Tuesday, July 24, 2012

On Run Distributions, pt. 7: W% Estimates

There are a number of things one can do with a per game run distribution algorithm. One can look at the scoring patterns of teams to see if they are more or less efficient than a typical team with their average runs scored per game. While this can be done in reference to the empirical distribution for a given set of teams, having a distribution formula that works at multiple points allows one to do some customization, like accounting for park factor, that is problematic when working with the empirical data.

Another application that’s near and dear to my heart is using the run distribution tool to fuel a winning percentage estimator. Most win estimators you see are expressed in terms of a relatively simple function of runs scored and runs allowed, and you never really see how the sausage is made. But the way games are won is to score more runs than the other team. Essentially, any W% estimator is just trying to approximate the proportion of games in which a team will score more runs than it allows. With a function for run distribution, we can write this out explicitly.

Let Ps(k) be the probability of scoring k runs in a game, and Pa(m) the probability of allowing m runs. Then the W% of a team can be estimated as:

W% = Ps(1)*Pa(0) + Ps(2)*[Pa(0) + Pa(1)] + Ps(3)*[Pa(0) + Pa(1) + Pa(2)] + Ps(4)*[Pa(0) + Pa(1) + Pa(2) + Pa(3)] + ...

Of course, there are a number of other ways we could express the same idea, but while the equation above may not be the simplest, I believe it is the clearest. If you score 0 runs, you cannot win. If you score 1 run, you win if you allow 0 runs. If you score 2 runs, you win if you allow either 0 or 1 runs.

There is a complication that arises when employing this logic with a discrete distribution however (if we had a continuous distribution, we would be integrating the difference between the runs scored and runs allowed curves rather than summing, but the idea is the same.) With a continuous distribution, the probability of achieving any given k runs scored or m runs allowed is 0, so one need not be concerned with the run distribution method setting k equal to m. With the discrete distribution, though, we need some way of handling a situation in which a team is estimated to score and allow the same number of runs.

For the sake of this issue, I’ll assume that the Enby distribution is modeling runs in the first nine innings of the game, and that if the game is tied, it will go to extra innings. What is the probability of extra innings?

P(Extra Innings) = Ps(0)*Pa(0) + Ps(1)*Pa(1) + Ps(2)*Pa(2) + Ps(3)*Pa(3) + ...

The probability of extra innings is the probability that you score the same number of runs as you allow. Simple enough. Incidentally, if we assume that the distribution of runs scored and allowed are identical (like we might if we considered the league as an entire unit with some uniform R/G rather than as 30 individual units that only average that R/G when combined), the equation becomes:

P(Extra Innings) = P(0)^2 + P(1)^2 + P(2)^2 + P(3)^2 + P(4)^2 + ...

Knowing the percentage of extra inning games is not enough to estimate W%--we also need to know the percentage of those contests that the team goes on to win. For this, we will fall back once again on the Tango Distribution, which allows us to consider scoring on the inning level.

Before I actually use the Tango Distribution, allow me to walk through this with a simple example. The important factor in determining the probability of winning in extra innings is the percentage of the time that a team scores more runs in an inning than they allow. As soon as you do that, you win. As soon as you allow more than you score in an inning, you lose. As long as you score as many runs as you allow in an inning, the game continues.

Let’s suppose that a team has a 25% chance of scoring outscoring their opponent in a given inning and a 20% chance of being outscored. This means that there is a 55% chance of a tie in each inning, which extends the game. Thus, the probability of winning eventually is:

.25 + .55*.25 + .55^2*.25 + .55^3*.25 + .55^4*.25 + ...

You have a 25% chance of winning in the tenth inning and a 55% chance of their being an eleventh inning. In each subsequent inning, there is also a 25% chance of winning and a 55% chance of the game continuing. This expression can be solved as follows:

.25[1 + .55 + .55^2 + .55^3 + .55^4 + ...] = .25*1/(1 - .55) = .5556

However, you don’t even need to deal with the 55% probability of an additional inning, thanks to the Craps Principle. As you can see, the expression .25*1/(1 - .55) simplifies to .25/.45 which is equal to the probability of winning the first round divided by the sum of the probability of winning the first round and the probability of losing the first round, so we can just take .25/(.25 + .2) = .5556.

Bringing the Tango Distribution back into play, let Fs(k) be the probability of scoring k runs in an inning and Fa(m) the probability of allowing m runs in an inning. By “win inning”, I mean outscoring the opponent in a single inning; by “lose inning”, I mean being outscored in a single inning.

P(win inning) = Fs(1)*Fa(0) + Fs(2)*[Fa(0) + Fa(1)] + Fs(3)*[Fa(0) + Fa(1) + Fa(2)] + Fs(4)*[Fa(0) + Fa(1) + Fa(2) + Fa(3)] + ...

P(lose inning) = Fa(1)*Fs(0) + Fa(2)*[Fs(0) + Fs(1)] + Fa(3)*[Fs(0) + Fs(1) + Fs(2)] + Fa(4)*[Fs(0) + Fs(1) + Fs(2) + Fs(3)] + ...

P(win in extra innings) = P(win inning)/[P(win inning) + P(lose inning)]

P(extra innings) = Ps(0)*Pa(0) + Ps(1)*Pa(1) + Ps(2)*Pa(2) + Ps(3)*Pa(3) + ...

P(win in 9 innings) = Ps(1)*Pa(0) + Ps(2)*[Pa(0) + Pa(1)] + Ps(3)*[Pa(0) + Pa(1) + Pa(2)] + Ps(4)*[Pa(0) + Pa(1) + Pa(2) + Pa(3)] + ...

W% = P(win in 9 innings) + P(extra innings)*P(win in extra innings)

Let me demonstrate how this estimate is figured with a table. Let’s take a team that averages 5 runs scored and 4 runs allowed per game. We first use Enby to estimate the probability of scoring or allowing k runs (I’ve capped scoring at 25 runs here; technically, you need to go to infinity):



The columns “score” and “allow” are the probabilities of scoring k runs. “allow <” is the probability of allowing less than k runs. “win 9” is the probability of winning the game in nine innings while scoring k runs, and is equal to the probability of scoring k runs times the probability of allowing less than k runs. “extra” is the probability of extra innings with a score of k-k, and is the product of the probability of scoring k runs and the probability of allowing k runs. The sum of “win 9” is the probability of winning in 9 innings, which works out to .5409 in this case. The sum of “extra” is the probability of extra innings, which is 9.95%. To complete our analysis, we need to do a similar analysis on the inning level using the Tango Distribution:

Here, I’ve added a “score <” column, which is the probability of scoring less than k runs. “lose” is the probability of losing given k runs scored and is the probability of allowing k runs times the probability of scoring less than k runs. “tie” here is the same as “extra” in the above chart, although it is also 1 - win - lose. The sum of “win” is the probability of winning the inning; the sum of “loss” is the probability of losing the winning. The probability of winning the inning divided by the sum of the probability of winning and losing the inning is the probability of winning an extra inning game (per the craps principles). In this case, the probability of winning an inning is 24.8%, the probability of losing an inning is 20%, and the probability of tying an inning is 55.2%.

Thus, the probability of winning given extra innings is .248/(.248 + .20) = .5534, and the overall probability of winning is:

.5409 + .0995*.5534 = .5960

Remember, .5409 is the probability of winning in 9 innings; the probability of winning given that the game only goes nine innings is .5409/(1 - .0995) = .6007. Obviously the team with the advantage has a greater probability of winning a nine inning game than winning a one inning game.

For comparison, Pythagenpat with z = .28 estimates that a 5 R/4 RA team will have a .6018 W%, a difference of .94 games over a 162 game season from the Enby estimate. The Tango-Ben distribution (using the same methodology as what I just demonstrated except substituting the Tango-Ben estimate of the game scoring probabilities for Enby) estimates .5953 when using c = .767, which is a good match for the Enby estimate. However, Tango found that for applications involving two teams in a head-to-head matchup, a c parameter of .852 produces better results. Using .852, the Tango-Ben distribution estimates a .6011 W%, a much better match for Pythagenpat.

It is quite possible that Enby would benefit from a separate set of parameters for use in a head-to-head matchup. This could be accomplished by modifying the variance targeted by the Enby parameters, and when I pick this topic back up in a few months I have some ideas on how to do that.

Given the complexity of the W% estimate and its questionable accuracy (at least with the current default parameterization) I have not endeavored to carry out more elaborate tests. For now, it will sit as an intellectual exercise rather than as a method I use.

Monday, July 16, 2012

On Run Distributions, pt. 6: Series Review

This post won’t introduce anything new--instead I’m just going to summarize what I’ve already done, giving you a full example of how to calculate the Enby distribution estimates for a given R/G level. I’ll also provide a spreadsheet with the parameters for each .05 R/G increment between 3-7 so that you don’t have to do all these calculations yourself.

Let’s suppose we have a team that averages exactly 5 R/G (in fact, there is such a team in my sample data--the 1984 Red Sox), and we’d like to estimate their game-level scoring distribution using the Enby distribution methodology. The first step is to estimate the variance of their runs scored per game:

Step 1: Estimate the variance of runs scored per game.

Variance = 1.43*(R/G) + .1345*(R/G)^2 = 1.43*5 + .1345*5^2 = 10.5125

Step 2: Use the mean and variance to estimate the parameters (r and B) of the negative binomial distribution (these formulas are equivalent to what I’ve presented before as explained below):

B = .1345*(R/G) + .43 = .1345*5 + .43 = 1.1025
r = (R/G)/(.1345*(R/G) + .43) = 5/(.1345*5 + .43) = 4.5351

Step 3: Use the negative binomial distribution to estimate the probability of scoring 0 runs:

q(0) = (1 + B)^(-r) = (1 + 1.1025)^(-4.5351) = .0344 (call this value a for ease later on)

Step 4: Use the Tango Distribution to estimate the probability of being shutout, which is equal to the Enby distribution (zero-modified negative binomial) parameter z:

RI = (R/G)/9 = 5/9 = .5556
z = (RI/(RI + .767*RI^2))^9 = .0410

Step 5: Using your spreadsheet, use trial and error (or a solver if you have that that level of functionality) to estimate a new value of r. In choosing this value, you need to ensure that the average R/G predicted by the Enby distribution equals your sample R/G (5 in this case). This needs to be done simultaneously; use the following formula to estimate the initial probability:

q(k) = (r)(r + 1)(r + 2)(r + 3)...(r + k - 1)*B^k/(k!*(1 + B)^(r + k)) for k >=1

Then modify it as follows:
p(0) = z
p(k) = (1 - z)*q(k)/(1 - a)for k >=1

The mean is calculated:
p(1) + 2*p(2) + 3*p(3) + 4*p(4) + ...

The new value of r is the value that, when used in conjunction with this methodology and the previously calculated values for B and z, produce a mean equal to the desired R/G (5 in this case, with a corresponding r of 4.571.

So we have determined that the Enby distribution for a team that scores 5 R/G has parameters (B = 1.1025, r = 4.571, z = .041). The formulas for p(0) and p(k) calculate the probability of scoring k runs in a game.

How does our plot for the 1984 Red Sox compare to their actual scoring output?



Of course, we don’t expect a great fit for every team-season. Even if we assumed that there were no variations in run distribution due to the characteristics of an offense, the 162 game sample size would cause deviation from the expected values.
I have calculated the three parameters at each interval of .05 R/G between 3 and 7. While we have some reason to believe that the Enby may be semi-accurate outside of normal ranges, I’m not going to recommend its usage outside of the scoring range of normal teams. Getting a lot more precise than .05 is probably overkill as well, but given my limitation in having to solve for r by trial and error, I’m also limiting the gradients as a matter of practicality.

Here is a link to the spreadsheet. Enter your R/G (only values between 3 and 7 are supported) in the shaded yellow cell. The spreadsheet will round this to the nearest .05 for you. P(k) is the probability of scoring k runs in the game, r is for computation purposes (it is the product of r*(r + 1)*(r + 2)... as applicable), and nb is the probability from the normal distribution without zero modification.

Since I now have a table with the parameters over the 3-7 R/G range, it would feel inappropriate not to make scatterplots and look for patterns. First, z against R/G:



The red line is the z values; the thin black line is an exponential regression line that is a decent match for the data over this range. z is the parameter that needs the least investigation, though, as it is calculated via a formula based on the Tango Distribution. The formula makes sense, and there’s no mystery about why it works. The regression equation is superfluous and will certainly fail at low levels of R/G (it will predict that a team that averages 0 R/G will only be shutout in 12.66% of games).

Here is B against R/G:



B is a linear function of R/G. This is also not a surprise. Remember that B = variance/mean - 1, but I’m estimating the variance as a function of the mean. In fact, B can be simplified to B = .1345*(R/G) + .43, keeping in mind that I have used a fairly crude estimator of the variance, which is an area that might well be improved upon.

The parameter for which behavior is not defined by a formula is r:



Over this range, r is almost linear as a function of R/G. It can be modeled very closely over this range by a quadratic regression. I wouldn’t want to assume that a function can be used to estimate r consistently over a wider range of R/G, and even if it did, I wouldn’t want to advocate it as the value of r should be chosen to ensure that the expected R/G equals the actual R/G. In any event, it’s interesting to see how the parameters might behave in relation to R/G.

This post is running a little shorter than most of the others, so I’ll throw in something that would have gone in the odds and ends post that will close this series. For the last few years I’ve been looking at runs scored and allowed distributions at the end of each season, and in that time the most interesting team I’ve seen is the 2011 Red Sox. Boston led the majors in runs scored, but based on the empirical W% by runs scored in the majors in 2011, their actual distribution of runs scored would have led to an estimated 6.2 less wins than one would assume from just looking at their average runs scored. I thought it would be interesting to look at such a team again with the Enby distribution.

Boston averaged 5.4 R/G, which from the table above means their run distribution will be estimated as Enby(B = 1.1563, r = 4.7, z = .0331). Graphing their actual distribution against the expected, we get this:



The Red Sox were shutout a lot more than we’d expect, and while we expected the mode of their runs per game to be 4, they actually scored 4 runs in 18.5% of their games compared to an expectation of 12.8%. They also were clearly below expectation in games of 5-8 runs scored, which are games that a team has a very good chance of winning. The distribution skewed more to right than expected, games in which gaudy runs scored totals have much less of an impact on wins as the marginal value of each run is quite low.

Another way to visualize this (and as you can tell from this series, charts aren’t really my thing--I'm using a lot here, but not to great effect and only because I think tables of numbers would bore you and require more exposition) is to graph the cumulative percentage of team runs scored as we progressively add in games in which k runs were scored.

Boston was shutout eleven times; obviously those games contributed zero runs. They scored one run twelve times, which contributed 12 runs. They scored 875 runs overall, so this represented 1.37% of their total output. They scored two runs fifteen times, for a total of 30 runs. So games with 0-2 runs represented 42/875 = 4.8% of their total output. Continuing in this vein, we can get a graphical sense of the share of their runs that came on the tails of the distribution:



I’ve included the Enby distribution expectation as well as the overall 2011 major league average on the graph. The average major league team tallied 88.6% of its runs in games in which ten or less runs were scored, while we’d expect a team that averages 5.4 R/G to have scored 80% of its runs in such games. However, Boston only scored 71.7% of its runs in those contests.

Saturday, July 14, 2012

Josh Edgin, #53*

Last night, Josh Edgin made his major league debut for the Mets in their game against the Braves. Edgin is a 25 year old left handed pitcher from Lewistown, Pennsylvania. Edgin was drafted by New York in the 30th round of the 2010 draft out of Francis Marion University in South Carolina. Edgin has never made a pro start, appearing in 110 minor league games with solid raw numbers: 167 K and 55 W in 144 IP with a 3.06 RA.

But as prospects go, a 25 year old lefty reliever isn't particularly interesting. So why have I devoted a blogpost to him? My interest in Edgin is due to where he *started* his college career. As a freshman in 2007, Edgin appeared in 19 games, 7 of them starts with a 6.55 RA (-2 RAA). In 2008, Edgin's workload was reduced, appearing in 13 games with just 4 starts, and his performance in those limited opportunities weren't any better (-8 RAA). Seeking an opportunity to start, he transferred from OSU.

Transfers are quite common in college baseball, and I certainly hold no ill will against a player for seeking the best opportunity for his career or his education (as long as it's not at a certain school up north). But while Edgin did pitch for OSU, it would also be inappropriate to claim him as a full Buckeye from either the OSU or the Edgin perspective. Nonetheless, he joins 52 other confirmed OSU students as having played in the major leagues, and I wish them best.

Monday, July 09, 2012

On Run Distributions, pt. 5: Estimating Variance

So far, I’ve demonstrated only that the zero-modified negative binomial distribution (which I’m referring to as the Enby distribution for the sake of my sanity) can provide a decent fit to actual scoring patterns when the sample variance of runs per game is known. In order for this approach to have value with teams for which we don’t know the variance (and that’s the whole point of the exercise--estimating what the distribution should be based on the average R/G rather than simply regurgitating a sample distribution), we need a way to approximate the variance (As a tease, some work being done independently should allow for a better approximation of the variance than what I've come up with here. At some point in the future, I will incorporate that method into my methodology).

Allow me to issue this disclaimer up front: the formula I’m proposing here is woefully inadequate. If the Enby distribution is to have any real value to sabermetricians, someone will have to come along and clean this part up (while I’m making a wish list, a better way of adjusting the parameters to return the correct average R/G after the zero modification would be nice too).

The scatterplot below shows the variance of R/G plotted against average R/G for each major league team, 1981-1996. You can see there is clearly a positive correlation between the two: the higher the average, the higher the variance:



There is no clear pattern that will help me in attempting to develop a function to estimate variance from the mean. In fact, if you plot the ratio of variance to mean against the mean, you get a big clump:



There is a positive correlation (r = +.27) between the ratio of variance/mean and the mean. However, using a regression equation to describe the relationship between the mean and the variance introduces the problem of illogical results at the extremes.
For example, a linear regression yields the following equation for variance as a function of mean:

Variance = 2.637*mean - 2.670

For any team scoring less than 1.013 R/G, this formula will predict a negative variance, which is obviously impossible. Granted, I’m under no delusions that the final method I offer will be of any use outside of the normal scoring range of major league teams, but I cringe at laying out a method that obviously cannot work at such extremes.

Another option is to estimate the variance ratio as a function of the mean. The benefit here is that this constrains the estimated variance; if a team scores zero runs, the estimated variance will be zero. The estimated variance can never be negative:

Variance/mean = .1345*mean + 1.430
So Variance = mean*(.1345*mean + 1.430) = 1.430*mean + .1345*mean^2

Neither of these equations comes close to matching the aggregate results at 4.46 R/G, which is troublesome. But that value is itself an amalgamation of hundreds of individual team seasons, each recorded by teams that theoretically follow their own distributions of runs scored, and so I’m not sure a failure to match the result should be a death knell. Further complicating matters is that the estimate of variance will only be used to fit initial r and B parameters, with r then being varied to ensure that the mean of the distribution equals the actual mean.

Let me try using the second formula and run through the whole process to generate the expected run distribution for the aggregate 4.4558 R/G:

1. Estimate the variance of R/G from the mean:

Variance = 1.43*mean + .1345*mean^2 = 1.43*4.4558 + .1345*4.4558^2 = 9.042

2. Fit the parameters of a Enby distribution assuming no zero-modification:

B = variance/mean - 1 = 9.042/4.4558 - 1 = 1.029
r = mean/B = 4.4558/1.029 = 4.33

3. Estimate the parameter z (variable RI = (R/G)/9 = 4.4558/9 = .4951):

z = (RI/(RI + .767*RI^2))^9 = (.4951/(.4951 + .767*.4951^2))^9 = .0552

4. Calculate the probability for k runs (where k >=1) using the zero-modified formula, then find a new value for r that sets the mean of the zero-modified distribution equal to the desired mean (4.4558).

This step can only be done via some kind of computer algorithm; I used trial and error and get r = 4.364.

For the first time in the series, we have a version of the Enby distribution that is not blatantly cheating compared to other methods: I am not treating the variance or the probability of being shutout as known values, but rather am estimating them. We’re still not flying completely solo--the formula for estimating variance from mean was based on the same data that’s been aggregated, but we’re getting closer.

What would the estimates look like if we tried to apply them to a really extreme team? I do not expect the Enby to perform well at all, but it’s worth checking to confirm. I’ll try to estimate the run distribution first for a team that averages 1.5 R/G, then for a team that averages 10. I didn’t select these numbers for any particular reason other than that they are extreme, without being so crazy as to be beyond the range that anyone could possibly care (for practical rather than theoretical reasons) if the method worked for that point.

Obviously we don’t have an actual run distribution to compare to, but I’ll compare to the Tango-Ben distribution which, while also untested at these extremes, would be a better bet. First, the 1.5 R/G team (parameters z = .3387, B = .6318, r = 2.5706):



And the 10 R/G team (parameters z = .0039, B = 1.775, r = 5.638):



I was honestly surprised when I saw how closely Enby tracked Tango-Ben. Pleasantly so, of course, but surprised nonetheless.

I’ve said this before in different ways, but it’s worth repeating: my experience working with probability distributions is either theoretical (I took a Stats course once where I rarely wrote a real number down for the entire class) or working with fixed distributions (i.e. you are given a Poisson distribution with parameter h = 2.05. What is the probability of three or fewer occurrences?) I have no practical knowledge of how to best fit a zero-modified distribution to sample data, and thus my work product here will be of little value. Hopefully I’ve provided enough promising results to encourage those of you who are skilled at this type of problem to consider the negative binomial as a model for runs per game.