Walk Like a Sabermetrician: January 2009

Monday, January 26, 2009

Historical Park Factors, 1901-2006

I have just posted a new spreadsheet with park factors for all teams, 1901-2006, as a Google Spreadsheet. These are five-year park factors, calculated in the same manner I describe on this page.

The guiding philosophy was to try to include as much data as possible. If there are five possible years of data to be used for a park, they will all be used, even if four of the seasons were in the past or in the future. The source of the raw data was KJOK’s excellent park database for past seasons and various sources for the 2008 data.

I treat a park as new if there are major changes to the dimensions, but I did not by any means do a complete historical survey to find out when those changes have taken place, so some that probably should have been treated differently are not. If you have specific data on when a change should have (or shouldn’t have) been made, feel free to leave a comment and I will try to incorporate these changes when I update the chart some time in the future.

Additionally, when a team moves, and a new team immediately moves in (for example, the Senators of ’60 and ’61), this is treated as a new team. Also, in cases in which teams have played a significant (which I defined as around ten or more) number of games in a different stadium in the same year, those years are treated as being a new park (an example is the Dodgers playing games in New Jersey the two years before they moved from Brooklyn). Whenever a “new park” of this sort is established, when the old order is restarted it is treated as another new park.

The reason the park factors are only shown through 2006 is that my ideal data set is two previous years, the year in question, and two future years. For most of the parks active in 2007, we will after 2009 be able to fill this dataset, and so I don’t want to publish a park factor now and change it later. However, there are a few parks where the 2006 or 2005 factors are not yet settled because they are new and there are not yet five years of data available. In these cases, I have listed a PF but marked it as one that will change in the future (this is indicated with an orange shading; park factors for the first year after a switch are in pink text).

Now I will give an example of how I chose the years to be considered in figuring the PF. Suppose we look at the Diamondbacks, who have played in Bank One Ballpark since 1998. In 1998, we have no previous data, but there is four future years of data, so the sample is 1998-2002. For 1999, there is one previous year, so we also look at three future years, and get 1998-2002. For 2000, there are two previous years, so we use two future years, and have a sample of 1998-2002. This is now in the ideal format of the year in question, plus the two immediately prior and future years. Of course, in 2001, we use the two previous years (1999 and 2000), and two future years (2002 and 2003), making the total sample 1999-2003, and it will continue in that manner until something changes.

Let’s also consider the end of the Braves’ tenure in Fulton-County Stadium. The last season there was 1996. For 1994, we have two previous years (‘92 and ‘93) as well as two future years (‘95 and ‘96), so we use 1992-1996. For 1995, we have just one future year, so we use three previous years, and also use 1992-1996, and the same for 1996.

Terpsfan101 has also published historical park factors recently, and here is a link. His differ from mine primarily in that they use more than five years of data when applicable (with a corresponding decrease in the amount of regression used). He includes both R/O and R/PA park factors (mine are based on R/G, which is strongly correlated to R/O), has home run factors, and also includes the nineteenth century.

Historical Park Factors, 1901-2006

Monday, January 19, 2009

Runs Per Win from Pythagenpat

I have written about this topic many times before, but you’ll have to bear with me as it is one of my favorites and I like to reexamine it from time to time.

The Pythagenpat method (of which, in the interests of full disclosure, I am a co-developer along with David Smyth) is, at this time, just about the most accurate single formula that seeks to quantify the relationship between runs and wins. The “single formula” qualifier is included to allow for the fact that other approaches may be more accurate, like the Tango Distribution or any other distributions that attempt to model runs per inning or game. However, using the Tango Distribution to describe the runs-wins relationship involves finding a runs/inning distribution, and then converting this into a runs/game distribution, and...suffice it to say, it cannot be reduced to a simple two or three lines of formulas that you can easily plug into a spreadsheet. That’s not a knock on more advanced approaches, just a reality that leads people to use Pythagenpat and other simple formulas.

Digression aside, Pythagenpat is a dynamic winning percentage estimator. Often times, though, runs-wins converters are applied in 100% linear approaches (one such implementation that is used all the time is using linear weights to measure a batter’s contributions, then converting this to Batting Wins), and in such a case you want a simple converter that can convert runs to wins with knowledge of only run differential. Pythagenpat requires you to know runs and runs allowed, while a fixed exponent Pythagorean formula requires you to know the run ratio. If you are converting runs above a baseline to wins, you need a formula that works on the differential, since that is precisely what the comparison to a baseline is.

So sabermetricians have developed a number of formulas that give a generalized RPW value for an average team. The most common is a static 10 runs per win, but there are also many approaches that allow RPW to vary with the total number of runs scored. To serve as an example, one of the most common is Pete Palmer’s formula:

RPW = 10*sqrt(runs per inning), where runs per inning is the total runs for both teams

There are a number of other such formulas out there, and they all do their jobs well enough. However, if you grant for the sake of argument that Pythagenpat is the “best” (understanding its limitations and the qualifier about being a relatively simple formula) W% estimator, then you may be interested in how one can use Pythagenpat to derive such a formula.

First, there is a big assumption that needs to be made. In order to find the RPW value for an average team based at some particular RPG, we need to hold RPG constant. For example, if the RPG is nine, then a run differential (RD) of .1 run/game means that the team scores 4.55 runs and allows 4.45 runs. A RD of 1 run/game would be achieved with 5 runs scored and 4 runs allowed, and so forth.

With that assumption, the Pythagenpat exponent will be a constant, x, which is figured as RPG^z. You’ll see values between .27-.29 used for z, and it is probably true that .28 is a better choice. However, whether you use .27, .29, or something in between will make essentially no difference with respect to the end game of this post.

Standard Pythagenpat estimates W% as:

EW% = R^x/(R^x + RA^x) = RR^x/(1 + RR^x), where RR = run ratio (R/RA)

I may use a lot of calculus in my writing, relative to the average sabermetrician, but I’m not by any means a whiz at it. So the formula for the derivative of EW% with respect to RD (I am using RD to represent run differential per game; (R - RA)/G) that I’m about to print may very well be needlessly complex and easily simplified. Nonetheless:

dEW%/dRD = x*RR^(x-1)/(2*RPG*(RR^x + 1)^2*(.5 - RD/(2*RPG))^2)

This is actually in the form of wins/run; the reciprocal is runs per win, and is:

RPW = ((2*RPG*(RR^x + 1)^2*(.5 - RD/(2*RPG))^2)/(x*RR^(x-1))

We are interested in a generalized formula for RPW that does not depend on the team’s ratio or differential between runs scored and allowed, just the RPG. Therefore, what we’re after is the RPW for an average team at a given RPG. Since the team is average (R = RA), we know that it has a RD of 0 and a RR of 1. Plugging that in:

RPW = ((2*RPG*(1^x + 1)^2*(.5 - 0/(2*RPG))^2)/(x*1^(x - 1))
= ((2*RPG*(2)^2*(.5)^2)/x
= (2*RPG)/x

For a standard Pythagorean equation with x = 2, this simplifies simply to RPW = RPG.

In the case of Pythagenpat, we have set x = RPG^z, and so we can simply further:

RPW = (2*RPG)/(RPG^z) = 2*RPG^(1 - z)

So for z = .29, the generalized RPW, derived directly from Pythagenpat, is 2*RPG^.71, and in order to estimate W% using this equation, you just use the general formula for all RPW estimators, W% = RD/RPW + .5.

None of this is new; all of the above has been previously published in some form or another either by Ralph Caola (the general findings, in his articles in By the Numbers--see Nov/2003, Feb/2004, and May/2004) or by myself (the Pythagenpat application).

Suppose, however, that you’d like to further simplify the relationship between RPW and RPG. You don’t want to have to deal with any exponents and you’re not concerned about whether it works for extreme theoretical situations. You just want a straightforward formula that allows RPW to vary with RPG as you know it should, will be easy to calculate, and will work for normal major league teams.

There are a number of ways you could try to approximate the function above, but one of the easiest is to take the tangent line of the function at a particular point. Since a RPG of 9 is easy to remember and very close to the long-term MLB average, we’ll use that point to find our tangent line.

I’ll write the line in point-slope form, y - y1 = m(x - x1), where y will be RPW, y1 will be RPW at the specific point (RPG = 9), m is the slope of the RPW function at the point RPG = 9, x is RPG, and x1 is the RPG at the point RPG = 9 (9, naturally).

The derivative of 2*RPG^(1 - z) with respect to RPG is (1-z)*2*RPG^(-z) = 2*(1-z)/RPG^z. For z = .29, it is 2*.71/RPG^.29 = 1.42/RPG^.29, which evaluates to .7509 at RPG = 9.

The RPW for a RPG of 9 is 2*9^.71 = 9.5179, and so we can put it all together and get this formula:

RPW - 9.5179 = .7509*(RPG - 9)

Simplifying this and solving for RPW gives:

RPW = .7509*RPG + 2.7598

And since we’re going for simplicity here, why not make sure all the coefficients are multiples of .05?

RPW = .75*RPG + 2.75

Comparing this approximation to 2*RPG^.71, the two are in agreement to within .05 RPW for RPGs between 7 and 11.5. It is within .20 RPW for 5.5-13.5 RPG. Beyond that range, there is a lot of divergence. For example, at the known point RPW = 2 when RPG = 1, the linear approximation gives 3.5. Fortunately, though, 5.5-13.5 RPG encompasses the scoring range that is normally seen from major league teams, and the approximation is fine within those bounds.

So there you have it: a 100% linear winning percentage estimator derived from Pythagenpat (given the assumptions that I’ve made). As I mentioned before, there are a bunch of RPW estimators out there, so it wouldn’t be surprising if this one or something close to it has been published previously. And indeed, that is the case.

Tango Tiger uses the formula 1.5*(RPG + 2) to estimate RPW, except his formula defines RPG as the runs for one team whereas I am using it to represent runs for both teams. So in my terms, his formula is 1.5*(RPG/2 + 2), which simplifies to .75*RPG + 3.

Now you can see why it works--it is a consequence (*) of using Pythagenpat to derive a 100% linear estimator at the normal (9 RPG) major league scoring level, and since Pythagenpat is the “best” W% estimator, any formula you derive from it should be similar to what other approaches like regression would produce.

After I wrote this piece, this topic came up at Fangraphs as they use Tango’s formula. So I checked the 1961-2003 data (excluding 1981 and 1994) and found that the +3 intercept had a slightly lower RMSE in predicting W% (3.949 to 3.951 per 162 games). I was a surprised by this, since the teams in the sample had a mean RPG of 8.74, and the tangent line I took was at RPG = 9. I don’t have an explanation for why this is, but I’ll pass it along anyway.

In order for the tangent line approach to approximating Pythagenpat RPW to yield an intercept of 3, RPG must be 10.3 (with the slope around .72) with a z value of .29. With z = .28, you would need a RPG of 10.72 to get an intercept of 3. This is related to the phenomenon broached in the last paragraph, and I can’t explain it, although I’m not sure it’s something to be concerned about.

Allow me to finish on a digression. Among the W% estimators that can be written as relatively simple formulas, there are two main types: differential estimators and ratio estimators. Of course, the distinction I’m drawing is that the input into differential estimators is run differential and the input into ratio estimators is run ratio.

Within each of those classes, you can break it down further into what I’ll call “dumb” and “smart” methods. Dumb methods used a fixed RPW or a fixed exponent; they assume that the relationship between runs and wins is the same regardless of how high scoring goes. 10 runs = 1 win is a dumb differential estimator; Pythagorean with a fixed exponent like 2 is a dumb ratio estimator.

Smart estimators, of course, change the price of a win as the scoring level changes. Palmer’s formula or Tango’s formula exemplify smart differential estimators, while Pythagenport or Pythagenpat are smart ratio estimators.

I’m not really going anywhere with this except to say that I think it is pretty clear that the smart ratio estimators work better, theoretically, than the smart differential estimators. (As an aisde, a smart differential estimator can definitely be more accurate with normal teams than a dumb ratio estimator. The dumb ratio will win some under extreme conditions since it is bounded by zero and one, but a smart differential estimator can beat it when applied to normal ranges). So, by using a differential method, you are already sacrificing some theoretical accuracy in favor of expediency. So why not simplify things further, and use a 100% linear approach?

RPW = 2*RPG^.71 is linear in a sense, since it is a differential estimator. It values each additional run equally; you fix the RPG to being with, and as your differential changes (but the total stays the same), each run you gain is worth 1/(2*RPG^.71) wins.

It is not, of course, a purely linear function, since it has an exponent. And my point is, why bother? It’s nice to have that formula around to answer specific questions, but if you are ever going to apply it generally, you should either 1) just go ahead and use Pythagenpat or 2) use something simpler. And that is why .75*RPG + 2.75 or 3 is a nice little formula to have around. That it can be semi-derived from Pythagenpat? All the better.

Technical addendum: If you want a general formula for the tangent line so that you can try it with different values of z and RPG, here it is:

pRPW (pythpat RPW) = 2*RPG^(1 - z)
m = 2*(1-z)/RPG^z
b = pRPW - m*RPG
lRPW (linear RPW) = m*RPG + b

(*) Of course, that’s a convoluted way of looking at it--all of these equations are the result of attempting to model the reality of baseball, and are a consequence of that, not each other. However, if you start from the premise that Pythagenpat is the best model (which can certainly be debated), and proceed from there to find a linear estimator of RPW, .75*RPG + b is where you end up.

Tuesday, January 13, 2009

A Perfunctory Look at Run Distribution and W%, 2008

Before I start, I want to emphasize the word “perfunctory” in the title again, as it really is just that. There is a lot of stuff that you could do with this data, and I’m just looking at a couple of things here.

Baseball Prospectus has some nice stats on their website for team record by runs scored, runs allowed, and run differential. You can download these as CSV files and open them in your spreadsheet program.

Let’s start by looking at one-run games. There were 680 one-run games in the majors last season out of a total of 2,428 (28%). The team with the most one-run games was Toronto (56), not a surprise considering that their games also had the lowest RPG in the majors. Cleveland (31) played in the fewest one-run games.

The chart is sorted by the difference between W% in one-run games and W% in all other games (labeled “else”):

I don’t need to lecture the readers of this blog about how unsurprising it is that most of the teams at the top of this list were not very good in terms of overall record, and that the opposite is true for those on the bottom.

Atlanta was peerless in their struggles in one-run games; Seattle had the second-worst W%, but they lost the same number as the Braves while winning seven more. Milwaukee had the highest W% and tied with Tampa Bay at 11 over (I am using “over .500” in the sense that the mainstream uses it, simply wins minus losses. I am well aware of the difference between this and wins above average).

Let’s also consider record in blowouts. There is no agreed upon definition for a blowout (nor is “blowout” necessarily the best word to use in this context), but I have defined them as games decided by five or more runs. The five-run threshold, for 2008 at least, has the nice property that there were 712 such games, or 29% of all games, compared to the 28% decided by one-run.

The team that participated in the most blowouts was Colorado (58), which is no surprise when you consider their home park. Next was the Dodgers (55), which goes to show you that park is by no means the sole determining factor (although Dodger Stadium’s PF has crept up to hover near 1.00 in recent years). Toronto played in the fewest blowouts (34):

However, when a Jays game was not close, it was usually a very good thing for them. Their .706 W% was best in the majors, although Chicago was 19 games over .500 while Boston and Minnesota also had a greater margin than Toronto’s +14. On the flip side, the Nationals’ dreadful .280 W% and -22 was most closely approached by the Royals (.340, -15). The Angels were the only playoff team that did not have a winning record in blowouts (20-20).

What about the other 43% of games which are decided by two to four runs? Cleveland played the most of these games (82), while Colorado and Minnesota participated in the fewest (59):

It was in these games that the Angels really made their hay, playing .700 baseball; among the other 29 clubs, only Houston was above .600, and LA’s +28 blew away the Astros’ + 17. Seattle (.390, -16) brought up the rear in these intermediate margin games.

Since BP kept data based on the total number of runs scored and allowed, we can look at some other, more interesting breakdowns. The most basic is the overall MLB frequencies and winning percentages by runs scored:

Three runs is the mode, occurring in 13.7% of games. Teams that scored four runs had a .480 W%, while a fifth run boosted the W% up to .639; that increase is the largest for any one-run increment.

In the 1986 Baseball Abstract, Bill James wrote about an alternative kind of Offensive Winning Percentage based on the number of times X runs were scored. If a team scored one run, they would get credit for .083 offensive wins, since the average team that scores one run wins 8.3% of the time. If they scored two, they would be credited with .202 offensive wins, and so on.

I will apply this Jamesian concept here. I have lumped all games with ten or more runs together with a uniform .950 W%. I acknowledge that it would be better to base this type of analysis on a theoretical model rather than the actual winning percentages, but again, I’m not holding up anything in this post as state of the art. I have figured the team’s conventional OW% as well, with the exponent figured as x = (R/G + 4.65)^.29, and OW% = R^x/(R^x + 4.65^x). 4.65 is the combined MLB average for R/G. There is no park adjustment in either figure; I have labeled the James alternate approach, based on the number of times scoring x runs, as “gOW%”.

One note about my use of OW%, a construct I have railed against in the past, for this application. My objections to OW% are mostly to its use on the player level. On the team level, it is much more palatable, as it does answer the question “If this team had average defense, how many games would we expect them to win?” When you ask that question for a player, it is at best an abstraction since no player is his own offense.

As you will see, the two approaches generally yield similar results. The large differences, arbitrarily defined as +/- .015, are as follows:

Positive (gOW% > OW%): HOU, SF, MIL, SEA
Negative: TEX, NYA, DET

If a team has a higher gOW% than OW%, they are expected to win more games when you consider their run distribution than if you just look at their average runs scored. gOW% is not any more impressed by a team scoring 17 runs in a game than 11 (it should, a little bit, but it is definitely a situation in which diminishing returns are in play), while conventional OW% would add around half a win/162 games for those six (mostly meaningless) extra runs.

The Tigers are a team that many people observed where boom or bust offensively. Based on their average of 5.07 R/G, they have an OW% of .542, but their gOW% is just .524. The comparison of OW% to gOW% backs up the observation, although any potential predictive value is undemonstrated.

We can also turn things around and look at team defenses by figuring DW% and gDW%, assuming for DW% that the offense is league average (4.65 R/G).

The biggest divergences:

Positive (gDW% > DW%): TOR, PHI, TB
Negative: TEX, COL

Nice coincidence that both pennant winners are in the positive group.

We can combine gOW% and gDW% into what I will call gEW%, and then compare it to actual W% or EW% (based on Pythagenpat or your win estimator of choice). As I have set it up here, we are assuming independence between the number of runs scored and allowed in a game (this is very likely a faulty assumption). In order to make this easy, I first convert gOW% and gDW% into an equivalent run ratio (rather than a winning percentage). Given a pythagorean exponent of x (x is unique for the offense and defense as above):

Rrat = (gOW%/(1 - gOW%))^(1/x)
RArat = ((1-gDW%)/gDW%)^(1/x)

Then, figure a new pyth exponent for the entire team (y = RPG^.29). That allows us to calculate gEW%:

gEW% = Rrat^y/(Rrat^y + RArat^y)

Here are the teams sorted by gEW%:

Most of the big differences between gEW% and W% will be teams whose pythagorean record also diverge (like the Angels). So the differences I’ll highlight here are between gEW% and EW%, by two or more games:

Positive (gEW% > EW%): HOU, SF, KC, COL, LAA
Negative: DET, CHN, TOR, PHI, CHA

To put this into words, teams in the positive category were expected to win more games when their distribution of runs scored and allowed are considered independently as opposed to win their averages of runs scored and allowed are considered independently.

While the Angels look better when you consider their run distribution rather than average runs, they still ended up with an actual W% far beyond their gEW% (.617 to .559, easily the largest positive difference in the majors at +9.5 wins). The worst team in this difference was the Mariners (.420 to .377, -7 wins).

Unsurprisingly, twenty of the thirty teams have a smaller difference between gEW% and W% than between EW% and W%. The RMSE/162 games for predicting W% is 3.47 for gEW% and 4.19 for EW%.

One interesting thing is that no teams are in the range of .477-.505 gEW%. Only two teams were in this range in actual W% although four were for EW%. It doesn’t mean anything, but it’s odd to have a .028 range of gEW% (between 78 and 81 wins) right around the mean not represented.

As I said, there are other sabermetricians out there who could come up with some more revealing ways to look at this data. Hopefully you have found something interesting in this perfunctory look.