Due to the mathematical nature of this research, we have included a additional post focused entirely on mathematics. It will be referenced throughout this post; detailed information and discussions about the research can be found there.
“Small sample size” is a phrase often used throughout the baseball season when analysts and fans discuss player statistics. Every fan, to some extent, has an idea of what a small sample is, even if they don’t know it by name: a player who goes 2 for 4 in a game is not a .500 hitter ; a reliever who hasn’t allowed a run before April 10 is not an ERA zero pitcher. It’s easy to know what a small sample size means. The question, however, is when do samples stop getting small and start becoming useful and meaningful?
Our goal in this project is to expand the understanding of reliability and show a more complete picture of how additional plate appearances affect the reliability value of daily stats for hitters and pitchers. We want to reinforce the idea that reliability is a spectrum, not a single point. There is no single point at which you can say a stat has stabilized. We also want to use the concept of reliability to regress to the mean and make confidence bands that give a better idea of a player’s true talent. (Throughout this project, we define true talent as the level of actual talent – not the value they provide adjusted for park, competition, etc.)
We used an approach similar to Carleton’s in his latest reliability study: Cronbach’s alpha. There are, however, differences in our sampling structure, which we explain in more detail in the math article.
We used a data set that is also similar to Carleton’s most recent studies – the Retrosheet data – but we used more recent data from a shorter period (2009 to 2014). We’ve also removed intentional walks, bunts, and non-hitter events such as stolen bases.
From there, we split the data into different seasons of players instead of just players: 2012 Mike Trout, for example, is different from 2013 Mike Trout. We then used these player-seasons to define samples for a given number of appearances at home plate (PA), at bat (AB), or balls in play (BIP). So, for 10 AP, we took 10 random plate spawns from each player season with at least 10 AP; for 600 AP, we took 600 random plate spawns from each player season with at least 600 AP. As you can imagine, the 10-PA sample has many more player seasons than the 600-PA sample. Since we used player seasons, we maxed out our sampling at 600 AP, 500 AB, and 400 BIP. For anything beyond these limits, the sample size has become too small and the results become erratic.
We have chosen this sampling structure because we believe it best represents the general question of the reliability of statistics. The most recent and smaller dataset mitigated a bias we found associated with the changing running environment of Major League Baseball. We don’t make any assumptions that a player’s talent levels are the same from year to year, so we’ve separated each season for each player. It also allows players to be compared from year to year. We will detail the implications and effects of sampling in a future article.
There are many different methods for measuring reliability, which is mathematically related to correlation, but which is a different construct with different assumptions. We chose Cronbach’s alpha because it provides a good framework for measuring the reliability of a full sample of plaque appearances. Given the nature of the data – different parks, pitchers, time of year, etc. – there was no obvious way to split the data. We used a method that splits it into as many ways as possible. Again, you can read more about Cronbach’s alpha and reliability in the math post.
The calculation of Cronbach’s alpha gives a value — alpha — which is a measure of reliability. The value represents the proportion of true talent variance to observed variance.
This is do not same as r, r-squared or linear regression.
Below is a data visualization of the reliability of various batting stats with respect to PA/AB/BIP counts. The lines represent the measured reliability at each 10-PA/AB/BIP increment for each statistic. To calculate the regression to the mean and the associated confidence band, enter the value of the statistic in the red box and select the appropriate confidence level. Then scroll down the line for the calculation results for each PA/AB/BIP increment.
The reliability of each stat increases as the number of PA/AB/BIP increases, and the curve increases at a slower rate as the value approaches 1.0. One of the goals of this project is to demonstrate how the reliability of statistics changes with the number of APs. More importantly, there is no single point at which a stat becomes stable – each additional AP/AB/BEEP simply increases reliability. Even with low reliability, there is information in the statistics; it just has more noise than a stat with high reliability.
Regression to mean and confidence bands
Reliability values are useful for comparing different statistics, but they do not address the uncertainty of that statistic in tangible terms. In other words, it doesn’t give you a likely point or range for the player’s actual skill. Regression to the mean and confidence bands allow us to estimate a floor and a ceiling for this uncertainty.
This diagram shows how to regress to the mean and create confidence bands from this regressed statistic. Since we are estimating true talent from an observed statistic, the first step will be to regress the statistic to the mean. If a statistic has low reliability, the sample mean is a better estimate of true talent. High reliability means that the stat contains more information about true talents and has regressed much less towards the mean. Reliability provides an empirical method for regressing to means in a manner similar to the mathematical approach described in Tango’s appendix. The book (more on this in the math part).
The second part uses the total standard deviation of the sample to estimate the uncertainty and the upper and lower bounds. The higher the standard deviation, the wider the confidence band. (These confidence bands are not the same as the binomial standard error.)
All previous reliability studies and this one are based on the math typically used for test evaluation – where researchers try to gauge how well the test is constructed. The basic idea is that there is a real score (or in our case the skill level), an error (or a noise) and an observed score (or an observed statistic).
What reliability attempts to measure is the ratio of true talent information to observed information. If there is not much true information, the reliability will be less; if there is a lot of information, the reliability will be higher. The term noise contains almost every factor that could be associated with affecting the appearance of a plate: pitcher, park factors, weather conditions, injuries, etc. The purpose of this analysis is to create reliability measures and confidence bands for daily stats, which do not contain these adjustments, so we left all of our data unadjusted.
Reliability is partly determined by the distribution of skills within the sample. Therefore, sampling becomes an important factor in determining reliability. We tried several variations of the sampling structure, including the one Carleton used in his most recent study. The results followed similar patterns, but there were some discrepancies due to the different groups of players used. Using a sample limited to a high minimum AP number will reduce the standard deviation as players with better stats get more AP. This eliminates the lower tier of players. The remaining players are all grouped tighter. The greater the distribution of talents, the higher the reliability; the smaller the spread, the lower the reliability. We discuss this more in the math post.
The most important conclusion to draw is that there is no single point at which a statistic becomes reliable or stable. The alpha reliability data visualization illustrates this idea, using the reliability measure to regress the statistic to the mean and create confidence bands. Regressed statistics and confidence bands are descriptive rather than predictive and are not adjusted for park factors, league adjustments, etc. This can provide an estimate of a player’s true talent level based on player performance. They are not intended to be projections.
NOTES: If you compare our results with the results of the Carleton analysis, we report the alpha for the entire PA sample. His previous analysis found a particular number of PA/AB/BIP associated with a certain value for alpha (0.70) and then halved the PA/AB/BIP value. The Cronbach’s alpha calculation finds the reliability coefficient associated with the entire sample, and it does not need to be halved. Our ratio method is essential for regressing to the mean and calculating confidence bands.
The code we used—along with a .csv file of the results—is available at GitHub.