Throughout the MLB’s long and storied history, narratives have defined fans’ enjoyment of the game as well as many teams’ front office strategies and teambuilding. From the fan perspective, these stories build the mythology and lore of the sport. But from the team perspective, these unbacked assumptions can often be harmful and limiting. One of the best examples of this is the long-held belief that only a special, “mentally tough” reliever is capable of handing the stress of finishing a game. These “proven closers” supposedly have the guts and experience to hold up under the immense weight of securing a save. This assumption has finally started to be questioned in recent years as advanced analytical front offices have conducted rigorous testing on many of these ancient baseball narratives. This can be easily seen by doing a simple glance at who successful current teams trust with the role compared to ten years in the past.

For this study, I used players who had at least 75 saves in the years preceding the chosen year as my proven cutoff. These pitchers would have recorded enough experience to allegedly calm their nerves and earn the title of proven closer in the public baseball lexicon. During the 2010 season, 60 percent of the league’s top 15 teams had these guys finishing games. That’s lower than the previous decades, but is huge compared to this season: In 2019 only 40 percent of the top 15 teams fielded these wily, validated veterans. While some are top pitchers in the league, including the likes Aroldis Chapman, Craig Kimbrell, and Sean Doolittle, the other group is just as effective. Inexperienced guys like Luke Jackson, Josh Hader, and Taylor Rodgers have taken the league by storm, and while they often are not as publicly recognized, they are often just as effective.

This, however, is all just anecdotal evidence. For the rest of this paper, I will evaluate whether this recent strategical change is valid and if being a proven closer really does make you more qualified to man the most continuously stressful position in the game.

The first test I will be running is a general comparison between the “proven closer” and the not-proven (which I’ll call the “young’uns”) for 2010 and 2019. One interesting part of this comparison is that these relievers over their careers have each played both roles at certain points. Due to this, you obviously can’t just take career stats for guys with 75 saves and those without. This can be easily solved by taking each season as its own individual data point. This will cause players like Mariano Rivera to have around 15 data points in the proven closer bucket, meaning he will have a larger impact on the end results than, for example, Luke Jackson. But this is fair, as his sample is much larger and deserves to have higher weight than a guy who has pitched only 30 innings in save situations.

After I built the aforementioned datasets, I took weighted averages of the two groups’ performance in save situations. The statistics taken into account were ERA, WHIP, K/9, and OPS allowed. I would have preferred to use some different metrics, but the ease of Baseball Reference’s save situations splits led me to use their numbers, which should be more than fine for this exercise. I then took the means of each of the previous statistics for both player buckets. This was used to run Welch’s T tests, a statistical method which tests differences in datasets that have different sample sizes. The results for every stat were pretty comparable across the board and actually gave some good insights. I did expect the proven closers’ numbers to look relatively similar to the “young’uns” numbers, but with maybe slightly better results due to the higher weight on a few very good players, like Rivera and Hoffman. But what I found was the young’un group significantly outperformed their proven counterparts on all stats across the board. ERA, which is probably the most important stat I tested, had the young’un group coming in at a 2.95 ERA in save situations compared to a 3.09 ERA for the vets. This may not seem like a lot, but when this data is tested, that result ended with a .21 p value when the null hypothesis used was the proven group having a lower ERA. In more layman’s terms, this means that in this dataset, it is very unlikely that this difference was just random variation and that having 75 career saves does not lead to a lower ERA in save situations. This discovery was consistent among the other statistics. While this provides some evidence to prove the irrelevance of the proven closer motif, these results could have resulted from other biases in the dataset, including that the young’uns are, well, younger, and that most guys who recently entered the closer spot are playing at the top of their game. This makes it necessary to conduct further testing if we want to say more confidently that the “proven closer” is myth.

One of the main issues with doing overarching quantitative analysis on this subject is the biases created by the uneven opportunities given by teams. In other words, playing time is not randomly generated for the players in the league. Teams are trying to win, and it obviously hurts this goal to have subpar pitchers on the hill during the highest leverage part of the game. To account for this, we need to create a baseline talent for each player. That way we would isolate pitching in the 9^{th} inning as the only variable. To do this, I compiled the statistics for each player in the previous datasets for both save and not save situations. If having 9^{th} inning experience makes an actual difference, the proven closers bucket would show a much larger negative delta when compared to the young’uns.

The results were as follows:

ERA Save SIt | ERA Non-Save Sit | OPS Save SIt | OPS Non-Save Sit | |

Proven Closer | 3.11 | 3.39 | .636 | .663 |

Young’uns | 3.65 | 3.75 | .683 | .705 |

As you can see the differences, while pretty similar, are larger for the proven closer group. This slightly points towards proven closers having an edge. But this is not as significant a difference as the previous study when taking into account the smaller sample size from not repeated players. This makes an interesting counterargument to the previous point and definitely requires further testing.

The last statistical method I used incorporated predictive modeling into the equation. This could potentially add an extra layer of noise to the study, but I believe it’s worth it if you take that into account. My idea was to create a usable model that predicts a closer’s save situation ERA from a variety of different inputs, but excluding everything that has to do with experience in the role. A few examples are innings pitched, saves, and any counting stat. From this starting point, I was able to build an OK but admittedly limited multiple linear regression model with various rate based inputs ranging from FIP, to K%, to HR/9. My final model ended with a mean absolute error of .4 on the validation datasets, which means it on average missed its target by a distance of .4 ERA.

After this, I input my two proven and unproven data buckets and evaluated the error metrics for each. What I was looking for was the model to predict the proven players bucket significantly worse than they actually were, and the unproven players significantly better. This would show that there was some hidden skill not accounted for in the model on top of just random noise. While we can’t say for sure what that skill would be, I attempted to design a model that would make that the most likely missing piece.

The results were the exact opposite. The model predicted the unproven guys to be worse than they actually were while the proven guys to be better. This easily could have been just variation, so I decided to run another Welch’s T Test to evaluate how significant the difference in error was. It was very confident that the mean error did not predict better results than actually seen for the proven guys and worse results than actually seen for the unproven guys. This was evident by a .9997 p value, which is a very significant number. Of course, this has to be taken with a grain of salt due to the previously mentioned issue of the added noise from my not-perfect predictive model. Nevertheless, this level of certainty is good evidence and backs the idea that experience in the 9^{th} isn’t an important factor.

In conclusion, my quantitative studies, for the most part, back the industrywide trend of no longer relying on the archetypal “proven closer”. The abundance of 9th inning experience does not seem to make any significant, quantifiable difference. While the second test conducted didn’t back this claim, the other two studies more than did. This makes sense. A good pitcher is a good pitcher. And while it is difficult to deal with stress when you haven’t before, these are professional athletes who play on the biggest stage in the world. They deal with immense stress every day. In order to get to this level of excellence in this failure-based game, they have to be mentally tough. No matter how many career saves you have, pitching the 9^{th} inning of a close game with playoff applications is just another day at the office.