I am going to make a confession. I love combing through minor league data. It is fun to search the minor league leaderboards and try to find those few players that come out of nowhere to become big league regulars. Do not get me wrong, I enjoy watching the top prospects in the game like Bobby Witt Jr. and Julio Rodriguez, but there is a certain satisfaction you get from finding that unheralded prospect that has success at the major league level. It is akin to finding and supporting that local band before they made it big. It validates your opinions and shows that you have some decent taste.
But how do you go about finding these diamonds in the rough? To find a good local band, you can just go to a lot of local concerts. To find a future major leaguer (using data alone), there are a multitude of factors to consider. What level have they had success at? How old are they? What position do they play? Are they in a hitter-friendly league? The options are endless, and that is just the first problem. The other problem is that there are so many minor league players to sift through. (Apparently MLB agrees given that they continue to try and reduce the number of professional teams). That is why I decided to use clustering analysis to quickly group minor league players to see which players are the most likely to succeed at the Major League level.
There may be a lot of minor league data, but minor league fielding data is still sparse. I would prefer to incorporate minor league defense into this project, but given the data restrictions, I decided to focus solely on a player’s offensive value.
For my data set, I gathered all affiliated minor league hitting data from 2006 to 2019 and then grouped each hitter’s minor league career totals across three levels of competition: Upper Minors (AAA & AA), A-Ball (A+ & A), and Complex-Ball (any level that is not full-season). I then removed any player that did not accumulate three hundred or more plate appearances in their grouped level. I split the data into three separate groups, because the competition levels are drastically different. It would be imprudent to treat AAA statistics the same as rookie-ball statistics. I considered splitting by each individual level of competition, but this would create too many subsets and there would be very few players that would generate three hundred or more plate appearances at every minor league level.
There are many factors that go into being a successful hitter, but the two factors I chose to spotlight are a player’s power and his ability to make contact. I know plate discipline is important, but I prefer to use only two variables for cluster analysis visualizations to avoid complicated visualizations. Besides, if you hit the ball hard enough and frequently enough, no one is going to mind if you are a free swinger. Just ask Vladimir Guerrero.
I used a player’s ISO to represent his power and his strikeout rate to represent his ability to make contact. Ideally, I wanted to use Statcast data to evaluate power and miss rate to evaluate a player’s contact ability. However, without access to minor league Statcast data, I had to settle for this method.
I decided to utilize a machine learning technique, called k-means clustering, to group the players by their ISO and strikeout rate. This helped me determine which minor league players have the best chance of succeeding in the majors. There are three distinct groups of hitters across my three levels of competition. Below are the clusters for the players in the upper minors.
I find this cluster group interesting because the groups are players that have below average power and make an above average level of contact (Cluster 1). Players that have below average power and make below average contact (Cluster 3). And players that have above average power (Cluster 2).
With the players grouped into clusters, I decided to see how each cluster of minor leaguers performed at the Major League level. The first thing I looked for was players from each cluster that recorded three hundred or more plate appearances in a single season. This indicates that the player was receiving a reasonable amount of playing time and was a valued member of his team for at least one season.
Next, I looked at the maximum wRC+ that each minor league player recorded in a season at the Major League level with a minimum of three hundred plate appearances. I chose maximum wRC+ instead of career wRC+ because I wanted to focus solely on a player’s peak value and not include his decline phase. This means that several players will be over-valued for a half-season of performance, but it also does not penalize players that were legitimate MVP candidates for several seasons that had a precipitous decline.
The first thing that caught my attention was that only 20.8% of players went on to have at least one MLB season with three hundred or more plate appearances regardless of what cluster they were in. The next thing is that the players in Cluster 2 were far more likely to perform in the majors. The data would suggest that a higher ISO in the upper minors is more likely to contribute at the MLB level.
Once again, the clusters followed a similar pattern with Cluster 1 being composed of players with well above average power. Cluster 2 consisting of players that make above average contact with below average power and Cluster 3 made up of players with below average contact ability, but this time the cluster allows for a wider spread of ISO than the clusters in the Upper Minors.
There were fewer players that played in MLB across the board. This is unsurprising considering that many of these players did not make it to the upper levels of the minors. The cluster with above average power out-performed the other two clusters yet again.
For the third time, the clusters followed a similar pattern. Cluster 3 being players with above average power, Cluster 2 being comprised of players with below average contact ability and roughly average power, and Cluster 1 being players with below average power, but above average contact ability. This makes it a bit easier to compare players across the various levels of competition.
These percentages for success are dangerously low for each cluster, but this is partially due to selection bias. There are few top prospects that spend multiple seasons outside of full-season ball, so the most talented players do not accumulate the three hundred plate appearances required to be in my data set. Even with this wrinkle, the pattern remained the same with the players with above average power having the most success in the majors. The odds of becoming an impactful major league hitter are small no matter which cluster the hitter belongs. However, the minor league players with the greatest chance of success are the players who post ISO rates well above league average.
2022 Top Prospects
With the historical minor leaguers assigned to their clusters, I turned to this season’s top prospects to see which clusters they belong to for each level of competition. Below is a table showing each position player prospect with a 50 or higher FV rating from FanGraphs.com with their cluster number for each level of competition. I highlighted the cluster with the highest success rate green, the cluster with the second highest success rate yellow, and the cluster with the lowest success rate as red across all levels. Any level that the player did not accumulate three hundred plate appearances I left blank, and I removed the players that did not accumulate three hundred plate appearances at any level.
*Please keep in mind that cluster analysis is meant for grouping a set of objects and is not a projection system. So, just because a player is in a certain group that does not mean he is going to do well or fail at the major league level. A projection system requires many more variables and rigorous analysis to forecast properly. Cluster analysis is a quick way to view trends and summaries, but it should not supersede a projection system.