Data Mining in Sports Analytics
Salford SystemsDan Steinberg
Mikhail Golovnya
Data mining is the search for patterns in data using modern highly automated, computer intensive methods
◦ Data mining may be best defined as the use of a specific class of tools (data mining methods) in the analysis of data
◦ The term “search” is key to this definition, as is “automated”
The literature often refers to finding hidden information in data
Data Mining Defined
Data Mining
•Predictive Analytics•Machine Learning•Pattern Recognition•Artificial Intelligence•Business Intelligence•Data Warehousing
Data Mining
•Statistics•Computer science•Insurance•Finance•Marketing•Robotics•Biotech•Sports Analytics
Cont. •OLAP•CART•SVM•NN•CRISP-DM•CRM•KDD•Etc.
Uses of Data Mining
Data guides the analysis, it is the “Alpha and Omega” of everything you do
Analyst asks the right questions but makes no assumptions
The success of data mining solely depends on the quality of available data◦ Famous “Garbage In – Garbage Out” principle
Long Live the King =Your Data=
(Insert visual aid) In a nutshell: Use historical data to gain
insights and/or predictions on the new data
The Essence of Machine Learning
Any game is the ultimate and unambiguous source of the quality data
◦ This is very different from the data availability and quality in other areas of research
However, there is no universal agreement on the best way of organizing and summarizing the results in a numeric form
◦ Large number of various game statistics available
◦ Common sense and game rules are at the core
◦ Heated debates on which stats best describe the potential for a future win
Data in Sports Analytics
(insert screenshot of Baseball-reference.com)
Available from many sources, including the Internet
Player level: summarize performance in a season, post season, and entire career
Team level: wins and losses
Game level: most detailed
Baseball Stats
(insert Sean Lahman website screenshot)
Widely known public database
Gathers baseball stats all the way back to 1871
Will use parts of it to illustrate the potential of data mining
Baseball Databases
Focusing on the 2010 regular season performance in both leagues
Have access to the player stats for the entire season organized in a flat table
Define a measure of the overall player success simply by having the team winning its division
◦ Thus 6 out of 30 participating teams in 2010 are declared as success
Question: Which of the player stats are associated with the team winning the division?
Typical DM Problem
Core Stats•AB-At Bats•R-Runs•H-Hits•2B-Doubles•3B-Triples•HR-Home Runs•RBI-Runs Batted In•SB-Stolen Bases•CS-Caught Stealing•BB-Base on Balls•SO-Strikeouts•SF-Sacrifice Flies•HBP-Hit by pitch
Derived Stats
•AVG-Batting Average H/AB•TB-Total Bases B1+2x2B+3x3B+4xHR•SLG-Slugging TB/AB•OBP-On Base Percentage (H+BB+HBP)/(AB+BB+SF+HBP)•OPS-On Base Plus Slugging OBP+SLG•…-Many more exist
Batting Stats
(insert scatter matrix)
This is how the problem is usually attacked
Each dot represents a single batter record for the whole 2010 season
1245 overall records
16 core stats
Winning team batters are marked in red
No obvious insights!
Conventional Statistical Approaches
Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone
Starting with CART in 1984, laid the foundation for tree-based modeling techniques
Conduct deep look into all available data
Point out most relevant variables and features
Automatically identify optimal transformations
Capable of extracting complex patterns going way beyond the traditional “single performance at a time” approach
Unique Personalities- the “Founding Fathers” of Trees
(insert graph) 6 core batter stats were identified as most predictive
About 20% of total variation can be directly associated with the batter stats
The single plots show non-linear nature of many of the relationships
Fine plot irregularities should be ignored
Striking result: HR above 30 is associated with loosing the division
Proceed by digging into pair-wise contribution plots
TreeNet Model on Core Stats
(insert images) The colored area within each plot shows
pairs that actually occur in the data
Areas associated with contribution towards team win are marked in red
Contributions towards team defeat are marked in blue
Pair-Wise Contributions
(insert graphs) These two plots further highlight the rather
unusual HR finding
It is a well-known fact that batters aiming at a home run have higher number of strike-outs
However, in 2010 regular season the HR-centered approach lead to a defeat!
Surprise: 2010 HR Leads to Division Loss!
(insert graph)
This plot represents two performance stats plotted against each other taken “as is” from the original data table
Note the difficulty at discerning the identified HR X SO pattern visually because of “shadow” projections
Compare with Conventional Plot
(Insert graphs)
Adding Derived Stats
(Insert screen shot of Baseball-reference.com and standard pitching chart)
Similar to batting stats
Large number of derived stats exists
Pitching Stats
Core Stats•W-Wins•L-Losses•H-Hits Allowed•BFP-Batters Faced•R-Runs Allowed•HR-Home Runs Allowed•WP-Wild Pitches•IPOUTS-Outs Pitched•SHO-Shutouts•BB-Base on Balls•SO-Strikeouts•ER-Earned Runs•HBP-Batters Hit by Pitch
Derived Stats
•ERA-Earned Run Average 9xER/InningsPitched•DICE-Defense Independent Component 3.0+(13HR+3(BB+HBP)-2SO)/IP•FIP-Fielding Independent Pitching 3.1+(13HR+3BB-2SO)/IP•dERA-Defense Independent ERA 10-line algorithm •CERA-Component ERA Long convoluted equation•…-Many more exist
Pitching Stats
(Insert charts)
Started by feeding a complete set of available 26 pitching stats for 2010 season performance
Using top variable elimination followed by bottom variable elimination technique, reduced the list to only 8 important stats
Modeling Steps
(inset graphs)
One-Variable Contributions
(insert graphs)
Keep the strikeouts high and the base on balls low to win the division!
Two-Variable Contributions
(insert graphs)
Remember that these are pitchers not batters
More wild pitches, more home runs allowed, more strikeouts=>the division is won!
More Surprises Here!
(insert graph)
Conventional plot IGNORES other dimensions which effectively project on top of each other
As a result, there is a lot of confusion on the plot, making it difficult to see any pattern
In contrast, TN dependence plot shows the given pair contribution AFTER the influence of other dimensions has been eliminated
Compare with Conventional Plot
(insert graphs)
These plots represent the results of running conventional linear regression (LR) on the pitching data
While the anomalous HR-effect is present, the model fails at the identifying the fine local nature of the phenomenon
LR does not provide enough “resolution”
Compare with Conventional Regression
It appears that in the 2010 regular season Home Run driven strategy did not work!
At least, this is what the data tells us, further understanding will require experts in the field
Core stats have good explaining potential once put into true multivariate modeling framework
Conventional statistics approaches do not have enough “resolution” to see the real details
Modern Data Mining helps identifying realized patterns and allows quick and efficient check of the usefulness of various performance measures available to a manager or researcher
What Have We Learned
NEVER FALL FOR THESE
Absolute Powers- data mining will finally find and explain everything
Gold Rush- with the right tool one can rip the stock-market or predict World-Series winner to become obscenely rich
Quest for the Holy Grail- search for an algorithm that will always produce 100% accurate models
Magic Wand- getting a complete solution from start to finish with a single button push
Data Mining Mythology
The End