Top Banner
Data Mining in Sports Analytics Salford Systems Dan Steinberg Mikhail Golovnya
28
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data mining for baseball new ppt

Data Mining in Sports Analytics

Salford SystemsDan Steinberg

Mikhail Golovnya

Page 2: Data mining for baseball new ppt

Data mining is the search for patterns in data using modern highly automated, computer intensive methods

◦ Data mining may be best defined as the use of a specific class of tools (data mining methods) in the analysis of data

◦ The term “search” is key to this definition, as is “automated”

The literature often refers to finding hidden information in data

Data Mining Defined

Page 3: Data mining for baseball new ppt

Data Mining

•Predictive Analytics•Machine Learning•Pattern Recognition•Artificial Intelligence•Business Intelligence•Data Warehousing

Data Mining

•Statistics•Computer science•Insurance•Finance•Marketing•Robotics•Biotech•Sports Analytics

Cont. •OLAP•CART•SVM•NN•CRISP-DM•CRM•KDD•Etc.

Uses of Data Mining

Page 4: Data mining for baseball new ppt

Data guides the analysis, it is the “Alpha and Omega” of everything you do

Analyst asks the right questions but makes no assumptions

The success of data mining solely depends on the quality of available data◦ Famous “Garbage In – Garbage Out” principle

Long Live the King =Your Data=

Page 5: Data mining for baseball new ppt

(Insert visual aid) In a nutshell: Use historical data to gain

insights and/or predictions on the new data

The Essence of Machine Learning

Page 6: Data mining for baseball new ppt

Any game is the ultimate and unambiguous source of the quality data

◦ This is very different from the data availability and quality in other areas of research

However, there is no universal agreement on the best way of organizing and summarizing the results in a numeric form

◦ Large number of various game statistics available

◦ Common sense and game rules are at the core

◦ Heated debates on which stats best describe the potential for a future win

Data in Sports Analytics

Page 7: Data mining for baseball new ppt

(insert screenshot of Baseball-reference.com)

Available from many sources, including the Internet

Player level: summarize performance in a season, post season, and entire career

Team level: wins and losses

Game level: most detailed

Baseball Stats

Page 8: Data mining for baseball new ppt

(insert Sean Lahman website screenshot)

Widely known public database

Gathers baseball stats all the way back to 1871

Will use parts of it to illustrate the potential of data mining

Baseball Databases

Page 9: Data mining for baseball new ppt

Focusing on the 2010 regular season performance in both leagues

Have access to the player stats for the entire season organized in a flat table

Define a measure of the overall player success simply by having the team winning its division

◦ Thus 6 out of 30 participating teams in 2010 are declared as success

Question: Which of the player stats are associated with the team winning the division?

Typical DM Problem

Page 10: Data mining for baseball new ppt

Core Stats•AB-At Bats•R-Runs•H-Hits•2B-Doubles•3B-Triples•HR-Home Runs•RBI-Runs Batted In•SB-Stolen Bases•CS-Caught Stealing•BB-Base on Balls•SO-Strikeouts•SF-Sacrifice Flies•HBP-Hit by pitch

Derived Stats

•AVG-Batting Average H/AB•TB-Total Bases B1+2x2B+3x3B+4xHR•SLG-Slugging TB/AB•OBP-On Base Percentage (H+BB+HBP)/(AB+BB+SF+HBP)•OPS-On Base Plus Slugging OBP+SLG•…-Many more exist

Batting Stats

Page 11: Data mining for baseball new ppt

(insert scatter matrix)

This is how the problem is usually attacked

Each dot represents a single batter record for the whole 2010 season

1245 overall records

16 core stats

Winning team batters are marked in red

No obvious insights!

Conventional Statistical Approaches

Page 12: Data mining for baseball new ppt

Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone

Starting with CART in 1984, laid the foundation for tree-based modeling techniques

Conduct deep look into all available data

Point out most relevant variables and features

Automatically identify optimal transformations

Capable of extracting complex patterns going way beyond the traditional “single performance at a time” approach

Unique Personalities- the “Founding Fathers” of Trees

Page 13: Data mining for baseball new ppt

(insert graph) 6 core batter stats were identified as most predictive

About 20% of total variation can be directly associated with the batter stats

The single plots show non-linear nature of many of the relationships

Fine plot irregularities should be ignored

Striking result: HR above 30 is associated with loosing the division

Proceed by digging into pair-wise contribution plots

TreeNet Model on Core Stats

Page 14: Data mining for baseball new ppt

(insert images) The colored area within each plot shows

pairs that actually occur in the data

Areas associated with contribution towards team win are marked in red

Contributions towards team defeat are marked in blue

Pair-Wise Contributions

Page 15: Data mining for baseball new ppt

(insert graphs) These two plots further highlight the rather

unusual HR finding

It is a well-known fact that batters aiming at a home run have higher number of strike-outs

However, in 2010 regular season the HR-centered approach lead to a defeat!

Surprise: 2010 HR Leads to Division Loss!

Page 16: Data mining for baseball new ppt

(insert graph)

This plot represents two performance stats plotted against each other taken “as is” from the original data table

Note the difficulty at discerning the identified HR X SO pattern visually because of “shadow” projections

Compare with Conventional Plot

Page 17: Data mining for baseball new ppt

(Insert graphs)

Adding Derived Stats

Page 18: Data mining for baseball new ppt

(Insert screen shot of Baseball-reference.com and standard pitching chart)

Similar to batting stats

Large number of derived stats exists

Pitching Stats

Page 19: Data mining for baseball new ppt

Core Stats•W-Wins•L-Losses•H-Hits Allowed•BFP-Batters Faced•R-Runs Allowed•HR-Home Runs Allowed•WP-Wild Pitches•IPOUTS-Outs Pitched•SHO-Shutouts•BB-Base on Balls•SO-Strikeouts•ER-Earned Runs•HBP-Batters Hit by Pitch

Derived Stats

•ERA-Earned Run Average 9xER/InningsPitched•DICE-Defense Independent Component 3.0+(13HR+3(BB+HBP)-2SO)/IP•FIP-Fielding Independent Pitching 3.1+(13HR+3BB-2SO)/IP•dERA-Defense Independent ERA 10-line algorithm •CERA-Component ERA Long convoluted equation•…-Many more exist

Pitching Stats

Page 20: Data mining for baseball new ppt

(Insert charts)

Started by feeding a complete set of available 26 pitching stats for 2010 season performance

Using top variable elimination followed by bottom variable elimination technique, reduced the list to only 8 important stats

Modeling Steps

Page 21: Data mining for baseball new ppt

(inset graphs)

One-Variable Contributions

Page 22: Data mining for baseball new ppt

(insert graphs)

Keep the strikeouts high and the base on balls low to win the division!

Two-Variable Contributions

Page 23: Data mining for baseball new ppt

(insert graphs)

Remember that these are pitchers not batters

More wild pitches, more home runs allowed, more strikeouts=>the division is won!

More Surprises Here!

Page 24: Data mining for baseball new ppt

(insert graph)

Conventional plot IGNORES other dimensions which effectively project on top of each other

As a result, there is a lot of confusion on the plot, making it difficult to see any pattern

In contrast, TN dependence plot shows the given pair contribution AFTER the influence of other dimensions has been eliminated

Compare with Conventional Plot

Page 25: Data mining for baseball new ppt

(insert graphs)

These plots represent the results of running conventional linear regression (LR) on the pitching data

While the anomalous HR-effect is present, the model fails at the identifying the fine local nature of the phenomenon

LR does not provide enough “resolution”

Compare with Conventional Regression

Page 26: Data mining for baseball new ppt

It appears that in the 2010 regular season Home Run driven strategy did not work!

At least, this is what the data tells us, further understanding will require experts in the field

Core stats have good explaining potential once put into true multivariate modeling framework

Conventional statistics approaches do not have enough “resolution” to see the real details

Modern Data Mining helps identifying realized patterns and allows quick and efficient check of the usefulness of various performance measures available to a manager or researcher

What Have We Learned

Page 27: Data mining for baseball new ppt

NEVER FALL FOR THESE

Absolute Powers- data mining will finally find and explain everything

Gold Rush- with the right tool one can rip the stock-market or predict World-Series winner to become obscenely rich

Quest for the Holy Grail- search for an algorithm that will always produce 100% accurate models

Magic Wand- getting a complete solution from start to finish with a single button push

Data Mining Mythology

Page 28: Data mining for baseball new ppt

The End