This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Future Of KaggleWhere we came from and where we’re going
kaggle.com/benhamner@benhamner
Our mission is to help the world learn from data
@benhamner
We got started running supervised learning competitions
@benhamner
Since 2010, we’ve run
● 240 general competitions● 1,610 university classroom competitions
We’re now doing this at scale
@benhamner
This has attracted a talented and diverse community
@benhamner
We’ve taught hundreds of thousands machine learning
@benhamner
We’ve pushed the state of the art forward
@benhamner
● What techniques work well● How people win competitions● Why our community participates● What major pain points data scientists hit● How we can help data scientists ameliorate these pain points
We’ve learned a tremendous amount along the way
@benhamner
Great data scientists optimize the entire ML workflow
@benhamner
GBM’s and deep neural networks are incredibly effective
@benhamner
Model ensembling almost always ekes out gains
@benhamner
Successful participants avoid overfitting
@benhamner
We’ve seen major pain points
@benhamner
Today’s practices are like programming in assembly
@benhamner
Beside software engineering tools, ML tools feel like they came from the stone age
@benhamner
Accessing data is tough
@benhamner
Getting high quality data is even tougher
@benhamner
Cleaning data is painful
Essay: “This essay got good marks, but as far as I can tell, it's gibberish.”
Human Scores: 5/5, 4/5@benhamner
Data leakage is common and subtle
@benhamner
Going from research to production can be brutal
@benhamner
Reproducing work takes days to months
@benhamner
We can do better than this
@benhamner
Accessing data should be seamless
@benhamner
You should never need to repeat work others have done
@benhamner
A single command should reproduce everything start-to-end
> make all
@benhamner
Making a successful one-line update should take seconds
@benhamner
Helpful metadata shouldn’t stay buried in minds or emails
@benhamner
Best practices should be easy defaults, not complicated custom contraptions
@benhamner
We’re changing this
@benhamner
We’ve launched two new products: Kernels and Datasets
We recently joined Google Cloud to accelerate our growth
@benhamner
Datasets, Kernels, and Competitions have an exciting future
@benhamner
The world’s data will be accessible with a common interface
@benhamner
That captures the important code and metadata on top of it
@benhamner
A central searchable hub for your organization’s data
@benhamner
A kernel is an atom of reproducible data science
@benhamner
Kernels will be your continuous integration server for data
@benhamner
We’ve started running code competitions
@benhamner
● Backtested time series● Live data feeds● Reinforcement learning● Generative modeling● Adversarial learning● Machine learning under computational constraints● Sensitive datasets