No-Bullshit Data Science Szilárd Pafka, PhD Chief Scientist, Epoch R/Finance Conference Chicago, May 2017
No-Bullshit Data Science
Szilárd Pafka, PhDChief Scientist, Epoch
R/Finance ConferenceChicago, May 2017
Disclaimer:
I am not representing my employer (Epoch) in this talk
I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
https://deads.gitbooks.io/paratext-bench/content/teaser.html June 2016
linear tops off
more data & better algorandom forest on 1% of data beats linear on all data
(data size)
(accuracy)
linear tops off
more data & better algorandom forest on 1% of data beats linear on all data
(data size)
(accuracy)
Summary / Tips for analyzing “big” data:
- Get lots of RAM (physical/ cloud)
- Use R/Python and high performance packages (e.g. data.table, xgboost)
- Do data reduction in database (analytical db/ big data system)
- (Only) distribute embarrassingly parallel tasks (e.g. hyperparameter search for machine learning)
- Let engineers (store and) ETL the data (“scalable”)
- Use statistics/ domain knowledge/ thinking
- Use “big data tools” only if the above tips not enough
I usually use other people’s code [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhanghttp://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/
http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
http://lowrank.net/nikos/pubs/empirical.pdf
http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
http://lowrank.net/nikos/pubs/empirical.pdf
- R packages 30%- Python scikit-learn 40%- Vowpal Wabbit 8%- H2O 10%- xgboost 8%- Spark MLlib 6%- a few others
- R packages 30%- Python scikit-learn 40%- Vowpal Wabbit 8%- H2O 10%- xgboost 8%- Spark MLlib 6%- a few others
n = 10K, 100K, 1M, 10M, 100M
Training timeRAM usageAUCCPU % by coreread data, pre-process, score test data
n = 10K, 100K, 1M, 10M, 100M
Training timeRAM usageAUCCPU % by coreread data, pre-process, score test data