Top Banner

Click here to load reader

Introduction to Machine Learning with Python and scikit-learn

Jan 27, 2015

ReportDownload

Education

matt-hagy

PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.

  • 1. Introduction to Machine Learning with Python and scikit-learn Python Atlanta Nov. 14th 2013 Matt Hagy [email protected]

2. Machine Learning (ML): Finding patterns in data Modeling patterns Use models to make predictionsSlide #2Intro to Machine Learning with [email protected] 3. ML can be easy* You already have ML applications! You can start applying ML methods now with Python &scikit-learn Theoretical knowledge of ML not needed (initially)* *Gaining more background, theory, and experience will help Slide #3Intro to Machine Learning with [email protected] 4. Simple ExampleSlide #4Intro to Machine Learning with [email protected] 5. Simple ModelSlide #5Intro to Machine Learning with [email protected] 6. import numpyas np from sklearn.linear_modelimport LinearRegression x,y = np.load('data.npz') x_test = np.linspace(0, 200) model = LinearRegression() model.fit(x[::, np.newaxis], y) y_test = model.predict(x_test[::, np.newaxis])Slide #6Intro to Machine Learning with [email protected] 7. Slide #7Intro to Machine Learning with [email protected] 8. Variance/Bias Trade Off Need models that can adapt to relationships in our data Highly adaptable models can over-fit and will not generalize Regularization Common strategy to address variance/bias trade off Slide #8Intro to Machine Learning with [email protected] 9. Slide #9Intro to Machine Learning with [email protected]eramp.com 10. import numpy as np from sklearn.svmimport SVR from sklearn.pipelineimport Pipeline from sklearn.preprocessingimport StandardScaler x,y = np.load('data.npz') x_test = np.linspace(0, 200)regularization termmodel = Pipeline([ ('standardize', StandardScaler()), ('svr', SVR(kernel='rbf', verbose=0, C=5e6, epsilon=20)) ]) model.fit(x[::, np.newaxis], y) y_test = model.predict(x_test[::, np.newaxis]) Slide #10Intro to Machine Learning with [email protected] 11. Supervised Learning Output, Y0 3 1 3 4 2 9 3 41 6 3 7 9 3 17 6 7SampleInput, XSlide #11Modeling relationship between inputs and outputsIntro to Machine Learning with [email protected] 12. Multiple Inputs Input, XSampleX1X2X3XnOutput, Y0 3 1 3 4 2 9 3 42 3 1 6 8 9 1 2 31 0 3 1 2 7 5 4 24 7 0 2 9 1 3 2 11 6 3 7 9 3 17 6 7Slide #12Intro to Machine Learning with [email protected] 13. Example: Image Classification Classify handwritten digits with ML models Each input is an entire image Output is digit in the image Slide #13Intro to Machine Learning with [email protected] 14. Input, XOutput, Y9 2 Slide #14Intro to Machine Learning with [email protected] 15. import numpyas np from sklearn.ensembleimport RandomForestClassifier with np.load(train.npz') as data: pixels_train = data['pixels'] labels_train = data['labels] with np.load(test.npz') as data: pixels_test = data['pixels'] # flatten X_train = pixels_train.reshape(pixels_train.shape[0], -1) X_test = pixels_test.reshape(pixels_test.shape[0], -1) model = RandomForestClassifier(n_estimators=50) model.fit(X_train, labels_train) labels_test = model.predict(X_test) Slide #15Intro to Machine Learning with [email protected] 16. Predicting the tags of Stack Overflow questions with machine learning Kaggle Data Science Competition Given 6 million training questions labeled with tags Predict the tags for 2 million unlabeled test questions www.users.globalnet.co.uk/~slocks/instructions.html stackoverflow.com/questions/895371/bubble-sort-homeworkSlide #16Intro to Machine Learning with [email protected] 17. Text Classification Overview Feature Extraction & Selection Raw PostsSlide #17Model Selection & TrainingVector SpaceIntro to Machine Learning with PythonMachine Learning [email protected] 18. Term Frequency Feature Extraction Characterize text by the frequency of specific words in each text entrySlide #18processingsortedarrayfasterWhy is processing a sorted array faster than processing an array this is not sorted?Term Frequencies whyExample Title:12221Ignore common words (i.e. stop words)Intro to Machine Learning with [email protected] 19. sortedarrayfasterneedhelpjavahomeworkTitle 1 122210000Title 2 000001111Title 3 001100101whyprocessingFrequency of key terms is anticipated to be correlated with the tags of the questionSlide #19Intro to Machine Learning with [email protected] 20. Example Model CoefficientsSlide #22Intro to Machine Learning with [email protected] 21. ML can be easy* You already have ML problems! You can start applying ML methods now with Python &scikit-learn Theoretical knowledge of ML not needed (initially)* scikit-learn.orggithub.com/scikit-learn Slide #24Intro to Machine Learning with [email protected] 22. Helping companies use their marketing data to delight customersToolsOpportunities Backend Engineers Data Scientists Full-Stack Engineers Java Hadoop (Map/Reduce) RubyBuild and work with large distributed systems that process massive data sets. Check out: liveramp.com/careers Slide #25Intro to Machine Learning with [email protected]