Top Banner
Introduction to Machine Learning with Python and scikit-learn Python Atlanta Nov. 14 th 2013 Matt Hagy [email protected]
23

Introduction to Machine Learning with Python and scikit-learn

Jan 27, 2015

Download

Education

Matt Hagy

PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Machine Learning with Python and scikit-learn

Introduction to Machine Learning with Python and scikit-learn

Python AtlantaNov. 14th 2013

Matt [email protected]

Page 2: Introduction to Machine Learning with Python and scikit-learn

Slide #2 Intro to Machine Learning with Python [email protected]

Machine Learning (ML):• Finding patterns in data

• Modeling patterns

• Use models to make predictions

Page 3: Introduction to Machine Learning with Python and scikit-learn

ML can be easy*• You already have ML applications!

• You can start applying ML methods now with Python & scikit-learn

• Theoretical knowledge of ML not needed (initially)*

*Gaining more background, theory, and experience will help

Slide #3 Intro to Machine Learning with Python [email protected]

Page 4: Introduction to Machine Learning with Python and scikit-learn

Simple Example

Slide #4 Intro to Machine Learning with Python [email protected]

Page 5: Introduction to Machine Learning with Python and scikit-learn

Simple Model

Slide #5 Intro to Machine Learning with Python [email protected]

Page 6: Introduction to Machine Learning with Python and scikit-learn

Slide #6 Intro to Machine Learning with Python [email protected]

import numpy as npfrom sklearn.linear_model import LinearRegression

x,y = np.load('data.npz')x_test = np.linspace(0, 200)

model = LinearRegression()model.fit(x[::, np.newaxis], y)y_test = model.predict(x_test[::, np.newaxis])

Page 7: Introduction to Machine Learning with Python and scikit-learn

Slide #7 Intro to Machine Learning with Python [email protected]

Page 8: Introduction to Machine Learning with Python and scikit-learn

Variance/Bias Trade Off

Slide #8 Intro to Machine Learning with Python [email protected]

• Need models that can adapt to relationships in our data

• Highly adaptable models can over-fit and will not generalize

• Regularization – Common strategy to address variance/bias trade off

Page 9: Introduction to Machine Learning with Python and scikit-learn

Slide #9 Intro to Machine Learning with Python [email protected]

Page 10: Introduction to Machine Learning with Python and scikit-learn

Slide #10 Intro to Machine Learning with Python [email protected]

import numpy as npfrom sklearn.svm import SVRfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler

x,y = np.load('data.npz')x_test = np.linspace(0, 200)

model = Pipeline([ ('standardize', StandardScaler()), ('svr', SVR(kernel='rbf', verbose=0, C=5e6, epsilon=20)) ])model.fit(x[::, np.newaxis], y)y_test = model.predict(x_test[::, np.newaxis])

regularizationterm

Page 11: Introduction to Machine Learning with Python and scikit-learn

Supervised Learning

Slide #11 Intro to Machine Learning with Python [email protected]

031342934

Input, X

1637931767

Output, Y Modeling relationship between inputs and outputs

Sam

ple

Page 12: Introduction to Machine Learning with Python and scikit-learn

Multiple Inputs

Slide #12 Intro to Machine Learning with Python [email protected]

Input, X

031342934

X1

231689123

X2

103127542

X3

470291321

Xn

1637931767

Output, Y

Sam

ple

Page 13: Introduction to Machine Learning with Python and scikit-learn

Example: Image Classification

Slide #13 Intro to Machine Learning with Python [email protected]

• Classify handwritten digits with ML models

• Each input is an entire image

• Output is digit in the image

Page 14: Introduction to Machine Learning with Python and scikit-learn

Slide #14 Intro to Machine Learning with Python [email protected]

9Input, X Output, Y

2

Page 15: Introduction to Machine Learning with Python and scikit-learn

Slide #15 Intro to Machine Learning with Python [email protected]

import numpy as npfrom sklearn.ensemble import RandomForestClassifier

with np.load(’train.npz') as data: pixels_train = data['pixels'] labels_train = data['labels’]with np.load(’test.npz') as data: pixels_test = data['pixels']

# flattenX_train = pixels_train.reshape(pixels_train.shape[0], -1)X_test = pixels_test.reshape(pixels_test.shape[0], -1)

model = RandomForestClassifier(n_estimators=50)model.fit(X_train, labels_train)labels_test = model.predict(X_test)

Trains on 50,000 images in roughly 20 seconds.96% accurate !!

Page 16: Introduction to Machine Learning with Python and scikit-learn

Kaggle Data Science Competition

• Given 6 million training questions labeled with tags

• Predict the tags for 2 million unlabeled test questions

www.users.globalnet.co.uk/~slocks/instructions.htmlstackoverflow.com/questions/895371/bubble-sort-homework

Predicting the tags of Stack Overflow questions with machine learning

Slide #16 Intro to Machine Learning with Python [email protected]

Page 17: Introduction to Machine Learning with Python and scikit-learn

Text Classification Overview

Raw Posts Vector Space Machine Learning Model

Feature Extraction & Selection

Model Selection & Training

Slide #17 Intro to Machine Learning with Python [email protected]

Page 18: Introduction to Machine Learning with Python and scikit-learn

Term Frequency Feature Extraction

“Why is processing a sorted array faster than processing an array this is not sorted?”

Characterize text by the frequency of specific words in each text entry

Example Title:

whyprocessing

sorted

array

faster

1 2 2 2 1

Term Frequencies

Ignore common words (i.e. stop words)

Slide #18 Intro to Machine Learning with Python [email protected]

Page 19: Introduction to Machine Learning with Python and scikit-learn

Frequency of key terms is anticipated to be correlated with the tags of the question

why

processing

sorted

array

faster

need

help

java

homework

Title 1 1 2 2 2 1 0 0 0 0

Title 2 0 0 0 0 0 1 1 1 1

Title 3 0 0 1 1 0 0 1 0 1

Slide #19 Intro to Machine Learning with Python [email protected]

Page 20: Introduction to Machine Learning with Python and scikit-learn

Example Model Coefficients

Slide #22 Intro to Machine Learning with Python [email protected]

Page 21: Introduction to Machine Learning with Python and scikit-learn
Page 22: Introduction to Machine Learning with Python and scikit-learn

ML can be easy*• You already have ML problems!

• You can start applying ML methods now with Python & scikit-learn

• Theoretical knowledge of ML not needed (initially)*

scikit-learn.org

github.com/scikit-learn

Slide #24 Intro to Machine Learning with Python [email protected]

Page 23: Introduction to Machine Learning with Python and scikit-learn

Check out: liveramp.com/careers

Helping companies use their marketing data to delight customers

Opportunities•Backend Engineers•Data Scientists•Full-Stack Engineers

Tools•Java•Hadoop (Map/Reduce)•Ruby

Build and work with large distributed systems that process massive data sets.

Slide #25 Intro to Machine Learning with Python [email protected]