A Beginner's Guide to Machine Learning with Scikit-Learn

Post on 27-Jan-2015

126 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014. Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.

Transcript

A Beginner’s Guide to Machine Learning with Scikit-LearnSarah Guido

PyTennessee 2014

All about me

• Grad student at the University of Michigan• Data analyst for HathiTrust• Organizer of Ann Arbor PyLadies chapter

My talk

• Machine learning and scikit-learn• Supervised and unsupervised learning• Preprocessing, validation and testing, strategies for machine learning

What is machine learning?

• Application of algorithms that learn from examples

• Representation and generalization

Why should we care?

• Useful in every day life• Email spam, handwriting analysis, stock market

analysis, Netflix

• Especially useful in data analysis• Feature extraction, linear regression, classification,

clustering

Machine Learning Vocab

• Instance• Feature• Class• Categorical

• Nominal• Ordinal

• Continuous

Machine Learning VocabFeature Class

Instance

Scikit-Learn

• Machine learning module• Open-source• Built-in datasets• Good resources for learning

Scikit-Learn

• Model = EstimatorObject()• Model.fit(dataset.data, dataset.target)

• dataset.data = dataset• dataset.target = labels

• Model.predict(dataset.data)

Scikit-Learn

• Supervised• Unsupervised• Semi-supervised• Reinforcement learning• Neural networks• …and many more!

Supervised learning

• Labeled data• You know what you’re looking for• Classification: predict categorical labels• Regression: predict continuous target variables

Classification

• Categorical variables• Relationship between instance and feature• Classification algorithms == classifiers

Classification

• Naïve Bayes classifier• Features are independent• Fast performance• Decent classifier

Classification

• Car evaluation dataset-UCI• Features: buying price, the maintenance price, the number of doors, the number of seats, the size of the trunk, and the safety ranking

• Labels: unacceptable, acceptable, good, or very good

Classification

Classification

Classification

Unsupervised algorithms

• Unlabeled data• You might have no idea what you’re looking for• Clustering: splitting observations into groups• Dimensionality reduction: flatten data to fewer dimensions

Clustering

• Exploring the data• Similar objects in the same group• Distance between data points

Clustering

• K-means clustering• Three steps

• Chooses initial cluster centers• Assigns data instance to cluster• Recalculates cluster center

• Efficient

Clustering

Clustering

Clustering

Data preprocessing

• Encoding categorical features

Data preprocessing

Data preprocessing

Data preprocessing

• Split the dataset into training and test data

Validation and testing

• Model evaluation

• Cross-validation

Good strategies

• Avoid overfitting• Use lots of data• Intuition fails in high dimensions

My materials

• Scikit-learn.org documentation and tutorials• Machine learning class at U of M• Scikit-learn talks

Resources

• Scikit-learn documentation and tutorials• scikit-learn.org/stable/documentation.html

• Other resources• http://archive.ics.uci.edu/ml/datasets.html• Mldata.org

• Videos• Scikit-learn tutorial: http://vimeo.com/53062607• Intro to scikit-learn: http://vimeo.com/72859487

Contact me!

• @sarah_guido• Linkedin.com/sarahguido• github.com/sarguido

top related