Joel Grus Seattle DAML Meetup June 23, 2015 Data Science from Scratch
Joel GrusSeattle DAML Meetup
June 23, 2015
Data Science from Scratch
About meOld-school DAML-erWrote a book ---------->SWE at GoogleFormerly data science at VoloMetrix, Decide, Farecast
The Road to Data Science
The Road to Data ScienceMy
Grad School
Fareology
Data Science Is A Broad Field
Some Stuff
MoreStuff
EvenMoreStuff
DataScience
People who think they're data scientists, but they're not really data scientists
People who are a danger to everyone around them
People who say "machine learnings"
a data scientist should be able to
JOEL GRUS
a data scientist should be able torun a regression,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson,
JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. JOEL GRUS
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS
A lot of stuff!
What Are Hiring Managers Looking For?
What Are Hiring Managers Looking For?
Let's check LinkedIn
a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS
grad students!
Learning Data Science
I want to be a data
scientist.Great!
The Math WayI like to start with matrix
decompositions. How's your
measure theory?
The Math WayThe Good:Solid foundationMath is the noblest known pursuit
The Math WayThe Good:Solid foundationMath is the noblest known pursuit
The Bad:Some weirdos don't think math is fun
Can be pretty forbidding
Can miss practical skills
So, did you count the words in
that document?
No, but I have an elegant
proof that the number of
words is finite!
OK, Let's Try Again
I want to be a data
scientist.Great!
The Tools WayHere's a list of
the 25 libraries you
really ought to know. How's
your R programming?
The Tools WayThe Good:Don't have to understand the math
PracticalCan get started doing fun stuff right away
The Tools WayThe Good:Don't have to understand the math
PracticalCan get started doing fun stuff right away
The Bad:Don't have to understand the math
Can get started doing bad science right away
So, did you build that model?
Yes, and it fits the training data almost perfectly!
OK, Maybe Not That Either
So Then What?
Example: k-means clusteringUnsupervised machine learning technique
Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares
i.e. in a way such that the clusters are as "small" as possible (for a particular conception of "small")
The Math Way
The Math Way
The Tools Way# a 2-dimensional examplex <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")(cl <- kmeans(x, 2))plot(x, col = cl$cluster)points(cl$centers, col = 1:2, pch = 8, cex = 2)
The Tools Way>>> from sklearn import cluster, datasets>>> iris = datasets.load_iris()>>> X_iris = iris.data>>> y_iris = iris.target
>>> k_means = cluster.KMeans(n_clusters=3)>>> k_means.fit(X_iris) KMeans(copy_x=True, init='k-means++', ...>>> print(k_means.labels_[::10])[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]>>> print(y_iris[::10])[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
So What To Do?
Bootcamps?
Data Science from ScratchThis is to certify thatJoel Grus
has honorably completed the course of study outlined in the book Data Science from Scratch: First Principles with Python, and is entitled to all the Rights, Privileges, and Honors thereunto appertaining. Joel
GrusJune 23, 2015
Certificate Programs?
Hey! Data scientists!
Learning By BuildingYou don't really understand something until you build it
For example, I understand garbage disposals much better now that I had to replace one that was leaking water all over my kitchen
More relevantly, I thought I understood hypothesis testing, until I tried to write a book chapter + code about it.
Learning By BuildingFunctional Programming
Break Things Down Into Small Functions
So you don't end up with
something like this
Don't Mutate
Example: k-means clusteringGiven a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares
Global optimization is hard, so use a greedy iterative approach
Fun Motivation: Image Posterization
Image consists of pixelsEach pixel is a triplet (R,G,B)Imagine pixels as points in spaceFind k clusters of pixelsRecolor each pixel to its cluster
meanI think it's fun, anyway
8 colors
Example: k-means clusteringgiven some points, find k clusters by
choose k "means"repeat:
assign each point to cluster of closest "mean"recompute mean of each cluster
sounds simple! let's code!
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each mean
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each meancompute the distance
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each meancompute the distance
assign the point to the cluster of the mean with the smallest distance
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each meancompute the distance
assign the point to the cluster of the mean with the smallest distance
find the points in each cluster
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
start with k randomly chosen points
start with no cluster assignments
for each iteration
for each point
for each meancompute the distance
assign the point to the cluster of the mean with the smallest distance
find the points in each cluster
and compute the new means
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
Not impenetrable, but a lot less helpful than it
could be
def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]
for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j
# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)
return means
Not impenetrable, but a lot less helpful than it
could be
Can we make it simpler?
Break Things Down Into Small Functions
def k_means(points, k, num_iters=10): # start with k of the points as "means" means = random.sample(points, k)
# and iterate finding new means for _ in range(num_iters): means = new_means(points, means)
return means
def new_means(points, means): # assign points to clusters # each cluster is just a list of points clusters = assign_clusters(points, means)
# return the cluster means return [mean(cluster) for cluster in clusters]
def assign_clusters(points, means): # one cluster for each mean # each cluster starts empty clusters = [[] for _ in means] # assign each point to cluster # corresponding to closest mean for p in points: index = closest_index(point, means) clusters[index].append(point) return clusters
def closest_index(point, means): # return index of closest mean return argmin(distance(point, mean) for mean in means)
def argmin(xs): # return index of smallest element return min(enumerate(xs), key=lambda pair: pair[1])[0]
To Recapk_means(points, k, num_iters=10)
mean(points)
k_means(points, k, num_iters=10)new_means(points, means)assign_clusters(points, means)closest_index(point, means)argmin(xs)
distance(point1, point2)mean(points)
add(point1, point2)scalar_multiply(c, point)
As a Pedagogical ToolCan be used "top down" (as we did here)
Implement high-level logicThen implement the detailsNice for exposition
Can also be used "bottom up"Implement small piecesBuild up to high-level logicGood for workshops
Example: Decision TreesWant to predict whether a given Meetup is worth attending (True) or not (False)
Inputs are dictionaries describing each Meetup
{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }
{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }
Example: Decision Trees{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }
{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }
beer?
True Falsespeaker?
True False
free none
paid
@jakevdp
@joelgrus
Example: Decision Treesclass LeafNode: def __init__(self, prediction): self.prediction = prediction
def predict(self, input_dict): return self.prediction
class DecisionNode: def __init__(self, attribute, subtree_dict): self.attribute = attribute self.subtree_dict = subtree_dict
def predict(self, input_dict): value = input_dict.get(self.attribute) subtree = self.subtree_dict[value] return subtree.predict(input)
Example: Decision TreesAgain inspiration from functional programming:type Input = Map.Map String String
data Tree = Predict Bool | Subtrees String (Map.Map String Tree)
look at the "beer" entry a map from each possible "beer" value to a subtree
always predict a specific value
Example: Decision Treestype Input = Map.Map String String
data Tree = Predict Bool | Subtrees String (Map.Map String Tree)
predict :: Tree -> Input -> Boolpredict (Predict b) _ = bpredict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)
Example: Decision Treestype Input = Map.Map String String
data Tree = Predict Bool | Subtrees String (Map.Map String Tree)
We can do the same, we'll say a decision tree is eitherTrueFalse(attribute, subtree_dict)
("beer", { "free" : True, "none" : False, "paid" : ("speaker", {...})})
predict :: Tree -> Input -> Bool
predict (Predict b) _ = b
predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)
Example: Decision Treesdef predict(tree, input_dict): # leaf node predicts itself if tree in (True, False): return tree else: # destructure tree attribute, subtree_dict = tree # find appropriate subtree value = input_dict[attribute] subtree = subtree_dict[value] # classify using subtree return predict(subtree, input_dict)
Not Just For Data Science
In ConclusionTeaching data science is fun, if you're smart about it
Learning data science is fun, if you're smart about it
Writing a book is not that much funHaving written a book is pretty funMaking slides is actually kind of funFunctional programming is a lot of fun