Top Banner
Data Science at Scale: Using Apache Spark for Data Science at Bitly Sarah Guido Data Day Seattle 2015
38
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using Apache Spark for Data Science at Bitly

Data Science at Scale:Using Apache Spark for Data Science at Bitly

Sarah GuidoData Day Seattle 2015

Page 2: Using Apache Spark for Data Science at Bitly

Overview

• About me/Bitly• Spark overview• Using Spark for data science• When it works, it’s great! When it works…

Page 3: Using Apache Spark for Data Science at Bitly

About me

• Data scientist at Bitly• NYC Python/PyGotham co-organizer• O’Reilly Media author• @sarah_guido

Page 4: Using Apache Spark for Data Science at Bitly

About this talk

• This talk is:– Description of my workflow– Exploration of within-Spark tools

• This talk is not:– In-depth exploration of algorithms– Building new tools on top of Spark– Any sort of ground truth for how you should be

using Spark

Page 5: Using Apache Spark for Data Science at Bitly

A bit of background

• Need for big data analysis tools• MapReduce for exploratory data analysis == • Iterate/prototype quickly• Overall goal: understand how people use not

only our app, but the Internet!

Page 6: Using Apache Spark for Data Science at Bitly

Bitly data!

• Legit big data• 1 hour of decodes is 10 GB• 1 day is 240 GB• 1 month is ~7 TB

Page 7: Using Apache Spark for Data Science at Bitly

What is Spark?

• Large-scale distributed data processing tool• SQL and streaming tools• Faster than Hadoop• Python API

Page 8: Using Apache Spark for Data Science at Bitly

How does Spark work?

• Partitions your data to operate over in parallel– A partition by default is 64 MB

• Capability to add map/reduce features• Lazy – only operates when method is called– Ex. collect() or writing to a file

Page 9: Using Apache Spark for Data Science at Bitly

Why Spark?

• Fast. Really fast.• SQL layer – kind of like Hive• Distributed scientific tools• Python! Sometimes.• Cutting edge technology

Page 10: Using Apache Spark for Data Science at Bitly

Setting up the workflow

• Spark journey– Hadoop server: 1.2– EMR: 1.3– EMR: 1.4

Page 11: Using Apache Spark for Data Science at Bitly

How do I use it?

• EMR!• spark-submit on the cluster• Can add script as a step to cluster launch

Page 12: Using Apache Spark for Data Science at Bitly

Creating a cluster

• aws emr create-cluster• --bootstrap-action• --steps• --auto-terminate

Page 13: Using Apache Spark for Data Science at Bitly

Creating a cluster

• LIVE!!

Page 14: Using Apache Spark for Data Science at Bitly

Let’s set the stage…

• Understanding user behavior• How do I extract, explore, and model a subset

of our data using Spark?

Page 15: Using Apache Spark for Data Science at Bitly

Data

{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}

Page 16: Using Apache Spark for Data Science at Bitly

Data processing

• Problem: I want to retrieve NYT decodes• Solution: well, there are two…

Page 17: Using Apache Spark for Data Science at Bitly

Data processing

Page 18: Using Apache Spark for Data Science at Bitly

Data processing

Page 19: Using Apache Spark for Data Science at Bitly

Data processing

• SparkSQL: 8 minutes• Pure Spark: 4 minutes!!!

Page 20: Using Apache Spark for Data Science at Bitly

Data processing

Page 21: Using Apache Spark for Data Science at Bitly

Data processing

• Yes, we’re going to do a live demo of this!

Page 22: Using Apache Spark for Data Science at Bitly

Exploratory data analysis

• Problem: what’s going on with my decodes?• Solution: DataFrames!– Similar to Pandas: describe, drop, fill, aggregate

functions– You can actually convert to a Pandas DataFrame!

Page 23: Using Apache Spark for Data Science at Bitly

Exploratory data analysis

• Get a sense of what’s going on in the data• Look at distributions, frequencies• Mostly categorical data here

Page 24: Using Apache Spark for Data Science at Bitly

Exploratory data analysis

• Yet another live demo

Page 25: Using Apache Spark for Data Science at Bitly

Topic modeling

• Problem: we have so many links but no way to classify them into certain kinds of content

• Solution: LDA (latent Dirichlet allocation)– Sort of – compare to other solutions

Page 26: Using Apache Spark for Data Science at Bitly

Topic modeling

• Oh, the JVM…– LDA only in Scala

• Scala jar file• Store script in S3

Page 27: Using Apache Spark for Data Science at Bitly

Topic modeling

• LDA in Spark– Generative model– Several different methods– Term frequency vector as input

• “Note: LDA is a new feature with some missing functionality...”

Page 28: Using Apache Spark for Data Science at Bitly

Topic modeling

Page 29: Using Apache Spark for Data Science at Bitly

Topic modeling

• Term frequency vector

TERMDOCUMENT

python data hot dogs baseball zoo

doc_1 1 3 0 0 0

doc_2 0 0 4 1 0

doc_3 4 0 0 0 5

Page 30: Using Apache Spark for Data Science at Bitly

Topic modeling

Page 31: Using Apache Spark for Data Science at Bitly

Topic modeling

Page 32: Using Apache Spark for Data Science at Bitly

Topic modeling

• Why not??– Means to an end– Current large scale scraping inability

Page 33: Using Apache Spark for Data Science at Bitly

Architecture

• Right now: not in production– Buy-in

• Streaming applications for parts of the app• Python or Scala?– Scala by force (LDA, GraphX)

Page 34: Using Apache Spark for Data Science at Bitly

Some issues

• Hadoop servers• JVM• gzip• 1.4• Resource allocation• Really only got it to this stage very recently

Page 35: Using Apache Spark for Data Science at Bitly

Where to go next?

• Spark in production!• Use for various parts of our app• Use for R&D and prototyping purposes, with

the potential to expand into the product

Page 36: Using Apache Spark for Data Science at Bitly

Current/future projects

• Trend detection• Device prediction• User affinities– GraphX!

• A/B testing

Page 37: Using Apache Spark for Data Science at Bitly

Resources

• spark.apache.org - documentation• Databricks blog• Cloudera blog

Page 38: Using Apache Spark for Data Science at Bitly

Thanks!!

@sarah_guido