September 18, 2015 Jason Huang Senior Solutions Architect, Qubole Inc.
September 18, 2015
Jason HuangSenior Solutions Architect, Qubole Inc.
Company Founding
Qubole founders built the Facebook data platform.
The Facebook model changed the role for datain an enterprise.
• Needed to turn the data assets into a “utility” to make a viable business.
– Collaborative: over 30% of employees use the data directly.
– Accessible: developers, analysts, business analysts or business users all running queries. Has made the company more data driven and agile with data use.
– Scalable: Exabyte's of data moving fast
It took the founders a team of over 30 people to create this infrastructure and currently the team managing this infrastructure has more than 100 people.
Work at Facebook inspired the founding of Qubole
Operations Analyst
Marketing Ops
Analyst
DataArchitec
t
Business
Users
Product Support
Customer Support
Developer
Sales Ops
Product Managers
Data Infrastructur
e
Impediments for an Aspiring Data Driven Enterprise
Where Big Data falls
short:
• 6-18 month implementation time• Only 27% of Big Data initiatives are
classified as “Successful” in 2014
Rigid and inflexible
infrastructure
Non adaptive software services
Highly specialized
systems
Difficult to build and operate
• Only 13% of organizations achieve full-scale production
• 57% of organizations cite skills gap as a major inhibitor
State of the Big Data Industry (n=417)
Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive0%
10%
20%
30%
40%
50%
60%
70%
80%
• Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Analytic Libraries:• Spark Streaming (Streaming Data)• Spark SQL (Data Processing)• MLlib (Machine Learning)• GraphX (Graph Processing)
Apache Spark
• Streaming Data– Process streaming data with Spark built-in functions– Applications such as fraud detection and log processing– ETL via data ingestion
• Machine Learning– Helps users run repeated queries and machine learning
algorithms on data sets– MLlib can work in areas such as clustering, classification, and
dimensionality reduction– Used for very common big data functions - predictive
intelligence, customer segmentation, and sentiment analysis
Common Spark Use Cases
• Interactive Analysis– MapReduce was built to handle batch processing– SQL-on-Hadoop engines such as Hive or Pig can be too slow for
interactive analysis– Spark is fast enough to perform exploratory queries without sampling– Provides multiple language-specific APIs including R, Python, Scala and
Java.
• Fog Computing– The Internet of Things - objects and devices with tiny embedded
sensors that communicate with each other and users, creating a fully interconnected world
– Decentralize data processing and storage and use Spark streaming analytics and interactive real time queries
Common Spark Use Cases
Use Spark for distributed computation:
- Combine SparkSQL, GraphX along with MLlib in the same Spark program
- Ability to use language of choice - python/scala/R/java
- Extensive algorithms (http://spark.apache.org/docs/latest/mllib-guide.html)
Why Spark MLlib?
• Classification and Regression: logistic regression, linear regression, linear support vector machine (SVM), naive Bayes, decision trees
• Collaborative Filtering: alternating least squares (ALS)
• Clustering: k-means, Gaussian mixture
• Dimensionality Reduction: singular value decomposition (SVD), principal component analysis (PCA)
Algorithms
• Spark : Fast, Scalable and Flexible
• R : Statistics, Packages and Plots
SparkR combines both - very powerful
Use SparkR API to take advantage of Spark, bring the data back into R - and do some machine learning, data visualization, etc.
How about R? Use SparkR!
What about the cloud?
Central Governance &
SecurityInternet
Scale
Instant Deployment
Isolated Multi-tenancy
Elastic
Object Store Underpinning
s
• Zero configuration – Spark, SparkR, MLlib, GraphX, etc. all pre-installed on all cluster nodes
– e.g. submit SparkR programs via a client-side API to an on-demand compute cluster
• ETL (data cleansing, transformations, table joins, etc.) required prior to any ML modeling and analysis
– e.g. Use other Big Data tools in order to prepare data – hive/hadoop/cascading/pig…
Spark in the Cloud
• Use AWS S3 object store to decouple compute and storage; scale processing power and storage capacity independently
• S3 is highly available, reliable, scalable and cost effective
• Elastic compute provides unlimited scale on-demand: calculations may require 10, 100 or 1,000+ compute nodes.
• Ability to have multiple clusters – distinguish between teams, workloads, production, non-production R&D/test
Spark in the Cloud
Cloud object store for data sets:
e.g. AWS S3:
• Flexible compute resource options– High memory instances
• AWS EC2 r3.* for high memory workloads to cache and manipulate large Spark RDDs
– High CPU• AWS EC2 c3.* for CPU intensive workloads
• Automatic cluster termination when idle• Periodically check for bad instances and remove them
Spark in the Cloud
CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.
• Install Spark on EC2 (HDFS if required)
• Choose Spark backend cluster mode and configure it– Standalone– Yarn– Mesos
• Spin up a cluster of instances
DIY - Getting Started on the Cloud
CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.
EC2 scripts can help:
http://spark.apache.org/docs/latest/ec2-scripts.html
- Helps spin up named clusters- Creates a security group, comes pre-baked with
Spark installed
DIY - Getting Started on the Cloud
Another (very short) Demo
Qubole Case Study
Qubole Case Study
Operations Analyst
MarketingOps
Analyst
DataArchitec
t
Business
UsersProduct Support
Customer Support
Developer
Sales Ops
Product Manager
s
Ease of use for analysts
• Dozens of DataScientist andAnalyst users
• Produces double-digit TBs of data per day
• Does not havededicated staffto setup and manage clustersand Hadoop Distributions
010110101010
Qubole Case Study
Qubole Case Study
Producers Continuous Processing Storage Analytics
CDN
Real TimeBidding
RetargetingPlatform
ETL
Kinesis S3 Redshift
Machine LearningStreaming
Customer Data
Why Spark?010110101010
010110101010
010110101010
“Qubole put our cluster management, auto-scaling and ad-hoc
queries on autopilot. Its higher performance for
Big Data queries translates directly into
faster and more actionable marketing intelligence for our
customers.”
Yekesa KosuruVP, Technology