A Production Quality Sketching Library for the Analysis of Big DataLee RhodesDistinguished Architect, Verizon Media (Yahoo), Inc.
Outline
Problematic Queries of Big DataWhere traditional analysis methods don’t work well
Approximate Analysis Using SketchesHow using stochastic processes and probabilistic analysis wins in a systems architecture context
The Open Source Apache DataSketches LibraryA quick overview of this unique library dedicated to production systems that process big data
The Data Analysis Challenge …
… Analyze This Data in Near-Real Time
Time Stamp
User ID
Device ID
Site Time Spent Sec
Items Viewed
9:00 AM U1 D1 Apps 59 5
9:30 AM U2 D2 Apps 179 15
10:00 AM U3 D3 Music 29 3
1:00 PM U1 D4 Music 89 10
Billions of Rows or Key, Value Pairs …
Example: Web Site Logs
Some Very Common Queries …
Unique Identifierswith Set Expressions:
(AUB) ∩ (CUD) \ E
Quantiles, CDFs
Histograms, PMFs
Frequent Items /Heavy Hitters
Vector & MatrixOperations:
SVD, etc.
Graph Analysis
Uniform
Weighted
Reservoir Sampling
5 ⋯ 2⋮ ⋱ ⋮4 ⋯ 3
All Non-Additive
When The Data Gets Large Or Resources Are Limited,
All Of These Queries Become Problematic,
Because the Aggregations are Non-Additive.
A
B
Result A
Result B
+ Combined Result
Current Result
NewItem
Updated Result+
Col1, ..., Item, ..., Coln… Billions of rows ...
Ω(u) size: ~ Big Data
Exact Results
Difficult
Query
Local Item Copies
Query processing often requires sorting…which is very slow.
Query Engine
Traditional, Exact Analysis Methods Require Local Copies
Note: Micro-batch “Streaming Platforms”, e.g., StormDo Not Solve The Fundamental Problem!
Big Data
Parallelization Does Not Help MuchBecause of Non-Additivity.
You have to keep the copies somewhere!
Col1, ..., Item, ..., Coln… Billions of rows ...
𝝨 Exact Results
Expensive Shuffle
Copy
Copy
Copy
Copy
Copy
Copy
Example: Map-Reduce
Every dataset is processed N times for a rolling N-day window!
Traditional, Exact Time WindowingRequires Multiple Touches of Every Item
Let’s challenge a fundamental premise:… that our results must be exact!
If we can allow for approximation, along with some accuracy guarantees,
we can achieve orders-of-magnitude improvement in • speed and • reduction of resources.
Introducing the Sketch (a.k.a., Stochastic Streaming Algorithm)
Stream Processor
DataStructureSize = f(k)
QueryProcessor
Stochastic Process
Merge / SetOperations
Results +/- ε𝛆 = f(1/k)
Probabilistic Analysis
Result SketchSketchStream
DataStream
QueryModel the Problem as a Stochastic Process Analyze using Probability & Statistics
Random Selection Sizing, Storing
Major Sketch Properties
• Small Stored Size• Sub-linear in Space• Single-pass, “One-Touch”• Data Insensitive• Mergeable• Approximate, Probabilistic• Mathematically Proven Error Properties
Sub-linear
Stream Size
Linear
Sket
ch S
ize
Sketching Sampling
Sketches Overlap with Sampling
Based on the Specific Sketch
Win #1: Small Query Space
O(k) size: ~ Kilobytes
Approximate Answer ± ε
Difficult
Query
Minimal or no sorting required!
Ideal for Streaming & Batch
Query Engine
SketchCol1, ..., Item, ..., Coln… Billions of rows ...
Sketches Start SmallSublinear Means they Stay SmallSingle Pass Simplifies Processing
Win #2: Mergeability
Sketch Approx. ± ε
Sketch
Sketch
… , Item… many rows ...
… , Item… many rows ...
Query
Query
Partitions
Merge
Mergeability Enables Parallelism … With No Additional Loss of Accuracy!Sketches Transform Non-Additive Metrics Into Additive ObjectsThe Result of a Sketch Merge is Another Sketch … Enabling Set Expressions for Selected Sketches
Col1, ..., Item, ..., Coln… Billions of rows ...
Intermediate Hyper-Cube Staging Enables Query SpeedAdditivity Enables Simpler Architecture
Win 3: Near-Real Time Query SpeedWin 4: Simpler Architecture
Win #6: Lower System Cost ($)Case Study: Real-time Flurry, Before and After
• Customers: >250K Mobile App Developers• Data: 40-50 TB per day• Platform: 2 clusters X 80 Nodes = 160 Nodes
– Node: 24 CPUs, 250GB RAM
Before Sketches After Sketches
VCS* / Mo. ~80B ~20B
Result FreshnessDaily: 2 to 8 hours; Weekly: ~3 daysReal-time Unique Counts Not Feasible 15 seconds!
Big Wins!Near-Real Time Lower System $
* VCS: Virtual Core Seconds
The DataSketches TeamCore Team / Committers
• Lee Rhodes, Distinguished Architect, Yahoo/VM*. Started internal DataSketches project 2012• Alex Saydakov, Systems Developer, Yahoo/VM, joined 2015• Jon Malkin, Ph.D., Research Engineer, Developer, Yahoo/VM, joined 2016• Edo Liberty, Ph.D., Founder, HyperCube Technologies. Joined 2015• Justin Thaler, Ph.D., Assistant Professor, Georgetown University, Computer Science. Joined 2015
• Roman Leventov, Systems Developer for Apache Druid, Metamarkets, joined 2018• Eshcar Hillel, Ph.D., Sr Scientist, Yahoo/VM, Israel, joined 2018
Extended Team & Consultants• Graham Cormode, Ph.D., Professor, University of Warwick, Computer Science, joined 2017• Jelani Nelson, Ph.D., Professor, U.C. Berkeley, joined 2019 • Daniel Ting, Ph.D., Sr Scientist, Tableau / Salesforce, joined 2019
… And our Community is Growing!* VM = Verizon Media
Our Mission…
Combine Deep Science with Exceptional Engineering
To Develop Production Quality Sketches
That Address These Difficult Queries
Cardinality, 4 Families• HLL (on/off Heap): A very high performing implementation of this well-known sketch• CPC: The best accuracy per space• Theta Sketches: Set Expressions (e.g., Union, Intersection, Difference), on/off Heap• Tuple Sketches: Generic, Associative Theta Sketches, multiple derived sketches:
Quantiles Sketches, 2 Families• Quantiles, Histograms, PMF’s and CDF’s of streams of comparable objects, on/off Heap.
KLL, highly optimized for accuracy-space.• Relative Error Quantiles (under development)
Frequent Items (Heavy-Hitters) Sketches, 2 Families• Frequent Items: Weighted or Unweighted• Frequent Directions: Approximate SVD (a Vector Sketch)
Sampling: Reservoir and VarOpt (Edith Cohen) Sketches, 2 Families• Uniform and weighted sampling to fixed-k sized buckets
Specialty Sketches• Customer Engagement, Frequent Distinct Tuples, Maps, etc.
The Apache DataSketches Library
Languages Supported: • Java, C++, Python• Binary Compatibility
Bright Future for Sketching Technology & SolutionsItems (words, IDs, events, clicks, …)
• Count Distinct• Frequent Items, Heavy-Hitters, etc• Quantiles, Ranks, PMFs, CDFs, Histograms• Set Operations• Sampling
• Mobile (IoT)• Moment and Entropy Estimation
Graphs (Social Networks, Communications, …)• Connectivity• Cut Sparsification• Weighted Matching
• …
Areas where we have sketch implementationsAreas of research (World-wide)
Vectors (text docs, images, features, …) &Matrices (text corpora, recommendations, …)
• Dimensionality Reduction (SVD)• Covariance Estimation
• Low Rank Approximation• Sparsification• Clustering (k-means, k-median, …)• Linear Regression
• Machine Learning (in some areas)• Density Estimation
THANK YOU!
Open Invitation forCollaboration
Learn More About Apache DataSketchesCome And Visit Us!
https://datasketches.apache.org