A Production Quality Sketching - iteblog.com

A Production Quality Sketching Library for the Analysis of Big DataLee RhodesDistinguished Architect, Verizon Media (Yahoo), Inc.

Outline

Problematic Queries of Big DataWhere traditional analysis methods don’t work well

Approximate Analysis Using SketchesHow using stochastic processes and probabilistic analysis wins in a systems architecture context

The Open Source Apache DataSketches LibraryA quick overview of this unique library dedicated to production systems that process big data

The Data Analysis Challenge …

… Analyze This Data in Near-Real Time

Time Stamp

User ID

Device ID

Site Time Spent Sec

Items Viewed

9:00 AM U1 D1 Apps 59 5

9:30 AM U2 D2 Apps 179 15

10:00 AM U3 D3 Music 29 3

1:00 PM U1 D4 Music 89 10

Billions of Rows or Key, Value Pairs …

Example: Web Site Logs

Some Very Common Queries …

Unique Identifierswith Set Expressions:

(AUB) ∩ (CUD) \ E

Quantiles, CDFs

Histograms, PMFs

Frequent Items /Heavy Hitters

Vector & MatrixOperations:

SVD, etc.

Graph Analysis

Uniform

Weighted

Reservoir Sampling

5 ⋯ 2⋮ ⋱ ⋮4 ⋯ 3

All Non-Additive

When The Data Gets Large Or Resources Are Limited,

All Of These Queries Become Problematic,

Because the Aggregations are Non-Additive.

A

B

Result A

Result B

+ Combined Result

Current Result

NewItem

Updated Result+

Col1, ..., Item, ..., Coln… Billions of rows ...

Ω(u) size: ~ Big Data

Exact Results

Difficult

Query

Local Item Copies

Query processing often requires sorting…which is very slow.

Query Engine

Traditional, Exact Analysis Methods Require Local Copies

Note: Micro-batch “Streaming Platforms”, e.g., StormDo Not Solve The Fundamental Problem!

Big Data

Parallelization Does Not Help MuchBecause of Non-Additivity.

You have to keep the copies somewhere!


𝝨 Exact Results

Expensive Shuffle

Copy

Copy

Copy

Copy

Copy

Copy

Example: Map-Reduce

Every dataset is processed N times for a rolling N-day window!

Traditional, Exact Time WindowingRequires Multiple Touches of Every Item

Let’s challenge a fundamental premise:… that our results must be exact!

If we can allow for approximation, along with some accuracy guarantees,

we can achieve orders-of-magnitude improvement in • speed and • reduction of resources.

Introducing the Sketch (a.k.a., Stochastic Streaming Algorithm)

Stream Processor

DataStructureSize = f(k)

QueryProcessor

Stochastic Process

Merge / SetOperations

Results +/- ε𝛆 = f(1/k)

Probabilistic Analysis

Result SketchSketchStream

DataStream

QueryModel the Problem as a Stochastic Process Analyze using Probability & Statistics

Random Selection Sizing, Storing

How & Why Sketches Achieve Superior Performance

For Systems Processing Massive Data

Major Sketch Properties

• Small Stored Size• Sub-linear in Space• Single-pass, “One-Touch”• Data Insensitive• Mergeable• Approximate, Probabilistic• Mathematically Proven Error Properties

Sub-linear

Stream Size

Linear

Sket

ch S

ize

Sketching Sampling

Sketches Overlap with Sampling

Based on the Specific Sketch

Win #1: Small Query Space

O(k) size: ~ Kilobytes

Approximate Answer ± ε

Difficult

Query

Minimal or no sorting required!

Ideal for Streaming & Batch

Query Engine

SketchCol1, ..., Item, ..., Coln… Billions of rows ...

Sketches Start SmallSublinear Means they Stay SmallSingle Pass Simplifies Processing

Win #2: Mergeability

Sketch Approx. ± ε

Sketch

Sketch

… , Item… many rows ...

… , Item… many rows ...

Query

Query

Partitions

Merge

Mergeability Enables Parallelism … With No Additional Loss of Accuracy!Sketches Transform Non-Additive Metrics Into Additive ObjectsThe Result of a Sketch Merge is Another Sketch … Enabling Set Expressions for Selected Sketches


Intermediate Hyper-Cube Staging Enables Query SpeedAdditivity Enables Simpler Architecture

Win 3: Near-Real Time Query SpeedWin 4: Simpler Architecture

Win #5: Simplified Time WindowingPlus Late Data Processing

Near-Real time Results, with History!

Win #6: Lower System Cost ($)Case Study: Real-time Flurry, Before and After

• Customers: >250K Mobile App Developers• Data: 40-50 TB per day• Platform: 2 clusters X 80 Nodes = 160 Nodes

– Node: 24 CPUs, 250GB RAM

Before Sketches After Sketches

VCS* / Mo. ~80B ~20B

Result FreshnessDaily: 2 to 8 hours; Weekly: ~3 daysReal-time Unique Counts Not Feasible 15 seconds!

Big Wins!Near-Real Time Lower System $

* VCS: Virtual Core Seconds

Introducing

The DataSketches TeamCore Team / Committers

• Lee Rhodes, Distinguished Architect, Yahoo/VM*. Started internal DataSketches project 2012• Alex Saydakov, Systems Developer, Yahoo/VM, joined 2015• Jon Malkin, Ph.D., Research Engineer, Developer, Yahoo/VM, joined 2016• Edo Liberty, Ph.D., Founder, HyperCube Technologies. Joined 2015• Justin Thaler, Ph.D., Assistant Professor, Georgetown University, Computer Science. Joined 2015

• Roman Leventov, Systems Developer for Apache Druid, Metamarkets, joined 2018• Eshcar Hillel, Ph.D., Sr Scientist, Yahoo/VM, Israel, joined 2018

Extended Team & Consultants• Graham Cormode, Ph.D., Professor, University of Warwick, Computer Science, joined 2017• Jelani Nelson, Ph.D., Professor, U.C. Berkeley, joined 2019 • Daniel Ting, Ph.D., Sr Scientist, Tableau / Salesforce, joined 2019

… And our Community is Growing!* VM = Verizon Media

Our Mission…

Combine Deep Science with Exceptional Engineering

To Develop Production Quality Sketches

That Address These Difficult Queries

Cardinality, 4 Families• HLL (on/off Heap): A very high performing implementation of this well-known sketch• CPC: The best accuracy per space• Theta Sketches: Set Expressions (e.g., Union, Intersection, Difference), on/off Heap• Tuple Sketches: Generic, Associative Theta Sketches, multiple derived sketches:

Quantiles Sketches, 2 Families• Quantiles, Histograms, PMF’s and CDF’s of streams of comparable objects, on/off Heap.

KLL, highly optimized for accuracy-space.• Relative Error Quantiles (under development)

Frequent Items (Heavy-Hitters) Sketches, 2 Families• Frequent Items: Weighted or Unweighted• Frequent Directions: Approximate SVD (a Vector Sketch)

Sampling: Reservoir and VarOpt (Edith Cohen) Sketches, 2 Families• Uniform and weighted sampling to fixed-k sized buckets

Specialty Sketches• Customer Engagement, Frequent Distinct Tuples, Maps, etc.

The Apache DataSketches Library

Languages Supported: • Java, C++, Python• Binary Compatibility

Bright Future for Sketching Technology & SolutionsItems (words, IDs, events, clicks, …)

• Count Distinct• Frequent Items, Heavy-Hitters, etc• Quantiles, Ranks, PMFs, CDFs, Histograms• Set Operations• Sampling

• Mobile (IoT)• Moment and Entropy Estimation

Graphs (Social Networks, Communications, …)• Connectivity• Cut Sparsification• Weighted Matching

• …

Areas where we have sketch implementationsAreas of research (World-wide)

Vectors (text docs, images, features, …) &Matrices (text corpora, recommendations, …)

• Dimensionality Reduction (SVD)• Covariance Estimation

• Low Rank Approximation• Sparsification• Clustering (k-means, k-median, …)• Linear Regression

• Machine Learning (in some areas)• Density Estimation

THANK YOU!

Open Invitation forCollaboration

Learn More About Apache DataSketchesCome And Visit Us!

https://datasketches.apache.org

https://datasketches.apache.org/

A Production Quality Sketching - iteblog.com

Documents