Top Banner
25

A Production Quality Sketching - iteblog.com

Jan 23, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Production Quality Sketching - iteblog.com
Page 2: A Production Quality Sketching - iteblog.com

A Production Quality Sketching Library for the Analysis of Big DataLee RhodesDistinguished Architect, Verizon Media (Yahoo), Inc.

Page 3: A Production Quality Sketching - iteblog.com

Outline

Problematic Queries of Big DataWhere traditional analysis methods don’t work well

Approximate Analysis Using SketchesHow using stochastic processes and probabilistic analysis wins in a systems architecture context

The Open Source Apache DataSketches LibraryA quick overview of this unique library dedicated to production systems that process big data

Page 4: A Production Quality Sketching - iteblog.com

The Data Analysis Challenge …

… Analyze This Data in Near-Real Time

Time Stamp

User ID

Device ID

Site Time Spent Sec

Items Viewed

9:00 AM U1 D1 Apps 59 5

9:30 AM U2 D2 Apps 179 15

10:00 AM U3 D3 Music 29 3

1:00 PM U1 D4 Music 89 10

Billions of Rows or Key, Value Pairs …

Example: Web Site Logs

Page 5: A Production Quality Sketching - iteblog.com

Some Very Common Queries …

Unique Identifierswith Set Expressions:

(AUB) ∩ (CUD) \ E

Quantiles, CDFs

Histograms, PMFs

Frequent Items /Heavy Hitters

Vector & MatrixOperations:

SVD, etc.

Graph Analysis

Uniform

Weighted

Reservoir Sampling

5 ⋯ 2⋮ ⋱ ⋮4 ⋯ 3

All Non-Additive

Page 6: A Production Quality Sketching - iteblog.com

When The Data Gets Large Or Resources Are Limited,

All Of These Queries Become Problematic,

Because the Aggregations are Non-Additive.

A

B

Result A

Result B

+ Combined Result

Current Result

NewItem

Updated Result+

Page 7: A Production Quality Sketching - iteblog.com

Col1, ..., Item, ..., Coln… Billions of rows ...

Ω(u) size: ~ Big Data

Exact Results

Difficult

Query

Local Item Copies

Query processing often requires sorting…which is very slow.

Query Engine

Traditional, Exact Analysis Methods Require Local Copies

Note: Micro-batch “Streaming Platforms”, e.g., StormDo Not Solve The Fundamental Problem!

Big Data

Page 8: A Production Quality Sketching - iteblog.com

Parallelization Does Not Help MuchBecause of Non-Additivity.

You have to keep the copies somewhere!

Col1, ..., Item, ..., Coln… Billions of rows ...

𝝨 Exact Results

Expensive Shuffle

Copy

Copy

Copy

Copy

Copy

Copy

Example: Map-Reduce

Page 9: A Production Quality Sketching - iteblog.com

Every dataset is processed N times for a rolling N-day window!

Traditional, Exact Time WindowingRequires Multiple Touches of Every Item

Page 10: A Production Quality Sketching - iteblog.com

Let’s challenge a fundamental premise:… that our results must be exact!

If we can allow for approximation, along with some accuracy guarantees,

we can achieve orders-of-magnitude improvement in • speed and • reduction of resources.

Page 11: A Production Quality Sketching - iteblog.com

Introducing the Sketch (a.k.a., Stochastic Streaming Algorithm)

Stream Processor

DataStructureSize = f(k)

QueryProcessor

Stochastic Process

Merge / SetOperations

Results +/- ε𝛆 = f(1/k)

Probabilistic Analysis

Result SketchSketchStream

DataStream

QueryModel the Problem as a Stochastic Process Analyze using Probability & Statistics

Random Selection Sizing, Storing

Page 12: A Production Quality Sketching - iteblog.com

How & Why Sketches Achieve Superior Performance

For Systems Processing Massive Data

Page 13: A Production Quality Sketching - iteblog.com

Major Sketch Properties

• Small Stored Size• Sub-linear in Space• Single-pass, “One-Touch”• Data Insensitive• Mergeable• Approximate, Probabilistic• Mathematically Proven Error Properties

Sub-linear

Stream Size

Linear

Sket

ch S

ize

Sketching Sampling

Sketches Overlap with Sampling

Based on the Specific Sketch

Page 14: A Production Quality Sketching - iteblog.com

Win #1: Small Query Space

O(k) size: ~ Kilobytes

Approximate Answer ± ε

Difficult

Query

Minimal or no sorting required!

Ideal for Streaming & Batch

Query Engine

SketchCol1, ..., Item, ..., Coln… Billions of rows ...

Sketches Start SmallSublinear Means they Stay SmallSingle Pass Simplifies Processing

Page 15: A Production Quality Sketching - iteblog.com

Win #2: Mergeability

Sketch Approx. ± ε

Sketch

Sketch

… , Item… many rows ...

… , Item… many rows ...

Query

Query

Partitions

Merge

Mergeability Enables Parallelism … With No Additional Loss of Accuracy!Sketches Transform Non-Additive Metrics Into Additive ObjectsThe Result of a Sketch Merge is Another Sketch … Enabling Set Expressions for Selected Sketches

Col1, ..., Item, ..., Coln… Billions of rows ...

Page 16: A Production Quality Sketching - iteblog.com

Intermediate Hyper-Cube Staging Enables Query SpeedAdditivity Enables Simpler Architecture

Win 3: Near-Real Time Query SpeedWin 4: Simpler Architecture

Page 17: A Production Quality Sketching - iteblog.com

Win #5: Simplified Time WindowingPlus Late Data Processing

Page 18: A Production Quality Sketching - iteblog.com

Near-Real time Results, with History!

Page 19: A Production Quality Sketching - iteblog.com

Win #6: Lower System Cost ($)Case Study: Real-time Flurry, Before and After

• Customers: >250K Mobile App Developers• Data: 40-50 TB per day• Platform: 2 clusters X 80 Nodes = 160 Nodes

– Node: 24 CPUs, 250GB RAM

Before Sketches After Sketches

VCS* / Mo. ~80B ~20B

Result FreshnessDaily: 2 to 8 hours; Weekly: ~3 daysReal-time Unique Counts Not Feasible 15 seconds!

Big Wins!Near-Real Time Lower System $

* VCS: Virtual Core Seconds

Page 20: A Production Quality Sketching - iteblog.com

Introducing

Page 21: A Production Quality Sketching - iteblog.com

The DataSketches TeamCore Team / Committers

• Lee Rhodes, Distinguished Architect, Yahoo/VM*. Started internal DataSketches project 2012• Alex Saydakov, Systems Developer, Yahoo/VM, joined 2015• Jon Malkin, Ph.D., Research Engineer, Developer, Yahoo/VM, joined 2016• Edo Liberty, Ph.D., Founder, HyperCube Technologies. Joined 2015• Justin Thaler, Ph.D., Assistant Professor, Georgetown University, Computer Science. Joined 2015

• Roman Leventov, Systems Developer for Apache Druid, Metamarkets, joined 2018• Eshcar Hillel, Ph.D., Sr Scientist, Yahoo/VM, Israel, joined 2018

Extended Team & Consultants• Graham Cormode, Ph.D., Professor, University of Warwick, Computer Science, joined 2017• Jelani Nelson, Ph.D., Professor, U.C. Berkeley, joined 2019 • Daniel Ting, Ph.D., Sr Scientist, Tableau / Salesforce, joined 2019

… And our Community is Growing!* VM = Verizon Media

Page 22: A Production Quality Sketching - iteblog.com

Our Mission…

Combine Deep Science with Exceptional Engineering

To Develop Production Quality Sketches

That Address These Difficult Queries

Page 23: A Production Quality Sketching - iteblog.com

Cardinality, 4 Families• HLL (on/off Heap): A very high performing implementation of this well-known sketch• CPC: The best accuracy per space• Theta Sketches: Set Expressions (e.g., Union, Intersection, Difference), on/off Heap• Tuple Sketches: Generic, Associative Theta Sketches, multiple derived sketches:

Quantiles Sketches, 2 Families• Quantiles, Histograms, PMF’s and CDF’s of streams of comparable objects, on/off Heap.

KLL, highly optimized for accuracy-space.• Relative Error Quantiles (under development)

Frequent Items (Heavy-Hitters) Sketches, 2 Families• Frequent Items: Weighted or Unweighted• Frequent Directions: Approximate SVD (a Vector Sketch)

Sampling: Reservoir and VarOpt (Edith Cohen) Sketches, 2 Families• Uniform and weighted sampling to fixed-k sized buckets

Specialty Sketches• Customer Engagement, Frequent Distinct Tuples, Maps, etc.

The Apache DataSketches Library

Languages Supported: • Java, C++, Python• Binary Compatibility

Page 24: A Production Quality Sketching - iteblog.com

Bright Future for Sketching Technology & SolutionsItems (words, IDs, events, clicks, …)

• Count Distinct• Frequent Items, Heavy-Hitters, etc• Quantiles, Ranks, PMFs, CDFs, Histograms• Set Operations• Sampling

• Mobile (IoT)• Moment and Entropy Estimation

Graphs (Social Networks, Communications, …)• Connectivity• Cut Sparsification• Weighted Matching

• …

Areas where we have sketch implementationsAreas of research (World-wide)

Vectors (text docs, images, features, …) &Matrices (text corpora, recommendations, …)

• Dimensionality Reduction (SVD)• Covariance Estimation

• Low Rank Approximation• Sparsification• Clustering (k-means, k-median, …)• Linear Regression

• Machine Learning (in some areas)• Density Estimation

Page 25: A Production Quality Sketching - iteblog.com

THANK YOU!

Open Invitation forCollaboration

Learn More About Apache DataSketchesCome And Visit Us!

https://datasketches.apache.org