Top Banner
PROBABILISTIC DATA STRUCTURES IN REAL LIFE Valentin Bazarevsky
16

Probabilistic data structures in real life

Feb 16, 2017

Download

Engineering

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Probabilistic data structures in real life

PROBABILISTIC DATA STRUCTURES IN REAL LIFEValentin Bazarevsky

Page 2: Probabilistic data structures in real life

WHO THEY ARE?

Bloom FilterLogLog FamilyMinHash

Page 3: Probabilistic data structures in real life

BUSINESS CASE:ESTIMATE YOUR AUDIENCE

Page 4: Probabilistic data structures in real life

SEGMENT BUILDER

15 Tb of transactional data4h SLA

Page 5: Probabilistic data structures in real life

POSSIBLE SOLUTIONS

Brute force (15 TB of transactional data) Sampling (1 % of users => 1.2 mb / b.o.)Magic tool (?!)

EstimatorHyperLogLog allows to estimate > 1 000 000 000 sets of unique elements with 1% error, and requires only 4kb memory

50 000 000 basic operations

Page 6: Probabilistic data structures in real life

OOPS…

Supports only Unions

But we need Intersections, Subtractions, Not operators

Page 7: Probabilistic data structures in real life

HYPERLOGLOG INTUITION

00101010101010001111010101101 => a[2] = 010010101010100101010101001011 => a[9] = 100000101010100101010101110101 => a[0] = 101010101010100100101010101010 => a[5] = 1

01010000000000000000000000010 => a[5] = 23

Page 8: Probabilistic data structures in real life

INCLUSION-EXCLUSION PRINCIPLE

Page 9: Probabilistic data structures in real life

MINHASH

Store only x (8192) smallest hashes in setJaccard Distance

Page 10: Probabilistic data structures in real life

UNION OF INTERSECTIONS

A (B C) = (A B) (A B)A - B - C = A - (B C)

Page 11: Probabilistic data structures in real life

NOT OPERATOR

Subtraction

Page 12: Probabilistic data structures in real life

I WANT EVERYONE EXCEPT…

A and not B Not A and Not B

Page 13: Probabilistic data structures in real life

CORNER CASES

|(A not(B)) C| => |A C||A not(B)| = |Everything| - |B| + |A B||A not(B)| => |A| - |A B|

Page 14: Probabilistic data structures in real life

ARCHITECTURE

Page 15: Probabilistic data structures in real life

ERROR RATE

Median = 5%Percentile 75 = 8%

Page 16: Probabilistic data structures in real life