Top Banner
Multidimensional probabilistic real-time analytics at Scale VALENTIN BAZAREVSKY
18

Realtime analytics

Apr 12, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

Goals for 2016

Multidimensional probabilistic real-time analytics at ScaleVALENTIN BAZAREVSKY

1

Questions to audienceHLLMinHashUniform distributionInclusion-Exclusion principleBitmap

Web Analytics QuestionsHow big your audience?From where?How active?Gender?What browsers / devices?How similar audiences are?Who is the most similar to your audience?What dynamics?

Advanced Web Analytics QuestionsWhat characteristics my audience will have if I build it by particular rule?If KPI could be described by given rule, give me audience which fits them better than others

Numbers2B cookie profiles50k segments35B cookie-segment pairs150M transaction predicate sets15 TB of transactional data50k requests per second

Segment sizeSegments> 1k6k> 10k6k> 100k6k> 1M6k> 10M6k> 100M2k> 1B25

Estimation PIPELINE

HyperLogLogsMinHashes1% Bitmaps1%, 0.01% samples as sets

Probabilistic data structures landscapeHLL zipped 2% error 400bMinHash 32 kb1% bitmap 2-5 mb1% sets depending on size (in our case up to 150Mb rare case)

Hyperloglog intuitionAllows to estimate number of unique users in set

Probability it will have 0 in first position 50%Two zeros sequentially 25%Three - 12.5%Etc.

What can you say about the set if you know that maximal sequence of zeros was 10?

HLL intuition pt. 20011001010100101001001010011011010101001100111010100011100001010001010010101000001000000100

Set operations on HLLUnionIntersectionSubtraction

Inclusion exclusion principleAccuracy degradationBinomial coefficients

calculation tree transformationHLL can union only with another HLLIf you need to intersect HLL with another HLL, you need to use inclusion exclusion principle:|A and B| = |A| + |B| - |A or B| - this results number, not HLLSo how to estimate expressions like:(A and B) or C => (A or C) and (B or C)Needed recursive tree transformation, which will result only one final intersection and subtraction

MinHash vs K Min ValuesJaccard index:

Sampling ratio normalizationCardinality estimation via KMinValuesAccuracy degradation when estimation result much smaller then bigger set

BitmapsEach bit corresponds to particular set itemGood estimation accuracy and performanceNot efficient from memory requirements if underlying set is smallMapping from element id to sequence number in bitmap required (sync challenge for distributed application)

Improvement: Compressed bitmapsStill big overhead, as we need to store all the items

Sampled audience as SetsHuge memory consumption for big audiencesSet operations performance depend on smaller setSo operations with two big sets are slowResample big sets to 0.01% and use this only for case if all sets in equation bigNo need to store id-sequence number mappingEfficient for small audiences

To sum up (2b audience)HLLMinHash (8k)Bitmaps 1%Sets (1% + 0.01%)Size2kb (400b packed)32 kb 5 Mb0 200 MbAccuracy2% in average for cardinality. 2% if sets cardinality less than 1002% if sets size > 10k2% if sets size > 10kRestrictionsSignificant degradation if set sizes differ more than 10 timesSet sizes difference > 1000 timesLots of extra data for big sets if there is no need to intersect with smallLots of extra data for big sets if there is no need to intersect with smallSupported operationsUnion natively, Intersect and subtract via inclusion exclusion principle.

Not every calculation tree can be estimated.Union, Intersect, Subtract

Recursive disjoint and intersection leads to accuracy degradation.Requires tree transformationUnion, Intersect, SubtractUnion, Intersect, subtract

Combination of different approachesHLL + MHUse MH for intersection and subtractionBitmaps + SetsI.e. sparse and dense representation of setStore items as sets and then convert them to bitmaps after certain threshold

What we storeSegment data (near realtime)Segment stats per each day (HLL + MinHash) 14 Gb, 1Gb per day Affinities report (daily recount + deltas near realtime)1% sample bitmap (no compression in Redis, 190 Gb)1% + 0.01% sample sets (40 Gb)Transaction Predicate Sets (Daily)HLL (compressed. 150M HLLs in 40 Gb)

Questions?