Top Banner
It Probably Works
144

It Probably Works

Apr 14, 2017

Download

Software

Fastly
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: It Probably Works

It Probably Works

Page 2: It Probably Works

Tyler McMullenCTO of Fastly@tbmcmullen

Page 3: It Probably Works

FastlyWe’re an awesome CDN.

Page 4: It Probably Works

What is a probabilistic algorithm?

Page 5: It Probably Works

Why bother?

Page 6: It Probably Works

–Hal Abelson and Gerald J. Sussman, SICP

“In testing primality of very large numbers chosen at random, the chance of stumbling upon a value that

fools the Fermat test is less than the chance that cosmic radiation will cause the computer to make

an error in carrying out a 'correct' algorithm. Considering an algorithm to be inadequate for the first reason but not for the second illustrates the

difference between mathematics and engineering.”

Page 7: It Probably Works

Everything is probabilistic

Page 8: It Probably Works

Probabilistic algorithms are not “guessing”

Page 9: It Probably Works

Provably bounded error rates

Page 10: It Probably Works

Don’t use them if someone’s life depends

on it.

Page 11: It Probably Works

Don’t use them if someone’s life depends

on it.

Page 12: It Probably Works

When should I use these?

Page 13: It Probably Works

Ok, now what?

Page 14: It Probably Works

Load Balancing

Monitoring

Application

CDN

DatabasesCount-distinct

Join-shortest-queue

Reliable broadcast

Page 15: It Probably Works

Load balancing

Page 16: It Probably Works

Log-normal Distribution

50th: 0.675th: 1.295th: 3.1

99th: 6.0

99.9th: 14.1MEAN: 1.0

Page 17: It Probably Works

importnumpyasnpimportnumpy.randomasnr

n=8#numberofserversm=1000#numberofrequestsbins=[0]*n

mu=0.0sigma=1.15

forweightinnr.lognormal(mu,sigma,m):chosen_bin=nr.randint(0,n)bins[chosen_bin]+=normalize(weight)

[100.7,137.5,134.3,126.2,113.5,175.7,101.6,113.7]

Page 18: It Probably Works

Random simulationActual distribution

Page 19: It Probably Works
Page 20: It Probably Works

SO WHAT DO WE DO ABOUT IT?

Page 21: It Probably Works

Random simulationJSQ simulation

Page 22: It Probably Works
Page 23: It Probably Works
Page 24: It Probably Works
Page 25: It Probably Works

DISTRIBUTED RANDOM IS EXACTLY THE SAME

Page 26: It Probably Works

DISTRIBUTED JOIN-SHORTEST-QUEUE IS A NIGHTMARE

Page 27: It Probably Works
Page 28: It Probably Works
Page 29: It Probably Works
Page 30: It Probably Works

importnumpyasnpimportnumpy.randomasnr

n=8#numberofserversm=1000#numberofrequestsbins=[0]*n

mu=0.0sigma=1.15

forweightinnr.lognormal(mu,sigma,m):a=nr.randint(0,n)b=nr.randint(0,n)chosen_bin=aifbins[a]<bins[b]elsebbins[chosen_bin]+=normalize(weight)

[130.5,131.7,129.7,132.0,131.3,133.2,129.9,132.6]

Page 31: It Probably Works

[100.7,137.5,134.3,126.2,113.5,175.7,101.6,113.7]

[130.5,131.7,129.7,132.0,131.3,133.2,129.9,132.6]

STANDARD DEVIATION: 1.18

STANDARD DEVIATION: 22.9

Page 32: It Probably Works

Random simulationJSQ simulationRandomized JSQ simulation

Page 33: It Probably Works
Page 34: It Probably Works

Load Balancing

Monitoring

Application

CDN

DatabasesCount-distinct

Join-shortest-queue

Reliable broadcast

Page 35: It Probably Works

The Count-distinct Problem

Page 36: It Probably Works

The Problem

How many unique words are in a large corpus of text?

Page 37: It Probably Works

The Problem

How many different users visited a popular website in a day?

Page 38: It Probably Works

The Problem

How many unique IPs have connected to a server over the last hour?

Page 39: It Probably Works

The Problem

How many unique URLs have been requested through an HTTP proxy?

Page 40: It Probably Works

The Problem

How many unique URLs have been requested through an entire network of HTTP proxies?

Page 41: It Probably Works

166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/product/982"20020246103.138.203.165--[25/Feb/2015:07:20:13+0000]"GET/article/490"20029870191.141.247.227--[25/Feb/2015:07:20:13+0000]"HEAD/page/1"20020409150.255.232.154--[25/Feb/2015:07:20:13+0000]"GET/page/1"20042999191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/article/490"20025080150.255.232.154--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20033617103.138.203.165--[25/Feb/2015:07:20:13+0000]"HEAD/listing/567"20029618191.141.247.227--[25/Feb/2015:07:20:13+0000]"HEAD/page/1"20030265166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/page/1"2005683244.210.202.222--[25/Feb/2015:07:20:13+0000]"HEAD/article/490"20047124103.138.203.165--[25/Feb/2015:07:20:13+0000]"HEAD/listing/567"20048734103.138.203.165--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20027392191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20015705150.255.232.154--[25/Feb/2015:07:20:13+0000]"GET/page/1"20022587244.210.202.222--[25/Feb/2015:07:20:13+0000]"HEAD/product/982"20030063244.210.202.222--[25/Feb/2015:07:20:13+0000]"GET/page/1"2006041166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/product/982"20025783191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/article/490"2001099244.210.202.222--[25/Feb/2015:07:20:13+0000]"GET/product/982"20031494191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20030389150.255.232.154--[25/Feb/2015:07:20:13+0000]"GET/article/490"20010251191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/product/982"20019384150.255.232.154--[25/Feb/2015:07:20:13+0000]"HEAD/product/982"20024062244.210.202.222--[25/Feb/2015:07:20:13+0000]"GET/article/490"20019070191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/page/648"20045159191.141.247.227--[25/Feb/2015:07:20:13+0000]"HEAD/page/648"2005576166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/page/648"20041869166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20042414

Page 42: It Probably Works

defcount_distinct(stream):seen=set()foriteminstream:seen.add(item)returnlen(seen)

Page 43: It Probably Works

Scale.

Page 44: It Probably Works

Count-distinct across thousands of servers.

Page 45: It Probably Works

defcombined_cardinality(seen_sets):combined=set()forseeninseen_sets:combined|=seenreturnlen(combined)

Page 46: It Probably Works

The set grows linearly.

Page 47: It Probably Works

Precision comes at a cost.

Page 48: It Probably Works
Page 49: It Probably Works
Page 50: It Probably Works

“example.com/user/page/1”

Page 51: It Probably Works

10110101

“example.com/user/page/1”

Page 52: It Probably Works

10110101

Page 53: It Probably Works

10110101

P(bit0isset)=0.5

Page 54: It Probably Works

10110101

P(bit0isset&bit1isset)=0.25

Page 55: It Probably Works

10110101

P(bit0isset&bit1isset&bit2isset)=0.125

Page 56: It Probably Works

10110101

P(bit0isset)=0.5Expectedtrials=2

Page 57: It Probably Works

10110101

P(bit0isset&bit1isset)=0.25Expectedtrials=4

Page 58: It Probably Works

10110101

P(bit0isset&bit1isset&bit2isset)=0.125Expectedtrials=8

Page 59: It Probably Works

We expect the maximum number of leading zeros we have seen + 1 to approximate

log2(unique items).

Page 60: It Probably Works

Improve the accuracy of the estimate by

partitioning the input data.

Page 61: It Probably Works

classLogLog(object):def__init__(self,k):self.k=kself.m=2**kself.M=np.zeros(self.m,dtype=np.int)self.alpha=Alpha[k]definsert(self,token):y=hash_fn(token)j=y>>(hash_len-self.k)remaining=y&((1<<(hash_len-self.k))-1)first_set_bit=(64-self.k)-int(math.log(remaining,2))self.M[j]=max(self.M[j],first_set_bit)defcardinality(self):returnself.alpha*2**np.mean(self.M)

Page 62: It Probably Works

Unions of HyperLogLogs

Page 63: It Probably Works

HyperLogLog

• Adding an item: O(1)

• Retrieving cardinality: O(1)

• Space: O(log log n)

• Error rate: 2%

Page 64: It Probably Works

HyperLogLog

For 100 million unique items, and an error rate of 2%,

the size of the HyperLogLog is...

1,500 bytes

Page 65: It Probably Works

Load Balancing

Monitoring

Application

CDN

DatabasesCount-distinct

Join-shortest-queue

Reliable broadcast

Page 66: It Probably Works

Reliable Broadcast

Page 67: It Probably Works

The Problem

Reliably broadcast “purge” messages across the world as quickly as possible.

Page 68: It Probably Works

Single source of truth

Page 69: It Probably Works

Single source of failures

Page 70: It Probably Works

Atomic broadcast

Page 71: It Probably Works

Reliable broadcast

Page 72: It Probably Works
Page 73: It Probably Works
Page 74: It Probably Works

Gossip Protocols

Page 75: It Probably Works
Page 76: It Probably Works
Page 77: It Probably Works

“Designed for Scale”

Page 78: It Probably Works

Probabilistic Guarantees

Page 79: It Probably Works
Page 80: It Probably Works

Bimodal Multicast

• Quickly broadcast message to all servers

• Gossip to recover lost messages

Page 81: It Probably Works
Page 82: It Probably Works
Page 83: It Probably Works
Page 84: It Probably Works
Page 86: It Probably Works
Page 87: It Probably Works

One ProblemComputers have limited space

Page 88: It Probably Works

Throw away messages

Page 89: It Probably Works
Page 90: It Probably Works
Page 92: It Probably Works

“with high probability” is fine

Page 93: It Probably Works

Real World

Page 94: It Probably Works

End-to-End Latency

42ms

74ms

83ms

133ms

New York

London

San Jose

Tokyo

0.00

0.05

0.10

0.00

0.05

0.10

0.00

0.05

0.10

0.00

0.05

0.10

0 50 100 150Latency (ms)

Den

sity

Density plot and 95th percentile of purge latency by server location

Page 95: It Probably Works

End-to-End Latency42ms

74ms

83ms

133ms

New York

London

San Jose

Tokyo

0.00

0.05

0.10

0.00

0.05

0.10

0.00

0.05

0.10

0.00

0.05

0.10

0 50 100 150Latency (ms)

Den

sity

Density plot and 95th percentile of purge latency by server location

Page 96: It Probably Works

Packet Loss

Page 97: It Probably Works

Good systems are boring

Page 98: It Probably Works

What was the point again?

Page 99: It Probably Works

We can build things that are otherwise

unrealistic

Page 100: It Probably Works

We can build systems that are more

reliable

Page 101: It Probably Works

You’re already using them.

Page 102: It Probably Works

Load Balancing

Monitoring

Application

CDN

DatabasesCount-distinct

Join-shortest-queue

Reliable broadcast

Bloom Filters

Hash tables!ECMP

Consistent hashing

Quicksort

Page 103: It Probably Works

We’re hiring!

Page 104: It Probably Works

Thanks

@tbmcmullen

Page 105: It Probably Works

What even is this?

Page 106: It Probably Works

Probabilistic Algorithms

Page 107: It Probably Works

Randomized Algorithms

Page 108: It Probably Works

Estimation Algorithms

Page 109: It Probably Works

Probabilistic Algorithms

1. An iota of theory

2. Where are they useful and where are they not?

3. HyperLogLog

4. Locality-sensitive Hashing

5. Bimodal Multicast

Page 110: It Probably Works

“An algorithm that uses randomness to improve its efficiency”

Page 111: It Probably Works

Las Vegas

Page 112: It Probably Works

Monte Carlo

Page 113: It Probably Works

Las Vegasdeffind_las_vegas(haystack,needle):length=len(haystack)whileTrue:index=randrange(length)ifhaystack[index]==needle:returnindex

Page 114: It Probably Works

Monte Carlodeffind_monte_carlo(haystack,needle,k):length=len(haystack)foriinrange(k):index=randrange(length)ifhaystack[index]==needle:returnindex

Page 115: It Probably Works

– Prabhakar Raghavan (author of Randomized Algorithms)

“For many problems a randomized algorithm is the simplest the

fastest or both.”

Page 116: It Probably Works

Naive Solution

For 100 million unique IPv4 addresses, the size of the hash is...

>400mb

Page 117: It Probably Works

Slightly Less Naive

Add each IP to a bloom filter and keep a counter of the IPs that don’t collide.

Page 118: It Probably Works

Slightly Less Naive

ips_seen=BloomFilter(capacity=expected_size,error_rate=0.03)counter=0forlineinlog_file:ip=extract_ip(line)ifitems_bloom.add(ip):counter+=1print"UniqueIPs:",counter

Page 119: It Probably Works

Slightly Less Naive

• Adding an IP: O(1)

• Retrieving cardinality: O(1)

• Space: O(n)

• Error rate: 3%

kind of

Page 120: It Probably Works

Slightly Less Naive

For 100 million unique IPv4 addresses, and an error rate of 3%,

the size of the bloom filter is...

87mb

Page 121: It Probably Works

definsert(self,token):#Gethashoftokeny=hash_fn(token)#Extract`k`mostsignificantbitsof`y`j=y>>(hash_len-self.k)#Extractremainingbitsof`y`remaining=y&((1<<(hash_len-self.k))-1)#Find"first"setbitof`remaining`first_set_bit=(64-self.k)-int(math.log(remaining,2))#Update`M[j]`tomaxof`first_set_bit`#andexistingvalueof`M[j]`self.M[j]=max(self.M[j],first_set_bit)

Page 122: It Probably Works

defcardinality(self):#Themeanof`M`estimates`log2(n)`with#anadditivebiasreturnself.alpha*2**np.mean(self.M)

Page 123: It Probably Works

The Problem

Find documents that are similar to one specific document.

Page 124: It Probably Works

The Problem

Find images that are similar to one specific image.

Page 125: It Probably Works

The Problem

Find graphs that are correlated to one specific graph.

Page 126: It Probably Works

The Problem

Nearest neighbor search.

Page 127: It Probably Works

The Problem

“Find the n closest points in a d-dimensional space.”

Page 128: It Probably Works

The Problem

You have a bunch of things and you want to figure out which ones are similar.

Page 129: It Probably Works
Page 130: It Probably Works

“There has been a lot of recent work on streaming algorithms, i.e. algorithms that produce an output by making one pass (or a few passes) over the data while using a limited amount of storage space and time. To cite a few examples, ...”

{"there":1,"has":1,"been":1,“a":4,"lot":1,"of":2,"recent":1,...}

Page 131: It Probably Works

• Cosine similarity

• Jaccard similarity

• Euclidian distance

• etc etc etc

Page 132: It Probably Works

Euclidian Distance

Page 133: It Probably Works

Metric space

Page 134: It Probably Works

kd-trees

Page 135: It Probably Works

Curse of Dimensionality

Page 136: It Probably Works

Locality-sensitive hashing

Page 137: It Probably Works
Page 138: It Probably Works
Page 139: It Probably Works

Locality-Sensitive Hashing for Finding Nearest Neighbors - Slaney and Casey

Page 140: It Probably Works

Random Hyperplanes

Page 141: It Probably Works

{"there":1,"has":1,"been":1,“a":4,"lot":1,"of":2,"recent":1,...}

LSH

0111010000101010...

Page 142: It Probably Works

Cosine Similarity

LSH

Page 143: It Probably Works

Firewall Partition

Page 144: It Probably Works

DDoS

• `