April 2008 1 Brahms Byzantine-Resilient Random Membership Sampling Bortnikov, Gurevich, Keidar, Kliot, and Shraer.

April 2008 1

Brahms

Byzantine-Resilient Random Membership SamplingBortnikov, Gurevich, Keidar, Kliot, and Shraer

April 2008 2

Edward (Eddie) Bortnikov Maxim (Max) Gurevich Idit Keidar

Gabriel (Gabi) Kliot Alexander (Alex) Shraer

April 2008 3

Why Random Node Sampling

Gossip partners Random choices make gossip protocols work

Unstructured overlay networks E.g., among super-peers Random links provide robustness, expansion

Gathering statistics Probe random nodes

Choosing cache locations

April 2008 4

The Setting

Many nodes – n 10,000s, 100,000s, 1,000,000s, …

Come and go Churn

Full network Like the Internet

Every joining node knows some others (Initial) Connectivity

April 2008 5

Adversary Attacks

Faulty nodes (portion f of ids) Attack other nodes Byzantine failures

May want to bias samples Isolate nodes, DoS nodes Promote themselves, bias statistics

April 2008 6

Previous Work

Benign gossip membership Small (logarithmic) views Robust to churn and benign failures Empirical study [Lpbcast,Scamp,Cyclon,PSS] Analytical study [Allavena et al.] Never proven uniform samples Spatial correlation among neighbors’ views [PSS]

Byzantine-resilient gossip Full views [MMR,MS,Fireflies,Drum,BAR] Small views, some resilience [SPSS] We are not aware of any analytical work

April 2008 7

Our Contributions

1. Gossip-based attack-tolerant membership Linear portion f of failures O(n1/3)-size partial views Correct nodes remain connected Mathematically analyzed, validated in

simulations

2. Random sampling Novel memory-efficient approach Converges to proven independent uniform

samples

The view is not all bad

Better than benign gossip

April 2008 8

Brahms

1. Sampling - local component

2. Gossip - distributed componentsample

Sampler

Gossip

view

April 2008 9

Sampler Building Block

Input: data stream, one element at a time Bias: some values appear more than others Used with stream of gossiped ids

Output: uniform random sample of unique elements seen thus far Independent of other Samplers One element at a time (converging)

next

sample

Sampler

April 2008 10

Sampler Implementation

Memory: stores one element at a time Use random hash function h

From min-wise independent family [Broder et al.] For each set X, and all , Xx

||

1))()}(Pr(min{

XxhXh

Sampler

next

sample

init

Keep id with smallest hash so far

Choose random hash function

April 2008 11

Component S: Sampling and Validation

SamplerSampler

sample

Sampler Sampler

nextinit

using pings

id streamfrom gossip

ValidatorValidator Validator Validator

S

April 2008 12

Gossip Process

Provides the stream of ids for S Needs to ensure connectivity Use a bag of tricks to overcome attacks

April 2008 13

Gossip-Based Membership Primer

Small (sub-linear) local view V V constantly changes - essential due to churn

Typically, evolves in (unsynchronized) rounds Push: send my id to some node in V

Reinforce underrepresented nodes Pull: retrieve view from some node in V

Spread knowledge within the network [Allavena et al. ‘05]: both are essential

Low probability for partitions and star topologies

April 2008 14

Brahms Gossip Rounds

Each round: Send pushes, pulls to random nodes from V Wait to receive pulls, pushes Update S with all received ids (Sometimes) re-compute V

Tricky! Beware of adversary attacks

April 2008 15

Problem 1: Push Drowning

Push Alice

Push Bob

Push Dana

Push Caro

lP

ush

Ed

Push Mallory

Push M

&M

Push Malfoy

A

D

B

EM

M

M

M

M

April 2008 16

Trick 1: Rate-Limit Pushes

Use limited messages to bound faulty pushes system-wide

Assume at most p of pushes are faulty E.g., computational puzzles/virtual currency Faulty nodes can send portion p of them Views won’t be all bad

April 2008 17

Problem 2: Quick Isolation

Push Alice

Push Bob

Push Dana

Push Caro

l

Pu

sh E

d Push MalloryPush M

&M

Pu

sh M

alfo

y

A

C

E

M

Ha! She’s out! Now let’s move on

to the next guy!

D

April 2008 18

Trick 2: Detection & Recovery

Do not re-compute V in rounds when too many pushes are received

Push MalloryPush M

&M

Pu

sh M

alfo

y

Hey! I’m swamped!I better ignore all of ‘em pushes…

Push Bob

Slows down isolation; does not prevent it Isolation takes log (view size) time

April 2008 19

Problem 3: Pull Deterioration

Pull

M8

A B M1 M2

C D M3 M4Pull

PullM

7

E F M5 M6

PullE

M 3

M3 E M7 M8

50% faulty ids in views 75% faulty ids in views

April 2008 20

Trick 3: Balance Pulls & Pushes

Control contribution of push - α|V| ids versus contribution of pull - β|V| ids Parameters α, β

Pull-only eventually all faulty ids Push-only quick isolation of attacked node Push ensures: system-wide not all bad ids Pull slows down (does not prevent) isolation

April 2008 21

Trick 4: History Samples

Attacker influences both push and pull

Feedback γ|V| random ids from S Parameters α + β + γ = 1

Attacker loses control - samples are eventually perfectly uniform

Yoo-hoo, is there any good process out there?

April 2008 22

View and Sample Maintenance

Pushed ids

Pulled ids

S |V| |V| |V|

View V Sample

April 2008 23

Samples Take Time To Help

Judicious use essential (e.g., 10%) Deal with churn, bootstrap

Need connectivity for uniform sampling Rely on sampling for connectivity?

Break cycle: Assume attack starts when samples are empty Analyze time to partition without samples Samples become useful at least as fast

April 2008 24

Key Property

With appropriate parameters E.g.,

Time to partition > time to convergence of 1st good sample

Prove lower boundusing tricks 1,2,3

(not using samples yet)

Prove upper bound until some good sample persists

forever

3| | | | ( )V S n

Self-healing from partitions

Time to Partition Analysis

Easiest partition to cause – isolate one node Targeted attack

Analysis of targeted attack: Assume unrealistically strong adversary Analyze evolution of faulty ids in views as a

random process Show lower bound on time to isolation Depends on p, |V|, α, β Independent of n

April 2008 25

April 2008 26

Scalability of PSP(t) – Probability of Perfect Sample at Time t Analysis says:

For scalability, want small and constant convergence time independent of system size, e.g., when

2| | | |

( ) (1 )V S

nPSP t e

3| | | | ( )V S n

2| | (log ),| | ( )

log

nV n S

n

April 2008 27

Analysis

1. Sampling - mathematical analysis

2. Connectivity - analysis and simulation

3. Full system simulation

April 2008 28

Connectivity Sampling

Theorem: If overlay remains connected indefinitely, samples are eventually uniform

April 2008 29

Sampling Connectivity Ever After

Perfect sample of a sampler with hash h: the id with the lowest h(id) system-wide

If correct, sticks once the sampler sees it Correct perfect sample self-healing from

partitions ever after

April 2008 30

Convergence to 1st Perfect Sample

n = 1000

f = 0.2

40% unique ids in stream

April 2008 31

Connectivity Analysis 1: Balanced Attacks Attack all nodes the same Maximizes faulty ids in views system-wide

in any single round If repeated, system converges to fixed point

ratio of faulty ids in views, which is < 1 if γ=0 (no history) and p < 1/3 or History samples are used, any p

There are always good ids in views!

April 2008 32

Fixed Point Analysis: Push

i

Local view node 1

Local view node i

Time t:

push

1 Time t+1:

push from faulty node

lost push

x(t) – portion of faulty nodes in views at round t;portion of faulty pushes to correct nodes :

p / ( p + ( 1 − p )( 1 − x(t) ) )

April 2008 33

Fixed Point Analysis: Pull

i

Local view node 1

Local view node i

Time t:

pull from i: faulty with probability x(t)

Time t+1:

E[x(t+1)] = p / (p + (1 − p)(1 − x(t))) + ( x(t) + (1-x(t))x(t) ) + γf

pull from faulty

April 2008 34

Faulty Ids in Fixed Point

With a few history samples, any

portion of bad nodes can be toleratedPerfectly validated

fixed pointsand convergence

Assumed perfect in analysis, real history

in simulations

April 2008 35

Convergence to Fixed Point

n = 1000

p = 0.2

α=β=0.5

γ=0

April 2008 36

Connectivity Analysis 2:Targeted Attack – Roadmap Step 1: analysis without history samples

Isolation in logarithmic time … but not too fast, thanks to tricks 1,2,3

Step 2: analysis of history sample convergence Time-to-perfect-sample < Time-to-Isolation

Step 3: putting it all together Empirical evaluation No isolation happens

April 2008 37

Targeted Attack – Step 1

Q: How fast (lower bound) can an attacker isolate one node from the rest?

Worst-case assumptions No use of history samples ( = 0) Unrealistically strong adversary

Observes the exact number of correct pushes and complements it to α|V|

Attacked node not represented initially Balanced attack on the rest of the system

April 2008 38

Isolation w/out History Samples

n = 1000

p = 0.2

α=β=0.5

γ=0

Depend on α,β,p

Isolation time for |V|=60

2x2 i, ji

E(indegree(t 1)) E(indegree(t))A , A 1

E(outdegree(t 1)) E(outdegree(t)

April 2008 39

Step 2: Sample Convergence

Perfect sample in 2-3 rounds

n = 1000

p = 0.2

α=β=0.5, γ=0

40% unique ids

Empirically verified

April 2008 40

Step 3: Putting It All TogetherNo Isolation with History Samples

Works well despite small PSP

n = 1000

p = 0.2

α=β=0.45

γ=0.1

April 2008 41

p = 0.2

α=β=0.45

γ=0.1

Sample Convergence (Balanced)

32|||| NSV

32|||| nSV

Convergence twice as fast with 33|||| nSV

April 2008 42

Summary

O(n1/3)-size views Resist attacks / failures of linear portion Converge to proven uniform samples Precise analysis of impact of attacks

April 2008 1 Brahms Byzantine-Resilient Random Membership Sampling Bortnikov, Gurevich, Keidar, Kliot, and Shraer.

Documents

benign gossip slide

random nodes

sample sampler slide

gossip validator s slide

brahms gossip

random hash function

gossip protocols

gossip process