Top Banner
Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University of Wisconsin-Madison Suman Nath Microsoft Research
39

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Jan 05, 2016

Download

Documents

Amos Greene
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems

Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya AkellaUniversity of Wisconsin-Madison

Suman NathMicrosoft Research

Page 2: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

New data-intensive networked systems

Large hash tables (10s to 100s of GBs)

Page 3: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

New data-intensive networked systems

Data center Branch office

WAN

WAN optimizersObject

Object store (~4 TB)Hashtable (~32GB)

Look up

Object

Chunks(4 KB)

Key (20 B)

Chunk pointer Large hash tables (32 GB)

High speed (~10 K/sec) inserts and evictions

High speed (~10K/sec) lookups for 500 Mbps link

Page 4: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

New data-intensive networked systems

• Other systems – De-duplication in storage systems (e.g., Datadomain)– CCN cache (Jacobson et al., CONEXT 2009)– DONA directory lookup (Koponen et al., SIGCOMM 2006)

Cost-effective large hash tablesCheap Large cAMs

Page 5: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Candidate options

DRAM 300K $120K+

Disk 250 $30+

Random reads/sec

Cost(128 GB)

Flash-SSD 10K* $225+

Random writes/sec

250

300K

5K*

Too slow Too

expensive

* Derived from latencies on Intel M-18 SSD in experiments

2.5 ops/sec/$

Slow writes

How to deal with slow writes of Flash SSD

+Price statistics from 2008-09

Page 6: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Our CLAM design

• New data structure “BufferHash” + Flash• Key features– Avoid random writes, and perform sequential writes

in a batch• Sequential writes are 2X faster than random writes (Intel

SSD)• Batched writes reduce the number of writes going to Flash

– Bloom filters for optimizing lookups

BufferHash performs orders of magnitude better than DRAM based traditional hash tables in ops/sec/$

Page 7: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Outline

• Background and motivation

• CLAM design– Key operations (insert, lookup, update)– Eviction– Latency analysis and performance tuning

• Evaluation

Page 8: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Flash/SSD primer

• Random writes are expensive Avoid random page writes

• Reads and writes happen at the granularity of a flash page

I/O smaller than page should be avoided, if possible

Page 9: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Conventional hash table on Flash/SSD

Flash

Keys are likely to hash to random locations

Random writes

SSDs: FTL handles random writes to some extent;But garbage collection overhead is high

~200 lookups/sec and ~200 inserts/sec with WAN optimizer workload, << 10 K/s and 5 K/s

Page 10: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Conventional hash table on Flash/SSD

DRAM

Flash

Can’t assume locality in requests – DRAM as cache won’t work

Page 11: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Our approach: Buffering insertions

• Control the impact of random writes• Maintain small hash table (buffer) in memory • As in-memory buffer gets full, write it to flash

– We call in-flash buffer, incarnation of buffer

Incarnation: In-flash hash table

Buffer: In-memory hash table

DRAM Flash SSD

Page 12: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Two-level memory hierarchyDRAM

Flash

Buffer

Incarnation table

Incarnation

1234

Net hash table is: buffer + all incarnations

Oldest incarnation

Latest incarnation

Page 13: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Lookups are impacted due to buffers

DRAM

Flash

Buffer

Incarnation table

Lookup key

In-flash look ups

Multiple in-flash lookups. Can we limit to only one?

4 3 2 1

Page 14: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Bloom filters for optimizing lookups

DRAM

Flash

Buffer

Incarnation table

Lookup keyBloom filters

In-memory look ups

False positive!

Configure carefully! 4 3 2 1

2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!

Page 15: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Update: naïve approachDRAM

Flash

Buffer

Incarnation table

Bloom filtersUpdate key

Update keyExpensive random writes

Discard this naïve approach

4 3 2 1

Page 16: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Lazy updatesDRAM

Flash

Buffer

Incarnation table

Bloom filtersUpdate key

Insert key

4 3 2 1

Lookups check latest incarnations first

Key, new value

Key, old value

Page 17: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Eviction for streaming apps

• Eviction policies may depend on application– LRU, FIFO, Priority based eviction, etc.

• Two BufferHash primitives– Full Discard: evict all items

• Naturally implements FIFO

– Partial Discard: retain few items• Priority based eviction by retaining high priority items

• BufferHash best suited for FIFO– Incarnations arranged by age– Other useful policies at some additional cost

• Details in paper

Page 18: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Issues with using one buffer

• Single buffer in DRAM– All operations and

eviction policies

• High worst case insert latency– Few seconds for 1

GB buffer– New lookups stall

DRAM

Flash

Buffer

Incarnation table

Bloom filters

4 3 2 1

Page 19: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Partitioning buffers

• Partition buffers– Based on first few bits

of key space– Size > page

• Avoid i/o less than page

– Size >= block• Avoid random page

writes

• Reduces worst case latency

• Eviction policies apply per buffer

DRAM

Flash

Incarnation table

4 3 2 1

0 XXXXX 1 XXXXX

Page 20: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

BufferHash: Putting it all together

• Multiple buffers in memory• Multiple incarnations per buffer in flash• One in-memory bloom filter per incarnation

DRAM

Flash

Buffer 1 Buffer K. . . .

Net hash table = all buffers + all incarnations

Page 21: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Outline

• Background and motivation

• Our CLAM design– Key operations (insert, lookup, update)– Eviction– Latency analysis and performance tuning

• Evaluation

Page 22: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Latency analysis

• Insertion latency – Worst case size of buffer – Average case is constant for buffer > block size

• Lookup latency– Average case Number of incarnations – Average case False positive rate of bloom filter

Page 23: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Parameter tuning: Total size of Buffers

.

. .

.

Total size of buffers = B1 + B2 + … + BN

Too small is not optimalToo large is not optimal eitherOptimal = 2 * SSD/entry

DRAM

Flash

Given fixed DRAM, how much allocated to buffers

B1 BN

# Incarnations = (Flash size/Total buffer size)

Lookup #Incarnations * False positive rate

False positive rate increases as the size of bloom filters decrease

Total bloom filter size = DRAM – total size of buffers

Page 24: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Parameter tuning: Per-buffer size

Affects worst case insertion

What should be size of a partitioned buffer (e.g. B1) ?

.

. .

.

DRAM

Flash

B1 BN

Adjusted according to application requirement (128 KB – 1 block)

Page 25: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Outline

• Background and motivation

• Our CLAM design– Key operations (insert, lookup, update)– Eviction– Latency analysis and performance tuning

• Evaluation

Page 26: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Evaluation

• Configuration– 4 GB DRAM, 32 GB Intel SSD, Transcend SSD– 2 GB buffers, 2 GB bloom filters, 0.01 false positive

rate– FIFO eviction policy

Page 27: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

BufferHash performance

• WAN optimizer workload– Random key lookups followed by inserts– Hit rate (40%)– Used workload from real packet traces also

• Comparison with BerkeleyDB (traditional hash table) on Intel SSDAverage latency BufferHash BerkeleyDB

Look up (ms) 0.06 4.6

Insert (ms) 0.006 4.8

Better lookups!

Better inserts!

Page 28: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Insert performance

0.001 0.01 0.1 1 10 100

BerkeleyDB

0.001 0.01 0.1 1 10 100

Bufferhash

0.2

0.40.6

0.81.0

CDF

Insert latency (ms) on Intel SSD

99% inserts < 0.1 ms

40% of inserts > 5 ms !

Random writes are slow! Buffering effect!

Page 29: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Lookup performance

0.001 0.01 0.1 1 10 100

Bufferhash

0.001 0.01 0.1 1 10 100

BerkeleyDB

0.20.40.60.81.0

CDF

99% of lookups < 0.2ms

40% of lookups > 5 ms

Garbage collection overhead due to writes!

60% lookups don’t go to Flash 0.15 ms Intel SSD latency

Lookup latency (ms) for 40% hit workload

Page 30: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Performance in Ops/sec/$

• 16K lookups/sec and 160K inserts/sec

• Overall cost of $400

• 42 lookups/sec/$ and 420 inserts/sec/$– Orders of magnitude better than 2.5 ops/sec/$ of

DRAM based hash tables

Page 31: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Other workloads

• Varying fractions of lookups• Results on Trancend SSD

Lookup fraction BufferHash BerkeleyDB0 0.007 ms 18.4 ms0.5 0.09 ms 10.3 ms1 0.12 ms 0.3 ms

• BufferHash ideally suited for write intensive workloads

Average latency per operation

Page 32: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Evaluation summary

• BufferHash performs orders of magnitude better in ops/sec/$ compared to traditional hashtables on DRAM (and disks)

• BufferHash is best suited for FIFO eviction policy– Other policies can be supported at additional cost, details in

paper

• WAN optimizer using Bufferhash can operate optimally at 200 Mbps, much better than 10 Mbps with BerkeleyDB– Details in paper

Page 33: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Related Work

• FAWN (Vasudevan et al., SOSP 2009)– Cluster of wimpy nodes with flash storage– Each wimpy node has its hash table in DRAM– We target…• Hash table much bigger than DRAM • Low latency as well as high throughput systems

• HashCache (Badam et al., NSDI 2009)– In-memory hash table for objects stored on disk

Page 34: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Conclusion

• We have designed a new data structure BufferHash for building CLAMs

• Our CLAM on Intel SSD achieves high ops/sec/$ for today’s data-intensive systems

• Our CLAM can support useful eviction policies

• Dramatically improves performance of WAN optimizers

Page 35: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Thank you

Page 36: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

ANCS 2010:ACM/IEEE Symposium on

Architectures for Networking and Communications Systems

• Estancia La Jolla Hotel & Spa (near UCSD)• October 25-26, 2010• Paper registration & abstract: May 10, 2010• Submission deadline: May 17, 2010• http://www.ancsconf.org/

Page 37: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

Backup slides

Page 38: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.
Page 39: Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.

WAN optimizer using BufferHash

• With BerkeleyDB, throughput up to 10 Mbps

• With BufferHash, throughput up to 200 Mbps with Transcend SSD– 500 Mbps with Intel SSD

• At 10 Mbps, average throughput per object improves by 65% with BufferHash