YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Investigating Distributed Caching Mechanisms for Hadoop

Gurmeet SinghPuneet Chandra

Rashid Tahir

Page 2: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

GOAL

• Explore the feasibility of a distributed caching mechanism inside Hadoop

Page 3: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Presentation Overview

• Motivation• Design• Experimental Results• Future Work

Page 4: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Motivation

• Disk Access Times are a bottleneck in cluster computing

• Large amount of data is read from disk• DARE• RAMClouds• PACMan – Coordinated Cache Replacement

We want to strike a balance between RAM and Disk Storage

Page 5: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Our Approach

• Integrate Memcached with Hadoop• Used Quickcached and Spymemcached• Reserve a portion of the main memory at each

node to serve as local cache• Local caches aggregate to abstract a distributed

caching mechanism governed by Memcached• Greedy caching strategy• Least Recently Used (LRU) cache eviction policy

Page 6: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Design Overview

Page 7: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Memcached

Page 8: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Design Choice 1

• Simultaneous requests to Namenode and Memcached

Minimizes access latency with additional network overhead

Page 9: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Design Choice 2• Send request to Namenode only in the case of

a cache miss

Minimizes network overhead with increased latency

Page 10: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Design Choice 3

• Datanodes send requests only to Memcached

• Memcached checks for cached blocks

• If cache miss occurs, it contacts the namenode and returns the replicas’ addresses to the datanodes

Page 11: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Global Cache Replacement• LRU based Global Cache Eviction Scheme

Page 12: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Prefetching

Page 13: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Simulation Results

• Test data ranging from 2GB to 24GB• Word Count and Grep

Page 14: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

0 5 10 15 20 25 30 35 400

20

40

60

80

100

Network Overhead vs Cache Size

Cache Size (GB)

% O

verh

eadWord Count

Page 15: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Word Count

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

Hit Ratio vs Cache Size

Cache Size (GB)

Cach

e H

it Ra

tio

Page 16: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Grep

0 5 10 15 20 25 30 350

20

40

60

80

100

Network Overhead vs Cache Size

Cache Size (GB)

% O

verh

ead

Page 17: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Grep

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

Hit Ratio vs Cache Size

Cache Size (GB)

Hit

Ratio

Page 18: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Future Work

• Implement a pre-fetching mechanism• Customized caching policies based on access

patterns• Compare and contrast caching with locality

aware scheduling

Page 19: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Conclusion

• Caching can improve the performance of cluster based systems based on the access patterns of the workload being executed