MapReduce and Hadoop Frankie Pike
Dec 27, 2015
2010: 1.2 zettabytes1.2 trillion gigabytesDVDs past the moon2-way = 6 newspapers everyday~58% growth per year
Why care?
Why care?
Google’s capacity = 1 exabyte 24 hours of Youtube > Internet in 2000 4 years of video / day on Youtube 100 trillion words online
Common Architecture
Single point of failureSpace-constraintsMulti-tenancy difficultiesRe-writing of programs or
changes to network config
The PromiseHigh reliability
any node can go downHigh scalability
easy to add nodesMulti-tenancyCost Reduction“Cloud-friendly”Java, C++, C#, Python, RTransparent Parallelization
Mapping
Input K/V pairs -> Intermediate K/V PairsInput and Intermediate can be different(Server Key, Blog Data) -> (Blog Key, Post Count)
Sorted and Partitioned for reduction
Number of maps depends on task and cluster10TB data with blocksize 128MB = 82,000 maps10-100 maps per node ideal
ReducingIntermediate K/V -> Intermediate K/V (smaller)
Matching keys consolidated(A, 15); (B, 6); (A, 3) -> (A, 18); (B, 6)
Number of Reductions >= 0Hopefully smaller dataset at each iterationReduce as much as needed
An Example{
"type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "<p>...</p>", "comments": [
{ "source_ip": '124.2.21.2', "author": "martin", "text": "..." }
] } Want count of comments for blog
http://ayende.com/blog/4435/map-reduce-a-visual-explanation
Dealing with Failure
WorkersOccasional check-in pings by masters
MastersData structures get periodic auto-saves and consistency
checks. Can restart from periodic savesBandwidth
Tasks attempt to pair with local storage
Apache Hadoop“open source software for reliable, scalable, distributed
computing”
Hadoop Distributed File System (HDFS)Hadoop MapReduceCassandra (multi-master database)HBase (scalable, distributed, structured database)Mahout (data mining and machine learning libs)ZooKeeper (coordination service)