Top Banner
Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation
16

Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

Dec 14, 2015

Download

Documents

Kellie Keller
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

Clemens NeudeckerKB National Library of the Netherlands

SCAPE & OPF HackathonVienna, 2 dec 2013

What is Hadoop?Hadoop Driven Digital Preservation

Page 2: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

2

Timeline

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• Dec 2004: Dean/Ghemawat (Google) MapReduce paper

• 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch(the name is derived from Doug’s son’s toy elephant)

• 2006: Yahoo runs Hadoop on 5-20 nodes

Page 3: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

3

Timeline

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

•March 2008: Cloudera founded

•July 2008: Hadoop wins TeraByte sort benchmark (1st time a Java program won this competition)

•April 2009: Amazon introduce “Elastic MapReduce”as a service on S3/EC2

•June 2011: Hortonworks founded

Page 4: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

4

Timeline

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

•27 dec 2011: Apache Hadoop release 1.0.0

•June 2012: Facebook claim “biggest Hadoop cluster”, totalling more than 100 PetaBytes in HDFS

•2013: Yahoo runs Hadoop on 42,000 nodes, computing about 500,000 MapReduce jobs per day

•15 oct 2013: Apache Hadoop release 2.2.0 (YARN)

Page 5: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

5

Contributions 2006 - 2011

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

(Cf. http://hortonworks.com/blog/reality-check-contributions-to-apache-hadoop/)

Page 6: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

6

“Core” Hadoop

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• Hadoop Common (formerly Hadoop Core)

• Hadoop MapReduce

• Hadoop YARN (MapReduce 2.0)

• Hadoop Distributed File System (HDFS)

Page 7: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

7

The wider Hadoop Ecosystem

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• Ambari, Zookeeper (managing & monitoring)

• HBase, Cassandra (database)

• Hive, Pig (data warehouse and query language)

• Mahout (machine learning)

• Chukwa, Avro, Oozie, Giraph, and many more

Page 8: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

8

The wider Hadoop Ecosystem

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-cloudera

Page 9: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

• “Hadoop is a hammer. Start by figuring out what house you‘re gonna build.“

Alistair Croll

• “If all you have is a hammer, throw away everything that is not a nail!“

Jimmy Lin

9

“Hadoop is a hammer”

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

Page 10: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

10

MapReduce in 41 words (including “library”)

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

Goal: count the number of books in the library.

• Map: You count up shelf #1, I count up shelf #2.

(The more people we get, the faster this part goes)

• Reduce: We all get together and add up our individual counts.

(Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html)

Page 11: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

MapReduce in a nutshell

11This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

Task1

Task 2

Task 3

Output data

Aggregated Result

Aggregated Result

Aggregated Result

Aggregated Result

© Sven Schlarb

Page 12: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

12

MapReduce “v1” issues

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• JobTracker as a single-point of failure

• Deficiencies in scalability, memory consumption, threading-model, reliability and performance(https://issues.apache.org/jira/browse/MAPREDUCE-278)

• Aim to support programming paradigms other than MapReduce (BSP)

Page 13: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

13

MapReduce vs YARN

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

(Cf. http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/)

Page 14: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

14

When to use Hadoop?

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

• Generally, always when “standard tools” don’t work anymore because of sheer data size(rule of thumb: if your data fits on a regular hard drive, your better off sticking to Python/SQL/Bash/etc.!)

• Aggregation across large data sets: use the power of Reducers!

• Large-scale ETL operations (extract, transform, load)

Page 15: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

• Tom White: Hadoop. The Definitive Guide(get 3rd ed. for extra YARN chapter)

• YARN explained (really quite well):http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/

• Jimmy Lin: Text Processing with MapReduce:http://lintool.github.io/MapReduceAlgorithms/ed1n.html

Reading

15This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

Page 16: Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013 What is Hadoop? Hadoop Driven Digital Preservation.

16

Happy Hadooping!

This work was partially supported by the SCAPE Project.The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐