Top Banner
Thursday, October 29, 2009
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 20091030nasajpl

Thursday, October 29, 2009

Page 2: 20091030nasajpl

Hadoop and ClouderaManaging Petabytes with Open Source

Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaOctober 30, 2009

Thursday, October 29, 2009

Page 3: 20091030nasajpl

Why You Should CareHadoop in the Sciences▪ Crossbow: Genotyping from short reads using cloud computing▪ “Crossbow shows how Hadoop can be a enabling technology for

computational biology”▪ SMARTS substructure searching using the CDK and Hadoop▪ “The Hadoop framework makes handling large data problems pretty much

trivial”▪ Tier II data storage for LHC▪ “The shift from dCache to Hadoop has been a pleasant transition”

▪ Scaling the Sky with MapReduce/Hadoop▪ “This research project is focused on developing new algorithms for

indexing, accessing and analyzing astronomical images.”

Thursday, October 29, 2009

Page 4: 20091030nasajpl

My BackgroundThanks for Asking

[email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers

▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist (other titles)▪ Also, check out the book “Beautiful Data”

Thursday, October 29, 2009

Page 5: 20091030nasajpl

Presentation Outline▪ What is Hadoop?▪ HDFS▪ MapReduce▪ Hive, Pig, Avro, Zookeeper, and friends

▪ Solving big data problems with Hadoop at Facebook and Yahoo!▪ Short history of Facebook’s Data team▪ Hadoop applications at Yahoo!, Facebook, and Cloudera▪ Other examples: LHC, smart grid, genomes

▪ Questions and Discussion

Thursday, October 29, 2009

Page 6: 20091030nasajpl

What is Hadoop?▪ Apache Software Foundation project, mostly written in Java▪ Inspired by Google infrastructure▪ Software for programming warehouse-scale computers (WSCs)▪ Hundreds of production deployments▪ Project structure▪ Hadoop Distributed File System (HDFS)▪ Hadoop MapReduce▪ Hadoop Common▪ Other subprojects

▪ Avro, HBase, Hive, Pig, Zookeeper

Thursday, October 29, 2009

Page 7: 20091030nasajpl

Anatomy of a Hadoop Cluster▪ Commodity servers▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC

▪ Typically arranged in 2 level architecture▪ 40 nodes per rack

▪ Inexpensive to acquire and maintain

ApacheCon US 2008

Commodity Hardware Cluster

•! Typically in 2 level architecture

–! Nodes are commodity Linux PCs

–! 40 nodes/rack

–! Uplink from rack is 8 gigabit

–! Rack-internal is 1 gigabit all-to-all

Thursday, October 29, 2009

Page 8: 20091030nasajpl

HDFS▪ Pool commodity servers into a single hierarchical namespace▪ Break files into 128 MB blocks and replicate blocks▪ Designed for large files written once but read many times▪ Files are append-only

▪ Two major daemons: NameNode and DataNode▪ NameNode manages file system metadata▪ DataNode manages data using local filesystem

▪ HDFS manages checksumming, replication, and compression▪ Throughput scales nearly linearly with node cluster size▪ Access from Java, C, command line, FUSE, or Thrift

Thursday, October 29, 2009

Page 9: 20091030nasajpl

HDFSHDFS distributes file blocks among servers

!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% !"

!"#$%&#"'()*+%,"-'./0('

#$%&&'"()*+,%-."$"/$,+010&+-2$)0".0&2$3-".4.0-5"*$++-%"06-"

#$%&&'"7(.02(8,0-%"9(+-":4.0-5;"&2"#79:<"#79:"(."$8+-"0&".0&2-"

6,3-"$5&,)0."&/"()/&25$0(&);".*$+-",'"()*2-5-)0$++4"$)%"

.,2=(=-"06-"/$(+,2-"&/".(3)(/(*$)0"'$20."&/"06-".0&2$3-"

()/2$.02,*0,2-">(06&,0"+&.()3"%$0$<"

#$%&&'"*2-$0-."456'$13'"&/"5$*6()-."$)%"*&&2%()$0-.">&2?"

$5&)3"06-5<"@+,.0-2."*$)"8-"8,(+0">(06"()-A'-).(=-"*&5',0-2.<"

B/"&)-"/$(+.;"#$%&&'"*&)0(),-."0&"&'-2$0-"06-"*+,.0-2">(06&,0"

+&.()3"%$0$"&2"()0-22,'0()3">&2?;"84".6(/0()3">&2?"0&"06-"

2-5$()()3"5$*6()-."()"06-"*+,.0-2<"

#79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"

/(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"

#79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"

'(-*-"0&"062--"%(//-2-)0".-2=-2.E"

"

"

!"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'

"

#79:"6$.".-=-2$+",.-/,+"/-$0,2-.<"B)"06-"=-24".(5'+-"-A$5'+-"

.6&>);"$)4"0>&".-2=-2."*$)"/$(+;"$)%"06-"-)0(2-"/(+-">(++".0(++"8-"

$=$(+$8+-<"#79:")&0(*-.">6-)"$"8+&*?"&2"$")&%-"(."+&.0;"$)%"

*2-$0-."$")->"*&'4"&/"5(..()3"%$0$"/2&5"06-"2-'+(*$."(0"

F"

!"

G"

H"

I"

!"

I"

H"

F"

!"

H"

F"

G"

I"

!"

G"

I"

F"

G"

H"

#79:"

"

" "

"

" "

7#8*3%90$1301$%

+3*+13$&1'%5&:1%;**.51<%

=>#?*0<%@#41A**:%#0)%

B#"**C%"#D1%+&*01131)%

$"1%6'1%*E%01$F*3:'%*E%

&01G+10'&D1%4*>+6$13'%

E*3%5#3.1H'4#51%)#$#%

'$*3#.1%#0)%

+3*41''&0.I%(/@J%6'1'%

$"1'1%$14"0&K61'%$*%

'$*31%10$13+3&'1%)#$#I%

Thursday, October 29, 2009

Page 10: 20091030nasajpl

Hadoop MapReduce▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems▪ Key/value data model▪ Two major daemons: JobTracker and TaskTracker▪ Many client interfaces▪ Java▪ C++▪ Streaming▪ Pig▪ SQL (Hive)

Thursday, October 29, 2009

Page 11: 20091030nasajpl

MapReduceMapReduce pushes work out to the data

!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% !"

"

!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'

"

#$%%&%'"()*"+%+,-.&."/%"()*"%/0*."()+("+1($+,,-".(/2*"()*"0+(+"

0*,&3*2."4$1)"4$1)"5*((*2"6*27/24+%1*"()+%"2*+0&%'"0+(+"

/3*2"()*"%*(8/29"72/4"+".&%',*"1*%(2+,&:*0".*23*2;""<+0//6"

4/%&(/2."=/5."0$2&%'"*>*1$(&/%?"+%0"8&,,"2*.(+2("8/29",/.("0$*"

(/"%/0*"7+&,$2*"&7"%*1*..+2-;"@%"7+1(?"&7"+"6+2(&1$,+2"%/0*"&."

2$%%&%'"3*2-".,/8,-?"<+0//6"8&,,"2*.(+2("&(."8/29"/%"+%/()*2"

.*23*2"8&()"+"1/6-"/7"()*"0+(+;"

!"##$%&'

<+0//6A."B+6#*0$1*"+%0"<CDE"$.*".&46,*?"2/5$.("(*1)%&F$*."

/%"&%*>6*%.&3*"1/46$(*2".-.(*4."(/"0*,&3*2"3*2-")&')"0+(+"

+3+&,+5&,&(-"+%0"(/"+%+,-:*"*%/24/$."+4/$%(."/7"&%7/24+(&/%"

F$&19,-;"<+0//6"/77*2."*%(*262&.*."+"6/8*27$,"%*8"(//,"7/2"

4+%+'&%'"5&'"0+(+;"

D/2"4/2*"&%7/24+(&/%?"6,*+.*"1/%(+1("G,/$0*2+"+(H"

" &%7/I1,/$0*2+;1/4"

" JKLMNOLPMQLO!RR"

" )((6HSS888;1,/$0*2+;1/4S"

K"

P"

N"

K"

P"

!"

Q"

P"

!"

K"

Q"

N"

Q"

!"

N"

(#)**+%$#41'%

#)5#0$#.1%*6%(/789%

)#$#%)&'$3&:;$&*0%

'$3#$1.<%$*%+;'"%=*34%

*;$%$*%>#0<%0*)1'%&0%#%

?@;'$13A%B"&'%#@@*='%

#0#@<'1'%$*%3;0%&0%

+#3#@@1@%#0)%1@&>&0#$1'%

$"1%:*$$@101?4'%

&>+*'1)%:<%>*0*@&$"&?%

'$*3#.1%'<'$1>'A%

Thursday, October 29, 2009

Page 12: 20091030nasajpl

Hadoop Subprojects▪ Avro▪ Cross-language framework for RPC and serialization

▪ HBase▪ Table storage on top of HDFS, modeled after Google’s BigTable

▪ Hive▪ SQL interface to structured data stored in HDFS

▪ Pig▪ Language for data flow programming; also Owl, Zebra, SQL

▪ Zookeeper▪ Coordination service for distributed systems

Thursday, October 29, 2009

Page 13: 20091030nasajpl

Hadoop Community Support▪ 185+ contributors to the open source code base▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera

▪ Over 500 (paid!) attendees at Hadoop World NYC▪ Hadoop World Beijing later this month

▪ Three books (O’Reilly, Apress, Manning)▪ Training videos free online▪ Regular user group meetups in many cities▪ University courses across the world▪ Growing consultant and systems integrator expertise▪ Commercial training, certification, and support from Cloudera

Thursday, October 29, 2009

Page 14: 20091030nasajpl

Hadoop Project Mechanics▪ Trademark owned by ASF; Apache 2.0 license for code▪ Rigorous unit, smoke, performance, and system tests▪ Release cycle of 3 months (-ish)▪ Last major release: 0.20.0 on April 22, 2009▪ 0.21.0 will be last release before 1.0; nearly complete▪ Subprojects on different release cycles

▪ Releases put to a vote according to Apache guidelines▪ Releases made available as tarballs on Apache and mirrors▪ Cloudera packages own release for many platforms▪ RPM and Debian packages; AMI for Amazon’s EC2

Thursday, October 29, 2009

Page 15: 20091030nasajpl

Hadoop at FacebookEarly 2006: The First Research Scientist

▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site

▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle

▪ ...and then we turned on impression logging

Thursday, October 29, 2009

Page 16: 20091030nasajpl

Facebook Data Infrastructure2007

Oracle Database Server

Data Collection Server

MySQL TierScribe Tier

Thursday, October 29, 2009

Page 17: 20091030nasajpl

Facebook Data Infrastructure2008

MySQL TierScribe Tier

Hadoop Tier

Oracle RAC Servers

Thursday, October 29, 2009

Page 18: 20091030nasajpl

Major Data Team Workloads▪ Data collection▪ server logs▪ application databases▪ web crawls

▪ Thousands of multi-stage processing pipelines▪ Summaries consumed by external users▪ Summaries for internal reporting▪ Ad optimization pipeline▪ Experimentation platform pipeline

▪ Ad hoc analyses

Thursday, October 29, 2009

Page 19: 20091030nasajpl

Workload StatisticsFacebook 2009

▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage▪ 4 TB of compressed new data added per day▪ 135TB of compressed data scanned per day▪ 7,500+ Hive jobs on per day▪ 80K compute hours per day▪ Around 200 people per month run Hive jobs

(data from Ashish Thusoo’s Hadoop World NYC presentation)

Thursday, October 29, 2009

Page 20: 20091030nasajpl

Hadoop at Yahoo!▪ Jan 2006: Hired Doug Cutting▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds▪ Aug 2008: Deployed 4,000 node Hadoop cluster▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds▪ Sorted 1 PB on 3,658 nodes in 16.25 hours

▪ Other data points▪ Over 25,000 nodes running Hadoop across 17 clusters▪ Hundreds of thousands of jobs per day from over 600 users▪ 82 PB of data

Thursday, October 29, 2009

Page 21: 20091030nasajpl

Example Hadoop Applications▪ Yahoo!▪ Yahoo! Search Webmap▪ Content and ad targeting optimization

▪ Facebook▪ Fraud and abuse detection▪ Lexicon (text mining)

▪ Cloudera▪ Facial recognition for automatic tagging▪ Genome sequence analysis▪ Financial services, government, telco, scientific data

Thursday, October 29, 2009

Page 22: 20091030nasajpl

Cloudera OfferingsOnly One Slide, I Promise

▪ Two software products▪ Cloudera’s Distribution for Hadoop▪ Cloudera Desktop▪ ...more on the way

▪ Training and Certification▪ For Developers, Operators, and Managers

▪ Support▪ Professional services

Thursday, October 29, 2009

Page 23: 20091030nasajpl

Cloudera DesktopBig Data can be Beautiful

Thursday, October 29, 2009

Page 24: 20091030nasajpl

(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Thursday, October 29, 2009