8/12/2019 Xiaoxiao Had Oop
1/31
Hadoop and its Real-world
Applications
Xiaoxiao Shi, Guan Wang
Experience: work at Yahoo! in 2010 summer,on developing hadoop-based machine learning models.
8/12/2019 Xiaoxiao Had Oop
2/31
Contents
Motivation of Hadoop
History of Hadoop
The current applications of Hadoop Programming examples
Research with Hadoop
Conclusions
8/12/2019 Xiaoxiao Had Oop
3/31
Motivation of Hadoop
How do you scale up applications? Run jobs processing 100s of terabytes of data
Takes 11 days to read on 1 computer
Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers),but
Reliability problems In large clusters, computers fail every day
Cluster size is not fixed
Need common infrastructure Must be efficient and reliable
8/12/2019 Xiaoxiao Had Oop
4/31
Motivation of Hadoop
Open Source Apache Project
Hadoop Core includes:
Distributed File System - distributes data
Map/Reduce - distributes application
Written in Java
Runs on
Linux, Mac OS/X, Windows, and Solaris Commodity hardware
8/12/2019 Xiaoxiao Had Oop
5/31
Fun Fact of Hadoop
"The name my kid gave a stuffed yellowelephant. Short, relatively easy to spelland pronounce, meaningless, and not used
elsewhere: those are my namingcriteria. Kids are good at generating such.Googol is a kids term."
---- Doug Cutting, Hadoop projectcreator
http://hadoop.apache.org/8/12/2019 Xiaoxiao Had Oop
6/31
History of Hadoop
Apache Nutch
Doug Cutting
Map-reduce
2004
It is an important technique!
Extended
The great journey begins
8/12/2019 Xiaoxiao Had Oop
7/31
History of Hadoop
Yahoo! became the primary contributor in2006
8/12/2019 Xiaoxiao Had Oop
8/31
History of Hadoop
Yahoo! deployed large scale science clusters in2007.
Tons of Yahoo! Research papers emerge: WWW CIKM SIGIR VLDB
Yahoo! began running major production jobsin Q1 2008.
Nowadays
8/12/2019 Xiaoxiao Had Oop
9/31
Nowadays
When you visityahoo, you areinteractingwith dataprocessed withHadoop!
8/12/2019 Xiaoxiao Had Oop
10/31
Nowadays
Ads
Optimization
Content
OptimizationSearch Index
Content Feed
Processing
When you visityahoo, you areinteractingwith dataprocessed withHadoop!
8/12/2019 Xiaoxiao Had Oop
11/31
Nowadays
Ads
Optimization
Content
OptimizationSearch Index
Content Feed
Processing
Machine
Learning(e.g. Spam filters)
When you visityahoo, you areinteractingwith dataprocessed withHadoop!
8/12/2019 Xiaoxiao Had Oop
12/31
Nowadays
Yahoo! has ~20,000 machines running Hadoop The largest clusters are currently 2000 nodes
Several petabytes of user data (compressed, unreplicated)
Yahoo! runs hundreds of thousands of jobs every month
8/12/2019 Xiaoxiao Had Oop
13/31
Nowadays
Who use Hadoop?
Amazon/A9
AOL
Fox interactive media
IBM
New York Times
PowerSet (now Microsoft)
Quantcast Rackspace/Mailtrust
Veoh
Yahoo!
More at http://wiki.apache.org/hadoop/PoweredBy
8/12/2019 Xiaoxiao Had Oop
14/31
Nowadays (job market on Nov 15th)
Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop /
data analytics a plus
Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel data
processing system, big data analytics ... multiple technologies, including Hadoop
8/12/2019 Xiaoxiao Had Oop
15/31
It is important
Details
8/12/2019 Xiaoxiao Had Oop
16/31
Nowadays
Hadoop Core Distributed File System
MapReduce Framework
Pig (initiated by Yahoo!) Parallel Programming Language and Runtime
Hbase (initiated by Powerset) Table storage for semi-structured data
Zookeeper (initiated by Yahoo!) Coordinating distributed systems
Hive (initiated by Facebook) SQL-like query language and metastore
8/12/2019 Xiaoxiao Had Oop
17/31
8/12/2019 Xiaoxiao Had Oop
18/31
HDFS
Hadoop's Distributed File System is designed to reliably storevery large files across machines in a large cluster. It isinspired by the Google File System. Hadoop DFS stores eachfile as a sequence of blocks, all blocks in a file except the lastblock are the same size. Blocks belonging to a file are
replicated for fault tolerance. The block size and replicationfactor are configurable per file. Files in HDFS are "write once"and have strictly one writer at any time.
Hadoop Distributed File System Goals: Store large data sets Cope with hardware failure Emphasize streaming data access
8/12/2019 Xiaoxiao Had Oop
19/31
Typical Hadoop Structure
Commodity hardware Linux PCs with local 4 disks
Typically in 2 level architecture 40 nodes/rack Uplink from rack is 8 gigabit
Rack-internal is 1 gigabit all-to-all
8/12/2019 Xiaoxiao Had Oop
20/31
8/12/2019 Xiaoxiao Had Oop
21/31
Hadoop structure
Single namespace for entire cluster Managed by a single namenode. Files are single-writer and append-only. Optimized for streaming reads of large files.
Files are broken in to large blocks. Typically 128 MB Replicated to several datanodes, for reliability
Client talks to both namenode and datanodes
Data is not sent through the namenode. Throughput of file system scales nearly linearly withthe number of nodes.
Access from Java, C, or command line.
8/12/2019 Xiaoxiao Had Oop
22/31
8/12/2019 Xiaoxiao Had Oop
23/31
Hadoop Structure
Java and C++ APIs In Java use Objects, while in C++ bytes
Each task can process data sets larger than RAM Automatic re-execution on failure
In a large cluster, some nodes are always slow orflaky
Framework re-executes failed tasks
Locality optimizations
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled close to the inputs whenpossible
8/12/2019 Xiaoxiao Had Oop
24/31
Example of Hadoop Programming
Word Count:
I ike parallel computing. I also took courses
on parallel computing
Parallel: 2
Computing: 2
I: 2
Like: 1
8/12/2019 Xiaoxiao Had Oop
25/31
Example of Hadoop Programming
Intuition: design
Assume each node will process a paragraph
Map: What is the key?
What is the value?
Reduce: What to collect?
What to reduce?
8/12/2019 Xiaoxiao Had Oop
26/31
Word Count ExamplepublicclassMapClass extendsMapReduceBase
implementsMapper {
privatefinalstaticIntWritable ONE= newIntWritable(1);
publicvoidmap(LongWritable key, Text value,OutputCollector out,
Reporter reporter) throwsIOException {String line = value.toString();StringTokenizer itr = newStringTokenizer(line);while(itr.hasMoreTokens()) {
out.collect(newtext(itr.nextToken()), ONE);}
}}
8/12/2019 Xiaoxiao Had Oop
27/31
Word Count ExamplepublicclassReduceClass extendsMapReduceBase
implementsReducer {
publicvoidreduce(Text key, Iterator values,OutputCollector out,Reporter reporter) throwsIOException {
intsum = 0;
while(values.hasNext()) {sum += values.next().get();
}out.collect(key, newIntWritable(sum));
}}
8/12/2019 Xiaoxiao Had Oop
28/31
Word Count Example
publicstaticvoidmain(String[] args) throwsException {
JobConf conf = newJobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);conf.setReducerClass(ReduceClass.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, newPath(args[1]));
conf.setOutputKeyClass(Text.class);// out keys are words (strings)
conf.setOutputValueClass(IntWritable.class);// values are counts
JobClient.runJob(conf);
}c
8/12/2019 Xiaoxiao Had Oop
29/31
Hadoop in Yahoo!
29
Before Hadoop After Hadoop
Time 26 days 20 minutes
Language C++ Python
Development Time 2-3 weeks 2-3 days
Database for Search Assistis built using Hadoop.
3 years of log-data
20-steps of map-reduce
8/12/2019 Xiaoxiao Had Oop
30/31
Related research of hadoop
Conference Tutorial: KDD Tutorial: Modeling with Hadoop, KDD 2011 (top conference in data mining)
Strta Tutorial: How to Develop Big Data Applications for Hadoop
OSCON Tutorial: Introduction to Hadoop,
Papers: Scalable distributed inference of dynamic user interests for behavioral targeting. KDD 2011: 114-122
Yucheng Low, Deepak Agarwal, Alexander J. Smola: Multiple domain user personalization. KDD 2011: 123-131
Shuang-Hong Yang, Bo Long, Alexander J. Smola, Hongyuan Zha, Zhaohui Zheng: Collaborative competitive filtering: learning
recommender using context of user choice. SIGIR 2011: 295-304
Srinivas Vadrevu, Choon Hui Teo, Suju Rajan, Kunal Punera, Byron Dom, Alexander J. Smola, Yi Chang, Zhaohui Zheng: Scalable
clustering of news search results. WSDM 2011: 675-684
Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, Zhaohui Zheng, Hongyuan Zha: Like like alike: joint
friendship and interest propagation in social networks. WWW 2011: 537-546
Amr Ahmed, Alexander J. Smola: WWW 2011 invited tutorial overview: latent variable models on the internet. WWW
(Companion Volume) 2011: 281-282
Daniel Hsu, Nikos Karampatziakis, John Langford, Alexander J. Smola: Parallel Online Learning CoRR abs/1103.4204: (2011)
Neethu Mohandas, Sabu M. Thampi: Improving Hadoop Performance in Handling Small Files. ACC 2011:187-194
Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong: Performance Analysis of Hadoop for Query Processing. AINA Workshops
2011:507-513
All just this year! 2011!
8/12/2019 Xiaoxiao Had Oop
31/31
For more information:
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/
Who uses Hadoop?:
http://wiki.apache.org/hadoop/PoweredBy
http://hadoop.apache.org/http://developer.yahoo.com/hadoop/http://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://developer.yahoo.com/hadoop/http://hadoop.apache.org/