Page 1
Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 4 September 15, 2011
Matt Lease
School of Information
University of Texas at Austin
ml at ischool dot utexas dot edu
Jason Baldridge
Department of Linguistics
University of Texas at Austin
Jasonbaldridge at gmail dot com
Page 2
Acknowledgments
Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College Park
Some figures courtesy of the following excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
Page 3
Today’s Agenda
• Practical Hadoop
– Input/Ouput
– Splits: small file and whole file operations
– Compression
– Mounting HDFS
– Hadoop Workflow and EC2/S3
Page 5
“Hello World”: Word Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)
Page 6
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 17
Page 7
Courtesy of Chuck Lam’s Hadoop In Action (2010), pp. 48-49
Page 8
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 51
Page 9
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 191
Page 10
Command-Line Parsing
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 135
Page 11
Data Types in Hadoop
Writable Defines a de/serialization protocol.
Every data type in Hadoop is a Writable.
WritableComparable Defines a sort order. All keys must be
of this type (but not values).
IntWritable
LongWritable
Text
…
Concrete classes for different data types.
SequenceFiles Binary encoded of a sequence of
key/value pairs
Page 12
Hadoop basic types
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 46
Page 13
Complex Data Types in Hadoop
How do you implement complex data types?
The easiest way:
Encoded it as Text, e.g., (a, b) = “a:b”
Use regular expressions to parse and extract data
Works, but pretty hack-ish
The hard way:
Define a custom implementation of WritableComprable
Must implement: readFields, write, compareTo
Computationally efficient, but slow for rapid prototyping
Alternatives:
Cloud9 offers two other choices: Tuple and JSON
(Actually, not that useful in practice)
Page 14
InputFormat &RecordReader
Note re-use key & value objects!
Courtesy of Tom White’s
Hadoop: The Definitive Guide,
2nd Edition (2010), pp. 198-199
Split is logical; atomic
records are never split
Page 15
Courtesy of Tom White’s
Hadoop: The Definitive Guide,
2nd Edition (2010), p. 201
Page 16
Input
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 53
Page 17
Output
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 58
Page 18
Source: redrawn from a slide by Cloduera, cc-licensed
Reducer Reducer Reduce
Output File
RecordWriter
Ou
tpu
tFo
rmat
Output File
RecordWriter
Output File
RecordWriter
Page 19
Creating Input Splits (White p. 202-203)
FileInputFormat: large files split into blocks
isSplitable() – default TRUE
computeSplitSize() = max(minSize, min(maxSize,blockSize) )
getSplits()…
How to prevent splitting?
Option 1: set mapred.min.splitsize=Long.MAX_VALUE
Option 2: subclass FileInputFormat, set isSplitable()=FALSE
Page 20
How to process whole file as a single record?
e.g. file conversion
Preventing splitting is necessary, but not sufficient
Need a RecordReader that delivers entire file as a record
Implement WholeFile input format & record reader recipe
See White pp. 206-209
Overrides getRecordReader() in FileInputFormat
Defines new WholeFileRecordReader
Page 21
Small Files
Files < Hadoop block size are never split (by default)
Note this is with default mapred.min.splitsize = 1 byte
Could extend FileInputFormat to override this behavior
Using many small files inefficient in Hadoop
Overhead for TaskTracker, JobTracker, Map object, …
Requires more disk seeks
Wasteful for NameNode memory
How to deal with small files??
Page 22
Dealing with small files
Pre-processing: merge into one or more bigger files
Doubles disk space, unless clever (can delete after merge)
Create Hadoop Archive (White pp. 72-73)
• Doesn’t solve splitting problem, just reduces NameNode memory
Simple text: just concatenate (e.g. each record on a single line)
XML: concatenate, specify start/end tags
StreamXmlRecordReader (as newline is end tag for Text)
Create a SequenceFile (see White pp. 117-118)
• Sequence of records, all with same (key,value) type
• E.g. Key=filename, Value=text or bytes of original file
• Can also use for larger files, e.g. if block processing is really fast
Use CombineFileInputFormat
Reduces map overhead, but not seeks or NameNode memory…
Only an abstract class provided, you get to implement it… :-<
Could use to speed up the pre-processing above…
Page 23
Multiple File Formats?
What if you have multiple formats for same content type?
MultipleInputs (White pp. 214-215)
Specify InputFormat & Mapper to use on a per-path basis
• Path could be a directory or a single file
• Even a single file could have many records (e.g. Hadoop archive or
SequenceFile)
All mappers must have the same output signature!
• Same reducer used for all (only input format is different, not the
logical records being processed by the different mappers)
What about multiple file formats stored in the same
Archive or SequenceFile?
Multiple formats stored in the same directory?
How are multiple file types typically handled in general?
e.g. factory pattern, White p. 80
Page 24
Data Compression
Big data = big disk space & I/O (bound) transfer times
Affects both intermediate (mapper output) and persistent data
Compression makes big data less big (but still cool)
Often 1/4th size of original data
Main issues
Does the compression format support splitting?
• What happens to parallelization if an entire 8GB compressed file has
to be decompressed before we can access the splits?
Compression/decompression ratio vs. speed
• More compression reduces disk space and transfer times, but…
• Slow compression can take longer than reduced transfer time savings
• Use native libraries!
White 77-86, Lam 153-155
Page 25
Slow; decompression can’t keep pace disk reads
Courtesy of Tom White’s
Hadoop: The Definitive Guide,
2nd Edition (2010), Ch. 4
Page 26
Compression Speed
LZO 2x faster than gzip
LZO ~15-20x faster than bzip2
http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
Page 27
Splittable LZO to the rescue
LZO format not internally splittable, but we can create
a separate, accompanying index of split points
Recipe
Get LZO from Cloudera or elsewhere, and setup
See URL on last slide for instructions
LZO compress files, copy to HDFS at /path
Index them: $ hadoop jar /path/to/hadoop-lzo.jar
com.hadoop.compression.lzo.LzoIndexer /path
Use hadoop-lzo’s LzoTextInputFormat instead of TextInputFormat
Voila!
Page 28
Compression API for persistent data
JobConf helper functions –or– set properties
Input
conf.setInputFormatClass(LzoTextInputFormat.class);
Persistent (reducer) output
FileOutputFormat.setCompressOutput(conf, true)
FileOutputFormat.setOutputCompressorClass(conf, LzopCodec.class)
Courtesy of Tom White’s
Hadoop: The Definitive Guide,
2nd Edition (2010), p. 85
Page 29
Compression API for intermediate data
Similar JobConf helper functions –or– set properties
conf.setCompressMapOutput()
Conf.setMapOutputCompressClass(LzopCodec.class)
Courtesy of Chuck Lam’s
Hadoop In Action(2010),
pp. 153-155
Page 30
SequenceFile & compression
Use SequenceFile for passing data between Hadoop jobs
Optimized for this usage case
conf.setOutputFormat(SequenceFileOutputFormat.class)
With compression, one more parameter to set
Default compression per-record; almost always preferable to
compress on a per-block basis
Page 31
See White p. 50;
hadoop: src/contrib/fuse-dfs
From “hadoop fs X” -> Mounted HDFS
Page 32
Hadoop Workflow
Hadoop Cluster You
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job 3a. Go back to Step 2
4. Retrieve data from HDFS
Page 33
On Amazon: With EC2
You
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job 3a. Go back to Step 2
4. Retrieve data from HDFS
0. Allocate Hadoop cluster
EC2
Your Hadoop Cluster
5. Clean up!
Uh oh. Where did the data go?
Page 34
On Amazon: EC2 and S3
Your Hadoop Cluster
S3 (Persistent Store)
EC2 (The Cloud)
Copy from S3 to HDFS
Copy from HFDS to S3