Top Banner
Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 4 September 15, 2011 Matt Lease School of Information University of Texas at Austin ml at ischool dot utexas dot edu Jason Baldridge Department of Linguistics University of Texas at Austin Jasonbaldridge at gmail dot com
34

Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Aug 11, 2014

Download

Automotive

Matthew Lease

co-taught with Jason Baldridge, topic for the day: practical Hadoop
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M

University of Texas at Austin, Fall 2011

Lecture 4 September 15, 2011

Matt Lease

School of Information

University of Texas at Austin

ml at ischool dot utexas dot edu

Jason Baldridge

Department of Linguistics

University of Texas at Austin

Jasonbaldridge at gmail dot com

Page 2: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Acknowledgments

Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College Park

Some figures courtesy of the following excellent Hadoop books (order yours today!)

• Chuck Lam’s Hadoop In Action (2010)

• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)

Page 3: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Today’s Agenda

• Practical Hadoop

– Input/Ouput

– Splits: small file and whole file operations

– Compression

– Mounting HDFS

– Hadoop Workflow and EC2/S3

Page 4: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Practical Hadoop

Page 5: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

“Hello World”: Word Count

Map(String docid, String text):

for each word w in text:

Emit(w, 1);

Reduce(String term, Iterator<Int> values):

int sum = 0;

for each v in values:

sum += v;

Emit(term, sum);

map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

Page 6: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 17

Page 7: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Courtesy of Chuck Lam’s Hadoop In Action (2010), pp. 48-49

Page 8: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 51

Page 9: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 191

Page 10: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Command-Line Parsing

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 135

Page 11: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Data Types in Hadoop

Writable Defines a de/serialization protocol.

Every data type in Hadoop is a Writable.

WritableComparable Defines a sort order. All keys must be

of this type (but not values).

IntWritable

LongWritable

Text

Concrete classes for different data types.

SequenceFiles Binary encoded of a sequence of

key/value pairs

Page 12: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Hadoop basic types

Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 46

Page 13: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Complex Data Types in Hadoop

How do you implement complex data types?

The easiest way:

Encoded it as Text, e.g., (a, b) = “a:b”

Use regular expressions to parse and extract data

Works, but pretty hack-ish

The hard way:

Define a custom implementation of WritableComprable

Must implement: readFields, write, compareTo

Computationally efficient, but slow for rapid prototyping

Alternatives:

Cloud9 offers two other choices: Tuple and JSON

(Actually, not that useful in practice)

Page 14: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

InputFormat &RecordReader

Note re-use key & value objects!

Courtesy of Tom White’s

Hadoop: The Definitive Guide,

2nd Edition (2010), pp. 198-199

Split is logical; atomic

records are never split

Page 15: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Courtesy of Tom White’s

Hadoop: The Definitive Guide,

2nd Edition (2010), p. 201

Page 16: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Input

Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 53

Page 17: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Output

Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 58

Page 18: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Source: redrawn from a slide by Cloduera, cc-licensed

Reducer Reducer Reduce

Output File

RecordWriter

Ou

tpu

tFo

rmat

Output File

RecordWriter

Output File

RecordWriter

Page 19: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Creating Input Splits (White p. 202-203)

FileInputFormat: large files split into blocks

isSplitable() – default TRUE

computeSplitSize() = max(minSize, min(maxSize,blockSize) )

getSplits()…

How to prevent splitting?

Option 1: set mapred.min.splitsize=Long.MAX_VALUE

Option 2: subclass FileInputFormat, set isSplitable()=FALSE

Page 20: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

How to process whole file as a single record?

e.g. file conversion

Preventing splitting is necessary, but not sufficient

Need a RecordReader that delivers entire file as a record

Implement WholeFile input format & record reader recipe

See White pp. 206-209

Overrides getRecordReader() in FileInputFormat

Defines new WholeFileRecordReader

Page 21: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Small Files

Files < Hadoop block size are never split (by default)

Note this is with default mapred.min.splitsize = 1 byte

Could extend FileInputFormat to override this behavior

Using many small files inefficient in Hadoop

Overhead for TaskTracker, JobTracker, Map object, …

Requires more disk seeks

Wasteful for NameNode memory

How to deal with small files??

Page 22: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Dealing with small files

Pre-processing: merge into one or more bigger files

Doubles disk space, unless clever (can delete after merge)

Create Hadoop Archive (White pp. 72-73)

• Doesn’t solve splitting problem, just reduces NameNode memory

Simple text: just concatenate (e.g. each record on a single line)

XML: concatenate, specify start/end tags

StreamXmlRecordReader (as newline is end tag for Text)

Create a SequenceFile (see White pp. 117-118)

• Sequence of records, all with same (key,value) type

• E.g. Key=filename, Value=text or bytes of original file

• Can also use for larger files, e.g. if block processing is really fast

Use CombineFileInputFormat

Reduces map overhead, but not seeks or NameNode memory…

Only an abstract class provided, you get to implement it… :-<

Could use to speed up the pre-processing above…

Page 23: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Multiple File Formats?

What if you have multiple formats for same content type?

MultipleInputs (White pp. 214-215)

Specify InputFormat & Mapper to use on a per-path basis

• Path could be a directory or a single file

• Even a single file could have many records (e.g. Hadoop archive or

SequenceFile)

All mappers must have the same output signature!

• Same reducer used for all (only input format is different, not the

logical records being processed by the different mappers)

What about multiple file formats stored in the same

Archive or SequenceFile?

Multiple formats stored in the same directory?

How are multiple file types typically handled in general?

e.g. factory pattern, White p. 80

Page 24: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Data Compression

Big data = big disk space & I/O (bound) transfer times

Affects both intermediate (mapper output) and persistent data

Compression makes big data less big (but still cool)

Often 1/4th size of original data

Main issues

Does the compression format support splitting?

• What happens to parallelization if an entire 8GB compressed file has

to be decompressed before we can access the splits?

Compression/decompression ratio vs. speed

• More compression reduces disk space and transfer times, but…

• Slow compression can take longer than reduced transfer time savings

• Use native libraries!

White 77-86, Lam 153-155

Page 25: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Slow; decompression can’t keep pace disk reads

Courtesy of Tom White’s

Hadoop: The Definitive Guide,

2nd Edition (2010), Ch. 4

Page 26: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Compression Speed

LZO 2x faster than gzip

LZO ~15-20x faster than bzip2

http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html

http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

Page 27: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Splittable LZO to the rescue

LZO format not internally splittable, but we can create

a separate, accompanying index of split points

Recipe

Get LZO from Cloudera or elsewhere, and setup

See URL on last slide for instructions

LZO compress files, copy to HDFS at /path

Index them: $ hadoop jar /path/to/hadoop-lzo.jar

com.hadoop.compression.lzo.LzoIndexer /path

Use hadoop-lzo’s LzoTextInputFormat instead of TextInputFormat

Voila!

Page 28: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Compression API for persistent data

JobConf helper functions –or– set properties

Input

conf.setInputFormatClass(LzoTextInputFormat.class);

Persistent (reducer) output

FileOutputFormat.setCompressOutput(conf, true)

FileOutputFormat.setOutputCompressorClass(conf, LzopCodec.class)

Courtesy of Tom White’s

Hadoop: The Definitive Guide,

2nd Edition (2010), p. 85

Page 29: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Compression API for intermediate data

Similar JobConf helper functions –or– set properties

conf.setCompressMapOutput()

Conf.setMapOutputCompressClass(LzopCodec.class)

Courtesy of Chuck Lam’s

Hadoop In Action(2010),

pp. 153-155

Page 30: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

SequenceFile & compression

Use SequenceFile for passing data between Hadoop jobs

Optimized for this usage case

conf.setOutputFormat(SequenceFileOutputFormat.class)

With compression, one more parameter to set

Default compression per-record; almost always preferable to

compress on a per-block basis

Page 31: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

See White p. 50;

hadoop: src/contrib/fuse-dfs

From “hadoop fs X” -> Mounted HDFS

Page 32: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Hadoop Workflow

Hadoop Cluster You

1. Load data into HDFS

2. Develop code locally

3. Submit MapReduce job 3a. Go back to Step 2

4. Retrieve data from HDFS

Page 33: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

On Amazon: With EC2

You

1. Load data into HDFS

2. Develop code locally

3. Submit MapReduce job 3a. Go back to Step 2

4. Retrieve data from HDFS

0. Allocate Hadoop cluster

EC2

Your Hadoop Cluster

5. Clean up!

Uh oh. Where did the data go?

Page 34: Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

On Amazon: EC2 and S3

Your Hadoop Cluster

S3 (Persistent Store)

EC2 (The Cloud)

Copy from S3 to HDFS

Copy from HFDS to S3