Top Banner
大大大大大大大 / 大大大 Lecture 4 – Word Co-occurrence Matrix 彭彭 彭彭彭彭彭彭彭彭彭彭彭彭 7/10/2014 http://net.pku.edu.cn/~cours e/cs402/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United S See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland SEWMGroup
46

大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Jan 02, 2016

Download

Documents

griselda-robert

大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix. 彭波 北京大学信息科学技术学院 7/10/2014 http://net.pku.edu.cn/~course/cs402/. Jimmy Lin University of Maryland. SEWMGroup. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

彭波北京大学信息科学技术学院

7/10/2014http://net.pku.edu.cn/~course/cs402/

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Jimmy LinUniversity of Maryland SEWMGroup

Page 2: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

WordCount Review

Page 3: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Problems&Solutions

• Mac OS – Mac OS X下配置心得 ,by Xin Lv

• Eclipse– eclipse3.7Indigo连接 hadoop心得 , by 朱瑜坚

• Linux– Linux下手动配置运行 hadoop心得 , by

Haoyan Huo

• VMPlayer–暂缺,:)

Page 4: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Homework Submission

• What to hand in– Please pack the ACCEPTED source codes of

oneline evalution into a single rar/tar.gz file, name it as "assign1-YourPinYinName.rar" or "assign1-YourPinYinName.tar.gz" and send the package to our TA by email (cs402.pku AT gmail.com) with "CS40214-Assign1-YourPinYinName" as the subject.

Page 5: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Changping11 使用规范

• hadoop.job.ugi = YourName, cs402

• 输入数据在:– /public–自己传上去的数据,放在个人目录下

• 输出数据,一定放在 /cs402的个人目录下– /cs402/YourName–不要使用默认的 /user/Yourname

Page 6: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
Page 7: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Streaming for Python Programmer

Page 8: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Hadoop Streaming

• Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Page 9: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

How Does Streaming Work

• both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout

• By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value

Page 10: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

More Features

• Specifying Other Plugins for Jobs– inputformat JavaClassName– outputformat JavaClassName– partitioner JavaClassName– combiner JavaClassName

• Specifying Additional Configuration Variables for Jobs• Customizing the Way to Split Lines into Key/Value Pairs• A Useful Partitioner Class• A Useful Comparator Class• Working with the Hadoop Aggregate Package

Page 11: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Debug in Hadoop

Page 12: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

What Constitutes Progress in MapReduce?

• Hadoop will not fail a task that’s making progress.– Reading an input record (in a mapper or

reducer)– Writing an output record (in a mapper or

reducer)– Setting the status message (using Context’s

setStatus() method)– Incrementing a counter (using Context’s

getCounter().increment() method)– Calling Reporter’s progress() method

Page 13: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Counters & Status Message

• Counters are a useful channel for gathering statistics about the job: for quality control or for application-level statistics.

• Status Message

Page 14: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Hadoop Logs

• MapReduce task logs – Each tasktracker child process produces a

logfile using log4j (called syslog), a file for data sent to standard out (stdout), and a file for standard error (stderr).

– accessible through the web UI

Page 15: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
Page 16: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
Page 17: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

'wordcount'How does it work?

Page 18: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

18

Page 19: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

“Hello World”: Word Count

19

Page 20: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

But, in a real system......

• How to inject user code into a runing system?– job submission– mapper & reducer class instantiate– read/write data in mapper& reducer

Page 21: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Implementation in Hadoop

• job submission

• mapper & reducer class instantiate

• read/write data in mapper& reducer

Page 22: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Hadoop Cluster

22

Page 23: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Job Submission Process

Page 24: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

• Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2).

• Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.

• Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.

Page 25: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

• Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. – The job JAR is copied with a high replication factor

(controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).

• Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).

Page 26: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

InputSplits

• input split – is a chunk of the input that is processed by a

single map. – Each map processes a single split. – Each split is divided into records, and the map

processes each record—a key-value pair—in turn.

– Splits and records are logical:

Page 27: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

InputFormat

• An InputFormat is responsible for creating the input splits and dividing them into records.

Page 28: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

• Mapper’s run() method

Page 29: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

InputFormat Class Hierarchy

Page 30: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Serialization

• Serialization is the process of turning structured objects into a byte stream for trans-mission over a network or for writing to persistent storage.

• Deserialization is the reverse process of turning a byte stream back into a series of structured objects.

• In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).

Page 31: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

The Writable Interface

• public interface Writable {– void write(DataOutput out) throws IOException;– void readFields(DataInput in) throws IOException;

}• public interface WritableComparable<T>

extends Writable, Comparable<T>– A Writable which is also Comparable. – public int compareTo(WritableComparable w){}

Page 32: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
Page 33: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
Page 34: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Word Co-occurrence

Page 35: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Tasks

• Do word co-occurrence analysis on ShakeSpeare Collection and AP Collection, which is under the directory of /public/Shakespeare and /public/AP of our sewm cluster (or your own virtual cluster). You will get one line of text data as input to process in map function by default.(80 points)

• Try to optimize your program, and find the fastest one. Write your approaches and evaluation in your report.(20 points)

• Analysis the result data matrix and find something interesting. (10 points bonus)

• Write a report to describe approach to each task, the problem you met etc.

Page 36: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

co-occurrence

• Co-occurrence or cooccurrence is a linguistics term that can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. In contrast to collocation, co-occurrence assumes interdependency of the two terms. A co-occurrence restriction is identified when linguistic elements never occur together. Analysis of these restrictions can lead to discoveries about the structure and development of a language.[1]

From Wikipedia, the free encyclopedia

Page 37: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
Page 38: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Input Data

Page 39: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
Page 40: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Pairs .vs. Stripes Idea: group together pairs into an associative array

Each mapper takes a sentence: Generate all co-occurring term pairs For each term, emit a → { b: countb, c: countc, d: countd … }

Reducers perform element-wise sum of associative arrays

(a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

a → { b: 1, d: 5, e: 3 }a → { b: 1, c: 2, d: 2, f: 2 }a → { b: 2, c: 2, d: 7, e: 3, f: 2 }

+

Key: cleverly-constructed data structure

brings together partial results

40

Page 41: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Pairs

• customized KEY– (a,b) TextPair that implements

WritableComparable<>

• customized Partitioner– all (a,b) (a,c) (a,f) (a,*) go to the same

Reducer– Default partitioner HashPartitioner use the

hashCode() method

Page 42: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

42

Page 43: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Partitioner

• public abstract class Partitioner<KEY,VALUE>{public int getPartition(KEY key, VALUE

value, int numPartitions)}

• job setup–job.setPartitionerClass(UserPartition

er.class);

Page 44: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Comparators

• Comparable<T>– compareTo()

• Comparator– RawComprator<>– WritableComparator

• job.setSortComparatorClass– sort map() output in Mapper

• job.setGroupingComparatorClass– sort shuffled data in Reducer, group result

sent to reduce()

Page 45: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Stripes

• associative array– map -> MapWritable?

• Caution:– jvm memory big enough– set mapred.child.java.opts (200M by default)

Page 46: 大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix

Q&A