大大大大大大大 / 大大大 Lecture 4 – Word Co-occurrence Matrix 彭彭 彭彭彭彭彭彭彭彭彭彭彭彭 7/10/2014 http://net.pku.edu.cn/~cours e/cs402/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United S See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland SEWMGroup
46
Embed
大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix. 彭波 北京大学信息科学技术学院 7/10/2014 http://net.pku.edu.cn/~course/cs402/. Jimmy Lin University of Maryland. SEWMGroup. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix
彭波北京大学信息科学技术学院
7/10/2014http://net.pku.edu.cn/~course/cs402/
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
• What to hand in– Please pack the ACCEPTED source codes of
oneline evalution into a single rar/tar.gz file, name it as "assign1-YourPinYinName.rar" or "assign1-YourPinYinName.tar.gz" and send the package to our TA by email (cs402.pku AT gmail.com) with "CS40214-Assign1-YourPinYinName" as the subject.
• Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
How Does Streaming Work
• both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout
• By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value
More Features
• Specifying Other Plugins for Jobs– inputformat JavaClassName– outputformat JavaClassName– partitioner JavaClassName– combiner JavaClassName
• Specifying Additional Configuration Variables for Jobs• Customizing the Way to Split Lines into Key/Value Pairs• A Useful Partitioner Class• A Useful Comparator Class• Working with the Hadoop Aggregate Package
Debug in Hadoop
What Constitutes Progress in MapReduce?
• Hadoop will not fail a task that’s making progress.– Reading an input record (in a mapper or
reducer)– Writing an output record (in a mapper or
reducer)– Setting the status message (using Context’s
setStatus() method)– Incrementing a counter (using Context’s
• Counters are a useful channel for gathering statistics about the job: for quality control or for application-level statistics.
• Status Message
Hadoop Logs
• MapReduce task logs – Each tasktracker child process produces a
logfile using log4j (called syslog), a file for data sent to standard out (stdout), and a file for standard error (stderr).
– accessible through the web UI
'wordcount'How does it work?
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
18
“Hello World”: Word Count
19
But, in a real system......
• How to inject user code into a runing system?– job submission– mapper & reducer class instantiate– read/write data in mapper& reducer
Implementation in Hadoop
• job submission
• mapper & reducer class instantiate
• read/write data in mapper& reducer
Hadoop Cluster
22
Job Submission Process
• Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2).
• Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
• Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.
• Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. – The job JAR is copied with a high replication factor
(controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).
• Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).
InputSplits
• input split – is a chunk of the input that is processed by a
single map. – Each map processes a single split. – Each split is divided into records, and the map
processes each record—a key-value pair—in turn.
– Splits and records are logical:
InputFormat
• An InputFormat is responsible for creating the input splits and dividing them into records.
• Mapper’s run() method
InputFormat Class Hierarchy
Serialization
• Serialization is the process of turning structured objects into a byte stream for trans-mission over a network or for writing to persistent storage.
• Deserialization is the reverse process of turning a byte stream back into a series of structured objects.
• In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).
extends Writable, Comparable<T>– A Writable which is also Comparable. – public int compareTo(WritableComparable w){}
Word Co-occurrence
Tasks
• Do word co-occurrence analysis on ShakeSpeare Collection and AP Collection, which is under the directory of /public/Shakespeare and /public/AP of our sewm cluster (or your own virtual cluster). You will get one line of text data as input to process in map function by default.(80 points)
• Try to optimize your program, and find the fastest one. Write your approaches and evaluation in your report.(20 points)
• Analysis the result data matrix and find something interesting. (10 points bonus)
• Write a report to describe approach to each task, the problem you met etc.
co-occurrence
• Co-occurrence or cooccurrence is a linguistics term that can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. In contrast to collocation, co-occurrence assumes interdependency of the two terms. A co-occurrence restriction is identified when linguistic elements never occur together. Analysis of these restrictions can lead to discoveries about the structure and development of a language.[1]
From Wikipedia, the free encyclopedia
Input Data
Pairs .vs. Stripes Idea: group together pairs into an associative array
Each mapper takes a sentence: Generate all co-occurring term pairs For each term, emit a → { b: countb, c: countc, d: countd … }
Reducers perform element-wise sum of associative arrays
(a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2