Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 7 October 6, 2011 Matt Lease School of Information University of Texas at Austin ml at ischool dot utexas dot edu Jason Baldridge Department of Linguistics University of Texas at Austin Jasonbaldridge at gmail dot com 1
61
Embed
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M
Counter values are definitive only once a job has successfully completed - White p. 227 What about while a job is running? • If a task reports progress, it sets a JobTracker flag to indicate a status
change should be sent to the TaskTracker – The flag is checked in a separate thread every 3s, and if set, the
TaskTracker is notified – What about counter updates?
• The TaskTracker sends heartbeats to the JobTracker (at least every 5s) which include the status of all tasks being run by the TaskTracker... – Counters (which can be relatively larger) are sent less frequently
• JobClient receives the latest status by polling the jobtracker every 1s • Clients can call JobClient’s getJob() to obtain a RunningJob instance
with the latest status information (at time of the call?)
White p. 172
11
Representing Graphs
What’s a graph?
Graphs are ubiquitous
The Web (pages and hyperlink structure)
Computer networks (computers and connections)
Highways and railroads (cities and roads/tracks)
Social networks
G = (V,E), where
V: the set of vertices (nodes)
E: the set of edges (links)
Either/Both may contain additional information
• e.g. edge weights (e.g. cost, time, distance)
• e.g. node values (e.g. PageRank)
Graph types
Directed vs. undirected
Cyclic vs. acyclic
Some Graph Problems
Finding shortest paths
Routing Internet traffic and UPS trucks
Finding minimum spanning trees
Telco laying down fiber
Finding Max Flow
Airline scheduling
Identify “special” nodes and communities
Breaking up terrorist cells, spread of avian flu
Bipartite matching
Monster.com, Match.com
And of course... PageRank
Graphs and MapReduce
MapReduce graph processing typically involves
Performing computations at each node
• e.g. using node features, edge features, and local link structure
Propagating computations
• “traversing” the graph
Key questions
How do you represent graph data in MapReduce?
How do you traverse a graph in MapReduce?
Graph Representation
How do we encode graph structure suitably for
computation
propagation
Two common approaches
Adjacency matrix
Adjacency list
1
2
3
4
Adjacency Matrices
Represent a graph as an |V| x |V| square matrix M
Mjk = w directed edge of weight w from node j to node k
• w=0 no edge exists
• Mii: main diagonal gives self-loop weights from node i to itself
If undirected, use only top-right of matrix (symmetry)
1 2 3 4
1 0 1 0 1
2 1 0 1 1
3 1 0 0 0
4 1 0 1 0
1
2
3
4
Adjacency Matrices: Critique
Advantages:
Amenable to mathematical manipulation
Easy iteration for computation over out-links and in-links
• Mj* column over all out-links from node j
• M*k row over all in-links to node k
Disadvantages
Sparsity: wasted computations, wasted space
Adjacency Lists
Take adjacency matrices… and throw away all the zeros
Hmm… look familiar…?
1: 2, 4
2: 1, 3, 4
3: 1
4: 1, 3
1 2 3 4
1 0 1 0 1
2 1 0 1 1
3 1 0 0 0
4 1 0 1 0
Inverted Index: Boolean Retrieval
one fish, two fish Doc 1
red fish, blue fish Doc 2
cat in the hat Doc 3
1
1
1
1
1
1
1 2 3
1
1
1
4
blue
cat
egg
fish
green
ham
hat
one
3
4
1
4
4
3
2
1
blue
cat
egg
fish
green
ham
hat
one
2
green eggs and ham Doc 4
1 red
1 two
2 red
1 two
Adjacency Lists: Critique
Vs. Adjacency matrix
Sparsity: More compact, fewer wasted computations
Easy to compute over out-links
What about computation over in-links?
1: 2, 4
2: 1, 3, 4
3: 1
4: 1, 3
1 2 3 4
1 0 1 0 1
2 1 0 1 1
3 1 0 0 0
4 1 0 1 0
Single Source Shortest Path
Problem
Find shortest path from a source node to one or more
target nodes
Shortest may mean lowest weight or cost, etc.
Classic approach
Dijkstra’s Algorithm
• Maintain a global priority queue over all (node, distance) pairs
• Sort queue by min distance to reach each node from the source node
• Initialization: distance to source node = 0, all others =
• Visit nodes in order of (monotonically) increasing path length
• Whenever node visited, no shorter path exists
• For each node is visited
• update its neighbours in the queue
• Remove the node from the queue
Edsger W. Dijkstra
May 11, 1930 – August 6, 2002
Received the 1972 Turing Award
Schlumberger Centennial Chair of Computer Science at