Top Banner
Background Knowledge http://net.pku.edu.cn/~course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides
83

Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Jan 11, 2016

Download

Documents

Domenic Tate
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Background Knowledge

http://net.pku.edu.cn/~course/cs402

Peng BoSchool of EECS, Peking University

6/26/2008

Refer to Aaron Kimball’s slides

Page 2: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Background Topics

• Parallelization & Synchronization

• Fundamentals of Networking

• Search Engine Technology– Inverted index– PageRank algorithm

Page 3: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Parallelization & Synchronization

Page 4: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Parallelization Idea

• Parallelization is “easy” if processing can be cleanly split into n units:

work

w1 w2 w3

Partition problem

Page 5: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Parallelization Idea (2)

w1 w2 w3

thread thread thread

Spawn worker threads:

In a parallel computation, we would like to have as many threads as we have processors. e.g., a four-processor computer would be able to run four threads at the same time.

Page 6: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Parallelization Idea (3)

thread thread thread

Workers process data:

w1 w2 w3

Page 7: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Parallelization Idea (4)

results

Report results

thread thread threadw1 w2 w3

Page 8: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Parallelization Pitfalls

But this model is too simple!

• How do we assign work units to worker threads?• What if we have more work units than threads?• How do we aggregate the results at the end?• How do we know all the workers have finished?• What if the work cannot be divided into

completely separate tasks?

What is the common theme of all of these problems?

Page 9: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Parallelization Pitfalls (2)

• Each of these problems represents a point at which multiple threads must communicate with one another, or access a shared resource.

• Golden rule: Any memory that can be used by multiple threads must have an associated synchronization system!

Page 10: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

What is Wrong With This?

Thread 1:

void foo() {

x++;

y = x;

}

Thread 2:

void bar() {

y++;

x++;

}

If the initial state is y = 0, x = 6, what happens after these threads finish running?

Page 11: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Multithreaded = Unpredictability

• When we run a multithreaded program, we don’t know what order threads run in, nor do we know when they will interrupt one another.

Thread 1:

void foo() {

eax = mem[x];

inc eax;

mem[x] = eax;

ebx = mem[x];

mem[y] = ebx;

}

Thread 2:

void bar() {

eax = mem[y];

inc eax;

mem[y] = eax;

eax = mem[x];

inc eax;

mem[x] = eax;

}

• Many things that look like “one step” operations actually take several steps under the hood:

Page 12: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Multithreaded = Unpredictability

This applies to more than just integers:

• Pulling work units from a queue• Reporting work back to master unit• Telling another thread that it can begin the

“next phase” of processing

… All require synchronization!

Page 13: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Synchronization Primitives

• synchronization primitive – Semaphore / mutex– Condition variable– Barriers

Page 14: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Semaphores

• A semaphore is a flag that can be raised or lowered in one step

• Semaphores were flags that railroad engineers would use when entering a shared track

Set: Reset:

Page 15: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

The “Corrected” Example

Thread 1:

void foo() {

sem.lock();

x++;

y = x;

sem.unlock();

}

Thread 2:

void bar() {

sem.lock();

y++;

x++;

sem.unlock();

}

Global var “Semaphore sem = new Semaphore();” guards access to x & y

Page 16: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Condition Variables

• A condition variable notifies threads that a particular condition has been met

• Inform another thread that a queue now contains elements to pull from (or that it’s empty – request more elements!)

Page 17: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

The final example

Thread 1:

void foo() {

sem.lock();

x++;

y = x;

fooDone = true;

sem.unlock();

fooFinishedCV.notify();

}

Thread 2:

void bar() {

sem.lock();

while(!fooDone) fooFinishedCV.wait(sem);

y++;

x++;

sem.unlock();

}

Global vars: Semaphore sem = new Semaphore(); ConditionVar fooFinishedCV = new ConditionVar(); boolean fooDone = false;

Page 18: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Barriers

• A barrier knows in advance how many threads it should wait for. Threads “register” with the barrier when they reach it, and fall asleep.

• Barrier wakes up all registered threads when total count is correct

• Pitfall: What happens if a thread takes a long time?

Page 19: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Too Much Synchronization? Deadlock

Synchronization becomes even more complicated when multiple locks can be used

Can cause entire system to “get stuck”

Thread A:Thread A:semaphore1.lock();semaphore2.lock();/* use data guarded by semaphores */semaphore1.unlock(); semaphore2.unlock();

Thread B:semaphore2.lock();semaphore1.lock();/* use data guarded by semaphores */semaphore1.unlock(); semaphore2.unlock();

(Image: RPI CSCI.4210 Operating Systems notes)

Page 20: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

And if you thought I was joking…

Page 21: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

The Moral: Be Careful!

• Synchronization is hard– Need to consider all possible shared state– Must keep locks organized and use them

consistently and correctly

• Knowing there are bugs may be tricky; fixing them can be even worse!

• Keeping shared state to a minimum reduces total system complexity

Page 22: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Fundamentals of Networking

Page 23: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Sockets: The Internet = tubes?

• A socket is the basic network interface• Provides a two-way “pipe” abstraction

between two applications• Client creates a socket, and connects to

the server, who receives a socket representing the other side

Page 24: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Ports

• Within an IP address, a port is a sub-address identifying a listening program

• Allows multiple clients to connect to a server at once

Page 25: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Example: Web Server (1/3)

1) Server creates a socket attached to port 80

80

The server creates a listener socket attached to a specific port. 80 is the agreed-upon port number for web traffic.

Page 26: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Example: Web Server (2/3)

The client-side socket is still connected to a port, but the OS chooses a random unused port number

When the client requests a URL (e.g., “www.google.com”), its OS uses a system called DNS to find its IP address.

2) Client creates a socket and connects to host

80Connect: 66.102.7.99 : 80

(anon)

Page 27: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Example: Web Server (3/3)

Server chooses a randomly-numbered port to handle this particular client

Listener is ready for more incoming connections, while we process the current connection in parallel

3) Server accepts connection, gets new socket for client

80

(anon) (anon)

4) Data flows across connected socket as a “stream”, just like a file

Page 28: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

What makes this work?

• Underneath the socket layer are several more protocols

• Most important are TCP and IP (which are used hand-in-hand so often, they’re often spoken of as one protocol: TCP/IP)

Your dataTCP header

IP header

Even more low-level protocols handle how data is sent over Ethernet wires, or how bits are sent through the air using 802.11 wireless…

Page 29: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

IP: The Internet Protocol

• Defines the addressing scheme for computers

• Encapsulates internal data in a “packet”

• Does not provide reliability

• Just includes enough information for the data to tell routers where to send it

Page 30: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

TCP: Transmission Control Protocol

• Built on top of IP• Introduces concept of “connection”• Provides reliability and ordering

Your dataTCP header

IP header

Page 31: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Why is This Necessary?

• Not actually tube-like “underneath the hood”• Unlike phone system (circuit switched), the

packet switched Internet uses many routes at once

you www.google.com

Page 32: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Networking Issues

• If a party to a socket disconnects, how much data did they receive?

• … Did they crash? Or did a machine in the middle?

• Can someone in the middle intercept/modify our data?

• Traffic congestion makes switch/router topology important for efficient throughput

Page 33: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Search Engine Technology

This presentation © Michael CafarellaRedistributed under the Creative Commons Attribution 3.0

license.

Page 34: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Doug Cutting

Page 35: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Meta-details

• Built to encourage public search work– Open-source, w/pluggable modules– Cheap to run, both machines & admins

• Goal: Search more pages, with better quality, than any other engine– Pretty good ranking– Has done ~ 200M pages, more possible

• Hadoop is a spinoff

Page 36: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.
Page 37: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

WebDB

Fetcher 2 of NFetcher 1 of N

Fetcher 0 of N

Fetchlist 2 of NFetchlist 1 of N

Fetchlist 0 of N

Update 2 of NUpdate 1 of N

Update 0 of N

Content 0 of NContent 0 of N

Content 0 of N

Indexer 2 of NIndexer 1 of N

Indexer 0 of N

Searcher 2 of N

Searcher 1 of N

Searcher 0 of N

WebServer 2 of MWebServer 1 of M

WebServer 0 of M

Index 2 of NIndex 1 of N

Index 0 of N

Inject

Page 38: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Moving Parts

• Acquisition cycle– WebDB– Fetcher

• Index generation– Indexing– Link analysis (maybe)

• Serving results

Page 39: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

WebDB

• Contains info on all pages, links– URL, last download, # failures, link score,

content hash, ref counting– Source hash, target URL

• Must always be consistent• Designed to minimize disk seeks

– 19ms seek time x 200m new pages/mo = ~44 days of disk seeks!

• Single-disk WebDB was huge headache

Page 40: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Fetcher• Fetcher is very stupid. Not a “crawler”

• Pre-MapRed: divide “to-fetch list” into k pieces, one for each fetcher machine

• URLs for one domain go to same list, otherwise random– “Politeness” w/o inter-fetcher protocols– Can observe robots.txt similarly– Better DNS, robots caching– Easy parallelism

• Two outputs: pages, WebDB edits

Page 41: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

2. Sort edits (externally, if necessary)

WebDB/Fetcher Updates

URL: http://www.flickr/com/index.html

LastUpdated: Never

ContentHash: None

URL: http://www.cnn.com/index.html

LastUpdated: Never

ContentHash: None

URL: http://www.yahoo/index.html

LastUpdated: 4/07/05

ContentHash: MD5_toewkekqmekkalekaa

URL: http://www.about.com/index.html

LastUpdated: 3/22/05

ContentHash: MD5_sdflkjweroiwelksd

Edit: DOWNLOAD_CONTENT

URL: http://www.cnn.com/index.html

ContentHash: MD5_balboglerropewolefbag

Edit: DOWNLOAD_CONTENT

URL: http://www.yahoo/index.html

ContentHash: MD5_toewkekqmekkalekaa

Edit: NEW_LINK

URL: http://www.flickr.com/index.html

ContentHash: None

WebDB Fetcher edits

1. Write down fetcher edits3. Read streams in parallel, emitting new database4. Repeat for other tables

URL: http://www.cnn.com/index.html

LastUpdated: Today!

ContentHash: MD5_balboglerropewolefbag

URL: http://www.yahoo.com/index.html

LastUpdated: Today!

ContentHash: MD5_toewkekqmekkalekaa

Page 42: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Indexing

• How to retrieve some information from a large document set efficiently?

Page 43: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Document Collection

Page 44: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

User Information Need

• Search inside this news site for articles talks about Culture between China and Japan, and doesn’t talk about students abroad.

• QUERY :– “ 中国 日本 文化 - 留学生”

Page 45: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

How to do it?

• Could grep all of Web Pages for “ 中国” ,“ 文化” and “ 日本” , then strip out pages containing “ 留学生” ?– Slow (for large corpora)– NOT “ 留学生” is non-trivial– Other operations (e.g., find “ 中国” NEAR

“ 日本” ) not feasible

Page 46: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Document Representation

• Bag of words model• Document-term incidence matrix

中国 文化 日本 留学生

教育 北京 …

D1 1 1 0 0 1 1

D2 0 1 1 1 0 0

D3 1 0 1 1 0 0

D4 1 0 0 1 1 0

D5 1 1 1 0 0 1

D6 0 0 1 0 0 1

1 if page contains word, 0 otherwise

Page 47: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Incidence Vector

D1 D2 D3 D4 D5 D6…

中国 1 0 1 1 1 0

文化 1 1 0 0 1 0

日本 0 1 1 0 1 1

留学生

0 1 1 1 0 0

教育 1 0 0 1 0 0

北京 1 0 0 0 1 1

• Transpose the Document-term incidence matrix• So we have a 0/1 vector for each term.

Page 48: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Retrieval

• Search inside this news site for articles talks about Culture between China and Japan, and doesn’t talk about students abroad.

• To answer query: – take the vectors for “ 中国” ,“ 文化” ,“ 日本” ,

“ 留学生” (complemented) bitwise AND– 101110 AND 110010 AND 011011 AND 100011

= 000010

Page 49: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

D5

Page 50: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Let’s build a search system!

• Consider N = 1million documents, each with about 1K terms.

• Avg 6 bytes/term include spaces/punctuation – 6GB of data in the documents.

• Say there are M = 500K distinct terms among these.

Page 51: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Can’t build the matrix

• 500K x 1M matrix has half-a-trillion 0’s and 1’s.

• But it has no more than one billion 1’s.– matrix is extremely sparse.

• What’s a better representation?– We only record the 1 positions.

Why?

Page 52: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Inverted index

• For each term T: store a list of all documents that contain T.

• Do we use an array or a list for this?中国文化留学生

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

What happens if the word 中国 is added to document 14?

Page 53: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Inverted index

• Linked lists generally preferred to arrays– Dynamic space allocation– Insertion of terms into documents easy– Space overhead of pointers

中国文化留学生

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings

Sorted by docID (more later on why).

Page 54: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Inverted index construction

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

Page 55: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

• Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Indexer steps

Page 56: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

• Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Core indexing step

Page 57: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

• Multiple term entries in a single document are merged.

• Frequency information is added.Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Page 58: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

• The result is split into a Dictionary file and a Postings file.

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Why split?

Page 59: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

The index we just built

• How do we process a Boolean query?

Page 60: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Query processing

• Consider processing the query:中国 AND 文化– Locate 中国 in the Dictionary;

• Retrieve its postings.

– Locate 文化 in the Dictionary;• Retrieve its postings.

– “Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

中国文化

Page 61: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

中国文化2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

Page 62: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Indexing in Nutch

• Iterate through all k page sets in parallel, constructing inverted index

• Creates a “searchable document” of:– URL text– Content text– Incoming anchor text

• Other content types might have a different document fields– Eg, email has sender/receiver– Any searchable field end-user will want

• Uses Lucene text indexer

Page 63: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Link analysis• A page’s relevance depends on both

intrinsic and extrinsic factors– Intrinsic: page title, URL, text– Extrinsic: anchor text, link graph

• PageRank is most famous of many

• Others include:– HITS– OPIC– Simple incoming link count

Page 64: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.
Page 65: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Web Graph

http://www.touchgraph.com/TGGoogleBrowser.html

Page 66: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

PageRank Algorithm

Page 67: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Link analysis in Nutch

• Nutch performs analysis in WebDB– Emit a score for each known page– At index time, incorporate score into inverted index

• Extremely time-consuming– In our case, disk-consuming, too (because we want to

use low-memory machines)

• Link analysis is sexy, but importance generally overstated, so fast and easy:– 0.5 * log(# incoming links)

Page 68: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

“britn

ey

Query ProcessingDocs 0-1M Docs 1-2M Docs 2-3M Docs 3-4M Docs 4-5M

“ britney”“

britney

“bri

tney

“ britney

”“ britney

Ds 1, 29

Ds 1.2M,

1.7M

Ds 2

.3M

, 2.9

M

Ds 3.1M,

3.2MDs 4.4M,

4.5M

1.2

M, 4

.4M

, 29

, …

Page 69: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Administering Nutch• Admin costs are critical

– It’s a hassle when you have 25 machines– Google has >100k, probably more

• Files– WebDB content, working files– Fetchlists, fetched pages– Link analysis outputs, working files– Inverted indices

• Jobs– Emit fetchlists, fetch, update WebDB– Run link analysis– Build inverted indices

Page 70: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Administering Nutch (2)• Admin sounds boring, but it’s not!

– Really– I swear

• Large-file maintenance– Google File System (Ghemawat, Gobioff, Leung)– Nutch Distributed File System

• Job Control– Map/Reduce (Dean and Ghemawat)– Pig (Yahoo Research)

• Data Storage (BigTable)

Page 71: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Nutch Distributed File System• Similar, but not identical, to GFS

• Requirements are fairly strange– Extremely large files– Most files read once, from start to end– Low admin costs per GB

• Equally strange design– Write-once, with delete– Single file can exist across many machines– Wholly automatic failure recovery

Page 72: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

NDFS (2)

• Data divided into blocks

• Blocks can be copied, replicated

• Datanodes hold and serve blocks

• Namenode holds meta info– Filename block list– Block datanode-location

• Datanodes report in to namenode every few seconds

Page 73: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

NDFS File Read

Namenode

Datanode 0 Datanode 1 Datanode 2

Datanode 3 Datanode 4 Datanode 5

1. Client asks datanode for filename info2. Namenode responds with blocklist,

and location(s) for each block3. Client fetches each block, in

sequence, from a datanode

“ crawl.txt”(block-33 / datanodes 1, 4)(block-95 / datanodes 0, 2)(block-65 / datanodes 1, 4, 5)

Page 74: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

NDFS Replication

Namenode

Datanode 0(33, 95)

Datanode 1(46, 95)

Datanode 2(33, 104)

Datanode 3(21, 33, 46)

Datanode 4(90)

Datanode 5(21, 90, 104)

1. Always keep at least k copies of each blk

2. Imagine datanode 4 dies; blk 90 lost3. Namenode loses heartbeat,

decrements blk 90’s reference count. Asks datanode 5 to replicate blk 90 to datanode 0

4. Choosing replication target is tricky

(Blk 90 to dn 0)

Page 75: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Map/Reduce• Map/Reduce is programming model from

Lisp (and other places)– Easy to distribute across nodes– Nice retry/failure semantics

• map(key, val) is run on each item in set– emits key/val pairs

• reduce(key, vals) is run for each unique key emitted by map()– emits final output

• Many problems can be phrased this way

Page 76: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Map/Reduce (2)

• Task: count words in docs– Input consists of (url, contents) pairs– map(key=url, val=contents):

• For each word w in contents, emit (w, “1”)

– reduce(key=word, values=uniq_counts):• Sum all “1”s in values list• Emit result “(word, sum)”

Page 77: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Map/Reduce (3)

• Task: grep– Input consists of (url+offset, single line)– map(key=url+offset, val=line):

• If contents matches regexp, emit (line, “1”)

– reduce(key=line, values=uniq_counts):• Don’t do anything; just emit line

• We can also do graph inversion, link analysis, WebDB updates, etc

Page 78: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Map/Reduce (4)

• How is this distributed?1. Partition input key/value pairs into chunks,

run map() tasks in parallel

2. After all map()s are complete, consolidate all emitted values for each unique emitted key

3. Now partition space of output map keys, and run reduce() in parallel

• If map() or reduce() fails, reexecute!

Page 79: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Map/Reduce Job Processing

JobTracker

TaskTracker 0TaskTracker 1 TaskTracker 2

TaskTracker 3 TaskTracker 4 TaskTracker 5

1. Client submits “grep” job, indicating code and input files

2. JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers.

3. After map(), tasktrackers exchange map-output to build reduce() keyspace

4. JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work.

5. reduce() output may go to NDFS

“ grep”

Page 80: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Nutch & Hadoop

• NDFS stores the crawl and indexes

• MapReduce for indexing, parsing, WebDB construction, even fetching– Broke previous 200M/mo limit– Index-serving?

• Required massive rewrite of almost every Nutch component

Page 81: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Summary

• Parallelization & Synchronization

• Fundamentals of Networking

• Search Engine Technology– Inverted index– PageRank algorithm

Page 82: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Readings

• [Brin and Page,1998] S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine," presented at Proceedings of the 7th Internation World Wide Web Conference/Computer Networks, Amsterdam, 1998.

Page 83: Background Knowledge course/cs402 Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Q&A