Google N-gram Data analyzer Project and Presentation by, Anagha Dharasurkar Andrew Norgren Premchand Bellamkonda Shruti Pandey Salil Bapat.

Google N-gram Data Google N-gram Data analyzer analyzer Project and Presentation by,Project and Presentation by,

Anagha DharasurkarAnagha DharasurkarAndrew NorgrenAndrew Norgren

Premchand BellamkondaPremchand BellamkondaShruti PandeyShruti Pandey

Salil Bapat Salil Bapat

1 gram and 2gramdata

Unigram Cutoff/ Associative Cutoff

Disjoint NetworkModule

Reverse NetworkCreator

Network Interface Module

Path Finder Module

Output PathsTarget QueryPath

Disjoint Networks

Reverse Networks

2 Gram Data 1 Gram DataDistribution of data

(Premchand)Input:Unigram Data

Output: Array of Unigrams

Find Disjoint Network(Sruthi)

Input:Array of Unigrams,Bigram data

Output:Linked list, files

Agglomeration(Andrew)

Input:No.of n/w for each processor,Linked lists

Output:No.of disjoint n/w

Number of Edges(Premchand)

Input:Number of networksOutput: Number of Edges

Reverse Network Building(Salil)

Input:2 Gram dataOutput:Reverse Network

Reverse Network

Network Interface Module(Salil)

Input: Reverse NetworkOutput:Linked List

Path Finder Module(Anagha)

Input:Target Query FileOutput: Query Paths

Disjoint Networks

Query Path, Number of Disjoint Networks, Number of nodes in each Disjoint Networks,

Number of edges

Target Query File

Network building background Network building background

Using what we have – the 2gm data.Using what we have – the 2gm data.

Building a reverse network.Building a reverse network.

Store whatever is built.Store whatever is built.

Network Details Network Details The folder structure.The folder structure.

1 Root directory

63 Second level Directories

Third Level of Directories With a file inside each directory.

Consider the bigram “match day 2000”.Consider the bigram “match day 2000”.

Parallelism Details Parallelism Details

Block allocationBlock allocation Lines distributed amongst processors Lines distributed amongst processors

instead of the files.instead of the files. Processor 0 sends to each processor:Processor 0 sends to each processor:

Number of lines it has to processNumber of lines it has to process File number from which it should startFile number from which it should start The starting lineThe starting line

BenchmarkingBenchmarking

ProcessorsProcessors TimingsTimings Number of 2-Number of 2-gm filesgm files

1616 3 hrs 56 mins3 hrs 56 mins 3232

3232 2 hrs 47 mins2 hrs 47 mins 3232

6464 1 hr 58 mins1 hr 58 mins 3232

Space Requirements Space Requirements

Not much to store in the memory.Not much to store in the memory.

Large space requirements. Around 5.5 gbs Large space requirements. Around 5.5 gbs for the google 2 gram data.for the google 2 gram data.

As a general rule, the reverse network will As a general rule, the reverse network will be approximately of the same size as the be approximately of the same size as the original data content. original data content.

Finding Disjoint Networks

Module Description:

This module deals with finding the disjoint networks from the google-2-gram data. It takes unigrams as input and for each unigram, it gets all the tokens connected to it and processes them as described later to find the disjoint network.

Approach

We exploited the simple fact that if we have two networks of words and if any word is common in both the networks, then both the networks are connected.

Example:

Network 1: A --> B --> C --> D --> E --> Z

Network 2: Q --> Y --> P --> R --> S --> A --> V -->

In the above network A is common in both the networks thus we can say that both the networks are connected

Distribution of DataDistribution of Data

Distributes the Unigram DataDistributes the Unigram Data

Follows Block DistributionFollows Block Distribution

Finds the number of Lines in the Unigram fileFinds the number of Lines in the Unigram file

Then finds the interval for the block distributionThen finds the interval for the block distribution

Data Structure Used

We have used a two dimensional linked list structure. The first linked list (Network List) contains all the connected words and the second linked list connects all the network lists.

Base List

Network List

Network List

Network List

WorkingWorking

1

2

1 32 4

1. Get the root tokens.

2. Get the words connected to the root tokens.

3. If it is the first root token.

B1 1

4. If not first root token then process each word one by one.

Nature of this network: unique

-> if connected to 1 existing network.

-> if connected to some network different to the marked network

-> Not Connected at all

Working (cntd.)Working (cntd.)

B1 1

B2 2

Cases:1. None of the word in root token 2 exist in root token 1

2. Any one word exists in already existing network

B1 1 2

Working (cntd.)Working (cntd.)

3. A word is common to a network different to the marked network

3

B1 1

B2 2

To Process:

Existing:

Result:

B1 1 3 2

Animated ExampleAnimated Example

1 2 3 4

1B1

2

2

TEMP

Observations & ConclusionObservations & Conclusion

Execution takes lot of time.Execution takes lot of time. 2gm-0031 data has 1869 networks.2gm-0031 data has 1869 networks. Initially fast.Initially fast. Execution slows down as network size increases.Execution slows down as network size increases. Use of linked list of arrays for speeding up the process.Use of linked list of arrays for speeding up the process.

AgglomerationAgglomeration

Combines work of all processorsCombines work of all processors

FindsFinds Number of Disjoint NetworksNumber of Disjoint Networks Number of Nodes in each NetworkNumber of Nodes in each Network

For this step to work:For this step to work: Processor 0 and k (k = np/2) have networks in linked listProcessor 0 and k (k = np/2) have networks in linked list Other Processors have written out their networks to fileOther Processors have written out their networks to file

How It WorksHow It Works

Processors 1 to k-1 send their “local” number of Processors 1 to k-1 send their “local” number of networks to Processor 0networks to Processor 0

Processors k+1 to (number of processors) -1 Processors k+1 to (number of processors) -1 send theirs to Processor ksend theirs to Processor k

Processor 0 and k combine networksProcessor 0 and k combine networks Open files and checks if a word is in their networks.Open files and checks if a word is in their networks.

• Yes – Combine the two networks (eliminating redundancy)Yes – Combine the two networks (eliminating redundancy)• No – Add that network to its list of networksNo – Add that network to its list of networks

Final StepFinal Step

Processor k writes its networks to filesProcessor k writes its networks to files Sends its number of networks to Processor 0Sends its number of networks to Processor 0 Processor 0 then combines those networksProcessor 0 then combines those networks ResultsResults

Processor 0 has list of disjoint networksProcessor 0 has list of disjoint networks Prints out number of disjoint networksPrints out number of disjoint networks Prints out the number of nodes in each networkPrints out the number of nodes in each network

Unigram Cut-OffUnigram Cut-Off

Happens during distribution of data to Happens during distribution of data to ProcessorsProcessors

When distributing to Processors, check for When distributing to Processors, check for conditioncondition If frequency of unigram is > cut-off, store in If frequency of unigram is > cut-off, store in

array for distribution.array for distribution. Else ignore that unigramElse ignore that unigram

Associative Cut-offAssociative Cut-off

Happens during the path finding moduleHappens during the path finding module

For each path foundFor each path found Find association scoreFind association score

• If > association cut-off, then include in pathIf > association cut-off, then include in path• Else don’t include in pathElse don’t include in path

Path Finding Path Finding This had queries supported to the This had queries supported to the

constructed network.constructed network.The aim was to explore the built Network by The aim was to explore the built Network by

Path Finding. Path Finding. The queries allow a user to specify a targetThe queries allow a user to specify a target

word, and display the paths of a given length word, and display the paths of a given length leading to and from that word and to the leading to and from that word and to the

words connected to those words etcwords connected to those words etc

RequirementsRequirementsThe specified target word should be at the The specified target word should be at the

center of the paths that leadcenter of the paths that lead into and out from it. Path lengths are defined in into and out from it. Path lengths are defined in terms of the number of edges in the path to and terms of the number of edges in the path to and

from the target word. from the target word. Eg: Eg: was 3was 3 (Path length : 3)(Path length : 3)

Italian --> (34) --> poor --> (34) --> girl --> (43) Italian --> (34) --> poor --> (34) --> girl --> (43) --> --> waswas --> (34) --> hardworking --> (432) --> --> (34) --> hardworking --> (432) -->

and --> (23) --> beautiful and --> (23) --> beautiful TIME: 0.432TIME: 0.432 (+more path length 3 variations) (+more path length 3 variations)

The number between 2 words represents The number between 2 words represents frequency of those bi-gramsfrequency of those bi-grams

Algorithm (Broader View)Algorithm (Broader View)Read the query (target-list) file according to the fileRead the query (target-list) file according to the file

format which is format which is <token> <path length> <token> <path length> Distribute each query from target-list to processors Distribute each query from target-list to processors

in a parallel manner (using MPI)in a parallel manner (using MPI)Each processor builds its internal tree structure and Each processor builds its internal tree structure and

finds the entire paths.finds the entire paths. Needed to dedicate someone for printing .Needed to dedicate someone for printing .

If all start printing chaos occurs as we need full If all start printing chaos occurs as we need full result set for a single query word clubbed together. result set for a single query word clubbed together. All processors send the path results they obtain to All processors send the path results they obtain to processor with rank 0 who is responsible for printing processor with rank 0 who is responsible for printing

the individual paths obtained by each processor. the individual paths obtained by each processor. Caching logic helped in cycle detection to some Caching logic helped in cycle detection to some

extentextent

Algorithm Details…(A closer Look)Algorithm Details…(A closer Look)

StartQuerySearch()

ReadQueryFile()

BuildNetworkForTarget()

CreateGraphNode() GetNodeFromCache()

AddNodeToCache()

BuildFromNetwork()(Recursive)

BuildToNetwork() (Recursive)

GetLinksFromNode()(Network Interface)

GetLinksToNode()(Network Interface)

AddLinks()

PrintOutput()

AddToResults()

ChallengesChallenges The recursive traversal done for all the 'from' and ‘to’ The recursive traversal done for all the 'from' and ‘to’

nodes of the given target node limits the scope of nodes of the given target node limits the scope of parallelism.parallelism.

Memory Issue :- Memory Issue :- Maximum Memory Limit :- For path lengths till one Maximum Memory Limit :- For path lengths till one

there was no problem. there was no problem. Eg : Eg : Bush 1Bush 1 entry has entry has 20000+20000+ 'From/To' words 'From/To' words

associated with it . For each of these 20000+associated with it . For each of these 20000+words when you start processing their ‘from’ and ‘To’ words when you start processing their ‘from’ and ‘To’ lists recursively there is huge investment of memory lists recursively there is huge investment of memory and time. This causes hitting the maximum memory and time. This causes hitting the maximum memory limit on blade easily before path processing is limit on blade easily before path processing is complete. Blade :- maximum memory limit of complete. Blade :- maximum memory limit of 7GB7GB for for user programs (4 million nodes in our case before it user programs (4 million nodes in our case before it crashes)crashes)

Alternatives to overcome challengesAlternatives to overcome challenges Fix memory leaksFix memory leaks :- Code had memory leaks :- Code had memory leaks

in some places. Identified major culprits in in some places. Identified major culprits in memory consumption and appropriately freed memory consumption and appropriately freed them for optimum memory consumption.them for optimum memory consumption.

Major BottlenecksMajor BottlenecksEg: Anytime a 'from' list or 'to' list for a token Eg: Anytime a 'from' list or 'to' list for a token

was obtained memory was not getting freed.was obtained memory was not getting freed. Function AddToResults() wasFunction AddToResults() was

allocating memory on every path found but allocating memory on every path found but was not freeing it.was not freeing it.

Alternatives to overcome challenges Alternatives to overcome challenges (continued)(continued)

Migration to ALTIXMigration to ALTIX :- since amount of :- since amount of memory available on ALTIX memory available on ALTIX

is lot more than blade the chances for is lot more than blade the chances for path finding to work for greater than path length path finding to work for greater than path length

1 were high.1 were high. Exploited good memory support on Altix by Exploited good memory support on Altix by

writing data to files.writing data to files. This gave good results for path length upto This gave good results for path length upto 8-9 for smaller scope target words and 8-9 for smaller scope target words and 4-5 for little higher scope words. The files on 4-5 for little higher scope words. The files on

which data was written were as big as 20GB.which data was written were as big as 20GB.

Change in Methodology and Change in Methodology and Performance. Performance.

The new method exploited memory and The new method exploited memory and also enhanced performance in terms of also enhanced performance in terms of

time required to find paths.time required to find paths. However because of the recursive However because of the recursive

nature of the algorithm, inherent nature of the algorithm, inherent sequential component was fixed and sequential component was fixed and this limited performance according to this limited performance according to Amdahl’s Law.Amdahl’s Law.

New Method…(A closer Look)New Method…(A closer Look)

StartQuerySearch()

ReadQueryFile()

BuildNetworkForTarget()

CreateGraphNode() GetNodeFromCache()

AddNodeToCache()

BuildFromNetwork()(Recursive)

BuildToNetwork() (Recursive)

GetLinksFromNode()(Network Interface)

GetLinksToNode()(Network Interface)

AddLinks()

SetupOutputFiles()

AddToResults()

WriteStagingOutput()(Recursive)

CombineFinalOutputFiles()

How is this better ??How is this better ??Merging:-Merging:-

Left Side Paths Right Side Paths

Final Output of Combined Files For each line of left file, combine that with every

line in right file to form a complete path. This architecture allows parallelism at much more granular

level than just query level

Target Word

Questions ??Questions ??

THANKYOU !THANKYOU !

Google N-gram Data analyzer Project and Presentation by, Anagha Dharasurkar Andrew Norgren Premchand Bellamkonda Shruti Pandey Salil Bapat.

Documents