Top Banner
CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma
37

CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Jun 01, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

CLOUD COMPUTING PROJECT

By:- Manish Motwani- Devendra Singh Parmar- Ashish Sharma

Page 2: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Instructor: Prof. Reddy RajaMentor: Ms M.Padmini

To Implement PageRank Algorithm using Map-Reduce for Wikipedia and verify it for smaller data-sets

Page 3: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Motivation Introduction to Algorithm PageRank Equation Analysis Brief Description of ProjectModule1Module2Module3Applications

Page 4: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

-> Need for PageRank:

The Search engines store billions of web pages which overall contain trillions of web url links. So, there is a need for an algorithm that gives the most relevant pages specific to a query.

-> Need for Distributed Environment( Map-Reduce and Distributed Storage)

• Trillions of links implies huge data storage required.(if each url requires 0.5K, then we need over 400TB just to

store URLs!) • Large data set implies large computations

Page 5: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Motivation Introduction to Algorithm PageRank Equation Analysis Brief Description of ProjectModule1Module2Module3Applications

Page 6: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

PageRank is a link analysis algorithm, named after Larry Page, used by the Google Internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the Worldwide Web, with the purpose of "measuring" its relative importance within the set

The numerical weight that it assigns to any given element Eis also called the PageRank of E and denoted by PR(E).

Page 7: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. Google calculates a page's importance from the votes cast for it. How important each vote is also taken into account when a page's PageRank is calculated.

Page 8: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Motivation Introduction to Algorithm PageRank Equation Analysis Brief Description of ProjectModule1Module2Module3Applications

Page 9: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Simple Iterative Algorithm

For kth iteration PageRank of ith page is given by:

Here,

Page 10: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Problems:

• Sinks or Dangling Pages• Cycles

Solution:

Page 11: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Solution for Cycles and If a random surfer gets bored

Here ‘d ‘ is known as damping factor . It represents the probability, at any step, that the person will continue surfing . The value of ‘d’ is typically kept 0.85

Page 12: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:
Page 13: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

In a simpler way:-

a page's PageRank = 0.15 /N+ 0.85 * (a "share" of the PageRank of every page that links to it) "share" = the linking page's PageRank divided by the number of outbound links on the page. And N= the number of documents in collection

The equation of PageRank shows clearly how a page's PageRank is arrived at. But what isn't immediately obvious is that it can't work if the calculation is done just once.

Page 14: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Motivation Introduction to Algorithm PageRank Equation Analysis Brief Description of ProjectModule1Module2Module3Applications

Page 15: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Input: Data Set containing multiple records where each record contains the Url of the Page(from Url) followed by the url of a page to which it is pointing to(ToUrl).

FromUrl

Wiki_Votes.txt

ToUrl

Page 16: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Output:The output file consist of records containing the url of the page(from Url), the page rank value of the page(PRValue) and the list of urls to which the page points to(ToUrlList).

FinalOutput.txt

fromUrl ToUrlListPRValue

Page 17: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Module1: Converter

Module2: PageRank Calculator

Module3: Output Analyzer

WebGraph

Converter

PageRankCalculator

Iterateuntil convergence

Output Analyzer

Search Engine

...

CreateIndex

Page 18: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Motivation Introduction to Algorithm PageRank Equation Analysis Brief Description of ProjectModule1Module2Module3Applications

Page 19: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Converter (Initializing with PR= 1/N )

FromUrl PRValue List:

Page 20: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Self Loops:

-handled by checking the FromUrl with ToUrl before sending it to the reduce function

Dangling Pages:

-handled by initializing their PRValue with 1/N and the List of ToUrls is left blank.

Page 21: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Motivation Introduction to Algorithm PageRank Equation Analysis Brief Description of ProjectModule1Module2Module3Applications

Page 22: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

PageRank Calculator (User can give Precision)

Page 23: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Map: Input: index.html PRValue OutList:

< 1.html 2.html... > Output

1. Output for each outlink:

key: “1.html”

value: PRValue/ ListLength

(Vote Share)

2. ToUrl itself

key: index.html

value: <OutList>

Reduce Input:

Key: “1.html”

Value: 0.5 23Value: 0.24 2…….

Value : UrlList <OutLink>

Output:

Key: “1.html”

Value: “<new pagerank>

<OutList> 1.html 2.html...”

Start with the initial PageRank and Outlinks of a document.

n

i i

i

tC

tPRd

N

dxPR

1 )(

)()1()(

Page 24: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Map: Input: index.html PRValue OutList:

< 1.html 2.html... > Output

1. Output for each outlink:

key: “1.html”

value: PRValue/ ListLength

(Vote Share)

2. ToUrl itself

key: index.html

value: <OutList>

Reduce Input:

Key: “1.html”

Value: 0.5 23Value: 0.24 2…….

Value : UrlList <OutLink>

Output:

Key: “1.html”

Value: “<new pagerank>

<OutList> 1.html 2.html...”

n

i i

i

tC

tPRd

N

dxPR

1 )(

)()1()(

For each Outlink, output the PageRank’s share of the Inlinks, and List of outlinks.

Page 25: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Map: Input: index.html PRValue OutList:

< 1.html 2.html... > Output

1. Output for each outlink:

key: “1.html”

value: PRValue/ ListLength

(Vote Share)

2. ToUrl itself

key: index.html

value: <OutList>

Reduce Input:

Key: “1.html”

Value: 0.5 23Value: 0.24 2…….

Value : UrlList <OutLink>

Output:

Key: “1.html”

Value: “<new pagerank>

<OutList> 1.html 2.html...”

n

i i

i

tC

tPRd

N

dxPR

1 )(

)()1()(

Now the reducer has a Urlof document, all the inlinksto that document and their corresponding PageRank’sshare and List of outlinks.

Page 26: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Map: Input: index.html PRValue OutList:

< 1.html 2.html... > Output

1. Output for each outlink:

key: “1.html”

value: PRValue/ ListLength

(Vote Share)

2. ToUrl itself

key: index.html

value: <OutList>

Reduce Input:

Key: “1.html”

Value: 0.5 23Value: 0.24 2…….

Value : UrlList <OutLink>

Output:

Key: “1.html”

Value: “<new pagerank>

<OutList> 1.html 2.html...”

n

i i

i

tC

tPRd

N

dxPR

1 )(

)()1()(

Compute the new PageRank and output in the same format as the input.

Page 27: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Map: Input: index.html PRValue OutList:

< 1.html 2.html... > Output

1. Output for each outlink:

key: “1.html”

value: PRValue/ ListLength

(Vote Share)

2. ToUrl itself

key: index.html

value: <OutList>

Reduce Input:

Key: “1.html”

Value: 0.5 23Value: 0.24 2…….

Value : UrlList <OutLink>

Output:

Key: “1.html”

Value: “<new pagerank>

<OutList> 1.html 2.html...”

n

i i

i

tC

tPRd

N

dxPR

1 )(

)()1()(Now iterate until

convergence (determined by the precision value).

Page 28: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Suppose we have 2 pages, A and B, which link to each other, and neither have any other links of any kind. This is what happens:-

Step 1: Calculate A's PageRank from the value of its inbound links

Step 2: Calculate B's PageRank from the value of its inbound links

we can't work out A's PageRank until we know B's PageRank, and we can't work out B's PageRank until we know A's PageRank. Thus the PageRank of A and B will be inaccurate.

Page 29: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

This problem is overcome by repeating the calculations many times. Each time produces slightly more accurate values. In fact, total accuracy can never be achieved because the calculations are always based on inaccurate values.The number of iterations should be sufficient to reach a point where any further iterations wouldn't produce enough of a change to the values to matter.

=> Use “delta function” which will keep track of changes in the PageRank of all the pages and if the change in PageRank of all the pages is less than the value specified by the user the iterations can be stopped.

Page 30: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Motivation Introduction to Algorithm PageRank Equation Analysis Brief Description of ProjectModule1Module2Module3Applications

Page 31: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Input

Analyzer ( If user want Top 3)

Output

Page 32: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Motivation Introduction to Algorithm PageRank Equation Analysis Brief Description of ProjectModule1Module2Module3Applications Questions

Page 33: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

A simple model of Search Engine. (Implemented)

The application utilizes: 1. The PageRank calculated by the PageRank Calculator2. The output generated by a map-reduce module that

finds out the number of times a pattern (as per the user’s query) matches in each of the files present in data set.

And outputs:The list of pages which are relevant to the query made in the order of their importance.

(DEMO)

Page 34: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Other Applications:

• PageRank-based mechanism to rank knowledge items used in E-Learning.

• GeneRank (based on PageRank) ranks the genes analyzed in the microarray to see the relationship between the cell’s function and gene expression.

• Can be used to sort the items present in the side menu in various blogs and sites depending on their importance.

Page 35: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

http://infolab.stanford.edu/pub/papers/google.pdf( research paper by Brin and Page)

http://www.ams.org/featurecolumn/archive/pagerank.html

http://en.wikipedia.org/wiki/PageRank

http://www.webworkshop.net/pagerank.html#how_is_pagerank_calculated

Page 36: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Questions

Page 37: CLOUD COMPUTING PROJECT - search.iiit.ac.insearch.iiit.ac.in/cloud/presentations/3.pdf · CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar-Ashish Sharma. Instructor:

Thank You