Top Banner
1 Information Management Information Management on the World-Wide Web on the World-Wide Web Junghoo “John” Cho Junghoo “John” Cho UCLA Computer Science UCLA Computer Science
30

1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

1

Information Management Information Management on the World-Wide Webon the World-Wide Web

Junghoo “John” ChoJunghoo “John” Cho

UCLA Computer ScienceUCLA Computer Science

Page 2: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

2

The Web and Information GaloreThe Web and Information Galore

Page 3: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

3

10 Years Ago10 Years Ago

Reading papers for Reading papers for researchresearch– Stacks of papersStacks of papers– Long waitLong wait

Page 4: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

4

With WebWith Web

Page 5: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

5

Challenges (1)Challenges (1)

Information overloadInformation overload– Too much information, too little timeToo much information, too little time

Page 6: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

6

Information OverloadInformation Overload

““XML” to GoogleXML” to Google– 14 Million14 Million matching documents! matching documents!

““XML” to AmazonXML” to Amazon– 464464 matching books! matching books!

Which one to read?Which one to read?

Page 7: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

7

Challenges (2)Challenges (2)

Hidden WebHidden Web

– Not indexed by Search EnginesNot indexed by Search Engines– ““Hidden” from an average userHidden” from an average user– Browse every site manually?Browse every site manually?

Page 8: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

8

Challenges (3)Challenges (3)

TransienceTransience

Page 9: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

9

Challenges (4)Challenges (4)

Scattered & unstructured dataScattered & unstructured data– All Computer Science faculty members and All Computer Science faculty members and

graduate students in the US?graduate students in the US?

Page 10: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

10

Projects In Our GroupProjects In Our Group

Web ArchiveWeb Archive Hidden Web IntegrationHidden Web Integration Page Ranking AlgorithmPage Ranking Algorithm User Recommendation SystemUser Recommendation System

Page 11: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

11

User Recommendation SystemUser Recommendation System

464 books on XML464 books on XML Which one to read?Which one to read?

– The one that my The one that my colleagues and friends colleagues and friends recommend?recommend?

Page 12: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

12

Amazon’s Recommendation SystemAmazon’s Recommendation System

1 – 5 star rating by individual users1 – 5 star rating by individual users Books can be sorted by “average user Books can be sorted by “average user

rating”rating”

Page 13: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

13

My Typical ScenarioMy Typical Scenario

Sort books by their average user ratingSort books by their average user rating Browse top 20 books to decide what to readBrowse top 20 books to decide what to read

Page 14: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

14

QuestionsQuestions

Is “5 star” by one user better than “4.9 star” Is “5 star” by one user better than “4.9 star” by 100 users?by 100 users?– Intuitively, I prefer 4.9 star by 100 usersIntuitively, I prefer 4.9 star by 100 users– More “reliable” ratingMore “reliable” rating

How much can I trust the rating of a How much can I trust the rating of a particular person?particular person?– How do I know that the person’s rating is How do I know that the person’s rating is

reliablereliable

Page 15: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

15

Our ApproachOur Approach

““Inherent quality” or “rating” of a bookInherent quality” or “rating” of a book– How many users recommend the book (i.e., How many users recommend the book (i.e.,

give high rating) if all users have read the give high rating) if all users have read the book?book?

More user rating More user rating More information on More information on the “quality” of the bookthe “quality” of the book– An average user is likely to give high rating for An average user is likely to give high rating for

a high-quality booka high-quality book

Page 16: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

16

Probabilistic Rating ModelProbabilistic Rating Model

How likely is the book of “4 star rating”?How likely is the book of “4 star rating”?– Rating probability distributionRating probability distribution

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

Page 17: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

17

Update of Rating ProbabilityUpdate of Rating Probability

As more users provide rating, we update As more users provide rating, we update our probability distributionour probability distribution

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

Page 18: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

18

Update of Rating ProbabilityUpdate of Rating Probability

As more users provide rating, we update As more users provide rating, we update our probability distributionour probability distribution

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

After five-star ratingby a user

Page 19: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

19

Update of Rating ProbabilityUpdate of Rating Probability

As more users provide rating, we update As more users provide rating, we update our probability distributionour probability distribution

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

After one-star ratingby a user

Page 20: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

20

Update of Rating ProbabilityUpdate of Rating Probability

As more users provide rating, we update As more users provide rating, we update our probability distributionour probability distribution

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

After many ratings

Page 21: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

21

Bayesian Inference TheoryBayesian Inference Theory

Given a user rating UR, what is the inherent rating Given a user rating UR, what is the inherent rating IR?IR?

)(

)()|()|(

URP

IRPIRURPURIRP

Probability of book rating BEFORE user ratingProbability of book rating

AFTER user rating

Page 22: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

22

User ModelUser Model

The characteristics of a userThe characteristics of a user

Sensitivity: Slope of the curveSensitivity: Slope of the curve+1: good, –1 : bad, 0: not useful+1: good, –1 : bad, 0: not useful

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

Good Bad

Book quality

Use

r ra

ting

Book qualityU

ser

rati

ng

Page 23: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

23

User ModelUser Model

The characteristics of a userThe characteristics of a user

Bias: Average “height” of the curveBias: Average “height” of the curve

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

Positive bias Negative bias

Book quality

Use

r ra

ting

Book qualityU

ser

rati

ng

Page 24: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

24

Iterative Model RefinementIterative Model Refinement

As more users rate a book, we get better As more users rate a book, we get better estimates on book qualityestimates on book quality

As we estimate a book quality better, we get As we estimate a book quality better, we get better idea on a user’s sensitivity and biasbetter idea on a user’s sensitivity and bias

Page 25: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

25

Iterative Model RefinementIterative Model Refinement

User-providedRating

Book Rating Estimate

UserCharacteristics

Page 26: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

26

Final RecommendationFinal Recommendation

Recommend the book with the highest Recommend the book with the highest expected ratingexpected rating

Page 27: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

27

Initial ResultsInitial Results

Our system prefers a 4.9-star book by 100 Our system prefers a 4.9-star book by 100 people to a 5-star book by 1 userpeople to a 5-star book by 1 user

If a user gives random ratings, the system If a user gives random ratings, the system ignores the user’s ratingignores the user’s rating

More thorough evaluation on the wayMore thorough evaluation on the way

Page 28: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

28

Other ProjectsOther Projects

Web ArchiveWeb Archive Hidden Web IntegrationHidden Web Integration Page Ranking AlgorithmPage Ranking Algorithm

Page 29: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

29

Ph.D. Students on the ProjectsPh.D. Students on the Projects

Alex NtoulasAlex Ntoulas Rob AdamsRob Adams Victor LiuVictor Liu– In Dr Chu’s groupIn Dr Chu’s group

Page 30: 1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

30

Thank YouThank You

Questions?Questions?