Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic

Post on 24-Mar-2016

34 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic. Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais. *Work done during internship at Microsoft Research . Search and recommendation are about the matching. Queries Documents - PowerPoint PPT Presentation

Transcript

Characterizing Web Content, User Interests, and Search Behavior by

Reading Level and TopicJin Young Kim*, Kevyn Collins-Thompson,

Paul Bennett and Susan Dumais

*Work done during internship at Microsoft Research

Search and recommendation are about the matching.

QueriesDocumentsWebsites

Users

Term-space matching is not always a good idea.

GranularitySparsity

Efficiency

Can we build representations beyond the term vectors?

Topic CategoryReading Level

SentimentStyle

What would be their implications for search and recommendations?

QueriesDocumentsWebsites

Users

Topic CategoryReading Level

SentimentStyle

In a Nutshell,

WHAT WE DID: Build Profiles of

Reading Level and Topic (RLT)

For queries, websites, users and search sessions

In order to characterize and compare entities

WHAT WE FOUND: Profile matching

predicts user’s content preference

Profiles can indicate when not to personalize

Profile features can predict expert content

Building Reading Level and Topic Profiles

Predicting Reading Level and Topic for URL Reading Level Classifier

Based on language model and other sources

Topic Classifier Trained using URLs in each Open Directory Project

category

Profile Distribution over reading level, topic,

or reading level and topic (RLT)P(R|d1) P(T|d1)

Entities and Related URLs Websites : content vs. user-viewed URLs Users : URLs visited during search sessions Queries : top-10 retrieved URLs

Example: Site profile made from URLs visited during search

sessions

Entity Profile Built from Related URLs

P(R|d1) P(T|d1)P(R|d1) P(T|d1)P(R|d1) P(T|d1) P(R,T|s)

Entity and related entities User – Websites visited Website – Surfacing queries Query – Issuing users

Example: Site profile made from the profiles of its visitors

Entity Profile Built with Related Entities

User

Query

WebsiteVisit

IssueSurface

P(R,T|s)P(R,T|u)P(R,T|u)P(R,T|u)

Characterizing an Individual Entity Mean : expectation Variance : entropy

Characterizing a Group of Entities Build a group centroid from its members Variance : divergence among members

Comparing Entitles and Groups Difference in mean Divergence in profile (distribution)

Characterizing and Comparing Profiles

Characterizing Web Content, User Interests, and Search Behavior

Data Set Session Log Data

2,281,150 URL visits (1,218,433 SERP clicks) Collected from 8,841 users

Profiles of Entities 4,715 websites with 25+ clicked URLs 7,613 users with 25+ URL visits 141,325 unique queries

Each topic has different reading level distribution

Reading Level Distribution for Top ODP Categories

Category R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 E[R|T]Reference 0.00 0.00 0.00 0.02 0.17 0.10 0.15 0.04 0.02 0.03 0.20 0.27 8.80Health 0.00 0.00 0.00 0.03 0.18 0.08 0.13 0.04 0.04 0.10 0.27 0.11 8.53Science 0.00 0.00 0.00 0.06 0.23 0.09 0.07 0.02 0.01 0.08 0.27 0.17 8.44Computers 0.00 0.00 0.00 0.06 0.24 0.19 0.03 0.01 0.01 0.02 0.32 0.12 8.11Business 0.00 0.00 0.00 0.05 0.22 0.16 0.09 0.03 0.02 0.04 0.26 0.12 8.08Society 0.00 0.00 0.00 0.02 0.23 0.07 0.35 0.03 0.01 0.01 0.22 0.06 7.62Adult 0.00 0.00 0.00 0.05 0.28 0.26 0.14 0.05 0.02 0.01 0.13 0.06 6.98Kids and Teens 0.00 0.00 0.02 0.23 0.26 0.13 0.09 0.02 0.01 0.02 0.15 0.08 6.60Games 0.00 0.00 0.00 0.19 0.36 0.10 0.11 0.02 0.02 0.03 0.12 0.03 6.39Recreation 0.00 0.00 0.00 0.11 0.44 0.19 0.08 0.02 0.02 0.02 0.09 0.02 6.18Arts 0.00 0.00 0.00 0.08 0.40 0.27 0.10 0.05 0.01 0.01 0.06 0.02 6.18Home 0.00 0.00 0.02 0.19 0.41 0.14 0.04 0.03 0.01 0.03 0.09 0.04 6.08News 0.00 0.00 0.00 0.04 0.41 0.33 0.14 0.02 0.02 0.01 0.03 0.01 5.99Shopping 0.00 0.00 0.01 0.22 0.29 0.24 0.09 0.03 0.01 0.02 0.07 0.02 5.98Sports 0.00 0.00 0.00 0.09 0.56 0.11 0.10 0.03 0.03 0.02 0.06 0.02 5.94

Topic and reading level characterize websites in each category

Profile matching predict user’s preference over search results Metric

% of user’s preferences predicted by profile matching,for each clicked website over the skipped website above

Results By degree of focus in user profile : H(R,T|u) By the distance metric between user and website

KLR(u,s) / KLT(u,s) / KLRLT(u,s)User

Group #Clicks KLR(u,s) KLT(u,s)KLRLT(u,s)

↑Focused 5,960 59.23% 60.79% 65.27%  147,195 52.25% 54.20% 54.41%

 ↓Diverse 197,733 52.75% 53.36% 53.63%

Users’ Deviation from Their Own Profiles Stretch reading

Session-level reading level >> Long-term reading level

Casual reading Session-level reading level << Long-term reading

level URL Title Words for Stretch Reading

URL Title Words for

Casual ReadingTitle word Log

ratio Title word Log ratio

tests 2.22 best -0.42test 1.99 football -0.45sample 1.94 store -0.46digital 1.88 great (deals) -0.47(tuition) options 1.87 items -0.52(financial) aid 1.87 new -0.53(medication) effects 1.84 sale -0.61education 1.77 games -0.65

Comparing Expert vs. Non-expert URLs Expert vs. Non-expert URLs taken from

[White’09]

Predicting Expert vs. Novice Websites Results

Features

Baseline(predict most likely class)

65.8%

Classifier accuracy 82.2%

FeatureCorrel. with

Expertness

Description

E[R|Qs] +0.34 Expectation of Surfacing Query's RLE[R|Us] +0.44 Expectation of Visitor's RLDivRLT(U,s) -0.56 Distance of visitors’ RLT profile from site'sDivT(U,s) -0.55 Distance of visitors’ Topic profile from

site's

Thank you for your attention!

WHAT WE DID: Build Profiles of

Reading Level and Topic (RLT)

For Queries, Websites, Users and Search Sessions

To characterize and compare entities

WHAT WE FOUND: Profile matching predict

user’s content preference

Profiles can indicate when not to personalize

Profile features can predict expert content

More at : @jin4ir / cs.umass.edu/~jykim

Optional Slides

Website reading level vs. visitor diversity

Breakdown per topic revealsstronger relationship

Correlation between Site vs. Visitor Profiles

Website Reading Level Visitor Profile Diversity

DivR(U|s) DivT(U|s) DivRT(U|s)

E[R|s] 0.052 0.081 0.095

ComputersReference

NewsArts

RecreationScienceHealthSports

SocietyBusiness

AdultGamesHome

ShoppingKids_and_Teens

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

Query / User Reading Level against P(Topic) User profile shows different trends in Computers

top related