Enriching the Web with Readability Metadata Kevyn Collins-Thompson Context, Learning, and User Experience for Search Group Microsoft Research PITR 2012.

Enriching the Webwith Readability Metadata

Kevyn Collins-Thompson

Context, Learning, and User Experience for Search GroupMicrosoft Research

PITR 2012 : NAACL HLT 2012 WorkshopPredicting and improving text readability for target reader populations

June 7, 2012 - Montréal

Enriching the Web with Readability Metadata

Acknowledgements

Joint work with my collaborators:

Paul Bennett, Ryen White, Sue Dumais (MSR)Jin Young Kim (U. Mass.)

Sebastian de la Chica (Microsoft)Paul Kidwell (LLNL)

Guy Lebanon (GaTech)David Sontag (NYU)

Bringing together readability and the Web… sometimes in unexpected ways

We use the comparative and superlative form to compare and contrast different objects in English. Use the comparative form to show the difference between two objects. Example: New York is more exciting than Seattle. Use the superlative form when speaking about three or more objects to show which object is 'the most' of something. Example: New York is the most exciting city in the USA.Here is a chart showing how to construct the comparative form in English. Notice in the example sentences that we use 'than' to compare the two objects We use the comparative and superlative form to compare and contrast different objects in English. Use the comparative form to show the difference between two objects. Example: New York is more exciting than Seattle. Use the superlative form when speaking about three or more objects to show which object is 'the most' of something. Example: New York is the most exciting city in the USA.Here is a chart showing how to construct the comparative form in English. Notice in the example sentences that we use 'than' to compare the two objects

Syntax

Vocabulary

Coherence

Visual Cues

Topic Interest

Reading level predictionTopic prediction

Text Readability Modelingand Prediction

Search Engines

Bringing together readability and the Web… sometimes in unexpected ways


Syntax

Vocabulary

Coherence

Visual

Topic Interest

Billions of pages, millions of sites, billions of users

Readability of content

Reading proficiency

and expertise of users

The Web


How Web interactions can be enriched with reading level metadata

• Prelude: Predicting reading level of Web pages• Web applications:

– Personalization [Collins-Thompson et al.: CIKM 2011]

– Search snippet quality– Modeling user & site expertise [Kim et al. WSDM 2012]

– Searcher motivation • Challenges and opportunities for readability

modeling and prediction

It’s not relevant …if you can’t understand it.

A search result should be at the reading level the user wants for that query.


Search engines try to maximize relevance but have traditionally ignored text difficulty

(at least, not immediately)

Intent Models Content ModelsMatching

Web pages occur at a wide range of reading difficulty levels

Query [insect diet]: Lower difficultyEnriching the Web with Readability Metadata

Medium difficulty [insect diet]


Higher difficulty [insect diet]


Users also exhibit a wide range of proficiency and expertise

• Students at different grade levels• Non-native speakers• General population

– Large variation in language proficiency– Special needs, language deficits– Familiarity or expertise in specific topic areas

• Even for a single user there can be broad variation in intent across search queries


Default results for [insect diet]


Relevance as seen by an elementary school student (e.g. age 10)

X Technical

X Technical

X Relevance

X Technical

X Relevance

X Relevance

X Technical


Blending in lower difficulty results would improve relevance for this user

X Technical

X Relevance

X Relevance

X Technical


Reading difficulty has many factors

• Factors include:– Semantics, e.g. vocabulary – Syntax, e.g. sentence structure, complexity– Discourse-level structure– Reader background and interest in topic– Text legibility– Supporting illustrations and layout

• Different from parental control, UI issues


Traditional readability measures don’t work for Web content

• Flesch-Kincaid (Microsoft Word)

• Problems include:– They assume the content has well-formed sentences– They are sensitive to noise– Input must be at least 100 words long

• Web content is often short, noisy, less structured– Page body, titles, snippets, queries, captions, …

• Billions of pages → computational constraints on approaches

• We focus on vocabulary-based prediction models that learn fine-grained models of word usage from labeled texts

59.15]/[8.11]/[39.0 WordSyllablesSentenceWordsRGFK



Method 1: Mixtures of language models that capture how vocabulary changes with level

Probability of the word "perimeter"

0

0.00005

0.0001

0.00015

0.0002

0.00025

0.0003

0.00035

0.0004

0.00045

0.0005

0 1 2 3 4 5 6 7 8 9 10 11 12

Grade Class

P(w

ord

|gra

de

)Probability of the word "red"

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0 1 2 3 4 5 6 7 8 9 10 11 12

Grade Class

P(w

ord

|gra

de

)

Probability of the word "determine"

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0 1 2 3 4 5 6 7 8 9 10 11 12

Grade Class

P(w

ord

|gra

de

)

Probability of the word "the"

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0 1 2 3 4 5 6 7 8 9 10 11 12

Grade Class

P(w

ord

|gra

de

)

perimeter

thedetermine

red

[Collins-Thompson & Callan: HLT 2004]


1 2 3 4 5 6 7 8 9 10 11 12

-18000

-16000

-14000

-12000

-10000

-8000

-6000

-4000

-2000

0

Log

Like

lihoo

d

Grade level likelihood usually has a well-defined maximum

Grade 8 document: 1500 words

We can use these word usage trends to compute feature weights per grade

desert 1.787

crew 1.765

habitat 1.763

butterflies 1.758

rough 1.707

slept 1.659

bowling 1.643

ribs 1.610

grows 1.606

entrance 1.604

acidic 1.425

soda 1.425

acid 1.408

typical 1.379

angle 1.362

press 1.318

radio 1.284

flash 1.231

levels 1.229

pain 1.220

grownup 2.485

ram 2.425

planes 2.411

pig 2.356

jimmy 2.324

toad 2.237

shelf 2.192

cover 2.184

spot 2.174

fed 2.164

essay 2.441

literary 2.383

technology 2.363

analysis 2.301

fuels 2.296

senior 2.292

analyze 2.279

management 2.269

issues 2.248

tested 2.226

Grade 1 Grade 4 Grade 8 Grade 12


Method 2: Vocabulary-based difficulty measure via word acquisition modeling

[Kidwell, Lebanon, Collins-Thompson: EMNLP 2009, JASA 2011]


• Documents can contain high-difficulty words but still be lower grade level• e.g. teaching new concepts

• We introduce a statistical model of (r, s) readabilityr : familiarity threshold for any word

A word w is familiar at a grade if known by at least r percent of population at that grade

s : coverage requirement for documentsA document d is readable at level t if s percent of the words in d are familiar at grade t.

• Estimate word acquisition age Gaussian (μw, σw) for each word w from labeled documents via maximum likelihood

• (r, s) parameters can be learned automatically or specified to tune the model for different scenarios

0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Grade Level

The r parameter controls the familiarity threshold for words


“red” “perimeter”

qRED(0.80) = 3.5 qPERIMETER(0.80) = 8.2Level quantile for word w: qw (r)

0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Grade Level

CDF

Suppose: p(“red” | d) = p(“perimeter” | d) = 0.5

The s parameter controls required document coverage


“red” “perimeter”

Predicted grade with s = 0.70: 8.8Predicted grade with s = 0.50: 3.5

0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Grade Level

CDF

Multiple-word example


“red”

“perimeter”

“the”

“ants”“explored”

“The red ants explored the perimeter.”

Predicted grade with s = 0.70: 5.3


New metadata based on reading level

• Documents:– Posterior distribution over levels– Distribution statistics:

• Expected reading difficulty• Entropy of level prediction

– Temporal / positional series– Vocabulary models

• Key technical terms• Regions needing augmentation (Text, images, links to sources)

• Web sites:– Topic, reading level expectation and entropy across pages

• User profiles:– Aggregated statistics of documents and sites based on short- or long-term

search/browse behavior

1 2 3 4 5 6 7 8 9 10 11 120

0.050.1

0.150.2

0.250.3

Health article: Bronchitis, efficacy …

Local readability within a document Movie dialogue in “The Matrix: Reloaded”

Architect’s speech

Keanu Reeves

enters

MerovingianScene (French)

[Kidwell, Lebanon, Collins-Thompson. J. Am. Stats. 2011]



Application:Personalizing Search Results

by Reading Level


Personalization by modeling users and content

Desired reading level0

0.5

1

Content reading level

Re-ranker

Session

User and Intent

User profile Long-term

Short-term (this talk)

How could a Web search engine personalize results by reading level?

1. Model a user’s likely search intent:– Get explicit preferences or instructions from a user– Learn a user’s interests and expertise over time

2. Extract reading-level and topical features:– Queries and Sessions: (Query text, results clicked, … )– User Profile (Explicit or Implicit from history)– Page reading level, Result snippet level

3. Use these features for personalized re-ranking



A simple session model combines the reading levels of previous satisfied clicks

insect diet

grasshoppers

insect habits

Session reading level distribution


Typical features used for reading level personalization

• Content– Page reading level (query-agnostic)– Result snippet reading level (query-dependent)

• User: Session– Reading level averaged across previous satisfied clicks– Count of previous queries in session

• User: Query– Length in words, characters– Reading level prediction for raw text

• Interaction features– Snippet-Page, Query-Page, Query-Snippet

• Confidence features for many of the above

What types of queries are helped most by reading level personalization?

• Gain for all queries, and most query subsets (205, 623 sessions)– Size of gain varied with query subset– Science queries benefited most in our experiment

• Beating the default production baseline is very hard: Gain ≥ 1.0 is notable• Net +1.6% of all queries improved at least one rank position in satisfied click

– Large rank changes (> 5 positions) more than 70% likely to result in a win


Point-Gain in Mean Reciprocal Rank of Last-SAT click

What features were most important for reading level personalization?


Session user model confidenceSession prev query count

Page levelSnippet level

Snippet-page diff confidenceQuery length (words)

Query vs snippetDale snippet difficulty

Snippet vs pageSession level vs pageQuery length (chars.)

Relative snippet difficultyReciprocal rank

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Average reduction in residual squared error over all trees and over all splits

relative to the most informative feature.


Application:Improving snippet quality


Users can be misled by a mismatch between snippet readability and page readability

Page Difficulty: High

Snippet Difficulty: Medium

Click!

Retreat!!

Users abandon pages faster when actual page is more difficult than the search result snippet suggested

Page harder than its result snippet

Page easier than its result snippet

Future goal:Expected snippet difficulty

should match the underlying document

difficulty


[Collins-Thompson et al., CIKM 2011]


Application:

Modeling expertise on the Webusing reading level + topic metadata


Topic drift can occur when the specified reading level changes

Example: [quantum theory]

Top 4 results


[quantum theory] + lower difficulty

Top 4 results


[quantum theory] + lower difficulty + science topic constraint

Top 4 results


[cinderella] + higher difficulty

Top 4 results


[bambi]

Top 3 results


[bambi] + higher difficulty

Top 4 results


P(RL|T) for Top ODP Topic Categories

Top Category R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 E(RL)Home 0.00 0.00 0.02 0.30 0.45 0.08 0.03 0.01 0.01 0.01 0.07 0.02 5.49Shopping 0.00 0.00 0.01 0.16 0.32 0.23 0.10 0.04 0.02 0.03 0.07 0.02 6.14Recreation 0.00 0.00 0.01 0.11 0.43 0.19 0.09 0.03 0.01 0.02 0.08 0.02 6.15Sports 0.00 0.00 0.00 0.09 0.48 0.12 0.12 0.04 0.02 0.02 0.08 0.02 6.19News 0.00 0.00 0.00 0.06 0.42 0.18 0.17 0.03 0.01 0.01 0.08 0.03 6.36Arts 0.00 0.00 0.01 0.10 0.37 0.15 0.14 0.06 0.01 0.02 0.09 0.04 6.48Kids_and_Teens 0.00 0.00 0.02 0.19 0.32 0.13 0.09 0.03 0.01 0.03 0.11 0.07 6.54Adult 0.00 0.00 0.00 0.07 0.28 0.26 0.15 0.06 0.01 0.01 0.09 0.06 6.73Games 0.00 0.00 0.01 0.13 0.29 0.13 0.11 0.04 0.02 0.03 0.19 0.05 7.09Society 0.00 0.00 0.00 0.07 0.31 0.14 0.11 0.06 0.02 0.03 0.16 0.08 7.27Business 0.00 0.00 0.01 0.07 0.23 0.18 0.09 0.03 0.02 0.04 0.22 0.11 7.74Science 0.00 0.00 0.00 0.06 0.23 0.09 0.07 0.02 0.01 0.07 0.27 0.17 8.46Reference 0.00 0.00 0.00 0.03 0.17 0.10 0.16 0.04 0.02 0.03 0.23 0.21 8.61Health 0.00 0.00 0.00 0.03 0.16 0.07 0.13 0.04 0.03 0.11 0.30 0.13 8.79Computers 0.00 0.00 0.00 0.04 0.10 0.07 0.05 0.02 0.01 0.04 0.43 0.23 9.62


P(RL

|S)

P(RL|S) against P(Science|S)

P(Science|S)

More scientific → Higher reading level


P(RL

|S)

P(RL|S) against P(Kids_and_Teens|S)

P(Kids_and_Teens|S)

More Kids-like → Lower reading level


Results suggest that there are both expert (high RL) and novice (low RL) users for computer topics

User Reading Level against P(Topic)


Using reading level and topic together to model user and site expertise

Four features that aggregate metadata over pages:Reading level:

1. Expected reading level E(R) over site/user pages 2. Entropy H(R) of reading level over site/user pages

Topic:3. Top-K ODP category predictions over site/user pages4. Entropy H(T) of ODP category distribution for

site/user pages


Sites with low topic entropy (focused) tend to be expert-oriented

Website H(T|S) T1 P1 T2 P2 T3 P3www.prosportsdaily.com 0.83 Sports 0.74 Sports/Football 0.26www.organize.com 0.91 Shopping 0.67 Shop/Home&Garden 0.33www.trulia.com 0.92 Business 0.78 Society 0.18 Bus./Construction 0.04www.fandango.com 0.95 Arts 0.63 Arts/Movies 0.36www.hobbytron.com 0.96 Recreation 0.62 Shopping 0.38

Sites with focused topical content: Low Entropy, H(T|S) < 1


Sites with high topic entropy (breadth) tend to be for general audiences

Website H(T|S) T1 P1 T2 P2 T3 P3www.prosportsdaily.com 0.83 Sports 0.74 Sports/Football 0.26www.organize.com 0.91 Shopping 0.67 Shop/Home&Garden 0.33www.trulia.com 0.92 Business 0.78 Society 0.18 Bus./Construction 0.04www.fandango.com 0.95 Arts 0.63 Arts/Movies 0.36www.hobbytron.com 0.96 Recreation 0.62 Shopping 0.38

Website H(T|S) T1 P1 T2 P2 T3 P3ezinearticles.com 4.27 Business 0.12 Health 0.09 Home 0.08www.dummies.com 4.28 Computers 0.17 Computers/HW 0.09 Business 0.08en.allexperts.com 4.38 Recreation 0.12 Home 0.09 Recreation/Pets 0.07phoenix.about.com 4.38 Recreation 0.12 Society 0.09 Arts 0.07www.wisegeek.com 4.40 Health 0.12 Business 0.10 Science 0.09

Sites with focused topical content: Low Entropy, H(T|S) < 1

Sites with very broad topical content: High Entropy : H(T|S) > 4


Reading level entropy measures breadth of a site’s content difficulty

Website H(RL|S) R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Count E(RL|S)www.pumpkinpatchesandmore.org 0.99 0 0 0.7 0.2 0 0 0 0 0 0 0 0 35 3.3busycooks.about.com 0.9 0 0 0 0.8 0.1 0 0 0 0 0 0 0 45 4.12www.pickyourown.org 0.93 0 0 0 0.8 0.2 0 0 0 0 0 0 0 38 4.14www.ssa.gov 0.91 0 0 0 0 0 0 0 0 0 0 0.1 0.8 59 11.52h10025.www1.hp.com 0.78 0 0 0 0 0 0 0 0 0 0 0.2 0.8 55 11.77www.socialsecurity.gov 0.53 0 0 0 0 0 0 0 0 0 0 0.1 0.9 29 11.87

Website H(RL|S) R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Count E(RL|S)www.dltk-kids.com 2.02 0 0 0.2 0.5 0.2 0.1 0 0 0 0 0 0 39 4.4www.dltk-teach.com 2.1 0 0 0.2 0.4 0.2 0.2 0 0 0 0 0 0 26 4.47www.dltk-holidays.com 2.07 0 0 0.2 0.5 0.1 0 0.1 0 0 0 0 0 31 4.65psychology.about.com 2.32 0 0 0 0 0 0 0.1 0 0 0.2 0.3 0.4 59 10.46compnetworking.about.com 2.07 0 0 0 0 0 0 0.1 0 0 0.1 0.4 0.4 68 10.58pcsupport.about.com 2.02 0 0 0 0 0 0 0 0 0 0.1 0.4 0.3 39 10.68

Sites with focused reading level: Low Entropy, H(RL|S) < 1

Sites with broad range of reading level: High Entropy, H(RL|S) > 2


Reading level and topic entropy features can help separate expert from non-expert websites

7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 121.5

2

2.5

3

3.5

4Nonexpert

Finance

CS

Legal

Medical

Reading Level (Grade)

Topi

c En

trop

y[Kim, Collins-Thompson, Bennett, Dumais. WSDM 2012]


Reading level and topic entropy features can help separate expert from non-expert websites

7 7.5 8 8.5 9 9.5 10 10.5 11 11.51.5

2

2.5

3

3.5

4ExpertNonexpert

Finance

CS

Legal

Medical

Reading Level (Grade)

Topi

c En

trop

y[Kim, Collins-Thompson, Bennett, Dumais. WSDM 2012]


Which features were most correlated with site expertise?

Baseline(predict most likely class) 65.8%

Classifier accuracy 82.2%

Feature Correl. with Expertness Description

DivRLT(U,s) -0.56 Distance of visitors’ RLT profile from site's

DivT(U,s) -0.55 Distance of visitors’ Topic profile from site's

DivRT(U) -0.45 Average distance among visitors’ RLT profile

E[R|s] +0.23 Expectation of Site's RL

E[R|Qs] +0.34 Expectation of Surfacing Query's RL

E[R|Us] +0.44 Expectation of Visitor's RL


Application:Searcher motivation


Readability metadata may also help predict when searchers are highly motivated

• Sites that are popular but also have large difference from average reading level

Website Type of site

socialsecurity.gov Government retirement/disability

collegeboard.com Entrance exam preparation, college application help

softwarepatch.com Find software patches

fileinfo.com Find programs to open file types

msdn.microsoft.com Technical reference


‘Stretch’ tasks: what are people searching for when they deviate from their typical reading level profile?

Capturing stretch behaviors:– Estimate a user’s typical reading level profile over

time, from historical search data– Collect search sessions where

E[R|Session] – E[R|User] > 4 grade levels– Build language models from titles of clicked pages– Compare word probability in clicked vs. all titles



Highest association with stretch reading

Title word Log ratiotests 2.22test 1.99sample 1.94digital 1.88options 1.87aid 1.87effects 1.84education 1.77forms 1.76plan 1.74pay 1.71medical 1.69learning 1.62

[Kim et al, WSDM 2012] Based on 2-month user profiles from Bing search log data

Medical testsCollege entrance

Gov’t formsJob search

Financial aid



Highest association with stretch reading

Lowest association with stretch reading

Title word Log ratio Title word Log ratiotests 2.22 best -0.42test 1.99 football -0.45sample 1.94 store -0.46digital 1.88 great -0.47options 1.87 items -0.52aid 1.87 new -0.53effects 1.84 sale -0.61education 1.77 games -0.65forms 1.76 sports -0.78plan 1.74 food -0.81pay 1.71 news -0.82medical 1.69 music -1.02learning 1.62 all -1.35

Medical testsCollege entrance

Gov’t forms

Financial aid

Future work:

1. Identify & predict stretch tasks2. Decide how and when to

provide support3. Determine helpful background

or alternatives

[Kim et al, WSDM 2012] Based on 2-month user profiles from Bing search log data

Shopping!ExplorationLeisure

Three key innovation directions for readability modeling and prediction


Syntax

Vocabulary

Coherence

Visual

Topic Interest

The Web

Data-driven

User-centric

Knowledge-based

Some key challenges and opportunitiesfor readability research


Basi

c Ad

vanc

emen

t of K

now

ledg

e

Relevance for applications

• Deep content understanding - Identifying gaps and assumptions - Concepts and their dependencies• Deep user understanding - Your expertise & changes over time - Learning plans tailored for you - Cognitive models of learning

• Web-scale speed and reliability• Exploiting new content forms

- Blogs, wiki structure & edits• Adapting to different tasks

and populations• Human computation/crowdsource• Predicting quality/authority

• Data-driven, personalized readability measures

• Adapting content to users- Enrich, augment, rewrite

• Adapting users to content• Influencing search presentation

and interaction

• Analyzing movie scripts withKeanu Reeves dialogue

Thanks! Questions?

For more information:

E-mail: [email protected]

Web site: http://research.microsoft.com/~kevynct


mailto:[email protected]

Enriching the Web with Readability Metadata Kevyn Collins-Thompson Context, Learning, and User Experience for Search Group Microsoft Research PITR 2012.

Documents

comparative form

superlative form

new york

contrast different objects

example sentences

exciting city

readability metadata2bringing

unexpected wayswe