SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this pictur Retrieval and Feedback Models for Blog Feed Search
May 24, 2015
SIGIR 2008Singapore
Jonathan Elsas, Jaime Arguello,
Jamie Callan & Jaime Carbonell
LTI/SCS/CMU
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Retrieval and Feedback Models for Blog Feed
Search
Outline
• The task– Overview of Blogs & Blog Search– Challenges in Blog Search
• Our approach– Retrieval Models– Query Expansion Models
• Conclusion
Background
What is a Blog?
What is a Feed?<xml>
<feed>
<entry>
<author>Peter …</>
<title>Good, Evil…</>
<content>I’ve said…</>
</entry>
<entry>
<author>Peter …</>
<title>Agreeing…</>
<content>Some peo…</>
</entry>
…
Blog-Feed Correspondence
Blog Feed
Post Entry
HTMLHTML XMLXML
Why are Blogs important?
Technorati currently tracking > 112.8 Million Blogs> 175,000 new Blogs per day> 1.6 Million posts per day
[http://www.technorati.com/about/]
The Task
Feed Search at TREC
Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X]
“A relevant feed should have a principle and recurring interest in X”
— TREC 2007 Blog Track
(a.k.a. Blog Distillation)
Feed Search at TREC
[Gardening][Apple iPod]
[Violence in Sudan][Gun Control]
[Food][Wine]
RepresentOngoing
Information Needs
FrequentlyVery
General
Challenges in Feed Search
Challenges in Feed Search
entries
time
feed
1.A feed is a collection of documents
1.A feed is a collection of documents – How does relevance at the entry level
correspond to relevance at the feed level?
Challenges in Feed Search
entries
time
feed
Challenges in Feed Search
2. Even a topical feed is topically diverse
time
NASA
China’s plans for the moon
shuttle launch
My dog
Mars rover
Boeing
Space Exploration
topic
Challenges in Feed Search
2. Even a topical feed is topically diverse– Can we favor entries close to the
central topic of the feed?
Space Exploration
time
topic
Challenges in Feed Search
3. Feeds are noisy– Spam blogs, Spam & off topic comments
time
Challenges in Feed Search
4. General & Ongoing Information Needs
[Mac]
[Music]
[Food]
[Wine]
… post regularly about new products, features, or application software of Apple Mac computers.
… describing songs, biographies of musicians, musical styles andtheir influences of music on people are discussed.
…such as tastings, reviews, food matching or pairing, and oenophile news and events.
… describing experiences eating cuisines, culinary delights,recipes, nutrition plans.
Our Approach
Retrieval Models
Feedback Models
Feeds:Topically Diverse
Noisy
Collections
Information Needs:
General & Ongoing
ChallengesOur
Approach
Retrieval Models
• Challenge: ranking topically diverse
collections
• Representation: feed vs. entry• Model topical relationship between entries
Large Document (Feed) Model
<?xml……
</…>
`<?xml……
</…>
<?xml……
</…>
<?xml…<feed><entry><entry><entry><entry><entry>
…</…>
<?xml……
</…>
<?xml……
</…>
<?xml……
</…>
<?xml…<feed><entry><entry><entry><entry><entry>
…</…>
Feed Document Collection
[Q]
Ranked Feeds
Rank by
Indri’s standard retrieval model[Metzler and Croft, 2004; 2005]
Large Document (Feed) Model
Advantages:
• A straightforward application of existing retrieval techniques
Potential Pitfalls:
• Large entries dominate a feed’s language model
• Ignores relationship among entries
Feed
Entry E E Entry Entry E
Small Document (Entry) Model
<entry><entry><entry><entry><?xml…<entry>
Entry Document Collection
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
Ranked FeedsRanked Entriesdocument = entry
[Q]
Apply some rankaggregation function
Rank By
Small Document (Entry) Model
• Query Likelihood• Entry Centrality• Feed Prior: favors longer feeds
ReDDE Federated Search Algortihm[Si & Callan, 2003]
Entry Centrality
Uniform :
Geometric Mean :
time
topic
Small Document (Entry) Model
Advantages:• Controls for differing entry length
• Models topical relationship among entries
Disadvantages:• Centrality computation is slow(er)
Q
Not only improves speed, Also performance
Retrieval Model Results
Retrieval Model Results
• 45 Queries from the TREC 2007 Blog Distillation Task
• BLOG06 test collection, XML feeds only
• 5-Fold Cross Validation for all retrieval model smoothing parameters
Retrieval Model Results
0.29
0.277
0.290.298
0.315
0.245
0.265
0.285
0.305
0.325
Mean Average Precision
LargeDocument(Feed)Model
Small Document (Entry) Models
Retrieval Model Results
0.29
0.277
0.290.298
0.315
0.245
0.265
0.285
0.305
0.325
Mean Average Precision
Uniform Log(Feed Length)UniformLog PriorMap 0.188
Retrieval Model Results
0.29
0.277
0.290.298
0.315
0.245
0.265
0.285
0.305
0.325
Mean Average Precision
Uniform Log(Feed Length)Uniform
n/a
Feedback Models
• Challenge: Noisy collection with general
& ongoing information needs
• Use a cleaner external collection for query expansion (Wikipedia)
• With an expansion technique designed to identify multiple query facets
Query Expansion (PRF)
[Q]
BLOG06Collection
Related Terms from top K documents[Q + Terms]
[Lavrenko & Croft, 2001]
Query Expansion Example
Idealdigital
photography
depth of field
photographic film
photojournalism
cinematography
[Photography]PRF
photographynudeeroticartgirlfreeteen
fashionwomen
Feedback Model Results
0.2
0.24
0.28
0.32
0.36
BLOG LD BLOG SD
Mean Average Precision None PRF
Query Expansion (Wikipedia PRF)
[Q]
BLOG06Collection
[Q + Terms]
[Lavrenko & Croft, 2001]
Wikipedia
[Diaz & Metzler, 2006]
Related Terms from top K documents
Query Expansion Example
Idealdigital
photography
depth of field
photographic film
photojournalism
cinematography
[Photography]PRF
photographynudeeroticartgirlfreeteen
fashionwomen
Wikipedia PRFphotographydirectorspecialfilmart
cameramusic
cinematographerphotographic
Feedback Model Results
0.2
0.24
0.28
0.32
0.36
BLOG LD BLOG SD
Mean Average Precision None PRF Wiki. PRF
Query Expansion (Wikipedia Link)
[Q]
BLOG06Collection
[Q + Terms]
Wikipedia
Related Terms from link structure
Wikipedia Link-BasedQuery Expansion
Wikipedia Link-Based ExpansionWikipedia
…
Q
Wikipedia Link-Based Expansion
…
Wikipedia
Relevance Set, Top R = 100
Working Set, Top W = 1000
Q
Wikipedia Link-Based Expansion
…
Wikipedia
Q
Relevance Set, Top R = 100
Working Set, Top W = 1000
Wikipedia Link-Based Expansion
Relevance Set, Top R = 100
Working Set, Top W = 1000
…
Wikipedia
Extract anchor text fromWorking Set that link tothe Relevance Set.
Q
Wikipedia Link-Based Expansion
Relevance Set, Top R = 500
Working Set, Top W = 1000
…
Wikipedia
Extract anchor text fromWorking Set that link tothe Relevance Set.
Q
Combines relevance and popularity
Relevance: An anchor phrase that links to a high ranked article gets a high score
Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score
Query Expansion Example
Wikipedia Link-Based
photographyphotographer
digital photographyphotographicdepth of field
feature photographyfilm
photographic filmphotojournalism
[Photography]PRF
photographynudeeroticartgirlfreeteen
fashionwomen
Idealdigital photography
depth of field
photographic film
photojournalism
cinematography
Feedback Model Results
0.2
0.24
0.28
0.32
0.36
0.4
BLOG LD BLOG SD
Mean Average Precision None PRF Wiki. PRFWiki. Link
Conclusion
• Feed Search Challenges:– Feeds are topically diverse, noisy collections
– Ranked against ongoing & general information needs
• Novel Retrieval Models:– Ranking collections, sensitive to topical relationship among entries
• Novel Feedback Models:– Discover multiple query facets & robust to collection noise
Thank You!
Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research
Entry Centrality GM Derivation
where
Entry Generation Likelihood:
|E|
Query Expansion Examples
Wikipedia ExpansionMusic
Folk musicElectronic music
FolkMusic videoWorld music
AmbientElectronic
Country music
[Music]
PRFMusicCountryDownloadFreeMP3Mp3andmoreLyricListenSong
Query Expansion Examples
Wikipedia Expansionscotland
scottish parliamentscottish
scottish national party wars of scottish
independencescottish independence
william wallaceglasgow
scottish socialist party
[Scottish Independence]
PRFscotlandindependencepartyconventionpoliticssnpnationalpeoplescot
Query Expansion Examples
Wikipedia Expansionmachine learning
learningartificial intelligence
turing machine machine gun
neural networksupport vector machine
supervised learningartificial neural network
[Machine Learning]
PRFlearnmachinecreditcardkaraokejournalsexmodelsew
Query Generality Characteristics• Query Length:
– BLOG: 1.9 words – TB04: 3.2 words– TB05: 3.0 words
• ODP Depth– BLOG: 4.7 levels– TB04: 5.2 levels– TB05: 5.3 levels
Relevance Set Cohesiveness
…
Wikipedia
Relevance Set, Top R = 100 Cohesivenes
s
=| Lin |
| Lin U Lout |
Relevant Set Cohesiveness
Is it the Queries?
Feed Search Queries ≠
TB Adhoc Queries
But, none of these measurespredict whether wikipedia
expansions helps…