Page 1
The Mechanical Librarian
Recommending Journal Articles
in a Scientific Digital Library
Andre Vellino
[email protected]
Group Leader, CISTI Research
Canada Institute for Scientific and Technical Information
Chef de groupe, Recherche ICIST
Institute canadien de l'information scientifique et technique
Page 2
2
Outline of Talk
• The Mechanical Librarian
• How Recommenders Work
• Recommenders in Digital Libraries
• Problems for Science Article Recommenders and
Strategies for CISTI’s Recommender Research
• Synthese on CISTI Lab
• Alternative Approaches
• Future Work
Acknowledgements to: Glen Newton, Jeff Demaine and Greg Kresko &
Students : Dave Zeber, Matthew Rutledge-Taylor and Aurel Constantinescu
Page 3
The Human (Reference)
Librarian
3
World Knowledge Experience
Authoritative
Trustworthy
References
Vocabularies
Databases
Page 4
The Mechanical Librarian
The Web, they say, is leaving the era of search and entering
one of discovery. What's the difference? Search is what you
do when you're looking for something. Discovery is when
something wonderful that you didn't know existed, or didn't
know how to ask for, finds you.
Jeffrey M. O'Brien, Fortune Magazine4
Page 5
Knowledge Discovery
Technologies
• Text Mining
– Enhances the researcher’s ability to
discover new and meaningful information
from existing text repositories
• Network Analysis
– Distills the structural relationships among
bibliographic elements to reveal trends
and patterns in science
• User Behaviour
– Infers “wisdom of the crowds” from
usage statistics
5
Page 6
What is a “Recommender”?
• A recommender is a software system which attempts to predict
items that a user may be interested in, given information about
– the user's interests
– the content in the items
– the usage patterns of other users
• Items may be:
– Merchandise: movies, music, books
– Text: news, blogs, web pages, and, why not,
Scientific Journal Articles
Page 7
Amazon Recommender
System
Page 8
Explanations
User Ratings
Category Filter
Personalized
User
Control
Page 9
Companies That Offer
Recommenders to Users
9
Books
Web SitesMovies
Music
Page 10
Companies That Sell
Recommender Services
10
Product Merchandise Placement
Database Mining
Advertizing / Product Placement
Software as a Service Platform
Page 11
Recommendation is Hard
Netflix Prize: $1M
• Netflix Prize
– To develop a recommender that improves quality of
recommendations by 10% over Netflix’s
– http://www.netflixprize.com/
• Current Leader Board
– BellKor (9.6%)
– … + 39 others
• NY Times Magazine Article
http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html
11
Page 12
Good Recommendations
are REALLY Hard
12
Page 13
13
Outline of Talk
• The Mechanical Librarian
• How Recommenders Work
• Recommenders in Digital Libraries
• Problems for Science Article Recommenders and
Strategies for CISTI’s Recommender Research
• Demonstration of Synthese on CISTI Lab
• Alternative Approaches
• Future Work
Page 14
Taxonomy of
Recommender Systems
Collaborative Filtering
• Usage based, with item-ratings
– User-Based (“similar users”)
– Item-Based (“like items”)
• Algorithms
– Memory-based
– Model-based
Content-Based Filtering
• Content (text / waveform / pixel) analysis to
– Find “similar users”
– Find “similar items”
J. Breese, D. Heckerman, C. Kadie, et al. Empirical Analysis of Predictive Algorithms for Collaborative
Filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 461, 1998.
Page 15
15
How Collaborative
Filtering (CF) Works
• User-Based CF
– Given user A find all the other users {U} that have the most
“similar” item-rating patterns
– For each item I not yet rated by A, predict the likely rating A will
assign to I given the ratings for I given by {U}
– Present the Top-N ordered list of items {I} to the user
• Item-Based CF
– Given user A and the set of items {I} to which A has given
ratings, find all the other items {O} that are “similar” to {I}
– Present the Top-N ordered list of items {O} to the user
Sarwar, Badrul M., George Karypis, Joseph A. Konstan, and John Reidl. "Item-based
collaborative filtering recommendation algorithms." World Wide Web. 2001, 285-295.
Page 16
16
Find “Nearest Neighbour”
and Predict Rating
• Find Nearest Neighbours (e.g. cosine similarity)
• Predict Rating (item i for user u)
– Weighted average of user’s ratings on N similar users
Page 17
17
User-Based
Collaborative Filtering
5
2
Reader
?44Ted
5345Alice
434Carol
51Bob
BoltDark NightDoubtMilk
• Goal: predict the rating Ted will give to the movie “Bolt”
• Step 1 – eliminate the user-profiles of users who didn’t rate “Bolt”
• Step 2 – find Ted’s “K-nearest neighbours” who rated “Bolt” and at
least 2 other movies (Alice)
• R(Ted,Bolt) ~= 5.
5
Movies
Users
Page 18
Things that can go wrong
with Collaborative Filtering
• False “product ratings” to artificially boost ranking (spamming)
• Losing the diversity in the “Long Tail” – converges to “Top N”.
18Fleder, D. and K. Hosanagar. 2008. Blockbuster culture's next rise or fall: The effect of
recommender systems on sales diversity. NET Institute Working Paper 07-10.
Page 19
Content-Based
Recommenders
“These things are similar (in content) to that”.
• Depends only on a measure of similarity between the content in
the items (text, music, images)
• Typical Steps for Content Based Recommenders
1. Cluster the user’s purchased or highly-rated items by
content-similarity
2. Find other similar items not purchased or rated by the user
3. Recommend the “Top N” to the user
19
Page 20
Search Engine as
“Content-Based
Recommender”
Collaborative filtering
Page 21
“Similar Pages” is a
Content-Based
Recommender
Page 22
What can go wrong with
Content Based
Recommenders
that use only Metadata
• Bad Men Do What Good Men Dream: A Forensic Psychiatrist Illuminates the Darker Side of Human Behavior
• Do Animals Dream?: Children's Questions about Animals Most Often asked of the Natural History Museum
• All I Do is Dream of You The other end of the leash : why we do what we do around dogs
• Why do Catholics do that : a guide to the teachings and practices of the Catholic Church
• Electric universe : the shocking true story of electricity
• The Island of Sheep
Page 23
23
Outline of Talk
• The Mechanical Librarian
• How Recommenders Work
• Recommenders in Digital Libraries
• Problems for Science Article Recommenders and
Strategies for CISTI’s Recommender Research
• Demonstration of Synthese on CISTI Lab
• Alternative Approaches
• Future Work
Page 24
Value of Recommenders
in a Digital Library
24
• For the Researcher
– Provide serendipity in a Browse / Search / Retrieve portal
• Broaden scope of search to cognate but otherwise disparate domains
• For the Library
– Increase customer loyalty by creating dynamic, adaptive,
customized services
• Alerts & notifications based on usage and collaborative filtering rather
than stored queries
• For Authors
– Given a draft article (with citations), find additional citations
• For Publishers & Journal Reviewers
– Given a submitted article, recommending peer-reviewers
Page 25
Recommender Systems in
Digital Libraries
– Techlens (University of Minnesota) (2002)
• Uses ACM DL, full text Mixed Hybrid
– BibTip (University of Karlsruhe) (2003)
• Uses OPAC (Library Catalog) usage data for collaborative filtering
– IngentaConnect (2007)
• Uses Baynote (SaaS) customer tracking
– DSpace (2008)
• Content-based recommender based on user-bookmarks
– CiteULike (academic experiment 2008)
• Collaborative filtering on user bookmarks from CiteULike
– “bX” system from Ex Libris (2009)
• Uses SFX resolver logs
– NextBio (to be announced in March 2009)
• Life sciences search engine that uses collaborative filtering + ontologiesto suggest new content (trials / abstracts / data)
25
Page 27
“bX”
Recommender (Jan „09)
27
Features
• Uses log data from SFX resolvers
• Applies Collaborative Filtering
• Uses lots of aggregated data
• Developed w/ the Los Alamos National Laboratory.
Possible issues
• Infers identity of users only through IP address
• May not be accurate when http proxies are used
• Same IP address can have several “IR objectives”
• Identical resolved objects may not be recognized
Page 28
28
Outline of Talk
• The Mechanical Librarian
• How Recommenders Work
• Recommenders in Digital Libraries
• Problems for Science Article Recommenders and
Strategies for CISTI’s Recommender Research
• Demonstration of Synthese on CISTI Lab
• Alternative Approaches
• Future Work
Page 29
29
Typical Problems with CF
Recommenders in General
• Data Sparsity
– Ratio of Users / Items is low (~ 1:10)
– Number of Ratings per User is low
– Ratings matrix sparsity ~ 95%
• Cold Start Problem
– First-time users get poor or no recommendations because CF matrix has no entries
• Rating Items
– CF recommender must be trained (explicitly or implicitly) by providing ratings to items
• Principle of Induction
– People who exhibited similar behaviour in the past will tend to exhibit similar behaviour in the future.
Page 30
30
Specific Problems for
Collaborative Filtering in
Science Digital Libraries
• Data Sparsity– Many More Articles & Far Fewer Users (10x)
– Fewer Item / Ratings (~ 99% sparsity)
• Rating Articles– Explicit ratings are more difficult to obtain
• DL users have less need to “express themselves” by explicitly rating items than movie watchers
– Implicit ratings depend on UI features of DL• No reliable method for inferring ratings from browsing and query
behaviour
• Principle of Induction (that past is a good predictor of the future) not necessarily true in digital libraries– Interest drift
– Context shifts
Page 31
Recommender Research
Strategy @ CISTI
• Follow in footsteps of TechLens+
– Collaborative Filtering (CF) among users
– Seed CF recommender with citation matrix
– Extended with
• PageRank on Citations
• User Contexts
– Future Extensions
• Add Content-Based Filtering (“Fusion Mixed Hybrid” model)
• Distributed Multi-Dimensional Recommender
• Explanation-based interface
31
A. Vellino and D. Zeber. (2007) “A Hybrid, Multi-dimensional Recommender for Journal
Articles in a Scientific Digital Library.” Conference Proceedings on Web Intelligence and
Intelligent Agent Technology
Page 32
Making a Reference Rating
32
Page 33
33
Recommender Citation
Seeding
• Articles either cite or don’t cite other articles
• Some articles that are cited are not in collection
• Users’ “article collection profile” citations
TechLens approach to Cold Start / Data Sparsity problem
Page 34
34
Outline of Talk
• The Mechanical Librarian
• How Recommenders Work
• Recommenders in Digital Libraries
• Problems for Science Article Recommenders and
Strategies for CISTI’s Recommender Research
• Demonstration of Synthese on CISTI Lab
• Alternative Approaches
• Future Work
Page 35
Synthese Recommender
on CISTI Lab
35
Page 37
Add Important Articles to
“Basket” (1)
37
Page 38
Add Important Articles to
“Basket” (2)
38
Page 39
Add Important Articles to
“Basket” (3)
39
Page 40
Add Important Articles to
“Basket” (4)
40
Page 42
Add More Articles to
“Basket” (1)
42
Page 43
Add More Articles to
“Basket” (2)
43
Page 44
Recommend Based on
Current “Basket”
44
Page 45
View Recommendations
45
Page 46
Evaluate Recommender
46
Page 47
Search and Basket
History
47
Page 48
Multiple Profiles
48
Page 49
Synthese Performance
49
Ratings
Perc
enta
ge
0
5
10
15
20
25
30
35
1 2 3 4 5
Ratings of Recommendations
Page 50
50
Recommender Citation
Seeding
Can we improve on 0 / 1 (Boolean) citation seeding?
Page 51
51
Apply PageRank to
Citation Matrix
Aurel Constantinescu “Ranking Full-Text Articles using Citation Based Methods”
Master’s Thesis, University of Ottawa
PageRank algorithm applied to citations
Page 52
52
PageRank-weighted
Citation matrix
• Apply Page Rank on Citations
– Use citation data (as in TechLens+)
– Apply PageRank to weight the citation-based “ratings”
• Done before but only at the Journal level (http://www.eigenfactor.org/)
0.30.2
0.60.30.5
0.50.7
0.60.2
0.40.5
0.4
p6p1 p5p2 p4p3
u2
p1
u1
p2
p4
p3
articles
citationsp7 p8
= constantusers
Page 53
PageRank Experimental
Results
53A. Vellino “The Effect of PageRank on the Collaborative Filtering of Journal Articles”
NRC Research Report, 2008.
Page 54
54
Outline of Talk
• The Mechanical Librarian
• How Recommenders Work
• Recommenders in Digital Libraries
• Problems for Science Article Recommenders and
Strategies for CISTI’s Recommender Research
• Demonstration of Synthese on CISTI Lab
• Alternative Approaches
• Future Work
Page 55
What is a Holographic
Memory System?
• A Holographic Memory System (HMS) stores information in
a manner analogous to the storage of an image on a
holographic plate.
• HMS is composed of units called items
– Each item represents some content
• e.g, a concept, a word, a bibliographic item
– Items are analogous to points on the surface of
holographic film (or, plate)
– Each item stores information about the associations it
has with other items
T. A. Plate, 2003 Holographic Reduced Representations: Distributed Representations for
Cognitive Structures (Stanford, CA: CSLI Publications)
Page 56
Holographic Memory
System (HMS)
Apple
Red
Spherical
Fruit
Each item stores information about
many other items in the system
HMS
Each point on the Holographic plate
stores information about many parts
of the image
Holography
Page 57
HMS Recommender for
Journal Articles
• We compared DSHM and user-based CF on journal article
recommendation on 2 small collections
• 90% - 10% Cross Validation
• systematically removed one reference at a time
• tested whether recommender predicts the reference.
• compared DSHM and user-based CF
Medicine Biology
7495 articles 38,667 articles
0.55 references per article 1.15 references per article
M. F. Rutledge-Taylor, A. Vellino and R. L. West. “A Holographic Associative Memory
Recommender System” 3rd Int. Conference on Digital Information Management, London, 2008.
Page 58
Experimental Results
58
Page 59
• Advantages
– Holographic System outperformed standard user-based
CF on very sparse bibliographic datasets
– DSHM is better able to exploit the available information
– The uniformly consistent model of DSHM gives it good
potential for success on multi-dimensional datasets
• Disadvantages
– Requires a lot of computational resources
– Unclear about how it works on a large scale.
Holographic Recommender:
Discussion
Page 60
60
Outline of Talk
• The Mechanical Librarian
• How Recommenders Work
• Recommenders in Digital Libraries
• Problems for Science Article Recommenders and
Strategies for CISTI’s Recommender Research
• Demonstration of Synthese on CISTI Lab
• Alternative Approaches
• Future Work
Page 61
61
Multi-Dimensional Ratings
Matrix
G. Adomavicious, R. Sankaranarayanan, S. Sen, A. Tuzhilin, ACM Transactions on Information Systems 2005
Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach
Page 62
62
Scaling Strategy:
Distributed
Recommenders
• Multiple ratings matrices decomposed by subject area
• Merge separate recommendations by subject
• Reduces matrix sparsity
• Improves accuracy of recommendations
S. Berkovsky, T.Kuflik, and F. Ricci Distributed Collaborative Filtering with
Domain Specialization Proceedings of Recommender Systems 2007
Page 63
What predicts overall usefulness of a System?
0
0.1
0.2
0.3
0.4
0.5
0.6
Good Rec. Useful Rec. Trust
Generating
Rec.
Adequate
Item
Description
Ease of
Use
Co
rre
lati
on
Importance of Quality and
Trust
63Rashmi Sinha & Kirsten Swearingen – UC Berkeley
Page 64
64
UI for Navigating
Recommendations
• Explanation-based
Recommendations
– Provide transparency
increase user trust
– Allow users to cluster by
type of reason
– Filter out unwanted
recommendations
P. Pu and L. Chen. Trust Building with Explanation Interfaces. In IUI ’06: Proceedings of
the 11th International Conference On Intelligent User Interfaces, pages 93–100
Page 65
Conclusions
• Recommender technology is only 12 years old, but mature
enough for widespread commercial use.
• Digital Libraries / Web 2.0 Bibliographic applications are
beginning to use recommenders.
• Digital Libraries create new problems for recommenders
(“context drift” / “data sparsity” / “multiple dimensions”)
• Recommenders insufficiently understood in Digital Libraries.
• Recommender as mechanism for enhancing the process of
scientific discovery promising but still uncertain.
65
Page 66
Thank You!
Questions?
http://lab.cisti-icist.nrc-cnrc.gc.ca/synthese/