Usage-Based vs. Citation-Based Recommenders in a Digital Library

Usage-Based vs. Citation-Based Recommenders in a Digital Library

André Vellino School of Information Studies ��

University of Ottawa��blog: http://synthese.wordpress.com��

twitter: @vellino e-mail: [email protected]

Context �  Canada Institute for Scientific and Technical Information

(aka Canada’s National Science Library) �  Has a full-text digital collection (Scientific, Technical,

Medical) with text-mining rights for research purposes only �  Elsevier and Springer (mostly)

�  ~8M articles �  ~2800 journals �  ~ 3TB

�  Plan: a Hybrid, Multi-Dimensional � Usage-based (CF) � Content-based (CBF) � User-Context

Sparsity of Usage Data is a Problem in Digital Libraries

Amazon Digital Libraries

Users Items

~ 70 M ~ 93 M

~70,000

Users Items

~ 7 M

Data is Sparse Too

�  Sparseness of a dataset S =

�  Mendeley data S = 2.66 x 10-05 �  Neflix S = 1.18 x 10-02

�  But also, Mendeley data isn’t “highly connected” �  83.6% of Mendeley articles were referenced by only 1 user � < 6% of the articles were referenced by 3 or more users.

total number of possible edges

edges user-item graph

ExLibris bX solution to data sparsity: Harvest lots usage (co-download) behaviour from world-wide SFX (Ex Libris Open URL resolver) logs and apply collaborative filtering to correlate articles.

Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. (in JCDL2006)

(2009)

TechLens+ Citation-Based Recommdendation p2

p3

p5

Articles References

R. Torres, S. McNee, M. Abel, J. Konstan, and J. Riedl. Enhancing Digital Libraries with TechLens+. (in JCDL 2004)

Does “Rated” Citations w/ PageRank Help?

"

"

"

"

0.3 0.2

0.6 0.3 0.5

0.5 0.7

0.6 0.2

0.4 0.5

0.4

p6 p1 p5 p2 p4 p3

u2

p1

u1

p2

p4

p3

articles

citations p7 p8

" = constant users

Answer: Using PageRank to “rate” citations is not significantly Better than using a constant (0/1)

Note: There is ongoing work w/ NRC on machine learning method ��for extracting “most important references” – that might help more

Sarkanto (NRC Article Recommender)

�  Uses TechLens+ strategy of replacing User-Item matrix with Article-Article matrix from citation data

�  Uses TASTE recommender (now the recommendation component of Mahout)

�  Is now decoupled from user-based recommender �  Compare side by side w/ ‘bX’ recommendations Try it here:

http://lab.cisti-icist.nrc-cnrc.gc.ca/Sarkanto/

Sarkanto compared w/ bX

“Users who viewed this article also viewed these articles.”

“These are articles whose co-citations are similar to this one.”

Experiments �  Sarkanto generated ~ 1.9 million citation-based

recommendations (statically) �  Experimental comparison done on 1886 randomly selected

articles from a subset of ~ 1.2M articles (down from ~ 8M) �  Questions asked in the experiment:

� How many recommendations produced by each recommender � Coverage (how often does a seed article generate a

recommendation) � How semantically diverse are the recommendations

Measuring Semantic Diversity

�  Question: what is the semantic distance between the source-article and the recommendations?

�  In this setup it was not possible to compare the semantic distance without the full-text for both set of recommendations

�  Full-text is available for the Sarkanto recommendations but not for the bX recommendations

Journal-Journal Semantic Distance �  Concatenate the full-text of all the articles in each journal �  From a Lucene index of the full text in each journal, use

Dominic Widdows’ Semantic Vectors package to create �  a term-journal matrix, �  reduced dimensionality term-vectors (512) for each journal

using random projections

�  Apply multidimensional scaling (MDS) in R to obtain a 2-D distance matrix (2300 x 2300)

G. Newton, A. Callahan, and M. Dumontier. Semantic journal mapping for search visualization in a large scale article digital library in Second Workshop on Very Large Digital Libraries, ECDL 2009

http://cuvier.cisti.nrc.ca/~gnewton/torngat/applet.2009.07.22/index.html

2-D Journal Distance Map

Colours clusters represent Journal subject headings ��(from publisher metadata)

Results: Diversity of Recommendations

�  ~13% of seed articles generated recommendations for both bX and Sarkanto (i.e. not much overlap!)

�  Citation-based recommendations appear to be more semantically diverse than User-based.

Conclusions �  Citation-based and User-based recommendations are

complementary �  Different kinds of data sources (users vs. citations) produce

different kinds of (non-overlapping) results �  Citation-based recommendations are more semantically diverse

�  Hypothesis: “user-based recommendations may be biased by the semantic similarity of search-engine results”

Usage-Based vs. Citation-Based Recommenders in a Digital Library

Technology

sarkanto recommendations

userbased recommender

citation data

diversity of recommendations

mendeley articles

seed articles

forthe bx recommendations

data sparsity