Page 1
Academic Recommendation using Citation Analysis withtheadvisor
Erik Saulejoint work with Onur Kucuktunc, Kamer Kaya, Umit V. Catalyurek
[email protected]
Department of Biomedical InformaticsThe Ohio State University
CSTA 2013
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
:: 1 / 37
Page 2
Table of Contents
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
:: 2 / 37
Page 3
Once upon a time : a survey paper
The Jimmy John’s scheduling problem
schedulingpartitioningmapping
×pipelineworkflowdata flowtask graph
×linearchainsequences(tree)(serial parallel)
But also...“Scheduling problems in parallel query optimization”“Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallelprogramming”
After 6 months, unknown papers where still uncovered
Develop software to make the search easier!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 3 / 37
Page 4
Once upon a time : a survey paper
The Jimmy John’s scheduling problem
schedulingpartitioningmapping
×pipelineworkflowdata flowtask graph
×linearchainsequences(tree)(serial parallel)
But also...“Scheduling problems in parallel query optimization”“Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallelprogramming”
After 6 months, unknown papers where still uncovered
Develop software to make the search easier!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 3 / 37
Page 5
Once upon a time : a survey paper
The Jimmy John’s scheduling problem
schedulingpartitioningmapping
×pipelineworkflowdata flowtask graph
×linearchainsequences(tree)(serial parallel)
But also...“Scheduling problems in parallel query optimization”“Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallelprogramming”
After 6 months, unknown papers where still uncovered
Develop software to make the search easier!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 3 / 37
Page 6
Once upon a time : a survey paper
The Jimmy John’s scheduling problem
schedulingpartitioningmapping
×pipelineworkflowdata flowtask graph
×linearchainsequences(tree)(serial parallel)
But also...“Scheduling problems in parallel query optimization”“Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallelprogramming”
After 6 months, unknown papers where still uncovered
Develop software to make the search easier!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 3 / 37
Page 7
Design Goals
Personalized
The user should be able to make a query that describes precisely what sheis looking for.
Conceptual
The system should free of linguistic problems. Ambiguity and synonymyshould be taken into accounts.
Exploratory
Different perspective should be available. The system should enhance theuser’s search.
Easy to use
The user should not need to know anything about data mining oralgorithms.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 4 / 37
Page 8
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
Page 9
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
Page 10
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
Page 11
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
Page 12
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
Page 13
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
Page 14
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
Page 15
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
Page 16
A Use Case
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Overview 6 / 37
Page 17
System Overview
Architecture
A web-server as a front end. A cluster in the back-end. New instances aredynamically created as the load varies.
Functional
.bib
.ris
.xml
paper IDs
parameters{k,d,κ}
πVisualization
Relevance Feedback
RecommendationEngine
Venue Rec.
Reviewer Rec.
DiversificationEngine
PaperMapper
venues
reviewers
papers
ππ
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Overview 7 / 37
Page 18
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis:: 8 / 37
Page 19
Using the Citation Graph
Hypothesis
If two papers are related or treat the same subject, then they will be closeto each other in the citation graph (and reciprocal)
Benefits
No linguistic => no synonymy, no ambiguity
Automatically crowd source by researchers
Drawbacks
Difficult to gather the data (But thanks Citeseer)
Relies on researcher already having made similar connections
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis:: 9 / 37
Page 20
Local Approaches
t
20032002 2004 20052001 2006
v
xu
reference edges of v citation edges of v
Bibliographic coupling [Kessler63]: papers having similarreferences are related
Cocitation [Small73]: papers which are cited by the same papersare related
CCIDF [Lawrence99]: cocitations weighted with inverse frequencies
Problem: Considers only level-2 papers based on level-1 information.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Previous Approaches 10 / 37
Page 21
Local Approaches
t
20032002 2004 20052001 2006
v
xu
reference edges of v citation edges of v
Bibliographic coupling [Kessler63]: papers having similarreferences are related
Cocitation [Small73]: papers which are cited by the same papersare related
CCIDF [Lawrence99]: cocitations weighted with inverse frequencies
Problem: Considers only level-2 papers based on level-1 information.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Previous Approaches 10 / 37
Page 22
Global Approaches
Graph distance-based
Katz: number of paths between two papers [Strohman07]
Random walk with restarts (RWR) based
ArticleRank [Li09] (PageRank [Brin98] extension)
PaperRank [Gori06] (Personalized PageRank [Haveliwala02]extension)
RWR treats the citations and references in the same way
This is not exploratory!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Previous Approaches 11 / 37
Page 23
Global Approaches
Graph distance-based
Katz: number of paths between two papers [Strohman07]
Random walk with restarts (RWR) based
ArticleRank [Li09] (PageRank [Brin98] extension)
PaperRank [Gori06] (Personalized PageRank [Haveliwala02]extension)
RWR treats the citations and references in the same way
This is not exploratory!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Previous Approaches 11 / 37
Page 24
PageRank
Let G = (V ,E ) be the citation graph
PageRank [Brin98]
πi (u) = d 1|V | + (1− d)
∑v∈N(u)
πi−1(v)δ(v)
with d ∈ (0 : 1) is the damping factor.It converges to a stable distribution.
source: wikipedia
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 12 / 37
Page 25
PageRank
Let G = (V ,E ) be the citation graph
PageRank [Brin98]
πi (u) = d 1|V | + (1− d)
∑v∈N(u)
πi−1(v)δ(v)
with d ∈ (0 : 1) is the damping factor.It converges to a stable distribution.But it is not personalized.
source: wikipedia
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 12 / 37
Page 26
PageRank
Let G = (V ,E ) be the citation graph
PageRank [Brin98]
πi (u) = d 1|V | + (1− d)
∑v∈N(u)
πi−1(v)δ(v)
with d ∈ (0 : 1) is the damping factor.It converges to a stable distribution.But it is not personalized.
Personalized PageRank [Jeh03]
πi (u) = dp∗(u)+(1−d)∑
v∈N(u)πi−1(v)δ(v)
with∑
p∗(u) = 1.source: wikipedia
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 12 / 37
Page 27
PageRank
Let G = (V ,E ) be the citation graph
PageRank [Brin98]
πi (u) = d 1|V | + (1− d)
∑v∈N(u)
πi−1(v)δ(v)
with d ∈ (0 : 1) is the damping factor.It converges to a stable distribution.But it is not personalized.
Personalized PageRank [Jeh03]
πi (u) = dp∗(u)+(1−d)∑
v∈N(u)πi−1(v)δ(v)
with∑
p∗(u) = 1.But it is not exploratory.
source: wikipedia
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 12 / 37
Page 28
Direction Awareness
Time exploration
What if we are interested in searching papers per years. Recent papers?Traditional papers?
Let M be a set of known relevant papers.
Direction Aware Random Walk with Restart
πi (u) = dp∗(u) + (1− d)(κ∑
v∈N+(u)πi−1(v)δ−(v) + (1− κ)
∑v∈N−(u)
πi−1(v)δ+(v) )
d ∈ (0 : 1) is the damping factor.
κ ∈ (0 : 1).
p∗(u) = 1|M| , if u ∈ M, p∗(u) = 0, otherwise
a b c d
restartedge
reference edge back-reference(citation) edgev
d (1-κ)δ+(v)
d κδ-(v)
(1-d)m
qm
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 13 / 37
Page 29
Exploring in Depth
0 0.2 0.4 0.6 0.8 1κ
0.5
0.6
0.7
0.8
0.9
1
d
1
1.5
2
2.5
3
aver
age
shor
test
dis
tanc
e
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 14 / 37
Page 30
Exploring in Time
0 0.2 0.4 0.6 0.8 1κ
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
d
1980
1985
1990
1995
2000
2005
2010
aver
age
publ
icat
ion
year
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 15 / 37
Page 31
Hidden Reference Discovery
The recovery test
Let’s hide some references from a paper and see if an algorithm can findthem
Results of the experiments with mean average precision (MAP@50) and95% confidence intervals.
hide random hide recent hide earliermean interval mean interval mean interval
DaRWR 48.00 46.80 49.20 42.22 40.95 43.50 60.64 59.48 61.80P.R. 56.56 55.31 57.80 38.75 37.50 40.00 58.93 57.76 60.10Katzβ 46.33 45.16 47.50 34.56 33.42 35.70 44.19 42.97 45.40Cocit 44.60 43.39 45.80 14.22 13.25 15.20 55.97 54.64 57.30Cocoup 17.28 16.36 18.20 17.56 16.61 18.50 2.93 2.57 3.30CCIDF 18.05 17.11 19.00 18.97 17.94 20.00 3.55 3.10 4.00
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 16 / 37
Page 32
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem:: 17 / 37
Page 33
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) + (1− d)
κ ∑v∈N+(u)
πi−1(v)
δ−(v)+ (1− κ)
∑v∈N−(u)
πi−1(v)
δ+(v)
πi (u) = dp∗(u) +
∑v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + Aπi−1
CRS Full
Traverse A column per column.
Skip columns where πi−1(v) = 0.
Per edge: 2 non-zeros (2 indices, 2 values)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 18 / 37
Page 34
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) + (1− d)
κ ∑v∈N+(u)
πi−1(v)
δ−(v)+ (1− κ)
∑v∈N−(u)
πi−1(v)
δ+(v)
πi (u) = dp∗(u) +
∑v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + Aπi−1
CRS Full
Traverse A column per column.
Skip columns where πi−1(v) = 0.
Per edge: 2 non-zeros (2 indices, 2 values)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 18 / 37
Page 35
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) +∑
v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + B−(1− d)κ
δ−πi−1 +B+ (1− d)(1− κ)
δ+πi−1
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 19 / 37
Page 36
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) +∑
v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + B−(1− d)κ
δ−πi−1 +B+ (1− d)(1− κ)
δ+πi−1
CRS Half
pre-compute: (1−d)κδ− πi−1 and (1−d)(1−κ)
δ+ πi−1
B− and B+ are 0/1 and symmetric
Traverse the matrix twice (B− and B+)
Skip columns where πi−1(v) = 0.
Per edge: 1 non-zeros (1 index, 0 values)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 19 / 37
Page 37
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) +∑
v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + B−(1− d)κ
δ−πi−1 +B+ (1− d)(1− κ)
δ+πi−1
COO Half
pre-compute: (1−d)κδ− πi−1 and (1−d)(1−κ)
δ+ πi−1
B− and B+ are 0/1 and symmetric
Traverse the matrix once (B− and B+)
Arbitrary order. Don’t skip anything.
Per edge: 1 non-zeros (2 indices, 0 values)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 19 / 37
Page 38
Number of updates
0
2M
4M
6M
8M
10M
12M
2 4 6 8 10 12 14 16 18 20
# up
date
s
iteration
CRS-FullCRS-HalfCOO-Half
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 20 / 37
Page 39
Runtimes
1
1.5
2
2.5
3
1 2 4 8 16 32 64
exec
utio
n tim
e (s
)
#partitions
CRS-FullCRS-Full (RCM)CRS-Full (AMD)CRS-Full (SB)CRS-HalfCRS-Half (RCM)CRS-Half (AMD)CRS-Half (SB)COO-HalfCOO-Half (RCM)COO-Half (AMD)COO-Half (SB)HybridHybrid (RCM)Hybrid (AMD)Hybrid (SB) 1
1.5
2
2.5
3
1 2 4 8 16 32 64
exec
utio
n tim
e (s
)
#partitions
CRS-FullCRS-Full (RCM)CRS-Full (AMD)CRS-Full (SB)CRS-HalfCRS-Half (RCM)CRS-Half (AMD)CRS-Half (SB)COO-HalfCOO-Half (RCM)COO-Half (AMD)COO-Half (SB)HybridHybrid (RCM)Hybrid (AMD)Hybrid (SB)
CRS-FullCRS-Half
COO-Half
Hybrid
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 21 / 37
Page 40
Ordering
LocalitySpMV is sensitive to non-zero locality.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 22 / 37
Page 41
Ordering
LocalitySpMV is sensitive to non-zero locality.
Reverse Cuthill-McKee [Cuthill, McKee, 69]
Order with respect to a Breadth First Search ordering. (Do 10 times, pick best)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 22 / 37
Page 42
Ordering
LocalitySpMV is sensitive to non-zero locality.
Reverse Cuthill-McKee [Cuthill, McKee, 69]
Order with respect to a Breadth First Search ordering. (Do 10 times, pick best)
Approximate Minimum Degree [Amestoy et al.,96]
Greedily, add the vertex whose degree is minimum.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 22 / 37
Page 43
Ordering
LocalitySpMV is sensitive to non-zero locality.
Reverse Cuthill-McKee [Cuthill, McKee, 69]
Order with respect to a Breadth First Search ordering. (Do 10 times, pick best)
Approximate Minimum Degree [Amestoy et al.,96]
Greedily, add the vertex whose degree is minimum.
Slashburn [Kang, Faloutsos,11]
Order by connected components. Remove the highest degree vertex. Repeat.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 22 / 37
Page 44
Partitioning
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 23 / 37
Page 45
Runtimes
1
1.5
2
2.5
3
1 2 4 8 16 32 64
exec
utio
n tim
e (s
)
#partitions
CRS-FullCRS-Full (RCM)CRS-Full (AMD)CRS-Full (SB)CRS-HalfCRS-Half (RCM)CRS-Half (AMD)CRS-Half (SB)COO-HalfCOO-Half (RCM)COO-Half (AMD)COO-Half (SB)HybridHybrid (RCM)Hybrid (AMD)Hybrid (SB) 1
1.5
2
2.5
3
1 2 4 8 16 32 64
exec
utio
n tim
e (s
)
#partitions
CRS-FullCRS-Full (RCM)CRS-Full (AMD)CRS-Full (SB)CRS-HalfCRS-Half (RCM)CRS-Half (AMD)CRS-Half (SB)COO-HalfCOO-Half (RCM)COO-Half (AMD)COO-Half (SB)HybridHybrid (RCM)Hybrid (AMD)Hybrid (SB)
CRS-FullCRS-Half
COO-Half
Hybrid
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 24 / 37
Page 46
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 25 / 37
Page 47
Principle
The goal of diversity is to avoid clustered answers.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 26 / 37
Page 48
Principle
The goal of diversity is to avoid clustered answers.
Relevant
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 26 / 37
Page 49
Principle
The goal of diversity is to avoid clustered answers.
Relevant Relevant Diverse
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 26 / 37
Page 50
A Modelization problem
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
better
Here is a distribution ofknown algorithms
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 27 / 37
Page 51
A Modelization problem
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
better
Would such an algorithm beof interest?
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 27 / 37
Page 52
A Modelization problem
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
better
Would such an algorithm beof interest?That algorithm is random!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 27 / 37
Page 53
What to do?
See later talk!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 28 / 37
Page 54
Results
GPU
Multicore
Generic SpMV
Eigensolvers
Partitioning
Compression
Graph mining
references
recommendations
top-100
Multicore
GPU
Multicore
Generic SpMV
Eigensolvers
Partitioning
Compression
Graph mining
references
recommendations
top-100
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 29 / 37
Page 55
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 30 / 37
Page 56
Relevance Feedback
Papers can be tagged are relevantor irrelevant.
Positive feedback (+RF):Relevant results are added to Q
Negative feedback (-RF):Irrelevant results are removedfrom the graph
How long does it take to find thefirst level-3 paper?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1 10 100
Tau
Ratio
No RFPos RFNeg RF
Pos+Neg RF
More exploration!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 31 / 37
Page 57
Relevance Feedback
Papers can be tagged are relevantor irrelevant.
Positive feedback (+RF):Relevant results are added to Q
Negative feedback (-RF):Irrelevant results are removedfrom the graph
How long does it take to find thefirst level-3 paper?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1 10 100
Tau
Ratio
No RFPos RFNeg RF
Pos+Neg RF
More exploration!Erik Saule
Ohio State University, Biomedical InformaticsHPC Lab http://bmi.osu.edu/hpc
theadvisor: http://theadvisor.osu.edu/Other Features:: 31 / 37
Page 58
Visualization
More exploration!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 32 / 37
Page 59
Visualization
More exploration!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 32 / 37
Page 60
Application Programming Interface
Web service
theadvisor can be accessed programmatically. Emit HTTP requests andobtain JSON encoded replies.
Potential Applications
Interfacing with article editors (e.g., TexShop)
Recommendation in bibliography manager (e.g., Mendeley)
Suggesting reviewers to program committees (e.g., EasyChair)
Suggesting sessions of interest at conferences (e.g., iConference )
Easier to use!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 33 / 37
Page 61
Application Programming Interface
Web service
theadvisor can be accessed programmatically. Emit HTTP requests andobtain JSON encoded replies.
Potential Applications
Interfacing with article editors (e.g., TexShop)
Recommendation in bibliography manager (e.g., Mendeley)
Suggesting reviewers to program committees (e.g., EasyChair)
Suggesting sessions of interest at conferences (e.g., iConference )
Easier to use!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 33 / 37
Page 62
Application Programming Interface
Web service
theadvisor can be accessed programmatically. Emit HTTP requests andobtain JSON encoded replies.
Potential Applications
Interfacing with article editors (e.g., TexShop)
Recommendation in bibliography manager (e.g., Mendeley)
Suggesting reviewers to program committees (e.g., EasyChair)
Suggesting sessions of interest at conferences (e.g., iConference )
Easier to use!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 33 / 37
Page 63
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Final Thoughts:: 34 / 37
Page 64
Design Goals - Are they matched?
Personalized
The query is expressed in a very precise way.
Conceptual
Using the citation graph allows to avoid all linguistic. Though, it may notbe enough to find all relevant papers.
Exploratory
Direction Awareness (to choose time), Diversification (to see more topics),Visualization (for manual crawling)
Easy to use
Efficient (recommendation in less than 2 seconds), web-based.
Is it good enough? Tell us!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/Final Thoughts::Conclusion 35 / 37
Page 65
Design Goals - Are they matched?
Personalized
The query is expressed in a very precise way.
Conceptual
Using the citation graph allows to avoid all linguistic. Though, it may notbe enough to find all relevant papers.
Exploratory
Direction Awareness (to choose time), Diversification (to see more topics),Visualization (for manual crawling)
Easy to use
Efficient (recommendation in less than 2 seconds), web-based.
Is it good enough? Tell us!Erik Saule
Ohio State University, Biomedical InformaticsHPC Lab http://bmi.osu.edu/hpc
theadvisor: http://theadvisor.osu.edu/Final Thoughts::Conclusion 35 / 37
Page 66
Future works
Clustering
Let’s assume for an instant that we have accurate disambiguated tags forevery document. We could restrict analysis to some fields. Improvediversification.
Betweenness Centrality
DaRWR provides recommendation around the query set. What aboutrecommending what is between it?
Contextual information
Distinguishing types of papers and citations. Survey, Method,Application...
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Final Thoughts::Future Works 36 / 37
Page 67
Thank you
More information
contact : [email protected] : http://theadvisor.osu.edu
(or http://bmi.osu.edu/hpc/
or http://bmi.osu.edu/~esaule)
Research at HPC lab is supported by
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Final Thoughts::Future Works 37 / 37