Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers Kris Jack, PhD Chief Data Scientist, @_krisjack
Dec 05, 2014
Mendeley’s Research Catalogue: building it, opening it up and
making it even more useful for researchers
Kris Jack, PhD Chief Data Scientist, @_krisjack
Outline
1. What‘s Mendeley?
2. Under the Bonnet
3. Opening up Data
4. Working with Academia
5. Conclusions
What's Mendeley?
Mendeley‘s not just a reference manager
è Mendeley is a platform that connects researchers, research data and apps
Mendeley Open API
Mendeley Open API
research catalogue
è Mendeley is a platform that connects researchers, research data and apps
...organise their research
Mendeley provides tools to help users...
è Reference management è Cite-as-you- write è Full-text article search è Digitalised annotations
...organise their research
...collaborate with one another
Mendeley provides tools to help users...
è Professional research groups è Social network è Annotation sharing
...organise their research
...collaborate with one another
...discover new research
Mendeley provides tools to help users...
è Explore crowdsourced research catalogue
è Document statistics
è Personalised article recommendations è Related research è Research contact suggestions
...organise their research
...collaborate with one another
...discover new research
Mendeley provides tools to help users...
...organise their research
...collaborate with one another
...discover new research
Mendeley provides tools to help users...
Social network (>2.4M users)
Research catalogue (~85M unique articles)
Research groups (~240K groups)
Personal libraries (>425M articles)
Our community from a data perspective
Logging massive set of usage data
Under the Bonnet
Lots of features to build & support
è Reference management è Cite-as-you- write è Full-text article search è Digitalised annotations
è Professional research groups è Social network è Annotation sharing
è Explore crowdsourced research catalogue
è Document statistics
è Personalised article recommendations è Related research è Research contact suggestions
Lots of features to build & support
è Reference management è Cite-as-you- write è Full-text article search è Digitalised annotations
è Professional research groups è Social network è Annotation sharing
è Explore crowdsourced research catalogue
è Document statistics
è Personalised article recommendations è Related research è Research contact suggestions
Lots of features to build & support
è Reference management è Cite-as-you- write è Full-text article search è Digitalised annotations
è Professional research groups è Social network è Annotation sharing
è Explore crowdsourced research catalogue
è Document statistics
è Personalised article recommendations è Related research è Research contact suggestions
Lots of features to build & support
features
Lots of features to build & support
features
Research catalogue (~30M unique articles)
Personal libraries (>100M articles)
Lots of features to build & support
features
Research catalogue (~30M unique articles)
Personal libraries (>100M articles)
Crowdsourcing (deduplication,
metadata aggregation,
statistics)
The curse of success
• More articles came • More users came • Keeping catalogue data fresh was a burden
• Algorithms relied on global counts • Iterating over MySQL tables was slow • Needed to shard tables to grow catalogue
• In short, our backend system didn’t scale
Please try again later
~0.5 million users; the 20 largest user bases: University of Cambridge
Stanford University MIT
University of Michigan Harvard University University of Oxford Sao Paulo University
Imperial College London University of Edinburgh
Cornell University University of California at Berkeley
RWTH Aachen Columbia University
Georgia Tech University of Wisconsin
UC San Diego University of California at LA
University of Florida University of North Carolina ~30m research articles
~0.5 million users; the 20 largest user bases: University of Cambridge
Stanford University MIT
University of Michigan Harvard University University of Oxford Sao Paulo University
Imperial College London University of Edinburgh
Cornell University University of California at Berkeley
RWTH Aachen Columbia University
Georgia Tech University of Wisconsin
UC San Diego University of California at LA
University of Florida University of North Carolina ~30m research articles
The system started to become slow.
How long did it take to
generate our daily readership statistics?
~0.5 million users; the 20 largest user bases: University of Cambridge
Stanford University MIT
University of Michigan Harvard University University of Oxford Sao Paulo University
Imperial College London University of Edinburgh
Cornell University University of California at Berkeley
RWTH Aachen Columbia University
Georgia Tech University of Wisconsin
UC San Diego University of California at LA
University of Florida University of North Carolina ~30m research articles
The system started to become slow.
How long did it take to
generate our daily readership statistics?
23 hours!
We had serious needs
• Build a catalogue based on billions of articles • Support many features that rely on the catalogue
• Statistics • Search • Recommendations • Sharing
• Data • Freshness • Consistency
• Business context • Agile development (rapid prototyping) • Cost effective • Going viral • Technical debt stacking up
Enter Hadoop
What is Hadoop?
The Apache Hadoop Project develops open-source software for reliable, scalable, distributed computing
www.hadoop.apache.org
Hadoop
• Designed to operate on a cluster of computers
• 1…thousands • Commodity hardware (low cost units)
• Each node offers local computation and storage
• Provides framework for working with big data (beyond petabytes)
New tech stack for backend
features
Research catalogue (~30M unique articles)
Personal libraries (>100M articles)
Crowdsourcing (deduplication,
metadata aggregation,
statistics)
New tech stack for backend
features
Research catalogue (~30M unique articles)
Personal libraries (>100M articles)
Crowdsourcing (deduplication,
metadata aggregation,
statistics) 23 hr
computations now took 15
minutes
New tech stack for backend
features
Research catalogue (~30M unique articles)
Personal libraries (>100M articles)
Crowdsourcing (deduplication,
metadata aggregation,
statistics)
recommended reading
Mendeley Suggest
Generating recommendations through matrix multiplication
This is item-based recommendations as similarity is based on items, not users
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
Running on Amazon's Elastic Map Reduce
On demand use and easy to cost
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
3
Mahout's Performance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
Cust. item-based è 2.4K, 1.5
3
Mahout's Performance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
Cust. item-based è 2.4K, 1.5
3
-4.1K (63%)
Mahout's Performance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
Cust. item-based è 2.4K, 1.5
3
Mahout's Performance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
Cust. item-based è 2.4K, 1.5
Orig. user-based è 1K, 2.5
3
Mahout's Performance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
Cust. item-based è 2.4K, 1.5
Orig. user-based è 1K, 2.5
3
-1.4K (58%)
+1 (67%)
Mahout's Performance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
Cust. item-based è 2.4K, 1.5
Orig. user-based è 1K, 2.5
3
Cust. user-based è 0.3K, 2.5
Mahout's Performance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
Cust. item-based è 2.4K, 1.5
Orig. user-based è 1K, 2.5
3
Cust. user-based è 0.3K, 2.5
-0.7K (70%)
Mahout's Performance
-4.1K (63%)
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5 Orig. item-based
Cust. item-based è 2.4K, 1.5
Orig. user-based è 1K, 2.5
3
Cust. user-based è 0.3K, 2.5
-6.2K (95%)
Mahout's Performance
+1 (67%)
Disclaimer: these advantages have costs
• Migrating to a new system (data consistency) • Setup costs
• Learn black magic to configure • Hardware for cluster
• Administrative costs • High learning curve to administrate Hadoop • Still an immature technology • You may need to debug the source code
• Developing against Mahout • Still needs lots of love
Big data backend
features
Research catalogue (~30M unique articles)
Personal libraries (>100M articles)
Crowdsourcing (deduplication,
metadata aggregation,
statistics)
Opening up Data
Social network (>2.4M users)
Research catalogue (~85M unique articles)
Research groups (~240K groups)
Personal libraries (>425M articles)
Our community from a data perspective
Logging massive set of usage data
Challenge: Build an application with our data, make science more open.
PloS/Mendeley's Binary Battle
More details at http://dev.mendeley.com/api-binary-battle/
Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set
ScienceRec Challenge 2012
More details at http://2012.recsyschallenge.com/tracks/sciencerec/
Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set
ScienceRec Challenge 2012
More details at http://2012.recsyschallenge.com/tracks/sciencerec/
Challenge: Metadata Extraction Challenge
The Next Challenge…?
Working with Academia
We have a history of academic collaborations
Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS
Demo
CSL Editor http://editor.citationstyles.org/
Demo
CODE Mendeley Desktop http://code-research.eu/results
Demo
Mendeley Labs http://labs.mendeley.com/
We have a history of academic collaborations
Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS
Want to collaborate?
Conclusions
Conclusions è Mendeley is far more than a reference manager – it‘s a platform that connects researchers, data and apps è Starting small is good, but be prepared for the cost of scaling up è We‘re opening up our data for you to build apps on our platform è We‘re always keen to collaborate with academic groups
Kris Jack, PhD Chief Data Scientist, @_krisjack