Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Mendeley’s Research Catalogue: building it, opening it up and

making it even more useful for researchers

Kris Jack, PhD Chief Data Scientist, @_krisjack

Outline

1.  What‘s Mendeley?

2.  Under the Bonnet

3.  Opening up Data

4.  Working with Academia

5.  Conclusions

What's Mendeley?

Mendeley‘s not just a reference manager

è  Mendeley is a platform that connects researchers, research data and apps

Mendeley Open API

Mendeley Open API

research catalogue

è  Mendeley is a platform that connects researchers, research data and apps

...organise their research

Mendeley provides tools to help users...

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations


...collaborate with one another


è  Professional research groups è  Social network è  Annotation sharing



...discover new research


è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions









Social network (>2.4M users)

Research catalogue (~85M unique articles)

Research groups (~240K groups)

Personal libraries (>425M articles)

Our community from a data perspective

Logging massive set of usage data

Under the Bonnet

Lots of features to build & support



















features


features




features



Crowdsourcing (deduplication,

metadata aggregation,

statistics)

The curse of success

•  More articles came •  More users came •  Keeping catalogue data fresh was a burden

•  Algorithms relied on global counts •  Iterating over MySQL tables was slow •  Needed to shard tables to grow catalogue

•  In short, our backend system didn’t scale

Please try again later

~0.5 million users; the 20 largest user bases: University of Cambridge

Stanford University MIT

University of Michigan Harvard University University of Oxford Sao Paulo University

Imperial College London University of Edinburgh

Cornell University University of California at Berkeley

RWTH Aachen Columbia University

Georgia Tech University of Wisconsin

UC San Diego University of California at LA

University of Florida University of North Carolina ~30m research articles










The system started to become slow.

How long did it take to

generate our daily readership statistics?










The system started to become slow.

How long did it take to

generate our daily readership statistics?

23 hours!

We had serious needs

•  Build a catalogue based on billions of articles •  Support many features that rely on the catalogue

•  Statistics •  Search •  Recommendations •  Sharing

•  Data •  Freshness •  Consistency

•  Business context •  Agile development (rapid prototyping) •  Cost effective •  Going viral •  Technical debt stacking up

Enter Hadoop

What is Hadoop?

The Apache Hadoop Project develops open-source software for reliable, scalable, distributed computing

www.hadoop.apache.org

Hadoop

•  Designed to operate on a cluster of computers

•  1…thousands •  Commodity hardware (low cost units)

•  Each node offers local computation and storage

•  Provides framework for working with big data (beyond petabytes)

New tech stack for backend

features





statistics)


features





statistics) 23 hr

computations now took 15

minutes


features





statistics)

recommended reading

Mendeley Suggest

Generating recommendations through matrix multiplication

This is item-based recommendations as similarity is based on items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

3

Mahout's Performance

Nor

mal

ised

Am

azon

Hou

rs


0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5




Cust. item-based è 2.4K, 1.5

3


Nor

mal

ised

Am

azon

Hou

rs


0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5





3

-4.1K (63%)


Nor

mal

ised

Am

azon

Hou

rs


0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5





3


Nor

mal

ised

Am

azon

Hou

rs


0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5





Orig. user-based è 1K, 2.5

3


Nor

mal

ised

Am

azon

Hou

rs


0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5






3

-1.4K (58%)

+1 (67%)


Nor

mal

ised

Am

azon

Hou

rs


0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5






3

Cust. user-based è 0.3K, 2.5


Nor

mal

ised

Am

azon

Hou

rs


0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5






3


-0.7K (70%)


-4.1K (63%)

Nor

mal

ised

Am

azon

Hou

rs


0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5






3


-6.2K (95%)


+1 (67%)

Disclaimer: these advantages have costs

•  Migrating to a new system (data consistency) •  Setup costs

•  Learn black magic to configure •  Hardware for cluster

•  Administrative costs •  High learning curve to administrate Hadoop •  Still an immature technology •  You may need to debug the source code

•  Developing against Mahout •  Still needs lots of love

Big data backend

features





statistics)

Opening up Data

Social network (>2.4M users)


Research groups (~240K groups)


Our community from a data perspective

Logging massive set of usage data

Challenge: Build an application with our data, make science more open.

PloS/Mendeley's Binary Battle

More details at http://dev.mendeley.com/api-binary-battle/

Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set

ScienceRec Challenge 2012

More details at http://2012.recsyschallenge.com/tracks/sciencerec/

Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set

ScienceRec Challenge 2012

More details at http://2012.recsyschallenge.com/tracks/sciencerec/

Challenge: Metadata Extraction Challenge

The Next Challenge…?

Working with Academia

We have a history of academic collaborations

Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS

Demo

CSL Editor http://editor.citationstyles.org/

Demo

CODE Mendeley Desktop http://code-research.eu/results

Demo

Mendeley Labs http://labs.mendeley.com/

We have a history of academic collaborations

Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS

Want to collaborate?

Conclusions

Conclusions è  Mendeley is far more than a reference manager – it‘s a platform that connects researchers, data and apps è  Starting small is good, but be prepared for the cost of scaling up è  We‘re opening up our data for you to build apps on our platform è  We‘re always keen to collaborate with academic groups

Kris Jack, PhD Chief Data Scientist, @_krisjack

Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Technology