Mahout becomes a researcher Kris Jack, PhD Senior Data Mining Engineer
May 28, 2015
Mahout becomes a researcher
Kris Jack, PhD
Senior Data Mining Engineer
➔ What's Mendeley?
➔ Applications of Mahout's Recommender
➔ Under Mahout's Bonnet
➔ Mahout's Research Career so Far
➔ Conclusions
Overview
What's Mendeley?
➔ Mendeley is a data platform for researchers
➔ We're bringing together researchers and the research that they produce from all over the world
➔ We're structuring this data in a machine readable format
➔ We're opening this data up for you to build applications on top of it using our API
➔ These applications help researchers to do even better research and become more productive
➔ How are we building our community?
...organise their research
Mendeley provides tools to help users...
...organise their research
➔ Reference management
➔ Cite-as-you-write
➔ Full-text article search
➔ Digitalised annotations
...organise their research
...collaborate with one another
Mendeley provides tools to help users...
...organise their research
➔ Research network
➔ Professional research groups
...organise their research
...collaborate with one another
...discover new research
Mendeley provides tools to help users...
...organise their research
➔ Mendeley Suggest
➔ Personalised article recommendations
➔ Weekly batch of 10 recommended articles
➔ Collaborative Filtering
➔ The more data, the better
1.5 million+ users; the 20 largest user bases:
University of CambridgeStanford University
MITUniversity of Michigan
Harvard UniversityUniversity of OxfordSao Paulo University
Imperial College LondonUniversity of Edinburgh
Cornell UniversityUniversity of California at Berkeley
RWTH AachenColumbia University
Georgia TechUniversity of Wisconsin
UC San DiegoUniversity of California at LA
University of FloridaUniversity of North Carolina50m research articles
...organise their research
...collaborate with one another
...discover new research
Mendeley provides tools to help users...
...organise their research
We need a recommender that scales up, coping with our data and future growth
Applications of Mahout's Recommender
http://www.slideshare.net/kryton/the-data-layer
Mahout use cases:
➔ Retrieve related items in large collections
http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/
Mahout use cases:
➔ Retrieve related items in large collections
➔ Discover relevant items that you may have overlooked
http://www.speeddate.com/apps/site/views/mp/technology.php
Mahout use cases:
➔ Retrieve related items in large collections
➔ Discover relevant items that you may have overlooked
➔ Find love!➔ Mahout implements collaborative
filtering, a surprisingly powerful algorithm
http://krisjack.blogspot.co.uk/2012/02/your-very-own-personalised-research.html
Mahout use cases:
➔ Retrieve related items in large collections
➔ Discover relevant items that you may have overlooked
➔ Find love!➔ Mahout implements collaborative
filtering, a surprisingly powerful algorithm
➔ Mendeley Suggest➔ Discover new research➔ Fill in gaps in your library➔ Your personal advisor
Under Mahout's Bonnet
Generating recommendations through matrix multiplication
Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734-749. Piscataway, NJ, USA.
This is item-based recommendations as similarity is based on items, not users
http://www.slideshare.net/srowen/collaborative-filtering-at-scale-2
http://krisjack.blogspot.co.uk/2012/04/under-bonnet-of-mahouts-item-based.html
Not convinced? Try reading these...
Turing Babbage Einstein Newton
Comp Sci 1
Physics 1
Res
earc
h A
rtic
les
Researchers
Physics 2
Comp Sci 2
Input (all user preferences)
Turing Babbage Einstein Newton
Comp Sci 1
Physics 1
Res
earc
h A
rtic
les
Researchers
Physics 2
Comp Sci 2
1.5M
50M
Input (all user preferences)
300M prefs
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
Res
earc
h
Art
icle
s
Research Articles
2 11 10 00 0
2 22 2
0 00 0
Item Similarity (item x item)
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Res
earc
h
Art
icle
s
Researchers
Comp Sci 1
Physics 1
Res
earc
h A
rtic
les
Research Articles
Physics 2
Comp Sci 2
Comp Sci 1Comp Sci 2
Physics 1Physics 2
2
1
2
2
1
1
2
2
0 0
0 0
0 0
0 0
Input (all user preferences)
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
Res
earc
h
Art
icle
s
Research Articles
2 11 10 00 0
2 22 2
0 00 0
Item Similarity (item x item)R
esea
rch
A
rtic
les
Turing
Recommendations(item x user)
X =
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Running on Amazon's Elastic Map Reduce
On demand use and easy to cost
Mahout's Research Career so Far
Mendeley Suggest
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
3
-4.1K(63%)
Mahout'sPerformance
Reducing processing time and cost
➔ Mahout's recommender is already efficient➔ but your data may have unusual properties
➔ We got improvements by:➔ tuning Hadoop's mapper and reducer allocation over the 10
steps in the RecommenderJob➔ using an appropriate partitioner
Task Allocation 37 hours to complete
1 reducer allocated, despite having 48 available...
Task Allocation
job.getConfiguration().set("mapred.max.split.size",String.valueOf(splitSize));
Allocating more mappers on a per job basis
job.getConfiguration().setInt("mapred.reduce.tasks",numMappers);
Allocating more reducers on a per job basis
Task Allocation 37 hours to complete14 hours
From 1 → 40 reducers
Partitioners 14 hours to complete
Partitioners 14 hours to complete
~50KB
~500MB
InputSampler.Sampler<IntWritable, Text> sampler =new InputSampler.RandomSampler<IntWritable, Text>(...);
InputSampler.writePartitionFile(conf, sampler);conf.setPartitionerClass(TotalOrderPartitioner.class);
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
Partitioners 14 hours to complete
2 hours
Evenly distributed
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
3
Mahout'sPerformance
-4.1K(63%)
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
Res
earc
h
Art
icle
s
Research Articles
2 11 10 00 0
2 22 2
0 00 0
Item Similarity (item x item)R
esea
rch
A
rtic
les
Turing
Recommendations(item x user)
X =
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
Res
earc
h
Art
icle
s
Research Articles
2 11 10 00 0
2 22 2
0 00 0
Item Similarity (item x item)R
esea
rch
A
rtic
les
Turing
Recommendations(item x user)
X =
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
user
User Similarity (user x user)
Researchers
Re
sea
rch
ers
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
-1.4K(58%)
+1 (67%)
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Cust. user-based➔0.3K, 2.5
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Cust. user-based➔0.3K, 2.5
-0.7K(70%)
Mahout'sPerformance
-4.1K(63%)
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Cust. user-based➔0.3K, 2.5
-6.2K(95%)
Mahout'sPerformance
+1 (67%)
Conclusions
Conclusions
➔ Mahout is doing a great job of powering Mendeley Suggest➔ Large scale data set➔ Excellent for batch processing requirements
➔ We'll soon be feeding our user-based implementation into Mahout
➔ User-based can outperform item-based➔ Makes Mahout's offering more rounded
➔ Save resources and money by understanding your data➔ Help Hadoop with task allocation if necessary➔ Paritition your data appropriately
We're Hiring!
➔ Hadoop Data Architect➔ design a coherent data model across the company
➔ take ownership of our data
➔ hands on Hadoop administration
➔ Marie Curie Senior Research Fellow ➔ ensure that Mendeley’s research catalogue is of high quality
➔ research and development opportunity
➔ £500 Finder's Fee if you find someone who we hire➔ http://www.mendeley.com/careers/
www.mendeley.com