Top Banner
Building Babel Large Scale Data Collection in the Cloud Ian Wesley-Smith [email protected]
27

Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Feb 13, 2017

Download

Documents

vuongdan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Building Babel

Large Scale Data Collection in the CloudIan [email protected]

Page 2: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Scholarly Article Recommendation

• Information Overload– 50m – 150m articles in existence

Page 3: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Google Scholar

• Recommendation vs Search– Serendipity

• Homonymity• Synonymity

Page 4: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Netflix/Spotify/Amazon

• User ratings (explicit, implicit)• Density– # user-item interactions >> # items

• Netflix Competition (2006)1

– 100m ratings– 480k users– 17k movies

1: http://www.netflixprize.com/community/viewtopic.php?id=68

Page 5: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Barriers to Research

• Hard to get datasets• Difficult to measure effectiveness– Judges– Citation prediction

Page 6: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Enter Babel

• Provide access to private data sets• Provide scholarly article recommendations,

freely to anyone– Feedback data in return

• Evaluate recommenders using usage data– With enough traffic could be very fast

Page 7: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Audience

• Publishers– Offload expensive research into recommender systems

to academia– Better recommendations drive more traffic/purchases

• Tool Developers• Researchers

Page 8: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Requirements

• Fast• Reliable• Scalable (lots of data!)• Easy to use• Cheap

Page 9: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

REST APIcurl http://babel-us-east-1.eigenfactor.org/recommendation/aminer/12345{"transaction_id": "46bb84190e9ddfd17700bfafb500ab3c","results": [

{"paper_id": "672","publisher": "aminer"

},{"paper_id": "11274","publisher": "aminer"

} ]

}

Page 10: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

http://babel.eigenfactor.org

Page 11: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Browser Plugins

Page 12: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

http://labs.jstor.org/sustainability/

Page 13: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Babel Architecture

Recommenders

EigenFactor Recommends

Co-Citation

Bibliographic Coupling

Metadata Database

update.eigenfactor.org

Object Store

Archive

Metadata Extraction

Recommender Frontend

Publisher

DemoWebsite

Chrome Plugin Analytics

Normalization

Researcher

Recommendation Cache

DesktopApp

Page 14: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Frontend

Recommenders

EigenFactor Recommends

Co-Citation

Bibliographic Coupling

Metadata Database

update.eigenfactor.org

Object Store

Archive

Metadata Extraction

Recommender Frontend

Publisher

DemoWebsite

Chrome Plugin Analytics

Normalization

Researcher

Recommendation Cache

DesktopApp

Page 15: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Frontend

AWS Elastic Bean Stalk

Application

Package

Deploy

Page 16: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Frontend

AWS Elastic Bean Stalk

Application

Package

Deploy

Page 17: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...
Page 18: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Swagger UI

Page 19: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Swagger UI

Page 20: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Frontend

AWS Elastic Bean Stalk

Application

Package

Deploy

Page 21: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...
Page 22: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Frontend

AWS Elastic Bean Stalk

Application

Package

Deploy

Page 23: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

AWS Elastic Bean Stalk

Image:Part1:Develop,Deploy,andManageforScalewithElasticBeanstalkandCloudFormation Series byEvanBrown, AWS

Page 24: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

DynamoDB

• AWS NoSQL– Key-value store

• Very fast (<10ms)• Very scalable – Specify throughput

• Not too expensive

Recommendation Cache

Page 25: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Issues

• Not all AWS services are created equal– Data Pipeline– Cloud Search

• Documentation• SDK/Tooling• Python & GIL• Access Keys

Page 26: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

Future Directions

• Finish backend• Expand clients (publishers, tool developers)• Actually get more recommenders• Babel 3.0 – simple middleware– Automatically logs & add transaction info to outgoing

requests

Page 27: Large Scale Data Collection in the Cloud Ian Wesley-Smith iwsmith ...

http://[email protected]