Top Banner
Rapid Data Exploration With Hadoop Peter Skomoroch Senior Data Scientist @peteskomoroch
33

Rapid Data Exploration With Hadoop

May 11, 2015

Download

Technology

Peter Skomoroch

LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio.

To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way.

This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Rapid Data Exploration With Hadoop

Rapid Data Exploration With Hadoop

Peter SkomorochSenior Data Scientist

@peteskomoroch

Page 2: Rapid Data Exploration With Hadoop

Outline• Overview: LinkedIn Biz, Tech, & Analytics• Rapid Data Exploration 101

- Spatial Analytics Pig Code- Trend detection with Pig & Python- R Streaming Example

• Deep Dive: Our Data Analysis Approach• Building Data Products• LinkedIn Data Insights

Page 3: Rapid Data Exploration With Hadoop

Connect the world’s professionals to make them more productive and successful

Page 4: Rapid Data Exploration With Hadoop

Professional Identity

Page 5: Rapid Data Exploration With Hadoop

LinkedIn at a glance• Founded in 2003• #17 site in the US (Alexa)• 60+ million members• First million members = 477 days• Latest million = 9 days• 500K+ company profiles• 12+ million small business professionals• In 2009 - 1billion people searches• Average age: 41• Household income $107,000• 42% are “decision makers”

Page 6: Rapid Data Exploration With Hadoop

How International?• More than 50% international

(members in over 200 countries & territories) • 13+ million in Europe• 4+ million in India• 3+ million in UK• #13 site in UK (Alexa)

Page 7: Rapid Data Exploration With Hadoop

How do we keep the lights on?• Profitable since 2007• Valued at over $1B at the last funding round• Subscriptions• Ads• Job Postings• Enterprise Client

Page 8: Rapid Data Exploration With Hadoop

Hadoop on LinkedIn1,400+ members list “Hadoop” on their profileWhat other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they?

• 36% in Bay Area• 8% in India• 6% in NYC• 4% in Seattle• 4% in Los Angeles

Who do they work for?• 11% Yahoo!• 2% Apache Software Foundation• 1% LinkedIn• 1% Google• 1% Facebook

Page 9: Rapid Data Exploration With Hadoop

Hadoop at LinkedIn

Page 10: Rapid Data Exploration With Hadoop

Voldemort Data StorageCompact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }

Page 11: Rapid Data Exploration With Hadoop

Getting Data In•From Databases (user data, news, jobs etc.)

• Need a way to get data reliably periodically• Need tests to verify data• Support for incremental replication• Solution: Transmogrify Driver Program

• InputReader: JDBCReader, CSV Reader• Output Writer: JDBCWriter, HDFS writers

• From web logs (page views, search, clicks etc)• Weblogs files are rsynced and loaded up in HDFS• Hadoop jobs for date cleaning and transformation.

Page 12: Rapid Data Exploration With Hadoop

Getting Data Out

Page 13: Rapid Data Exploration With Hadoop

Giving Back: Open Sourcehttp://sna-projects.com/sna/

Page 14: Rapid Data Exploration With Hadoop

Analytics Technologies

Page 15: Rapid Data Exploration With Hadoop

We Build Things With Data

Give smart people great tools, enable them to solve problems

Page 16: Rapid Data Exploration With Hadoop

Prototyping Culture

Page 17: Rapid Data Exploration With Hadoop

How does Hadoop enable rapid data

exploration?

Page 18: Rapid Data Exploration With Hadoop

Pig for Spatial Analytics

Page 19: Rapid Data Exploration With Hadoop

US County HeatMap

Page 20: Rapid Data Exploration With Hadoop

Pig for Trend Detection

Page 21: Rapid Data Exploration With Hadoop

Python Streaming Script

Page 22: Rapid Data Exploration With Hadoop

Sort Output & Display

Page 23: Rapid Data Exploration With Hadoop

R Streaming Also Easy

*from http://www.stat.uiowa.edu/~luke/classes/295-hpc/

Page 24: Rapid Data Exploration With Hadoop

Let’s Talk Data

Page 25: Rapid Data Exploration With Hadoop

Business is recognizing the importance of analytics

Page 26: Rapid Data Exploration With Hadoop

What data do we start with?

Page 27: Rapid Data Exploration With Hadoop

We can also leverage... • Connection Graph• Recommendations• Address Book Uploads• Search Logs• Profile Views & Activity• Job Postings• LinkedIn Groups• LinkedIn Questions

• Company Pages• Talent Match• Web Referrals• 1M+ Twitter Accounts• Wikipedia Data• Mechanical Turk• Census, BLS, & Data.gov• Much more...

Page 28: Rapid Data Exploration With Hadoop

How do we think of Analytics?

Data Jujitsu

Page 29: Rapid Data Exploration With Hadoop

Lots of Medium can be more powerful than Big

>

Page 30: Rapid Data Exploration With Hadoop

Reconstruct Realityfrom Data Exhaust

Page 31: Rapid Data Exploration With Hadoop

Data Scientist Lessons• Follow the data, avoid assumptions• Sanity check the extremes (0, infinity)• Don’t get mired in rare edge cases• Data Jujitsu: solve easier auxiliary problems• Build smaller consistent samples to test code• Establish a baseline model quickly, iterate often• Use the right tool for the job at hand• Iterate quickly with high level languages

Page 32: Rapid Data Exploration With Hadoop

Where did the bankers go?