Top Banner
Machine Learning with Hadoop
61
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning with Hadoop Boston hug 2012

Machine Learning with Hadoop

Page 2: Machine Learning with Hadoop Boston hug 2012

Agenda

• Why Big Data? Why now?

• What can you do with big data?

• How does it work?

2

Page 3: Machine Learning with Hadoop Boston hug 2012

Slow Motion Explosion

3

Page 4: Machine Learning with Hadoop Boston hug 2012

Why Now?

• But Moore’s law has applied for a long time

• Why is Hadoop/Big Data exploding now?

• Why not 10 years ago?

• Why not 20?

48/9/2013

Page 5: Machine Learning with Hadoop Boston hug 2012

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

5

Page 6: Machine Learning with Hadoop Boston hug 2012

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

6

Page 7: Machine Learning with Hadoop Boston hug 2012

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

7

Page 8: Machine Learning with Hadoop Boston hug 2012

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

8

Page 9: Machine Learning with Hadoop Boston hug 2012

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

9

Page 10: Machine Learning with Hadoop Boston hug 2012

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

They did

10

Page 11: Machine Learning with Hadoop Boston hug 2012

Everywhere at Once?

• Something very strange is happening

– Big data is being applied at many different scales

– At many value scales

– By large companies and small

11

Page 12: Machine Learning with Hadoop Boston hug 2012

Everywhere at Once?

• Something very strange is happening

– Big data is being applied at many different scales

– At many value scales

– By large companies and small

Why?

12

Page 13: Machine Learning with Hadoop Boston hug 2012

Analytics Scaling Laws

• Analytics scaling is all about the 80-20 rule

– Big gains for little initial effort

– Rapidly diminishing returns

• The key to net value is how costs scale

– Old school – exponential scaling

– Big data – linear scaling, low constant

• Cost/performance has changed radically

– IF you can use many commodity boxes

Page 14: Machine Learning with Hadoop Boston hug 2012

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

Page 15: Machine Learning with Hadoop Boston hug 2012

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

Page 16: Machine Learning with Hadoop Boston hug 2012

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue Net value optimum has a

sharp peak well before maximum effort

Page 17: Machine Learning with Hadoop Boston hug 2012

But scaling laws are changing both slope and shape

Page 18: Machine Learning with Hadoop Boston hug 2012

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

More than just a little

Page 19: Machine Learning with Hadoop Boston hug 2012

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

They are changing a LOT!

Page 20: Machine Learning with Hadoop Boston hug 2012
Page 21: Machine Learning with Hadoop Boston hug 2012
Page 22: Machine Learning with Hadoop Boston hug 2012

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

Page 23: Machine Learning with Hadoop Boston hug 2012

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

Page 24: Machine Learning with Hadoop Boston hug 2012

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

Page 25: Machine Learning with Hadoop Boston hug 2012

Pre-requisites for Tipping

• To reach the tipping point,

• Algorithms must scale out horizontally

– On commodity hardware

– That can and will fail

• Data practice must change

– Denormalized is the new black

– Flexible data dictionaries are the rule

– Structured data becomes rare

Page 26: Machine Learning with Hadoop Boston hug 2012

So that is why and why now

26

Page 27: Machine Learning with Hadoop Boston hug 2012

So that is why, and why now

What can you do with it?

And how?

27

Page 28: Machine Learning with Hadoop Boston hug 2012

Agenda

• Mahout outline

– Recommendations

– Clustering

– Classification

• Hybrid Parallel/Sequential Systems

• Real-time learning

Page 29: Machine Learning with Hadoop Boston hug 2012

Agenda

• Mahout outline

– Recommendations

– Clustering

– Classification

• Supervised on-line learning

• Feature hashing

• Hybrid Parallel/Sequential Systems

• Real-time learning

Page 30: Machine Learning with Hadoop Boston hug 2012

Classification in Detail

• Naive Bayes Family

– Hadoop based training

• Decision Forests

– Hadoop based training

• Logistic Regression (aka SGD)

– fast on-line (sequential) training

Page 31: Machine Learning with Hadoop Boston hug 2012

Classification in Detail

• Naive Bayes Family

– Hadoop based training

• Decision Forests

– Hadoop based training

• Logistic Regression (aka SGD)

– fast on-line (sequential) training

Page 32: Machine Learning with Hadoop Boston hug 2012

Classification in Detail

• Naive Bayes Family

– Hadoop based training

• Decision Forests

– Hadoop based training

• Logistic Regression (aka SGD)

– fast on-line (sequential) training

– Now with MORE topping!

Page 33: Machine Learning with Hadoop Boston hug 2012

How it Works

• We are given “features”

– Often binary values in a vector

• Algorithm learns weights

– Weighted sum of feature * weight is the key

• Each weight is a single real value

Page 34: Machine Learning with Hadoop Boston hug 2012

An Example

Page 35: Machine Learning with Hadoop Boston hug 2012

Features

From: Dr. Paul Acquah

Dear Sir,

Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India

hospital directory, I am pleased to propose a

confidential business deal for our mutual

benefit. I have in my possession, instruments

(documentation) to transfer the sum of

33,100,000.00 eur thirty-three million one hundred

thousand euros, only) into a foreign company's

bank account for our favor.

...

Date: Thu, May 20, 2010 at 10:51 AM

From: George <[email protected]>

Hi Ted, was a pleasure talking to you last night

at the Hadoop User Group. I liked the idea of

going for lunch together. Are you available

tomorrow (Friday) at noon?

Page 36: Machine Learning with Hadoop Boston hug 2012

But …

• Text and words aren’t suitable features

• We need a numerical vector

• So we use binary vectors with lots of slots

Page 37: Machine Learning with Hadoop Boston hug 2012

Feature Encoding

Page 38: Machine Learning with Hadoop Boston hug 2012

Hashed Encoding

Page 39: Machine Learning with Hadoop Boston hug 2012

Feature Collisions

Page 40: Machine Learning with Hadoop Boston hug 2012

Training Data

Page 41: Machine Learning with Hadoop Boston hug 2012

Training Data

Page 42: Machine Learning with Hadoop Boston hug 2012

Training Data

Training examples

with target values

Tokens

VectorsTraining

algorithm

Parsing

Encoding

Raw

data

Joining,

combining,

transforming

Page 43: Machine Learning with Hadoop Boston hug 2012

Full Scale Training

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

Page 44: Machine Learning with Hadoop Boston hug 2012

Hybrid Model Development

44

Logs User sessions

Training dataGroup by user

Count transaction

patterns

Account info

Training data

Big-data cluster Legacy modeling

Shared filesystem

Merge PROC LOGISTIC

Model

Page 45: Machine Learning with Hadoop Boston hug 2012

Enter the Pig Vector

• Pig UDF’s for

– Vector encoding

– Model training

define EncodeVector

org.apache.mahout.pig.encoders.EncodeVector(

'10','x+y+1',

'x:numeric, y:numeric, z:numeric');

vectors = foreach docs generate newsgroup, encodeVector(*) as v;

grouped = group vectors all;

model = foreach grouped generate 1 as key,

train(vectors) as model;

Page 46: Machine Learning with Hadoop Boston hug 2012

Real-time Developments

• Storm + Hadoop + Mapr

– Real-time with Storm

– Long-term with Hadoop

– State checkpoints with MapR

• Add the Bayesian Bandit for on-line learning

Page 47: Machine Learning with Hadoop Boston hug 2012

Aggregate Splicing

t

Hadoop handles the past

Storm handles the present

Page 48: Machine Learning with Hadoop Boston hug 2012

Mobile Network Monitor

48

Transaction data

Batch aggregation

HBase

Real-time dashboard and alerts

Geo-dispersed ingest servers

Retro-analysisinterface

Page 49: Machine Learning with Hadoop Boston hug 2012

A Quick Diversion

• You see a coin

– What is the probability of heads?

– Could it be larger or smaller than that?

• I flip the coin and while it is in the air ask again

• I catch the coin and ask again

• I look at the coin (and you don’t) and ask again

• Why does the answer change?

– And did it ever have a single value?

Page 50: Machine Learning with Hadoop Boston hug 2012

A First Conclusion

• Probability as expressed by humans is subjective and depends on information and experience

Page 51: Machine Learning with Hadoop Boston hug 2012

A Second Conclusion

• A single number is a bad way to express uncertain knowledge

• A distribution of values might be better

Page 52: Machine Learning with Hadoop Boston hug 2012

I Dunno

Page 53: Machine Learning with Hadoop Boston hug 2012

5 and 5

Page 54: Machine Learning with Hadoop Boston hug 2012

2 and 10

Page 55: Machine Learning with Hadoop Boston hug 2012

Bayesian Bandit

• Compute distributions based on data

• Sample p1 and p2 from these distributions

• Put a coin in bandit 1 if p1 > p2

• Else, put the coin in bandit 2

Page 56: Machine Learning with Hadoop Boston hug 2012
Page 57: Machine Learning with Hadoop Boston hug 2012
Page 58: Machine Learning with Hadoop Boston hug 2012

The Basic Idea

• We can encode a distribution by sampling

• Sampling allows unification of exploration and exploitation

• Can be extended to more general response models

Page 59: Machine Learning with Hadoop Boston hug 2012

Deployment with Storm/MapR

Impression Logs

Click Logs

Targeting Engine

Conversion Detector

Model Selector

RPC

Online Model

Online Model

Online Model

RPC

RPC

RPC

Conversion Dashboard

RPC

Training

Training

Training

All state managed transactionally in MapR file system

Page 60: Machine Learning with Hadoop Boston hug 2012

Service Architecture

MapR Lockless Storage Services

MapR Pluggable Service Management

Storm

HadoopImpression Logs

Click Logs

Targeting Engine

Conversion Detector

Model Selector

RPC

Online Model

Online Model

Online Model

RPC

RPC

RPC

Conversion Dashboard

RPC

Training

Training

Training

Page 61: Machine Learning with Hadoop Boston hug 2012

Find Out More

• Me: [email protected]

[email protected]

[email protected]

• MapR: http://www.mapr.com

• Mahout: http://mahout.apache.org

• Code: https://github.com/tdunning