Top Banner
Machine Learning with Hadoop
61

Boston hug

Nov 19, 2014

Download

Technology

Ted Dunning

Talk about why and how machine learning works with Hadoop with recent developments for real-time operation.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Boston hug

Machine Learning with Hadoop

Page 2: Boston hug

2

Agenda

• Why Big Data? Why now?

• What can you do with big data?

• How does it work?

Page 3: Boston hug

3

Slow Motion Explosion

Page 4: Boston hug

4

Why Now?

• But Moore’s law has applied for a long time

• Why is Hadoop/Big Data exploding now?

• Why not 10 years ago?

• Why not 20?

2/15/12

Page 5: Boston hug

5

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

Page 6: Boston hug

6

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

Page 7: Boston hug

7

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

Page 8: Boston hug

8

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

Page 9: Boston hug

9

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

Page 10: Boston hug

10

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

They did

Page 11: Boston hug

11

Everywhere at Once?

• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Page 12: Boston hug

12

Everywhere at Once?

• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Why?

Page 13: Boston hug

Analytics Scaling Laws

• Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns

• The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

• Cost/performance has changed radically– IF you can use many commodity boxes

Page 14: Boston hug

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

Page 15: Boston hug

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

Page 16: Boston hug

Net value optimum has a sharp peak well before maximum effort

Page 17: Boston hug

But scaling laws are changing both slope and shape

Page 18: Boston hug

More than just a little

Page 19: Boston hug

They are changing a LOT!

Page 20: Boston hug
Page 21: Boston hug
Page 22: Boston hug
Page 23: Boston hug
Page 24: Boston hug

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

Page 25: Boston hug

Pre-requisites for Tipping

• To reach the tipping point, • Algorithms must scale out horizontally– On commodity hardware– That can and will fail

• Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare

Page 26: Boston hug

26

So that is why and why now

Page 27: Boston hug

27

So that is why, and why now

What can you do with it?And how?

Page 28: Boston hug

Agenda

• Mahout outline– Recommendations– Clustering– Classification

• Hybrid Parallel/Sequential Systems• Real-time learning

Page 29: Boston hug

Agenda

• Mahout outline– Recommendations– Clustering– Classification• Supervised on-line learning• Feature hashing

• Hybrid Parallel/Sequential Systems• Real-time learning

Page 30: Boston hug

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

Page 31: Boston hug

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

Page 32: Boston hug

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training– Now with MORE topping!

Page 33: Boston hug

How it Works

• We are given “features”– Often binary values in a vector

• Algorithm learns weights– Weighted sum of feature * weight is the key

• Each weight is a single real value

Page 34: Boston hug

An Example

Page 35: Boston hug

Features

From:  Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit.  I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

Page 36: Boston hug

But …

• Text and words aren’t suitable features• We need a numerical vector• So we use binary vectors with lots of slots

Page 37: Boston hug

Feature Encoding

Page 38: Boston hug

Hashed Encoding

Page 39: Boston hug

Feature Collisions

Page 40: Boston hug

Training Data

Page 41: Boston hug

Training Data

Page 42: Boston hug

Training Data

Page 43: Boston hug

Full Scale Training

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

Page 44: Boston hug

44

Hybrid Model Development

Logs User sessions

Training dataGroup by user

Count transaction

patterns

Account info

Training data

Big-data cluster Legacy modeling

Shared filesystem

Merge PROC LOGISTIC

Model

Page 45: Boston hug

Enter the Pig Vector

• Pig UDF’s for– Vector encoding

– Model training

define EncodeVector org.apache.mahout.pig.encoders.EncodeVector( '10','x+y+1', 'x:numeric, y:numeric, z:numeric');

vectors = foreach docs generate newsgroup, encodeVector(*) as v;grouped = group vectors all;model = foreach grouped generate 1 as key, train(vectors) as model;

Page 46: Boston hug

Real-time Developments

• Storm + Hadoop + Mapr– Real-time with Storm– Long-term with Hadoop– State checkpoints with MapR

• Add the Bayesian Bandit for on-line learning

Page 47: Boston hug

Aggregate Splicing

tHadoop handles the

past

Storm handles the present

Page 48: Boston hug

48

Mobile Network MonitorTransaction

data

Batch aggregation

HBase

Real-time dashboard and alerts

Geo-dispersed ingest servers

Retro-analysisinterface

Page 49: Boston hug

A Quick Diversion

• You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change?– And did it ever have a single value?

Page 50: Boston hug

A First Conclusion

• Probability as expressed by humans is subjective and depends on information and experience

Page 51: Boston hug

A Second Conclusion

• A single number is a bad way to express uncertain knowledge

• A distribution of values might be better

Page 52: Boston hug

I Dunno

Page 53: Boston hug

5 and 5

Page 54: Boston hug

2 and 10

Page 55: Boston hug

Bayesian Bandit

• Compute distributions based on data• Sample p1 and p2 from these distributions

• Put a coin in bandit 1 if p1 > p2

• Else, put the coin in bandit 2

Page 56: Boston hug
Page 57: Boston hug
Page 58: Boston hug

The Basic Idea

• We can encode a distribution by sampling• Sampling allows unification of exploration and

exploitation

• Can be extended to more general response models

Page 59: Boston hug

Deployment with Storm/MapR

All state managed transactionally in MapR file system

Page 60: Boston hug

Service Architecture

MapR Lockless Storage Services

MapR Pluggable Service Management

Storm

Hadoop

Page 61: Boston hug

Find Out More

• Me: [email protected] [email protected] [email protected]

• MapR: http://www.mapr.com • Mahout: http://mahout.apache.org

• Code: https://github.com/tdunning