Top Banner
Algorithms Poul Petersen @pejpgrep CIO, BigML, Inc @bigmlcom UI Algorithms & Feature Engineering with Flatline
42

Web UI, Algorithms, and Feature Engineering

Feb 07, 2017

Download

Data & Analytics

BigML, Inc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web UI, Algorithms, and Feature Engineering

Algorithms

Poul Petersen @pejpgrep CIO, BigML, Inc @bigmlcom

UI Algorithms & Feature Engineering with Flatline

Page 2: Web UI, Algorithms, and Feature Engineering

BigML, Inc 2ML Crash Course - UI/Algorithms/Feature Engineering

BigML Algorithm History

2011

Prototyping and Beta

API-first Approach

2013

Evaluations, Batch Predictions,

Ensembles, Sunburst

2015

Association Discovery,

Correlations, Samples, Statistical

Tests

2014

Anomaly Detection, Clusters, Flatline

2016

Scripts, Libraries, Executions,

WhizzML, Logistic Regression

2012

Core ML workflow: source, dataset,

model, prediction

Page 3: Web UI, Algorithms, and Feature Engineering

BigML, Inc 3ML Crash Course - UI/Algorithms/Feature Engineering

The need for Machine Learning• Can you find any pattern in this tiny data set?

Talk Text Purchases Data Age Churn?

148 72 0 33.6 50 TRUE

85 66 0 26.6 31 FALSE

183 64 0 23.3 32 TRUE

89 66 94 28.1 21 FALSE

115 0 0 35.3 29 FALSE

166 72 175 25.8 51 TRUE

100 0 0 30 32 TRUE

118 84 230 45.8 31 TRUE

171 110 240 45.4 54 TRUE

159 64 0 27.4 40 FALSE

…. but this is a simple example

Page 4: Web UI, Algorithms, and Feature Engineering

BigML, Inc 4ML Crash Course - UI/Algorithms/Feature Engineering

Data Types

numeric

1 2 3

1, 2.0, 3, -5.4 categoricaltrue, yes, red, mammal categoricalcategorical

A B C

DATE-TIME2013-09-25 10:02

DATE-TIME

YEAR

MONTH

DAY-OF-MONTH

YYYY-MM-DD

DAY-OF-WEEK

HOUR

MINUTE

YYYY-MM-DD

YYYY-MM-DD

M-T-W-T-F-S-D

HH:MM:SS

HH:MM:SS

2013

September

25

Wednesday

10

02

text / itemsBe not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em.

text

“great”“afraid”“born”“some”

appears 2 timesappears 1 timeappears 1 timeappears 2 times

Page 5: Web UI, Algorithms, and Feature Engineering

BigML, Inc 5ML Crash Course - UI/Algorithms/Feature Engineering

Text Analysis

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon 'em.

great: appears 4 times

Bag of Words

Page 6: Web UI, Algorithms, and Feature Engineering

BigML, Inc 6ML Crash Course - UI/Algorithms/Feature Engineering

Text Analysis

… great afraid born achieve … …

… 4 1 1 1 … …

… … … … … … …

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon ‘em.

Model

The token “great” occurs more than 3 times

The token “afraid” occurs no more than once

Page 7: Web UI, Algorithms, and Feature Engineering

BigML, Inc 7ML Crash Course - UI/Algorithms/Feature Engineering

DATASET

Evaluation

TRAIN SET

TEST SET

PREDICTIONS

METRICS

Page 8: Web UI, Algorithms, and Feature Engineering

BigML, Inc 8ML Crash Course - UI/Algorithms/Feature Engineering

EnsemblesDiameter Color Shape Fruit

4 red round plum

5 red round apple

5 red round apple

6 red round plum

7 red round appleBagging!

Random Decision Forest!

All Data: “plum”

Sample 2: “apple”

Sample 3: “apple”

Sample 1: “plum”}“apple”

What is a round, red 6cm fruit?

Page 9: Web UI, Algorithms, and Feature Engineering

BigML, Inc 9ML Crash Course - UI/Algorithms/Feature Engineering

Logistic Regression

Page 10: Web UI, Algorithms, and Feature Engineering

BigML, Inc 10ML Crash Course - UI/Algorithms/Feature Engineering

Logistic Regression

????

Page 11: Web UI, Algorithms, and Feature Engineering

BigML, Inc 11ML Crash Course - UI/Algorithms/Feature Engineering

Logistic Regression

P≈0 P≈10<P<1• x→-∞ : P(x)→0

• x→∞ : P(x)→1

Page 12: Web UI, Algorithms, and Feature Engineering

BigML, Inc 12ML Crash Course - UI/Algorithms/Feature Engineering

Supervised Learning

animal state … proximity actiontiger hungry … close run

elephant happy … far take picture

Classification

animal state … proximity min_kmhtiger hungry … close 70

hippo angry … far 10

Regression

label

animal state … proximity action1 action2tiger hungry … close run look untasty

elephant happy … far take picture call friends

Multi-Label Classification

Page 13: Web UI, Algorithms, and Feature Engineering

BigML, Inc 13ML Crash Course - UI/Algorithms/Feature Engineering

Unsupervised Learning

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

Clustering

Anomaly Detection

similar

unusual

Page 14: Web UI, Algorithms, and Feature Engineering

BigML, Inc 14ML Crash Course - UI/Algorithms/Feature Engineering

K-Means

K=3

Page 15: Web UI, Algorithms, and Feature Engineering

BigML, Inc 15ML Crash Course - UI/Algorithms/Feature Engineering

K-Means

K=3

Page 16: Web UI, Algorithms, and Feature Engineering

BigML, Inc 16ML Crash Course - UI/Algorithms/Feature Engineering

G-Means

Page 17: Web UI, Algorithms, and Feature Engineering

BigML, Inc 17ML Crash Course - UI/Algorithms/Feature Engineering

G-Means

Page 18: Web UI, Algorithms, and Feature Engineering

BigML, Inc 18ML Crash Course - UI/Algorithms/Feature Engineering

G-MeansLet K=2Keep 1, Split 1 New K=3

Page 19: Web UI, Algorithms, and Feature Engineering

BigML, Inc 19ML Crash Course - UI/Algorithms/Feature Engineering

G-MeansLet K=3Keep 1, Split 2New K=5

Page 20: Web UI, Algorithms, and Feature Engineering

BigML, Inc 20ML Crash Course - UI/Algorithms/Feature Engineering

G-MeansLet K=5K=5

Page 21: Web UI, Algorithms, and Feature Engineering

BigML, Inc 21ML Crash Course - UI/Algorithms/Feature Engineering

Isolation Forest

Grow a random decision tree until each instance is in its own leaf

“easy” to isolate

“hard” to isolate

Depth

Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)

Page 22: Web UI, Algorithms, and Feature Engineering

BigML, Inc 22ML Crash Course - UI/Algorithms/Feature Engineering

Model Competence

MODEL

ANOMALY DETECTOR

Prediction T T

Confidence 86% 84%

AnomalyScore 0.5367 0.7124

Competent? Y N

At Training Time At Prediction Time

DATASET

Page 23: Web UI, Algorithms, and Feature Engineering

BigML, Inc 23ML Crash Course - UI/Algorithms/Feature Engineering

Association Rules

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

{class = gas} amount < 100{customer = Bob, account = 3421} zip = 46140

Rules:

Antecedent Consequent

Page 24: Web UI, Algorithms, and Feature Engineering

BigML, Inc 24ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Instances

AC

Coverage

Percentage of instances which match antecedent “A”

Page 25: Web UI, Algorithms, and Feature Engineering

BigML, Inc 25ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Instances

AC

Support

Percentage of instances which match antecedent “A” and Consequent “C”

Page 26: Web UI, Algorithms, and Feature Engineering

BigML, Inc 26ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Coverage

Support

Instances

AC

Confidence

Percentage of instances in the antecedent which also contain the consequent.

Page 27: Web UI, Algorithms, and Feature Engineering

BigML, Inc 27ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

CInstances

A C

A

Instances

C

Instances

A

Instances

AC

0% 100%

Instances

AC

Confidence

A never implies C

A sometimes implies C

A always implies C

Page 28: Web UI, Algorithms, and Feature Engineering

BigML, Inc 28ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Independent

AC

C

Observed

A

Lift

Ratio of observed support to support if A and C were statistically independent.

Support == Confidence p(A) * p(C) p(C)

Page 29: Web UI, Algorithms, and Feature Engineering

BigML, Inc 29ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

C

Observed

A

Observed

AC

< 1 > 1

Independent

A C

Lift = 1

Negative Correlation No Association Positive

Correlation

Independent

A C

Independent

A C

Observed

A C

Page 30: Web UI, Algorithms, and Feature Engineering

BigML, Inc 30ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

Independent

AC

C

Observed

A

Leverage

Difference of observed support and support if A and C were statistically independent.

Support - [ p(A) * p(C) ]

Page 31: Web UI, Algorithms, and Feature Engineering

BigML, Inc 31ML Crash Course - UI/Algorithms/Feature Engineering

Association Metrics

C

Observed

A

Observed

AC

< 0 > 0

Independent

A C

Leverage = 0

NegativeCorrelation No Association Positive

Correlation

Independent

A C

Independent

A C

Observed

A C

-1…

Page 32: Web UI, Algorithms, and Feature Engineering

BigML, Inc 32ML Crash Course - UI/Algorithms/Feature Engineering

Machine Learning Secret

“…the largest improvements in accuracy often came from quick experiments, feature engineering, and model tuning rather than applying fundamentally different algorithms.”

Facebook FBLearner 2016

Feature Engineering: applying domain knowledge of the data to create features that make machine

learning algorithms work better or at all.

Page 33: Web UI, Algorithms, and Feature Engineering

BigML, Inc 33ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

2013-09-25 10:02

DATE-TIME

Automatic Date Transformation

… year month day hour minute …

… 2013 Sep 25 10 2 …

… … … … … … …

NUM NUMCAT NUM NUM

Page 34: Web UI, Algorithms, and Feature Engineering

BigML, Inc 34ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringAutomatic Categorical Transformation

… alchemy_category …… business …… recreation …… health …… … …

CAT

business health recreation …… 1 0 0 …… 0 0 1 …… 0 1 0 …… … … … …

NUM NUM NUM

Page 35: Web UI, Algorithms, and Feature Engineering

BigML, Inc 35ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon ‘em.

TEXT

Automatic Text Transformation

… great afraid born achieve …

… 4 1 1 1 …

… … … … … …

NUM NUM NUM NUM

Page 36: Web UI, Algorithms, and Feature Engineering

BigML, Inc 36ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

{ “url":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News “, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.”}

TEXT

Better representation

title body

Breaking News… news covering…

… …

TEXT TEXT

Page 37: Web UI, Algorithms, and Feature Engineering

BigML, Inc 37ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringDiscretization

Total Spend

7,342.99

304.12

4.56

345.87

8,546.32

NUM

“Predict will spend $3,521 with error

$1,232”

Spend Category

Top 33%

Middle 33%

Bottom 33%

Middle 33%

Top 33%

CAT

“Predict customer will be Top 33% in

spending”

Page 38: Web UI, Algorithms, and Feature Engineering

BigML, Inc 38ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringCombinations of Multiple Features

Kg M2

101.4 3.24

85.2 2.8

56.2 2.9

136.1 3.6

95.9 4.1

NUM NUM

BMI

31.17

30.4

19.38

37.8

23.39

NUM

Kg M2

Page 39: Web UI, Algorithms, and Feature Engineering

BigML, Inc 39ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringFlatline

• BigML’s Domain-Specific Language (DSL) for Transforming Datasets

• Limited programming language structures

• let, cond, if, maps, list operators, */+-

• Dataset Fields are first-class citizens

• (field “diabetes pedigree”)

• Built-in transformations

• statistics, strings, timestamps, windows

Page 40: Web UI, Algorithms, and Feature Engineering

BigML, Inc 40ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))

date volume price1 34353 3142 44455 3153 22333 3154 52322 3215 28000 3206 31254 3197 56544 3238 44331 3249 81111 287

10 65422 29411 59999 30012 45556 30213 19899 30114 21453 302

day-4 day-3 day-2 day-1 4davg 0

314 314314 315 314.5

314 315 315 314.6314 315 315 321 316.25315 315 321 320 317.75315 321 320 319 318.75

Current - (4-day avg) std dev

Shock: Deviations from a Trend

Page 41: Web UI, Algorithms, and Feature Engineering

BigML, Inc 41ML Crash Course - UI/Algorithms/Feature Engineering

Feature Engineering

(/ (- (f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))

Current - (4-day avg) std dev

Shock: Deviations from a Trend

Current : (field “price”) 4-day avg: (avg-window “price” -4 -1) std dev: (standard-deviation “price”)

Page 42: Web UI, Algorithms, and Feature Engineering

BigML, Inc 42ML Crash Course - UI/Algorithms/Feature Engineering

Feature EngineeringFix Missing Values in a “Meaningful” Way

Filter Zeros

Model insulin

Predict insulin

Select insulin

FixedDataset

AmendedDataset

OriginalDataset

CleanDataset