Top Banner
Feature Engineering #VSSML16 September 2016 #VSSML16 Feature Engineering September 2016 1 / 50
50

VSSML16 L6. Feature Engineering

Feb 07, 2017

Download

Data & Analytics

BigML, Inc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: VSSML16 L6. Feature Engineering

Feature Engineering

#VSSML16

September 2016

#VSSML16 Feature Engineering September 2016 1 / 50

Page 2: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 2 / 50

Page 3: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 3 / 50

Page 4: VSSML16 L6. Feature Engineering

Should I Drive?

• Building a predictive model torecommend driving (or not)

• Have data from the beginningand ending of trip, andwhether or not there are pavedroads between the two points

• Tracked human-madedecisions for several hundredtrips

#VSSML16 Feature Engineering September 2016 4 / 50

Page 5: VSSML16 L6. Feature Engineering

A Simple Function

• Create a predictive model toemulate a simple Booleanfunction

• Features are Booleanvariables

• Objective is the inputs X’oredtogether (true if the number ofones is odd and falseotherwise)

#VSSML16 Feature Engineering September 2016 5 / 50

Page 6: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 6 / 50

Page 7: VSSML16 L6. Feature Engineering

What?

• So we should just give up,yes?

• All the model knows about arethe features we provide for it

• In the cases above thefeatures are broken

#VSSML16 Feature Engineering September 2016 7 / 50

Page 8: VSSML16 L6. Feature Engineering

Broken?

• The features we’ve provided aren’t useful for predictionI Latitude and longitude don’t correlate especially well with drivabilityI Any single feature in the Xor problem doesn’t predict the outcome

(and, in fact, changing any one feature changes the class)I In both cases, the same feature value has different semantics in the

presence of other features

• Machine learning algorithms, in general, rely on some statisticallydetectable relationship between the features and the class

• The nature of the relationship is particular to the algorithm

#VSSML16 Feature Engineering September 2016 8 / 50

Page 9: VSSML16 L6. Feature Engineering

How to Confuse a Machine Learning Algorithm

• Remember that machine learning algorithms are searching for aclassifier in a particular hypothesis space

• Decision TreesI Thresholds on individual featuresI Are you able to set a meaningful threshold on any of your input

features?

• Logistic RegressionI Weighted combinations of featuresI Can a good model be made on a weighted average of your input

features?

#VSSML16 Feature Engineering September 2016 9 / 50

Page 10: VSSML16 L6. Feature Engineering

Feature Engineering

• Feature engineering: Theprocess of transforming rawinput data into machinelearning ready-data

• Alternatively: Using yourexisting features and somemath to make new featuresthat models will “like”

• Not covered here, butimportant: Going out andgetting better information

• “Applied machine learning” isdomain-specific featureengineering and evaluation!

#VSSML16 Feature Engineering September 2016 10 / 50

Page 11: VSSML16 L6. Feature Engineering

Some Good Times to Engineer Features

• When the relationship between the feature and the objective ismathematically unsatisfying

• When the relationship of a function of two or more features is farmore relevant than the original features

• When there is missing data

• When the data is time-series, especially when the previous timeperiod’s objective is known

• When the data can’t be used for machine learning in the obviousway (e.g., timestamps, text data)

Rule of thumb: Every bit of work you do in feature engineering is a bitthat the model doesn’t have to figure out.

#VSSML16 Feature Engineering September 2016 11 / 50

Page 12: VSSML16 L6. Feature Engineering

Aside: Flatline

• Feature engineering is the most important topic in real-worldmachine learning

• BigML has its own domain specific language for it, Flatline, andwe’ll use it for our examples here

• Two things to note right off the bat:I Flatline uses lisp-like “prefix notation”I You get the value for a given feature using a function fI So, to create a new column in your dataset with the sum of

feature1 and feature2:(+ (f "feature1") (f "feature2"))

#VSSML16 Feature Engineering September 2016 12 / 50

Page 13: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 13 / 50

Page 14: VSSML16 L6. Feature Engineering

Statistical Aggregations

• Many times you have a bunch of features that all “mean” the samething

I Pixels in the wavelet transform of an imageI Did or did not make a purchase for day n− 1 to day n− 30

• Easiest thing is sum, average, or count (especially with sparsedata)

• The all and all-but field selectors are helpful here:(/ (+ (all-but "PurchasesM-0") (count (all-but "PurchasesM-0"))))

#VSSML16 Feature Engineering September 2016 14 / 50

Page 15: VSSML16 L6. Feature Engineering

Better Categories

• Some categories are helped by collapsing themI Categories with elaborate hierarchies (occupation)I “Numeric” categories with too many levels (good, very good,

amazingly good)I Any category with a natural grouping (country, US state)

• Group categories with cond:(cond (= "GRUNT" (f "job")) "Worker"

(> (occurrences (f "job") "Chief") 0) "Fancy Person""Everyone Else")

• Consider converting them to a numeric if they are ordinal(cond (= (f "test1") "poor") 0

(= (f "test1") "fair") 1(= (f "test1") "good") 2(= (f "test1") "excellent") 3)

#VSSML16 Feature Engineering September 2016 15 / 50

Page 16: VSSML16 L6. Feature Engineering

Binning or Discretization

• You can also turn a numeric variable into a categorical one

• Main idea is that you make bins and put the values into each one(e.g., low, middle, high)

• Good to give the model information about potential noise (say,body temperature)

• There are whole bunch of ways to do it (in the interface):I QuartilesI DecilesI Any generic percentiles

• Note: This includes the objective itself!I Turns a regression problem into a classification problemI Might turn a hard problem into an easy oneI Might be more what you actually care about

#VSSML16 Feature Engineering September 2016 16 / 50

Page 17: VSSML16 L6. Feature Engineering

Linearization

• Not important for decision trees (transformations that preserveordering have no effect)

• Can be important for logistic regression and clustering

• Common and simple cases are exp and log(log (f "test"))

• Use casesI Monetary amounts (salaries, profits)I Medical testsI Various engagement metrics (e.g., website activity)I In general, hockey stick distributions

#VSSML16 Feature Engineering September 2016 17 / 50

Page 18: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 18 / 50

Page 19: VSSML16 L6. Feature Engineering

Missing Data: Why?

• Occasionally, a feature value might be missing from one or moreinstances in your data

• This could be for a number of diverse reasons:I Random noise (corruption, formatting)I Feature is not computable (mathematical reasons)I Collection errors (network errors, systems down)I Missing for a reason (test not performed, value doesn’t apply)

• Key question: Does the missing value have semantics?

#VSSML16 Feature Engineering September 2016 19 / 50

Page 20: VSSML16 L6. Feature Engineering

How Machine Learning Algorithms Treat Missing Data

• The standard treatment ofmissing values by decisiontrees is that they’re just due torandom noise (unless youchoose the BigML “or missing”splits. Plug: This isn’t availablein a lot of other packages.)

I They’re essentially “ignored”during tree construction (bitof an oversimplification)

I This means that featuresthat have missing values areless likely to be chose for asplit than those that aren’t

• Can we “repair” the features?

#VSSML16 Feature Engineering September 2016 20 / 50

Page 21: VSSML16 L6. Feature Engineering

Missing Value Replacement

• Simplest thing to do is just replace the missing value with acommon thing

I Mean or median - For symmetric distributionsI Mode - Especially for features where the “default” value is incredibly

common (e.g., word counts, medical tests)

• Such a common operation that it’s built into the interface

• Also available in Flatline:(if (missing? "test1") (mean "test1") (f "test1"))

#VSSML16 Feature Engineering September 2016 21 / 50

Page 22: VSSML16 L6. Feature Engineering

Missing Value Induction

• But we can do better than the mean, can’t we? (Spoiler: Yes)

• If only we had an algorithm that could predict a value given theother feature values for the same instance HEY WAIT A MINUTE

• Train a model to predict your missing valuesI Training set is all points with the value non-missingI Predict for points that have the training value missingI Remember not to use your objective as part of the missing value

predictor

• Some good news: You probably don’t know or care what yourperformance is!

• If you’re modeling with a technique that’s robust to missing values,you can model every column without getting into a “cycle”

#VSSML16 Feature Engineering September 2016 22 / 50

Page 23: VSSML16 L6. Feature Engineering

Constructing Features From Missing Data

• Maybe a more interesting thing to do is to use missing data as afeature

• Does your missing value have semantics?I Feature is incomputable?I Presence/absence is more telling than actual value

• Easy to make a binary feature

• You could also a quasi-numeric feature with a multilevelcategorical with Flatline’s cond operator:(cond (missing? "test1") "not performed"

(< (f "test1") 10) "low value"(< (f "test1") 20) "medium value""high value")

#VSSML16 Feature Engineering September 2016 23 / 50

Page 24: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 24 / 50

Page 25: VSSML16 L6. Feature Engineering

Time-series Prediction

• Occasionally a time seriesprediction problem comes atyou as a 1-d predictionproblem

• The objective: Predict thevalue of the sequence givenhistory.

• But . . . there aren’t anyfeatures!

#VSSML16 Feature Engineering September 2016 25 / 50

Page 26: VSSML16 L6. Feature Engineering

Some Very Simple Time-series Data

• Closing Prices for the S&P 500

• A useless objective!

• No features!

• What are we going to do?(Spoiler: Either drink away oursorrows or FEATUREENGINEERING)

Price2019.322032.432015.112043.932060.502085.382092.93

#VSSML16 Feature Engineering September 2016 26 / 50

Page 27: VSSML16 L6. Feature Engineering

A Better Objective: Percent Change

• Going to be really difficult to predict actual closing price. Why?I Price gets larger over long time periodsI If we train on historical data, the future price will be out of range

• Predicting the percent change from the previous close is a morestationary and more relevant objective

• In flatline, we can get the previous field value by passing an indexto the f function:(/ (- (f "price") (f "price" -1)) (f "price" -1))

#VSSML16 Feature Engineering September 2016 27 / 50

Page 28: VSSML16 L6. Feature Engineering

Features: Delta from Previous Day (Week, Month, . . .)

• Percent change over the last n days

• Remember these are features so don’t include the objective day -you won’t know it!(/ (- (f "price" -1) (f "price" -10)) (f "price" -10))

• Note that this could be anything, and exactly what it should be isdomain-specific

#VSSML16 Feature Engineering September 2016 28 / 50

Page 29: VSSML16 L6. Feature Engineering

Features: Above/Below Moving Average

• The avg-window function makes it easy to conduct a movingaverage:(avg-window "price" -50 -1)

• How far are we off the moving average?(let (ma50 (avg-window "price" -50 -1))

(/ (- (f "price" -1) ma50) ma50))

#VSSML16 Feature Engineering September 2016 29 / 50

Page 30: VSSML16 L6. Feature Engineering

Features: Recent Volatility

• Let’s do the standard deviation of a window:(let (win-mean (avg-window "price" -10 -1))

(map (square (- _ win-mean)) (window "price" -10 -1)))

• With that, it’s easy to get the standard deviation:(let (win-mean (avg-window "price" -10 -1)

sq-errs (map (square (- _ win-mean)) (window "price" -10 -1)))(sqrt (/ (+ sq-errs) (- (count sq-errs) 1))))

• This is a reasonably nice measure of volatility of the objective overthe last n time periods.

#VSSML16 Feature Engineering September 2016 30 / 50

Page 31: VSSML16 L6. Feature Engineering

Go Nuts!

• Not hard to imagine all sorts of interesting features:I Moving average crossesI Breaking out of a rangeI All with different time parameters

• One of the difficulties of feature engineering is dealing with thisexponential explosion

• Makes it spectacularly easy to keep wasting effort (or losingmoney)

#VSSML16 Feature Engineering September 2016 31 / 50

Page 32: VSSML16 L6. Feature Engineering

Some Caveats

• The regularity in time of thepoints has to match yourtraining data

• You have to keep track of pastpoints to compute yourwindows

• Really easy to get informationleakage by including yourobjective in a windowcomputation (and can be veryhard to detect)!

• Did I mention how awfulinformation leakage can behere?

• WHAT ABOUTINFORMATION LEAKAGE

#VSSML16 Feature Engineering September 2016 32 / 50

Page 33: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 33 / 50

Page 34: VSSML16 L6. Feature Engineering

A Nearly Useless Datatype

• There’s no easy way to includetimestamps in our models(really just a formatted textfield)

• What about epoch time?Usually not what we want.

I Weather forecasting?I Activity prediction?I Energy usage?

• A date time is really acollection of features

#VSSML16 Feature Engineering September 2016 34 / 50

Page 35: VSSML16 L6. Feature Engineering

An Opportunity for Automatic Feature Engineering

• Timestamps are usually found in a fairly small (okay, massive)number of standard formats

• Once parsed into epoch time, we can automatically extract abunch of features:

I Date features - month, day, year, day of weekI Time features - hour, minute, second, millisecond

• We do this “for free” in BigML

• You can also specify a specific format in Flatline

#VSSML16 Feature Engineering September 2016 35 / 50

Page 36: VSSML16 L6. Feature Engineering

Useful Features May Be Buried Even More Deeply

• Important to remember that the computer doesn’t have theinformation about time that you do

• Example - Working HoursI Need to know if it’s between, say, 9:00 and 17:00I Also need to know if it’s Saturday or Sunday

(let (hour (f "SomeDay.hour")day (f "SomeDay.day-of-week"))

(and (<= 9 hour 18) (< day 6)))

• Example - Is DaylightI Need to know hour of dayI Also need to know day of year

#VSSML16 Feature Engineering September 2016 36 / 50

Page 37: VSSML16 L6. Feature Engineering

Go Nuts!

• Date FeaturesI endOfMonth? - Important feature of lots of clerical workI nationalHoliday? - What it says on the boxI duringWorldCup? - Certain behaviors (e.g., TV watching) might be

different during this time

• Time FeaturesI isRushHour? - Between 7 and 9am on a weekdayI mightBeSleeping? - From midnight to 6amI mightBeDrinking? - Weekend evenings (or Wednesday at 1am, if

that’s your thing)

• There are a ton of these things that are spectacularly domaindependent (think contract rolls in futures trading)

#VSSML16 Feature Engineering September 2016 37 / 50

Page 38: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 38 / 50

Page 39: VSSML16 L6. Feature Engineering

Bag of Words

• The standard way that BigML processes text is to create onefeature for each word found for any instance in the text field.

• This is the so-called “bag of words” approach

• Called this because all notion of sequence goes away afterprocessing

• In this case, any notion of correlation also disappears as thefeatures are independent.

#VSSML16 Feature Engineering September 2016 39 / 50

Page 40: VSSML16 L6. Feature Engineering

Tokenization

• Tokenization seems trivialI Except if you have numbers or special characters in the tokensI What about hyphens? Apostrophes?

• Do we want to do n-grams?

• Keep only tokens that occur a certain amount of time (not too rare,not too frequent)

• Note that this is more difficult with languages that don’t have clearword boundaries

#VSSML16 Feature Engineering September 2016 40 / 50

Page 41: VSSML16 L6. Feature Engineering

Word Stemming

• Now we have a list of tokens

• But sometimes we get “forms” of the same termI playing, plays, playI confirm, confirmed, confirmation

• We can use “stemming” to map these different forms back to thesame root

• Most western languages have a reasonable set of rules

#VSSML16 Feature Engineering September 2016 41 / 50

Page 42: VSSML16 L6. Feature Engineering

Other Textual Features

• Example: Length of text(length (f "textFeature"))

• Contains certain strings

• Dollar amounts? Dates? Salutations? Please and Thank you?

• Flatline has full regular expression capabilities

#VSSML16 Feature Engineering September 2016 42 / 50

Page 43: VSSML16 L6. Feature Engineering

Latent Dirichlet Allocation

• Learn word distributions for topics

• Infer topic scores for each document

• Use the topic scores as features to a model

#VSSML16 Feature Engineering September 2016 43 / 50

Page 44: VSSML16 L6. Feature Engineering

Outline

1 Some Unfortunate Examples

2 Feature Engineering

3 Mathematical Transformations

4 Missing Values

5 Series Features

6 Datetime Features

7 Text Features

8 Advanced Topics

#VSSML16 Feature Engineering September 2016 44 / 50

Page 45: VSSML16 L6. Feature Engineering

Feature Construction as Projection

• Feature construction means increasing the space in whichlearning happens

• An other set of techniques typically replaces the feature space

• Often these techniques are called dimensionality reduction, andthe models that are learned are a new basis for that data.

• Why would you do this?I New, possibly unrelated hypothesis spaceI SpeedI Better visualization

#VSSML16 Feature Engineering September 2016 45 / 50

Page 46: VSSML16 L6. Feature Engineering

Principle Components Analysis

• Find the axis that preserves the maximum amount of variancefrom the data

• Find the axis, orthogonal to the first, that preserves the nextlargest amount variance, and so on

• In spite of this description, this isn’t an iterative algorithm (it can besolved with a matrix decomposition)

• Projecting the data into the new space is accomplished with amatrix multiplication

• Resulting features are a linear combination of the old features

#VSSML16 Feature Engineering September 2016 46 / 50

Page 47: VSSML16 L6. Feature Engineering

Distance to Cluster Centroids

• Do a k-Means clustering

• Compute the distances from each point to the each clustercentroid

• Ta-da! k new features

• Lots of variations on this theme:I Normalized / Unnormalized, and by what?I Average the class distributions of the resulting clustersI Take the number of points / spread of each cluster into account

#VSSML16 Feature Engineering September 2016 47 / 50

Page 48: VSSML16 L6. Feature Engineering

Stacked Generalization: Classifiers as Features

• Idea: Use model scores as input to a “meta-classifier”

• Algorithm:I Split the training data into “base” and “meta” subsetsI Learn several different classifiers on the “base” subsetI Compute predictions for the “meta” subset with the “base”

classifiersI Use the scores on the “meta” subset as features for a classifier

learned on that subset

• Meta-classifier learns when the predictions of each of the “base”classifiers are to be preferred.

• Pop quiz: Why do we need to split the data?

#VSSML16 Feature Engineering September 2016 48 / 50

Page 49: VSSML16 L6. Feature Engineering

Caveats (Again)

• There’s obviously a lot of representational power here

• Surprising sometimes how little it helps (remember, the upstreamalgorithm are already trying to do this)

• Really easy to get bogged down in details

• Some thoughts that might help you avoid that:I Something really good is likely to be good with little effortI Think of a decaying exponential where the vertical axis is “chance

of dramatic improvement” and the horizontal one is “amount oftweaking”

I Have faith in both your own intelligence and creativityI If you miss something really important, you’ll probably come back to

it

• Use Flatline as inspiration; it was with data scientists in mind

#VSSML16 Feature Engineering September 2016 49 / 50

Page 50: VSSML16 L6. Feature Engineering

It’s over!

Questions

#VSSML16 Feature Engineering September 2016 50 / 50