Top Banner
Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director of Data Science
39

Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

Aug 01, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

Feature Engineering The Dark Art of Data Science

Josh Wills // Senior Director of Data Science

Page 2: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

2 © 2014 Cloudera, Inc. All rights reserved.

About Me

Page 3: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

3 © 2014 Cloudera, Inc. All rights reserved.

The Two Kinds of Data Scientists

• The Lab • Statisticians who got really

good at programming • Neuroscientists,

geneticists, social scientists, etc.

• The Factory • Software engineers who

were in the wrong place at the wrong time

Page 4: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

4 © 2014 Cloudera, Inc. All rights reserved.

A Brief History of Data Products

Page 5: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

5 © 2014 Cloudera, Inc. All rights reserved.

Scorecards

Page 6: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

6 © 2014 Cloudera, Inc. All rights reserved.

Spell Correction

Page 7: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

7 © 2014 Cloudera, Inc. All rights reserved.

Virtual Personal Assistants

• Pipeline of not-so-loosely coupled ML systems

• Speech recognition • Semantic decoding •  Intent Model • Dialogue Rules • Language Generation • Speech Synthesis

Page 8: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

8 © 2014 Cloudera, Inc. All rights reserved.

Machine Learning vs. Feature Extraction

Page 9: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

9 © 2014 Cloudera, Inc. All rights reserved.

Talking About Feature Engineering

Page 10: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

10 © 2014 Cloudera, Inc. All rights reserved.

Brainwash

Page 11: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

11 © 2014 Cloudera, Inc. All rights reserved.

Good: More Features == Better Performance

Page 12: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

12 © 2014 Cloudera, Inc. All rights reserved.

Good: Feature Development Scales

Page 13: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

13 © 2014 Cloudera, Inc. All rights reserved.

Bad: Lots of Grunt Work

Page 14: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

14 © 2014 Cloudera, Inc. All rights reserved.

Deployment: Even More Grunt Work

Page 15: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

15 © 2014 Cloudera, Inc. All rights reserved.

Analytic Data Model: Giant Spreadsheet

Page 16: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

16 © 2014 Cloudera, Inc. All rights reserved.

Operational Data Model: 3NF

Page 17: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

17 © 2014 Cloudera, Inc. All rights reserved.

The Impedance Mismatch

Page 18: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

18 © 2014 Cloudera, Inc. All rights reserved.

What Do We Need?

Page 19: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

19 © 2014 Cloudera, Inc. All rights reserved.

One Solution I Thought Might Work

Page 20: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

20 © 2014 Cloudera, Inc. All rights reserved.

Inventing on Principle

Page 21: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

21 © 2014 Cloudera and/or its affiliates. All rights reserved.

A Data Model For Feature Engineering

Page 22: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

22 © 2014 Cloudera, Inc. All rights reserved.

Bridging the Gap

Page 23: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

23 © 2014 Cloudera, Inc. All rights reserved.

Spell Correction Revisited

Page 24: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

24 © 2014 Cloudera, Inc. All rights reserved.

A Simple Star Schema for Search

Page 25: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

25 © 2014 Cloudera, Inc. All rights reserved.

A Supernova Schema for Search

Page 26: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

26 © 2014 Cloudera, Inc. All rights reserved.

Beyond Analytic SQL: Nested SQL Sessions

Page 27: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

27 © 2014 Cloudera, Inc. All rights reserved.

Exhibit (http://github.com/jwills/exhibit)

Page 28: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

28 © 2014 Cloudera, Inc. All rights reserved.

Operational Supernovas

Page 29: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

29 © 2014 Cloudera, Inc. All rights reserved.

ln(supernova)

Page 30: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

30 © 2014 Cloudera, Inc. All rights reserved.

Data Science and the Holy Grail

Page 31: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

31 © 2014 Cloudera, Inc. All rights reserved.

Feature Engineering: Within / Across

1.  Generate features for normalization/segmentation.

2.  Generate normalization constants and segments using the result of Step 1.

3.  Generate input features using the original data and the result of Step 2.

4.  Generate a model using the result of Step 3.

Page 32: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

32 © 2014 Cloudera and/or its affiliates. All rights reserved.

Feature Engineering IDE

Page 33: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

33 © 2014 Cloudera, Inc. All rights reserved.

Data Science as ETL Workflow

Page 34: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

34 © 2014 Cloudera, Inc. All rights reserved.

Metlife’s Wall •  A 360 degree view of all customer

data and interactions

•  Backed by a NoSQL, supernova-style data model (MongoDB, in this instance)

•  Multiple Use Cases •  Customer Support •  Exploratory Analytics/Metrics Definitions •  Profiles/Segmentation

Page 35: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

35 © 2014 Cloudera, Inc. All rights reserved.

Queryable Wall

Page 36: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

36 © 2014 Cloudera, Inc. All rights reserved.

Integrated Model and Feature Store

Page 37: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

37 © 2014 Cloudera, Inc. All rights reserved.

Feature Application and Evaluation

Page 38: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

38 © 2014 Cloudera, Inc. All rights reserved.

Exhibitor (http://github.com/jhlch/exhibitor)

Page 39: Feature Engineering - files.meetup.comfiles.meetup.com/17627552/josh-wills-feature-engineering.pdf · Feature Engineering The Dark Art of Data Science Josh Wills // Senior Director

Thank you.