Transcript

1

MLConf NYC 2014Josh Wills, Senior Director of Data ScienceCloudera

A Little Bit About Me

2

3

An Experience I Had Recently

The Two Kinds of Data Scientists

• The Lab• Statisticians who got

really good at programming

• Neuroscientists, geneticists, etc.

• The Factory• Software engineers who

were in the wrong place at the wrong time

4

5

The Lab and The Factory

Analytics in the Lab

• Question-driven• Interactive• Ad-hoc, post-hoc• Fixed data• Focus on speed and flexibility• Output is embedded into a

report or in-database scoring engine

Analytics in the Factory

• Metric-driven• Automated• Systematic• Fluid data• Focus on transparency and

reliability• Output is a production

system that makes customer-facing decisions

6

Data Science In The Factory

7

On Icebergs

8

The Impedance Mismatch

9

What Do We Need?

10

Apache Spark

11

A Feature Extraction DSL for Spark

12

The R Formula Specification

13

So Why Doesn’t This Exist Yet?

14

Functional Programming to the Rescue

15

Data Science in the Lab

16

Great Tools for Investigative Analytics

17

Cloudera Impala

18

LLVM and NUMBA

19

Python UDFs for Impala

20

Python UDFs for Impala

• github.com/cloudera/impyla• Already There

• Numeric and boolean types (as native python objects)• In Progress

• String support• C/C++ function integration

• Planned• Struct/tuple and array types• UDAFs• Include support for PyData stack (scikit-learn, NLTK)

Josh Wills, Director of Data Science, Cloudera@josh_wills

Thank you!

top related