Top Banner
Data Science Consulting or Science meets business, again. Third time a charm? David Johnston ThoughtWorks March 17, 2014
34

NYC Open Data Meetup-- Thoughtworks chief data scientist talk

May 10, 2015

Download

Technology

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Chief Data Scientist, Thoughtworks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Data Science Consultingor

Science meets business, again. Third time a charm?

David Johnston ThoughtWorksMarch 17, 2014

Page 2: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Young scientists become…

Professors

Prog

ram

mer

s Trad

ers

Data

scie

ntist

s

Professors

Professors

Page 3: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Talk Overview

• Agile Analytics group at ThoughtWorks

• What is data science anyway? Origins and future. Good or evil?

• Guide to technologies and limits to technology

• Process and methodology for successful data science consulting

Page 4: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

ThoughtWorks

• Global software consulting company

• HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide.

• Privately owned by Roy Singham

• Flat hierarchy of passionate people

Page 5: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

The three pillars

Page 6: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Agile Analytics at TW

• Practiced started 2011• Led by Ken Collier and John Spens• About a dozen people involved

Key Themes

• BI, data warehousing and analytics has largely missed the revolution in agile methodologies.

• We can do analytics in a agile, fast, light-footprint way.

Page 7: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

What do we do?

• Probabilistic modeling• Predictive analytics / machine learning• Advanced BI, prescriptive analysis• Big Data technologies• Advanced algorithms and data structures, streaming

Our main goals

• Use data analysis to give companies an edge in their marketplace

• Use data analysis to improve the world at large

Page 8: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Some typical projects

• Recommending Systems

• Customer behavior analysis

• Optimization

• Efficient algorithms/tech for massive data sets

• Company specific analytics challenges

Page 9: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Case Study 1: HealthCare Group Purchasing Organization

• One of the largest GPOs. 1000s of client hospitals

• Hospital sign up, pay fee and get group-purchasing discounts

• The GPO has to make estimates to hospitals on their likely savings.

• Hospital’s data is usually in a non-standard spreadsheet. No SKUs in healthcare (yet).

• A data matching mess

Page 10: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Case Study 1: HealthCare Group Purchasing Organization

GPO: Johnson & Johnson Sterile Scalpel #F8-505

Hospital: J&J scalpel, steel item f8505 size 3’’

• Their in-place solution – Oracle, lots of ETL tools, using SQL with lots of rigid rules for how to match.

• Data-base of matching rules was difficult to maintain

• Accuracy of matching ~60%. Rest was done by hand. Took 1 day for processing and weeks for lines done by hand.

Page 11: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Case Study 1: HealthCare Group Purchasing Organization

What we did

• First convince them that their solution was highly inefficient.

• Wrote python program using a tree data structure and machine learning to do matching.

• Ran on my laptop in a few minutes. Match rates > 80%

• This done in 3 weeks. Later settled on a solution using Elastic Search.

Page 12: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Case Study 2: Retail Rec Systems

• Customer providing coupons to retailer customers

• Needed a better recommendation system

• We’re using a simple logistic regression model

Page 13: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

What exactly is data science?

• Is this really new?

• Does the term “data science” make any sense?

• Is it just a fad? Over-hyped?

• Why did this term just become popular a few years back?

• Where is this going?

• Should scientists/engineers/math-types really go and make a career doing this?

Page 14: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

What exactly is data science?

• Is this really new? - Not really

• Does the term “data science” make any sense? - Not really but so what?

• Is it just a fad? Over-hyped? – No, some times.

• Why did this term just become popular a few years back? - Productivity

• Where is this going?

• Should scientists/engineers/math-types really go and make a career doing this? Yes for most

Page 15: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Is it new?

Of course not

Combination of many subjects:• Mathematics and statistics – probability theory• Machine learning• Computer science – algorithms, data structures, data bases• Operations research - process optimization• Business consulting • Software development

Where we have seen this before?

Business: Finance, Insurance, Sports, Government accounting, Retail, GoogleScience: Physics, Astronomy, Biology

Page 16: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Isn’t there anything new?

Of course

• Analytics finally becoming ubiquitous in business (as it always should have been)

• Much more communication between disparate fields

• It’s finally work that’s fun

Ok, but why now?

It’s a big movement so lets give it a new name , Data Science

Page 17: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Why now? - Productivity

• There has always been plenty of data science in science

• Job prospects in academia are slim

• Productivity has been rising much faster than postdoc salaries and scientist job creation

Page 18: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Data scientist productivity growth

• Salary increase over postdoc requires ~2.5 x

• Salaries in Industry are set by productivity and supply/demand

• Crossing the threshold in productivity Leads to new job creation

• Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation

• Nothing magical happened in 2005!

Page 19: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Productivity Drivers for Data-science

Long time scale

• Compute , Moore’s law

• The internet (duh!)

• HD and RAM price drop

• Science learns to deal with Big Data

• Growing importance of statistics

More recent

• Git , code –sharing

• Libraries machine learning

• Python/ R Open source

• Hadoop and ecosystem

• The Cloud, AWS

• NoSQL databases, in-mem

• Growing community in “data science” cohesion, feedback effects of popularity

Page 20: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Then and now

1990s data science

• Writing code in C/C++

• Working with flat files

• Even relational/SQL is new

• Using Matlab, IDL proprietary software

• Writing all algorithms from scratch. Slow. Buggy.

Data science today

• Working in high level open-source languages Python, R

• We’re good at SQL and have lots of other options NoSQL

• Git, thousands of libraries available. Easy to install.

• Can concentrate more on what we’re good at.

Page 21: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

So what is data science now

Data Science:

An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science.

Page 22: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Where is it going?

• Big Data technology is separated from data science

• Software developers take over much of Big Data roles

• Businesses begin to understand data science terminology like they now understand software terminology and they are not Twitter.

• Data scientists and businesses find a methodology that works like industrial scale software development has

Page 23: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Where is it going?

Specialization

• Most experienced data scientists move into consulting or management of teams

• Universities graduate many “data scientist-lite” students from new more specialized BS or MA programs

• Fewer generalists

• PhD students need to learn additional skills. Not instant hires(http://bit.ly/1m3krq6)

Page 24: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Why won’t we have 100x more data scientists in N years?

• Pool of disgruntled postdocs will dry up or “I am not even supposed to be here!”

• Many data science problems don’t need the most cutting edge tools. (Some do).

• People rarely get much experience working with real data in academic settings. Requires real-world experience, takes time.

Page 25: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Are we there yet?Overhyped, underhyped, mis-hyped?

• No, probably not

• Productivity growth is real

• We are solving important problems. Plenty left.

• Big Data will probably peak in the hype cycle before data science

• Just watched my first analytics commercial. IBM.

Page 26: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Why Big Data enthusiasm might peak soon

Big Data defined – Process for performing calculations on data that:

• Cannot possibly be done on a single machine• When sampling and streaming are not effective• What data-reduction is not possible• When storage and compute are closely balanced• Parallelizing is absolutely unavoidable

Most tasks are not like this

• Sampling is usually good enough for training machine learning• Need for rapid feedback, interactive work• CPUs are underutilized. IO limited. • Usually a better algorithm can solve the problem better

Page 27: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Hadoop (Spark)

Good use cases

• Large batch jobs like: restructuring and reducing data from raw files.

• Scoring with ML models

• When you have to do something on every data point.

• Raw storage in HDFS

Bad use cases

• Model development

• Visualization

• Brute-forcing an inefficient algorithm.

• Treating Hadoop like a data-base.

Page 28: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

The data-sizes we typically see

Most companies have a few million customers 10^7

Often they storage ~ 1000 items per customer

That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on our laptops (but not in memory). Such data can be moved to the cloud if need be in 1-2 days.

Often we can be productive with either a sample or an aggregation.

True when • Customer specific items are things like purchases, manually entered text, logins etc.

Not true when • Things are web-events, pair-wise interactions (i.e. graphs, social)

Page 29: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Sources of really big data

Sensor data

• Pictures

• Video

• Health monitoring devices

• Internal device monitors

• Results of combinatorical-complexity

However

• Is it really economic to store and process these huge data sets to begin with?

• Will learn to utilize streaming algorithms

• Will learnt on focus on information not noise

Page 30: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Case study : Particle PhysicsData reduction par excellence

• 600 million collisions per second• Most are boring events and are not saved• Save ~ 100 petabytes per year

Determine existence of Higg-boson – 1 bitMeasure it’s mass to 1% ~ 1 byte

Data = ExabytesInformation = 9 bitsCompression 10^18

Goal

$9 billion per byte!

Page 31: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Data science consulting

The good

• Always something new, always learning.

• Exposed to many different people.

• Get to see how everything works on the inside.

• See the world!

• Low career risk but still fun.

The bad

• Your clients choose you

• People problems often more important than math problems

• Travel can be extreme

• Your great ideas will rarely be credited to you.

Page 32: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Challenges in data science consulting

• Business’s don’t yet understand the terminology, process or techniques. Much teaching involved

• Visionary CEO send you into a not-so-visionary environment

• Problems can be vague

• Communication with business stakeholders takes much of your time

• We are still developing an effective model. More than just agile techniques

Page 33: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Red flags to avoid

• “Built us a platform for analytics so we can become a data-driven company” Non-sequitur

• Wanting prediction of the un-predicable

• Attempting to use ML on noisy data

• When incentives and opinions are all over the map

• Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS.

Page 34: NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Keep offering up bold ideas

• Look for ways for major productivity enhancement

• Keep up on cutting-edge literature in stats/ML

• All my best ideas for web-apps are now successful companies.

• Everybody laughed at them!

Data science is NOT going to beproductized.

FIN