Top Banner
Laws and Limits of Data Science: The Next Decade Michael L. Brodie
26

Laws and limits of data science 11 10-14

Jul 02, 2015

Download

Data & Analytics

Michael Brodie

Keynote Analytics Week, Boston, MA November 7, 2014
Big Data is in its infancy and is opening the door to profound change - Grand Opportunities (Accelerating Scientific Discovery) and Grand Challenges to be addressed over the next decade. We explore the premise that Data Science is to data-intensive discovery as the Scientific Method is to scientific discovery, leading us to potential Laws and Limits of Data Science, and then to Best Practices.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Laws and limits of data science 11 10-14

Laws and Limits of Data Science: The Next Decade

Michael L. Brodie

Page 2: Laws and limits of data science 11 10-14

2

Big Data is Opening the door to …

Page 3: Laws and limits of data science 11 10-14

3

Grand Opportunities:Accelerating Scientific Discovery …

Page 4: Laws and limits of data science 11 10-14

4

Grand Challenges:Many – efficacy, efficiency, …

Page 5: Laws and limits of data science 11 10-14

What is Big Data?

•  Defining Big Data constrains this emerging phenomena •  Since Big Data is not

—  About data, but a problem solving ecosystem —  A discipline, but a multidisciplinary sub-domain of most disciplines*

•  What matters is what we will do with Big Data •  Big Data is opening the door to profound change in

—  Processing —  Thinking

•  Let’s use the potential of profound change to understand Big Data

5

*  “transforma,ve  …  changing  academia  (…  emerged  ..  on  the  cri,cal  path  for  their  sub-­‐discipline)”  and  is  changing  society”  Michael  Jordan.  

Page 6: Laws and limits of data science 11 10-14

Starting to Understand Big Data

•  Listen to Data —  Hypothesis generation ! overcome limits of human cognition*

•  Multiple, Simultaneous Perspectives —  Ensemble models ! Accelerating Scientific Discovery*

•  And many more …

6

* Necessary condition: human-guidance

Page 7: Laws and limits of data science 11 10-14

7

Big Data is in its infancyWith at least decade-long challenges

Page 8: Laws and limits of data science 11 10-14

Outline •  Big Picture: Why and What •  Grand Opportunities •  Grand Challenges

—  Efficacy, amongst many •  Laws and Limits of Data Science

Page 9: Laws and limits of data science 11 10-14

Hypothesis

Phenomenon

Big Picture Scientific Method

Causality

Experiment Model

Page 10: Laws and limits of data science 11 10-14

Big Picture: Why & What

Experiment Model What

(Big Data) Why

(Empiricism)

Correlation: What might occur

Causation: Why it occurs

Phenomenon

Page 11: Laws and limits of data science 11 10-14

Why: Scientific Method and the Search for Causation History of Science and the Scientific Method Mature Disciplines: Empiricism, Clinical Studies, Drug Discovery

The Holy Grail of science is to identify accurate causality.

Empirical, clinical trial, and drug discovery methods take time +100 years

Three Ages of Medicine [The Remedy: Goetz] Free-for-All: 1850s–1940s Rise of Trials: 1940s–2010s Beyond the Lab: Post-2010

Page 12: Laws and limits of data science 11 10-14

What: Models and the Search for Meaningful Correlations

•  History of Modelling: mathematics, sciences, computing, …

•  Disciplines "  Mature (theory-driven): math, physics, statistics, … "  Emerging (data-driven): data mining, machine learning, neural networks, support

vector machines, …

The Holy Grail of data-intensive discovery is correlations that are meaningful.

Correlation does not imply causation

•  Methodologies "  Mature: 100s of years "  Emerging: at least a decade

The Holy Grail of data-intensive discovery is correlations that are meaningful. The Holy Grail of data-intensive discovery is correlations that are accurate and reliable.

Page 13: Laws and limits of data science 11 10-14

GRAND OPPORTUNITIES Big Data

Page 14: Laws and limits of data science 11 10-14

Accelerating Scientific Discovery

Experiment Model

Correlations

Hypotheses

Why: Causation

What: Correlation

Data D

riven Theory D

riven

Page 15: Laws and limits of data science 11 10-14

Accelerating Scientific Discovery

Experiment Model

Correlations

Hypotheses

Why: Causation

What: Correlation

Data D

riven Theory D

riven

Watson

Baylor

Scientists

Wonderful Use Case

Page 16: Laws and limits of data science 11 10-14

Grand Challenges •  Big Data is in its infancy: 10+ year evolution

"  Efficiency: expression/language ! execution (stack) "  Open Data: data use/reuse / sharing "  Efficacy

“major engineering and mathematical challenge, one that will not be solved by just gluing together a few

existing ideas from statistics, optimization, databases and computer systems.” Michael Jordan

Page 17: Laws and limits of data science 11 10-14

“wrt to Big Data we’re now at the what are the principles? point in time”. Michael Jordan

Page 18: Laws and limits of data science 11 10-14

What is Data Science @ Scale? Data Science @ scale is to data-intensive discovery as The Scientific Method is to scientific discovery

Reframe Empiricism* "  Data Science is the data component of the Scientific Method for data "  Concepts, tools, and techniques for data-intensive discovery

•  Data-intensive discovery = virtual experiment

"  Laws and Limits of Data Science

* With Dr. Jennie Duggan, MIT & Northwestern University

Page 19: Laws and limits of data science 11 10-14

First Law of Data Science

Meaning of a correlation requires empirical verification

What is seldom enough Why is not always necessary

Best Practice #1: Efficacy-driven data discovery

(Efficacy before efficiency)

Page 20: Laws and limits of data science 11 10-14

Second Law of Data Science*

Causality can be determined from correlations only by community accepted mechanisms and metrics**, e.g.,

empiricism.

* With Gregory Piatetsky-Shapiro, KDNuggets

** for What and Why

Page 21: Laws and limits of data science 11 10-14

Limits of Data Science

We do not know where our concepts, tools, and techniques break on massive data sets!

Caution: Big Data Winter Potential (Michael Jordan) Best Practice #2: Experiment + Error bars everywhere

"  Common Practice: not so much

Best Practice #3: Machine-driven, human guided "  Common Practice: not so much

Page 22: Laws and limits of data science 11 10-14

Best Practice Not So Common* •  BP1: Efficacy-driven data discovery

"  Best eScience, Journalism, Economics, Computational X, … "  Big Data not so much (<5%)

•  BP2: Experiment + Error bars everywhere "  Above + Best Data Scientists (~5%, w/scientific, ML, … training) "  Big Data (<5%): Customers don’t ask; data scientists don’t practice

•  BP3: Machine-driven, human guided "  ~5% strict;95% not so much, e.g., ~60 Data Curation products "  50% partial: supervised / trained

•  Example: based on the above Laws and Best Practices

*Personal un-scientific study, limited data, yet so unbiased and oh so true

Page 23: Laws and limits of data science 11 10-14

Laws of Data Science Less So … 1st Correlations ≠ Causation

Common confusion in science*, more in Data Science, even more in business

2nd Causality (meaning) requires verification by community-accepted norms

Cornerstone of Science, hopefully emerging in Data Science**

*Richard Feynman, 1974 ** If #1 is rare, #2 is more so

Page 24: Laws and limits of data science 11 10-14

Conclusions •  Big Data is in its infancy and is opening the door to … •  Grand Opportunities •  Grand Challenges •  10+ year evolution •  Data Science ~= Scientific Method For Data •  Laws of Data Science

1  Correlations must be verified 2  Verification relative to community-accepted norms

•  Data Science Best Practices 1  Efficacy-driven discovery 2  Experiment + Error Bars everywhere 3  Machine-Driven – Human Guided

•  Limit of Data Science: we do not know where our tools break

Page 25: Laws and limits of data science 11 10-14

25

Page 26: Laws and limits of data science 11 10-14

26