YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Winning With Big Data:  Secrets of the Successful Data Scientist

WINNINGWITH

BIG DATA

Michael Driscoll@dataspora

SDForum BI SIGJune 15, 2010

Secrets of the Successful

Data Scientist

Page 2: Winning With Big Data:  Secrets of the Successful Data Scientist

WHY DATAMATTERSNOW

Page 3: Winning With Big Data:  Secrets of the Successful Data Scientist

THE INDUSTRIALAGE OF DATA

Page 4: Winning With Big Data:  Secrets of the Successful Data Scientist

WHAT IS BIG DATA?

Data that is distributed.

class size manage with how it fits examples

small < 10 GB Excel, Rfits in one machine’s memory

thousands of sales figures

medium 10GB-1TB indexed files, monolothic DB

fits on one machine’s disk millions of web pages

Big > 1TBHadoop,

distributed DBs

stored across many

machinesbillions of web clicks

Page 5: Winning With Big Data:  Secrets of the Successful Data Scientist

WHAT ISDATA SCIENCE?

Page 6: Winning With Big Data:  Secrets of the Successful Data Scientist

WHY DATA SCIENCEIS SEXY

Page 7: Winning With Big Data:  Secrets of the Successful Data Scientist

+ =

“The sexy job in the next ten years will be statisticians…”- Hal Varian

Page 8: Winning With Big Data:  Secrets of the Successful Data Scientist
Page 9: Winning With Big Data:  Secrets of the Successful Data Scientist

data model

1000 bytes 2 bytes

Page 10: Winning With Big Data:  Secrets of the Successful Data Scientist

9 WAYS TO WINWITH DATA

Page 11: Winning With Big Data:  Secrets of the Successful Data Scientist

1. CHOOSE THERIGHT TOOL

You don’t need a chainsaw to cut butter.

Page 12: Winning With Big Data:  Secrets of the Successful Data Scientist

2. COMPRESS EVERYTHING

The world is IO-bound.

mysqldump -u myuser -p mypass sourceDB | \ gzip | ssh [email protected] "cat - | \ gunzip | mysql -u myuser -p mypass targetDB"

Page 13: Winning With Big Data:  Secrets of the Successful Data Scientist

3. SPLIT UPYOUR DATA

Split, apply, combine.

Page 14: Winning With Big Data:  Secrets of the Successful Data Scientist

4. WORK WITH SAMPLES

Big Data is heavy, samples are light.

perl -ne "print if (rand() < 0.01)" \ data.csv > sample.csv

Page 15: Winning With Big Data:  Secrets of the Successful Data Scientist

5. USESTATISTICS

Page 16: Winning With Big Data:  Secrets of the Successful Data Scientist

6. COPYFROM OTHERS

Use open source.

git clone git://github.com/kevinweil/hadoop-lzo

Page 17: Winning With Big Data:  Secrets of the Successful Data Scientist

Charts are compositions,not containers.

7. ESCHEW CHART TYPOLOGIES

Page 18: Winning With Big Data:  Secrets of the Successful Data Scientist

8. COLOR WITH CARE

Color can enhance or insult.

Page 19: Winning With Big Data:  Secrets of the Successful Data Scientist

9. TELL A STORY

People are listening.

Page 20: Winning With Big Data:  Secrets of the Successful Data Scientist

ONE SUCCESSSTORY

Page 21: Winning With Big Data:  Secrets of the Successful Data Scientist

WHY DO TELCO CUSTOMERS LEAVE?

Sign up Leave

Goal: “less churn.”

Page 22: Winning With Big Data:  Secrets of the Successful Data Scientist

DATA:BILLIONSOF CALLS

… and millions of callers.

Page 23: Winning With Big Data:  Secrets of the Successful Data Scientist

… a difference,but not significant.

DOES CALL QUALITYMATTER?

Page 24: Winning With Big Data:  Secrets of the Successful Data Scientist

Hmmm...

WHAT ABOUTSOCIALNETWORKS?

Page 25: Winning With Big Data:  Secrets of the Successful Data Scientist

… but is it predictive?

BUILD THE CALL GRAPH

Page 26: Winning With Big Data:  Secrets of the Successful Data Scientist

April

EVOLUTION OF A CALL GRAPH

Page 27: Winning With Big Data:  Secrets of the Successful Data Scientist

May

EVOLUTION OF A CALL GRAPH

Page 28: Winning With Big Data:  Secrets of the Successful Data Scientist

June

EVOLUTION OF A CALL GRAPH

Page 29: Winning With Big Data:  Secrets of the Successful Data Scientist

July

EVOLUTION OF A CALL GRAPH

Page 30: Winning With Big Data:  Secrets of the Successful Data Scientist

when a cancellationoccurs in a call network.

700% INCREASEIN CHURN

Page 31: Winning With Big Data:  Secrets of the Successful Data Scientist

FINAL THOUGHTS

Page 32: Winning With Big Data:  Secrets of the Successful Data Scientist

Big Data Dedicated RDBMS

Analytics(R, SPSS, SAS, SAP)

Data Products (Content Filters, Rec Engines)

Data

Actions

Insights

THE BIG DATA STACK

Page 33: Winning With Big Data:  Secrets of the Successful Data Scientist

THANKS!QUESTIONS?

Michael [email protected]

@dataspora on Twitterhttp://www.dataspora.com/blog

SDForum BI SIGJune 15, 2010


Related Documents