Top Banner
DATA SCIENTIST’S DAILY LIFE
62
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Scientist's Daily Life

DATA S C I E N T I S T ’ S DA I LY L I F E

Page 2: Data Scientist's Daily Life

AG E N DA

• Data scientist?

• Big data and data scientist

• Data scientist’s Toolbox

• Data is the biggest

Page 3: Data Scientist's Daily Life

Derive Knowledge

fromBig data

Page 4: Data Scientist's Daily Life

Efficiently

and

Intelligently

Page 5: Data Scientist's Daily Life

F R O M BAC K E N D T O F R O N T E N D

https://doubleclix.wordpress.com/2012/12/15/what-or-who-is-a-data-scientist/

Page 6: Data Scientist's Daily Life

W H AT I S B I G DATA ?

Page 7: Data Scientist's Daily Life

W H E R E D O T H E DATA C OM E F R OM

• Web Log data

• Machine data

• Transactional data

• Social media data

• …

Page 8: Data Scientist's Daily Life

https://plus.google.com/+DigitalStrategyIE

Page 9: Data Scientist's Daily Life
Page 10: Data Scientist's Daily Life

A WE B SE RV I C E RE C E I VE T H E LOG DATA M ORE T H E N 50G PE R DAYT OTAL SPAC E US E D L AST T H RE E M ONT H : 4500GT OTAL SPAC E US E D L AST ONE Y E AR : 18 , 000G (17 .6T )

Page 11: Data Scientist's Daily Life

• Data Storage/ Backup

• 2T/per HDD

• How to save the data MORE than 2T?

• $0.3 USD/per gigabyte

• Pay 900 USR for KEEPING data but do nothing else.

• Read/Write Speed

• Read: 131.6 MB/s / Write 131.4MB/s

• Spend 393s(6 min) reading just ONE day data.

• Large number of transactions immediately

Page 12: Data Scientist's Daily Life

H A DO O P AN D M A P R E D U C E

Page 13: Data Scientist's Daily Life

H A D O O P A N D H D FS

http://www.fraudtechwire.com/f-level-guide-to-hadoop-hdfs/

Page 14: Data Scientist's Daily Life
Page 15: Data Scientist's Daily Life
Page 16: Data Scientist's Daily Life

– D I S T R I BUT ED A LG OR I THM

「 The world will change,when data is distributed」

Page 17: Data Scientist's Daily Life

M A P R E D U C E

http://www.milanor.net/blog/?p=853

Page 18: Data Scientist's Daily Life

https://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/

Page 19: Data Scientist's Daily Life

http://blog.agro-know.com/?p=3810

Page 20: Data Scientist's Daily Life

P E R F O R M A N C E O F H A D OO P ?

• Not good, but at least can run.

• Count 86,389,084 rows/per day in 39 sec. (64G ram, E5 8core * 2/per node * 10)

• How about 39sec * 30days ?

Page 21: Data Scientist's Daily Life

B E F O R E A N A LY T I C …

Page 22: Data Scientist's Daily Life

E XT RAC T T RA S F O R M LOA D

http://www.wisdomjobs.com/e-university/data-warehouse-etl-toolkit-tutorial-201/surrounding-the-requirements-1319/architecture-8029.html

Page 23: Data Scientist's Daily Life

http://www.slideshare.net/capgemini/emc-world-2014-breakout-move-to-the-business-data-lake-not-as-hard-as-it-sounds

Page 24: Data Scientist's Daily Life

http://www.slideshare.net/hortonworks/modern-data-architecture-for-a-data-lake-with-informatica-and-hortonworks-data-platform

Page 25: Data Scientist's Daily Life

DATA S C I E N T I S T ’ S T O O L BOX

Page 26: Data Scientist's Daily Life

L I N U X

• The best server choice

• Free and freedom

• Easy to control system

• Easy data processing

• Hadoop is based on Linux

Page 27: Data Scientist's Daily Life
Page 28: Data Scientist's Daily Life

P O W E R F U L S H E L L S C R I PT

Page 29: Data Scientist's Daily Life

S QL DATA BA S E

• MySql, Postgresql, Hive, MongoDB(NOSQL)

• Standard SQL Language

• Store and Manage data

Page 30: Data Scientist's Daily Life

R E L AT I O N A L DATA BA S E

Page 31: Data Scientist's Daily Life

TA BL E R E L AT I O N

https://cloudant.com/blog/foundbites-data-model-relational-db-vs-nosql-on-cloudant/

Page 32: Data Scientist's Daily Life

http://ghtorrent.org/relational.html

Page 33: Data Scientist's Daily Life

S QL S Y N TA X

Page 34: Data Scientist's Daily Life

R & PY T H O N

• Basic Analysis Tools

• Easy to Learn

• Many Packages

Page 35: Data Scientist's Daily Life
Page 36: Data Scientist's Daily Life
Page 38: Data Scientist's Daily Life

E TC …

• Excel

• Google Analytics

• Visualisation tools (tableau)

• Web Crawler

• Version control management (git)

• ETL and job scheduling tools (jenkins)

• …

Page 39: Data Scientist's Daily Life

DATA I S T H E B I G G E S T

Page 40: Data Scientist's Daily Life

– J OS H W I LLS

“Person who is better at statistics than any software engineer and better at software

engineering than any statistician.”

Page 41: Data Scientist's Daily Life

S TAT I S T I C

Page 42: Data Scientist's Daily Life

W H Y D O W E N E E D M AC H I N E L E A R N I N G ?

• Clustering這些人可以分成幾類

• Classification哪個人屬於哪一類?

• Regression某個事件發生或某人屬於哪類的機率是多少?

• Dimensionality reduction降維

Page 43: Data Scientist's Daily Life

C LU S T E R I N G

http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/

source http://humble-developer.blogspot.tw/2011/01/kmeans-clustering-algorithm-part-1.html

Page 44: Data Scientist's Daily Life

C L A S S I F I C AT I O N

http://letsmakerobots.com/content/tcs3200-color-sensor-with-k-nearest-neighbor-classification-algorithm

Page 45: Data Scientist's Daily Life

http://www.astroml.org/sklearn_tutorial/

Page 46: Data Scientist's Daily Life

LO G I S T I C R E G R E S S I O N

https://www.coursera.org/instructor/andrewng

Page 47: Data Scientist's Daily Life

C O S T F U N C T I O N

https://www.coursera.org/instructor/andrewng

Page 48: Data Scientist's Daily Life

OV E R F I TT I N G

https://www.coursera.org/instructor/andrewng

Page 49: Data Scientist's Daily Life

O H M Y G O D !H O W T O C H O O S E I T

Page 50: Data Scientist's Daily Life

M AC H I N E L E A R N I N G A L G OR I T H M N

http://amueller.github.io/sklearn_tutorial/

Page 51: Data Scientist's Daily Life

S TAT I S T I C V S M L

S TATT I S T I C MAC H I NEL E ARN I NG

FOC U S ON U NDE RS TAND I NG DATA I N TER MS OF MODEL S

FOC U S ON THE ANALYS I S OF L EAR N I NG AL G OR I THMS

I NTER P R ETAB I L I TY , HYP OTHES I S TE S T I NG

G R EATE R FOC U S ON P R ED I C T I ON

Page 52: Data Scientist's Daily Life

S Y S T E M AT I C S A N D A U T OM AT I O N

http://www.slideshare.net/CetasAnalytics/cetas-e-baymeetupprezofinal

Page 53: Data Scientist's Daily Life

http://mlg.postech.ac.kr/projects/

Page 54: Data Scientist's Daily Life

S H O W YO U R DATA AN D F I N D I N G S

Page 55: Data Scientist's Daily Life

http://hortonworks.com/wp-content/uploads/2012/06/Tableau2.png

Page 56: Data Scientist's Daily Life

http://www.tableau.com

Page 57: Data Scientist's Daily Life

http://www.tableau.com

Page 58: Data Scientist's Daily Life

http://www.tableau.com

Page 59: Data Scientist's Daily Life

T H E R E A L C A S E

Page 60: Data Scientist's Daily Life

H O W T O S TA RT ?

Page 61: Data Scientist's Daily Life

• Codecademy http://www.codecademy.com/Include kinds of programming language, i.e. python, JavaSrtipt, even shell script and sql

• Coursera http://www.codecademy.com/Famous self-learning MOOC website.

Page 62: Data Scientist's Daily Life

http://nirvacana.com/thoughts/becoming-a-data-scientist/