Top Banner
CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang
32

CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

May 06, 2018

Download

Documents

dinhquynh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

CIS 4930/6930 Spring 2014 Introduction to Data Science

Data Intensive Computing

University of Florida, CISE Department

Prof. Daisy Zhe Wang

Page 2: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Data Science Overview

Why, What, How, Who

Page 3: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Outline

• Why Data Science?

• What is Data Science?

• What are some prominent examples of Data Science?

• How to become a Data Scientist?

• Who are hiring Data Scientists Now?

3

Page 4: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Why Data Science?

4

Page 5: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

The Dawn of Big Data

• Google, Yahoo today – Web Search and Computational advertising – Google: 35,000 searches/sec – Yahoo! scale: 600 million users per month, 4 billion

clicks per day, 25 terabytes of data collected every day

• Netflix 2007 – Movie recommendations, netflix prize – 100 million ratings, 500,000 users, 18,000 movies

• Amazon 2003 – Product recommendations, reviews – 29 million customers, millions of products

• Word Economic Forum 2011 at Davos – Personal data – digital data created by and about

people – represents a new economic “asset class”, touching all aspects of society.

Page 6: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big
Page 7: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

How Big is Your Data?

• Kilobyte (1000 bytes)

• Megabyte (1 000 000 bytes)

• Gigabyte (1 000 000 000 bytes)

• Terabyte (1 000 000 000 000 bytes)

• Petabyte (1 000 000 000 000 000 bytes)

• Exabyte (1 000 000 000 000 000 000 bytes)

• Zettabyte (1 000 000 000 000 000 000 000 bytes)

• Yottabyte (1 000 000 000 000 000 000 000 000 bytes)

7

Page 8: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

5 Vs of Big Data

• Raw Data: Volume

• Change over time: Velocity

• Data types: Variety

• Data Quality: Veracity

• Information for Decision Making: Value

Page 9: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Cloud Computing

Cloud computing is a style of computing where scalable and elastic IT-enabled capabilities are delivered as a (pay-as-you-go) service to external customers using Internet technologies.

-- Gartner IT Glossary

Cloud Computing is a new term for a long-held dream of computing as a utility

-- Above the Clouds, 2009

9

Page 10: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Cloud Computing = Cloud + SaaS

Cloud computing refers to both:

• Cloud: The hardware and system software in the datacenters that provide those services.

– Public Cloud (Utility Computing) vs. Private Cloud

• SaaS: The applications delivered as services over the Internet, and

• Cloud Computing started around 2006

• Big Data and Data Science (Big Data Analytics) started around 2011

10

Page 11: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Current Trends

• Applications has bigger data and need more advanced analysis – Example: Web, Corporate documents and

Emails • Natural Language Processing

– Example: Social Media • Network/Graph Analysis

• IT Infrastructure moving to Cloud Computing

• Data Science arise given this application pull and technology push

11

Page 12: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

What is Data Science?

12

Page 13: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Data Science – A Definition

Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products.

13

Page 14: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Goal of Data Science

Turn data into data products.

Page 15: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Data to Data Products

• Transaction Databases Fraud Detection

• Wireless Sensor Data Smart Home

• Text Data, Social Media Data Product Review

and Consumer Satisfaction

• Software Log Data Automatic Trouble Shooting

• Genotype and Phenotype Data New treatment

for Cancer

Page 16: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Other Data Products

• Financial products for investment or retirement funds

• Legal profession uses e-discovery tool for retrieval and review of legal documents

• Political campaign management

• Sports (e.g., Oakland baseball team)

• Remote Sensing for Environment Monitoring

Page 17: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

What are some prominent examples of Data Science?

17

Page 18: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Data Products – Google

• Web Search

• Google Ads

• News Recommendation Engine

• Google Maps

Currently one of the best if not the best IT company to work for. (Google event on Jan 21/22)

Page 19: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Data Products – Netflix

• Personalized Movie Ratings

• Movie Recommendations

• Similar Movies

• Movie Categories (e.g., 80’s movie with a strong female lead, Kung Fu movies)

BlockBuster is out of the business …

Page 20: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Data Products – LinkedIn/Facebook

• People you may know

• Applications you may like

• Jobs/Events you might be interested

• Classifier for bad users and bad content

• With high accuracy, Facebook can guess whether you are single or married

Who does not have LinkedIn/Facebook Account?

Page 21: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Data Products – Twitter

• Text Analysis – Spam Filter/Similarity Search

• User Sentiment/Satisfaction/Feedback

• News Breakout

• Trend and Topics

200 million users as of 2011, generating over 200 million tweets and handling over 1.6 billion search queries per day

Page 22: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Data Products – Splunk

• Degradation, Failure Detection

• Identify Security Breach

• Event Monitoring

• Troubleshoot Tools

• Cross-platform Event Correlation

Founded 2004, IPO in 2012

Page 23: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

How to become a Data Scientist?

23

Page 24: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

The Life of Data (state-of-the-art)

Collect Clean Integrate Visualization

Interface

Users

Data Sources

Analysis

Page 25: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Challenges in Data Science

• Preparing Data (Noisy, Incomplete, Diverse, Streaming …)

• Analyze Data (Scalable, Accurate, Real-time, Advanced Methods, Probabilities and Uncertainties ...)

• Represent Analysis Results (i.e. data product) (Story-telling, Interactive, explainable…)

Page 26: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Skill Set of a Data Scientist

• Data Management

– Data collection, storage, cleaning, filtering, integration …

• Large-scale Parallel Data Processing

– Parallel computing

• Statistics and Machine Learning

– Data modeling, inference, prediction, pattern recognition …

• Interface and Data Visualization

– HCI design, visualization, story-telling …

Page 27: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Who are hiring Data Scientists Now?

27

Page 28: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

“Sexy Job” in the next 10 years

“The sexy job in the next ten years will be … The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

-- Hal Varian, Google Chief Economist, 2009

Page 29: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Who’s hiring Data Scientist?

• IT companies: Google, Twitter, Lexis/Nexis, Facebook, Pivotal/EMC…

• Media and Financial sectors Fox, CNN, NYT, Bloomburg,…

• Research: Biology, Medicine, Physics, Psychology,…

• Information office in government and corporations…

• Law firms: e-discovery tools…

Page 30: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Books on Data Science

30

Page 31: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Additional Reading Pointers

• Data Science Summit (Strata) (http://www.datascientistsummit.com/ )

• Kaggle Competitions (http://www.kaggle.com/)

• Data Science course at Berkeley & Corsera (http://datascienc.es/ )

31

Page 32: CIS 4930/6930 Spring 2014 Introduction to Data Science .... Data Science overview.pdf · CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing ... How Big

Summary

• Why now: Dawn of Big Data, Need for Advanced Analytics and Cloud Computing

• What is it: Data Data Product, many

examples incl. Google, Netflix, Splunk, LinkIn

• How to become: Data management, parallel computing and data processing, statistical machine learning, and visualization skills

– Life of Data

• Who are hiring: Data Scientists are in great demands, from industry to government to science.

32