Data Science Background and Course Software setup Week 1.

Data Science Background and Course Software setup

Week 1

Index

Installation process

Lecture 1: Introduction to big data and data

science

Lecture 2: Performing data science and

preparing data

Installation process (I)

The same development environment: Two free software packages: VirtualBox and Vagrant Virtual Machine

Hardware and Software Prerequisites Minimum Hardware Requirements

Free disk space: 3.5 GB RAM memory: 2.5 GB (4+ GB preferred) Processor: Any recent Intel or AMD multicore processor

should be sufficient. Supported Operating Systems

Windows, Linux, MAC OS X

Installation process (II) Installation of the Virtual box:

virtualbox.org Downloads Choose the appropriate version of the Virtual box for your OS

Installation of Vagrant: www.vagrantup.com -> Downloads Choose the appropriate

version of the Vagrant for your OS

Installation of the Virtual Machine: Create a custom directory (e.g., /home/marrval/myvagrant) Download the file:

https://github.com/spark-mooc/mooc-setup/archive/master.zip to the custom directory and unzip it.

Copy Vagrantfile to the custom directory you created in step #1 Open a DOS prompt (Windows) or Terminal (Mac/Linux), change

to the custom directory, and issue the command vagrant up (the Virtual box opened in the background)

Sparkvm is running!

Installation process (III)

Basic Instructions for Using the Virtual Machine

To start the VM, from a DOS prompt (Windows) or Terminal (Mac/Linux), issue the command vagrant up.

To stop the VM, use the command vagrant halt You should always stop the VM before you log off, turn off, or

reboot your computer. To erase or delete the VM, vagrant destroy

Once the VM is running, to access the notebook, open a web browser to "http://localhost:8001/" : start the iPython notebook on port 8001 (so we can have access to an IPython notebook with a Spark)

Installation process (IV)

Running Your First Notebook

Start the VM

Open a web browser to "http://localhost:8001/"

Upload the file "lab0_student.ipynb”, which is contained in the .zip

Verify that you do not encounter any errors in the run of the cells

Introduction to big data and data science (I)

Correlation doesn’t imply causation

• Use more data

• Explore more types of data/factors

Introduction to big data and data science (II)

Big Data: Why all this excitement?

From 2003 to 2008, they looked at weekly search queries Identify 45 terms relevant to people searching about flu

Build a model

Google rolled out flu stories in Google News

during this period + reading stories

skewed the results

Introduction to big data and data science (III)

Big Data: Why all this excitement?

• Bloggers used data science to analyze the elections

• The campaigns were using data science (database that modeled the behavior of the electorate)

Pollsters try to predict the

outcome by polling people they

have biases (+errors)

incorrect results

Challenge: remove biases + errors

Introduction to big data and data science (IV)

Cautionary tale

• How did they come to this conclusion?

• Look at Google trend searches for MySpace and use the same model to Facebook

• Correlation doesn’t imply causation

• Identify important factors

Introduction to big data and data science (V)

Where Does Big Data Come From?

• Online (And can be recorded). Many data are collected and few analyzed

• Users (user-generated content)

Individually is not very large

Introduction to big data and data science (VI)

Where Does Big Data Come From?

• Health and scientific computing

• Graphs

• Log files (generated by servers around The Internet)

• The Internet of Things (e.g., sensors in a forest, toll collection transponder to traffic reporting)

Performing Data Science and preparing Data (I)

What is Data Science?

• Data Science aims to derive knowledge from big data, efficiently and intelligently”

• Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government

Apply algorithms at scale to large amounts of data, and understand both the

algorithms and the results

Collect data, analyze them and understands the

analytical process and results

Collect knowledge,

apply algorithms, but do not understand

Performing Data Science and preparing Data (I)

What is Data Science?

• Data Science aims to derive knowledge from big data, efficiently and intelligently”

• Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government

Apply domain-specific knowledge at very large scale, and understand

both the algorithms and the results

Performing Data Science and preparing Data (II)

Contrasting Data Science: Database

Performing Data Science and preparing Data (III)

Contrasting Data Science: Database

Contrasting Data Science: Scientific computing

Performing Data Science and preparing Data (IV)

Contrasting Data Science: Traditional Machine Learning

Performing Data Science and preparing Data (V)

Doing data science

• Problem Collect data clean the data build a model communicate the results

Performing Data Science and preparing Data (V)

• Cloud computing: key enabler of data science

• Allows date science on a massive scale

Data science practice

Performing Data Science and preparing Data (VI)

What is hard about Data Science?

Performing Data Science and preparing Data (VII)

Data acquisition and Preparation

1. Extract data from sources

2. Load data into the sink

3. Transform data (source, sink, staging area)

Performing Data Science and preparing Data (VIII)

Data acquisition and Preparation

• We create pipelines or workflows, which can be scheduled

• Recording the execution of a workflow is known as capturing lineage or provenance (Spark does it automatically)

• Impediments to collaboration: diversity of tools/programming languages, finding a script is hard, most analysis work is ‘thrown away’

Performing Data Science and preparing Data (VIII)

Data Science roles

Individual

Organizational

Data Science Background and Course Software setup Week 1.

Documents

big data

data sciencelecture

data science background

data science v

data science database

virtual box

osinstallation of vagrant

vagrant destroyonce