Data Science Background and Course Software setup Week 1
Jan 20, 2016
Data Science Background and Course Software setup
Week 1
Index
Installation process
Lecture 1: Introduction to big data and data
science
Lecture 2: Performing data science and
preparing data
Installation process (I)
The same development environment: Two free software packages: VirtualBox and Vagrant Virtual Machine
Hardware and Software Prerequisites Minimum Hardware Requirements
Free disk space: 3.5 GB RAM memory: 2.5 GB (4+ GB preferred) Processor: Any recent Intel or AMD multicore processor
should be sufficient. Supported Operating Systems
Windows, Linux, MAC OS X
Installation process (II) Installation of the Virtual box:
virtualbox.org Downloads Choose the appropriate version of the Virtual box for your OS
Installation of Vagrant: www.vagrantup.com -> Downloads Choose the appropriate
version of the Vagrant for your OS
Installation of the Virtual Machine: Create a custom directory (e.g., /home/marrval/myvagrant) Download the file:
https://github.com/spark-mooc/mooc-setup/archive/master.zip to the custom directory and unzip it.
Copy Vagrantfile to the custom directory you created in step #1 Open a DOS prompt (Windows) or Terminal (Mac/Linux), change
to the custom directory, and issue the command vagrant up (the Virtual box opened in the background)
Sparkvm is running!
Installation process (III)
Basic Instructions for Using the Virtual Machine
To start the VM, from a DOS prompt (Windows) or Terminal (Mac/Linux), issue the command vagrant up.
To stop the VM, use the command vagrant halt You should always stop the VM before you log off, turn off, or
reboot your computer. To erase or delete the VM, vagrant destroy
Once the VM is running, to access the notebook, open a web browser to "http://localhost:8001/" : start the iPython notebook on port 8001 (so we can have access to an IPython notebook with a Spark)
Installation process (IV)
Running Your First Notebook
Start the VM
Open a web browser to "http://localhost:8001/"
Upload the file "lab0_student.ipynb”, which is contained in the .zip
Verify that you do not encounter any errors in the run of the cells
Introduction to big data and data science (I)
Correlation doesn’t imply causation
• Use more data
• Explore more types of data/factors
Introduction to big data and data science (II)
Big Data: Why all this excitement?
From 2003 to 2008, they looked at weekly search queries Identify 45 terms relevant to people searching about flu
Build a model
Google rolled out flu stories in Google News
during this period + reading stories
skewed the results
Introduction to big data and data science (III)
Big Data: Why all this excitement?
• Bloggers used data science to analyze the elections
• The campaigns were using data science (database that modeled the behavior of the electorate)
Pollsters try to predict the
outcome by polling people they
have biases (+errors)
incorrect results
Challenge: remove biases + errors
Introduction to big data and data science (IV)
Cautionary tale
• How did they come to this conclusion?
• Look at Google trend searches for MySpace and use the same model to Facebook
• Correlation doesn’t imply causation
• Identify important factors
Introduction to big data and data science (V)
Where Does Big Data Come From?
• Online (And can be recorded). Many data are collected and few analyzed
• Users (user-generated content)
Individually is not very large
Introduction to big data and data science (VI)
Where Does Big Data Come From?
• Health and scientific computing
• Graphs
• Log files (generated by servers around The Internet)
• The Internet of Things (e.g., sensors in a forest, toll collection transponder to traffic reporting)
Performing Data Science and preparing Data (I)
What is Data Science?
• Data Science aims to derive knowledge from big data, efficiently and intelligently”
• Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government
Apply algorithms at scale to large amounts of data, and understand both the
algorithms and the results
Collect data, analyze them and understands the
analytical process and results
Collect knowledge,
apply algorithms, but do not understand
Performing Data Science and preparing Data (I)
What is Data Science?
• Data Science aims to derive knowledge from big data, efficiently and intelligently”
• Data Science encompasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government
Apply domain-specific knowledge at very large scale, and understand
both the algorithms and the results
Performing Data Science and preparing Data (II)
Contrasting Data Science: Database
Performing Data Science and preparing Data (III)
Contrasting Data Science: Database
Contrasting Data Science: Scientific computing
Performing Data Science and preparing Data (IV)
Contrasting Data Science: Traditional Machine Learning
Performing Data Science and preparing Data (V)
Doing data science
• Problem Collect data clean the data build a model communicate the results
Performing Data Science and preparing Data (V)
• Cloud computing: key enabler of data science
• Allows date science on a massive scale
Data science practice
Performing Data Science and preparing Data (VI)
What is hard about Data Science?
Performing Data Science and preparing Data (VII)
Data acquisition and Preparation
1. Extract data from sources
2. Load data into the sink
3. Transform data (source, sink, staging area)
Performing Data Science and preparing Data (VIII)
Data acquisition and Preparation
• We create pipelines or workflows, which can be scheduled
• Recording the execution of a workflow is known as capturing lineage or provenance (Spark does it automatically)
• Impediments to collaboration: diversity of tools/programming languages, finding a script is hard, most analysis work is ‘thrown away’
Performing Data Science and preparing Data (VIII)
Data Science roles
Individual
Organizational