Top Banner
CS 626 Large Scale Data Science Jun Zhang Department of Computer Science University of Kentucky Based on materials prepared by Dr. Licong Cui Lecture 1 – Introduction 1
33

CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing [email protected] for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Jun 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

CS 626 Large Scale Data Science

Jun ZhangDepartment of Computer Science

University of KentuckyBased on materials prepared by Dr. Licong Cui

Lecture 1 – Introduction

1

Page 2: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Outline

Course Logistics

Student Introduction

Introduction to Big Data

2

Page 3: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Course Logistics

• Class hours: TR 12:30 pm - 1:45 pm• Class location: F. Paul Anderson Tower Room 255• Office hours: MW: 9:00am – 10:00am• Course documents:

http://www.cs.uky.edu/~jzhang/CS626/cs626.htmlo Syllabuso Files

- Slides- Homework and Project Assignments

3

Page 4: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Course Description

• Data => Actionable information• Big Data Techniques– Hadoop/MapReduce– HBase– Hive– Pig– Spark

• Real-world data science problems

4

Page 5: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Prerequisites and Expected Background

• Algorithm design and analysis• Database systems (e.g. MySQL)• Programming languages– Java (preferred)– Python

• Linux basics (e.g., ssh, scp)• Your own computer requirements:– 64-bit OS– 10+ GB RAM

5

Page 6: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Alternative Hardware Systems

• Use CS Department’s OpenStack cluster• Contact Mr. Jarad Downing [email protected] for

obtaining an account and knowing the requirements

• The Cloudera system has been installed on OpenStack

• More information about OpenStack is at:https://www.cs.uky.edu/docs/users/openstack.html

6

Page 7: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

What Do You Need for the OpenStack Cluster?

• You need to connect to the UK campus via VPN, see:https://www.cs.uky.edu/docs/users/vpn.html

• You need to install nomachine, it can be foundhere: https://www.nomachine.com

• You need to use your UK ID address and the credentials (cloudera/cloudera) to connect.

7

Page 8: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Textbook (Optional)

• Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale (4th Edition)

• Author: Tom White • ISBN-13: 978-1491901632 • ISBN-10: 1491901632

8

Page 9: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Grading Criteria

• Homework/Programming assignments (40%)• Paper presentation (20%)• Project (30%) – Project team: each team consists of up to 3 members– Clear statement of contribution for each team member– Deliverables: mid-project report (5%), live demos (5%),

and final project report (20%)• Attendance and participation (10%)– Attendance: 5%– Participation: 5% (participating discussions in class)

9

Page 10: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Grading Scale

85 – 100% = A75 – 84% = B60 – 74% = C< 60% = E

10

Page 11: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Course Policies

• Academic Integrity– Independently complete

homework/programming assignments.– Proper acknowledgement is required if you

borrow idea or content from other sources.• Submission Policy– See each assignment for deadlines.– Late submission will not be accepted.

11

Page 12: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Course Policies

• Attendance Policy– In order to meet federal regulations, the

instructor will monitor student participation in this class through attendance or assignments. Students whose attendance or participation cannot be determined one time during the first three weeks of the semester may be dropped from the course.

12

Page 13: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Course Policies

• Attendance Policy– University policy: students are expected to

withdraw from the class if more than 20% of the classes scheduled for the semester are missed (excused or unexcused)

• Excused Absences– http://www.uky.edu/Ombud/

13

Page 14: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Student Introduction

14

Page 15: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Introduction to Big Data

Why Big Data?o What launches Big Data era?

o What makes Big Data valuable?

Characteristics of Big Data

15

Page 16: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

What launches Big Data era?

Retail2 billion products sold in 2014

Social media 204 million emails/min

1.8 million likes, 200,000 photos/min

278,000 tweets/min

40,000 queries/sec, 3.5 billion/day

HealthcareA Samaritan Medical Center Watertown NY: 120 TB as of 2013

16

Page 17: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

What Makes Big Data Valuable?

Big Data Better Models

Higher Precision

17

Page 18: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Example: Recommendation Engines

18

Page 19: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Example: Using Big Data to Help Patients

Big Data for precision medicineo Personalized healthcare

o Predict/Prevent disease

Data sourceso Genome

o Sensors

o Electronic Health Record (EHR)

o People19

Page 20: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Genome Data

200 GB/genome

20

Page 21: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Sensor Data

21

Page 22: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Electronic Health Record (EHR)

22

Page 23: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

People-generated Data- Fitness Device Data

2-5 GB/day

23

Page 24: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

How Big Data Can Help?

Integration

Genome Data

Sensor DataElectronic

Health Records

People-generated

Data

24

Page 25: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

How Big Data Can Help?

Integration Personalization Precision

25

Page 26: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Basic principles for big data integration

• Create a common understanding of data definition

• Develop a set of data services to qualify the data and make it consistent and ultimate trustworthy

• Set up a streamlined way to integrate your big data sources and system of record

26

Page 27: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Characteristics of Big Data – 6V’s

• Veracity• Valence

Volume Variety Velocity

Value

27

Page 28: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Volume of big data

• The amount of data• Facebook has 250 billion images, and 2.5

trillion posts (2016)• The amount of data is ever increasing• How to store the data• How to process the data

28

Page 29: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Variety of big data

• Ever increasing different forms of data• Photographs, sensor data, tweets,

encrypted packages• Traditional data tables • E-mail messages, with attachments• Photos, videos and audio recordings

29

Page 30: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Velocity of big data

• The speed at which big data is created, stored, and/or analyzed.

• Facebook users upload 900 million photos every day

• Packet analysis for cybersercurity• Search engine query• Internet of Things

30

Page 31: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Veracity of big data

• Quality and trustfulness of data• Accuracy, preciseness, reliability• Any bias, noises, and abnormality in

data?• Falsification?• No good data, no good results

31

Page 32: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Valence of big data

• Connectedness of big data in the form of graphs

• Data bond with each other• Forming connection between disparate

data• Positive valence and negative valence

32

Page 33: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing jarad@cs.uky.edu for obtaining an account and knowing the ... Tom White • ISBN-13: 978-1491901632

Value of big data

• The ability to convert big data information into a monetary reward

• The final goal of big data• Data mining?• Decision and results

33