Top Banner
COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University
58

COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Dec 26, 2015

Download

Documents

Mark Ford
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

COP 6727:Advanced Database Systems

Spring 2013

Dr. Tao LiFlorida International University

Page 2: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

COP6727 2

Student Self-Introduction

• Name– I will try to remember your names. But if you

have a Long name, please let me know how should I call you

• Anything you want us to know

Page 3: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

COP6727 3

Course Overview

• Meeting time– Tuesday and Thursday 12:30pm – 13:45pm

• Office hours: – Thursday 2:30pm – 4:30pm or by

appointment

• Course Webpage:– http://www.cs.fiu.edu/~taoli/class/CAP6727-S

13/index.html

Page 4: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

COP6727 4

Course Objectives

• This is an advanced database course– Already taken COP5725

• Assume knowledge of the fundamental concepts of relational databases.

• Cover the core principles and techniques of data and information management

• Discuss advanced techniques that can be applied to traditional database systems in order to provide efficient support of new emerging applications.

Page 5: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Tentative Topics• Query processing and optimization• Transaction management • Database tuning • Data stream systems • Spatial databases • XML • Information retrieval and Web data management • Scalable data processing • Readings in recent developments in database systems and applications

– SQL vs. non-SQL database– Nearest neighbor queries– High-dimensional indexing– Database retrieval and ranking– Stream processing– Big Data – Incremental and online query processing– Mobile database

COP6727 5

Page 6: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

COP6727 6

Assignments and Grading• Reading/Written Assignments• Programing Projects• Midterm Exam• Final Project/Presentations• Class attendance is mandatory. • Evaluation will be a subjective process

– Effort is very important component• Regular In-class Students

– Quizzes and Class Participation: 5%– Midterm Exam: 30%– Final Project: 30%– Assignments and Projects: 35%

• Online Students– Midterm Exam: 30%– Final Project: 30%– Homework Assignments: 40%

Page 7: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

COP6727 7

Text and References

Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. Third Edition, McGraw Hill, 2003. ISBN: 0-07-246563-8. Links to Textbook Homepage .

 In addition,  the course materials will also be drawn from recent research literature.

Page 8: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Lecture 1 & 2

• Lecture 1 & 2: Introduction To MapReduce(Most of slides are adapted from Bill Graham, Spiros Papadimitriou, Cloudera Tutorials)

COP6727 8

Page 9: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Outline

• Motivation for MapReduce

• What is MapReduce?

• What is Hadoop?

• What is Hive?

COP6727 9

Page 10: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Motivation for MapReduce

• The Big Data

• How to handle big data?

COP6727 10

Page 11: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

The Big Data

• Big data is everywhere

• Documents– Blogs ( 77 million Tumblr and 56.6 million WordPress as of 2012

) , Micro blogs, News, Reviews

• Images– Instagram, Flickr (more than 6 billion images)

• Videos– Youtube, All broadcast

• Others– Map (Google Map)

– Human Genome

– aeronautics and space data

COP6727 11

Page 12: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Another view on “big”

• 2008: Google processes 20 PB a day

• 2009: Facebook has 2.5 PB user data + 15 TB/ day

• 2009: eBay has 6.5 PB user data + 50 TB/day

• 2011: Yahoo! has 180-200 PB of data

• 2012: Facebook ingests 500 TB/day

COP6727 12

Page 13: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Why do we care about those data?

• Modeling and predicting information flow• Recommend/predict links in social networks• Relevance classification / information filtering• Sentiment analysis and opinion mining• Topic modeling and evolution• Measuring influence in social networks• Concept mapping• Search• …

COP6727 13

Page 14: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Big data analysis

• Scalability (with reasonable cost)– Algorithms improvement– Intuitive way: divide and conquer

COP6727 14

Page 15: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Divide and Conquer

COP6727 15

Page 16: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Challenges

• Parallel processing is complicated – How do we assign tasks to workers? – What if we have more tasks than slots? – What happens when tasks fail? – How do you handle distributed

synchronization?

COP6727 16

Page 17: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Challenges – Con’t

• Data storage is not trivial – Traditional database is not reliable

• Data volumes are massive • Reliably storing PBs of data is challenging

– Disk/hardware/network failures – Probability of failure event increases with number of

machines

• For example: – 1000 hosts, each with 10 disks, a disk lasts 3 year– how many failures per day?

COP6727 17

Page 18: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

What is MapReduce?

• A programming model for expressing distributed computations at a massive scale

• An execution framework for organizing and performing such computations

• An open-source implementation called Hadoop

COP6727 18

Page 19: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Workflow of Large Data Problem

COP6727 19

Page 20: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

MapReduce paradigm

• Implement two functions:

Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3)

• Framework handles everything else*

• Value with same key go to same reducer

COP6727 20

Page 21: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

MapReduce Flow

COP6727 21

Page 22: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

An Example

COP6727 22

Page 23: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

MapReduce paradigm – Con’t

• There’s more!• Partioners decide what key goes to what

reducer – partition(k’, numPartitions) -> partNumber – Divides key space into parallel reducers chunks – Default is hash-based

• Combiners can combine Mapper output before sending to reducer

– Reduce(k2, list(v2)) -> list(v3)

COP6727 23

Page 24: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

MapReduce Flow

COP6727 24

Page 25: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

MapReduce additional details

• Reduce starts after all mappers complete

• Mapper output gets written to disk

• Intermediate data can be copied sooner

• Reducer gets keys in sorted order

• Keys not sorted across reducers

• Global sort requires 1 reducer or smart partitioning

COP6727 25

Page 26: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

MapReduce is good at

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining

• Off-line batch jobs on massive data sets

• Analyzing an entire large dataset

COP6727 26

Page 27: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

MapReduce can do

• Iterative jobs (e.g., PageRank, K-means Clustering)– Each iteration must read/write data to disk – IO and latency cost of an iteration is high

COP6727 27

Page 28: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

MapReduce is not good at

• Jobs that need shared state/coordination– Tasks are shared-nothing– Shared-state requires scalable state store

• Low-latency jobs

• Jobs on small datasets

• Finding individual records

COP6727 28

Page 29: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Summary of MapReduce

• Simple programming model

• Scalable, fault-tolerant

• Ideal for (pre-)processing large volumes of data

COP6727 29

Page 30: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

What is Hadoop?

• Hadoop is an open-source implementation based on GFS and MapReduce from Google

• Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System

• Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

COP6727 30

Page 31: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Hadoop provides

• Redundant, fault-tolerant data storage

• Parallel computation framework

• Job coordination

COP6727 31

Page 32: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Hadoop Stack

COP6727 32

Page 33: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Who uses Hadoop?

• Yahoo!

• Facebook

• Last.fm

• Rackspace

• Digg

• Apache Nutch

• ...

COP6727 33

Page 34: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

HDFS

• The Hadoop Distributed File System

• Redundant storage

• Designed to reliably store data using commodity hardware

• Designed to expect hardware failures

• Intended for large files

• Designed for batch inserts

COP6727 34

Page 35: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Some Concepts about HDFS

• Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about

files and blocks • The SecondaryNameNode (SNN) holds a

backup of the NN data • DataNodes (DN) store and serve blocks

COP6727 35

Page 36: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Write

COP6727 36

Page 37: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Read

COP6727 37

Page 38: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

If a datanode failures

• DNs check in with the NN to report health

• Upon failure NN orders DNs to replicate under- replicated blocks

COP6727 38

Page 39: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Jobs and Tasks in Hadoop

• Job: a user-submitted map and reduce implementation to apply to a data set

• Task: a single mapper or reducer task– Failed tasks get retried automatically – Tasks run local to their data, ideally

• JobTracker (JT) manages job submission and task delegation

• TaskTrackers (TT) ask for work and execute tasks

COP6727 39

Page 40: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Architecture

COP6727 40

Page 41: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

How to handle failed tasks?

• JT will retry failed tasks up to N attempts

• After N failed attempts for a task, job fails

• Some tasks are slower than other

• Speculative execution is JT starting up multiple of the same task

• First one to complete wins, other is killed

COP6727 41

Page 42: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Data locality

• Move computation to the data

• Moving data between nodes has a cost

• Hadoop tries to schedule tasks on nodes with the data

• When not possible TT has to fetch data from DN

COP6727 42

Page 43: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Hadoop execution environment

• Local machine (standalone or pseudo- distributed)

• Virtual machine

• Cloud (e.g. Amazon EC2)

• Own cluster

COP6727 43

Page 44: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Demo: word count

• Demo

COP6727 44

Page 45: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Homework

• Write a Hadoop program to index the words within the text document dataset– Example:

• Input: – Doc1: Hello World!

– Doc2: Hello Java!

• Expected output: – Hello \t Doc1 Doc2

– World \t Doc1

– Java \t Doc2

• Due: beginning of the class on 01/10• If you have any questions, send emails to Jingxuan

Li ([email protected])

COP6727 45

Page 46: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Login Info

• Below is the login information for our Hadoop cluster– Server: datamining-node03.cs.fiu.edu– U:dbstudent p:******* (announced during the class)– Gaining the access to the working directory in HDFS (Do not

modify or remove the other directories!): hadoop fs -ls /user/dbstudent

• Input dataset for the homework (every one will be working on this dataset, so do not modify it!): /user/dbstudent/dataset

• Output directory (including the source code, the indexing results) format: /user/dbstudent/output-PID

COP6727 46

Page 47: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

What is Hive?

• Data warehousing tool on top of Hadoop• Originally developed at Facebook

– Now a Hadoop sub-project

• Data warehouse infrastructure – Execution: MapReduce – Storage: HDFS files

• Large datasets, e.g. Facebook daily logs– 30GB (Jan’08), 200GB (Mar’08), 15+TB (2009)

• Hive QL: SQL-like query language

COP6727 47

Page 48: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Motivation

• Missing components when using Hadoop MapReduce jobs to process data– Command-line interface for “end users”– Ad-hoc query support– … without writing full MapReduce jobs– Schema information

COP6727 48

Page 49: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Hive Applications

• Log processing

• Text mining

• Document indexing

• Customer-facing business intelligence

(e.g., Google Analytics)

• Predictive modeling, hypothesis testing

COP6727 49

Page 50: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Hive Components

• Shell: allows interactive queries like MySQL shell connected to database– Also supports web and JDBC clients

• Driver: session handles, fetch, execute• Compiler: parse, plan, optimize• Execution engine: DAG of stages (M/R,

HDFS, or metadata)• Metastore: schema, location in HDFS

COP6727 50

Page 51: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Data Model

• Tables– Typed columns (int, float, string, date,

boolean)– Also, list: map (for JSON-like data)

• Partitions– e.g., to range-partition tables by date

• Buckets– Hash partitions within ranges (useful for

sampling, join optimization)COP6727 51

Page 52: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Metastore

• Database: namespace containing a set of Tables

• Holds table definitions (column types, physical layout)

• Partition data

• Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases

COP6727 52

Page 53: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Physical Layout

• Warehouse directory in HDFS– e.g., /home/hive/warehouse

• Tables stored in subdirectories of warehouse

– Partitions, buckets form subdirectories of tables

• Actual data stored in flat files– Control char-delimited text, or SequenceFiles– With custom SerDe, can use arbitrary format

COP6727 53

Page 54: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Useful command examples

• Start Hive: bin/hive• Show all the tables: SHOW TABLES• Create a new table: CREATE TABLE

shakespeare (freq INT, word STRING) ROW FORMAT ELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE

• Loading data into the table: LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare

COP6727 54

Page 55: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Useful command examples – Con’t

• Select data: SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10

• Join: INSERT OVERWRITE TABLE merged SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1

COP6727 55

Page 56: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Summary of Hive

• Supports rapid iteration of ad-hoc queries

• Can perform complex joins with minimal code

• Scales to handle much more data than many similar systems

COP6727 56

Page 57: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

References

• White, T., Hadoop: The definitive guide, 2012

• http://hadoop.apache.org/

• http://hive.apache.org/

• MapReduce tutorial: http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v1.0

• Bill Graham, http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/BillGraham_IntroToHadoop_Aug30.pdf

• Spiros Papadimitriou, Jimeng Sun, and Rong Yan, http://cs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/07-1.pdf

• Cloudera, http://blog.cloudera.com/wp-content/uploads/2010/01/6-IntroToHive.pdf

COP6727 57

Page 58: COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Exercises

• To be announced

COP6727 58