Top Banner
MySQL and Hadoop MySQL SF Meetup 2012 Chris Schneider
26

Hadoop and mysql by Chris Schneider

Jan 27, 2015

Download

Documents

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop and mysql by Chris Schneider

MySQL and HadoopMySQL SF Meetup 2012Chris Schneider

Page 2: Hadoop and mysql by Chris Schneider

About MeChris Schneider, Data Architect @ Ning.com (a

Glam Media Company)

Spent the last ~2 years working with Hadoop (CDH)

Spent the last 10 years building MySQL architecture for multiple companies

[email protected]

Page 3: Hadoop and mysql by Chris Schneider

What we’ll coverHadoop

CDH

Use cases for Hadoop

Map Reduce

Scoop

Hive

Impala

Page 4: Hadoop and mysql by Chris Schneider

What is Hadoop?An open-source framework for storing and

processing data on a cluster of servers

Based on Google’s whitepapers of the Google File System (GFS) and MapReduce

Scales linearly

Designed for batch processing

Optimized for streaming reads

Page 5: Hadoop and mysql by Chris Schneider

The Hadoop DistributionCloudera

The only distribution for Apache Hadoop

What Cloudera Does Cloudera Manager Enterprise Training

Hadoop Admin Hadoop Development Hbase Hive and Pig

Enterprise Support

Page 6: Hadoop and mysql by Chris Schneider

Why HadoopVolume

Use Hadoop when you cannot or should not use traditional RDBMS

Velocity Can ingest terabytes of data per day

Variety You can have structured or unstructured data

Page 7: Hadoop and mysql by Chris Schneider

Use cases for Hadoop Recommendation engine

Netflix recommends movies

Ad targeting, log processing, search optimization eBay, Orbitz

Machine learning and classification Yahoo Mail’s spam detection Financial: Identity theft and credit risk

Social Graph Facebook, Linkedin and eHarmony connections

Predicting the outcome of an election before the election, 50 out of 50 correct thanks to Nate Silver!

Page 8: Hadoop and mysql by Chris Schneider

Some Details about HadoopTwo Main Pieces of Hadoop

Hadoop Distributed File System (HDFS) Distributed and redundant data storage using

many nodes Hardware will inevitably fail

Read and process data with MapReduce Processing is sent to the data Many “map” tasks each work on a slice of the data Failed tasks are automatically restarted on another

node or replica

Page 9: Hadoop and mysql by Chris Schneider
Page 10: Hadoop and mysql by Chris Schneider

MapReduce Word CountThe key and value together represent a row of

data where the key is the byte offset and the value is the line

map (key,value)

foreach (word in value)

output (word,1)

Page 11: Hadoop and mysql by Chris Schneider

Map is used for Searching

64, big data is totally cool and big…

Intermediate Output (on local disk):big, 1data, 1is, 1totally, 1cool, 1and, 1big, 1

MAP

Foreach word

Page 12: Hadoop and mysql by Chris Schneider

Reduce is used to aggregateHadoop aggregates the keys and calls a reduce for each unique key… e.g. GROUP BY, ORDER BY

reduce (key, list)

sum the list

output (key, sum)

big, (1,1)data, (1)is, (1)totally, (1)cool, (1)and, (1)

big, 2data, 1is, 1totally, 1cool, 1and, 1

Reduce

Page 13: Hadoop and mysql by Chris Schneider

Where does Hadoop fit in?Think of Hadoop as an augmentation of your

traditional RDBMS system

You want to store years of data

You need to aggregate all of the data over many years time

You want/need ALL your data stored and accessible not forgotten or deleted

You need this to be free software running on commodity hardware

Page 14: Hadoop and mysql by Chris Schneider

Where does Hadoop fit in?

MySQL MySQL

MySQL MySQL

http http http

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

NameNodeNameNode2

SecondaryNameNode

JobTracker

Hadoop (CDH4)

MySQL

MySQL

Sqoop or ETL

Sqoop

Flume

Tableau: Business Analytics

HivePig

Page 15: Hadoop and mysql by Chris Schneider

Data FlowMySQL is used for OLTP data processing

ETL process moves data from MySQL to Hadoop Cron job – Sqoop

OR Cron job – Custom ETL

Use MapReduce to transform data, run batch analysis, join data, etc…

Export transformed results to OLAP or back to OLTP, for example, a dashboard of aggregated data or report

Page 16: Hadoop and mysql by Chris Schneider

MySQL Hadoop

Data Capacity Depends, (TB)+ PB+

Data per query/MR

Depends, (MB -> GB)

PB+

Read/Write Random read/write

Sequential scans, Append-only

Query Language SQL MapReduce, Scripted Streaming, HiveQL, Pig Latin

Transactions Yes No

Indexes Yes No

Latency Sub-second Minutes to hours

Data structure Relational Both structured and un-structured

Enterprise and Community Support

Yes Yes

Page 17: Hadoop and mysql by Chris Schneider

About SqoopOpen Source and stands for SQL-to-Hadoop

Parallel import and export between Hadoop and various RDBMS

Default implementation is JDBC

Optimized for MySQL but not for performance

Integrated with connectors for Oracle, Netezza, Teradata (Not Open Source)

Page 18: Hadoop and mysql by Chris Schneider

Sqoop Data Into Hadoop

This command will submit a Hadoop job that queries your MySQL server and reads all the rows from world.City

The resulting TSV file(s) will be stored in HDFS

$ sqoop import --connect jdbc:mysql://example.com/world \--tables City \--fields-terminated-by ‘\t’ \--lines-terminated-by ‘\n’

Page 19: Hadoop and mysql by Chris Schneider

Sqoop FeaturesYou can choose specific tables or columns to

import with the --where flag

Controlled parallelism Parallel mappers/connections (--num-mappers) Specify the column to split on (--split-by)

Incremental loads

Integration with Hive and Hbase

Page 20: Hadoop and mysql by Chris Schneider

Sqoop Export

The City table needs to exist

Default CSV formatted

Can use staging table (--staging-table)

$ sqoop export --connect jdbc:mysql://example.com/world \--tables City \--export-dir /hdfs_path/City_data

Page 21: Hadoop and mysql by Chris Schneider

About Hive Offers a way around the complexities of

MapReduce/JAVA

Hive is an open-source project managed by the Apache Software Foundation

Facebook uses Hadoop and wanted non-JAVA employees to be able to access data Language based on SQL Easy to lean and use Data is available to many more people

Hive is a SQL SELECT statement to MapReduce translator

Page 22: Hadoop and mysql by Chris Schneider

More About HiveHive is NOT a replacement for RDBMS

Not all SQL works

Hive is only an interpreter that converts HiveQL to MapReduce

HiveQL queries can take many seconds or minutes to produce a result set

Page 23: Hadoop and mysql by Chris Schneider

RDBMS vs HiveRDBMS Hive

Language SQL Subset of SQL along with Hive extensions

Transactions Yes No

ACID Yes No

Latency Sub-second(Indexed Data)

Many seconds to minutes(Non Index Data)

Updates? Yes, INSERT [IGNORE], UPDATE, DELETE, REPLACE

INSERT OVERWRITE

Page 24: Hadoop and mysql by Chris Schneider

Sqoop and Hive

Alternatively, you can create table(s) within the Hive CLI and run an “fs -put” with an exported CSV file on the local file system

$ sqoop import --connect jdbc:mysql://example.com/world \--tables City \--hive-import

Page 25: Hadoop and mysql by Chris Schneider

Impala It’s new, it’s fast

Allows real time analytics on very large data sets

Runs on top of HIVE

Based off of Google’s Dremel http://research.google.com/pubs/pub36632.html

Cloudera VM for Impala https://ccp.cloudera.com/display/SUPPORT/

Downloads

Page 26: Hadoop and mysql by Chris Schneider

Thanks EveryoneQuestions?

Good References Cloudera.com http://infolab.stanford.edu/~ragho/hive-

icde2010.pdf

VM downloads https://ccp.cloudera.com/display/SUPPORT/

Cloudera%27s+Hadoop+Demo+VM+for+CDH4