Top Banner
©Continuent 2014 Real-Time Loading from MySQL to Hadoop Featuring Continuent Tungsten MC Brown, Senior Information Architect
40

Set Up & Operate Real-Time Data Loading into Hadoop

Jun 20, 2015

Download

Technology

Continuent

Getting data into Hadoop is not difficult, but it is complex if what you want to load 'live' or semi-live data into your Hadoop cluster from your Oracle and MySQL databases. There are plenty of solutions available, from manually dumping and loading to the good and bad sides of using a tool like Sqoop. Neither are easy and both prone to the problems of lag between the moment you perform the dump and the load into Hadoop.

Replicating into Hadoop with Tungsten Replicator enables you to stream replication data from your Oracle and MySQL servers straight into Hadoop. Using the leading replication service built into Tungsten Replicator, and supporting all the topology and reliability features of Tungsten Replicator, the Hadoop applier enables you to replicate data directly from Oracle and MySQL into Hadoop.

In this course, we look at the existing methods of loading Hadoop data, review how the Hadoop replicator works, and give a live demo of replicating data from MySQL into Hadoop.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Real-Time Loading from MySQL to Hadoop

Featuring Continuent Tungsten

MC Brown, Senior Information Architect

Page 2: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014 2

Introducing Continuent

Page 3: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Introducing Continuent

3

• The leading provider of clustering and replication for open source DBMS

• Our Product: Continuent Tungsten

• Clustering - Commercial-grade HA, performance scaling and data management for MySQL

• Replication - Flexible, high-performance data movement

Page 4: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Quick Continuent Facts

• Largest Tungsten installation processes over 700 million transactions daily on 225 terabytes of data

• Tungsten Replicator was application of the year at the 2011 MySQL User Conference

• Wide variety of topologies including MySQL, Oracle, Vertica, and MongoDB are in production now

• MySQL to Hadoop deployments are now in progress with multiple customers

4

Page 5: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014©Continuent 2014

Continuent Tungsten Customers

5

1

Page 6: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014 6

Five Minute Hadoop Introduction

Page 7: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

What Is Hadoop, Exactly?

7

a.A distributed file system

b.A method of processing massive quantities of data in parallel

c.The Cutting family’s stuffed elephant

d.All of the above

Page 8: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Hadoop Distributed File System

8

Java Client

NameNode (directory)

DataNodes (replicated data)

Hive

Pig

hadoop command

Find file

Read block(s)

Page 9: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Map/Reduce

9

Acme,2013,4.75!Spitze,2013,25.00!Acme,2013,55.25!Excelsior,2013,1.00!Spitze,2013,5.00

Spitze,2014,60.00!Spitze,2014,9.50!Acme,2014,1.00!Acme,2014,4.00!Excelsior,2014,1.00!Excelsior,2014,9.00

Acme,60.00!Excelsior,1.00!Spitze,30.00

Acme,5.00!Excelsior,10.00!Spitze,69.50

MAP

MAP

REDUCEAcme,65.00!Excelsior,11.00!Spitze,99.50

Page 10: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Typical MySQL to Hadoop Use Case

10

Hive (Analytics)

Hadoop Cluster

Transaction Processing

Initial Load?

Latency?

App changes?

Materialized views?

Changes?

App load?

Page 11: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Options for Loading Data

11

CSV Files

Sqoop

Manual Loading Sqoop

Tungsten Replicator

Page 12: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Comparing Methods in Detail

12

Manual via CSV

SqoopTungsten

Replicator

Process Manual/Scripted

Manual/Scripted

Fully automated

Incremental Loading

Possible with DDL changes

Requires DDL changes

Fully supported

Latency Full-load Intermittent Real-time

Extraction Requirements

Full table scan Full and partial table scans

Low-impact binlog scan

Page 13: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014 13

Replicating MySQL Data to Hadoop using

Tungsten Replicator

Page 14: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

What is Tungsten Replicator?

14

A real-time, high-performance,

open source database replication engine

!GPL V2 license - 100% open source

Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent

“GoldenGate without the Price Tag”®

Page 15: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Tungsten Replicator Overview

15

Master

(Transactions + Metadata)

Slave

THL

DBMS Logs

Replicator

(Transactions + Metadata)

THLReplicator

Extract transactions

from log

Apply

Page 16: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Tungsten Replicator 3.0 & Hadoop

16

• Extract from MySQL or Oracle

• Base Hadoop support

• Platforms: Cloudera, HortonWorks, MapR, Amazon EMR, IBM InfoSphere BigInsights

• Provision using Sqoop or parallel extraction

• Automatic replication of incremental changes

• Transformation to preferred HDFS formats

• Schema generation for Hive

• Tools for generating materialized views

Page 17: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Hadoop Support

17

Hadoop Hadoop-BaseFS

Apache Hadoop Yes Yes

Cloudera Yes (Certified) Yes (Certified)

MapR Yes

HortonWorks Yes (Awaiting Certification)

IBM InfoSphere BigInsights Yes

Amazon EMR Yes

Page 18: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Basic MySQL to Hadoop Replication

18

MySQL Tungsten Master Replicator

hadoop

Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated

binlog_format=row

Tungsten Slave Replicator

hadoop

MySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Hadoop Cluster

Extract from MySQL binlog

Load raw CSV to HDFS (e.g., via LOAD DATA to

Hive)

Access via Hive

Page 19: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Hadoop Data Loading - Gory Details

19

Replicator

hadoopTransactions from master

CSV FilesCSV FilesCSV Files

Staging TablesStaging TablesStaging “Tables”

Base TablesBase TablesMaterialized Views

Javascript load script

e.g. hadoop.js

Write data to CSV

(Run Map/Reduce)

(Generate Table

Definitions)

(Generate Table

Definitions)

Load using hadoop

command

Page 20: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014 20

Demo #1 !

Replicating sysbench data

Page 21: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014 21

Viewing MySQL Data in Hadoop

Page 22: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Generating Staging Table Schema

22

$ ddlscan -template ddl-mysql-hive-0.10-staging.vm \! -user tungsten -pass secret \! -url jdbc:mysql:thin://logos1:3306/db01 -db db01!...!DROP TABLE IF EXISTS db01.stage_xxx_sbtest;!!CREATE EXTERNAL TABLE db01.stage_xxx_sbtest!(! tungsten_opcode STRING ,! tungsten_seqno INT ,! tungsten_row_id INT ,! id INT ,! k INT ,! c STRING ,! pad STRING)!ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' ESCAPED BY '\\'!LINES TERMINATED BY '\n'!STORED AS TEXTFILE LOCATION '/user/tungsten/staging/db01/sbtest';

Page 23: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Generating Base Table Schema

$ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten \! -pass secret -url jdbc:mysql:thin://logos1:3306/db01 -db db01!...!DROP TABLE IF EXISTS db01.sbtest;!!CREATE TABLE db01.sbtest!(! id INT ,! k INT ,! c STRING ,! pad STRING )!;!

23

Page 24: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Creating a Materialized View in Theory

24

Log #1 Log #2 Log #N...

MAP Sort by key(s), transaction order

REDUCE Emit last row per key if not a delete

Page 25: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Creating a Materialized View in Hive

$ hive!...!hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/tungsten-reduce;!hive> FROM ( ! SELECT sbx.*! FROM db01.stage_xxx_sbtest sbx! DISTRIBUTE BY id ! SORT BY id,tungsten_seqno,tungsten_row_id!) map1!INSERT OVERWRITE TABLE db01.sbtest! SELECT TRANSFORM(! tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad)! USING 'perl tungsten-reduce -k id -c tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad'! AS id INT,k INT,c STRING,pad STRING;!...

25

MAP

REDUCE

Page 26: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Comparing MySQL and Hadoop Data

$ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib!...!$ /opt/continuent/tungsten/bristlecone/bin/dc \! -url1 jdbc:mysql:thin://logos1:3306/db01 \! -user1 tungsten -password1 secret \! -url2 jdbc:hive2://localhost:10000 \! -user2 'tungsten' -password2 'secret' -schema db01 \! -table sbtest -verbose -keys id \! -driver org.apache.hive.jdbc.HiveDriver!22:33:08,093 INFO DC - Data comparison utility!...!22:33:24,526 INFO Tables compare OK!

26

Page 27: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Doing it all at once

$ git clone \! https://github.com/continuent/continuent-tools-hadoop.git!!$ cd continuent-tools-hadoop!!$ bin/load-reduce-check \! -U jdbc:mysql:thin://logos1:3306/db01 \! -s db01 --verbose

27

Page 28: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014 28

Demo #2 !

Constructing and Checking a Materialized View

Page 29: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014 29

Scaling It Up!

Page 30: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

MySQL to Hadoop Fan-In Architecture

30

Replicator

m1 (slave)

m2 (slave)

m3 (slave)

Replicator

m1 (master)

m2 (master)

m3 (master)

Replicator

Replicator

RBR

RBR

Slaves

Hadoop Cluster

(many nodes)

Masters

RBR

Page 31: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Integration with Provisioning

31

MySQL

Tungsten Master

hadoop

binlog_format=row

Tungsten Slave

hadoopMySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Hadoop Cluster

Access via Hive

Sqoop/ETL

(Initial provisioning run)

Page 32: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

On-Demand Provisioning via Parallel Extract

32

MySQL Tungsten Master Replicator

hadoop

Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated (other filters as needed) binlog_format=row

Tungsten Slave Replicator

hadoop

MySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Hadoop Cluster

Extract from MySQL tables

Load raw CSV to HDFS (e.g., via LOAD DATA to

Hive)

Access via Hive

Page 33: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Tungsten Replicator Roadmap

33

• Parallel CSV file loading (supported)

• Partition loaded data by commit time (supported)

• Expanded Data format support (CSV, JSON)

• Replication out of Hadoop

Page 34: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Continuent Hadoop Tools Roadmap

• HBase Data Support & Materialization

• Impala Data Support & Materialization

• Integration with emerging real-time analytics (e.g. Storm, Spark, Shark, Stinger, …)

• Point-in Time Table Generation

• Time-Series Generation

• Rolling and Managed Materialization

• Replicator driven data manipulation (e.g. denormalisation, combining, …)

34

Page 35: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014 35

Getting Started with Continuent Tungsten

Page 36: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Where Is Everything?

36

• Tungsten Replicator 3.0 builds are now available on code.google.com http://code.google.com/p/tungsten-replicator/

• Replicator 3.0 documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html

• Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop

Contact Continuent for support

Page 37: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Commercial Terms

• Replicator features are open source (GPL V2)

• Investment Elements

• POC / Development (Walk Away Option)

• Production Deployment

• Annual Support Subscription

• Governing Principles

• Annual Subscription Required

• More Upfront Investment -> Less Annual Subscription

37

Page 38: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

We Do Clustering Too!

38

Tungsten clusters combine off-the-shelf open source MySQL servers into data services with: !

• 24x7 data access • Scaling of load on replicas • Simple management commands !...without app changes or data migration

Amazon US West

apache /php

GonzoPortal.com

Connector Connector

Page 39: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

In Conclusion: Tungsten Offers...

• Fully automated, real-time replication from MySQL into Hadoop

• Support for automatic transformation to HDFS data formats and creation of full materialized views

• Positions users to take advantage of evolving real-time features in Hadoop

39

Page 40: Set Up & Operate Real-Time Data Loading into Hadoop

©Continuent 2014

Continuent Web Page: http://www.continuent.com

!

Tungsten Replicator: http://code.google.com/p/tungsten-replicator

Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://www.continuent.com/news/blogs

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected]