Top Banner
7/23/2019 Gorbachev Hadoop R http://slidepdf.com/reader/full/gorbachev-hadoop-r 1/55 for relational database professioanals Practical Hadoop by Example Alex Gorbachev 12-Mar-2013 New York, NY
55

Gorbachev Hadoop R

Feb 17, 2018

Download

Documents

soma1243
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 1/55

for relational database professioanals

Practical Hadoop by Example

Alex Gorbachev

12-Mar-2013

New York, NY

Page 2: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 2/55

© 2012 – Pythian

2

© 2012 Pythian

 Alex Gorbachev

• 

Chief Technology Officer at Pythian

•  Blogger

•  OakTable Network member

•  Oracle ACE Director

•  Founder of BattleAgainstAnyGuess.com

•  Founder of Sydney Oracle Meetup

•  IOUG Director of Communities

•  EVP, Ottawa Oracle User Group

2

Page 3: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 3/55

© 2012 – Pythian

3

© 2012 Pythian

Why Companies Trust PythianRecognized Leader:

• 

Global industry-leader in remote database administration services and consultingfor Oracle, Oracle Applications, MySQL and SQL Server

•  Work with over 150 multinational companies such as Forbes.com, FoxInteractive media, and MDS Inc. to help manage their complex IT deployments

Expertise:

• 

One of the world’s largest concentrations of dedicated, full-time DBA expertise.

Global Reach & Scalability:

•  24/7/365 global remote support for DBA and consulting, systems administration,special projects or emergency response

3

Page 4: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 4/55

© 2012 – Pythian

 Agenda

•  What is Big Data?

• 

What is Hadoop?

•  Hadoop use cases

• 

Moving data in and out ofHadoop

• 

Avoiding major pitfalls

Page 5: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 5/55

What is Big Data

Page 6: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 6/55

© 2012 – Pythian

Doesn’t Matter.

We are here to discuss data architecture and use cases. 

Not define market segments.  

Page 7: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 7/55

Page 8: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 8/55

© 2012 – Pythian

Given enough skill and money –Oracle can do anything.  

Lets talk about efficient solutions. 

Page 9: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 9/55

© 2012 – Pythian

When RDBMS Makes no Sense?

• 

Storing images and video

• 

Processing images and video

• 

Storing and processing other large files

•  PDFs, Excel files

 

Processing large blocks of natural language text•  Blog posts, job ads, product descriptions

• 

Processing semi-structured data

•  CSV, JSON, XML, log files

•  Sensor data

Page 10: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 10/55

© 2012 – Pythian

When RDBMS Makes no Sense?

• 

Ad-hoc, exploratory analytics

• 

Integrating data from external sources

• 

Data cleanup tasks

•  Very advanced analytics (machine learning)

Page 11: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 11/55

© 2012 – Pythian

New Data Sources

• 

Blog posts

• 

Social media

• 

Images

•  Videos

• 

Logs from web applications•  Sensors

They all have large potential value

But they are awkward fit for traditional data warehouses

Page 12: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 12/55

© 2012 – Pythian

Big Problems with Big Data

• 

It is:

• Unstructured

• Unprocessed

• 

Un-aggregated

• 

Un-filtered

• Repetitive

• Low quality

• 

And generally messy.

Oh, and there is a lot of it.

Page 13: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 13/55

© 2012 – Pythian

Technical Challenges

• 

Storage capacity

• 

Storage throughput

• 

Pipeline throughput

•  Processing power

• 

Parallel processing•  System Integration

• 

Data Analysis

Scalable storage 

Massive Parallel Processing  

Ready to use tools  

Page 14: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 14/55

© 2012 – Pythian

Big Data Solutions  

Real-time transactions at very highscale, always available, distributed  

•  Relaxing ACID rules  

• 

Atomicity 

•  Consistency 

•  Isolation  

•  Durability 

Example: eventual consistency

in Cassandra 

Analytics and batch-like workloadon very large volume often unstructured  

•  Massively scalable 

• 

Throughput oriented 

•  Sacrifice efficiency for scale  

Hadoop is mostindustry accepted

standard / tool 

Page 15: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 15/55

What is Hadoop?

Page 16: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 16/55

© 2012 – Pythian

Hadoop Principles

Bring Code to DataShare Nothing

 

Page 17: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 17/55

© 2012 – Pythian

Hadoop in a Nutshell

Replicated Distributed Big-Data File System

Map-Reduce - framework forwriting massively parallel jobs

Page 18: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 18/55

© 2012 – Pythian

HDFS architecturesimplified view

•  Files are split in large blocks

• 

Each block is replicated on write

•  Files can be only created anddeleted by one client

•  Uploading new data? => new file

•  Append supported in recent versions

•  Update data? => recreate file

•  No concurrent writes to a file

• 

Clients transfer blocks directly to& from data nodes

• 

Data nodes use cheap local disks

•  Local reads are efficient

Page 19: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 19/55

© 2012 – Pythian

HDFS design principles 

Page 20: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 20/55

© 2012 – Pythian

Map Reduce example histogram calculation

Page 21: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 21/55

© 2012 – Pythian

Map Reduce pros & cons

Advantages

• 

Very simple

•  Flexible

• 

Highly scalable

• 

Good fit for HDFS – mappersread locally

• 

Fault tolerant

Pitfalls

• 

Low efficiency

•  Lots of intermediate data

•  Lots of network traffic on shuffle

 

Complex manipulationrequires pipeline of multiple jobs

•  No high-level language

• 

Only mappers leverage local

reads on HDFS

Page 22: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 22/55

© 2012 – Pythian

Main components of Hadoop ecosystem

• 

Hive – HiveQL is SQL like query language

•  Generates MapReduce jobs

• 

Pig – data sets manipulation language (like create your ownquery execution plan)

• 

Generates MapReduce jobs

• 

Zookeeper – distributed cluster manager

• 

Oozie – workflow scheduler services

• 

Sqoop – transfer data between Hadoop and relational

Page 23: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 23/55

© 2012 – Pythian

Non-MR processing on Hadoop

• 

HBase – columnar-oriented key-value store (NoSQL)

• 

SQL without Map Reduce

•  Impala (Cloudera)

•  Drill (MapR)

• 

Phoenix (Salesforce.com)•  Hadapt (commercial)

• 

Shark – Spark in-memory analytics on Hadoop

Page 24: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 24/55

© 2012 – Pythian

Hadoop Benefits

• 

Reliable solution based on unreliable hardware

• 

Designed for large files

• 

Load data first, structure later

•  Designed to maximize throughput of large scans

 

Designed to leverage parallelism•  Designed to scale

• 

Flexible development platform

• 

Solution Ecosystem

Page 25: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 25/55

© 2012 – Pythian

• 

Hadoop is scalable but not fast

• 

Some assembly required

•  Batteries not included

• 

Instrumentation not included either

• 

DIY mindset (remember MySQL?)

Hadoop Limitations

Page 26: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 26/55

© 2012 – Pythian

How much does it cost?

$300K DIY on SuperMicro

• 

100 data nodes

•  2 name nodes

• 

3 racks

 

800 Sandy Bridge CPU cores• 

6.4 TB RAM

•  600 x 2TB disks

•  1.2 PB of raw disk capacity

• 

400 TB usable (triple mirror)

•  Open-source s/w

Page 27: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 27/55

Hadoop Use Cases

Page 28: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 28/55

© 2012 – Pythian

Use Cases for Big Data

• 

Top-line contributions

•  Analyze customer behavior

•  Optimize ad placements

•  Customized promotions and etc

•  Recommendation systems

•  Netflix, Pandora, Amazon

•  Improve connection with your customers

•  Know your customers – patterns and responses

• 

Bottom-line contributors•  Cheap archives storage

•  ETL layer – transformation engine, data cleansing

Typical Initial Use Cases for Hadoop

Page 29: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 29/55

© 2012 – Pythian

Typical Initial  Use-Cases for Hadoop

in modern Enterprise IT• 

Transformation engine (part of ETL)

•  Scales easily

•  Inexpensive processing capacity

•  Any data source and destination

• 

Data Landfill•  Stop throwing away any data

•  Don’t know how to use data today? Maybe tomorrow you will

•  Hadoop is very inexpensive but very reliable

Page 30: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 30/55

© 2012 – Pythian

 Advanced: Data Science Platform

• 

Data warehouse is good when questions are known, data

domain and structure is defined•

 

Hadoop is great for seeking new meaning of data, new types ofinsights

•  Unique information parsing and interpretation

• 

Huge variety of data sources and domains

• 

When new insights are found and newstructure defined, Hadoop often takesplace of ETL engine

• 

Newly structured information is thenloaded to more traditional data-warehouses (still today)

Page 31: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 31/55

© 2012 – Pythian

Pythian Internal Hadoop Use

• 

OCR of screen video capture from Pythian privileged access

surveillance system•  Input raw frames from video capture

•  Map-Reduce job runs OCR on frames and produces text

•  Map-Reduce job identifies text changes from frame to frame and produces

text stream with timestamp when it was on the screen•  Other Map-Reduce jobs mine text (and keystrokes) for insights

• 

Credit Cart patterns

•  Sensitive commands (like DROP TABLE)

• 

Root access

•  Unusual activity patterns

•  Merge with monitoring and documentation systems

Page 32: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 32/55

Hadoop in the Data WarehouseUse Cases and Customer Stories

Page 33: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 33/55

© 2012 – Pythian

ETL for Unstructured Data

LogsWeb servers,

app server,

clickstreams

Flume HadoopCleanup,

aggregation

Longterm storage

DWHBI,

batch reports

Page 34: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 34/55

© 2012 – Pythian

ETL for Structured Data

OLTPOracle,

MySQL,

Informix…

Sqoop,

Perl  HadoopTransformation

aggregation

Longterm storage

DWHBI,

batch reports

Page 35: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 35/55

Page 36: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 36/55

© 2012 – Pythian

Rare Historical Report

Page 37: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 37/55

© 2012 – Pythian

Find Needle in Haystack

Page 38: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 38/55

© 2012 – Pythian

Hadoop for Oracle DBAs?

• 

alert.log repository

• 

listener.log repository

• 

Statspack/AWR/ASH repository

•  trace repository

• 

DB Audit repository•  Web logs repository

• 

SAR repository

• 

SQL and execution plans repository

• 

Database jobs execution logs

Page 39: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 39/55

Connecting the (big) Dots

Page 40: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 40/55

© 2012 – Pythian

Sqoop

Queries 

Page 41: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 41/55

© 2012 – Pythian

Sqoop is Flexible Import

• Select <columns> from <table> where <condition>

• Or <write your own query>

• Split column

• Parallel

• Incremental

• File formats

Page 42: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 42/55

© 2012 – Pythian

Sqoop Import Examples

• !"##$ &'$#() **+#,,-+) ./0+1#(2+3-1)4&,1566

/07-(8-(19:;96'27)-(/0

**<7-(,2'- 4( **)203- -'$

**=4-(- >7)2()?/2)- @ AB9*B9*;B9;AC

• !"##$ &'$#() ./0+1#(2+3-1)4&,1566/07-(8-(19:;96

'27)-(/0

**<7-(,2'- 'D<7-(

**)203- 74#$7 **7$3&)*0D 74#$?&/**,<'*'2$$-(7 9EMust be indexed orpartitioned to avoid16 full table scans 

Page 43: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 43/55

© 2012 – Pythian

Less Flexible Export

• 100 row batch inserts

• Commit every 100 batches

• Parallel export

• Merge vs. Insert

Example:

7"##$ -F$#()**+#,,-+) ./0+1'D7"3166/0G-F2'$3-G+#'6H##**)203- 02(**-F$#()*/&( 6(-7<3)7602(?/2)2

Page 44: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 44/55

© 2012 – Pythian

FUSE-DFS

• 

Mount HDFS on Oracle server:

•  sudo yum install hadoop-0.20-fuse

•  hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port><mount_point>

• 

Use external tables to load data into Oracle

• 

File Formats may vary

• 

All ETL best practices apply

Page 45: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 45/55

© 2012 – Pythian

Oracle Loader for Hadoop

• 

Load data from Hadoop into Oracle

• 

Map-Reduce job inside Hadoop

• 

Converts data types, partitions and sorts

•  Direct path loads

• 

Reduces CPU utilization on database

• 

NEW:

•  Support for Avro

• 

Support for compression codecs

Page 46: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 46/55

© 2012 – Pythian

Oracle Direct Connector to HDFS

• 

Create external tables of files in HDFS

• 

IJKIJLMK!!LJ NOP!?QRS?ITUN14/H7?7)(-2'

• 

All the features of External Tables

•  Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS 

Page 47: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 47/55

© 2012 – Pythian

Oracle SQL Connector for HDFS

• 

Map-Reduce Java program

• 

Creates an external table

• 

Can use Hive Metastore for schema

•  Optimized for parallel queries

• 

Supports Avro and compression

Page 48: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 48/55

How not to Fail

Page 49: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 49/55

© 2012 – Pythian

Data That Belong in RDBMS

Page 50: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 50/55

© 2012 – Pythian

Prepare for Migration

Page 51: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 51/55

© 2012 – Pythian

Use Hadoop Efficiently

• 

Understand your bottlenecks:

• 

CPU, storage or network?

• 

Reduce use of temporary data:

•  All data is over the network

•  Written to disk in triplicate.

• 

Eliminate unbalancedworkloads

• 

Offload work to RDBMS

• 

Fine-tune optimization withMap-Reduce

Page 52: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 52/55

© 2012 – Pythian

Your Data 

is NOT 

as BIG as you think

G i d

Page 53: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 53/55

© 2012 – Pythian

Getting started

•  Pick a business problem

• 

Acquire data

•  Get the tools: Hadoop, R,Hive, Pig, Tableau

•  Get platform: can start cheap

• 

Analyze data

•  Need Data Analysts a.k.a. DataScientists

•  Pick an operational problem

• 

Data store

•  ETL

•  Get the tools: Hadoop,Sqoop, Hive, Pig, Oracle

Connectors•

 

Get platform: Ops suitable

•  Operational team

C ti Y Ed ti

Page 54: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 54/55

© 2012 – Pythian

Communities KnowledgeSaring Education

Continue Your Education

!!!"#$%%&'$(&)*+,"-$./"$(/ 

Page 55: Gorbachev Hadoop R

7/23/2019 Gorbachev Hadoop R

http://slidepdf.com/reader/full/gorbachev-hadoop-r 55/55

 Thank you & Q&A

http://www.pythian.com/news/  

http://www.facebook.com/pages/The-Pythian-Group/  

http://twitter.com/pythian 

http://www.linkedin.com/company/pythian  

1-866-PYTHIAN  

[email protected] 

To contact us…

To follow us…