Top Banner
1 © Copyright 2012 EMC Corporation. All rights reserved. Greenplum & Hadoop Why do such a thing? Donald Miner Solutions Architect Advanced Technologies Group [email protected]
29

Hadoop & Greenplum: Why Do Such a Thing?

Nov 04, 2014

Download

Technology

Ed Kohlwey

Greenplum is using Hadoop in several interesting ways as part of a larger big data architecture with EMC Greenplum Database (a scale-out MPP SQL database) and EMC Isilon (a scale-out network-attached storage appliance). After a quick introduction of Greenplum Database and Isilon, I list some ways Greenplum is tightly integrating with Hadoop and why we would want to do such a thing. Integration points discussed include: Greenplum Database external tables to seamlessly access data in HDFS, querying HBase tables natively from Greenplum Database, Greenplum Database having its underlying storage on HDFS, and Isilon OneFS as a seamless replacement for HDFS.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop & Greenplum: Why Do Such a Thing?

1© Copyright 2012 EMC Corporation. All rights reserved.

Greenplum & Hadoop

Why do such a thing?

Donald MinerSolutions ArchitectAdvanced Technologies [email protected]

Page 2: Hadoop & Greenplum: Why Do Such a Thing?

2© Copyright 2012 EMC Corporation. All rights reserved.

GREENPLUM DATABASEQUICK INTRODUCTION TO

Page 3: Hadoop & Greenplum: Why Do Such a Thing?

3© Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database Basics

Massively Parallel Processing (MPP) Database

Uses commodity hardware

Data is distributed by auser-defined “distribution key”

Master node delegatesqueries to segments

1:1 segment and mastermirroring for redundancy

Master

Segment Segment Segment Segment

Master

GREENPLUM DATABASE

Page 4: Hadoop & Greenplum: Why Do Such a Thing?

4© Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database FeaturesFull SQL support based on PostgreSQL 8.2

Columnar or row-oriented storage with compression

Multi-level table partitioning with query time partition pruning

B-tree and bitmap indexes

JDBC, ODBC, OLEDB, etc. interfaces

High speed, parallel bulk ingest

Parallel query optimizer

External tables

GREENPLUM DATABASE

Page 5: Hadoop & Greenplum: Why Do Such a Thing?

5© Copyright 2012 EMC Corporation. All rights reserved.

MADlib Analytics with Greenplum

Scalable and in-database

Mathematical, statistical, machine learning

Active open source project

> SELECT householdID, variables FROM households ORDER BY RANDOM() LIMIT 100000;> SELECT run_univariate_analysis ( 'households_training', 'variables'); WHERE pvalue<.01 AND r2>.01;> SELECT run_regression( 'univariate_results', 'households_training');> SELECT householdID, madlib.array_dot( coef::REAL[], xmatrix::REAL[]) FROM coefficients, households;

GREENPLUM DATABASE

Page 6: Hadoop & Greenplum: Why Do Such a Thing?

6© Copyright 2012 EMC Corporation. All rights reserved.

MADlib In-Database Analytical Functions

Descriptive Statistics Modeling

Quantile Correlation Matrix

Profile Association Rule Mining

CountMin (Cormode-Muthukrishnan) Sketch-based Estimator K-Means Clustering

FM (Flajolet-Martin) Sketch-based Estimator Naïve Bayes Classification

MFV (Most Frequent Values) Sketch-based Estimator Linear Regression

Frequency Logistic Regression

Histogram Support Vector Machines

Bar Chart SVD Matrix Factorisation

Box Plot Chart Decision Trees/CART

Latent Dirichlet Allocation Topic Modeling

GREENPLUM DATABASE

Page 7: Hadoop & Greenplum: Why Do Such a Thing?

7© Copyright 2012 EMC Corporation. All rights reserved.

PostGIS Support in Greenplum DBPostGIS adds support for geographic objects in PostgreSQL

Example: find all records within 25 miles of hurricane path

customer_id | st_astext | phone_num------------+-----------------------------+-------------493140 | POINT(-80.040397 26.570613) | 1231231234192401 | POINT(-81.820933 26.242611) | 2342342345

select customer_id, ST_AsText(lat_lon), phone_numfrom clientswhere ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING(-79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, -80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, -83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI())

http://postgis.refractions.net/

GREENPLUM DATABASE

Page 8: Hadoop & Greenplum: Why Do Such a Thing?

8© Copyright 2012 EMC Corporation. All rights reserved.

Solr integration with GPDBSolr is an open source enterprise search engine

Enable in-database text indexing and search

select t.id, q.score, t.message_textfrom message t, gptext.search( 'twitter.public.message', '(iphone and (hate or love))', 'author_lang:en', 100 ) qwhere t.id=q.idorder by score desc;

id | score | message_text -----------+------------------+------------------------------------------- 71552856 | 5.43078422546387 | Hates BB's Love IPhones!

91373993 | 4.06371879577637 | Its a love hate relationship with iPhone spellcheck

25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate relationship...

120166038 | 3.39410924911499 | Love the new iPhone 4s, hate @ATT service #Verizonhereicome

117498183 | 3.39181470870972 | I got a love-hate relationship for my iPhone!!!

86416378 | 3.39180779457092 | Absolutely love the new iPhone, but Siri seems to hate me..

GREENPLUM DATABASE

Page 9: Hadoop & Greenplum: Why Do Such a Thing?

9© Copyright 2012 EMC Corporation. All rights reserved.

GREENPLUM HADOOP

Page 10: Hadoop & Greenplum: Why Do Such a Thing?

10© Copyright 2012 EMC Corporation. All rights reserved.

Greenplum “HD”GREENPLUM HADOOP

• Bundled open source

• HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Mahout

Page 11: Hadoop & Greenplum: Why Do Such a Thing?

11© Copyright 2012 EMC Corporation. All rights reserved.

Greenplum “MR”GREENPLUM HADOOP

• Bundled MapR, a commercial version of Hadoop• API compatible with traditional Hadoop• MapR improvements over Hadoop:

– Improved control system– Major portions of HDFS re-implemented

in C++– HDFS is NFS mountable– Improved shuffle and sort– Distributed NameNode– Supports large number of files– Mirroring, snapshot capability

Page 12: Hadoop & Greenplum: Why Do Such a Thing?

12© Copyright 2012 EMC Corporation. All rights reserved.

Why do such a thing?Greenplum DB

STRUCTURED SEMISTRUCTURED UNSTRUCTURED

SQL

RDBMS

Tables and SchemasGPMapReduce

Indexing

Partitioning

Text objects

GP Solr/LuceneMADLib

PostGIS

Page 13: Hadoop & Greenplum: Why Do Such a Thing?

13© Copyright 2012 EMC Corporation. All rights reserved.

Why do such a thing?Hadoop

STRUCTURED SEMISTRUCTURED UNSTRUCTURED

HiveMapReduce

PigXML, JSON, … Flat files

Schema on load

Page 14: Hadoop & Greenplum: Why Do Such a Thing?

14© Copyright 2012 EMC Corporation. All rights reserved.

Why do such a thing?HBase

STRUCTURED SEMISTRUCTURED UNSTRUCTURED

Hive MapReduce

PigHBase Tables

Row keys

Flexible schema

Page 15: Hadoop & Greenplum: Why Do Such a Thing?

15© Copyright 2012 EMC Corporation. All rights reserved.

Why do such a thing?Hybrid architecture with all three (or two…)

STRUCTURED SEMISTRUCTURED UNSTRUCTURED

HBase Tables

Row keys

Flexible schema

SQL

RDBMS

Tables and SchemasGPMapReduce

Indexing

Partitioning

Text objects

HiveMapReduce

Pig XML, JSON, … Flat files

Schema on loadGP Solr/Lucene

MADLib

PostGIS

Page 16: Hadoop & Greenplum: Why Do Such a Thing?

16© Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Unified Analytics Platform

Page 17: Hadoop & Greenplum: Why Do Such a Thing?

17© Copyright 2012 EMC Corporation. All rights reserved.

Hadoop External Tables in GPDBExternal tables bring external data into the database.

Native support for HDFS with parallelized loading.

Can write to HDFS or read from HDFS.

> SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE h.term = g.word;

> CREATE EXTERNAL TABLE hdfs_document_feature ( docid integer, term text, freq integer) LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*') FORMAT 'text' (delimiter '|');

> WRITE INTO hdfs_export SELECT * FROM gpdb_source;

Page 18: Hadoop & Greenplum: Why Do Such a Thing?

18© Copyright 2012 EMC Corporation. All rights reserved.

Why do such a thing?

Many of the same use cases of a HBase/Hadoop environment

Use Hadoop as a data groomer

Do rollups in Hadoop and store results in GPDB

Use the best tool for the job (structured vs. unstructured)

Use GPDB to host data sets in a more real-time layer for ad-hoc analytics

Page 19: Hadoop & Greenplum: Why Do Such a Thing?

19© Copyright 2012 EMC Corporation. All rights reserved.

EMC Isilon

Hardware appliance for scale-outnetwork-attached storage (NAS)

Stripes data across all nodes

Uses Infiniband for intra-clustercommunication

Up to 15.5PB total storage

3 different hardware configurationsto handle different workloads

Uses “OneFS”, Isilon’s operating system and file system

Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few more.

Page 20: Hadoop & Greenplum: Why Do Such a Thing?

20© Copyright 2012 EMC Corporation. All rights reserved.

Isilon HDFS interface

Isilon is able to “pretend” to be a HDFS cluster: it mimics the NameNode and DataNode protocols to host data.

Underlying system is OneFS and does not follow the traditional HDFS scheme.

Point HDFS clients (MapReduce, command line, etc.) to any IP in the Isilon cluster.

Page 21: Hadoop & Greenplum: Why Do Such a Thing?

21© Copyright 2012 EMC Corporation. All rights reserved.

Pros & Cons

Isilon is more dense

Isilon can be mounted via a number of protocols

– Easier ingest / egress– Raw data accessible by applications

Isilon is easy to manage

Free of certain HDFS limitations

Isilon loses data locality (~250MB/sec throughput per node over network)

Page 22: Hadoop & Greenplum: Why Do Such a Thing?

22© Copyright 2012 EMC Corporation. All rights reserved.

Why do such a thing? Hadoop backup or archive

– More dense than HDFS, more accessible than tape, no need for compute

Complete HDFS replacement– More dense, more accessible, utilize existing

Isilon, slower per terabyte of storage

Hot/warm storage– Use HDFS as primary, but Isilon as secondary

Storage for original content– Use MapReduce to extract metadata from original

content, and leave original content in place

Page 23: Hadoop & Greenplum: Why Do Such a Thing?

23© Copyright 2012 EMC Corporation. All rights reserved.

HBase External Tables in GPDBProject in development

Load data in parallel from HBase by specifying table name and column qualifiers

> SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE h.term = g.word;

> CREATE EXTERNAL TABLE hbase_document_feature ( “HBASEROWKEY” text, “term” text, “freq” integer) LOCATION ('gphbase://docfeatures') FORMAT ’CUSTOM' (formatter=‘gpdbwriteable_import’);

Page 24: Hadoop & Greenplum: Why Do Such a Thing?

24© Copyright 2012 EMC Corporation. All rights reserved.

HBase External Tables in GPDB

Possible TODO list:

Specify range of rowkeys

Support writes into HBase

Specify filter criteria on the external table

select * from hbase_external where ROWKEY=‘abc’

Accumulo?

Page 25: Hadoop & Greenplum: Why Do Such a Thing?

25© Copyright 2012 EMC Corporation. All rights reserved.

Why do such a thing?

Have HBase store semi-structured data

Exploit the strengths of each

Use HBase for really really wide tables

Use HBase as a scalable archive of raw records

Leverage existing HBase applications

Page 26: Hadoop & Greenplum: Why Do Such a Thing?

26© Copyright 2012 EMC Corporation. All rights reserved.

Greenplum On HDFS

Get Greenplum Database to run natively off of HDFS

Underlying Greenplum Database data is stored in HDFS

Unifies the two platform further – no need for external tables

Fully supports Greenplum’s append-only tables

Early project in R&D

Talk will be given by Chang Lei at Yahoo Summit

Page 27: Hadoop & Greenplum: Why Do Such a Thing?

27© Copyright 2012 EMC Corporation. All rights reserved.

NamenodeB

replication

Rack1 Rack2

DatanodeDatanode Datanode

Read/Write

Segment

Segment host

Segment

Segment (Mirror)

Segment host

Segment

Segment host

Segment

Segment host

Segment

Segment host

Master host

Meta Ops

Interconnect

Segment (Mirror)

Segment (Mirror) Segment

(Mirror)

Segment (Mirror)

Tables in HDFS filespace

Greenplum On HDFS

Page 28: Hadoop & Greenplum: Why Do Such a Thing?

28© Copyright 2012 EMC Corporation. All rights reserved.

Why do such a thing?

Covers many of the same use cases as Hive

Run Hadoop MapReduce over data managed by Greenplum DB

Initial results show it is faster than Hive

You only have to store your data in one system

Page 29: Hadoop & Greenplum: Why Do Such a Thing?