Hadoop & Greenplum: Why Do Such a Thing?

1© Copyright 2012 EMC Corporation. All rights reserved.

Greenplum & Hadoop

Why do such a thing?

Donald MinerSolutions ArchitectAdvanced Technologies [email protected]


GREENPLUM DATABASEQUICK INTRODUCTION TO


Greenplum Database Basics

Massively Parallel Processing (MPP) Database

Uses commodity hardware

Data is distributed by auser-defined “distribution key”

Master node delegatesqueries to segments

1:1 segment and mastermirroring for redundancy

Master

Segment Segment Segment Segment

Master

GREENPLUM DATABASE


Greenplum Database FeaturesFull SQL support based on PostgreSQL 8.2

Columnar or row-oriented storage with compression

Multi-level table partitioning with query time partition pruning

B-tree and bitmap indexes

JDBC, ODBC, OLEDB, etc. interfaces

High speed, parallel bulk ingest

Parallel query optimizer

External tables

GREENPLUM DATABASE


MADlib Analytics with Greenplum

Scalable and in-database

Mathematical, statistical, machine learning

Active open source project

> SELECT householdID, variables FROM households ORDER BY RANDOM() LIMIT 100000;> SELECT run_univariate_analysis ( 'households_training', 'variables'); WHERE pvalue<.01 AND r2>.01;> SELECT run_regression( 'univariate_results', 'households_training');> SELECT householdID, madlib.array_dot( coef::REAL[], xmatrix::REAL[]) FROM coefficients, households;

GREENPLUM DATABASE


MADlib In-Database Analytical Functions

Descriptive Statistics Modeling

Quantile Correlation Matrix

Profile Association Rule Mining

CountMin (Cormode-Muthukrishnan) Sketch-based Estimator K-Means Clustering

FM (Flajolet-Martin) Sketch-based Estimator Naïve Bayes Classification

MFV (Most Frequent Values) Sketch-based Estimator Linear Regression

Frequency Logistic Regression

Histogram Support Vector Machines

Bar Chart SVD Matrix Factorisation

Box Plot Chart Decision Trees/CART

Latent Dirichlet Allocation Topic Modeling

GREENPLUM DATABASE


PostGIS Support in Greenplum DBPostGIS adds support for geographic objects in PostgreSQL

Example: find all records within 25 miles of hurricane path

customer_id | st_astext | phone_num------------+-----------------------------+-------------493140 | POINT(-80.040397 26.570613) | 1231231234192401 | POINT(-81.820933 26.242611) | 2342342345

select customer_id, ST_AsText(lat_lon), phone_numfrom clientswhere ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING(-79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, -80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, -83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI())

http://postgis.refractions.net/

GREENPLUM DATABASE


Solr integration with GPDBSolr is an open source enterprise search engine

Enable in-database text indexing and search

select t.id, q.score, t.message_textfrom message t, gptext.search( 'twitter.public.message', '(iphone and (hate or love))', 'author_lang:en', 100 ) qwhere t.id=q.idorder by score desc;

id | score | message_text -----------+------------------+------------------------------------------- 71552856 | 5.43078422546387 | Hates BB's Love IPhones!

91373993 | 4.06371879577637 | Its a love hate relationship with iPhone spellcheck

25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate relationship...

120166038 | 3.39410924911499 | Love the new iPhone 4s, hate @ATT service #Verizonhereicome

117498183 | 3.39181470870972 | I got a love-hate relationship for my iPhone!!!

86416378 | 3.39180779457092 | Absolutely love the new iPhone, but Siri seems to hate me..

GREENPLUM DATABASE


GREENPLUM HADOOP


Greenplum “HD”GREENPLUM HADOOP

• Bundled open source

• HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Mahout


Greenplum “MR”GREENPLUM HADOOP

• Bundled MapR, a commercial version of Hadoop• API compatible with traditional Hadoop• MapR improvements over Hadoop:

– Improved control system– Major portions of HDFS re-implemented

in C++– HDFS is NFS mountable– Improved shuffle and sort– Distributed NameNode– Supports large number of files– Mirroring, snapshot capability


Why do such a thing?Greenplum DB

STRUCTURED SEMISTRUCTURED UNSTRUCTURED

SQL

RDBMS

Tables and SchemasGPMapReduce

Indexing

Partitioning

Text objects

GP Solr/LuceneMADLib

PostGIS


Why do such a thing?Hadoop


HiveMapReduce

PigXML, JSON, … Flat files

Schema on load


Why do such a thing?HBase


Hive MapReduce

PigHBase Tables

Row keys

Flexible schema


Why do such a thing?Hybrid architecture with all three (or two…)


HBase Tables

Row keys

Flexible schema

SQL

RDBMS

Tables and SchemasGPMapReduce

Indexing

Partitioning

Text objects

HiveMapReduce

Pig XML, JSON, … Flat files

Schema on loadGP Solr/Lucene

MADLib

PostGIS


Greenplum Unified Analytics Platform


Hadoop External Tables in GPDBExternal tables bring external data into the database.

Native support for HDFS with parallelized loading.

Can write to HDFS or read from HDFS.

> SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE h.term = g.word;

> CREATE EXTERNAL TABLE hdfs_document_feature ( docid integer, term text, freq integer) LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*') FORMAT 'text' (delimiter '|');

> WRITE INTO hdfs_export SELECT * FROM gpdb_source;



Many of the same use cases of a HBase/Hadoop environment

Use Hadoop as a data groomer

Do rollups in Hadoop and store results in GPDB

Use the best tool for the job (structured vs. unstructured)

Use GPDB to host data sets in a more real-time layer for ad-hoc analytics


EMC Isilon

Hardware appliance for scale-outnetwork-attached storage (NAS)

Stripes data across all nodes

Uses Infiniband for intra-clustercommunication

Up to 15.5PB total storage

3 different hardware configurationsto handle different workloads

Uses “OneFS”, Isilon’s operating system and file system

Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few more.


Isilon HDFS interface

Isilon is able to “pretend” to be a HDFS cluster: it mimics the NameNode and DataNode protocols to host data.

Underlying system is OneFS and does not follow the traditional HDFS scheme.

Point HDFS clients (MapReduce, command line, etc.) to any IP in the Isilon cluster.


Pros & Cons

Isilon is more dense

Isilon can be mounted via a number of protocols

– Easier ingest / egress– Raw data accessible by applications

Isilon is easy to manage

Free of certain HDFS limitations

Isilon loses data locality (~250MB/sec throughput per node over network)


Why do such a thing? Hadoop backup or archive

– More dense than HDFS, more accessible than tape, no need for compute

Complete HDFS replacement– More dense, more accessible, utilize existing

Isilon, slower per terabyte of storage

Hot/warm storage– Use HDFS as primary, but Isilon as secondary

Storage for original content– Use MapReduce to extract metadata from original

content, and leave original content in place


HBase External Tables in GPDBProject in development

Load data in parallel from HBase by specifying table name and column qualifiers

> SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE h.term = g.word;

> CREATE EXTERNAL TABLE hbase_document_feature ( “HBASEROWKEY” text, “term” text, “freq” integer) LOCATION ('gphbase://docfeatures') FORMAT ’CUSTOM' (formatter=‘gpdbwriteable_import’);


HBase External Tables in GPDB

Possible TODO list:

Specify range of rowkeys

Support writes into HBase

Specify filter criteria on the external table

select * from hbase_external where ROWKEY=‘abc’

Accumulo?



Have HBase store semi-structured data

Exploit the strengths of each

Use HBase for really really wide tables

Use HBase as a scalable archive of raw records

Leverage existing HBase applications


Greenplum On HDFS

Get Greenplum Database to run natively off of HDFS

Underlying Greenplum Database data is stored in HDFS

Unifies the two platform further – no need for external tables

Fully supports Greenplum’s append-only tables

Early project in R&D

Talk will be given by Chang Lei at Yahoo Summit


NamenodeB

replication

Rack1 Rack2

DatanodeDatanode Datanode

Read/Write

Segment

Segment host

Segment

Segment (Mirror)

Segment host

Segment

Segment host

Segment

Segment host

Segment

Segment host

Master host

Meta Ops

Interconnect

Segment (Mirror)

Segment (Mirror) Segment

(Mirror)

Segment (Mirror)

Tables in HDFS filespace

Greenplum On HDFS



Covers many of the same use cases as Hive

Run Hadoop MapReduce over data managed by Greenplum DB

Initial results show it is faster than Hive

You only have to store your data in one system

Hadoop & Greenplum: Why Do Such a Thing?

Technology

hbase external

original content

select count

select householdid

rights reserved

freq integer

term text

hadoop