Data Federation & Extensible Framework Goden Yao Product Manager HAWQ
Data Federation & Extensible Framework
Goden YaoProduct Manager
HAWQ
Agenda
●History / Motivations / Goals
●Architecture / Interfaces / Design
●Contribute to Apache HAWQ/PXF
●Q&A
1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Michael Stonebraker develops Postgres at UCB
Postgres adds support for SQL
Open Source PostgreSQL
PostgreSQL 7.0 released
PostgreSQL 8.0 released
Greenplum forks PostgreSQL
Hadoop 1.0 Released
HAWQ + MADlib go open-source
(Apache)
HAWQ project launched
Hadoop 2.0 Released
MADlib launchedHistory - timeline
Motivations: SQL on Hadoop
RDBMS
?
various formats, storages supported on HDFS
External Tablegpfdist://…gphdfs://…file://…http://…
Goals: PXF
Parallelism
Extensible
Query Efficiency
DO’s
Materialize Data
DON’Ts
Import/Export
Hadoop Ecosystem
Design - Communication
Apache Tomcat
PXF WebappREST API Java API
libhdfs3, written in C, segments
External Tables
Native Tables
HTTP, port: 51200
Java API
Java API
Architecture - Deployment
HAWQMaster Node NN
pxf
HBase Master
DN4
pxf
HAWQseg4
DN1
pxf
HAWQseg1
HBase Region Server1
DN2
pxf
HAWQseg2
HBase Region Server2
DN3
pxf
HAWQseg3
HBase Region Server3
* PXF needs to be installed on all DN* PXF is recommended to be installed on NN
Design - Components(PXF)
Fragmenter Get the locations of fragments for an external table
AccessorUnderstand and read/write the fragment , return records
Resolver Convert records to HAWQ consumable format (Data Types)
*Analyzer was also a component before OSS, which provides stats on files to the Query Optimizer. Recent HAWQ-44 , HAWQ-191 improved the way PXF gets statistics and make Analyzer obsolete in the newer version. (Apache HAWQ 2.0+)
Design - Define External Tables (SQL)CREATE EXTERNAL TABLE ext_table <attr list, ...>
LOCATION('pxf://<namenode>:<port>/path/to/data?FRAGMENTER=package.name.FragmenterForX&ACCESSOR=package.name.AccessorForX&RESOLVER=package.name.ResolverForX&<Other custom user options>=<value>') FORMAT'custom'(formatter='pxfwritable_import');
CREATE EXTERNAL TABLE ext_table <attr list, ...>LOCATION('pxf://<namenode>:<port>/path/to/data?PROFILE=hbaseFORMAT'custom'(formatter='pxfwritable_import');
* When defining a table, PXF doesn’t check if the file exists, accessible or name,port are correct, etc. It is just a DEFINITION.
Architecture - Data Flow: Query (HDFS)
HAWQMaster Node NN
pxf
DN1
pxf
HAWQseg1
select * from ext_table0
pxf://<namenode><port>/path/to/data
getFragments() REST
1
Fragments JSON2
Get Server List REST
3
Servers4
5
Split mapping(fragment -> segment)
DN1
pxf
HAWQseg1
DN1
pxf
HAWQseg1
Query dispatched to Segment 1,2,3… (Interconnect)6
7
Read() REST
8 records
9
10
query result
records (stream)
FragmenterResolver
Accessor
3 differences from HAWQ Native Query:
1. Get fragments from PXF2. Assign fragments to segments3. Get data from PXF
Interface - FragmenterFragmenter Get the locations of fragments for an external table
• HDFS - splits (blocks, replicas)• HBase - regions • Hive - splits of the files stored in a table
package org.apache.hawq.pxf.api;/** * Abstract class that defines the splitting of a data resource into fragments * that can be processed in parallel. */public abstract class Fragmenter extends Plugin { protected List<Fragment> fragments;
public Fragmenter(InputData metaData) {
super(metaData);fragments =
new LinkedList<Fragment>();}...
/** * Gets the fragments of a given path (source name and location of each * fragment). Used to get fragments of data that could be read in parallel * from the different segments. */
public abstract List<Fragment> getFragments() throws Exception;
Interface - FragmenterFragmenter Get the locations of fragments for an external table
package org.apache.hawq.pxf.api;/** * FragmentsStats holds statistics for a given path. */public class FragmentsStats {
// number of fragments private long fragmentsNumber; // first fragment size private SizeAndUnit firstFragmentSize; // total fragments size private SizeAndUnit totalSize;
/** * Container for size and unit */ public class SizeAndUnit { long size; SizeUnit unit; ...
Analyzer - Obsolete HAWQ-44 , HAWQ-191
Fragmenter.getFragmentsStats()
returns a string in JSON format of statistics for the data source.
For example, if the input path is a HDFS directory of 3 files, each one of 1 block, the output will be 1. the number of fragments (3)2. the size of the first file 3. size of all files in that directory.
*Only Implemented for HDFS. Still need to implement Hive (HAWQ-181) , HBase(HAWQ-182)
Fragments Distribution source:hd_work_mgr.cParallelism Locality
HBase Data Replica
HBase1, HBase2
HBase1, HBase3
HBase1, HBase2
HBase1, HBase3
DN2
pxf
HAWQseg2
HBase Region Server1
DN1
pxf
HAWQseg1
DN3
pxf
HAWQseg3
HBase Region Server2
DN4
pxf
HBase Region Server3 Split Mapping Result:
seg1 - purple-DN4 seg2 - yellow-DN2 + red-DN2 seg3 - green-DN3
>
Interface - AccessorAccessor
• Read / Write Accessor• org.apache.hawq.pxf.api.OneRow
Understand and read/write the fragment , return records
package org.apache.hawq.pxf.api;/* * An interface for writing data into a data store * (e.g, a sequence file on HDFS). * All classes that implement actual access to such data sources must * respect this interface */
public interface WriteAccessor {boolean openForWrite()
throws Exception;OneRow
writeNextObject(OneRow onerow) throws Exception;
void closeForWrite() throws Exception;}
package org.apache.hawq.pxf.api;/* * Internal interface that defines the access to data on the source * data store (e.g, a file on HDFS, a region of an HBase table, etc). * All classes that implement actual access to such data sources must * respect this interface */
public interface ReadAccessor {boolean openForRead()
throws Exception;OneRow readNextObject()
throws Exception;void closeForRead()
throws Exception;}
Interface - Resolver
package org.apache.hawq.pxf.api;/** Interface that defines the serialization of data read from the DB* into a OneRow object.* Every implementation of a serialization method * (e.g, Writable, Avro, ...) must implement this interface.*/
public interface WriteResolver { public OneRow setFields(List<OneField> record) throws Exception;}
package org.apache.hawq.pxf.api;/* * Interface that defines the deserialization of one record brought from * the data Accessor. Every implementation of a deserialization method * (e.g, Writable, Avro, ...) must implement this interface. */
public interface ReadResolver { public List<OneField> getFields(OneRow row) throws Exception;}
Convert records to HAWQ consumable format (Data Types)
• OneRow <-> OneField• Read/Write Resolver• OneField: type/val
Resolver
PXF Bridge, Stream and Format
• Serialize fields using BridgeOutputBuilder
• Stream data back to HAWQ master
• Format: csv, text, pxfwritable_import(READ) / pxfwritable_export(WRITE)CREATE EXTERNAL TABLE ext_table <attr list, ...>
LOCATION('pxf://<namenode>:<port>/path/to/data?PROFILE=hbaseFORMAT'custom'(formatter='pxfwritable_import');
PXF Plugins, Profiles
• Built-in with HAWQ (Profiles)
• HDFS: HDFSTextSimple(R/W), HDFSTextMulti(RO), Avro(RO)
• Hive(RO): Hive, HiveRC, HiveText
• HBase(RO): HBase
• Community (https://bintray.com/big-data/maven/pxf-plugins/view )
• JSON HAWQ-178
• Cassandra
• Accumulo
• ...
PXF Filter Push Down
• Goal: Performance , Efficiency, less data over wire
• Criteria: SQL “Where” Clause
• Single Expression or only a group of “AND” Expressions
• Supported data types and operators• Types: text, int, smallint, bigint.• Operators: EQ, NE, LT, GT, LE, GE & AND
select * from ext_saleswhere id > 500 and id < 1000
package org.apache.hawq.pxf.api; /* * Interface a user of FilterParser should implement * This is used to let the user build filter expressions in the manner she * sees fit * * When an operator is parsed, this function is called to let the user decide * what to do with its operands. */ interface FilterBuilder { public Object build(Operation operation, Object left, Object right) throws Exception; }
Accessor
Fragmenter reduce fragments by partitions
reduce returned data records
New in Apache HAWQ 2.0 beta - incubating Release
● HCatalog Integration
Old way to access the data:CREATE EXTERNAL TABLE zoo_ext ( id double, animal string, age int) LOCATION ('pxf://namenode:51200/db1.zoo?PROFILE=hive')FORMAT 'CUSTOM' (formatter='pxfwritable_import');
SELECT * FROM zoo_ext;
With Hcatalog Integration:SELECT * FROM hcatalog.db1.zoo;
● Advanced statistics (HDFS)
cwiki.apache.org/confluence/display/HAWQ/PXF
github.com/apache/incubator-hawq/tree/master/pxf
issues.apache.org/jira/browse/HAWQ/ Component = PXF
ContributionFeature Areas Plugins
(storage, formats)Push Down
Filters Performance
Documentation Confluence
Code Review Github/Git git-wip-us.apache.org/repos/asf?p=incubator-hawq.git
Join Discussion/Ask Questions Apache DLs [email protected]@hawq.incubator.apache.org
Future
ZookeeperHigh Availability
Security Ranger
Hcatalog Write Complex Types Partitions
Q & A