Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

Sharpest Tool in the Hadoop ToolboxVertica SQL on HadoopBob HansenDeepak MajetiJames Clampffer

#SeizeTheData

We are making Vertica the fastest structured data processor for Hadoop.

Spark Hive Pig

MapReduce

HCatalog

#SeizeTheData 4

Accessing your existing data is easy to doCREATE HCATALOG SCHEMA hive WITH

HOSTNAME='hcat.mycorp.com'

HCATALOG_SCHEMA='tweets';

SELECT

keyword,

EXTRACT(month from created_at),

AVG(score)

FROM hive.tweets.tweet_keywords

WHERE created_at >=

now()- ‘3 months’::interval

GROUP BY 1,2

ORDER BY 2;

#SeizeTheData

Hadoop has two popular formats for columnar data:Parquet and ORC

Column Oriented file formats used

by popular Hadoop ingesting tools

like Hive, Spark, Drill, Impala etc.

#SeizeTheData

Columnar formats efficiently pack data into Hadoop blocks

File broken into blocks (rowgroups/stripes)

Typical size up to 256 MB (size of an HDFS block)

Structured: Metadata contains information about the file including DDL, statistics, etc.

#SeizeTheData

VSQLoH allows you to use your Hadoop data fast

#SeizeTheData 8

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

#SeizeTheData 9

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

#SeizeTheData

Big Data SQL Performance Tournamentvs

Impala is 4x-60x faster Impala succeeded in 60 queries that Spark failed

Both Impala and Spark

failed 18 queries

Measured under TPC Benchmark™DS standards

TPC-DS query, sorted by relative run-time

#SeizeTheData 11

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

#SeizeTheData

HAWQ is 2x – 60x fasterTez is up to 3x faster

HAWQ succeeded in22 queries that Tez failed

Tez succeeded in16 queries thatHAWQ failed

Both Tez andHAWQ failed

18 queries

#SeizeTheData 13

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

#SeizeTheData

Parquet: libhdfs++ Parquet: webhdfs

Comparable

#SeizeTheData 15

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet: libhdfs++

#SeizeTheData

ORC: libhdfs++ ORC: webhdfs

Comparable

#SeizeTheData 17

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet: libhdfs++ ORC: libhdfs++

#SeizeTheData

vsParquet

Vertica is 2x – 30x fasterSimilar Vertica succeeded with 19queries Impala failed

#SeizeTheData 19

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet

#SeizeTheData

Vertica is 2x – 73x fasterHAWQ up to 4x faster Vertica succeeded in 34 queries HAWQ failed

#SeizeTheData 21

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet ORC

#SeizeTheData

Parquet ORC

Libhdfs++ is 1.2x – 2.5x faster

Comparable

#SeizeTheData 23

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet ORC

#SeizeTheData 24

ROSVSQLOH

ROS is 2x – 11x faster

#SeizeTheData

VSQLoH is fast because of our open source investments

#SeizeTheData

Vertica developed libParquet and libOrc for speed and stability

Most systems use Java SerDes and Java Vectorized ReadersThese do not couple well with C++ based systems due to lack of control over resources and lack of tighter integration

libOrc (https://orc.apache.org)• Development Started early 2015• HPE + Hortonworks collaboration

libParquet (https://parquet.apache.org)• Development started early 2016• HPE + Cloudera collaboration

#SeizeTheData

Optimizations

Column selection

Partition Pruning

Read only the data you needPredicate Pushdown

#SeizeTheData

How much do we gain ?

28Resources: https://github.com/apache/orc/pull/43/files

#SeizeTheData

Fast is no good if it doesn’t work reliably

#SeizeTheData 30

VSQLoH will run your SQL queries out of the box

Successful Unaltered TPC-DS Queries

Running unmodified TPC‐DS benchmark queries

#SeizeTheData 31

VSQLoH’s fine-grained resource management ensures that queries will complete without running out of memory

60Concurrent queries before error

Running concurrent select TPC‐DS queries

#SeizeTheData

WebHDFS was not a good fit for Vertica’s use case

Webhdfs was intended to be easy for people to use, not for high performanceMeant to be accessed from curl or web browser:

curl webhdfs://host:port/webhdfs/v1/my_file_pathor http://host:port/webhdfs/v1/my_file_path

Vertica

HDFS Server

Web Server

HDFS Client

WebHDFS

WebHDFSInterface

libhdfs++Interface

HDFS Client (JVM)

libhdfsInterface

#SeizeTheData

Libhdfs++ is developed from scratch with a focus on performance

• Implemented in C++ with minimal dependencies.• Supports Linux and OSX.• All interfaces are non-blocking (unless you want them to be).• Minimal memory footprint; all memory is explicitly freed as soon as possible.

Time (sec)

Find of 1 directory

Java C++

Time (sec) Memory (MB)

Find across 1M directories

Java C++

2.4 seconds

0.012 seconds

#SeizeTheData

Libhdfs++ is developed from scratch with a focus on performance

• Implemented in C++ with minimal dependencies.• Supports Linux and OSX.• All interfaces are non-blocking (unless you want them to be).• Minimal memory footprint; all memory is explicitly freed as soon as possible.

HDFS JIRAS

• HDFS-7280• HDFS-7279• HDFS-7270• HDFS-7945

Time (sec)

Find of 1 directory

Java C++

Time (sec) Memory (MB)

Find across 1M directories

Java C++

2.4 seconds

0.012 seconds

#SeizeTheData

Libraries were implemented with bindings to other languages in mind

libhdfs++

liborc

libparquet

• Pure C wrapper APIs allow functionality to accessed from nearly any other language.• Write prototypes and tools in scripting languages, and move to native implementations if required.• More language bindings lead to more adoption and more contributors.• Development here: http://issues.apache.org/jira/browse/HDFS-8707

Developed within the apache community

#SeizeTheData

In upcoming releases, it will be faster

#SeizeTheData 37

Caching Hadoop data in Vertica’s ROS format will supercharge your queries

CREATE CACHE VIEW fast_tweetsFROM hive.tweets.tweet_keywordsWHERE created_year_month BETWEEN

201509 AND 201608;

SELECT keyword, EXTRACT(month from created_at), AVG(score)

FROM fast_tweetsWHERE created_year_month = 201608GROUP BY 1,2ORDER BY 2;

#SeizeTheData 38

ROS has 10 years of R&D to make it the fastest format around

ROSVSQLOH

HDFSROS

SELECT…

ROS is 2x – 11x faster

#SeizeTheData 39

Complex types allow richer, more semantically clean data

Complex types enable expression of SQL queries in a natural and intuitive way

“SELECT customer, orders.total_cost FROM customersWHERE orders.total_sales > 4000 and orders.products.id= ‘B2’;”

#SeizeTheData 40

Writing data to HDFS will make Vertica a central part of your workflow

Support writing data in Vertica to ORC and Parquet formats.

“SELECT * FROM customers AS COPY TO ‘hdfs:///user/customers’ PARQUET”;

HDFSORC / Parquet

#SeizeTheData

Who wants some fast?

#SeizeTheData

Vertica SQL on Hadoop Summary

Spark Hive Pig

MapReduce

HCatalog

• High Performance Vertica Engine• Beats Hawk, Impala, Spark, Tez

• All TPC-DS queries run out of the box

• Supports major HDFS file formats• ORC, Parquet

• Native readers enable tighter integration• Partition pruning, Predicate pushdown,

Column selection

• Libhdfs++ enables efficient communication with HDFS

• Roadmap for more features and further improve performance

Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

Documents

Hive hcatalog

Vertica mpp columnar dbms

Vertica International (Slideshare)

Vertica Column-vs-Row

Vertica Installation Tips

2D vertica stress.pdf

Presentazione Nuvola Vertica Full

Benchmark BigData OLAP Kylin vs Vertica CN...

Hp vertica 7.2.x_complete_documentation

Ganga Vertica

Обзор HP Vertica

SQOOP HCatalog Integration Venkat Ranganathan Sqoop Meetup.....

Data mashups vertica

Hive + HCatalog

Vertica Mur

een messcherpe collectie/The sharpest collection