Top Banner
Running Cloudera Impala on PostgreSQL By Chengzhong Liu [email protected] 2013.12
24

刘诚忠:Running cloudera impala on postgre sql

May 25, 2015

Download

Technology

hdhappy001

BDTC 2013 Beijing China
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 刘诚忠:Running cloudera impala on postgre sql

Running Cloudera Impala on PostgreSQL

By Chengzhong Liu

[email protected]

2013.12

Page 2: 刘诚忠:Running cloudera impala on postgre sql

Story coming from…

• Data gravity

• Why big data

• Why SQL on big data

Page 3: 刘诚忠:Running cloudera impala on postgre sql

Today agenda

• Big data in Miaozhen 秒针系统

• Overview of Cloudera Impala

• Hacking practice in Cloudera Impala

• Performance

• Conclusions

• Q&A

Page 4: 刘诚忠:Running cloudera impala on postgre sql

What happened in miaozhen

• 3 billion Ads impression per day

• 20TB data scan for report generation every morning

• 24 servers cluster

• Besides this – TV Monitor

– Mobile Monitor

– Site Monitor

– …

Page 5: 刘诚忠:Running cloudera impala on postgre sql

Before Hadoop

• Scrat – PostgreSQL 9.1 cluster

– Write a simple proxy

– <2s for 2TB data scan

• Mobile Monitor – Hadoop-like distribute computing system

– Rabbit MQ + 3 computing servers

– Write a Map-Reduce in C++

– Handles 30 millions to 500 millions Ads impression

Page 6: 刘诚忠:Running cloudera impala on postgre sql

Problem & Chance

• Database cluster

• SQL on Hadoop

• Miscellaneous data

• Requirements

– Most data is rational

– SQL interface

Page 7: 刘诚忠:Running cloudera impala on postgre sql

SQL on Hadoop

• Google Dremel

• Apache Drill

• Cloudera Impala

• Facebook Presto

• EMC Greenplum/Pivotal

HDFS

Map Reduce

Hive Pig

Impala/Drill /Pivotal/Presto

Latency matters

Page 8: 刘诚忠:Running cloudera impala on postgre sql

What’s this

• A kind of MPP engine

• In memory processing

• Small to big join

– Broadcast join

• Small result size

Page 9: 刘诚忠:Running cloudera impala on postgre sql

Why Cloudera Impala

• The team move fast – UDF coming out – Better join strategy on the way

• Good code base – Modularize – Easy to add sub classes

• Really fast – Llvm code generation

• 80s/95s – uv test

– Distributed aggregation Tree – In-situ data processing (inside storage)

Page 10: 刘诚忠:Running cloudera impala on postgre sql

Typical Arch. SQL Interface Meta Store

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Page 11: 刘诚忠:Running cloudera impala on postgre sql

Our target

• A MPP database

– Build on PostgreSQL9.1

– Scale well

– Speed

• A mixed data source MPP query engine

– Join two tables in different sources

– In fact…

Page 12: 刘诚忠:Running cloudera impala on postgre sql

Hacking… from where

• Add, not change

– Scan Node type

– DB Meta info

• Put changes in configuration

– Thrift Protocol update

• TDBHostInfo

• TDBScanNode

Page 13: 刘诚忠:Running cloudera impala on postgre sql

Front end

• Meta store update

– Link data to the table name

– Table location management

• Front end

– Compute table location

Page 14: 刘诚忠:Running cloudera impala on postgre sql

Back end

• Coordinator

– pg host

• New scan node type

– db scan node

• Pg scan node

• Psql library using cursor

Page 15: 刘诚忠:Running cloudera impala on postgre sql

SQL Plan

Aggr.: sum(count(id)

Exchange node

Aggr. : group by id

Aggr. : count(id)

HDFS/PG scan

Aggr. : group by id

Exchange node

• select count(distinct id)

from table

– MR like process

Page 16: 刘诚忠:Running cloudera impala on postgre sql

Env.

• Ads impression logs – 150 millions, 100KB/line

• 3 servers – 24 cores – 32 G mem – 2T * 12 HD – 100Mbps LAN

• Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’

Page 17: 刘诚忠:Running cloudera impala on postgre sql

Performance

impala

hive

pg+impala

• Group by speed / core

• 20 M /s

Page 18: 刘诚忠:Running cloudera impala on postgre sql

With index

Page 19: 刘诚忠:Running cloudera impala on postgre sql

Codegen on/off

en_codegen

dis_codegen

• select count(distinct id) from t group by c

• select distinct id

from t

• select id from t group by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;

Page 20: 刘诚忠:Running cloudera impala on postgre sql

Multi-users

Page 21: 刘诚忠:Running cloudera impala on postgre sql

Conclusion

• Source quality – Readable

– Google C++ style

– Robust

• MPP solution based on PG – Proved perf.

– Easy to scale

• Mixed engine usage – HDFS and DB

Page 22: 刘诚忠:Running cloudera impala on postgre sql

What’s next

• Yarn integrating

• UDF

• Join with Big table

• BI roadmap

• Fail over

Page 23: 刘诚忠:Running cloudera impala on postgre sql

Rerf.

• Cloudera Impala online doc. & src

• http://files.meetup.com/1727991/Impala%20and%20BigQuery.ppt‎

• http://www.cubrid.org/blog/dev-platform/meet-impala-open-source-real-time-sql-querying-on-hadoop/

• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf

• @datascientist, @dongxicheng, @flyingsk, @zhh

Page 24: 刘诚忠:Running cloudera impala on postgre sql

Thanks! Q & A