Top Banner
10 Reasons To Start Your Analytics Project with PostgreSQL Satoshi Nagayasu @snaga HKOSCon 2016
49

10 Reasons to Start Your Analytics Project with PostgreSQL

Jan 15, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 10 Reasons to Start Your Analytics Project with PostgreSQL

10 Reasons To StartYour Analytics Projectwith PostgreSQL

Satoshi Nagayasu@snaga

HKOSCon 2016

Page 2: 10 Reasons to Start Your Analytics Project with PostgreSQL

Agenda• Collecting Data / Database Federation• Building Data Warehouse and Data Mart• Writing Queries / SQL Features• Performance• In-Database Analytics

Page 3: 10 Reasons to Start Your Analytics Project with PostgreSQL

Collecting Data / Database FederationForeign Data WrapperUnlogged Table

Page 4: 10 Reasons to Start Your Analytics Project with PostgreSQL

Foreign Data Wrapper• Connects external data sources (RDBMS, NoSQL, files,

etc) to the PostgreSQL executor.• Allows SELECT/INSERT/UPDATE/DELETE operations

for external tables.

PostgreSQL

Oracle

MySQL

HDFS

https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Page 5: 10 Reasons to Start Your Analytics Project with PostgreSQL

Unlogged Table• Does not record XLOG.• Has better performance compared to regular

table.• Will be truncated after crash

recovery.

http://pgsnaga.blogspot.jp/2011/10/data-loading-into-unlogged-tables-and.html

Page 6: 10 Reasons to Start Your Analytics Project with PostgreSQL

Building Data Warehouse and Data MartMaterialized ViewsTransactional DDLs

Page 7: 10 Reasons to Start Your Analytics Project with PostgreSQL

Materialized View• Defines a view with caching records.• Allows to avoid running complicated queries and

aggregations every time.• Requires updating cache by the users.

Table

View

Table Table

MaterializedView

Table

Query Query

Cache

Page 8: 10 Reasons to Start Your Analytics Project with PostgreSQL

Transactional DDLs• Most of DDLs can be performed in transaction in

PostgreSQL.• Schema can be modified with keeping atomicity

even online. (commit or rollback)• Transactional DDLs would help DBAs manage

their schema easier.

Page 9: 10 Reasons to Start Your Analytics Project with PostgreSQL

Writing Queries / SQL FeaturesRich SQL featuresCompatibility with SQL standard

Page 10: 10 Reasons to Start Your Analytics Project with PostgreSQL

Writing Queries / SQL Features• Rich SQL features

– Subqueries– WITH clauses (Common Table Expressions, CTEs)– Many aggregation functions– Window functions

• JSON support• Compatibility with the SQL standard

Page 11: 10 Reasons to Start Your Analytics Project with PostgreSQL

WITH clause• Defines a temporary table for a query.• May make a better performance compared to

using the same subquery more than once.

WITH foo AS (SELECT ... FROM ... GROUP BY ...

)SELECT ... FROM foo WHERE ...

UNION ALLSELECT ... FROM foo WHERE ...;

https://www.postgresql.org/docs/9.5/static/queries-with.html

Page 12: 10 Reasons to Start Your Analytics Project with PostgreSQL

Many Aggregations• New in 9.4

– percentile_cont()– percentile_disc()– mode()– rank()– dense_rank()– percent_rank()– cume_dist()

• New in 9.5– ROLLUP()– CUBE()– GROUPING SETS()

https://www.postgresql.org/docs/9.5/static/functions-aggregate.html

Page 13: 10 Reasons to Start Your Analytics Project with PostgreSQL

ROLLUP• Calculates total/subtotal values

Page 14: 10 Reasons to Start Your Analytics Project with PostgreSQL

CUBE• Calculates for all combinations of the

specified columns

Page 15: 10 Reasons to Start Your Analytics Project with PostgreSQL

GROUPING SETS• Runs multiple GROUP BY queries at once

Two GROUP BYsat once.

Page 16: 10 Reasons to Start Your Analytics Project with PostgreSQL

JSON data typetestdb=# create table t1 ( j jsonb );CREATE TABLEtestdb=# insert into t1 values ('{ "key1": "value1", "key2": "value2" }');INSERT 0 1testdb=# select * from t1;j--------------------------------------{"key1": "value1", "key2": "value2"}(1 row)testdb=# select j->>'key2' key2 from t1;key2--------value2(1 row)

Page 17: 10 Reasons to Start Your Analytics Project with PostgreSQL

JSON data typetestdb=# select n_nationkey,n_name from nation where n_nationkey = 12;n_nationkey | n_name-------------+---------------------------12 | JAPAN(1 row)testdb=# select jsonb_build_object('n_nationkey', n_nationkey, 'n_name', n_name) from nation where n_nationkey = 12;jsonb_build_object------------------------------------------------------------{"n_name": "JAPAN ", "n_nationkey": 12}(1 row)

Page 18: 10 Reasons to Start Your Analytics Project with PostgreSQL

JSON data typeOperator Description9.4-> Get an element by key as a JSON object->> Get an element by key as a text object#> Get an element by path as a JSON object#>> Get an element by path as a text object<@, @> Evaluate whether a JSON object contains a key/value pair? Evaluate whether a JSON object contains a key or a value?| Evaluate whether a JSON object contains ANY of keys or values?& Evaluate whether a JSON object contains ALL of keys or values9.5|| Insert or Update an element to a JSON object- Delete an element by key from a JSON object#- Delete an element by path from a JSON object

http://www.postgresql.org/docs/9.5/static/functions-json.html

Page 19: 10 Reasons to Start Your Analytics Project with PostgreSQL

JSON data type• Allows to collect data without defining schema.• “Schema-less”, “Schema on Read” or “Schema-

later”.• Still accessible with SQL.

JSONData Type

Fluentdpg-Json plugin View

(Schema) App

App

Fluentd

Page 20: 10 Reasons to Start Your Analytics Project with PostgreSQL

Performance3 types of JoinFull text search (n-gram)Table PartitionBRIN IndexTable SampleParallel Queries

Page 21: 10 Reasons to Start Your Analytics Project with PostgreSQL

3 types of Join• Nested Loop (NL) Join

– Works good when joining small number of records between tables with indexes.

• Merge Join• Hash Join

– Works better than NL when joining large number of records between large tables.

Page 22: 10 Reasons to Start Your Analytics Project with PostgreSQL

Full-text search (n-gram)• Splits a text into N-char tokens and build an index.

– Pg_trgm: Tri-gram (3-char)– Pg_bigm: Bi-gram (2-char)

• CJK has lots of 2-char words, so Bi-gram may be useful rather than Tri-gram.– CJK: Chinese, Japanese and Korean.

Pg_trgm: https://www.postgresql.org/docs/9.5/static/pgtrgm.htmlPg_bigm: http://pgbigm.osdn.jp/index_en.html

Page 23: 10 Reasons to Start Your Analytics Project with PostgreSQL

Pg_bigm performance• Wikipedia title data (2,789,266 records)

– https://dumps.wikimedia.org/zhwiki/20160601/– zhwiki-20160601-pages-articles-multistream-index.txt.bz2zhwikidb=> select * from zhwiki_index where title like '%香港%';id1 | id2 | title----------+-------+----------------------------------------

5693863 | 2087 | 香港特別行政區基本法第二十三條11393231 | 4323 | 香港特别行政区12830042 | 5085 | 香港大学列表14349335 | 6088 | 香港行政区划14349335 | 6090 | 香港行政區劃14349335 | 6091 | 香港十八区14349335 | 6092 | 香港十八區16084672 | 7168 | 香港兒童文學作家18110426 | 8206 | 北區 (香港)18110426 | 8236 | 東區 (香港)19537078 | 9528 | 香港專業教育學院19537078 | 9567 | 香港中文大學

Page 24: 10 Reasons to Start Your Analytics Project with PostgreSQL

Pg_bigm performance

Aggregate (actual time=481.512..481.541 rows=1 loops=1)-> Seq Scan on zhwiki_index (actual time=1.458..478.326 rows=317 loops=1)

Filter: (title ~~ '%香港電影%'::text)Rows Removed by Filter: 2788949Planning time: 0.125 msExecution time: 481.654 ms(6 rows)

select count(*) from zhwiki_indexwhere title like '%香港電影%';

Page 25: 10 Reasons to Start Your Analytics Project with PostgreSQL

Pg_bigm performanceAggregate (actual time=1.790..1.792 rows=1 loops=1)-> Bitmap Heap Scan on zhwiki_index (actual time=0.299..1.225 rows=317

loops=1)Recheck Cond: (title ~~ '%香港電影%'::text)Rows Removed by Index Recheck: 1Heap Blocks: exact=191-> Bitmap Index Scan on zhwiki_index_title_idx (actual

time=0.258..0.258 rows=318 loops=1)Index Cond: (title ~~ '%香港電影%'::text)Planning time: 0.103 ms

Execution time: 1.833 ms(9 rows)

select count(*) from zhwiki_indexwhere title like '%香港電影%';

481.6ms → 1.8ms.200x faster than a regular LIKE.

Page 26: 10 Reasons to Start Your Analytics Project with PostgreSQL

Table Partition• Table Partitioning by Range or List

– Called “Constraint Exclusion”

• Does not scan unnecessary partitions– Determined by the “constraints”.

• Is able to eliminate “full table scan” for large tables entirely.

https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html

Page 27: 10 Reasons to Start Your Analytics Project with PostgreSQL

BRIN Index• Block Range INdex (New in 9.5)

– Holds "summary“ data, instead of raw data.– Reduces index size tremendously.– Also reduces creation/maintenance cost.– Needs extra tuple fetch to get the exact record.

050,000

100,000150,000200,000250,000300,000

Btree BRIN

Elapsed

time (m

s)

Index Creation

050,000

100,000150,000200,000250,000300,000

Btree BRIN

Number

of Block

sIndex Size

02468

1012141618

Btree BRINElap

sed tim

e (ms)

Select 1 record

https://gist.github.com/snaga/82173bd49749ccf0fa6c

Page 28: 10 Reasons to Start Your Analytics Project with PostgreSQL

BRIN Index• Structure of BRIN Index

Table File

Block Range 1 (128 Blocks)

Block Range 2

Block Range 3 BlockRangeMin. Value Max. Value

1 1992-01-02 1992-01-282 1992-01-27 1992-02-083 1992-02-08 1992-02-16… … …

Holds only min/max valuesfor “Block Ranges”,128 blocks each.

(in case a date column)

Page 29: 10 Reasons to Start Your Analytics Project with PostgreSQL

TABLESAMPLE• Allows to get approximate results for aggregations by

sampling.• BERNOULLI

– Accurate– Sample by Tuple

• SYSTEM– Performance– Sample by Block

http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/

Page 30: 10 Reasons to Start Your Analytics Project with PostgreSQL

TABLESAMPLE• Calculating the average of total price.

– The actual value and the approximate ones

Page 31: 10 Reasons to Start Your Analytics Project with PostgreSQL

TABLESAMPLEWithout TABLESAMPLE

1787ms

SYSTEM Sampl.22ms

BERNOULLI Sampl.405ms

Page 32: 10 Reasons to Start Your Analytics Project with PostgreSQL

Parallel Queries• The leader process cooperates with those worker

processes for:– Sequential scan– Joins (Nested Loop & Hash)– Aggregations

• Will be shipped with 9.6– 9.6 is beta2 as of today

Leader

Worker Worker

Client

Data

Read &Examine

QueryResult

Launch & Gather

Page 33: 10 Reasons to Start Your Analytics Project with PostgreSQL

Parallel Aggregation Performance & Scalability

• count(*) on 30M rows– Shows a good parallel scalability

Page 34: 10 Reasons to Start Your Analytics Project with PostgreSQL

In-Database AnalyticsUser Defined FunctionsApache MADlib

Page 35: 10 Reasons to Start Your Analytics Project with PostgreSQL

In-Database Analytics• In-Database Analytics?

– Performs analytics workload in the database without pulling the data out of the server.

• Advantages of In-Database Analytics– No need to move “BigData” between server and

client for analytics.– Higher performance hardware resources (CPU,

memory, storage) compared to client PCs.

Page 36: 10 Reasons to Start Your Analytics Project with PostgreSQL

In-Database Analytics• User defined functions

– PL/Python, PL/R, PL/v8, ... or C lang.– Allow you to run (almost) any logics within the

database.

• Apache MADlib– Machine Learning Library for PostgreSQL

Page 37: 10 Reasons to Start Your Analytics Project with PostgreSQL

UDF by PythonCREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text)RETURNS SETOF recordAS $$import osfor e in os.environ:plpy.notice(str(e) + ": " + os.environ[e])yield(e, os.environ[e])$$ LANGUAGE plpythonu;

Page 38: 10 Reasons to Start Your Analytics Project with PostgreSQL

UDF by PythonCREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text)RETURNS SETOF recordAS $$import osfor e in os.environ:plpy.notice(str(e) + ": " + os.environ[e])yield(e, os.environ[e])$$ LANGUAGE plpythonu;

testdb=# select * from dumpenv() order by 1 limit 10;column1 | column2--------------------+-----------------------G_BROKEN_FILENAMES | 1HISTCONTROL | ignoredupsHISTSIZE | 1000HOME | /home/snagaHOSTNAME | localhost.localdomainLANG | ja_JP.UTF-8LC_COLLATE | CLC_CTYPE | CLC_MESSAGES | CLC_MONETARY | C(10 rows)

Page 39: 10 Reasons to Start Your Analytics Project with PostgreSQL

Apache MADlib• An Open Source Machine Learning Library

– Can run in PostgreSQL, Greenplum Database and Apache HAWQ.

– Supports many ML algorithms.

http://madlib.incubator.apache.org/

Page 40: 10 Reasons to Start Your Analytics Project with PostgreSQL

OthersStrict type checking and constraints.Industry Standard Interface (for BI tools)

Page 41: 10 Reasons to Start Your Analytics Project with PostgreSQL

Others• Strict type checking and constraints.

– Avoid “Garbage in, garbage out.”

• Industry Standard Interface (for BI tools)– ODBC, JDBC

Page 42: 10 Reasons to Start Your Analytics Project with PostgreSQL

Summary• PostgreSQL has already had lots of features that

help your analytics project– In terms of productivity and performance.

• And more “BigData” features are coming in the future release.– Parallel query must be a big-shot.

• Let’s start your analytic project with PostgreSQL and join our community. – PostgreSQL 9.6 beta2 is available now!

Page 43: 10 Reasons to Start Your Analytics Project with PostgreSQL

Resources• http://www.postgresql.org• http://wiki.postgresql.org• http://planet.postgresql.org• http://pgcon.org

Page 44: 10 Reasons to Start Your Analytics Project with PostgreSQL

pgDay Asia 2016• pgDay Asia 2016 / FOSSASIA 2016

– March 17-19 in Singapore• Speakers:

– 19+ speakers from 9 countries• Sessions:

– 19 Regular Sessions.– Plus, lightning talks

• Attendees:– Around 100 attendees

Page 45: 10 Reasons to Start Your Analytics Project with PostgreSQL
Page 46: 10 Reasons to Start Your Analytics Project with PostgreSQL
Page 47: 10 Reasons to Start Your Analytics Project with PostgreSQL
Page 48: 10 Reasons to Start Your Analytics Project with PostgreSQL

pgDay Asia 2017• FOSSASIA 2017 (March, 2017)

– Probably, the same format, in the same season, in the same region.• Do not miss the next one!

– Will be better and bigger. • Join us at:

– http://pgday.asia– https://www.facebook.com/pgdayasia

Page 49: 10 Reasons to Start Your Analytics Project with PostgreSQL

Q&A