Top Banner
[email protected] (855) 232-0320 [email protected] (855) 232-0320 Using the PostgreSQL Extension Ecosystem for Advanced Analytics
45

Using the PostgreSQL Extension Ecosystem for Advanced Analytics

Jan 15, 2017

Download

Software

Chartio
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

[email protected] (855) 232-0320

Using the PostgreSQL Extension Ecosystem for

Advanced Analytics

Page 2: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

- The problem- The prevailing view vs. the practical reality

- A possible solution- Or just building blocks?

- Nearness- Near at hand, near to our skill set, near to our capabilities

- A more complete solution- The PostgreSQL extension ecosystem

Agenda

Page 3: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

[email protected] (855) 232-0320

The ProblemThe Prevailing View

vs. The Practical Reality

Page 4: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

The Prevailing View - LogicalDimension Relational Non-Relational

Schema objects ● Structured rows and columns● Schema on write● Referential integrity● Painful migrations

● Unstructured files, docs, etc● Schema on read● No referential integrity● No migrations

Query languages ● SQL● Declarative● Easy enough for non-tech users

● Various● Procedural● Requires some programming skills

Exploratory analysis ● Native support for joins● Interactive/low execution overhead

● No native support for joins● OLAP - Batch processing

Data science and ML ● Only descriptive statistics● Requires exporting dumps/samples

● Robust ecosystem● Does not require exports

Page 5: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

The Prevailing View - PhysicalDimension Relational Non-Relational

Parallel query processing

● Single node system● Single process per query

● Multiple node system● Multiple processes per query

Concurrency ● High concurrency● Single process per connection

● OLAP - low concurrency/high scheduling overhead

High Availability & Replication

● Async and sync replication● HA may not be native

● Async and sync replication● HA likely to be native

Sharding ● Sharding may not be native● Difficult to manage

● Sharding likely to be native● Easy to manage

Page 6: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

The Prevailing View - Summary- RDBMS have nice properties for producing rich data

- ACID, relational integrity, constraints, strong data types

- Easier for non-tech users and exploratory analysis- Probably don’t meet the needs of today’s analysts

- Data science & Machine Learning- Parallel processing

- Definitely don’t meet the needs of today’s apps- Schema migrations- Replication and sharding

Page 7: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

The Practical Reality

Page 8: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

[email protected] (855) 232-0320

But we still want more advanced functionality.

The Practical Reality

Page 9: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

[email protected] (855) 232-0320

A Possible SolutionOr Just Building Blocks?

Page 10: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Modern SQL- Many people still think of SQL in terms of SQL-92- Since then we’ve had: SQL:1999, SQL:2003, SQL:2006,

SQL:2008, SQL:2011- http://use-the-index-luke.com/blog/2015-02/modern-sql

- Common Table Expressions (CTEs) / Recursive CTEs- Window Functions- Ordered-set Aggregates- Lateral joins- Temporal support- The list goes on...

Page 11: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Procedural Languages- Native

pgSQL Tcl Perl Python

- Community

Java PHP R Javascript Ruby Scheme sh

Page 12: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

[email protected] (855) 232-0320

These solve some problems. For others, they are just building

blocks.

Building Blocks

Page 13: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

[email protected] (855) 232-0320

NearnessNear at Hand

Near to Our Skill SetNear to Our Capabilities

Page 14: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

- http://www.infoq.com/presentations/Simple-Made-Easy

Nearness

Page 15: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

- Near at hand- Easily installable

- Near to our skill set- Familiar tool/language/abstraction- Modular and composable

- Near to our capabilities- Capable of solving a problem in our domain

Nearness Drives Adoption

Page 16: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

[email protected] (855) 232-0320

A More Complete SolutionThe PostgreSQL Extension

Ecosystem

Page 17: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & operators: https://github.com/eulerto/pg_similarity- UDAs & data types: https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes: https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/, https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking: http://no0p.github.io/2015/10/20/record_linking.html#/

Page 18: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & operators: https://github.com/eulerto/pg_similarity- UDAs & data types: https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes: https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/, https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking: http://no0p.github.io/2015/10/20/record_linking.html#/

Page 19: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

- Package Manager: pgxn- Index/Network: http://pgxn.org/- PyPI, RubyGems, CPAN, CRAN

The PostgreSQL Extension Network

Page 20: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

The PostgreSQL Extension Network

- Near at hand- pgxn search semver- pgxn info semver- pgxn install semver- pgxn load –d somedb semver- pgxn unload –d somedb

semver- pgxn uninstall semver

- Search github? google? mailing list?- Github README?- git clone; make; make install;- psql –c “CREATE EXTENSION IF NOT

EXISTS”- psql –c “DROP EXTENSION IF EXISTS”- make uninstall?

Page 21: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & operators: https://github.com/eulerto/pg_similarity- UDAs & data types: https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes: https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/, https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking: http://no0p.github.io/2015/10/20/record_linking.html#/

Page 22: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

UDFs & Operators: pg_similarity- Near to our capabilities

- Similarity coefficient algorithms- L1 Distance- Cosine Distance- Dice Coefficient- Euclidean Distance- Hamming Distance- Jaccard Coefficient- Jaro Distance- Jaro-Winkler Distance- Levenshtein Distance

- Matching Coefficient- Monge-Elkan Coefficient- Needleman-Wunsch Coefficient- Overlap Coefficient- Q-Gram Distance- Smith-Waterman Coefficient- Smith-Waterman-Gotoh Coefficient- Soundex Distance

Page 23: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

UDFs & Operators: pg_similarity- Near to our skill set

Page 24: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

UDFs & Operators: pg_similarity- Implementation

Page 25: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & Operators: https://github.com/eulerto/pg_similarity- UDAs & Data Types:

https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes: https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/, https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking: http://no0p.github.io/2015/10/20/record_linking.html#/

Page 26: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

UDAs & Data Types: postgresql-hll- Near to our capabilities & near to our skill set

- Data type- Estimate count distinct with tunable precision- 1280 bytes estimates tens of billions of distinct values with few

percent error

Page 27: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

UDAs & Data Types: postgresql-hll

Page 28: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

UDAs & Data Types: postgresql-hll- Implementation

Page 29: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & Operators: https://github.com/eulerto/pg_similarity- UDAs & Data Types: https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes: https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/, https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking: http://no0p.github.io/2015/10/20/record_linking.html#/

Page 30: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Foreign Data Wrappers: API

Page 31: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Foreign Data Wrappers: multicorn

- Near to our skill set

Page 32: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Foreign Data Wrappers: pgosquery

- Near at hand

Page 33: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & Operators: https://github.com/eulerto/pg_similarity- UDAs & Data Types: https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes: https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/, https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking: http://no0p.github.io/2015/10/20/record_linking.html#/

Page 34: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Indexes: ZomboDB

- Index Access Method API- http://www.postgresql.org/docs/9.4/static/indexam.html

Page 35: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & Operators: https://github.com/eulerto/pg_similarity- UDAs & Data Types: https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes (GiST, GIN): https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/, https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking: http://no0p.github.io/2015/10/20/record_linking.html#/

Page 36: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Composing Extension Methods: MADlib Near to our capabilities

Page 37: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Composing Extension Methods: MADlib- Near to our skill set

Page 38: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Composing Extension Methods: MADlib

Page 39: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & Operators: https://github.com/eulerto/pg_similarity- UDAs & Data Types: https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes: https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/,

https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking: http://no0p.github.io/2015/10/20/record_linking.html#/

Page 40: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Parallel Processing

- Parallel sequential scan- http://rhaas.blogspot.com/2015/11/parallel-sequential-scan-is-committed.html

- Columnar FDW:- https://github.com/citusdata/cstore_fdw

Page 41: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Postgres Extension Ecosystem Examples- PostgreSQL Extension Network: http://pgxn.org/

- UDFs & Operators: https://github.com/eulerto/pg_similarity- UDAs & Data Types: https://github.com/aggregateknowledge/postgresql-hll- Foreign Data Wrappers: http://multicorn.org/, https://github.com/shish/

pgosquery- Indexes: https://github.com/zombodb/zombodb- Composing Extension Methods: http://doc.madlib.net/- MPP: https://www.citusdata.com/, https://github.com/greenplum-db/gpdb- Composing Extensions

- Custom Background Workers: https://github.com/no0p/alps- Record linking:

http://no0p.github.io/2015/10/20/record_linking.html#/

Page 42: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Composing Extensions: Alps

Page 43: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Composing Extensions: Record Linking

Page 44: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Beyond Analytics- Web app framework

- http://blog.aquameta.com/- REST API

- https://github.com/begriffs/postgrest- Unit testing framework

- http://pgtap.org/- Firewall

- https://github.com/uptimejp/sql_firewall- More every week!

Page 45: Using the PostgreSQL Extension Ecosystem for Advanced Analytics

[email protected] (855) 232-0320

Conclusion- With PostgreSQL, you get

- more than rows and columns- more than SELECT, FROM, WHERE, GROUP BY, ORDER

BY- more than a single machine

- Make sure you get the full return on your investment!

Get your Chartio free trial!

[email protected]

(855) 232-0320