Top Banner
© 2ndQuadrant 2016 Big Data & PostgreSQL Using TABLESAMPLE to Analyze Very Large Datasets By Umair Shahid
20

Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

May 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Big Data & PostgreSQLUsing TABLESAMPLE to Analyze Very Large Datasets

By Umair Shahid

Page 2: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Who am I?● Director, Products @

2ndQuadrant● Got “pushed” into PostgreSQL in

2004, ended up falling in love with it

● Not a hardcore techie, yet passionate about open source software

● Interested in Big Data, especially the newer PostgreSQL features supporting it

2011

2015

Page 3: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

What is Big Data?● Volume

○ Size: Text files to HD videos○ Sources: Spreadsheets to sensors○ From lakes to oceans

● Velocity○ More sources imply more speed○ Faster connectivity implies more speed○ High-paced world requires faster turnaround

● Variety

Page 4: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

What is the problem?Number of Rows Size on Disk (MB) Time Taken (ms)

1k 0.23 219.706

100k 24 1,302.135

1M 195 7,696.386

5M 951 40,691.603

10M 1,923 60,012.457

100M 19,456 801,493.319

Page 5: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Why is this significant?

● Data mining has typically been a painful process● Major contributor to the pain has been the time it

takes for queries to return● Many false steps before the required data is

identified● Waiting time is wasted time● Sampling, count based or time based, reduces

the wasted time significantly

Page 6: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

What is TABLESAMPLE?

● Ability to read a random sample of data in a table

● Defined in SQL:2003 (5th revision of SQL)

● Implemented in PostgreSQL 9.5

Page 7: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Syntax

SELECT select_expression

FROM table_name

TABLESAMPLE sampling_method ( argument [, ...] )

[ REPEATABLE ( seed ) ]

...

Page 8: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

sampling_method

● argument is percentage of rows● SYSTEM

○ Block level sampling○ Very fast○ Non-independent rows

● BERNOULLI○ Row level sampling○ Slower than SYSTEM○ Independent rows (uniformly random)

Page 9: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Page 10: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Demo sampling methods

Page 11: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

REPEATABLE results● (Reminder: [ REPEATABLE ( seed ) ])● Optional argument● Used if random, yet repeatable results are

required● seed and argument need to be the same to

produce repeatable results● Any changes made to the table will result in a

different data set

Page 12: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Now it gets interesting … ● TABLESAMPLE allows for additional sampling methods

via extensions● tsm_system_time specifies max number of

milliseconds to spend reading a table● Implements the syntax:

SELECT select_expression

FROM table_name

TABLESAMPLE SYSTEM_TIME (argument)

Page 13: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Demo tsm_system_time

Page 14: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Enter Orange ...● Funded by AXLE

(http://axleproject.eu)● Same project funded

TABLESAMPLE● Available integrated

with PostgreSQL in 2UDA (http://2ndquadrant.com/2uda)

● Uses TABLESAMPLE to very quickly create visualizations for data

● Can quickly create predictive models

Page 15: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Demo OrangeYou can find a very helpful tutorial at

http://2ndquadrant.com/2uda

Page 16: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Other Big Data features in PostgreSQL● HSTORE● XML● JSON & JSONB● BRIN INDEXES● Parallel sequential scan● Parallel aggregates● FDWs● Horizontal Scalability

○ Check out Postgres-XL http://www.postgres-xl.org/

Page 17: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Features from the latest release

● 9.6 Beta3 announced last night

● Added support to parallel query for TABLESAMPLE

Page 18: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Moving Forward … ● Next meetup: Tentatively August 19, 2016● Please come forward and share your

PostgreSQL stories● Today’s refreshments are sponsored by

2ndQuadrant - THANK YOU!○ Need more sponsors

OR○ Need to start charging for these sessions

Page 19: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

Special Thanks!!

Page 20: Big Data & PostgreSQLfiles.meetup.com/19722453/3 - TABLESAMPLE.pdf · 2016-08-29 · Got “pushed” into PostgreSQL in 2004, ended up falling in love with it Not a hardcore techie,

© 2ndQuadrant 2016

Umair ShahidEmail: [email protected]: @pg_umair

2ndQuadrant is hiring - All geographies!

Thank you for your time!