© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Vertica’s Design: Basics, Successes, and Failures Chuck Bear CIDR 2015 ‟ January 5, 2015
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Vertica’s Design: Basics, Successes, and Failures Chuck Bear
CIDR 2015 ‟ January 5, 2015
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
1. Vertica Basics: Storage Format
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3
Design Goals „ SQL (for the ecosystem and knowledge pool)
„ Clusters of commodity hardware (for cost)
„ Linux, x86, Ethernet
„ Software-only solution (for flexibility)
„ Special purpose hardware has poor track record in databases
„ Shared Nothing MPP
„ Cheaper, but puts more complexity in the software
„ Analytics: Run large queries many times faster than a legacy DB, load as fast, but feel free to snarl and growl at small UPDATEs and DELETEs
„ Work smart, and work hard.
„ Robust algorithms, query optimizer, vectorize, JIT, etc.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 4
Start from how data is stored on disk…
SELECT SUM(volume) FROM trades WHERE symbol = 'HPQ' AND date = '5/13/2011'
SYMBOL DATE TIME PRICE VOLUME ETC
… … … … … …
HPQ 05/13/11 01:02:02 PM 40.01 100 …
IBM 05/13/11 01:02:03 PM 171.22 10 …
AAPL 05/13/11 01:02:03 PM 338.02 5 …
GOOG 05/13/11 01:02:04 PM 524.03 150 …
HPQ 05/13/11 01:02:05 PM 39.97 40 …
AAPL 05/13/11 01:02:07 PM 338.02 20 …
GOOG 05/13/11 01:02:07 PM 524.02 40 …
… … … … … …
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 5
Sorted Data
Sort by Symbol, Date, and Time
SYMBOL DATE TIME PRICE VOLUME ETC
… … … … … …
AAPL 05/13/11 01:02:07 PM 338.02 20 …
AAPL 05/13/11 01:02:03 PM 338.02 5 …
… … … … … …
GOOG 05/13/11 01:02:04 PM 524.03 150 …
GOOG 05/13/11 01:02:07 PM 524.02 40 …
… … … … … …
HPQ 05/13/11 01:02:02 PM 40.01 100 …
HPQ 05/13/11 01:02:05 PM 39.97 40 …
… … … … … …
IBM 05/13/11 01:02:03 PM 171.22 10 …
… … … … … …
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 6
Column Files
Split into columns
SYMBOL DATE TIME PRICE VOLUME ETC
… … … … … …
AAPL 05/13/11 01:02:07 PM 338.02 20 …
AAPL 05/13/11 01:02:03 PM 338.02 5 …
… … … … … …
GOOG 05/13/11 01:02:04 PM 524.03 150 …
GOOG 05/13/11 01:02:07 PM 524.02 40 …
… … … … … …
HPQ 05/13/11 01:02:02 PM 40.01 100 …
HPQ 05/13/11 01:02:05 PM 39.97 40 …
… … … … … …
IBM 05/13/11 01:02:03 PM 171.22 10 …
… … … … … …
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 7
Compression + RLE
SYMBOL DATE VOLUME
(8K Distinct) (250/yr)
…
… 22
GOOG (x18M) 05/13/2011 (x150K) 150
… 40
…
…
… 99
HPQ (x22M) 05/13/2011 (x220K) 100
… 40
…
…
… 200
IBM (x19M) 05/13/2011 (x150K) 10
… 18
…
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 8
Position Index (NOT Row ID) Last PSize
Comp SzCRC
Min/MaxNull Count
Last PSize
Comp SzCRC
Min/MaxNull Count
Last PSize
Comp SzCRC
Min/MaxNull Count
CmpData
CmpData
CmpData
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
2. Vertica Basics: Updates & Deletes
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 10
Q: How do you update this?
SYMBOL DATE VOLUME
(8K Distinct) (250/yr)
…
… 22
GOOG (x18M) 05/13/2011 (x150K) 150
… 40
…
…
… 99
HPQ (x22M) 05/13/2011 (x220K) 100
… 40
…
…
… 200
IBM (x19M) 05/13/2011 (x150K) 10
… 18
…
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11
A: You Do Not!
„ Multiple sets of sorted files loaded
‟ Or keep things in memory for a while
„ Update is INSERT+DELETE
„ Delete is just a mark ‟ nice sorted list of positions
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12
So you need to compaction, or whatever. We call ours the tuple mover.
It’ll Get Dirty….
Maximum ROS size (~1/2 disk or less)
Negligible ROS size (1 MB/column or less)
ROS Size (Sum of all columns). Log scale.
Merge Strata (Blue dots represent ROSs)
Start Epoch, End Epoch
Full Stratum
(ready to merge)
...up to the number of strata
Stratum
Height
Stratum 0
Stratum 1
Stratum 3
Stratum 4
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13
Not by glamour, etc.
How Do You Judge a Tuple Mover?
„ Magical: no problems, no backlogs, no errors
„ Latency and freshness: How much batching is needed?
„ Sustained load rate (consider machine capacity + retention interval)
„ Efficiency will be required
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
3. Vertica Basics: Transactions & Recovery
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15
Transactions
Vertica offers full ACID ( just at low TPS)
Queries take a snapshot of the relevant list of files, and need no locks at READ COMMITTED isolation
Tuple Mover (etc.) doesn’t interfere
Loads do not conflict with each other
COMMIT ‟ keep the new files
ROLLBACK ‟ discard them
Table level locks for SERIALIZABLE
All Operations are On-Line
Database is essentially its own undo / redo log Recovery can be as simple as file copies
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16
Node 1
1A 2B
Node 2
1B 2C
Node 4
1D 2A
Node 3
1C 2D
a) All nodes up
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17
Node 1
1A 2B
Node 2
1B 2C
Node 4
1D 2A
Node 3
1C 2D
a) All nodes up
Node 1
1A 2B
Node 2
1B 2C
Node 4
1D 2A
Node 3
1C 2D
b) Node 2 down
All data still available, in several combinations: 2A, 2B, 1C, 1D (shown) 1A, 2B, 1C, 1D 2A, 2B, 1C, 2D 1A, 2B, 1C, 2D (never chosen)
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18
Node 1
1A 2B
Node 2
1B 2C
Node 4
1D 2A
Node 3
1C 2D
a) All nodes up
Node 1
1A 2B
Node 2
1B 2C
Node 4
1D 2A
Node 3
1C 2D
b) Node 2 down
Node 1
1A 2B
Node 2
1B 2C
Node 4
1D 2A
Node 3
1C 2D
c) Recovery
All data still available, in several combinations: 2A, 2B, 1C, 1D (shown) 1A, 2B, 1C, 1D 2A, 2B, 1C, 2D 1A, 2B, 1C, 2D (never chosen)
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
4. Mistake: Execution Engine Design
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20
Simple Design
„ Use iterators
‟ open
‟ getNext
‟ close
„ If there’s trouble, use a temp relation
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21
(You have to vectorize. And do JIT compiling.)
Too Slow!
0 2 4 6 8 10 12
0
1000
2000
3000
4000
5000
6000
Original (ms)
Vectorized Copy (ms)
JIT Compiled (ms)
Number of Merge Streams
Merg
e T
ime (
ms,
Nehale
m)
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22
“Push Model” DAG Executor You might even get parallelism for free
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 23
“Push Model” DAG Executor
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24
“Push Model” DAG Executor
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25
“Push Model” DAG Executor
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 26
“Push Model” DAG Executor
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 27
“Push Model” DAG Executor
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 28
Problems with DAG Execution
„ Free-for-all
‟ And that parallelism thing didn’t pan out after all
„ Resource usage: could do better
„ Diamond problem
„ Need to give clues to upstream operations
‟ Imagine subqueries?
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 29
We threw it away, and went back to the “pull” model
End Result?
„ Block iterators
‟ open
‟ getNextBatch (w/ optimizations to avoid tuple copies)
‟ close
‟ Also, send information back upstream
„ When it gets tricky, use coroutines or other tactics
„ We still push data when there are multiple targets
‟ Such as loading multiple projections, UPDATEs, etc.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
5. Evolution of Joins in Vertica: The Good, Bad, and Ugly
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 31
Scan all columns
Perform Join of 'fk' against 'pk' IN list
Final 'sv' SUM
Pre-SUM 'sv' data Against 'fk' join key
(optional)
a) No SIPS, EMJ
SELECT SUM(sv) FROM fact WHERE fk IN (SELECT pk FROM d)
Early Materialized Joins
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 32
Scan all columns
Perform Join of 'fk' against 'pk' IN list
Final 'sv' SUM
Pre-SUM 'sv' data Against 'fk' join key
(optional)
Scan 'fk' key column
Perform Join of 'fk' against 'pk' IN list
Materialize columns from rows
that joined
SUM 'sv' from rows
a) No SIPS, EMJ b) No SIPS, LMJ
SELECT SUM(sv) FROM fact WHERE fk IN (SELECT pk FROM d)
Late Materialized Joins
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 33
SCANfact
SCANd
JOIN
1
23
Sideways Information Passing (SIPS)
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 34
Scan 'fk' key column (Filter out keys that
will not join)
Perform Join of 'fk' against 'pk' IN list
Materialize columns from rows that joined
SUM 'sv' from rows
Scan 'fk' key column (Filter out keys that
will not join)
Perform Join of 'fk' against 'pk' IN list
Final 'sv' SUM
Pre-SUM 'sv' data against 'fk' join key
(optional)
c) SIPS, EMJ d) SIPS, LMJ
SELECT SUM(sv) FROM fact WHERE fk IN (SELECT pk FROM d)
Late Materialized Joins
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 35
Outcome?
Selectivity Neither Feature LMJ only SIPS only SIPS+LMJ
0.00% 1206 39 23 271.00% 1202 63 33 39
2.00% 1200 75 50 57
3.00% 1208 121 75 79
5.00% 1207 151 93 116
10.00% 1200 195 141 191
20.00% 1202 362 405 360
50.00% 1202 1050 1086 1047
100.00% 1204 1720 1222 1724
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 36
a) Good join order, no SIPS
A B
CJoin
Join
10M
1M
10
100
10M
Robustness to Join Order Errors
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 37
b) Bad join order, no SIPS
A C
BJoin
Join
10M
100M
100
10
10M
Robustness to Join Order Errors
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 38
c) Good join order, w/ SIPS d) Bad join order, w/ SIPS
A B
CJoin
Join
1M
1M
10
100
10M
A C
BJoin
Join
1M
10M
100
10
10M
Robustness to Join Order Errors
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
6. Mistake: Partitioned Hash Join
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 40
Partitioned Hash Join vs. Sort Merge Join
„ (There are papers about these)
„ PHJ was the first one tried
„ SMJ was simpler to implement
„ Sometimes one relation is sorted already
„ Sometimes, you need to sort for other reasons
„ Much more compatible with SIPS
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 41
Also, There’s Performance
0 1 2 3 4 5 6 7 8 9
0
5000
10000
15000
20000
25000
PHJ
SMJ
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
7. Good Idea: Data Collection
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 43
A database that doesn’t self-collect is hypocrisy at its worst
Big Data Mentality
„ How busy is the machine compared to historical trends?
„ What have my users been doing?
„ How long will this job take to finish?
„ What is the most common error?
„ When was the last time we made a backup?
„ My request’s run-time changed… why?
„ Have there been changes from the standard configuration?
„ Are there problems that the customer hasn’t called about?
„ Which features have been used?
„ Where do customer machines burn the most CPU cycles?
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 44
Unexpected Questions
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 45
Don’t Compromise on the Design
„ Data collector can’t kill the system
„ Like a log, lots of little appends
„ Shouldn’t accidentally monitor itself
„ Should be able to analyze off-line
„ Result: Separate data management scheme
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
8. Good Idea: Dynamic Workload Management
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 47
Static (Known) Workload Management
Don't want reports to take over the entire system, preventing loads or tactical queries
Keep some resources (e.g. memory) reserved so that high-priority queries can always begin
Apply run-time prioritization to manage CPU and I/O
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 48
Unpredictable Workload: Short Query Bias
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 49
Q: Are optimizer cost model estimates really that bad?
Dynamic Prioritization
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 50
Q: Are optimizer cost model estimates really that bad?
A: Doesn’t matter!
Dynamic Prioritization
0 50 100 150 200 0
20
40
60
80
100
120
Time (s)
Cum
ula
tive
Co
mp
leti
on
(%
)
Unprioritized
Dynamic Priority
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you
Please come visit our development team in:
Boston (Cambridge and Andover), MA
Pittsburgh, PA
Sunnyvale, CA