Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation † Harald Lang 1 , Tobias Mühlbauer 1 , Florian Funke 2,* , Peter Boncz 3,* , Thomas Neumann 1 , Alfons Kemper 1 1 Technical University Munich, 2 Snowflake Computing, 3 Centrum Wiskunde & Informatica † To appear at SIGMOD 2016. * Work done while at Technical University Munich.
24
Embed
Data Blocks: Hybrid OLTP and OLAP on Compressed Storage ... fileData Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilationy Harald Lang1, Tobias
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Blocks: Hybrid OLTP and OLAP on CompressedStorage using both Vectorization and Compilation †
Harald Lang1, Tobias Mühlbauer1, Florian Funke2,∗,Peter Boncz3,∗, Thomas Neumann1, Alfons Kemper1
1Technical University Munich, 2Snowflake Computing, 3Centrum Wiskunde & Informatica
†To appear at SIGMOD 2016. * Work done while at Technical University Munich.
Goals
I Primary goalI Reducing the memory-footprint in hybrid OLTP&OLAP database systemsI Retaining high query performance and transactional throughput
I Secondary goals / future workI Eviting cold data to secondary storageI Reducing costly disk I/O
I Out of scopeI Hot/cold clustering (see previous work of Funke et al.: “Compacting
Transactional Data in Hybrid OLTP&OLAP Databases”)
Compression in Hybrid OLTP&OLAP Database Systems
I SAP HANA (existing approach)I Compress entire relationsI Updates are performed in an uncompressed write-optimized partitionI Implicit hot/cold clusteringI Merge partitions
I HyPer (our approach)I Split relations in fixed size chunks (e.g., 64 K tuples)I Cold chunks are “frozen” into immutable Data Blocks
Data Blocks
I Compressed columnar storage formatI Designed for cold data (mostly read)I Immutable and self-containedI Fast scans and fast point-accessesI Novel index-structure to narrow scan ranges
Coldcompressed Data Blocks
Hotuncompressedmostly point acceses
through index;some on cold data
querypipeline
OLAPOLTP
Compression SchemesI Lightweight compression only
I Single value, byte-aligned truncation, ordered dictionary
I Efficient predicate evaluation, decompression and point-accessesI Optimal compression chosen based on the actual value distribution
I Improves compression ratio, amortizes light-weight compression schemes andredundancies caused by block-wise compression
A0chunk0 B0 C0
Uncompressed
A1 B1 C1chunk1
...
A B C
Data Blocks
...
dictionary (B)truncated (A) keys (B)
truncated (C)
single value (A)truncated (B)
dictionary (C)keys (C)
Positional SMAs
I Lightweight indexingI Extension of traditional SMAs (min/max-indexes)I Narrow scan ranges in a Data Block
Po
siti
on
al S
MA
(P
SMA
)compressed Data Block
range withpotential matches
σ SARGable
I Supported predicates:I column ◦ constant, where ◦ ∈ {=, is,<,≤,≥, >}I column between a and b
Positional SMAs - DetailsI Lookup table where each table entry contains a range with potential matchesI For n byte values, the table consists of n× 256 entriesI Only the most significant non-zero byte is considered
lookup table data
0x0200AA 0x0000AE 0x02FA42 ...
00 02 03 E4
...
0 1 2 3pos:
range: [0,3)
tail bytes
most significantnon-zero byte
...
leadingzero-bytes
256 range entries
256 range entries
256 range entries
...
...[0,3)...
max # of values sharing an entry
1
28
216
224
Positional SMAs - DetailsI Lookup table where each table entry contains a range with potential matchesI For n byte values, the table consists of n× 256 entriesI Only the most significant non-zero byte is considered
lookup table data
0x0200AA 0x0000AE 0x02FA42 ...
00 02 03 E4
...
0 1 2 3pos:
range: [0,3)
tail bytes
most significantnon-zero byte
...
leadingzero-bytes
256 range entries
256 range entries
256 range entries
...
...[0,3)...
max # of values sharing an entry
1
28
216
224
preferred rangeachieved by
using the deltavalue – SMAmin
Positional SMAs - Example
lookup table
10
23
...
259
...
25
5
...
5
...
0
[0,6)
[1,17)
[16,19)
[6,7)
[0,0)
SMA min: 2SMA max: 999
probe 7
delta = 5 (7-min)
bytes of delta
00 00 00 05
probe 998
delta = 996 (998-min)
bytes of delta
00 00 03 E4
5 (leading non-0 byte)
Challenge for JIT-compiling Query Engines
I HyPer compiles queries just-in-time (JIT) using the LLVM compiler frameworkI Generated code is data-centric and processes a tuple-at-a-time
I Data Blocks individually determine the best suitable compression scheme foreach column on a per-block basis
I The variety of physical representations either results inI multiple code paths => exploding compile-timeI or interpretation overhead => performance drop at runtime
Vectorization to the Rescue
I Vectorization greatly reduces the interpretation overheadI Spezialized vectorized scan functions for each compression schemeI Vectorized scan extracts matching tuples to temporary storage where tuples
are consumed by tuple-at-a-time JIT code
uncompressed chunk
A B C D
vectorB
compressedData Block
A
B
C
DA D
interpreted vectorized scanon Data Block
vectorized evaluation of SARGable predicates on compressed data and unpacking of matches
push matchestuple-at-a-time
vector
B
uncompressedchunk
A B C D
A D
interpreted vectorized scanon uncompressed chunk
vectorized evaluation of SARGable predicates and copying of matches
Speedup of TPC-H Q6 (scale factor 100) on block-wise sorted3 data (+SORT).
JIT VEC Data Blocks(+PSMA)
+SORT(-PSMA)
+PSMA02468
101214
gain by PSMA
spee
dup
over
JIT
3sorted by l_shipdate
OLTP Performance - Point Access
Throughput (in lookups per second) of random point access queriesselect * from customer where c_custkey = randomCustKey()on TPC-H scale factor 100 with a primary key index on c_custkey.
Throughput [lookups/sec]
Uncompressed 545,554
Data Blocks 294,291 (0.54×)
OLTP Performance - TPC-CTPC-C transaction throughput (5 warehouses), old neworder recordscompressed into Data Blocks:
Throughput [Tx/sec]
Uncompressed 89,229
Data Blocks 88,699 (0.99×)
Only read-only TPC-C transactions order status and stock level; allrelations frozen into Data Blocks:
Throughput [Tx/sec]
Uncompressed 119,889
Data Blocks 109,649 (0.91×)
Performance of SIMD Predicate Evaluation
Speedup of SIMD predicate evaluation of type l ≤ A ≤ r with selectivity 20%:
8-bit 16-bit 32-bit 64-bit012345
data type bit-width
spee
dup
over
sequ
entia
lx86
code x86 SSE AVX2
Performance of SIMD Predicate Evaluation (cont’d)
Costs of applying an additional restriction with varying selectivities of the firstpredicate and the selectivity of the second predicate set to 40%:
x86 AVX2
0 50 1000
1
2
cycl
espe
rele
men
t
8-bit
0 50 1000
1
216-bit
0 50 1000
1
2
cycl
espe
rele
men
t
32-bit
0 50 1000
1
264-bit
selectivity of first predicate [%]
Advantages of Byte-AddressabilityPredicate Evaluation
Cost of evaluating a SARGable predicate of type l ≤ A ≤ r with varyingselectivities:
I Intentionally, the domain exceeds the 2-byte truncation by one bitI 17-bit codes with bit-packing, 32-bit codes with Data Blocks
Advantages of Byte-AddressabilityUnpacking matching tuples
Cost of unpacking matching tuples:
1 25 50 75 100
10
100
selectivity [%]
cycl
espe
rmat
chin
gtu
ple
(log
scal
e)
Data BlocksBit-packed (positional access)Bit-packed (unpack all and filter)
I 3 attributes, dom(A) = dom(B) = [0, 216] and dom(C) = [0, 28])I Intentionally, the domains exceed 1-byte and 2-byte truncation by one bitI The compression ratio of bit-packing is almost two times higher in this