GT 4420/6422 // SPRING 2019 // @JOY ARULRAJ LECTURE #23: …jarulraj/courses/4420-s19/... · 2019. 4. 16. · GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #23: VECTORIZED EXECUTION.
Post on 25-Jan-2021
4 Views
Preview:
Transcript
DATABASE SYSTEM IMPLEMENTATION
GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ
LECTURE #23: VECTORIZED EXECUTION
ANATOMY OF A DATABASE SYSTEM
Connection Manager + Admission Control
Query Parser
Query Optimizer
Query Executor
Lock Manager (Concurrency Control)
Access Methods (or Indexes)
Buffer Pool Manager
Log Manager
Memory Manager + Disk Manager
Networking Manager
2
QueryTransactional
Storage Manager
Query Processor
Shared Utilities
Process Manager
Source: Anatomy of a Database System
http://cs.brown.edu/courses/cs295-11/2006/anatomyofadatabase.pdf
TODAY’S AGENDA
BackgroundHardwareVectorized Algorithms (Columbia)
3
VECTORIZATION
The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.
4
WHY THIS MATTERS
Say we can parallelize our algorithm over 32 cores.Each core has a 4-wide SIMD registers.
Potential Speed-up: 32x × 4x = 128x
5
MULTI-CORE CPUS
Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake→ High power consumption and area per core.
Massively superscalar and aggressive out-of-order execution→ Instructions are issued from a sequential stream.→ Check for dependencies between instructions.→ Process multiple instructions per clock cycle.
6
MANY INTEGRATED CORES (MIC)
Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.
Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).
Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)
7
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.htmlhttps://en.wikipedia.org/wiki/P5_(microarchitecture)https://en.wikipedia.org/wiki/Xeon_Phi
MANY INTEGRATED CORES (MIC)
Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.
Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).
Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)
8
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.htmlhttps://en.wikipedia.org/wiki/P5_(microarchitecture)https://en.wikipedia.org/wiki/Xeon_Phi
MANY INTEGRATED CORES (MIC)
Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.
Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).
Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)
9
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.htmlhttps://en.wikipedia.org/wiki/P5_(microarchitecture)https://en.wikipedia.org/wiki/Xeon_Phi
SINGLE INSTRUCTION, MULTIPLE DATA
A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously.
All major ISAs have microarchitecture support SIMD operations.→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512→ PowerPC: Altivec→ ARM: NEON
11
SIMD EXAMPLE
12
X + Y = Zx1x2⋮xn
y1y2⋮yn
x1+y1x2+y2⋮
xn+yn
+ =
Z
SIMD EXAMPLE
13
X + Y = Z 87654321
X
for (i=0; i
Z
SIMD EXAMPLE
14
X + Y = Z 87654321
X
SISD+for (i=0; i
Z
SIMD EXAMPLE
15
X + Y = Z 87654321
X
SISD+for (i=0; i
Z
SIMD EXAMPLE
16
X + Y = Z 87654321
X
SISD+for (i=0; i
Z
SIMD EXAMPLE
17
X + Y = Z 87654321
X
for (i=0; i
Z
SIMD EXAMPLE
18
X + Y = Z 87654321
X
for (i=0; i
Z
SIMD EXAMPLE
19
X + Y = Z 87654321
X
for (i=0; i
Z
SIMD EXAMPLE
20
X + Y = Z 87654321
X
for (i=0; i
Z
SIMD EXAMPLE
21
X + Y = Z 87654321
X
for (i=0; i
Z
SIMD EXAMPLE
22
X + Y = Z 87654321
X
for (i=0; i
STREAMING SIMD EXTENSIONS (SSE)
SSE is a collection SIMD instructions that target special 128-bit SIMD registers.These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously.
First introduced by Intel in 1999.
23
SIMD INSTRUCTIONS (1 )
Data Movement→ Moving data in and out of vector registersArithmetic Operations→ Apply operation on multiple data items (e.g., 2 doubles, 4
floats, 16 bytes)→ Example: ADD, SUB, MUL, DIV, SQRT, MAX, MINLogical Instructions→ Logical operations on multiple data items→ Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS
24
SIMD INSTRUCTIONS (2)
Comparison Instructions→ Comparing multiple data items (==,=,!=)Shuffle instructions→ Move data in between SIMD registersMiscellaneous→ Conversion: Transform data between x86 and SIMD
registers.→ Cache Control: Move data directly from SIMD registers
to memory (bypassing CPU cache).
25
INTEL SIMD EXTENSIONS
26
Width Integers Single-P Double-P1997 MMX 64 bits
1999 SSE 128 bits (×4)
2001 SSE2 128 bits (×2)
2004 SSE3 128 bits
2006 SSSE 3 128 bits
2006 SSE 4.1 128 bits
2008 SSE 4.2 128 bits
2011 AVX 256 bits (×8) (×4)
2013 AVX2 256 bits
2017 AVX-512 512 bits (×16) (×8)Source: James Reinders
https://www.youtube.com/watch?v=_OJmxi4-twY
WHY NOT GPUS?
Moving data back and forth between DRAM and GPU is slow over PCI-E bus.
There are some newer GPU-enabled DBMSs→ Examples: MapD, SQream, Kinetica
Emerging co-processors that can share CPU’s memory may change this.→ Examples: AMD’s APU, Intel’s Knights Landing
27
https://www.mapd.com/http://sqream.com/https://www.kinetica.com/
VECTORIZATION
Choice #1: Automatic Vectorization
Choice #2: Compiler Hints
Choice #3: Explicit Vectorization
28
Source: James Reinders
https://www.youtube.com/watch?v=_OJmxi4-twY
VECTORIZATION
Choice #1: Automatic Vectorization
Choice #2: Compiler Hints
Choice #3: Explicit Vectorization
29
Source: James Reinders
Ease of Use
ProgrammerControl
https://www.youtube.com/watch?v=_OJmxi4-twY
AUTOMATIC VECTORIZATION
The compiler can identify when instructions inside of a loop can be rewritten as a vectorizedoperation.
Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.
30
AUTOMATIC VECTORIZATION
31
void add(int *X,int *Y,int *Z) {
for (int i=0; i
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize.
The code is written such that the addition is described as being done sequentially.
32
void add(int *X,int *Y,int *Z) {
for (int i=0; i
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize.
The code is written such that the addition is described as being done sequentially.
33
These might point to the same address!
void add(int *X,int *Y,int *Z) {
for (int i=0; i
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize.
The code is written such that the addition is described as being done sequentially.
34
These might point to the same address!
void add(int *X,int *Y,int *Z) {
for (int i=0; i
AUTOMATIC VECTORIZATION
This loop is not legal to automatically vectorize.
The code is written such that the addition is described as being done sequentially.
35
These might point to the same address!
void add(int *X,int *Y,int *Z) {
for (int i=0; i
COMPILER HINTS
Provide the compiler with additional information about the code to let it know that is safe to vectorize.
Two approaches:→ Give explicit information about memory locations.→ Tell the compiler to ignore vector dependencies.
36
COMPILER HINTS
The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.
37
void add(int *restrict X,int *restrict Y,int *restrict Z) {
for (int i=0; i
COMPILER HINTS
This pragma tells the compiler to ignore loop dependencies for the vectors.
It’s up to you make sure that this is correct.
38
void add(int *X,int *Y,int *Z) {
#pragma ivdepfor (int i=0; i
EXPLICIT VECTORIZATION
Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions.
Potentially not portable.
39
EXPLICIT VECTORIZATION
Store the vectors in 128-bit SIMD registers.
Then invoke the intrinsic to add together the vectors and write them to the output location.
40
void add(int *X,int *Y,int *Z) {
__mm128i *vecX = (__m128i*)X;__mm128i *vecY = (__m128i*)Y;__mm128i *vecZ = (__m128i*)Z;for (int i=0; i
VECTORIZATION DIRECTION
Approach #1: Horizontal→ Perform operation on all elements
together within a single vector.
Approach #2: Vertical→ Perform operation in an elementwise
manner on elements of each vector.
41
Source: Przemysław Karpiński
https://gain-performance.com/2017/05/01/umesimd-tutorial-2-calculation/
VECTORIZATION DIRECTION
Approach #1: Horizontal→ Perform operation on all elements
together within a single vector.
Approach #2: Vertical→ Perform operation in an elementwise
manner on elements of each vector.
42
Source: Przemysław Karpiński
0 1 2 3
SIMD Add 6
https://gain-performance.com/2017/05/01/umesimd-tutorial-2-calculation/
VECTORIZATION DIRECTION
Approach #1: Horizontal→ Perform operation on all elements
together within a single vector.
Approach #2: Vertical→ Perform operation in an elementwise
manner on elements of each vector.
43
Source: Przemysław Karpiński
0 1 2 3
SIMD Add 6
0 1 2 3
SIMD Add
1 1 1 1
1 2 3 4
https://gain-performance.com/2017/05/01/umesimd-tutorial-2-calculation/
EXPLICIT VECTORIZATION
Linear Access Operators→ Predicate evaluation→ Compression
Ad-hoc Vectorization→ Sorting→ Merging
Composable Operations→ Multi-way trees→ Bucketized hash tables
44
Source: Orestis Polychroniou
http://www.cs.columbia.edu/~orestis
VECTORIZED DBMS ALGORITHMS
Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.→ Favor vertical vectorization by processing different input
data per lane.→ Maximize lane utilization by executing different things
per lane subset.
45
RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015
FUNDAMENTAL OPERATIONS
Selective LoadSelective StoreSelective GatherSelective Scatter
46
FUNDAMENTAL VECTOR OPERATIONS
47
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
48
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
49
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
50
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
51
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
FUNDAMENTAL VECTOR OPERATIONS
52
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U
FUNDAMENTAL VECTOR OPERATIONS
53
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U
FUNDAMENTAL VECTOR OPERATIONS
54
Selective Load
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
FUNDAMENTAL VECTOR OPERATIONS
55
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
FUNDAMENTAL VECTOR OPERATIONS
56
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
FUNDAMENTAL VECTOR OPERATIONS
57
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
FUNDAMENTAL VECTOR OPERATIONS
58
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
B
FUNDAMENTAL VECTOR OPERATIONS
59
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
B
FUNDAMENTAL VECTOR OPERATIONS
60
Selective Load Selective Store
A B C DVector
Memory
0 1 0 1Mask
U V W X Y Z • • •
U V
A B C DVector
U V W X Y Z • • •Memory
0 1 0 1Mask
B D
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
61
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CA
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
62
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CA
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
63
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CA
0 21 3 54
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
64
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CA
0 21 3 54
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
65
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CAW
0 21 3 54
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
66
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • •
CAW V XZ
0 21 3 54
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
67
Selective Scatter
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • • A B C DValue Vector
U V W X Y Z • • •Memory
2 1 5 3Index Vector
CAW V XZ
0 21 3 54
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
68
Selective Scatter
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • • A B C DValue Vector
U V W X Y Z • • •Memory
2 1 5 3Index Vector
CAW V XZ
0 21 3 54
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
69
Selective Scatter
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • • A B C DValue Vector
U V W X Y Z • • •Memory
2 1 5 3Index Vector
CAW V XZ
0 21 3 54
0 21 3 54
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
70
Selective Scatter
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • • A B C DValue Vector
U V W X Y Z • • •Memory
2 1 5 3Index Vector
CAW V XZ
0 21 3 54
0 21 3 54
Selective Gather
FUNDAMENTAL VECTOR OPERATIONS
71
Selective Scatter
A B DValue Vector
Memory
2 1 5 3Index Vector
U V W X Y Z • • • A B C DValue Vector
U V W X Y Z • • •Memory
2 1 5 3Index Vector
CAW V XZ B A CD
0 21 3 54
0 21 3 54
ISSUES
Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle.
Gathers are only supported in newer CPUs.Selective loads and stores are also emulated in Xeon CPUs using vector permutations.
72
https://software.intel.com/en-us/node/683481
VECTORIZED OPERATORS
Selection ScansHash TablesPartitioning
Paper provides additional info:→ Joins, Sorting, Bloom filters.
73
RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015
SELECTION SCANS
74
SELECT * FROM tableWHERE key >= $(low)AND key
SELECTION SCANS
75
Scalar (Branching)
i = 0for t in table:
key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1
SELECTION SCANS
76
Scalar (Branching)
i = 0for t in table:
key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1
SELECTION SCANS
77
Scalar (Branching)
i = 0for t in table:
key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1
Scalar (Branchless)
i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&
(key≤high ? 1 : 0)i = i + m
SELECTION SCANS
78
Scalar (Branching)
i = 0for t in table:
key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1
Scalar (Branchless)
i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&
(key≤high ? 1 : 0)i = i + m
SELECTION SCANS
79
Scalar (Branching)
i = 0for t in table:
key = t.keyif (key≥low) && (key≤high):copy(t, output[i])i = i + 1
Scalar (Branchless)
i = 0for t in table:copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&
(key≤high ? 1 : 0)i = i + m
Source: Bogdan Raducanu
http://15721.courses.cs.cmu.edu/spring2017/papers/20-compilation/p1231-raducanu.pdf
SELECTION SCANS
80
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
SELECTION SCANS
81
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
SELECTION SCANS
82
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
SELECTION SCANS
83
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
SELECTION SCANS
84
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
SELECTION SCANS
85
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
SELECTION SCANS
86
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
SELECT * FROM tableWHERE key >= "O" AND key
SELECTION SCANS
87
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
ID1
KEYJ
2 O3 Y4 S5 U6 X
SELECT * FROM tableWHERE key >= "O" AND key
SELECTION SCANS
88
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
J O Y S U XKey VectorID1
KEYJ
2 O3 Y4 S5 U6 X
SELECT * FROM tableWHERE key >= "O" AND key
SELECTION SCANS
89
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
J O Y S U XKey VectorID1
KEYJ
2 O3 Y4 S5 U6 X
Mask 0 1 0 1 1 0
SIMD Compare
SELECT * FROM tableWHERE key >= "O" AND key
SELECTION SCANS
90
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
J O Y S U XKey VectorID1
KEYJ
2 O3 Y4 S5 U6 X
Mask 0 1 0 1 1 0
SIMD Compare
0 1 2 3 4 5All Offsets
SELECT * FROM tableWHERE key >= "O" AND key
SELECTION SCANS
91
Vectorized
i = 0for vt in table:
simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&
(vk≤high ? 1 : 0)simdStore(vt, vm, output[i])i = i + |vm≠false|
J O Y S U XKey VectorID1
KEYJ
2 O3 Y4 S5 U6 X
Mask 0 1 0 1 1 0
SIMD Compare
0 1 2 3 4 5All Offsets
SIMD Store
1 3 4Matched OffsetsSELECT * FROM tableWHERE key >= "O" AND key
SELECTION SCANS
92
Scalar (Branching)Scalar (Branchless)
Vectorized (Early Mat)Vectorized (Late Mat)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
SELECTION SCANS
93
0
16
32
48
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Selectivity (%)
Scalar (Branching)Scalar (Branchless)
Vectorized (Early Mat)Vectorized (Late Mat)
0.0
2.0
4.0
6.0
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
5.7 5.6 5.35.7 4.9 4.3 2.8 1.3
1.7 1.7 1.71.8 1.6 1.4 1.5 1.2
SELECTION SCANS
94
0
16
32
48
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Selectivity (%)
Scalar (Branching)Scalar (Branchless)
Vectorized (Early Mat)Vectorized (Late Mat)
0.0
2.0
4.0
6.0
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
SELECTION SCANS
95
0
16
32
48
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Selectivity (%)
Scalar (Branching)Scalar (Branchless)
Vectorized (Early Mat)Vectorized (Late Mat)
0.0
2.0
4.0
6.0
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
MemoryBandwidth
MemoryBandwidth
SELECTION SCANS
96
0
16
32
48
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Selectivity (%)
Scalar (Branching)Scalar (Branchless)
Vectorized (Early Mat)Vectorized (Late Mat)
0.0
2.0
4.0
6.0
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
SELECTION SCANS
97
0
16
32
48
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Selectivity (%)
Scalar (Branching)Scalar (Branchless)
Vectorized (Early Mat)Vectorized (Late Mat)
0.0
2.0
4.0
6.0
0 1 2 5 10 20 50 100
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Selectivity (%)
MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
PAYLOADKEY
Linear Probing Hash Table
HASH TABLES – PROBING
98
PAYLOADKEY
Linear Probing Hash Table
HASH TABLES – PROBING
99
Scalar
k1Input Key
PAYLOADKEY
Linear Probing Hash Table
HASH TABLES – PROBING
100
Scalar
k1Input Key
h1Hash Index
#hash(key)
PAYLOADKEY
Linear Probing Hash Table
HASH TABLES – PROBING
101
Scalar
k1Input Key
h1Hash Index
#hash(key)
k1 k9=
PAYLOADKEY
Linear Probing Hash Table
HASH TABLES – PROBING
102
Scalar
k1Input Key
h1Hash Index
#hash(key)
k1 k9=
k3=
k8=
k1=
HASH TABLES – PROBING
103
Scalar
k1Input Key
h1Hash Index
#hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table
HASH TABLES – PROBING
104
Scalar
k1Input Key
h1Hash Index
#hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table
k1Input Key
h1Hash Index
#hash(key)
HASH TABLES – PROBING
105
Scalar
k1Input Key
h1Hash Index
#hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table
k1Input Key
h1Hash Index
#hash(key)
k9= k3 k8 k1k1
HASH TABLES – PROBING
106
Scalar
k1Input Key
h1Hash Index
#hash(key)
Vectorized (Horizontal)
KEYS PAYLOAD
Linear Probing Bucketized Hash Table
k1Input Key
h1Hash Index
#hash(key)
k9= k3 k8 k1k1
0 0 0 1Matched Mask
SIMD Compare
PAYLOADk99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES – PROBING
107
Vectorized (Vertical)Input Key
Vector
k1k2k3k4
PAYLOADk99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES – PROBING
108
Vectorized (Vertical)Input Key
Vector hash(key)
####
Hash IndexVector
h1h2h3h4
k1k2k3k4
PAYLOADk99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES – PROBING
109
Vectorized (Vertical)Input Key
Vector hash(key)
####
Hash IndexVector
h1h2h3h4
k1k2k3k4
k1k99k88k4
====
SIMD Gather
k1k2k3k4
PAYLOADk99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES – PROBING
110
Vectorized (Vertical)Input Key
Vector hash(key)
####
Hash IndexVector
h1h2h3h4
k1k2k3k4
k1k99k88k4
====
SIMD Compare
k1k2k3k4
PAYLOADk99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES – PROBING
111
Vectorized (Vertical)Input Key
Vector hash(key)
####
Hash IndexVector
h1h2h3h4
k1k2k3k4
k1k99k88k4
====
SIMD Compare
1001
k1k2k3k4
PAYLOADk99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES – PROBING
112
Vectorized (Vertical)Input Key
Vector hash(key)
####
Hash IndexVector
h1h2h3h4
k1k2k3k4
k1k99k88k4
====
SIMD Compare
1001
k1k2k3k4
PAYLOADk99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES – PROBING
113
Vectorized (Vertical)Input Key
Vector hash(key)
####
Hash IndexVector
h1h2h3h4
k1k2k3k4
k1k99k88k4
====
SIMD Compare
1001
k1k2k3k4
k5
k6
h5h2+1h3+1h6
PAYLOADk99
k1
k6
k4
KEY
k5
k88
Linear Probing Hash Table
HASH TABLES – PROBING
114
Vectorized (Vertical)Input Key
Vector hash(key)
####
Hash IndexVector
h1h2h3h4
k1k2k3k4
k5
k6
h5h2+1h3+1h6
HASH TABLES – PROBING
115
Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
HASH TABLES – PROBING
116
0
3
6
9
12
4KB 16
KB 64
KB256
KB 1M
B 4M
B 16
MB 64
MB
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Hash Table Size
Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0
0.5
1
1.5
2
4KB 16
KB 64
KB256
KB 1M
B 4M
B 16
MB 64
MB
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Hash Table Size
HASH TABLES – PROBING
117
0
3
6
9
12
4KB 16
KB 64
KB256
KB 1M
B 4M
B 16
MB 64
MB
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Hash Table Size
Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0
0.5
1
1.5
2
4KB 16
KB 64
KB256
KB 1M
B 4M
B 16
MB 64
MB
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Hash Table Size
2.3 2.2 2.12.4 1.1 0.9 0.7 0.6
1.1 1.10.9
1.2
0.8 0.8
0.30.2
HASH TABLES – PROBING
118
0
3
6
9
12
4KB 16
KB 64
KB256
KB 1M
B 4M
B 16
MB 64
MB
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Hash Table Size
Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0
0.5
1
1.5
2
4KB 16
KB 64
KB256
KB 1M
B 4M
B 16
MB 64
MB
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Hash Table Size
HASH TABLES – PROBING
119
0
3
6
9
12
4KB 16
KB 64
KB256
KB 1M
B 4M
B 16
MB 64
MB
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)
Hash Table Size
Scalar Vectorized (Horizontal) Vectorized (Vertical)MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)
0
0.5
1
1.5
2
4KB 16
KB 64
KB256
KB 1M
B 4M
B 16
MB 64
MB
Thro
ughp
ut(b
illio
n tu
ples
/ se
c)Hash Table Size
Out of Cache
Out of Cache
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.
120
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.
121
k1k2k3k4
Input KeyVector
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.
122
k1k2k3k4
Input KeyVector
h1h2h3h4
Hash Index Vector
SIMD AddSIMD Radix
+1+1
+1
Histogram
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.
123
k1k2k3k4
Input KeyVector
h1h2h3h4
Hash Index Vector
SIMD AddSIMD Radix
+1+1
+1
Histogram
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.
124
k1k2k3k4
Input KeyVector
h1h2h3h4
Hash Index Vector Replicated Histogram
+1+1
+1
+1
SIMD Radix SIMD Scatter
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.
125
k1k2k3k4
Input KeyVector
h1h2h3h4
Hash Index Vector Replicated Histogram
+1+1
+1
+1
# of Vector LanesSIMD Radix SIMD Scatter
PARTITIONING – HISTOGRAM
Use scatter and gathers to increment counts.Replicate the histogram to handle collisions.
126
k1k2k3k4
Input KeyVector
h1h2h3h4
Hash Index Vector Replicated Histogram
+1+1
+1
+1
SIMD Add
# of Vector LanesSIMD Radix
+1+2
+1
Histogram
SIMD Scatter
JOINS
No Partitioning→ Build one shared hash table using atomics→ Partially vectorizedMin Partitioning→ Partition building table→ Build one hash table per thread→ Fully vectorizedMax Partitioning→ Partition both tables repeatedly→ Build and probe cache-resident hash tables→ Fully vectorized
127
JOINS
128
0
0.5
1
1.5
2
Scalar Vector Scalar Vector Scalar VectorNo Partitioning Min Partitioning Max Partitioning
Join
Tim
e (s
ec)
Partition Build Probe Build+Probe
200M ⨝ 200M tuples (32-bit keys & payloads)Xeon Phi 7120P – 61 Cores + 4×HT
PARTING THOUGHTS
Vectorization is essential for OLAP queries.These algorithms don’t work when the data exceeds your CPU cache.
We can combine all the intra-query parallelism optimizations we’ve talked about in a DBMS.→ Multiple threads processing the same query.→ Each thread can execute a compiled plan.→ The compiled plan can invoke vectorized operations.
129
NEXT CLASS
Reminder: No class on 4/18 (Thu).
Reminder: Guest lecture on 4/23 (Tue).
Reminder: Extra credit due on 4/18 (Thu).
Reminder: Final presentation on 4/25 (Thu).
130
top related