Quantitative Performance Analysis Joe Chang [email protected] www.sql-server-performance.com/joe _chang.asp
Mar 22, 2016
Quantitative Performance Analysis
Joe [email protected]/joe_chang.asp
Objectives
Estimate DB performance early What design/architecture decisions impact performance?Is the project/architecture feasible?
Production database performance tuning
Reduces guess work, but there are easier ways
Server Performance CharacteristicsProcessor Architecture:
Pentium III – Xeon – Itanium 2– OpteronSystem Architecture: 2, 4, 8, 16-way etc
SPEC CPU 2000 Integer
Pentium M 2.0GHz 90nm, all others 130nmXeon 2.4GHz/512K base: 913 (used later in this presentation)Pentium III Xeon 700MHz/2M : 431
0
500
1000
1500
2000
2500
3000
base gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 tw olf
Pentium M 2.0GHz/2M
Xeon 3.2GHz/2M
Opteron 2.2GHz/1M
Itanium 2 1.5GHz/6M
Power5 1.9GHz
TPC-C Performance – SQL Server
# CPUs System tpm-C $/tpm-C Mem(GB) # Disks1 HP (3.2GHz/2M) 35,030 $1.88 12 43+42 HP (3.2GHz/2M) 60,364 $3.51 12 280+10
4 IBM x365 102,667 $3.52 32 2668 IBM x445 156,195 $4.31 64 616
16 Unisys ES7000 237,869 $5.08 64 700+1032 Unisys ES7000 304,148 $6.18 64 1092+12
# CPUs System tpm-C $/tpm-C Mem(GB) # Disks4 HP DL585 115,110 2.62 32 295+8
Xeon 3.2GHz/2M & Xeon MP 3.0GHz/4M
Opteron 2.2GHz/1M
IA-32 limited max memory (64GB), AWE overhead , bus architecture
Itanium 2 (SQL Server) vs IBM Power 5
# CPUs System tpm-C $/tpm-C Mem(GB) # Disks4 Pwr5 570 /Oracle 194,391 $5.62 128 432+16
8 Pwr5 570 /UDB 429,900 $4.99 256 880+4016 Pwr5 570 /UDB 809,144 $4.95 512 1600+4032 Pwr5 695 /Oracle 1,601,785 $5.27 1024 3200+96
64 Pwr5 695 /UDB 3,210,540 $5.19 2048 6400+140
IBM Power 4+/5 1.9GHz
# CPUs System tpm-C $/tpm-C Mem(GB) # Disks4 HP rx5670 121,065 $4.49 64 448+208 Bull 175,366 $4.54 128 225
16 Unisys ES7000 309,037 $4.49 128 770+2432 NEC Express 577,531 $7.74 512 1150
64 HP Superdome 786,646 $6.49 512 1792+60
Itanium 2 1.5GHz/6M
Scaling
Pn / P1 = S ** log2(n)Pn Performance with n processors P1 Perf. with 1 processorS Scale Factor n Number of processors
0
4
8
12
16
20
24
28
32
36
2 4 8 16 32 64Processors
Scal
ing
S=1.5
S=1.6
S=1.7
S=1.8
Linear
Unit of Measure – CPU-Cycles
Query costs measured in CPU-cyclesAlternative: CPU-sec
Cost = Runtime (sec) × CPU Util. × Available CPU-cycles ÷ IterationsAvailable CPU-cycles = Number of CPUs × FrequencyExample 4 x 700MHz = 2.8B cycles/sec
CPU-cycles does not imply CPU instructions, Unit of time same as CPU clock
1GHz CPU: time unit = 1ns2GHz CPU: time unit = 0.5ns
All tests on Windows 2000/2003, SQL Server 2000 SP1-3
CPU-Cycles dependencies
CPU-cycles on one processor architecture has no relation to another – ex. Pentium III, Pentium 4, Itanium, OpteronSome platform dependencies – cache size, bus speed, SMP
Notes: Some platform dependencies – cache size, bus speed, SMP
Processor System/Cache/Mem
Performance Cost / trans.
Itanium 2 4 x 1.5GHz/6M /64G
121,065 2.974M
Xeon 4 x 3.0GHz/4M /32G
102,667 7.013M
Opteron 4 x 2.2GHz/1M /32G
105,687 4.996M
Cost Structure - ModelStored Procedure Call Cost =
RPC cost (once per procedure)+ Type cost (once per procedure?)+ Query costs (one or more per
procedure)
Query – one or more components
Component Cost = Cost for component operation base
+ Cost per additional row or page
Only stored procedures are examined
RPC Cost
Cost of RPC to SQL Server includes:1) Network roundtrip2) SQL Server handling costs
Calls to SQL Server are made with RPC (not SQL Batch)Profiler -> Event Class: RPC
ADO.NET Command Type Stored Procedure or Text with parametersCommand Text without parameters: SQL Batch
Type Cost?
Blank Procedure: ~250,000 CPU-cyclesCREATE PROC p_ProcBlank ASRETURN
Proc with Single Query ~320K CPU-cyclesCREATE PROC p_ProcSingleQuery ASSELECT … FROM TableA WHERE ID = @ID
Proc with Two Queries ~360K CPU-cyclesCREATE PROC p_ProcTwoQueries ASSELECT … FROM TableA WHERE ID = @IDSELECT … FROM TableB WHERE ID = @ID
RPC Cost ~250K CPU-cycles,Type Cost ~30K CPU-cycles, Query Cost ~40K CPU-cycles
Blank RPC Performance
All systems with 2 processors, Xeon 3.2GHz with 800MHz busRPC’s are expensive relative to simple query
0
5,000
10,000
15,000
20,000
25,000
30,000
Pent III900M/2M
Xeon3.06G/512K
Xeon3.06G/512
w/HT
Xeon3.20G/1M
800
Xeon3.20G/1M
w/HT
Itanium 21.50G/6M
Opteron2.20G/1M
RPC
/sec
120K
270K240K 260K
195K
154K
140K
Costs in CPU-Cycles
RPC Cost
Processor PIII PIII X Xeon Opteron Itanium 2* It2 CPUs 2 4 2 2 2 4 8RPC cost 140K 200K 250 140K 155K 290K 350K
270K (2.xx driver)Type Cost Select 20-30K ~5K 35-55K ~20K ~8K
Systems: Pentium III 2x 600MHz/256K, 2x 733MHz/256K, PIII Xeon 2x 500MHz/2M, 4x 700MHz/2M, 4x900/2M Xeon (P4) 2x 2.0GHz/512K 2x 2.4GHz/512K Opteron 2x 2.2GHz/1M Itanium 2 2x 900MHz/1.5M 8x1.5GHz/6MOS: W2K, W2K3, various spSQL Server 2000, various spPIII: Intel PRO/100+, Others: Broadcom Gigabit Ethernet driver 5.xx+*Itanium 2 system booted with 2, 4 or 8 processors(4P config may have had procs from more than 1 cell)
Costs in CPU-Cycles
RPC Cost – Fiber versus Threads
PIII Xeon - TCP 1P 2P 4PFE-Thread 105K 150K 200KFE-Fiber 95K 120K 170KXeon - TCPGE-Thread 210K 250KGE-Fiber 200K 230KXeon - VIVI Thread 190KVI Fiber 160K 180KItanium 2 - TCP 1P 2P 4P 8PThread 105K 155K 290K 350KFiber 95K 145K 260K 300K
Costs in CPU-Cycles
Broadcom Gigabit Ethernet driver 5.xx, 6.xx, 7.xx (270K for 2P 2.xx driver)VI: QLogic QLA2350, drivers: qla2300 8.2.2.10, qlvika 1.1.3
RPC Cost TCP vs Named Pipes
PIII Xeon 4P TCP named pipesFE-Thread 200K 315KFE-Fiber 170K 370K
Xeon, Thread 1P 2PGE, TCP 210K 250KGE, Named Pipes 320K 360K
Broadcom Gigabit Ethernet driver 5.xx, 6.xx, 7.xx (270K for 2P 2.xx driver)VI: QLogic QLA2350, drivers: qla2300 8.2.2.10, qlvika 1.1.3
Costs in CPU-Cycles
RPC Costs – owner, case
PIII PIII X P4/Xeon
RPC cost 140K 140K? 250Ksp_executesql 210K
Unspecified owner, Ex: user1 calls procedure owned by dbo +100K on 4P PIII, +100K on 2P Xeon, 300K on 8P Itanium2Case mismatch:
Actual procedure: p_Get_RowsCalled procedure: p_get_rows
+100K on 4P PIII, +150K on 2P Xeon, +300K on 8P Itanium2
Costs in CPU-Cycles
Single Row Select Costs
Clustered Index SeekDoes cost depend on index depth?Role of I/O count in cost
Index Seek with Bookmark LookupDoes cost depend on table type?
Heap versus clustered indexTable Scan
Single Row Index Seek vs Index Depth
010,00020,00030,00040,00050,00060,00070,00080,00090,000
100,000110,000
0 2 4 6 8 10SELECT's per Stored Procedure
CPU-
cycl
es p
er S
ELEC
T
Cl 50K Cl 200K Cl 500KNH 50K NH 200K NH 500KNC 50K NC 200K NC 500K
Index depth: 50K rows -2 both, 200K 3-Cl, 2-NC, 500K 3 Pentium III 2x733
Batching Queries into a Single RPC
Each query: clustered index seek returning 1 row
0
20,000
40,000
60,000
80,000
100,000
120,000
0 2 4 6 8 10Queries per batch
Que
ries
/ sec
X 3.06/512K X 3.2/1MIt2 1.5/6M Opt 2.2/1MPIII 733
Single Row Cost Summary
Index depthPlan – no, True cost -yesCost versus index depth not fully examinedFill factor dependence not tested
Bookmark Lookup – Table typePlan cost -noTrue cost higher for clustered index than heap
Multi-row Select Queries
Queries return aggregate- Not a test of network bandwidthSingle Table Example SELECT @Count = count(*), @Value1 = AVG(randDecimal) FROM M3C_01 WHERE GroupID = @ID1 Count, 1 AVG on 5 byte decimalJoin ExampleSELECT count(*), AVG(m.randDecimal), min(n.randDecimal) FROM M2A_02 m INNER LOOP JOIN M2B_02 n ON n.ID = m.IDWHERE m.GroupID = @ID AND n.GroupID = @ID1 Count, 2 AVG on 5 byte decimal
Table Scan tests return either- a single row- 1 Count, 1 aggregate on 5 byte decimal
Index Seek – Aggregates
Xeon 2x3.06GHz/512K
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
count
cnt+avg(money)
cnt+avg(decimal)
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
10 100 1,000 10,000rows per query
Que
ry x
Row
s / s
ec
Seq Heap Dist Heap Seq Cl IX Dist Cl IXSeq Hp PL Dist Hp PL Seq Cl PL Dist Cl PL
Multi-row Bookmark Lookups
Table lock escalation at >5K rows ?
2x2.2GHz/1M Opteron
1 Count, 1 AVG on decimal
Table Scan
0
50,000
100,000
150,000
200,000
250,000
1 10 100 1,000 10,000table size (pages)
page
s/se
c
Default
Row Lock
Page Lock
Table Lock
No Lock
Xeon 3.2GHz/800MHz FSB HT
Table Scan – Component Cost
Total cost = RPC + Type + Base + per page costsPIII (X) P4/Xeon Opteron It2 2P 8P
Type + Base cost 60K/40K 145K 35K 90K
Cost per page: NOLOCK: 24K - 16K 23-35KTABLOCK: 24K 25K 16KPAGLOCK: 26K 26K 20K 17K 23-35KROWLOCK: 140K 250K 110K 100K 150K
Table Scan or Index Scan – Plan FormulaI/O: 0.0375785 + 0.0007407 per pageCPU: 0.0000785 + 0.0000011 per row
Measured Table Scan cost formulas for 99 rows per pageCosts in CPU-Cycles per page
Table Scan Cost per Page
0
1
2
3
4
5
6
7
1 10 100 1000Table size (pages)
Norm
aliz
ed c
ost p
er p
age Default Row Lock
Page Lock Table LockNo Lock Plan
2x733MHz/256KB Pentium III
Bookmark – Table Scan Cross-over
1
10
100
1,000
10 100 1000 10000rows
Norm
aliz
ed C
ost
Bk on ClusterBk on HeapTable ScanClust. IndexPlan Cost, BkPlan Cost, ScanPlan, Clust. IX
Plan Costs: ≤1GB 2x733MHz/256K Pentium III
Loop, Hash and Merge Join
Loop joins: Case 1, 2, 3 etcHash joins: in-memory versus othersMerge: regular versus many-to-manyMerge join with sort operationsLocks – page lock
Joins – Locking GranularityXeon 2x2.0GHz/512K
Count+2 AVG(decimal) Hash join spools to tempdb
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
10 100 1,000 10,000 100,000rows
CPU-
Cycl
es p
er ro
w
Loop L Page LockHash H Page LockMerge M Page Lock
Loop, Hash & Merge Joins
Xeon 2x3.06GHz/512K (Count)
100,000
1,000,000
10,000,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Loop Loop PLHash Hash PLMerge Merge PL
Processor Architecture
Pentium III, Xeon Opteron, Itanium 2
Index Seek
COUNT(*) + AVG(Decimal)
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
4,500,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Xeon 3.06/512K
Xeon 3.2 w/HT
Itanium 2
Opteron 2.2
Bookmark Lookup
Heap, Page Lock, sequential rows 1 Count, 1 AVG on decimal
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
10 100 1,000 10,000rows per query
Que
ry x
Row
s / s
ec
PIII 900/2M
Xeon 2.4
Itanium 2
Opteron 2.2
Table Scan
NOLOCK
0
50,000
100,000
150,000
200,000
250,000
300,000
1 10 100 1,000 10,000table size (pages)
page
s/se
c
X 3.06/512K
X 3.2/1M
It2 1.5/6M
Opt 2.2/1M
4xPIII-700
Hyper-ThreadingXeon 3.06GHz/512K (130nm) vs.Xeon 3.2GHz/1M & 800MHz FSB (90nm)
Hyper-ThreadingNo Hyper-Threading
0
20
40
60
80
100
0 5,000 10,000 15,000 20,000 25,000
RPC/secCP
U
with Hyper-Threading
0
20
40
60
80
100
0 5,000 10,000 15,000 20,000 25,000RPC/sec
CPU
CPU linear with throughput
CPU non-linear with throughput
Exercise caution in extrapolating system performance based on low CPU measurements
Index Seek – Single Row
0
20,000
40,000
60,000
80,000
100,000
120,000
0 2 4 6 8 10Queries per batch
Que
ries
/ sec
X 3.06/512K X 3.06 w/HT
X 3.2/1M X 3.2 w/HT
Index Seek - Multiple Rows
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s /s
ec
Xeon 3.06/512K
Xeon 3.06 w/HT
Xeon 3.2/1M
Xeon 3.2 w/HT
Table Scan - NOLOCK
0
50,000
100,000
150,000
200,000
250,000
300,000
1 10 100 1,000 10,000table size (pages)
page
s/se
c
X 3.06/512K X 3.06 w/HT
X 3.2/1M X 3.2 w/HT
Xeon, Xeon/800 NOLOCK
Bookmark Lookup – Cl. IX, Seq, PL
1 Count, Clustered Index, Sequential rows, Page Lock
0
100,000
200,000
300,000
400,000
500,000
600,000
10 100 1,000 10,000 100,000rows per query
quer
y x
row
s / s
ec
Xeon 3.06/512K Xeon 3.06 w/HT
Xeon 3.2/1M Xeon 3.2 w/HT
Bookmark Lookup – Heap, Seq, PL
1 Count, Heap, Sequential rows, Page Lock
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Xeon 3.06/512K Xeon 3.06 w/HT
Xeon 3.2/1M Xeon 3.2 w/HT
Loop Join
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Xeon 3.06/512K Xeon 3.06 w/HTXeon 3.2/1M Xeon 3.2 w/HT
Hash Join
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Xeon 3.06/512K Xeon 3.06 w/HT
Xeon 3.2/1M Xeon 3.2 w/HT
Merge Join
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
4,500,000
5,000,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Xeon 3.06/512K Xeon 3.06 w/HT
Xeon 3.2/1M Xeon 3.2 w/HT
Hyper-Threading Summary
Lower cost RPCImproves performance of low cost queries
May improve performance on some opsDegrades performance on othersXeon 800MHz FSB with 90nm core better than 130nm
Scaling
Itanium 2, 1.5GHz/6M, 1, 2, 4, 8, 16 CPUs
Scaling
First System: HP rx76208-way Itanium 2 1.5GHz/6MBoot with /NUMPROC = xxQuery: Count + 1 or 2 AVG(Decimal)
Second System: Unisys ES7000 Aries 42016-way Itanium 2 1.5GHz/6MBooted with all 16 CPUs, Affinity Mask = 1, 2, …Query: Count onlySQL Server 2000, Service Pack 3 (build 760)
Index Seek
9X difference between 16P and 1P
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
1P
2P
4P
8P
16P
Bookmark Lookup 1-way
Minor differences between Table org & lock granularity
0
50,000
100,000
150,000
200,000
250,000
300,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Seq Heap PL
Seq Heap
Seq Cl PL
Seq Clustered
Bookmark Lookup – 16 way
Major differences! 100X !!
10,000
100,000
1,000,000
10,000,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Seq Heap PL
Seq Heap
Seq Cl PL
Seq Clustered
Bookmark Lookup – Clustered Index
Really bad news beyond 2P, especially bad ~5K rows
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
1P 2P 4P8P 16P
Bookmark Lookup – Cl. IX, PL
No scaling beyond 2P, but no disaster at 5K rows
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
1P 2P 4P
8P 16P
Bookmark Lookup – Heap
Performance degradation at 5K rows
10,000
100,000
1,000,000
10,000,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
1P 2P
4P 8P16P
Bookmark Lookup – Heap Page Lock
Excellent scaling, 8P anomaly?
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
4,500,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
1P 2P 4P
8P 16P
Joins – 1P
Up to 10X difference between loop, hash & merge
100,000
1,000,000
10,000,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Loop
Hash
Merge
Joins – 16P
Up to 100X difference between loop, hash & merge
100,000
1,000,000
10,000,000
100,000,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Loop
Hash
Merge
Loop Join
No scaling beyond 2P
0
50,000
100,000
150,000
200,000
250,000
300,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
1P 2P 4P
8P 16P
Hash Join
Scales to 8P, peaks at ~5K rows
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
1P 2P 4P8P 16P
Merge Join
Scales well all around
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
20,000,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
1P
2P
4P
8P
16P
Table Scans
Questionable scaling, peaks at 300 pages, cache size ?
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
100 1,000 10,000 100,000table size (pages)
Page
s /s
ec
1P
2P
4P
8P
16P
Scaling Summary
Avoid Bookmark Lookups on Tables with Clustered indexesUse Heap tablesUse NOLOCK, especially ~5000 rows / query
Avoid Loop JoinsDesign for Merge or Hash Joins
Avoid table scans (Itanium only?)
Bookmark HP 32P, SP3 build 760/818
10,000
100,000
1,000,000
10,000,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Seq Heap PL
Seq Heap
Seq Cl PL
Seq Clustered
Bookmark HP 32P, SP4 build 2000?
10,000
100,000
1,000,000
10,000,000
10 100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Seq Heap PL
Seq Heap
Seq Cl PL
Seq Clustered
Joins HP 32P, SP3 build 760/818
100,000
1,000,000
10,000,000
100,000,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Loop
Hash
Merge
Joins HP 32P, SP4 build 2000?
100,000
1,000,000
10,000,000
100,000,000
100 1,000 10,000 100,000rows per query
Que
ry x
Row
s / s
ec
Loop
Hash
Merge
Table Scan HP 32P,
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
2,000,000
10 100 1,000 10,000 100,000table size (pages)
Page
s/se
c
760 Def 760 NL818 Def 818 NL
sp4 Def sp4 NL
Bookmark Lookup (PL Sequential)
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
10,000,000
100 1,000 10,000 100,000rows per query
Que
ry-ro
ws/
sec
16P Xeon
16P It2
32P It2
Index Seek
1,000,000
10,000,000
100,000,000
100 1,000 10,000 100,000rows per query
Que
ry-ro
ws/
sec
16P Xeon
16P It2
32P It2
Loop Join
0
50,000
100,000
150,000
200,000
250,000
300,000
100 1,000 10,000 100,000rows per query
Que
ry-ro
ws/
sec
16P Xeon
16P It2
32P It2
Hash Join
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
100 1,000 10,000 100,000rows per query
Que
ry-ro
ws/
sec
16P Xeon
16P It2
32P It2
Merge
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
45,000,000
100 1,000 10,000 100,000rows per query
Que
ry-ro
ws/
sec
16P Xeon
16P It2
32P It2
Joins HP 32P, SP4 build 2000?
Joins HP 32P, SP4 build 2000?
Database Design for Performance
Round-trip minimization – RPC costRow count management – Cost per row
Indexes – isolate queried rows to a limited number of adjacent pages quickly, not most selective columns 1st
Design for low cost operationsCovered index instead of bookmark lookupsMerge joins, Indexed viewsAvoid excessive logic
NOLOCK on non-transactional data
Statistics
Accuracy & RelevanceMore than keeping statistics up to dateData queried needs to reflect data in tableAvoid populating database with test data having different distribution than live data
Performance Issues
NC Index Seek + Bookmark Lookup vs. Table Scan
Query optimizer switches to table scantoo soon for in-memory, too late for disk bound data
Watch table scan lock levelRow count plan versus actual cost issues
May be related to WHERE clause conditionsLock hints Merge and Hash joins vs. Loop joinsFixed costs favor consolidation
Both in RPC and queries
Summary
Query cost structure Fixed costs identified
Costs applied once per RPCComponent operations examined
Base cost and cost per row or page Lock hints – Row, Page, Table, No Lock
Additional Information
www.sql-server-performance.com/joe_chang.asp
SQL Server Quantitative Performance AnalysisSQL Server Quantitative Performance AnalysisServer System ArchitectureServer System ArchitectureProcessor PerformanceProcessor PerformanceDirect Connect Gigabit NetworkingDirect Connect Gigabit NetworkingParallel Execution PlansParallel Execution PlansLarge Data OperationsLarge Data OperationsTransferring StatisticsTransferring StatisticsSQL Server Backup Performance with Imceda LiteSpeedSQL Server Backup Performance with Imceda [email protected]
Quantitative Performance Analysis
Backup Slides
Subjects
Cost measurementTest procedure, Unit of measure
Query CostsSingle row, table scan, multi-row, joins
Discrepancies with plan costsLogical I/O, Lock Hints, Conditions
Database design implicationsDesign, coding and indexes
Test Procedure
Load generatorDrive DB server CPU utilization as high as possible (80-100%)Multiple load generators running same query
Single thread will not fully load DB serverNetwork propagation time is significantMost queries run on single processorServer may have more than one processor
Component operation cost derivedComparing two queries differing in one op
Index Seek by Aggregates
SELECT COUNT(*) SELECT COUNT(*), CONVERT int to bigint, etcSELECT COUNT(*), SUM(int), SELECT COUNT(*), SUM(Money)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Xeon 1P Opt 1P It2 1P Xeon 2P Opt 2P It 2P
Dura
tion/
1K ro
ws
(ms)
1 of 10M CountSum ConvertMax MoneyDecimal
Join Cost Linearity on 32-bit
0
1
2
3
4
5
6
100 1000 10000
rows (1000's)
Dura
tion/
1K ro
w (m
s) Index Seek
Merge Join
Hash Join
Loop Join
Hash join cost per row is somewhat linear up to ~5M rowsDuration then jumps due to disk IO
Join Cost Linearity on Itanium 2
Loop & Merge join cost per row independent of row countHash join cost per row is not (no spooling to temp)
0
1
2
3
4
5
6
100 1,000 10,000rows (000's)
Dura
tion/
1K ro
ws
(ms) Index Seek
Merge Join
Hash Join
Loop Join
Hash Join – row size
Optimizer cost: depends on OS size, not IS sizeActual cost: depends on # of columns and OS/IS
0
1
2
3
4
5
6
100 1,000 10,000rows (000's)
Dura
tion/
1K ro
ws
(ms)
Count OS 1 Sum
IS 1 Sum OS 2 Sum
IS 2 Sum OS 3 Sum
IS 3 Sum
Type + Component Base Cost
Processor PIII PIII X Xeon Opt Itanium 2
# of CPUs 2 4 2 2 2 4 8Frequency ~73
3 700 2-2.4 2.2 900M 1.5G 1.5G
Cache 256K 2M 512K 1M 1.5M 6M 6MAggregate 100K 40K 135K 45K 45K 35K 66KTable Scan 60K 40K 145K 50K 35K 40K 90KLoop Join 110K 24K 130K 50K 45K 10K 15KHash Join 500K 250K 620K 250K 225K 210K 320KMerge Join 150K 54K 170K 80K 75K 55K 110KMerge + Sort 145K 320K 140K 230KMany-to-Many 250K 550K 200K 320K
Costs in CPU-Cycles
Big cache lowers base cost?
Cost per RowProcessor PIII Xeon Opt Itanium 2 2P, 4P, 8P
Index Seek 1.3K 1.8K 1.1K 1.0K 1.0K 1.0KBookmark LU (CL) 16K 20K 12K 11K 16K 27KBL (CL) page lock 11K 15K 8K 8K 13K 18KBookmark LU (Heap) 11K 14K 9K 10K 11K 16KBL (Heap) page lock 7K 9K 5K 6K 8K 8KLoop Join 16K 25K 13K 11K 16K 23KLoop Join, page lock 13K 20K 11.5K 9K 14K 21KHash Join 8.5K 11K 7.5K 7K 7K 7KHash Join, page lock 6.5K 8K 6.0K 5K 5K 5KMerge Join 6.5K 10K 5.0K 6K 6K 6KMerge Join, page lock 3.0K 4K 2.5K 3K 3K 3KMerge+Sort 7.5K 9K 6K 7KMany-To-Many PL 32K 40K 18K 31K
Costs in CPU-Cycles
IUD Cost Structure
2xXeon* Notes
RPC cost 240,000 Higher for threads, owner m/m
Type Cost 130,000 once per procedureIUD Base 170,000 once per IUD statementSingle row IUD 300,000 Range: 200,000-400,000
Multi-row InsertCost per row 90,000 cost per additional row
INSERT, UPDATE & DELETE cost structure very similarMulti-row UPDATE & DELETE not fully investigated
*Use Windows NT fibers on
INSERT Cost Structure
Index and Foreign Key not fully explored
Early measurements:
50-70,000 per additional index50-70,000 per foreign key
Assumes inserts into “sequential” segmentsCluster on Identity or Cluster Key + IdentityCluster on row GUID can cause substantial disk loading!