1 1 1 Evolution of Thread-Level Parallelism in Desktop Applications Geoffrey Blake*, Ronald G. Dreslinski*, Trevor Mudge*, Krisztián Flautner† University of Michigan – Ann Arbor* ARM † ISCA 2010 June 22, 2010
Dec 27, 2015
11
1
Evolution of Thread-Level Parallelism in Desktop Applications
Geoffrey Blake*, Ronald G. Dreslinski*, Trevor Mudge*, Krisztián Flautner†
University of Michigan – Ann Arbor*ARM †
ISCA 2010 June 22, 2010
22
2
ACAL – University of Michigan 2
Introduction 2000
Single core machines common Clock speed steadily increasing – Intel predicts to reach 10GHz by 2010
2005 Consumer CMPs announced (from Intel/AMD) Aggressive clock speed increases halt
2010 and Future Multi-core machines common Core counts steadily increasing
Pentium 4
Core DuoCore i7
Nehalem EX
“If you build it, they will come”
-- “Field of Dreams” 1989
33
3
ACAL – University of Michigan 3
Motivation Server and Scientific workloads already parallel What about Desktop/Laptop workloads? Flautner et al. at ASPLOS-IX studied interactive desktop workloads
Almost all applications behaved as single threaded 1 extra core helped system responsiveness
Multiprocessor desktop/laptop systems now common Has desktop/laptop software followed?
# of cores
Perf
orm
ance
Desktop Scaling?
Server/Scientific Scaling
Ideal Scaling
44
4
ACAL – University of Michigan 4
Motivation
Past 5 years of ISCA, Server and Scientific workload research has dominatedMarket shows the oppositeCorrect domain to invest disproportionate effort into? Trickle down effect?
*source IDC and Gartner 2009
*source ISCA ‘05-’09
Scientific/Server Desktop/Laptop0
10
20
30
40
50
60
70
80
90
100
Research (Papers)Market ($)
Perc
ent D
istrib
ution
55
5
ACAL – University of Michigan 5
Metrics Replicate contemporary experiments of previous work Measure “Thread Level Parallelism” (TLP) instead of Utilization
ci = fraction of time i cpus are doing work c0 = Idle (0 cpus are doing work)
c1 = 1 cpu doing work, c2 = 2 cpus doing work
Example: c0 = 0.5 Util = 0.25 * 1 + 0.25 * 2 = 0.75 c1 = 0.25 TLP = (0.25 * 1 + 0.25 * 2) / (1 – 0.5) = 1.5 c2 = 0.25
TLP is measure of how efficiently system is using parallel resources when work needs to be done
*[Flautner et al. ASPLOS’00]
66
6
ACAL – University of Michigan 6
Test Systems
2009 Mac Pro (Fast, highend)2x 2.26GHz Intel Xeon E5520(8 cores, 16 hardware threads)NVIDIA GTX 285/GT120(240/32 CUDA cores)Mac OS 10.6 + Windows 7 x64
ASUS ASRock (Slow, cheap)1x 1.6GHz Intel Atom 330(2 cores, 4 hardware threads)NVIDIA ION(16 CUDA cores)Windows 7 x64
Machines were chosen to measure effect of system speed on TLP.
Fast system may allow OS to schedule tasks on same core all the time.
77
7
ACAL – University of Michigan 7
Measurement Infrastructure Developed system wide monitoring programs for both client OS’s Mac OS X 10.6
DTrace to track thread context switches I/O Kit probing of GPU driver for GPU utilization
Windows 7 Event Tracing for Windows (ETW) to track thread context switches NVPerfKit SDK for GPU utilization
88
8
ACAL – University of Michigan 8
Benchmarking Tested software in six categories
Games Image Authoring Office Productivity Multimedia Playback Video Authoring/CUDA enabled video authoring Web Browsing
Used detailed task sets and input parameters Tests performed by user – results fully reproducible, low variance Test length was 5 minutes or more Details, input sets and tracing tools can be found here:
http://itlpbench.eecs.umich.edu
99
9
ACAL – University of Michigan 9
Are threads used?Benchmark Threads Created Avg. Threads Alive
Handbrake 0.9 22511 24
Call of Duty 4 77 44
Photoshop CS4 82 75
Adobe Reader 9 239 24
Quicktime-HD 53 52
Firefox 3.5 522 38
Many threads created Many threads alive and visible to the OS during runtime
1010
10
ACAL – University of Michigan 10
10 Year ComparisonQ
uake
2Cr
ysis
Call
of D
uty
4Bi
osho
ckEx
trem
e 3D
Phot
osho
p 4.
0.1
May
a3D
Phot
osho
p CS
4Ad
obeR
eade
r 4.0
Pow
erPo
int 9
7W
ord
97Ex
cel 9
7Ad
obeR
eade
r 9.0
Pow
erPo
int 2
007
Wor
d 20
07Ex
cel 2
007
Qui
cktim
e 4.
0.3
Para
llel M
PEG
HDQ
uick
Tim
e 7.
6Q
uick
Tim
e 7.
6 HD
Prem
iere
4.2
Hand
brak
e 0.
9Po
wer
Dire
ctor
v7
IE 5
Net
scap
e 4.
05Fi
refo
x 3.
5Sa
fari
4.0
Games Image Author-ing
Office Playback Video Author-
ing
Web
0
1
2
3
4
5
6
7
8
920002010
TLP
Requires high performance, progress made in 10 years
Requires high performance, little progress in 10 years
Elsewhere, very little progress in 10 years
1111
11
ACAL – University of Michigan 11
Overall TLP Results – Xeon Windows 7Idle 8 Core SMT - System Wide TLP
Application 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TLPAVGTLP
GameBioshock 1.6
1.7Call of Duty 4 2.1
Crysis 1.4
Image Authoring
Maya3D 2010 2.42.2
Photoshop CS4 2.0
Office
Adobe Reader 9 1.3
1.2Excel 2007 1.2
PowerPoint 2007 1.2
Streets 2010 1.4
Word 2007 1.2
PlaybackiTunes 9 1.3
1.6Quicktime 1.3
QuicktimeHD 2.1
CUDABadaboom 1.3
2.2PowerDirector v9 3.2
Video Authoring
Handbrake 8.4 6.6PowerDirector v9 4.8
Web Browsing
Firefox 3.5* 1.51.6
Safari 4.0* 1.6
100%
0%
1212
12
ACAL – University of Michigan 12
Overall TLP Results – Atom Windows 7Idle System TLP
Application 0 1 2 3 4 TLP AVG
GameBioshock 2.2
2.0Call of Duty 4 2.1Crysis 1.8
Image AuthoringMaya3D 2010 2.1
1.9Photoshop CS4 1.6
Office
Adobe Reader 9 1.5
1.4Excel 2007 1.3PowerPoint 2007 1.4Streets 2010 1.6Word 2007 1.3
PlaybackiTunes 9 1.5
2.1Quicktime 2.1Quicktime HD 2.6
CUDABadaboom 1.3
2.0PowerDirector v8 2.8
Video AuthoringHandbrake 3.8
3.6PowerDirector v8 3.3
Web BrowsingFirefox 3.5* 1.6
1.7Safari 4.0* 1.8
100%
0%
1313
13
ACAL – University of Michigan 13
Call of Duty 4: TLP vs Time – Xeon Windows 7
Active Idle
Time (3s)
75 threads spawned during execution, average of 44 live threads at any time
CPU12
CPU13
CPU14
CPU15
CPU8
CPU9
CPU10
CPU11
CPU4
CPU5
CPU6
CPU7
CPU0
CPU1
CPU2
CPU3
[Hauser et al., SIGOPS ‘93]
1414
14
ACAL – University of Michigan 14
TLP vs Time – Atom Windows 7
Call of Duty 4
Firefox 3.5
Time (3s)
Time (3s)
CPU0
CPU1
CPU2
CPU3
CPU0
CPU1
CPU2
CPU3
522 threads spawned, average of 38 live threads at any time
1616
16
ACAL – University of Michigan 16
1 2 4 80
10
20
30
40
50
60
70
80
90
100
Badaboom GTX285Badaboom GT120Badaboom IONHandbrake OS X SMTHandbrake Atom
Number of Cores
Tran
scod
e Ra
te (f
ps)
GPU Measurements - Throughput
[Lee et al., ISCA’10]
1717
17
ACAL – University of Michigan 17
Conclusions Little change in TLP over ten years Single thread speed has little impact on TLP Single thread performance is still important Specific applications do take advantage of resources Large amounts of silicon is under utilized Debatable if aggressively increasing core count is correct direction
Can desktop applications be parallelized effectively? Would architecture specialization be more beneficial?
1818
18
ACAL – University of Michigan 18
Future Directions and Work Categorizing thread use in desktop applications Better performance metrics Perform critical path analysis Detailed characterization of instruction stream
2121
21
ACAL – University of Michigan 21
Motivation
The mobile market is even larger: ~1.2 Billion units shipped in 2009 174.1 Million units were smartphones Source: IDC
2525
25
ACAL – University of Michigan 25
Firefox 3.5: TLP vs Time (Xeon)H
ardw
are
Cont
exts
Active Idle
Time (3s)
2828
28
ACAL – University of Michigan 28
TLP – OS Comparison
Small differences between Windows and OS X Applications written originally for a particular platform perform
better than the port
Call
of D
uty
4
May
a3D
2010
Phot
osho
p CS
4
Adob
e Re
ader
9
Exce
l
Pow
erPo
int
Wor
d
iTun
es 9
Qui
cktim
e
Qui
cktim
e - H
D
Hand
brak
e 0.
9
Fire
fox
3.5
Safa
ri 4.
0
Game Image Office Video Web
0
2
4
6
8
10 WindowsOS X
TLP
2929
29
ACAL – University of Michigan 29
Discussion - Atom Reduced performance cores do not appreciably increase TLP Lack of TLP appears more due to software design Single thread performance is still important for desktop/laptop Many slow cores over few fast cores may be bad fit for
desktop/laptop space
3030
30
ACAL – University of Michigan 30
Overall TLP Results - XeonIdle 8 Core SMT - System Wide TLP GPU
Application 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TLP σ Util
GameBioshock 1%57% 31% 7% 1% 2% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.6 0.05 75%
Call of Duty 4 0%12% 35% 20%14% 8% 7% 2% 1% 0% 0% 0% 0% 0% 0% 0% 0% 2.1 0.21 86%
Crysis 1%72% 23% 4% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.4 0.07 84%
Image Authoring
Maya3D 2010 55%34% 6% 1% 0% 0% 0% 0% 1% 0% 0% 0% 0% 0% 0% 0% 2% 2.4 0.53 18%
Photoshop CS4 43%43% 7% 1% 0% 0% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 1% 2.0 0.56 17%
Office
Adobe Reader 9 65%25% 8% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.3 0.05 23%
Excel 2007 72%23% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.2 0.02 10%
PowerPoint 2007 69%25% 5% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.2 0.03 16%
Streets 2010 68%23% 7% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.4 0.01 14%
Word 2007 74%22% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.2 0.04 16%
PlaybackiTunes 9 71%23% 5% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.3 0.16 22%
Quicktime 50%38% 10% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.3 0.01 43%
QuicktimeHD 66%22% 5% 1% 1% 0% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 0% 2.1 0.06 40%
CUDABadaboom 54%35% 9% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.3 0.03 95%
PowerDirector v9 42%20% 12% 6% 5% 4% 3% 2% 3% 1% 0% 0% 0% 0% 0% 0% 0% 3.2 0.52 28%Video
AuthoringHandbrake 1% 0% 0% 0% 1% 3% 9%17%22%20% 14% 8% 4% 1% 0% 0% 0% 8.4 0.02 8%PowerDirector v9 27%20% 11% 6% 4% 3% 2% 6% 8% 5% 5% 3% 0% 0% 0% 0% 0% 4.8 0.15 18%
Web Browsing
Firefox 3.5* 66% 24 6 1 ### ### ### ### 1 ### ### ### ### ### ### ### ### 1.5 0.05 24%
Safari 4.0* 50% 34 11 3 1 ### ### ### 1 ### ### ### ### ### ### ### ### 1.6 0.06 24%
3131
31
ACAL – University of Michigan 31
Overall TLP Results - AtomIdle System TLP
Application 0 1 2 3 4 TLP σ
GameBioshock 2% 25% 35% 27% 11% 2.2 0.04Call of Duty 4 1% 37% 27% 26% 10% 2.1 0.05Crysis 0% 39% 44% 14% 2% 1.8 0.02
Image AuthoringMaya3D 2010 20% 41% 13% 5% 21% 2.1 0.05Photoshop CS4 8% 59% 17% 6% 10% 1.6 0.11
Office
Adobe Reader 9 40% 36% 19% 5% 1% 1.5 0.03Excel 2007 45% 40% 12% 2% 0% 1.3 0.01PowerPoint 2007 38% 42% 16% 4% 1% 1.4 0.01Streets 2010 39% 34% 20% 5% 2% 1.6 0.02Word 2007 35% 49% 13% 3% 0% 1.3 0.01
PlaybackiTunes 9 24% 45% 24% 6% 1% 1.5 0.09Quicktime 4% 28% 39% 23% 6% 2.1 0.10Quicktime HD 11% 19% 22% 22% 26% 2.6 0.01
CUDABadaboom 68% 23% 9% 1% 0% 1.3 0.04PowerDirector v8 8% 17% 20% 23% 32% 2.8 0.07
Video AuthoringHandbrake 0% 0% 2% 10% 88% 3.8 0.04PowerDirector v8 3% 8% 11% 19% 58% 3.3 0.04
Web BrowsingFirefox 3.5* 25% 42% 19% 9% 5% 1.6 0.04Safari 4.0* 23% 35% 21% 12% 8% 1.8 0.08
3232
32
ACAL – University of Michigan 32
Discussion Many threads, but few used concurrently Lack of concurrency appears due to software design issues Single thread performance is still important Underutilized GPU may offer additional opportunities Unlikely programmers will quickly take advantage of multi-cores Focus on desktop/laptop applications should be greater
Understood programs like video transcoding are already parallel Others, like web browsers, use only 1 – 2 cores