ISPASS 2011 Characterizing Multi-threaded Applications based on Shared-Resource Contention Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science University of Virginia 1
Dec 14, 2015
ISPASS 2011
Characterizing Multi-threaded Applications based on
Shared-Resource Contention
Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa
Department of Computer Science
University of Virginia
1
MotivationThe number of cores doubles every 18 monthsExpected: Performance number of coresOne of the bottlenecks is shared resource contention
For multi-threaded workloads, contention is unavoidable
To reduce contention, it is necessary to understand where and how the contention is created
2
Shared Resource Contention in Chip-Multiprocessors
Intel Quad Core Q9550
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
Front -Side Bus
3
Application 1 Thread
Application 2 Thread
Scenario 1 Multi-threaded applicationsWith co-runner
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
4
Application 1 Thread
Application 2 Thread
Without co-runner
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
Application Thread
5
Scenario 2Multi-threaded applications
Shared-Resource Contention
Intra-application contentionContention among threads from the same application
(No co-runners)
Inter-application contentionContention among threads from the co-running
application
6
ContributionsA general methodology to evaluate a multi-threaded
application’s performance Intra-application contention Inter-application contentionContention in the memory-hierarchy shared resources
Characterizing applications facilitates better understanding of the application’s resource sensitivity
Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7
OutlineMotivationContributionsMethodologyMeasuring intra-application contentionMeasuring inter-application contentionRelated WorkSummary
8
Methodology
9
Designed to measure both intra- and inter-application contention for a targeted shared resourceL1-cache, L2-cacheFront Side Bus (FSB)
Each application is run in two configurationsBaseline: threads do not share the targeted resourceContention: threads share the targeted resource
Multiple number of targeted resourceDetermine contention by comparing performance
(gathering hardware performance counters’ values)
OutlineMotivationContributionsMethodologyMeasuring intra-application contention (See paper)Measuring inter-application contentionRelated WorkSummary
10
L1-cache
Baseline Configuration
Contention Configuration
Measuring inter-application contention
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
11
Measuring inter-application contentionL2-cache
Baseline Configuration
Contention Configuration
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
12
Measuring inter-application contentionFSB
Baseline Configuration
Memory
C0
C2
C4
C6
L2 L2
L1 L1L1 L1
C1
C3
C5
C7
L2 L2
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
13
Measuring intra-application contentionFSB
Contention Configuration
Memory
C0
C2
C4
C6
L2 L2
L1 L1L1 L1
C1
C3
C5
C7
L2 L2
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
14
PARSEC Benchmarks
15
Application Domain Benchmark(s)
Financial Analysis Blackscholes (BS)Swaptions (SW)
Computer Vision Bodytrack (BT)
Engineering Canneal (CN)
Enterprise Storage Dedup (DD)
Animation Facesim (FA)Fluidanimate (FL)
Similarity Search Ferret (FE)
Rendering Raytrace (RT)
Data Mining Streamcluster (SC)
Media Processing Vips (VP)X264 (X2)
Experimental platformPlatform 1: Yorkfield
Intel Quad core Q955032 KB L1-D and L1-I
cache6MB L2-cache2GB MemoryCommon FSB
C0
L2 cache
Memory
L1 cache
Memory Controller Hub (Northbridge)
FSB
MB
FSB interface
L2 cache
L2 HW-PF
FSB interface
L2 HW-PF
L1 HW-PF
C1
C2
C3
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
1616
Tanima Dey
Experimental platform
Memory
Memory Controller Hub (Northbridge)FSB
MB
FSB
C0
L2 cache
L1 cache
FSB interface
L2 cache
L2 HW-PF
FSB interface
L2 HW-PF
L1 HW-PF
C2
C4
C6
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
C1
L2 cache
L1 cache
FSB interface
L2 cache
L2 HW-PF
FSB interface
L2 HW-PF
L1 HW-PF
C3
C5
C7
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
Platform 2: Harpertown
1717
18
Performance AnalysisInter-application contention
For i-th co-runner
PercentPerformanceDifferencei =
( PerformanceBasei – PerformanceContendi ) * 100
PerformanceBasei
Absolute performance difference sum
APDS = Σ abs ( PercentPerformanceDifferencei )
Inter-application contentionL1-cache – for Streamcluster
19
Bla
ck
sc
ho
les
Bo
dy
tra
ck
Ca
nn
ea
l
De
du
p
Fa
ce
sim
Fe
rre
t
Flu
ida
nim
ate
Ra
ytr
ac
e
Sw
ap
tio
ns
Vip
s
X2
64
-8
-6
-4
-2
0
2
4
6
8
Inter-application L1-cache Contention
Co-running benchmarks
Pe
rfo
rma
nc
e D
iffe
ren
ce
(%
)
Inter-application L1-cache contention Streamcluster
20
Inter-application L1-cache Contention
-8
-6-4
-20
2
46
8
Bla
ck
sc
ho
les
Bo
dy
tra
ck
Ca
nn
ea
l
De
du
p
Fa
ce
sim
Fe
rre
t
Flu
ida
nim
ate
Ra
ytr
ac
e
Str
ea
mc
lus
ter
Sw
ap
tio
ns
Vip
s
X2
64
Co-running benchmarks
Pe
rfo
rma
nc
e D
iffe
ren
ce
(%
)
Characterization
24
Benchmarks
L1-cache L2-cache FSB
Blackscholes
none none none
Bodytrack inter inter intra
Canneal intra inter intra
Dedup inter intra, inter intra, inter
Facesim inter inter intra
Ferret intra intra, inter intra
Fluidanimate
inter inter intra
Raytrace none none intra
Streamcluster
inter inter intra
Swaptions none none none
Vips intra inter inter
X264 inter intra, inter intra
SummaryThe methodology generalizes contention analysis of
multi-threaded applicationsNew approach to characterize applicationsUseful for performance analysis of existing and future
architecture or benchmarks Helpful for creating new workloads of diverse
properties
Provides insights for designing improved contention-aware scheduling methods
25
Related WorkCache contention
Knauerhase et al. IEEE Micro 2008Zhuravleve et al. ASPLOS 2010Xie et al. CMP-MSI 2008Mars et al. HiPEAC 2011
Characterizing parallel workload Jin et al., NASA Technical Report 2009
PARSEC benchmark suiteBienia et al. PACT 2008Bhadauria et al. IISWC 2009
26