1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi {kchak,pwells,sohi}@cs.wisc.edu
Jan 07, 2016
1
Computation Spreading: Employing Hardware Migration to Specialize
CMP Cores On-the-fly
Koushik Chakraborty Philip WellsGurindar Sohi
{kchak,pwells,sohi}@cs.wisc.edu
Chakraborty, Wells, and Sohi ASPLOS 2006 2
Paper Overview
Multiprocessor Code ReusePoor resource utilization
Computation SpreadingNew model for assigning computation within a program on CMP cores in H/WCase Study: OS and User computation
Investigate performance characteristics
Chakraborty, Wells, and Sohi ASPLOS 2006 3
Talk Outline
Motivation Computation Spreading (CSP)
Case study: OS and User compution Implementation Results Related Work and Summary
Chakraborty, Wells, and Sohi ASPLOS 2006 4
Homogeneous CMP
Many existing systems are homogeneous
Sun Niagara, IBM Power 5, Intel Xeon MP
Multithreaded server application Composed of server threadsTypically each thread handles a client requestOS assigns software threads to cores• Entire computation from one thread
execute on a single core (barring migration)
Chakraborty, Wells, and Sohi ASPLOS 2006 5
Code Reuse
Many client requests are similarSimilar service across multiple threadsSame code path traversed in multiple cores
Instruction footprint classificationExclusive – single core accessCommon – many cores accessUniversal – all cores access
Chakraborty, Wells, and Sohi ASPLOS 2006 6
Multiprocessor Code Reuse
Chakraborty, Wells, and Sohi ASPLOS 2006 7
Implications
Lack of instruction stream specialization
Redundancy in predictive structures• Poor capacity utilization
Destructive interference No synergy among multiple cores
Lost opportunity for co-operationExploit core proximity in CMPExploit core proximity in CMP
Chakraborty, Wells, and Sohi ASPLOS 2006 8
Talk Outline
Motivation Computation Spreading (CSP)
Case study: OS and User compution Implementation Results Related Work and Summary
Chakraborty, Wells, and Sohi ASPLOS 2006 9
Computation Spreading (CSP)
Computation fragment = dynamic instruction stream portion
Collocate similar computation fragments from multiple threads
Enhance constructive interference
Distribute dissimilar computation fragments from a single thread Reduce destructive interference
Reassignment is the key
Chakraborty, Wells, and Sohi ASPLOS 2006 10
Example
A1
B1
C1
B2
C2
A2
C3
A3
B3
T1 T2 T3
B3
A3
C3A1
C1
B1
B2
C2
A2
P1 P2 P3
CCAANNOONNIICCAALL
CCSSPP
time
A1
B1
C1
B2
C2
A2
C3
A3
B3
Chakraborty, Wells, and Sohi ASPLOS 2006 11
Key Aspects
Dynamic SpecializationHomogeneous multicore acquires specialization via retaining mutually exclusive predictive state
Data LocalityData dependencies between different computation fragmentsCareful fragment selection to avoid loss of data locality
Chakraborty, Wells, and Sohi ASPLOS 2006 12
Selecting Fragments
Server workloads characteristicsLarge data and instruction footprintSignificant OS computation
User Computation and OS Computation
A natural separationExclusive instruction footprints
Relatively independent Relatively independent data footprint
Chakraborty, Wells, and Sohi ASPLOS 2006 13
Data Communication
T1 T2
T1-User
T1-OS
T2-User
T2-OS
Core 1 Core 2
Chakraborty, Wells, and Sohi ASPLOS 2006 14
Relative Inter-core Data Communication
Apache OLTP
OS-User Communication is limited
Chakraborty, Wells, and Sohi ASPLOS 2006 15
Talk Outline
Motivation Computation Spreading (CSP)
Case study: OS and User compution Implementation Results Related Work and Summary
Chakraborty, Wells, and Sohi ASPLOS 2006 16
Implementation
Migrating ComputationTransfer state through the memory subsystem
• ~2KB of register state in SPARC V9• Memory state through coherence
Lightweight Virtual Machine Monitor
Migrates computation as dictated by the CSP PolicyImplemented in hardware/firmware
Chakraborty, Wells, and Sohi ASPLOS 2006 17
BaselineUser Cores
OS Cores
User CompOS Comp
Virtual CPUs
Physical
Cores
Software
Stack
Implementation contThreads
Chakraborty, Wells, and Sohi ASPLOS 2006 18
User Cores
OS Cores
Virtual CPUs
Physical
Cores
Software
Stack
Implementation contThreads
Chakraborty, Wells, and Sohi ASPLOS 2006 19
CSP Policy
Policy dictates computation assignment
Thread Assignment Policy (TAP)Maintains affinity between VCPUs and physical cores
Syscall Assignment Policy (SAP)OS computation assigned based on system calls
TAP and SAP use identical assignment for user computation
Chakraborty, Wells, and Sohi ASPLOS 2006 20
Talk Outline
Motivation Computation Spreading (CSP)
Case study: OS and User compution Implementation Results Related Work and Summary
Chakraborty, Wells, and Sohi ASPLOS 2006 21
Simulation Methodology Virtutech SIMICS MAI running Solaris 9 CMP system: 8 out-of-order processors
2 wide, 8 stages, 128 entry ROB, 3GHz 3 level memory hierarchy
Private L1 and L2Directory base MOSIL3: Shared, Exclusive 8MB (16w) (75 cycle load-to-use)Point to point ordered interconnect (25 cycle latency)Main Memory 255 cycle load to use, 40GB/s
Measure impact on predictive structures
Chakraborty, Wells, and Sohi ASPLOS 2006 22
L2 Instruction Reference
Chakraborty, Wells, and Sohi ASPLOS 2006 23
Result Summary
Branch predictors9-25% reduction in mis-predictions
L2 data references0-19% reduction in load missesModerate increase in store misses
Interconnect messagesModerate reduction (after accounting extra messages for migration)
Chakraborty, Wells, and Sohi ASPLOS 2006 24
Performance Potential
Migration Overhead
Chakraborty, Wells, and Sohi ASPLOS 2006 25
Talk Outline
Motivation Computation Spreading (CSP)
Case study: OS and User compution Implementation Results Related Work and Summary
Chakraborty, Wells, and Sohi ASPLOS 2006 26
Related Work
Software re-design: staged executionCohort Scheduling [Larus and Parkes 01], STEPS [Ailamaki 04], SEDA [Welsh 01], LARD [Pai 98]CSP: similar execution in hardware
OS and User Interference [several]Structural separation to avoid interferenceCSP avoids interference and exploits synergy
Chakraborty, Wells, and Sohi ASPLOS 2006 27
Summary
Extensive code reuse in CMPs45-66% instruction blocks universally accessed in server workloads
Computation SpreadingLocalize similar computation and separate dissimilar computationExploits core proximity in CMPs
Case Study: OS and User computationDemonstrate substantial performance potential
Chakraborty, Wells, and Sohi ASPLOS 2006 28
Thank You!
Chakraborty, Wells, and Sohi ASPLOS 2006 29
Backup Slides
Chakraborty, Wells, and Sohi ASPLOS 2006 30
L2 Data Reference
L2 load miss comparable, slight to moderate increase in L2 store miss
Chakraborty, Wells, and Sohi ASPLOS 2006 31
Multiprocessor Code Reuse
Chakraborty, Wells, and Sohi ASPLOS 2006 32
Performance Potential