Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly

1

Computation Spreading: Employing Hardware Migration to Specialize

CMP Cores On-the-fly

Koushik Chakraborty Philip WellsGurindar Sohi

{kchak,pwells,sohi}@cs.wisc.edu

Chakraborty, Wells, and Sohi ASPLOS 2006 2

Paper Overview

Multiprocessor Code ReusePoor resource utilization

Computation SpreadingNew model for assigning computation within a program on CMP cores in H/WCase Study: OS and User computation

Investigate performance characteristics


Talk Outline

Motivation Computation Spreading (CSP)

Case study: OS and User compution Implementation Results Related Work and Summary


Homogeneous CMP

Many existing systems are homogeneous

Sun Niagara, IBM Power 5, Intel Xeon MP

Multithreaded server application Composed of server threadsTypically each thread handles a client requestOS assigns software threads to cores• Entire computation from one thread

execute on a single core (barring migration)


Code Reuse

Many client requests are similarSimilar service across multiple threadsSame code path traversed in multiple cores

Instruction footprint classificationExclusive – single core accessCommon – many cores accessUniversal – all cores access


Multiprocessor Code Reuse


Implications

Lack of instruction stream specialization

Redundancy in predictive structures• Poor capacity utilization

Destructive interference No synergy among multiple cores

Lost opportunity for co-operationExploit core proximity in CMPExploit core proximity in CMP


Talk Outline




Computation Spreading (CSP)

Computation fragment = dynamic instruction stream portion

Collocate similar computation fragments from multiple threads

Enhance constructive interference

Distribute dissimilar computation fragments from a single thread Reduce destructive interference

Reassignment is the key


Example

A1

B1

C1

B2

C2

A2

C3

A3

B3

T1 T2 T3

B3

A3

C3A1

C1

B1

B2

C2

A2

P1 P2 P3

CCAANNOONNIICCAALL

CCSSPP

time

A1

B1

C1

B2

C2

A2

C3

A3

B3


Key Aspects

Dynamic SpecializationHomogeneous multicore acquires specialization via retaining mutually exclusive predictive state

Data LocalityData dependencies between different computation fragmentsCareful fragment selection to avoid loss of data locality


Selecting Fragments

Server workloads characteristicsLarge data and instruction footprintSignificant OS computation

User Computation and OS Computation

A natural separationExclusive instruction footprints

Relatively independent Relatively independent data footprint


Data Communication

T1 T2

T1-User

T1-OS

T2-User

T2-OS

Core 1 Core 2


Relative Inter-core Data Communication

Apache OLTP

OS-User Communication is limited


Talk Outline




Implementation

Migrating ComputationTransfer state through the memory subsystem

• ~2KB of register state in SPARC V9• Memory state through coherence

Lightweight Virtual Machine Monitor

Migrates computation as dictated by the CSP PolicyImplemented in hardware/firmware


BaselineUser Cores

OS Cores

User CompOS Comp

Virtual CPUs

Physical

Cores

Software

Stack

Implementation contThreads


User Cores

OS Cores

Virtual CPUs

Physical

Cores

Software

Stack

Implementation contThreads


CSP Policy

Policy dictates computation assignment

Thread Assignment Policy (TAP)Maintains affinity between VCPUs and physical cores

Syscall Assignment Policy (SAP)OS computation assigned based on system calls

TAP and SAP use identical assignment for user computation


Talk Outline




Simulation Methodology Virtutech SIMICS MAI running Solaris 9 CMP system: 8 out-of-order processors

2 wide, 8 stages, 128 entry ROB, 3GHz 3 level memory hierarchy

Private L1 and L2Directory base MOSIL3: Shared, Exclusive 8MB (16w) (75 cycle load-to-use)Point to point ordered interconnect (25 cycle latency)Main Memory 255 cycle load to use, 40GB/s

Measure impact on predictive structures


L2 Instruction Reference


Result Summary

Branch predictors9-25% reduction in mis-predictions

L2 data references0-19% reduction in load missesModerate increase in store misses

Interconnect messagesModerate reduction (after accounting extra messages for migration)


Performance Potential

Migration Overhead


Talk Outline




Related Work

Software re-design: staged executionCohort Scheduling [Larus and Parkes 01], STEPS [Ailamaki 04], SEDA [Welsh 01], LARD [Pai 98]CSP: similar execution in hardware

OS and User Interference [several]Structural separation to avoid interferenceCSP avoids interference and exploits synergy


Summary

Extensive code reuse in CMPs45-66% instruction blocks universally accessed in server workloads

Computation SpreadingLocalize similar computation and separate dissimilar computationExploits core proximity in CMPs

Case Study: OS and User computationDemonstrate substantial performance potential


Thank You!


Backup Slides


L2 Data Reference

L2 load miss comparable, slight to moderate increase in L2 store miss


Multiprocessor Code Reuse


Performance Potential

Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly

Documents

coresentire computation

cspcase study

cmp cores

cores accessuniversal

hwcase study

multiple threadssame

multiple coreslost opportunity

server threadstypically