Top Banner
Xian-He Sun C-AMAT Concurrent Average Memory Access Time Xian-He Sun Xian-He Sun April 2015 Illinois Institute of Technology [email protected] With Yuhang Liu and Dawei Wang
38

Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology [email protected] With Yuhang Liu and Dawei.

Dec 19, 2015

Download

Documents

Felix Simon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

C-AMAT : Concurrent Average Memory Access Time

Xian-He SunXian-He Sun

April , 2015Illinois Institute of Technology

[email protected]

With Yuhang Liu and Dawei Wang

Page 2: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Outline

Motivation

Memory System and Metrics

C-AMAT: Definition and Contribution

Experimental Design and Verification

Application and Related Work

Conclusion

2

X.-H. Sun and D. Wang, "Concurrent Average Memory Access Time", in IEEE Computers, vol. 47, no. 5, pp. 74-80,May 2014 D. Wang and X. Sun, “APC: A Novel Memory Metric and Measurement Methodology for Modern Memory System,” IEEE Transactions on Computers, vol. 63, no. 7, pp. 1626–1639, 2014.

Reference

Page 3: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Motivation

Processor is 400x faster than memory, and applications become more data intensive

Data access becomes THE performance bottleneck of high-end computing

Many concurrency based technologies are developed to improve data access speed, but their impact on final performance is elusive and, therefore, are not fully utilized

Existing memory optimization strategies are still primarily based on the sequential single-access assumption

3

Page 4: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Memory Wall Problem

µProc 1.52/yr.(2X/1.5yr)

Processor-MemoryPerformance Gap:(grows 50% / year)

DRAM7%/yr.(2X/10 yrs)

“Moore’s Law”

Processor-DRAM Memory GapµProc 1.20/yr.

• 1980: no cache in micro-processor; 2010: 3-level cache on chip, 4-level cache off chip• 1989 the first Intel processor with on-chip L1 cache was Intel 486, 8KB size• 1995 the first Intel processor with on-chip L2 cache was Intel Pentium Pro, 256KB size• 2003 the first Intel processor with on-chip L3 cache was Intel Itanium 2, 6MB size

Source: Computer Architecture A Quantitative Approach

Page 5: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Extremely Unbalanced Operation LatencyC

ycle

s

5~15M cycles

IO Access

Page 6: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun6

Data Access becomes Performance Bottleneck

Source: Gromacs

GROMACS  (molecular dynamics) 

Source: MPQC

MPQC (Massively Parallel Quantum Chemistry)

Source: Multi-grid solver

Multi-Grid solver (CFD)Microstructure

Page 7: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun7

Data Access becomes Performance Bottleneck

Computational Fluid Dynamics

Data miningComputational Finance

Adaptive Multigrid

Page 8: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

CPU Registers <8KB <0.2~0.5 ns

L1 Cache <128B 0.5-1 ns

Main Memory Giga Bytes 50ns-100ns

Disk Tera Bytes, 5 ms

Capacity Access Time

Registers

L1 Cache

Memory

Disk

Instr. Operands

Blocks

Pages

StagingXfer Unit

prog./compiler1-8 bytes

L2 cache cntl32-128 bytes

OS4K-4M bytes

Upper Level

Lower Level

faster

Larger

Solution: Memory Hierarchy

L2 CacheL2 Cache <50MB 1-10 ns

L1 cache cntl32-128 bytes

Page 9: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Data Access Concurrency Exist

9

Page 10: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Multi-coreMulti-threadingMulti-issue

Multi-banked CacheMulti-level Cache

Multi-channelMulti-rankMulti-bank

CPU

Cache

Memory

Out-of-order ExecutionSpeculative ExecutionRunahead Execution

Pipelined CacheNon-blocking Cache Data PrefetchingWrite buffer

Solution: Memory Hierarchy & Parallelism

Parallel File SystemParallel File SystemInput-Output (I/O)

Disks

PipelineNon-blocking PrefetchingWrite buffer

Page 11: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Extremely Unbalanced Operation Latency

Cyc

les

IO Access 5~15M cycles

Assumption of Current SolutionsAssumption of Current Solutions

Memory Hierarchy: Locality Concurrence: Data access pattern

o Data stream

Performances vary Performances vary largelylargely

Page 12: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Existing Memory Metrics Miss Rate(MR)

o {the number of miss memory accesses} over {the number of total memory accesses}

Misses Per Kilo-Instructions(MPKI)o {the number of miss memory accesses} over {the number of total committed

Instructions × 1000}

Average Miss Penalty(AMP)o {the summary of single miss latency} over {the number of miss memory accesses}

Average Memory Access Time (AMAT)o AMAT = Hit time + MR×AMP

Flaw of Existing Metrics o Focus on a single component oro A single memory access

Missing memory parallelism/concurrency

Page 13: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Concurrent AMAT (C-AMAT)

13

-H M

H pAMPC AMAT pMR

C C

• H is Hit time

• CH is the hit concurrency

• CM is the pure miss concurrency

• pMR and pAMP are pure-miss ratio and pure-miss penalty

• a Pure-miss cycle is a miss cycle there is no hit

AMAT H MR AMP

Page 14: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Different perspectives

Sequential perspective: AMAT

Concurrent perspective: C-AMAT

14

Access 1

Access 2

Access 3

Access 4

Access 5

Hit phase Pure miss phase

Hit phase Hit/Miss phase

Hit phase

pure miss cycles

Miss cycles

Page 15: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Pure-miss Miss is not important (Pure miss is)

The penalty is due to pure miss

15

M

pAMPpMR

C

Access 1

Access 2

Access 3

Access 4

Access 5

Hit phase Pure miss phase

Hit phase Hit/Miss phase

Hit phase

pure miss cycles

Miss cycles

Page 16: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

C-AMAT is Recursive

16

1

11 1 1 2- -

H

HC AMAT pMR C AMAT

C

1 1

1 11 1-

H M

H pAMPC AMAT pMR

C C

2 2

2 22 2-

H M

H pAMPC AMAT pMR

C C

where

1

1

11

1

m

M

CpAMP

AMP C

This Eq. shows the recurrence relation of C-AMAT1 and C-AMAT2

Page 17: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

The physical meaning of η1

R1 = pure miss cycles / miss cycles

R2 = pure misses / misses

η1 = R1 / R2

The penalty at L2 is C-AMAT2

The actual delay impact is η1 x C-AMAT2

η1 is the L1 (concurrency) data delay reducer

17

Page 18: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Architecture Impacts CH could be contributed by

o multi-port cacheo multi-banked cacheo pipelined cache structures

CM could be contributed by o non-blocking cache structures o prefetching logic

These techniques can both increase the CH and CM o out-of-order execution o multiple issue pipeline o SMT o CMP

18

Page 19: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Detecting System

19

CPU Interface

Hit Concurrency DetectorCache

MSHR Miss Concurrency Detector

C-AMAT analyzer

Structure for detecting cache hit concurrency and cache miss

concurrency using the C-AMAT metric

Page 20: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Experimental Environment Simulator

o GEM5

Benchmarks

o 29 benchmarks from SPEC CPU2006 suite

For each benchmark, 10 million instructions were simulated to collect statistics

Average values of the correspondent memory metrics are shown

A good memory metric should matches the actual design choices for modern processors

20

Page 21: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Default configuration

21

Default processor and cache configuration parameters forsimulated testing of C-AMAT

Page 22: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Experimental Results

22

L1 DCache AMAT and C-AMAT when Changing Issue Pipeline Width

AMAT getting worse and C-AMAT getting better when concurrency increase

Page 23: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Experimental Results

23

L1 DCache AMAT and C-AMAT when Changing MSHR Size

AMAT getting worse and C-AMAT getting better when concurrency increase

Page 24: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Experimental Results

24

L2 Cache AMAT and C-AMAT when Changing MSHR Size

More results can be found in X. H. Sun and D. Wang, "Concurrent Average Memory Access Time," IEEE Computer, 47(5), May 2014, pp.74-80.

AMAT getting worse and C-AMAT getting better when concurrency increase

Page 25: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Potential of C-AMAT and Data Concurrency

Assume total running time is T Data stall time is d, d/T is up to 70%, that is d/T is

0.7 T Compute time is t, and t is 0.3 T Therefore, data stall time can be up to 0.7/0.3 = 2.3

folds of compute time If layered performance matching can be achieved

when the overlapping effect of data access concurrency is enough, data stall time is only 1% of compute time

Then memory performance can be improved 230 times!

25

Page 26: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Improvement potential due to concurrency

26

Aided by concurrency, memory system performances can be improved up to hundreds of times (230X) at each layer of a memory hierarchy with layered performance matching

Page 27: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

How 230x Improvement Achieved

27

Increasing data access concurrency to have a 230 speedup of memory system performance with our LPM algorithm

Page 28: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Technique Impact Analysis (Original)

28Figure 2.11 on page 96 in Hennessy & Patterson’s latest book

Page 29: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Technique Impact Analysis (Ours)

29

A new technique summation table with C-AMAT

Page 30: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

The Impact of C-AMAT

New understanding of memory systems with a rigor mathematical description

Unified the influence of data locality and concurrency under one formulation

Foundation for developing new concurrency-based optimizations, and utilizing existing locality-based optimizations

Foundation for automatic tuning for best configuration, partition, and scheduling, etc.

30

Page 31: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

C-AMAT in Action

Data stall time

New C-AMAT model

CPU-time = IC×(CPIexe + fmem×C-AMAT×(1–overlapRatioc-m))×cycle-time

Data stall time

Traditional AMAT model

Data stall time

Only pure miss will cause processor stall, and the penalty is formulated here

Y.-H. Liu and X.-H. Sun, “Reevaluating data stall time with the consideration of data access concurrency,” Journal of Computer Science and Technology, vol. 30, no. 2, pp. 227–245, 2015.

Page 32: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

C-AMAT in Action

Layered performance matching at each memory hierarchy

Using recursive C-AMAT to measure and mitigate layered performance mismatch

For instance, the impact of C-AMAT2 can be trimmed by pMR1 and η1

The key is to reduce pure miss, not miss, and data concurrence can do so

32

Y.-H. Liu, X.-H. Sun, "LPM: Layered Performance Matching in Memory Hierarchy," Illinois Institute of Technology Technical Report (IIT/CS-SCS-2014-08), 2014. 

Page 33: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

C-AMAT in Action

Online Reconfiguration and Smart Schedulingo A performance optimization tool has been developed

base on C-AMATo Provide measurement and optimization suggestionso Measure C-AMAT on existing computing systemso Optimization in hardware reconfigurationo Optimization in software task partitioning and

scheduling

33

Y.-H. Liu, X.-H. Sun, "TuningC: A Concurrency-aware Optimization Tool," Illinois Institute of Technology Technical Report (IIT/CS-SCS-2015-05), 2015. 

Page 34: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Related Work: APC Versus C-AMAT Access Per (memory active) Cycle (APC)

o APC = A/T APC is a measurement, a companion of C-AMAT C-AMAT is a analysis and optimization tool APC is very different with the traditional IPC

o Memory Active Cycle (data centric/access)o Overlapping mode (concurrent data access)

C-AMAT does not depend on its five parameters for its value

C-AMAT = 1/APC

D. Wang, X.-H. Sun "Memory Access Cycle and the Measurement of Memory Systems", IEEE Transactions on Computers, vol. 63, no. 7, pp. 1626-1639, July.2014

Page 35: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Related Work: MLP Memory Level Parallelism (MLP)

o Average number of long-latency main memory outstanding accesses when there is at least one such outstanding access

o Assuming each off-chip memory access has a constant latency, say m cycles, APCM=MLP/m

o That means APCM is directly proportional to MLPo APC is superset of MLP

C-AMAT is an analytical tool and measurement, MLP is a measurement

MLP does not consider locality, will APC and C-AMAT do

Page 36: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Conclusions Data access delay is the premier bottleneck of

computing

Hardware memory concurrence exists but is under utilized

C-AMAT unifies data concurrency with locality for combined data access optimizations

C-AMAT can improve AMAT performance 230 times

This 230X number could be even larger. With the multicore technology, CPU can be built faster. The question is if data can be moved up fast enough

36

Page 37: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Conclusions

Develop C-AMAT Develop C-AMAT based technologies based technologies to reduce data to reduce data access time !access time !

37

Page 38: Xian-He Sun C-AMAT : Concurrent Average Memory Access Time Xian-He Sun April , 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei.

Xian-He Sun

Thank YouThank You

& &

Questions ?Questions ?

38