Not only Faster Bruce Jacob Faster and Accurater
Post on 04-Jul-2022
4 Views
Preview:
Transcript
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Fasterand AccuraterThe Future of Memory-SystemModeling and SimulationBruce Jacob (with Ph.D. results of Shang Li) Keystone Professor University of Maryland
�1
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
The Bottom Line
�2
SimulationSpeed
SimulationAccuracy
TraceBased
HDLBased
Cycle Accurate
[don’t actually
go here]
We canget up here
(e.g., via prediction)
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Background
�3
The Root of the Problem
Column
Read
tRP = 15ns tRCD = 15ns, tRAS = 37.5ns
CL = 8
Bank
Precharge
Row Activate (15ns)
and Data Restore (another 22ns)
DATA
(on bus)
BL = 8TIME
Cost of access is high; requires significant effort to amortize this over the (increasingly short) payoff.
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
CPU/$
“Significant Effort”
CPU/$
Outgoing bus request
MC
read data
read data
Read B
Write X, data
Read Z
Write Q, data
Read A
Write A, data
Read W
Read Z
Read YA
CT
RD
PR
E
RD
RD
PR
E
PR
E
AC
T
WR
WR
AC
TR
D
PRE ACTRDread data
beat
cm
d
Background (‘significant effort’)
�4
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Faster?
�5
CPU Simulators
A. JAMES CLARK SCHOOL OF ENGINEERING 5
Background & MotivationPerformance & Scalability
Simulation Speed
Accuracy
SST-Ariel
Gem5-O3
GraphiteGem5-Timing
SniperZSim
CMP$im
Marss-O3
• Simulation speed: 100X faster• Error: < 20%• 10s of cores simulated on 10s of cores
CPU Simulators
A. JAMES CLARK SCHOOL OF ENGINEERING 5
Background & MotivationPerformance & Scalability
Simulation Speed
Accuracy
SST-Ariel
Gem5-O3
GraphiteGem5-Timing
SniperZSim
CMP$im
Marss-O3
• Simulation speed: 100X faster• Error: < 20%• 10s of cores simulated on 10s of cores
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Faster?
�6
DRAM vs CPU simulator performance
A. JAMES CLARK SCHOOL OF ENGINEERING 4
Background & MotivationPerformance & Scalability
Cycle accurate, traditional CPU simulator
DRAM vs CPU simulator performance
A. JAMES CLARK SCHOOL OF ENGINEERING 6
Background & MotivationPerformance & Scalability
Faster CPU simulator w/ DRAM simulator
Easily Predictable
Result:Memory-System
Simulation is now Limiting
Factor
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
I’ll just leave this here …
�7
A. JAMES CLARK SCHOOL OF ENGINEERING 14
Cycle Accurate Memory SimulationSimulator Comparison
Trace Simulation:10M Random requests10M Stream requests
A. JAMES CLARK SCHOOL OF ENGINEERING 14
Cycle Accurate Memory SimulationSimulator Comparison
Trace Simulation:10M Random requests10M Stream requests
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Even Faster via Prediction
�8
Turning DRAM timing simulation into a classification problem
A. JAMES CLARK SCHOOL OF ENGINEERING 37
Statistical DRAM ModelProposed Approach
Clock Address OP0 0x01230000 READ12 0x01230020 READ40 0x0123003C READ65 0x06340000 WRITE... ... ...
ClassIdleRow-HitRow-HitRow-Miss...
Latency36222256...
Classification Recovery
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Latency ← Queue Contents
�9
The Root of the Problem
Column
Read
tRP = 15ns tRCD = 15ns, tRAS = 37.5ns
CL = 8
Bank
Precharge
Row Activate (15ns)
and Data Restore (another 22ns)
DATA
(on bus)
BL = 8TIME
Cost of access is high; requires significant effort to amortize this over the (increasingly short) payoff.
BankConflict
Idle Bank
Row Hit
…plus any Queueing Delays
Refresh Delay
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Training Process
�10
A. JAMES CLARK SCHOOL OF ENGINEERING 43
Statistical DRAM ModelTraining: Supervised Learning
Synthetic Trace:~7000 RequestsVarious access patterns, inter-arrival timings to cover all kinds of workloads
A. JAMES CLARK SCHOOL OF ENGINEERING 43
Statistical DRAM ModelTraining: Supervised Learning
Synthetic Trace:~7000 RequestsVarious access patterns, inter-arrival timings to cover all kinds of workloads
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Models (performed the same)
�11Models: Decision Tree & Random Forest
A. JAMES CLARK SCHOOL OF ENGINEERING 44
Statistical DRAM ModelTraining
same-row-last
ref-after-last
…
num-recent-rank
Idle Idle
…
row-miss
Decision Tree
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Results: Way Faster
�12
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Results: Way Faster
�12
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
But Wait — and Accurater?A Little Background:
�13
MEMSYS’18, October 2018, Alexandria, Washington DC, USA Anon.
Figure 1: When CPU and memory simulators are coupled,the timings of thememory request between the LLC and thememory controller could be easily overlooked.
better simulation accuracy. Finally, we quantify the LLC to memorylatency for various high-end and emerging platforms and we showits signi�cant range, between 30ns (POWER8) and 277ns (KnighthsLanding); therefore, it is really important to properly adjust andvalidate this parameter in system simulators before any measure-ments are performed. Overall, we believe that the issues addressin this paper would help researchers of the computer architecturecommunity to improve main memory system simulation.
The rest of the paper is organized as follows. Section 2 explainssimulation environment and evaluates main memory latency witha microbenchmark for real and simulated systems. This section alsopropose approaches to �x the deviation identi�ed between real andsimulated main memory latency measurements. Section 3 detailsthe validation of the proposed approaches with SPEC CPU2006benchmarks , while Section 4 discusses LLC tomainmemory latencyof various high-end and emerging High Performance Computing(HPC) platforms. Section 5 analyzes the validation procedure ofstate-of-the-art system and memory simulators . Finally, Section 6presents the conclusions of the study.
2 MAIN MEMORY LATENCY EVALUATIONAND SIMULATION ENHANCEMENTS
In this section we detail the methodology used to model a targetedsystem into a simulation infrastructure and we describe the mi-crobenchmarks used to discover the main memory access latency.The targeted system we aimed to model is an Intel Xeon E5-2670Sandy Bridge-EP processor [8] operating at 3.0GHz. The mainmemory comprises four 4GiB DIMMS devices [21] connected tothe processor using four DDR3-1600 channels. Each processor runseight cores where the hyper-threading feature has been disabledlike in most HPC systems [22].
2.1 Simulation environmentThe simulator infrastructure we chose to use is an integration oftwo simulators: ZSim [20] as CPU simulator and DRAMsim2 asmain memory simulator.
ZSim is a user-level, execution-driven CPU simulatorwidely usedin the computer architecture research community. Developed byresearchers fromMIT and Stanford University, ZSim is designed forsimulation of large-scale systems. However, ZSim was originally de-veloped to simulate Intel Westmere architecture which is no longerbeing used in HPC domain. One of the tasks that we had to perform
Table 1: Cache parameters of the Sandy Bridge EP class pro-cessor used in the study.
L1-D L2 L3
Size 32 KiB 256 KiB 20MiBLatency (in CPU cycles) 4 8 28Cache line size 64 B 64 B 64 BSet associativity 8-way 8-way 20-way
was to upgrade and validate ZSim for Intel Sandy Bridge processor.The work to upgrade ZSim consisted in the following steps: First,we adjusted the simulator by updating the instruction latenciesobtained trough the execution of CPU microbenchmarks [23] in thereal hardware; Second, we improved the micro-operation fusionand we increased the number of entries in the Reorder Bu�er (ROB)from 128 (Westmere) to 168 (Sandy Bridge); Third, we con�guredthe cache hierarchy according to the Intel documentation [8] for aSandy Bridge EP Class, summarized in Table 1. Finally, we updatedthe L3 caching mechanism implementing the hashing functiondescribed in work by Maurice et al. [13].
ZSim is easily integrated with a main memory simulator suchas DRAMsim2 . DRAMsim2 is a cycle-accurate simulator validatedagainst Verilogmodels formemory devices.We con�gured DRAM-sim2 following manufactures documentation with speci�c timingson memory device part [21].
2.2 Memory latency microbenchmarkState-of-the-artmemory benchmarks such as LMbench [15], stream[14] and Intel’s Memory Latency Checker (imlc) [24] can be usedfor main memory latency measurements. However, they are not agood �t for our study because it is very di�cult to use them in ZSimsimulation. LMbench and stream rely on compiler optimizationand imlc is a binary-only distributed program; Hence, no tailoredanalysis nor modi�cation to the code could be made. Therefore, asnone of the open source existent benchmarks was appropriate forour analysis, we had to design a speci�c microbenchmark to usefor our experiments.
Our microbenchmark is designed to stress the caches and mainmemory implementing the concept of pointer chasing. Because themicrobenchmarks are designed to run on top of an Operating Sys-tem, a C program wraps all functionality outside the microbench-mark design as memory initialization, metrics collection, andprogram cleanup. By doing so, the microarchitectural implicationsof running on top of an OS are diminished.
In the microbenchmarks prologue, we allocate a contiguoussection of memory that stores an array of pointers. The elementson the array are initialized as a circular linked list that follows apointer chase pattern, Figure 2 portray an example of such ordering.Our design goals for the microbenchmarks are summarized as:(1) Iteratively traverse the whole array; (2) Access di�erent cachelines for every memory access; (3) The memory accesses have arandom pattern preventing data prefetchers to bring data to anylevel of cache.
2
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Conference’17, July 2017, Washington, DC, USA Anon.
(a) DRAM latency and overall latency reported byGem5 and ZSim.
(Program timing instrumented here)Min Latency Min Latency Min Latency
Request 0 Request 1
Request 0Returned
Request 1Returned
Request 2Returned
Request 2
Request 0 Request 1
Request 0Returned
Request 1Returned
Request 2Returned
Request 2
Request 0 Request 1
Request 0Returned
Request 1Returned
Request 2Returned
Request 2
Hardware &Cycle Accurate
ZSim Phase 1
ZSim Phase 2
Request 0 Latency
Request 1 Latency
Request 2 Latency
CPU to Mem
Mem to CPU
Timeline
(b) ZSim 2-phase memory model timeline diagram compared with real hard-ware/cycle accurate model. Three back-to-back memory requests (0, 1, 2) are issuedto the memory model.
Figure 3: Simulator memory latency analysis
main memory backend at all. [17] supports cycle accurate memorybackend, but as we will see soon, it has its issues when integratinga cycle accurate memory backend.
The problem was �rst discovered by [19], who observed a mem-ory latency error of about 20ns when they tested a memory latencybenchmark. But [19] did not answer where this 20ns missing la-tency comes from as was suspecting it came from the cycle accurateDRAM simulator they were using. We will analyze this situationand provide a conclusive answer to this question.
To replicate the issue independently, we developed a simpli�edversion of LMBench(ram_lat we referred in Table 1) that randomlytraverse a huge array, and measure the average latency of eachaccess. When the array is too large to �t in the cache and mostaccesses go to DRAM, the average access latency will include theDRAM latency. The benchmark inserts time stamps before andafter the memory traversal, and use them to determine the overalllatency of a certain number of memory requests, and divide thenumber of requests to obtain average memory latency. This averagememory latency consists of cache latency and DRAM latency, andthus we use the term overall latency in the following discussion.
Like [19], we ran this benchmark natively on our machine toobtain “hardware measured” latency(72ns), then ran it in ZSimalong with DRAMSim2 as DRAM backend, and we were able toreproduce similar results as [19]. That is, the overall latency (43ns)is 29ns lower than hardware measurement (72ns). To determinewhether this is a ZSim speci�c issue or DRAM simulator issue, weran the same benchmark in Gem5 with the same cache and DRAMparameters, and this time, the overall latency is 78ns, much closer toour hardware measurement. So we conclude this is a ZSim speci�cissue not a DRAM simulator issue. We then further looked into thesimulator statistics, and found that the DRAM latency reported bythe DRAM simulator in Gem5 is 55ns, which makes sense as theoverall latency (78ns) should be a combination of DRAM latency(55ns) and cache latency (23ns). However, in ZSim, the DRAM
latency reported by the DRAM simulator is 73ns, much higher thanoverall latency, which makes no sense. Figure 3a visualize theseresults. This again con�rms the issue lies within the ZSim memorymodel.
The way ZSim memory model works is, it has two phases ofmemory models, the �rst phase is an �xed latency model thatassumes a �xed “minimum latency” for all memory events. Thepurpose is to simulate instructions as fast as possible, and generate atrace of memory events. After the memory event trace is generated,the second phase kicks in and that’s when the cycle accurate DRAMsimulator actually works, the cycle accurate simulation use theevent trace as input and update latency timings associated withthese events.
For instance, Figure 3b demonstrates how ZSim memory modelhandles memory requests di�erently from hardware/cycle accuratemodels. Suppose there are 3 back-to-back memory requests(eachrelies on the �nishing of previous one). In real hardware or a cycleaccurate model, each memory request’s latency may vary and nextrequest cannot be issued until the previous request is returned. InZSim Phase 1, all requests are assumed to be �nished with “mini-mum latency”, and therefore �nish earlier than they should. Thenin ZSim Phase 2, cycle accurate simulation is performed, more ac-curate latency timing is produced by cycle accurate simulator andall 3 requests update their timings. But even if all memory requestsobtain correct timings in Phase 2, unfortunately, when the simu-lated program, like our benchmark, has instrumenting instructionssuch as reading system clock, it will obtain the timing numbersduring Phase 1, which can be substantially smaller. This is why theoverall latency is much smaller than DRAM latency.
So in other words, the “minimum latency” ZSim parameter willdictate the latency observed by the simulated program. To verifythis claim, we run the same simulation with di�erent “minimumlatency” parameters, and plot them against the benchmark reported
But Wait — and Accurater?The Real Culprit (took 2 yrs to find):
�14
ZSim 2-phase memory model timeline diagram compared
with real hardware/cycle accurate model.
Three back-to-back memory requests (0, 1, 2) are issued to
the memory model.
First phase of memory access aggressively schedules reqs for
performance; second phase fails to take into account dependence information.
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
But Wait — and Accurater?What Programmers WANT: (and if you can do it → accurate, parallel sims)
if (INSTR.isMemOp) { if (L1_cache_miss(INSTR.dAddr)) {
if (L2_cache_miss(INSTR.dAddr)) { INSTR.valid = now + DRAM_request(INSTR.dAddr);
} }
}
�15
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
But Wait — and Accurater?What Programmers WANT: (and if you can do it → accurate, parallel sims)
if (INSTR.isMemOp) { if (L1_cache_miss(INSTR.dAddr)) {
if (L2_cache_miss(INSTR.dAddr)) { INSTR.valid = now + DRAM_request(INSTR.dAddr);
} }
}
�15
Prediction gives it to them
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Bottom Line (scalability)The Future:
�16
Input Interface(Configuration, request inputs, address mapper...)
Controller
Bank States
Scheduler
Statistics
...
Controller
Bank States
Scheduler
Statistics
...
Controller
Bank States
Scheduler
Statistics
...
...
Output Interface(Aggregated statistics, request callbacks...)
SerialRegion
ParallelRegion
SerialRegion
Figure 5.1: Parallel DRAM simulator architecture.
5.1.2 Evaluation
First, we examine the execution speedup of the parallel DRAM simulator over
trace inputs. The trace frontend contributes little overhead to the overall simulation
time. We can also load traces to keep the DRAM simulator busy all the time to
maximize the simulation time spent in the DRAM simulator. Therefore it is the
ideal scenario for testing parallelization speedup.
We run two types of trace, stream and random, for 10 million cycles on a
8-channel HBM configuration. We compare the simulation time of running the
simulation in 1, 2, 4, and 8 threads respectively.
78
Large parallel simulations enabled, wherein each CPU
model can have its own memory-system predictor to
provide estimates of main memory-system latency.
None of the memory models need interact to provide their
predictions.
Moreover, the CPU models can be written in a FAR simpler
way than they are now, making them faster and less likely to
contain “gotcha” assumptions.
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Call For Participation www.memsys.io Call For Participation
The International Symposium on Memory Systems v October 1–4, Washington DC
MEMSYS 2018Keynote Addresses
Hardware Keynote: Steve Wallach Micron
Software Keynote: Brian Barrett Amazon
Postamble: J Thomas Pawlowski Micron
Panelists
Keren Bergman, ColumbiaWendy Elsasser, ArmPhilip Emma, Systems Technology & Architecture Consulting Michael Healy, IBMAdolfy Hoisie, BNLDave Resnick, ConsultantJeffrey Vetter, ORNL
Organizers
Bruce Jacob, U. Maryland Kathy Smiley, Memory Systems
Rajat Agarwal, Intel Abdel-Hameed Badawy, NMSUJonathan Beard, ArmIshwar Bhati, Intel Bruce Christenson, Intel Zeshan Chishti, Intel Zhaoxia (Summer) Deng, Facebook Chen Ding, U. Rochester David Donofrio, Berkeley Lab Dietmar Fey, FAU Erlangen-Nürnberg Maya Gokhale, LLNLXiaochen Guo, Lehigh U. Manish Gupta, NVIDIA Fazal Hameed, TU DresdenMatthias Jung, Fraunhofer IESE Kurt Keville, MITHyesoon Kim, Georgia Tech Scott Lloyd, LLNLSally A. McKee, ClemsonMoinuddin Qureshi, Georgia Tech Petar Radojkovic, BSCArun Rodrigues, Sandia National LabsRobert Voigt, Northrop GrummanGwendolyn Voskuilen, SandiaDavid T. Wang, Samsung Vincent Weaver, U. Maine Norbert Wehn, U. KaiserslauternYuan Xie, UC Santa BarbaraKe Zhang, Chinese Acad. of SciencesXiaodong Zhang, Ohio StateJishen Zhao, UC San Diego
Memory-device manufacturing, memory-architecture design, and the use of memory technologies by application software all profoundly impact today’s and tomorrow’s computing systems, in terms of their performance, function, reliability, predictability, power dissipation, and cost. Existing memory technologies are seen as limiting in terms of power, capacity, and bandwidth. Emerging memory technologies offer the potential to overcome both technology- and design-related limitations to answer the requirements of many different applications. Our goal is to bring together researchers, practitioners, and others interested in this exciting and rapidly evolving field, to update each other on the latest state of the art, to exchange ideas, and to discuss future challenges.
Conference Schedule and VenueThe conference will be held at the Gaylord National Resort & Convention Center at The National Harbor, Maryland. An opening reception will be held on Monday evening, followed by 2 1/2 days of technical presentations (full days on Tuesday and Wednesday, a half length technical day on Thursday), Conference Dinner Wednesday evening, and Awards Luncheon Tuesday afternoon. A discounted room block is still available on the registration site, with only a few rooms left.
Tracks and TopicsThe following topics will be presented over the 3-day conference:• Memory-system design from both hardware and software perspectives• Memory failure modes and mitigation strategies• Memory-system resilience, especially at large scale• Memory and system security issues• Operating system design for hybrid/nonvolatile memories• Technologies like flash, DRAM, STT-MRAM, 3DXP, memristors, etc.• Memory-centric programming models, languages, optimization• Compute-in-memory and compute-near-memory technologies• Large-scale data movement: networks, hardware, software, mitigation• Virtual memory redesign for unifying storage/memory/accelerators• Algorithmic & software memory-management techniques• Emerging memory technologies, both hardware and software,
including memory-related blockchain applications• Interference at the memory level across datacenter applications• Issues in the design and operation of large-memory machines• In-memory databases and NoSQL stores• Post-CMOS scaling efforts and memory technologies to support them,
including cryogenic, neural, quantum, and heterogeneous memories• The conference focuses on these and other related topics.
Publications & PresentationsAll accepted papers will be published in the ACM & IEEE Digital Libraries. Our primary goal is to showcase interesting ideas that will spark conversation between disparate groups—to get applications people, operating systems people, system architecture people, interconnect people and circuits people to talk to each other. Thus, we try to showcase interesting ideas in a format that will facilitate this. The talks are short, to encourage participation and discussion. Every evening we host a panel discussion of invited speakers, with beer, wine, and hot hors d’oeuvres.
2018 Conference Sponsors⟜ ⟜
Shameless Plug
Washington DC Sep 30 – Oct 3, 2019
�17
www.memsys.io
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Thank You!Bruce Jacob
blj@umd.edu www.ece.umd.edu/~blj
�18
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Backup Slides
�19
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE �20
System Level
Memory Controller
Memory Controller
Side View
Top View
Package Pins
Edge Connectors
PCB Bus Traces
DIMM 0 DIMM 1 DIMM 2
DRAMs DIMMs
Rank 0, Rank 1or
Rank 0, Rank 1or even
Rank 0/1, Rank 2/3…
One DIMM can have one RANK, two RANKs, or even more depending on its configuration.
I/O
MUX
One DRAM device with eight internal BANKS, each of which connects to the shared I/O bus.
One DRAM bank is comprised of many DRAM ARRAYS, depending on the part’s configuration. This example shows four arrays, indicating a x4 part (4 data pins).
DRAM Array
One BANK,four ARRAYS
Nomenclature
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Background
�21
Chapter 15 MEMORY SYSTEM DESIGN ANALYSIS 589
Per Bank (DRAM Command) Queue Depth
Ban
dwid
th E
ffici
ency
Impr
ovem
ent P
erce
ntag
e
4 6 8 10 12 14 160
20
40
2R8B vs. 1R8B
1R16B vs. 1R8B
Legend
2R16B vs. 2R8B
tRTRS ! 1 cycle
tRTRS ! 2 cycles
tRTRS ! 3 cycles2R8B vs. 1R16B
1R16B outperforms2R8B with queuedepth of 16
FIGURE 15.36: Bandwidth improvement—16-bank versus 8-bank DDR3 devices; relaxed tFAW and tWTR.
0 200 400 600 800Memory Access Latency (ns)
1
100
10000
1e+06
1e+08
Num
ber
of A
cces
ses
at G
iven
Lat
ency
Val
ue
0 200 400 600 8001
100
10000
1e+06
1e+08
Num
ber
of A
cces
ses
at G
iven
Lat
ency
Val
ue
Memory Access Latency (ns)
FCFS179.art CPRH179.art
FIGURE 15.37: Impact of scheduling policy on memory-access latency distribution: 179.art.
ch15_P379751.indd 589ch15_P379751.indd 589 8/8/07 4:03:32 PM8/8/07 4:03:32 PM
590 Memory Systems: Cache, DRAM, Disk
problems with the CPRH scheduling algorithm for other workloads. Figure 15.38 shows the latency dis-tribution curve for 188.ammp, and 188.ammp was one workload that points to possible issues with the CPRH algorithm. Figure 15.38 shows that the CPRH scheduling algorithm resulted in longer latencies for a number of transactions, and the number of trans-actions with memory-access latency greater than 400 ns actually increased. Figure 15.38 also shows that the increase of a small number of transactions with memory-access latency greater than 400 ns is offset by the reduction of the number of transactions with memory transaction latency around 200 ns and the increase of the number of transactions with mem-ory-access latency less than 100 ns. In other words, the CPRH scheduling algorithm redistributed the memory-access latency curve so that most memory transactions received a modest reduction in access latency, but a few memory transactions suffered a substantial increase in access latency. The net result is that the changes in access latency cancelled each other out, resulting in limited speedup for the CPRH algorithm over the FCFS algorithm for 188.ammp.
15.5 A Latency-Oriented StudyIn the previous section, we examined the impact
of transaction ordering on the memory-access latency distribution for various applications. Memory controller schedulers typically attempt to maximize performance by taking advantage of memory applica-tion access patterns to hide DRAM-access penalties. In this section, we provide insight into the impact that DRAM architectural choices make on the average read latency or memory-access latency. We briefl y examine how the choice of DRAM protocol impacts memory system performance and then discuss in detail how aspects of the memory system protocol and confi gu-ration contribute to the observed access latency.4
15.5.1 Experimental FrameworkThis study uses DRAMSim, a stand-alone memory
subsystem simulator. DRAMSim provides a detailed execution-driven model of a Fully Buffered (FB) DIMM memory system. The simulator also sup-ports the variation of memory system parameters of interest, including scheduling policies and memory
0 200 400 600 800Memory Access Latency (ns)
1
100
10000
1e 06
1e 08
Num
ber
of A
cces
ses
at G
iven
Lat
ency
Val
ue
Num
ber
of A
cces
ses
at G
iven
Lat
ency
Val
ue
0 200 400 600 8001
100
10000
1e 06
1e 08
Memory Access Latency (ns)
188.ammp FCFS 188.ammp CPRH
FIGURE 15.38: Impact of scheduling policy on memory-access latency distribution: 188.ammp.
4Some of this section’s material appears in “Fully-Buffered DIMM memory architectures: Understanding mechanisms, overheads and scaling,” by B. Ganesh, A. Jaleel, D. Wang, and B. Jacob. In Proc. 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Phoenix, AZ, February 2007. Copyright IEEE. Used with permission.
ch15_P379751.indd 590ch15_P379751.indd 590 8/8/07 4:03:33 PM8/8/07 4:03:33 PM
Not only Fasterbut Accurater, too
Bruce Jacob
University of Maryland
SLIDE
Features Extracted
�22
Table 6.1: Features with Descriptions
Feature Values Description Intuition
same-row-last 0/1
whether the last request
that goes to same bank has the same row
(as this one)
key factor for the most
recent bank state
is-last-recent 0/1whether the last request to the
same bank added recently (tRC)
relevancy of the last request
to the same bank
is-last-far 0/1whether the last request to the
same bank added long ago (tRFC)
relevancy of the last request
to the same bank
op 0/1 operation(read/write) for potential R/W scheduling
last-op 0/1 operation of last request to the same bank for potential R/W scheduling
ref-after-last 0/1whether there is a refresh since
last request to the same bank
refresh reset the
bank to idle
near-ref 0/1 whether this cycle is near a refresh cyclelatency can be really
high if it’s near a refresh
same-row-prev intnumber of previous requests with
same row to the same bank
if there is same row
request then OOO
may be possible
num-recent-bank intnumber of requests added recently
to the same bank
contention/queuing
in the bank
num-recent-rank intnumber of recent requests added
recently to the same rank
contention
num-recent-all intnumber of recent requests added
recently to all ranks
contention
108
top related