MB3 D3.7{ Final Report on Memory Hierarchy Investigations. … · MB3 D3.7{ Final Report on Memory Hierarchy Investigations. Version 1.0 Document Information Contract Number 671697

MB3 D3.7– Final Report on Memory Hierarchy Investigations.Version 1.0

Document Information

Contract Number 671697

Project Website www.montblanc-project.eu

Contractual Deadline PM33

Dissemination Level Public

Nature Document

Authors Abdoulaye Gamatie (CNRS), Alejandro Nocua (CNRS), Gilles Sas-satelli (CNRS), David Novo (CNRS), Michel Robert (CNRS), Li-onel Torres (CNRS)

Contributors CNRS

Reviewers Pablo Oliveira (UVSQ), Patrick Schiffmann (AVL)

Keywords Design Modeling and Evaluation, Non Volatile Memory, MainMemory, Energy-Efficiency, DDR4, PCM, RRAM, gem5, NVMain

Notices: This project has received funding from the European Union’s Horizon 2020 research and innovation

programme under grant agreement No 671697.

c©Mont-Blanc 3 Consortium Partners. All rights reserved.

Ref. Ares(2018)4547268 - 05/09/2018

MB3 D3.7 - Final Report on Memory Hierarchy Investigations.Version 1.0

Change Log

Version Description of Change

v0.1 Initial version by CNRS

v0.2 Addition of new results on HPC applications by CNRS

v0.3 Revision after internal review by CNRS

v0.4 Revision after review by AVL partner

v0.5 Revision after review by UVSQ partner

v1.0 Final version

2

D3.7 - Final Report on Memory Hierarchy Investigations.Version 1.0

Contents

Executive Summary 4

1 Introduction 51.1 Trends About Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 System and Memory Models for Design Evaluation . . . . . . . . . . . . . . . . . 71.3 Outline of this Deliverable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Main Memory Modeling in NVMain 82.1 Selected Memory Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Model Refinement: Memory and Architecture . . . . . . . . . . . . . . . . . . . . 9

2.2.1 DDR4 Technology Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 NVM Technology Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Simulated Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Intrinsic Technology Evaluation 113.1 Memory Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Idle Memory Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Some Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Evaluation based on Parsec benchmarks 144.1 Selection of Representative Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 144.2 Evaluation on Different Memory Technologies . . . . . . . . . . . . . . . . . . . . 15

5 Evaluation on Typical HPC mini-applications 175.1 Performance of Selected Mini-Applications . . . . . . . . . . . . . . . . . . . . . . 17

5.1.1 High Performance Conjugate Gradient . . . . . . . . . . . . . . . . . . . . 175.1.2 Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics . . . 18

5.2 Memory Activity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Open NVM Evaluation Directions 226.1 NVM Latency Mitigation by Varying Memory Parameters . . . . . . . . . . . . . 22

6.1.1 Memory Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.1.2 Number of Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.2 NVM-specific Memory Controllers? . . . . . . . . . . . . . . . . . . . . . . . . . . 236.3 Impact of Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.4 NVM Integration in Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 28

7 Concluding Remarks and Insights 30

Acronyms and Abbreviations 31

A Memory Model Parameters in NVMain 32

3


Executive Summary

This deliverable presents some results on the integration of Non-Volatile Memory technologiesat main memory level in multicore architectures. This follows our initial work (see deliverableD3.2) on a design exploration framework for analyzing these technologies in different levelsof the memory hierarchy, e.g., caches and main memory. It improves the design evaluationframework by targeting ARMv8 heterogeneous multicore architectures, while the main memorytechnology models have been modified to reflect as much as possible the typical compute nodearchitectures under consideration in the MontBlanc3 project.

We define a gem5 model of the Exynos 7420 ARMv8 chip, based on which we quantitativelyevaluate the possible impacts of various main memory technologies (DDR4, Phase-Change Mem-ory, Resistive RAM) on both performance and energy-to-solution. This is achieved by consid-ering typical benchmarks, e.g., Parsec and a few HPC mini-applications studied in WP6, whichaims at studying a set of kernels, mini-apps and applications for co-design and system softwareassessment. Further design improvement opportunities are addressed, by focusing on memoryarchitecture parameters (e.g., frequency, multi-channel design) and on the implications of pro-gramming models, via their runtimes as investigated in WP7, which deals with runtimes forparallel programming.

4


1 Introduction

Main memory is a critical component in compute systems as it plays a central role regardingthe efficient data access, which has a direct impact on the overall system performance andenergy-efficiency. This is true in both embedded and high-performance computing domains.

1.1 Trends About Main Memory

A number of important trends have been identified in the literature about main memory [28],as recalled below:

• Increasing need for memory capacity and bandwidth: modern systems keep on integratingseveral processing units, which can be of different nature, i.e., heterogeneous, (e.g., CPUs,GPUs or FPGA) in order to meet the requirements of performance-demanding (data-intensive) applications. Following this trend, a drop by 30% of the memory capacity percore is envisioned every two years [24]1 (see Figure 1). As a consequence, the memorybandwidth per core will decrease.

Figure 1: Trends on memory capacity gap w.r.t. integrated core counts in server systems(source: [24]).

• End of DRAM technology scaling: the DRAM technology will hardly scale beyond a cer-tain technology node, because of higher cell manufacturing complexity, reduced reliability,and more cell leakage necessitating higher refresh rates [29].

• Memory energy/power as a key system design concern: the energy consumed by off-chipmemory hierarchy (including L3 cache, DRAM, memory controller and their connectinginterfaces) can reach up to 41% of the total energy consumption of a compute system [23].In addition, the periodic refresh mechanism of DRAM induces power consumption evenwhen no activity occurs in the memory. In addition, the background power consump-tion related to the peripheral circuitry (e.g., word-line drivers, sense-amplifiers and writedrivers) is another concern as it contributes significantly to the memory leakage.

For instance, authors in [2] previously quantified the increase in performance and poweroverheads of refresh when using high-density 32 GB memory devices, and they observedthat refresh can account for more than 20% of the DRAM energy consumption whilerunning SPEC CPU2006 benchmarks with a full-system simulator (see Figure 2). Thebackground energy consumption is the largest part of the overall energy reported for eachmemory size.

1To the best of our knowledge, no updated projection is provided in the literature regarding this trend since[24].

5


(a) Impact on energy (b) Impact on performance

Figure 2: Impact of refresh on energy and performance. Energy breakdown: ref = refresh,rd/wr = read/write, act/pre= activate/pre-charge, bg = background, ipc = instructions percycle and avg lat = average memory access latency (source: [2]).

In the present document, we mainly focus on the last trend, i.e., energy-efficiency concern inhigh-performance compute node design from memory perspective. As motivated in deliverableD3.2 [12], the main memory market, though extremely competitive, is a promising applicationarea for Non-Volatile Memories (NVMs) from different the following perspectives:

• Technology independence: main memory comes in form of discrete memory modules madeof chips with a JEDEC DDR standardized interface, thereby lifting the difficulty of inte-grating different technologies on the same substrate.

• Voiding refresh needs: DRAM spends a significant amount of energy into the necessaryrefresh cycles. The overall resulting complexity of DRAM required for coping with the hi-erarchical organization in pages, ranks and banks incurs significant performance penaltiescompared to NVMs. Indeed, thanks to their inherent non-volatility, most NVMs do notrequire any refresh mechanism to preserve their stored data. This also confers to NVMnegligible leakage.

• Scaling and density NVM memory technologies can scale better than DRAM. For in-stance, the Phase-Change Memory is expected to scale to 9nm around 2022, while a 20nmprototype was already proposed in 2008 by IBM [28]. This enables denser memories thatcould meet the memory capacity requirements of multi/manycore compute systems.

We carry out a system-level analysis of three memory models, i.e., a DDR4 modules fromMicron used in the Dibona compute nodes built by Atos, Phase-Change Memory (PCM) andResistive RAM (RRAM), while integrated in the main memory of a typical ARMv8 big.LITTLEmulticore system. Figure 3 illustrates the corresponding evaluated system designs. Our initialinvestigations [12] on the opportunities offered by Non-Volatile Memories (NVMs) in memoryhierarchy suggested the potential of NVMs in energy reduction in systems (up to 20% lessenergy). Due to the lack of accurate timing and energy information in considered NVM models,our reasoning relied the comparison of some accurate DDR models versus their correspondingvariants resulting from an artificial deactivation of the DDR refresh operations.

Now, we aim at digging more into these initial observations by refining the previous modelingsetup so as to quantify more accurately the energy-efficiency gains enabled by NVMs. Startingfrom the PCM and RRAM model templates provided by NVMain [32], we enhance them throughsome extrapolation relying on NVM-related literature. In addition, our evaluation targets state-of-the-art heterogeneous ARMv8 multicore systems, which are envisioned for better processingefficiency of compute nodes in upcoming HPC systems [34].

6


(a) DDR4-based main memory (b) NVM-based main memory (PCM or RRAM)

Figure 3: Sketch of evaluated system designs

1.2 System and Memory Models for Design Evaluation

In order to evaluate the impact of the above NVM technologies w.r.t. DDR4 in multicoresystems, two simulation tools are used: the gem5 cycle-approximate architecture simulatorcoupled with the NVMain main memory simulator. By leveraging them, potential gains interms of performance and energy are quantified regarding NVM integration. Indeed, gem5 [13]provides an accurate evaluation of system performance [5] thanks to its high configurability for afine grained architecture modeling. Its full-system simulation mode runs unmodified operatingsystems. It includes several predefined architecture component models, e.g., CPU, memory andinterconnect. This simulator produces detailed execution statistics (even at micro-architecturelevel) for power and footprint area estimation.

NVMain [32] is an architectural-level simulator that enables to model main memory designwith both DRAM and emerging non-volatile memory technologies [20]. It is flexible enough toallow the implementation of various memory controllers, interconnects and organizations. Itis well-integrated together with the gem5 simulator, which makes it possible to achieve cycle-approximate multicore system simulation together with cycle-level simulation of a variety ofmain memory technologies in the system. In addition, NVMain can also accomodate tracesas input for facilitating design space exploration by considering either user-defined emergingmemory technologies or those provided together with the simulator. In this deliverable, weconsider both options by calibrating a few DRAM models based on available data-sheets andby evaluating the PCM and RRAM memory technology models already present in NVMain.

1.3 Outline of this Deliverable

The remainder of this deliverable is organized as follows: Section 2 presents the calibration ofour models for the design exploration. Then, Section 3 evaluates the defined models by focusingon their intrinsic properties. Sections 4 and 5 extend the previous evaluation by considering theParsec benchmark suite and two representative HPC applications in order to assess the memorymodels application-wise. A few insights are discussed from all these evaluations in Section 6.Finally, some conclusions and perspectives are given in Section 7. Note that further technicaldetails on the considered experiment setup are provided in Appendix A.

7


2 Main Memory Modeling in NVMain

This section describes the setup of our modeling and evaluation framework. Starting fromthe selected memory technologies, we provide a brief description of the evaluated multicoreheterogeneous architecture used in the rest of this deliverable.

2.1 Selected Memory Technologies

In the present work, we consider two NVM technologies, namely RRAM and PCM, which arecompared to DRAM at main memory level. Table 1 introduces some numbers regarding themain characteristics of these memory technologies according to literature [4]. It has to beconsidered as a relative comparison basis that will serve to assess the specific memory modelsused in the rest of this document. One must keep in mind that generally at system level, theinterconnect used for memory access has a notable (if not the highest) impact on those metrics.

Table 1: Approximate device-level characteristics of NVMs

DRAMNon Volatile MemoryRRAM PCM

Cell size (F2) 60 - 100 4 - 10 4 - 12Read latency (nsec) ∼10 ∼10 20 - 60Write latency (nsec) ∼10 ∼50 20 - 150

Read energy medium low mediumWrite energy medium high high

Leakage power medium low lowWrite endurance > 1015 108 - 1011 108 - 109

Maturity mature test chips test chips

Among the most relevant features of NVMs for energy-efficiency, we note their low leakagepower. Since they are non volatile, they do not require any data refresh mechanism to keepwritten values in the memory, contrarily to DRAM. However, writing operation on those NVMsis more expensive compared to DRAM.

Phase-Change Memory (PCM) was first proposed in 1970 [30] by Gordon Moore, co-founderof Intel. However, material quality and power consumption issues prevented commercializationof the technology until very recently. Initial PCM prototypes achieved a switching time on theorder of 100 ns [15]—conventional volatile SDRAM has a switching time on the order of 5 ns— but a January 2006 Samsung Electronics patent application indicates that PCM can achieveswitching times as fast as 5 ns [31]. A number of industrial actors already announced PCM-based device prototypes (e.g., Samsung, Toshiba, IBM and SK Hynix [33]). More recently, Inteland Micron announced the 3D XPoint NVM technology [18], which shares many similaritieswith the PCM technology (it is generally believed that it is a phase-change memory-basedtechnology), but details of the materials and physics of operation have not been disclosed. A375 GB SSD PCIe card from Intel, aimed at enterprise markets, is going to be released as thefirst product with this technology around 2019. Both PCM and 3D XPoint technologies areseen as potential solutions for a compromise between main memory and storage-class memory.Their higher access latencies and dynamic power consumption compared to DRAM (see Table1) may however become penalizing at main memory level.

The Resistive RAM (RRAM) memory technology [1] goes back to 2003, when Unity Semi-conductor started developing its Conductive Metal-Oxide (CMOx) technology. This technologyworks by changing the resistance across a dielectric solid-state material, also called memristor.

8


The RRAM technology has proved a promising candidate for main memory thanks to its lowerpower consumption and access latency compared to PCM (see Table 1). It has shown a goodstability at 10nm node fabrication [1]. Among major industrial players investing in the RRAMtechnology, one can mention Panasonic Semiconductor [14].

2.2 Model Refinement: Memory and Architecture

The memory configurations used in our study are indicated in Table 2. They are designed asLPDDR technologie. These parameters are used by NVMain to estimate the delay and energyconsumption of the main memory in the execution of a particular workload.

Table 2: Compared main memory configurations

Name (Channel, Bank, Rank) Freq [MHz] Size [GB]

DDR4 Micron [26] (2,4,2) 1333 4PCM Samsung [6] (2,1,2) {400, 1333} 4

RRAM Panasonic [19] (2,4,2) {400, 1333} 4

2.2.1 DDR4 Technology Modeling

As a reference for our study, we consider the DRAM technology used in the Cavium Thun-derX2 compute nodes of the Dibona supercomputer built by Atos/Bull, within the Mont-Blanc3project. This technology consists of a DDR4 from Micron defined from the corresponding data-sheet [26]. The bandwidth of the reference memory module considered for our investigation isaround 21.3GB/s according to the data-sheet. The details about the parameters of the corre-sponding simulation model are given in Annex A of the current document (see Tables 7 and 8).They are captured by over 60 parameters regarding memory architecture, timing and energyconsumption. Some of the captured memory components are illustrated in Figure 4.

Figure 4: Memory organization.

2.2.2 NVM Technology Modeling

First of all, we note that building realistic models of PCM and RRAM technologies is not atrivial task since only a few prototype designs are currently available. The design parametersof these prototypes that are required for devising accurate models are hardly available. A fewNVM performance and energy evaluation tools [8] [27] could be considered as an alternative solu-tion. Unfortunately, they fail to provide the fine-grain parameter description necessary to mainmemory modeling. For all these reasons, we decided to rely our study on the PCM and RRAMmodels provided in NVMain. They respectively result from Samsung [6] and Panasonic [19]specifications, presented in the International Solid-State Circuits Conference (ISSCC’20122),

2To the best of our knowledge, these are the most complete models publically available for NVM evaluation atmain memory level. In addition, the recent trends about NVMs observed in literature suggest that the relativecomparison between these technologies has not changed.

9


the top venue for presenting the most advanced memory prototypes. While the latency param-eters are consistent with the designs from Samsung and Panasonic, it is not the case of theenergy parameters (this has been confirmed by the authors of NVMain).

We therefore modified the values of energy parameters in order to get as close as possible tothe featured NVMs. For this purpose, we extrapolated their parameter values regarding energyaccording to existing literature on NVM [22] [21] [4]. In particular, we considered the followingenergy ratios, w.r.t. the above DDR4 model as the DRAM reference:

• 2.1x and 43.1x more energy-consuming for read and write on PCM respectively;

• 1.2x and 23.7x more energy-consuming for read and write on RRAM respectively.

On the one hand, the above ratios considered for PCM result from [22] in which authorscarried out an exhaustive analysis of a number of PCM prototypes proposed in literature from2003 to 2008. We assume these ratios remain relevant enough for the Samsung PCM modelproposed in 2012 and integrated within NVMain. Note that these ratios may have been reducedtoday with recent advances in PCM technology.

On the other hand, the ratios concerning the RRAM have been chosen empirically fromthe trends found in NVM-related literature [21] [4], which suggests that RRAM has a bit-leveldynamic energy that is comparable to DRAM. However, in terms of latency, writes are severaltimes slower than reads. So, we assume that the write energy on RRAM is notably highercompared to DRAM, while it remains lower than for PCM. In fact, the recent trends aboutRRAM tend to suggest that this technology has at least the same energy cost in read comparedto DRAM, while the gap in write energy is getting reduced between the two technologies.Therefore, the ratio values chosen above can be seen as an upper-bound approximation thatallows us to conduct a conservative analysis in the rest of this document. In other words, thereexists a probable margin for improving the quantified gains reported in this work.

2.2.3 Simulated Architecture

We consider the Exynos 7420 ARMv8 chip embedded in the Exsom board [17] for modelingour simulated architecture in gem5. This chip relies on the big.LITTLE technology proposedby ARM. Table 3 summarizes the main parameters of the architecture.

10


Table 3: Exynos 7 Octa (7420) SoCs characteristics and corresponding gem5 models

Exynos 7 Octa (7420) ex7 gem5 model

ParametersCortex-A53ARMv8 ISA(in-order)

Cortex-A57ARMv8 ISA(out-of-order)

ex7 LITTLE(in-order)

ex7 big(out-of-order)

Max. core count 4 4 4 4Frequency 1.4GHz 2.1GHz 1.5GHz 2.1GHz

L1ISize 32kB 48kB 32kB 48kB

Assoc. 2 3 2 3Latency 3 3 3 3

L1DSize 32kB 32kB 32kB 32kB

Assoc. 4 2 4 2Latency 3 3 3 3

L2Size 256kB 2MB 256kB 2MB

Assoc. 16 16 16 16Latency 12 16 12 16

Interconnect CCI-400 XBAR - 256 bus width @533MHzMemory 3GB LPDDR4@1555MHz, 2 channels see Table 2

Memory bus width 32 32Technology 14nm FinFET -

3 Intrinsic Technology Evaluation

As a preliminary evaluation, we focus on intrinsic properties of modeled main memories: accesslatency and idle power consumption. We aim at assessing the soundness of the base memorymodels decided in Section 2.

3.1 Memory Latency

For each of the three memory technology models introduced above, we consider the lmbenchbenchmark suite [25] to assess their corresponding read latencies. In particular, we use thelat mem rd benchmark to measure the memory hierarchy latency when using a DDR4 in mainmemory versus PCM and RRAM integration. Note that the L1 and L2 caches are both inSRAM.

Basically, the memory hierarchy latency is computed by repeatedly accessing contiguousdata structures of increasing sizes, given a particular value of memory stride length. In ourexperiments, we consider a stride length of 4kB. The result of these repetitive computationsare variations in memory access latency, according to the size of the three hierarchy levels ofinterest here: L1 cache, L2 cache and main memory.

The result provided by lat mem rd is reported in Figure 5 (here, the plotted latency valuesfollow a logarithmic scale). The benchmark has been executed while using a single LITTLEcore according to the associated memory hierarchy (see Table 3), i.e. 32kB L1 cache, 256kBL2 cache and 4GB main memory. The frequency of the DDR4 memory is 1333MHz, while thefrequencies of both PCM and RRAM is set to 400MHz according to the initial configurationinherited from NVMain. For all three cases, the rest of the memory parameters is the same,e.g., memory controller, memory organization.

As expected, Micron’s DDR4 model offers a better memory latency compared to both RRAMand PCM. Their respective main memory latencies are around: 69 nsec, 383 nsec and 813 nsec.In other words, read accesses in the considered DDR4 model are respectively 5.5 and 11.7 timesfaster in comparison with RRAM and PCM respectively.

11


Figure 5: Memory hierarchy latency with different main memory technologies (log scale).

Comparing these results w.r.t. the tendencies usually found in literature (see Table 1), weobserve that the device-level slowdown of PCM in read latency compared to RRAM and DRAMis confirmed here at memory system level. However, this is not the case for RRAM compared toDRAM: while the read latencies of the two memory technologies are similar at device level, itis not true at memory system level since read accesses with DDR4 are faster than with RRAM.

3.2 Idle Memory Power Consumption

As motivated earlier, an important advantage of NVMs is that they do not require any datarefresh mechanism thanks to their non-volatile nature. It is not the case of DRAM, which canspend a non negligible amount of energy due to refresh process and background power. Therefresh process periodically reads data from a memory area and immediately rewrites the datato the same area, while the background power is due to the peripheral logic.

(a) Idle energy (b) Idle power

Figure 6: Idle memory energy and power consumption

Given the anticipated significance of the refresh and background energy consumption forDRAM, we assess its impact compared to the considered RRAM and PCM memory models.For this purpose, we evaluate several scenarios with little-to-no access in main memory, i.e., noread, no write. In Figure 6, we report the energy and power consumption of an idle memorystatus for different durations, varying from 1 second to 30 seconds. This is achieved in gem5by executing the ”sleep” system call with different input delay values in seconds. The reported

12


values enable us to analyze the impact of the overhead related to the background power andthe refresh mechanism on the DDR4 memory energy consumption.

Here, the total energy consumption computed by NVMain consists of the sum of four com-ponents: background energy, refresh energy, activate energy and burst energy. In all experimentsshown in Figure 6, the background energy represents almost 98% of the total memory energyconsumption. The refresh energy accounts for around 1% (it is equal to 0% for RRAM andPCM), while the remaining energy corresponds to activate and burst energies.

The evolution of the energy consumption of 4GB main memory for each of the three memorytechnologies is summarized in Figure 6. We observe that with the DDR4 the energy consumptionof the main memory grows significantly with bigger input delays. On the contrary, only amarginal energy increase is observed with RRAM and PCM, showing the potential energysaving opportunity offered by NVMs for system execution scenarios where the main memory isunder-used.

Figure 7: Idle Memory Energy Consumption: 4Gb versus 32Gb main memory

When increasing the size of the main memory, NVMs further mitigate the energy consump-tion. This is illustrated in Figure 7, where the size of the main memory is increased from 4GBto 32GB, while keeping unchanged the other memory parameters. The reported results concernan execution of the ”sleep” system call for 15 seconds.

3.3 Some Observations

The above preliminary evaluation of the NVM models regarding their latency and energy con-sumption is in line with existing observation from literature: refresh and background energyconsumption has an important impact on the overall energy consumption of advanced DRAM-based memories [2]. This suggests the considered PCM and RRAM models are relevant enoughto be considered for further investigations.

In particular, provided these results, we were able to quantify the latency gap between theconsidered NVM technologies w.r.t. a typical DDR4 model when integrated in main memory.Note that the read latencies have been determined by using a benchmark designed for thisspecific purpose within the stream benchmark set. Then, it will be interesting to evaluate theimpact of NVMs application-wise, i.e., which is a more general case of workload.

13


4 Evaluation based on Parsec benchmarks

We consider the Parsec benchmark set for evaluating the impact of the PCM and RRAM models,compared to the Micron DDR4 model, within a 4Gb main memory model. The applicationsprovided within Parsec provide evaluation scenarios that are more representative of applications.

4.1 Selection of Representative Benchmarks

For the sake of simplicity, we consider three categories of applications, each featured by onerepresentative workload from Parsec, i.e., blackscholes, dedup and swaptions, described below:

• the dedup workload is mainly memory-bound benchmark, particularly with a high numberof read transactions in the memory. It is dedicated to the definition of high data streamcompression ratios.

• the blackscholes workload is mainly compute-intensive, with marginal memory accessescompared to the two others. It expresses an analytical price calculation based on theBlack-Scholes partial differential equation.

• the swaptions workload shows intermediate memory access properties compared to theprevious two. It applies Monte Carlo simulation to compute prices.

These workloads are selected based on a previous characterization described in [3].Theircorresponding properties in terms of memory access and computation load are summarized inTable 4. In the rest of the document, we therefore consider blackscholes, dedup and swaptions asthe respective representatives of low memory access, high memory access and moderate memoryaccess applications.

Table 4: Breakdown of instructions (in Billions) for selected Parsec benchmarks [3]

Workloads FLOPs Reads Writesblackscholes 1.14 0.68 0.19

dedup 0 11.71 3.13swaptions 2.62 5.08 1.16

In order to confirm the memory access properties of the above three workloads with mediuminput sets3, we run them by considering the same system setup as described in Section 3.1. Theresulting profiles in terms of read and write bandwidths for each of the workloads, are illustratedin Figure 8. They are compliant with the read/write volumes characterized in [3]. Indeed theread/write bandwidth thresholds obtained with dedup and swaptions globally dominate that ofblackscholes since they perform more memory accesses than blackscholes. This is particularlyvisible for dedup, which has the highest memory activity in terms of both read and writetransactions, among the three benchmarks.

Here, we can notice that only a small fraction of the memory bandwidth is used by the threebenchmarks (i.e., far from memory bandwidth saturation, which is around 21.3GB/s accordingto the reference data-sheet [26]).

3The breakdown shown in Table 4 considers large input sets run on a real computer. Here, it is used as aninitial criterion for benchmarks selection w.r.t. their characteristics. In our experiments, we use rather mediuminput sets because it is more tractable with gem5 in term of simulation time.

14


(a) blackscholes (low memory accesses) (b) dedup (high memory access)

(c) swaptions (medium memory access)

Figure 8: Read/Write bandwidth for three Parsec benchmarks with different DDR4 memoryaccesses

4.2 Evaluation on Different Memory Technologies

Now, let us consider the execution of the three workloads by varying the memory technologyin main memory as in previous section, i.e., DDR4, RRAM and PCM. Medium input sets areused for the experiments. Figure 9 reports the different results. More precisely, the executiontime and the main memory energy consumption are illustrated.

• On the one hand, in Figure 9a, we observe that the maximum execution slowdown be-tween DDR4 and PCM is about 7.2x, while between DDR4 and RRAM it is only about1.7x. These slowdown factors of NVMs over DDR4 are obtained with the dedup memory-intensive workload. They are respectively smaller than the previous 11.7x and 5.5x mem-ory latency gap resulting from the execution of lat-mem rd benchmark (see Figure 5)benchmark. This suggests that the detrimental impact of the intrinsic higher memorylatencies of NVMs compared to DRAM, is limited application-wise on corresponding ex-ecution times, even for a memory-intensive workload. For application workloads withlow and moderate memory accesses, the slowdown ratios between DDR4 and NVM basedscenarios are further reduced, as illustrated for blackscholes and swaptions in Figure 9a.

We notice that the important overhead in execution time observed for the dedup andswaptions workloads, while using PCM in main memory, is due to the write transactions,which are more present in dedup. Since write latency with PCM is higher than with

15


(a) Executime time (b) Memory energy consumption

Figure 9: Normalized execution time and main memory energy consumption for DDR4 andNVMs (RRAM and PCM), on three Parsec benchmarks (smaller execution time / energy con-sumption is better)

RRAM (see Table 1), this exacerbates the gap between the execution times resulting withthe two NVM technologies. The overhead in execution time observed for the swaptionsworkload compared to the blackscholes workload mainly comes from the higher numberof read transactions in the former.

• On the other hand, the energy consumption of the main memory with NVMs is alwayssmaller than with DDR4, no matter the workload type (see Figure 9b). Of course, here,the energy gap observed application-wise is more important than the ”Idle Memory”scenario shown previously in Figure 6, due to the dynamic activity of the memory. Inorder to appreciate the overall benefit of NVM integration in main memory, let us considerthe Energy-Delay-Product (EDP) from the main memory perspective only. The EDP iscalculated as the product of the workload execution time and the energy consumed by themain memory during the execution time. The aim is to assess the global benefit of eachdesign alternative. Figure 10 shows the normalized EDP of all memory configurations,according to the three representative workloads. Here, one can see that NVMs alwaysprovide better EDP than DDR4, except for one scenario: when dedup is executed whilethe main memory is in PCM. In other words, for memory-intensive applications, the energyreduction enabled by the PCM technology cannot be compensated by the correspondingexecution slowdown, because of its higher device-level read/write latencies.

Figure 10: Energy-Delay-Product applied to main memory for DDR4 and NVMs (RRAM andPCM) on three Parsec benchmarks (smaller EDP is better)

16


5 Evaluation on Typical HPC mini-applications

The evaluation proposed in this deliverable relies on simulation tools, which can required longerprocessing time. We therefore select two representative mini-applications (identified from WP6studies on a set of kernels, mini-apps and applications for co-design and system software assess-ment) that can be simulated in a reasonable time. This allows us to assess the impact of thedifferent memory technologies while executing these mini-applications.

5.1 Performance of Selected Mini-Applications

5.1.1 High Performance Conjugate Gradient

The High Performance Conjugate Gradient (HPCG) mini-application [9] for a 3D chimneydomain on an arbitrary number of processors. It has been devised with the aim of providinga metric for assessing HPC systems. It enables to exercise computational and data accesspatterns, which feature a range of applications. HPCG shares the same goal as the popularHigh Performance LINPACK (HPL) benchmark [10].

The HPCCG 1.0 OpenMP implementation is considered in the evaluation presented below.It is executed on 8-cores big.LITTLE architecture, where its three input parameters nx, ny andnz are all set to 104. These parameters denote the number of nodes in the 3D dimension.

(DDR4)

Initial Residual = 2760.46...Mini-Application Name: hpccgMini-Application Version: 1.0Parallelism:

MPI not enabled:Number of OpenMP threads: 8

Dimensions:nx: 104ny: 104nz: 104

Number of iterations: : 149Final residual: : 1.37739e-20

Time Summary (in sec):Total : 59.956DDOT : 3.676WAXPBY : 14.94SPARSEMV: 41.276

FLOPS Summary:Total : 1.07267e+10DDOT : 6.70419e+08WAXPBY : 1.00563e+09SPARSEMV: 9.05066e+09

MFLOPS Summary:Total : 178.91DDOT : 182.377WAXPBY : 67.3111SPARSEMV: 219.272

(RRAM)

Initial Residual = 2760.46...Mini-Application Name: hpccgMini-Application Version: 1.0Parallelism:MPI not enabled:Number of OpenMP threads: 8






(PCM)

Initial Residual = 2760.46...Mini-Application Name: hpccgMini-Application Version: 1.0Parallelism:MPI not enabled:Number of OpenMP threads: 8






Figure 11: Summary of HPCCG performance numbers according to 3 memory technologies

The performance numbers obtained upon executing HPCCG with three different main mem-ory configurations (DDR4, RRAM and PCM) are summarized in Figure 11. As expected, the

17


configuration with DDR4 provides the best performance in terms of mega floating point opera-tions per second (MFLOPS). The performance degradation induced by RRAM is very limitedcompared to PCM. This is due to their respective high memory access latencies.

5.1.2 Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics

We consider first the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics mini-application [16], also known as Lulesh. This mini-application encodes a discrete approximationof the hydrodynamics equations by partitioning the spatial problem domain into a collectionof volumetric elements defined by a mesh. It presents high instruction-level and memory-levelparallelism with bursts of independent memory accesses. We consider its implementation inOpenMP.

(DDR4)

Loading new script...Running problem size 15ˆ3per domain until completionNum processors: 1Num threads: 8Total num. of elements: 3375

...

Run completed:Problem size = 15MPI tasks = 1Iteration count = 400Final Origin Energy =

5.702894e+04Testing Plane 0 of Energy

Array on rank 0:MaxAbsDiff = 1.091394e-11TotalAbsDiff= 8.856107e-11MaxRelDiff = 5.046922e-13

Elapsed time = 2.54 (s)Grind time (us/z/c) =

1.8829452 (per dom)( 1.8829452 overall)

FOM = 531.08291 (z/s)

(RRAM)


...





1.8628081 (per dom)( 1.8628081 overall)

FOM = 536.82393 (z/s)

(PCM)


...





4.4751096 (per dom)( 4.4751096 overall)

FOM = 223.45821 (z/s)

Figure 12: Summary of Lulesh execution outputs according to 3 memory technologies

The output of Lulesh execution with the previous three main memory configurations (DDR4,RRAM and PCM) is summarized in Figure 12. Elapsed time denotes how much time it takesto perform the simulation. Grind time denotes the time it takes to update a single zone forone iteration of the time-step loop. The Lulesh Figure-Of-Merit (FOM) gives the number ofupdated zones per second.

From these results, the configurations with DDR4 and RRAM lead to almost similar per-formance results with respectively 531.08 and 536.82 z/s FOM values (RRAM even slightlyoutperforms DDR4). They are far better than with the PCM configuration, which has penaliz-ing memory access latency.

18


5.2 Memory Activity Analysis

We focus on the main memory activity for the previous two mini-applications in order toanalyze the implication of the different memory technologies w.r.t. the performance resultsdescribed above. Figure 13 presents the main memory read/write bandwidths for the differenttechnologies.

(a) Main memory in DDR4 (Lulesh) (b) Main memory in DDR4 (HPCCG)

(c) Main memory in RRAM (Lulesh) (d) Main memory in RRAM (HPCCG)

(e) Main memory in PCM (Lulesh) (f) Main memory in PCM (HPCCG)

Figure 13: Main memory Read/Write bandwidth for Lulesh and HPCCG mini-applications

Overall, the Lulesh mini-application exhibits lower bandwidth thresholds (both for read andwrite) than HPCCG. Therefore, performance penalty is expected to be more visible with the

19


latter. This is the reason why the gap in execution times between DDR4 configuration and NVMconfigurations is more important for HPCCG, e.g., more than four times when comparing DDR4and PCM (see Figure 14b). This gap remains below 2.5x for these technologies when consideringLulesh (see Figure 14a). Note that for both mini-applications, the RRAM technology providescomparable execution time than DDR4.

(a) Lulesh (b) HPCCG

Figure 14: Normalized execution time and main memory energy for Lulesh and HPCCG

The main memory energy consumption shown in Figure 14 for both mini-applications showsthat NVMs always provide improvements compared DDR4, despite their possible performancepenalty. Nevertheless, when considering the Energy-Delay-Product (EDP) figure-of-merit, PCMnever appears as efficient as DDR4 because of the high performance degradation it induces. Thisis not the case of RRAM, which always provides a better EDP than the DDR4 configuration,whatever the mini-application.

The energy breakdown for the main memory is shown in Figure 15, considering both Luleshand HPCCG mini-applications. On the one hand, the DDR4 configurations show that the maincontributor is the background energy. The dynamic activity of the memory as well as the refreshare quite marginal. The cost of the refresh is unexpectedly negligible for some unknown reasons.On the other hand, the ratio of the background energy over the total energy is significantlyreduced in NVM configurations, which suggests that the current saved in peripheral circuitrywith NVM integration is beneficial to the leaky parts of DDR4 memory devices.

(a) Lulesh (b) HPCCG

Figure 15: Memory energy breakdown for Lulesh and HPCCG

Remark. We notice that the refresh energy proportion reported in Figure 15 does not seem tobe in-line with the typical refresh energy ratios found in DRAM technologies (see the introduc-

20


tion section of this deliverable). This probably relates to the limitation of the NVMain-basedmodeling even though this simulator is currently the most advanced solution available for mainmemory simulation with NVM technologies. Taking into account advanced features4 of DDR4technologies for refresh management (e.g., temperature-aware low power auto-self refresh, mem-ory utilization-based refresh) requires further extensions of the simulator with suitable mecha-nisms. This is certainly a important perspective that deserves attention.

4According to http://www.memcon.com/pdfs/proceedings2013/track2/Designing_for_DDR4_Power_and_Performance.pdf, up to 20% savings in refresh energy are possible.

21

http://www.memcon.com/pdfs/proceedings2013/track2/Designing_for_DDR4_Power_and_Performance.pdf

http://www.memcon.com/pdfs/proceedings2013/track2/Designing_for_DDR4_Power_and_Performance.pdf


6 Open NVM Evaluation Directions

The evaluations conducted in Sections 4 and 5 show that non-volatile memory technologies canbe beneficial to the improvement of energy-efficiency for typical high-performance workloads.Beyond the considered NVM design assumptions, there are further design opportunities thatmay deserve attention for additional gains in both performance and energy consumption. Thenext sections address a few possible directions that can contribute to this aim by focusingon memory architecture parameters (e.g., memory frequency, channel organization or memorycontroller - see Figure 4) and parallel programming models.

6.1 NVM Latency Mitigation by Varying Memory Parameters

In our previous experiments, we evaluated the read latency of main memory by configuringthe frequency of DDR4 to 1333MHz, while the frequencies of both PCM and RRAM is set to400MHz (see Section 3.1). An interesting direction is to explore the possibility of upscalingthe main memory frequency of NVMs so as to reduce their latency gap as observed previously.Another direction concerns the memory architecture in terms of channels, and more specificallyits number of channels.

6.1.1 Memory Frequency

We consider a new design setup, which assumes the same frequency level for all three memorytypes. In other words, the frequency of main memory in NVM is set to 1333MHz, similarly toDDR4. All other main memory parameters are kept unchanged, e.g., size, banks, ranks, etc.The lat mem rd benchmark is executed using a single LITTLE core according to the associatedmemory hierarchy (see Table 3), i.e. 32kB L1 cache, 256kB L2 cache and 4GB main memory.

Figure 16: Memory hierarchy latency with different main memory technologies (log scale Y-axis).

Figure 16 reports the obtained read latency values (according to a logarithmic scale). Theaverage latencies for RRAM and PCM are respectively 116 nsec and 265 nsec, while that of theDDR4 model is unchanged, i.e. 69 nsec. This represents a latency gap reduction from x5.5 tox1.7 for RRAM compared to DDR4 and from x11.7 to x3.8 for PCM compared to DDR4.

In the case of the RRAM technology, we notice that the saturation array size of the L1 cacheis slightly delayed compared to the other technologies. This is unexpected since the L1 and L2caches have similar characteristics in all three cases, i.e., in SRAM and with same sizes. Thisis probably a side-effect of the prefetching mechanism and/or cache replacement policy.

22


Assuming that upcoming NVM technologies could support higher frequency ranges, webelieve memory frequency upscaling is a potential optimization parameter for mitigating theimpact of NVM latency overhead on application execution time.

6.1.2 Number of Channels

Now, we evaluate the impact of the channel count in the main memory architecture. In theprevious memory design scenarios, we only considered two channel memory technologies. Moregenerally, memory architectures with multiple channels aim at increasing the data transfer ratebetween the memory and its associated controller by multiplying the number of communicationchannels between them.

Figure 17 summarizes the memory latency evaluation for different multi-channel memoryarchitectures. The number of channels is varied from 2 to 8 for NVMs, while the referenceDDR4 memory architecture is kept unchanged with 2 channels. Figure 17a shows the potentialimprovement achieved for the RRAM technology with architectures including either 4 or 8channels. The latency gap of RRAM technology is decreased up to x1.1 compared to the baselineDDR4 technology, i.e., almost equivalent latencies. When considering the PCM technology, thelatency gap w.r.t. DDR4 is reduced up to x2.6, which is also quite relevant.

(a) RRAM configurations (b) PCM configurations

Figure 17: Comparison of NVM memory system latencies with 2, 4 and 8 channels versus thereference DDR4 configuration: zoom from L2 cache to main memory (log scale Y-axis)

Multi-channel memory architectures therefore provide an interesting opportunity for miti-gating the latency overhead of NVMs. While, the latency improvement enabled here is higherthan that obtained through memory frequency upscaling, both opportunities could be combinedfor further gains.

6.2 NVM-specific Memory Controllers?

Another design question regarding the successful integration of NVMs in compute systemsconcerns adequate control mechanisms at hardware level. Indeed, memory controllers play acentral role in data allocation and movement within the main memory for meeting the lowlatency, high throughput and optimized energy requirements.

In the present study, all experiments have been achieved by considering the same controllerdesign parameters for both DDR4 and selected NVMs. Of course, this choice leaves room forfurther design optimizations since memory accesses have different costs in latency and energy.

23


Typically, the row buffer design could take into account the different latencies of hits and missesdepending on the memory technology: DRAM and PCM are known to have similar row bufferhit latency, while the row buffer miss latency is larger for PCM due to their related higher arrayaccess latency [28].

The aforementioned controller design concern should be consider the heterogeneous/homo-geneous nature of the main memory. Some existing work already addressed the case of hybridDRAM-NVM main memory design, where row buffer locality-aware data placement policiesare proposed [35]. For instance, the authors observed that streaming data accesses are PCM-friendly, while other data accesses (with reuse) are rather DRAM-friendly for data placement.In the case of non hybrid main memory integrating only NVM, as evaluated in the present work,the design of controller is an open question that remains to be addressed. We believe that re-visiting the design principle of such a mechanism can improve the benefits of NVM integrationby taking their specificity into account.

6.3 Impact of Programming Models

Beyond the above directions which focus on memory architecture and controller design, webelieve that leveraging of NVMs also calls for reconsidering the entire system design stack, andin particular the impact of programming models and system software.

We evaluate here the energy-efficiency of NVM-based compute systems relying on differentparallel programming models, which involve different memory activities through their associ-ated runtime systems. As an illustration, let us consider the OpenMP and OmpSs programmingmodels. OpenMP5 is a popular shared memory parallel programming interface. The OpenMP3.0 version (in which our benchmarks and mini-applications are implemented) features a thread-based fork-join task allocation model as illustrated in Figure 18a. It consists of a set of compilerdirectives, library routines and environment variables for the development of parallel applica-tions. OmpSs [11] is a task-based programming model (see Figure 18b) that improves OpenMPwith support for irregular and asynchronous parallelism, and for heterogeneous architectures.It incorporates dataflow concepts enabling its compiler/runtime to automatically move datawhenever necessary and to apply various useful optimizations.

(a) OpenMP fork-join parallel region (b) OmpSs task graph

Figure 18: Parallel programming models: OmpSs vs OpenMP

The Nanos++ runtime used with OmpSs enables an asymmetry-conscious task scheduling[7], referred to as botlev-ompss. More precisely, it incorporates Criticality-Aware Task Scheduler(CATS) scheduling policy CATS, which exploits the criticality of generated tasks, at executiontime, to schedule them. The most critical tasks, i.e., those appearing on the critical executionpath, are executed by the high-performance (big) cores. Typically, such tasks are representedin red color in Figure 18b. Less critical tasks are executed on low-performance (LITTLE) cores.

5https://www.openmp.org

24

https://www.openmp.org


These tasks are denoted by green color in Figure 18b. The CATS scheduler integrates workstealing between big and LITTLE cores. The runtimes of parallel programming models areinvestigated in WP7 within MontBlanc3 project. We rely on the insights gained from theseinvestigations.

Given its performance improvements over the OpenMP static scheduler for heterogeneousasymmetric platforms, we explore the impact of the three main memory technologies for HPCCGversions in OmpSs versus OpenMP. Figure 19 shows the evolution of processor speed by report-ing the number of million instructions per second (MIPS), while executing the two versions,

(a) OpenMP (DDR4) (b) OmpSs (DDR4)

(c) OpenMP (RRAM) (d) OmpSs (RRAM)

(e) OpenMP (PCM) (f) OmpSs (PCM)

Figure 19: HPCCG instructions per second with different main memory technologies: OpenMPversus OmpSs programming models

25


according to the big and LITTLE clusters.

More generally, we observe that the OmpSs version of HPCCG provides better performancethan the OpenMP version. This is in major part due to the botlev-ompss scheduler, whichbetter balances the tasks execution on both big and LITTLE cores for reducing executiontime, thanks to high MIPS thresholds. On the other hand, OpenMP scenarios show that theworkload is not scheduled as efficiently as in OmpSs (see Figure 19), so that only the big clus-ter appears globally active. Some preliminary experiments that we conducted on the ExsomARMv8 big.LITTLE board showed that the static, dynamic and guided scheduling strategiesof OpenMP lead to similar performance when executing the HPCCG mini-application. Nev-ertheless, further improvements may be possible by exploring all possible scheduling strategiesproposed in OpenMP. Note that the design of efficient runtime schedulers for parallel programsis addressed in other work-packages of the Montblanc 3 project.

As a result, the read/write bandwidth obtained with OmpSs version (see Figure 20) is signif-icantly higher than that of OpenMP version (see Figure 13). For instance, the read bandwidthis improved by around 4x, 3x and 1.8x for DDR4, RRAM and PCM respectively, while thewrite bandwidth is only marginally improved. Note that despite this important bandwidthimprovement measured by OmpSs, there is still a margin for DDR4 bandwidth saturation [26]in particular.

(a) Main memory in DDR4 (b) Main memory in RRAM

(c) Main memory in PCM

Figure 20: Main memory bandwidth for OmpSs version of HPCCG according to different mem-ory technologies

The impact of this improvement is depicted in Figure 21, where the execution time and mainmemory energy consumption are compared separately for each memory technology, dependingon the two programming models for HPCCG. The OmpSs runtime enables to reduce by more

26


(a) Normalized execution time (b) Normalized energy consumption

Figure 21: Separate comparison per memory technology for HPCCG mini-application: OpenMPvs OmpSs

than 50% the execution time of the OpenMP version for all three memory technologies. Italso allows a reduction of the energy consumption, by 60% for DDR4 and by 20% for PCM.However, it marginally increases the energy consumption for RRAM (around 2%). This is dueto the notable increase in write bandwidth enabled by OmpSs scheduler, specifically for RRAM.Writes on NVMs are very expensive operations both in latency and energy consumption.

(a) Speedup (b) Normalized energy

(c) Normalized memory EDP (lower is better)

Figure 22: Cross-comparison of different HPCCG mini-application scenarios

Figure 22 provides a cross-comparison of the different scenarios where the OpenMP versionwith DDR4 is considered as the reference scenario. It shows the variation in speedup, memoryenergy and EDP of the different combinations of programming models and memory technologies,

27


while executing the HPCCG mini-application.

The best configurations in term of EDP are obtained when executing the OmpSs version ofHPCCG with DDR4 and RRAM as main memory setups. The former memory technologyenables the best speedup while the latter provides the lowest main memory energy consumption.

Finally, the energy breakdown for the main memory when executing the OmpSs versionfollows a similar trend as seen in Section 5: the dynamic activity of the memory as well as therefresh are quite marginal here also, while the dominating background energy gets decreasedwith NVM integration (see Figure 23).

Figure 23: Memory energy breakdown for HPCCG in OmpSs

6.4 NVM Integration in Memory Hierarchy

All results obtained in this deliverable rely on the assumptions we made during the calibration ofthe considered NVM models in Section 2. Provided this basis, the global observations resultingfrom the reported experiments suggest that the RRAM technology is a better candidate forDRAM replacement at main memory level. The high overhead in performance and powerconsumption exhibited by the PCM model rather suggests its integration in between mainmemory and storage levels. This also holds for a similar technology such as 3D XPoint fromIntel and Micron [18]. Figure 24 summarizes this idea.

Figure 24: Memory technologies within memory hierarchy.

On the other hand, we note that our study does not consider alternative NVM integrationapproaches, such as hybrid memory design as addressed widely in existing literature [28]. Theseapproaches are complementary solution that may have an impact on the way NVM integrationis seen through the memory hierarchy depicted in Figure 24.

Finally, the NVM integration question should be considered beyond the performance andenergy consumption concerns. Indeed, the write endurance of both RRAM and PCM are still

28


lower than that provided by DRAM (see Table 1 in Section 2). This has important implicationson their reliability when used for write-intensive workloads.

29


7 Concluding Remarks and Insights

In this deliverable, we carried out different evaluations of non-volatile memory (NVM) tech-nologies in order to analyze their impact on both system performance and main memory energysystem. This study follows our preliminary work presented in deliverable D3.2 [12]. An impor-tant motivation here is to quantify as much as possible the impact of NVMs in explored designscenarios. Starting from the PCM and RRAM model templates provided by the NVMain simu-lation framework, we enhanced them through some extrapolation relying on existing literature.The NVM integration has been evaluated with the gem5 cycle-approximate simulator, whileconsidering a heterogeneous ARMv8 multicore system model (such systems are envisioned asgood candidates for compute node design in HPC domain [34]). A subset of representativeParsec benchmarks and two typical mini-applications from HPC domain have been used asevaluation workloads.

The different results obtained throughout the present study showed NVM can relevantlyimprove the main memory energy consumption, at the cost of limited penalty on performance.This seems to be particularly promising with the RRAM technology, while PCM may incurnon negligible performance. The PCM technology rather appears as a in-between good candi-date w.r.t. faster main memories and slower storage-class memories (this observation also holdsfor the recent 3D XPoint technology from Intel). Nevertheless, our exploration of a few opendirections regarding NVM exploitation in systems at a wider scope, shows that further oppor-tunities for better leveraging these technologies: i) memory architecture design adaptation, e.g.,by considering multi-channel memory technology in presence of NVMs; and ii) more suitableprogramming models together with their runtime systems for performance improvements, e.g.the OmpSs programming in combination with NVMs shows notable performance gains overOpenMP3.0, leading to better energy-efficiency. All these promising directions promote NVMsas serious candidates for integration in the memory hierarchy of energy-efficient compute sys-tems. Their announced cost-effectiveness6 compared to concurrent technologies (e.g., DRAM,NAND Flash) makes this perspective credible.

Finally, beyond the interesting observations made possible by the current study, some of thegained insights also point out the need, in the future, for detailed NVM technology data-sheetsand advanced modeling and simulation tools capable of taking into account more accuratelythe features of recent memory technologies, in order to conduct fine-grained design analyses.

6https://www.computerworld.com/article/3194147/data-storage/faq-3d-xpoint-memory-nand-flash-killer-or-dram-replacement.html

30


Acronyms and Abbreviations

• DDR: Double Data Rate.

• DRAM: Dynamic Random Access Memory.

• EDP: Energy-Delay Product.

• EtoS: Energy-to-Solution.

• MRAM: Magnetoresistive Random Access Memory.

• NVM: Non Volatile Memory.

• NVMain: Non Volatile Main memory simulator.

• NVSim: Non Volatile memory Simulator.

• PCM: Phase Change Memory.

• RAM: Random Access Memory.

• RRAM: Resistive Random Access Memory.

• STT-RAM: Spin Transfer Torque Random Access Memory.

31


A Memory Model Parameters in NVMain

This appendix provides the different memory configuration parameters of NVMain used tomodel the memory technologies considered in this document. Tables 5 and 6 first summarizethe parameters. Then, Tables 8 and 7 show the values assigned to those parameters for modelingthe DDR4, RRAM and PCM memory models.

parameters commentsenergy /power

UselowPower

PowerDownModeEnergyModel “current” or “joules”Ewrpb Subarray write energy per bitErd Read energy from a single matEopenrdEwr Write energyErefEleak energy leaked in 1 secondEactst dbyEprest dbyEpdaEpdpfEpdpsVoltage memory supply voltageEIDD0 all values below in mA taken from datasheetEIDD1EIDD2P0EIDD2P1EIDD2NEIDD2NTEIDD3PEIDD3NEIDD4REIDD4WEIDD5BEIDD6

memorycon-troller

MEM CTL Memory controller policy, e.g., FCFS, FRFCFS, FRFCFS-WQF

CTL DUMP Memory request trace dumpClosePage close-page row buffer management policyScheduleScheme command scheduling schemeAddressMappingScheme SA=Sub array; R=Row; C=Column; BK=Bank; RK=Rank;

CH=ChannelINTERCONNECT Interconnect between controller and memory chipsReadQueueSize read queue sizeWriteQueueSize write queue sizeHighWaterMark write drain high watermark, write drain is triggered if it is reachedLowWaterMark write drain low watermark, write drain is stopped if it is reached

endurancemodel

EnduranceModel specifies the endurance model to use in the memory system

EnduranceDist specifies the probability distribution to use when determining endurancefor memory cells – used for NVMs

EnduranceDistMean endurance parameter – used for NVMsEnduranceDistVariance endurance parameter – used for NVMsFlipNWriteGranularity endurance parameter – used for NVMs

Table 5: Overview of NVMain memory model parameters: energy/power parameters, memorycontroller and memory endurance model.

32


parameters commentsinterface CLK Frequency of the interconnect to the memory (MHz)

MULT Clock rate Multiplier (the core run at MULT*CLK)Rate Number of data outputs per clock cycleBuswidth Number of data in a rank (64=JEDEC standard)Device width bits provided by each device in a rankBPC Bits per clock cycleCPUFreq frequency at which the memory controller will run

memorysystem

Banks Banks in each memory device (per rank)

Ranks Ranks per channelChannels channels in the memory subsystemRows wordlines in each bankCOLS logical columns in each bank (bitlines/bits provided in a rank)MatHeight Matrix HeightRbsize Size of rows buffer in bytes ((RBSize)/(device width*burst length)) COLSUseRefresh BooleanBanksperRefreshRefresh Rows number of rows to refresh for each refresh commandDelayed refresh threshold number of refresh that can be delayed (from 1 to 8)

memorytiming

tBURST length of data burst

tCMd command spends on a bustRAS restore data read from memory cells back to memory cells in the case the

data is losttRCD row activationtAL posted commandstCCD Delay of the command decodertCWD delay to switch the two-way circuitry from output to inputtWTR time between a write and a read commandtWR recover from a write command before a precharge can be issuedtWP Write pulsetRP precharge the bitlines to the voltage required by the sense amplifierstCAS data to be sensed by the sense amplifiers and output to the interconnecttRTRS small delay to allow transmission line buses to discharge when switching

between different bus driverstRTP time between a read and precharge commandtRFC refresh command to completetOST switch the on-die terminationtRRDR minimum time between activations when no write has occurred in the

previous activationtRRDW minimum time between activations when a write occurred in the previous

activationRAWtRAW minimum of time for the value of the RAW parameter activations to occurtRDPDEN read to powerdown entry with no precharge issued (precharge powerdown)tWRPDEN write to powerdown entry with no precharge issued (active powerdown)tWRAPDEN write to powerdown entry with no precharge issued (precharge power-

down)tPD powerdown any powerdown modetXP powerup time from powerdown fast exittXPDLL powerup time from precharge powerdown slow exittXStXSDLLtREFW maximum amount of time between refreshes of the same row

Table 6: Overview of NVMain memory model parameters: interface properties, memory systemand timing parameters.

33


parameters DDR4 Micron PCM RRAMinterface CLK 1333 400 400

MULT 4 8 8Rate 2 2 2Buswidth 64 64 64Device width 8 8 8BPC 8 8 8CPUFreq 2000 2000 2000

memorysystem

Banks 4 1 4

Ranks 2 2 2Channels 2 2 2Rows 131072 16384 8192COLS 32 1024 512MatHeight 131072 16384 8096Rbsize - 4 8UseRefresh true false falseBanksperRefresh 4 1 4Refresh Rows 16 4 4Delayed refreshthreshold

8 1 1

memorytiming

tBURST 4 4 4

tCMd 1 1 1tRAS 42 0 0tRCD 19 48 10tAL 0 0 0tCCD 4 2 2tCWD 7 4 4tWTR 5 3 3tWR 210 0 0tWP - 60 4tRP 19 1 1tCAS 10 1 6tRTRS 1 1 1tRTP 5 3 3tRFC 107 100 100tOST 1 0 0tRRDR 4 4 4tRRDW 4 4 4RAW 4 4 4tRAW 21 20 20tRDPDEN 14 5 10tWRPDEN 19 68 12tWRAPDEN 20 68 12tPD 4 1 1tXP 5 3 3tXPDLL 16 200000 200000tXS 5 - -tXSDLL 854 - -tREFW 42666667 42666667 42666667

Table 7: Configuration of DDR4 Micron (reference for comparison), PCM, RRAM memorymodels: interface properties, memory system and timing parameters.

34


parameters DDR4 Micron PCM RRAMenergy /power

UselowPower true -

PowerDownMode FASTEXIT -EnergyModel current energy energyEwrpb 2.02E-4 2.02E-4 2.02E-4Erd 3.405401 7.1513421 4.0864812Eopenrd 1.08108 2.270268 1.297296Ewr 1.023750 44.123625 24.26799375Eref 38.558533 0 0Eleak - 3120.202 3120.202Eactst dby 0.090090 - -Eprest dby 0.083333 - -Epda 0 0 0Epdpf 7.8828999999999996E-2 0 0Epdps 0 0 0Voltage 1.2 1.5 1.8EIDD0 59 - -EIDD1 76 - -EIDD2P0 22 - -EIDD2P1 22 - -EIDD2N 42 - -EIDD2NT 54 - -EIDD3P 33 - -EIDD3N 58 - -EIDD4R 145 - -EIDD4W 140 - -EIDD5B 66 - -EIDD6 25 - -

memorycon-troller

MEM CTL FRFCFS FRFCFS-WQF FRFCFS

CTL DUMP false - -ClosePage 0 0 0ScheduleScheme 2 2 2AddressMapping-Scheme

SA:R:RK:BK:CH:C R:RK:BK:CH:C R:RK:BK:CH:C

INTERCONNECT OffChipBus OffChipBus OffChipBusReadQueueSize 32 32 32WriteQueueSize 32 32 32HighWaterMark 32 32 32LowWaterMark 16 16 16

endurancemodel

EnduranceModel NullModel NullModel NullModel

EnduranceDist Normal Normal NormalEnduranceDist-Mean

1000000 1000000 1000000

EnduranceDist-Variance

100000 100000 100000

FlipNWrite Granu-larity

32 - -

Table 8: Configuration of DDR4 Micron (reference for comparison), PCM, RRAM memorymodels: energy/power parameters, memory controller and memory endurance model.

35


References

[1] Hiroyuki Akinaga and Hisashi Shima. Resistive random access memory (reram) based onmetal oxides. Proceedings of the IEEE, 98(12):2237–2251, 2010.

[2] Ishwar Bhati, Mu-Tien Chang, Zeshan Chishti, Shih-Lien Lu, and Bruce Jacob. Dram re-fresh mechanisms, penalties, and trade-offs. IEEE Trans. Comput., 65(1):108–121, January2016.

[3] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterizationand architectural implications. In 2008 International Conference on Parallel Architecturesand Compilation Techniques (PACT), pages 72–81, Oct 2008.

[4] Jalil Boukhobza, Stephane Rubini, Renhai Chen, and Zili Shao. Emerging nvm: A surveyon architectural integration and research challenges. ACM Trans. Des. Autom. Electron.Syst., 23(2):14:1–14:32, November 2017.

[5] Anastasiia Butko, Florent Bruguier, Abdoulaye Gamatie, Gilles Sassatelli, David Novo,Lionel Torres, and Michel Robert. Full-system simulation of big.little multicore architec-ture for performance and energy exploration. In MCSoC, pages 201–208. IEEE ComputerSociety, 2016.

[6] Youngdon Choi, Ickhyun Song, Mu-Hui Park, Hoeju Chung, Sanghoan Chang, BeakhyoungCho, Jinyoung Kim, Younghoon Oh, Duckmin Kwon, Jung Sunwoo, et al. A 20 nm 1.8V 8 Gb PRAM with 40 MB/s program bandwidth. In Proceedings of the InternationalSolid-State Circuits Conference, pages 46–48, 2012.

[7] Kallia Chronaki, Alejandro Rico, Rosa M. Badia, Eduard Ayguade, Jesus Labarta, andMateo Valero. Criticality-aware dynamic task scheduling for heterogeneous architectures.In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15,pages 329–338, New York, NY, USA, 2015. ACM.

[8] Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P Jouppi. Nvsim: A circuit-level perfor-mance, energy, and area model for emerging nonvolatile memory. Computer-Aided Designof Integrated Circuits and Systems, IEEE Transactions on, 31(7):994–1007, 2012.

[9] Jack Dongarra, Michael A Heroux, and Piotr Luszczek. High-performance conjugate-gradient benchmark. Int. J. High Perform. Comput. Appl., 30(1):3–10, February 2016.

[10] Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. The linpack benchmark: Past,present, and future. concurrency and computation: Practice and experience. Concurrencyand Computation: Practice and Experience, 15:2003, 2003.

[11] Alejandro Duran, Eduard Ayguade, Rosa M. Badia, Jesus Labarta, Luis Martinell, XavierMartorell, and Judit Planas. Ompss: a proposal for programming heterogeneous multi-corearchitectures. Parallel Processing Letters, 21(2):173–193, 2011.

[12] Abdoulaye Gamatie, Gilles Sassatelli David Novo, Lionel Torres, and Michel Robert. De-liverable D3.2 - Initial report on the use of emerging technologies in memory hierarchy.Technical report, MontBlanc 3 project, 2017.

[13] gem5. The gem5 simulator. http://www.gem5.org, 2016. [Accessed: February-2017].

36

http://www.gem5.org


[14] EE Herald. UMC is a foundry partner for Panasonic in making Resistive RAM.http://www.eeherald.com/section/news/owns20170202003.html. [Accessed:February-2017].

[15] H Horii, JH Yi, JH Park, YH Ha, IG Baek, SO Park, YN Hwang, SH Lee, YT Kim,KH Lee, et al. A novel cell technology using n-doped GeSbTe films for phase change RAM.In Proceedings of the Symposium on VLSI Technology, pages 177–178, 2003.

[16] R. D. Hornung, J. A. Keasler, and M. B. Gokhale. Hydrodynamics challenge problem.Technical report, LLNL-TR-490254, Lawrence Livermore National Laboratory, 2011.

[17] HowChip. Exsom board, 2016. http://howchip.com/ExSOM7420SB.php.

[18] Intel. Intel and Micron Produce Breakthrough Memory Tech-nology. https://newsroom.intel.com/news-releases/intel-and-micron-produce-breakthrough-memory-technology/. [Accessed:February-2017].

[19] Akifumi Kawahara, Ryotaro Azuma, Yuuichirou Ikeda, Ken Kawai, Yoshikazu Katoh,Kouhei Tanabe, Toshihiro Nakamura, Yoshihiko Sumimoto, Naoki Yamada, NobuyukiNakai, et al. An 8 Mb multi-layered cross-point ReRAM macro with 443 MB/s writethroughput. In Proceedings of the International Solid-State Circuits Conference, pages432–434, 2012.

[20] Manu Komalan, Oh Hyung Rock, Matthias Hartmann, Sushil Sakhare, Christian Tenllado,Jose Ignacio Gomez, Gouri Sankar Kar, Arnaud Furnemont, Francky Catthoor, SophianeSenni, David Novo, Abdoulaye Gamatie, and Lionel Torres. Main memory organizationtrade-offs with DRAM and STT-MRAM options based on gem5-nvmain simulation frame-works. In DATE, pages 103–108. IEEE, 2018.

[21] M. H. Kryder and Chang soo Kim. After hard driveswhat comes next? IEEE Transactionson Magnetics, 45:3406–3413, 2009.

[22] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Architecting phase changememory as a scalable dram alternative. SIGARCH Comput. Archit. News, 37(3):2–13, June2009.

[23] Charles Lefurgy, Karthick Rajamani, Freeman Rawson, Wes Felter, Michael Kistler, andTom W. Keller. Energy management for commercial servers. Computer, 36(12):39–48,December 2003.

[24] Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Rein-hardt, and Thomas F. Wenisch. Disaggregated memory for expansion and sharing in bladeservers. SIGARCH Comput. Archit. News, 37(3):267–278, June 2009.

[25] Larry W McVoy, Carl Staelin, et al. lmbench: Portable Tools for Performance Analysis.In USENIX annual technical conference, pages 279–294. San Diego, CA, USA, 1996.

[26] Micron. Ddr4 sdram rdimm - mta18asf2g72pdz 16gb. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwif55DcxtDbAhUKVxQKHbsiBaAQFggoMAA&url=https%3A%2F%2Fwww.micron.com%2F˜%2Fmedia%2Fdocuments%2Fproducts%2Fdata-sheet%2Fmodules%2Frdimm%2Fddr4%2Fasf18c2gx72pdz.pdf&usg=AOvVaw1YvGKpxb5Aok0KYGnbhwrE, 2015. Online; accessed 13th June 2018.

37

http://www.eeherald.com/section/news/owns20170202003.html

http://howchip.com/ExSOM7420SB.php

https://newsroom.intel.com/news-releases/intel-and-micron-produce-breakthrough-memory-technology/

https://newsroom.intel.com/news-releases/intel-and-micron-produce-breakthrough-memory-technology/

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwif55DcxtDbAhUKVxQKHbsiBaAQFggoMAA&url=https%3A%2F%2Fwww.micron.com%2F~%2Fmedia%2Fdocuments%2Fproducts%2Fdata-sheet%2Fmodules%2Frdimm%2Fddr4%2Fasf18c2gx72pdz.pdf&usg=AOvVaw1YvGKpxb5Aok0KYGnbhwrE







[27] Sparsh Mittal, Rujia Wang, and Jeffrey Vetter. Destiny: A comprehensive tool with 3dand multi-level cell memory modeling capability. Journal of Low Power Electronics andApplications, 7(3), 2017.

[28] Onur Mutlu. Opportunities and challenges of emerging memory technologies,2017. ARM Research Summit - https://people.inf.ethz.ch/omutlu/pub/onur-OpportunitiesAndChallengesOfEmergingMemoryTechnologies-ARMResearchSummit-September-11-2017-unrolled.pdf.

[29] Onur Mutlu and Lavanya Subramanian. Research problems and opportunities in memorysystems. Supercomput. Front. Innov.: Int. J., 1(3):19–55, October 2014.

[30] RG Neale, DL Nelson, and Gordon E Moore. Nonvolatile and reprogrammable, the read-mostly memory is here. Electronics, 43(20):56–60, 1970.

[31] Hyung-Rok Oh, Woo-Yeong Cho, and Choong-keun Kwak. Data read circuit for use in asemiconductor memory and a memory thereof, January 2006. US Patent 6,982,913.

[32] Matthew Poremba, Tao Zhang, and Yuan Xie. Nvmain 2.0: A user-friendly memory simula-tor to model (non-)volatile memory systems. Computer Architecture Letters, 14(2):140–143,2015.

[33] M Stanisavljevic, H Pozidis, A Athmanathan, N Papandreou, T Mittelholzer, and E Eleft-heriou. Demonstration of reliable triple-level-cell (TLC) phase-change memory. In Pro-ceedings of the International Memory Workshop (IMW), pages 1–4, 2016.

[34] J. Wanza Weloli, S. Bilavarn, M. De Vries, S. Derradji, and C. Belleudy. Efficiency mod-eling and exploration of 64-bit arm compute nodes for exascale. Microprocess. Microsyst.,53(C):68–80, August 2017.

[35] HanBin Yoon. Row buffer locality aware caching policies for hybrid memories. In Proceed-ings of the 2012 IEEE 30th International Conference on Computer Design (ICCD 2012),ICCD ’12, pages 337–344, Washington, DC, USA, 2012. IEEE Computer Society.

38

https://people.inf.ethz.ch/omutlu/pub/onur-OpportunitiesAndChallengesOfEmergingMemoryTechnologies-ARMResearchSummit-September-11-2017-unrolled.pdf



MB3 D3.7{ Final Report on Memory Hierarchy Investigations. … · MB3 D3.7{ Final Report on Memory Hierarchy Investigations. Version 1.0 Document Information Contract Number 671697

Documents