Top Banner
4 th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20 th & 21 st February 2015 Jyothy Institute of Technology Department of ECE Page | 184 EC-29 DIFFERENT APPROACHES IN ENERGY EFFICIENT CACHEMEMORY ARCHITECTURE Dhritiman Halder Dept. of ECE, REVA ITM Yealahanka, Bangalore-64 ABSTRACT - Many high-performance microprocessors employ cache write-through policy for performance improvement and at the same time achieving good tolerance to soft errors in on-chip caches. However, write- through policy also incurs large energy overhead due to the increased accesses to caches at the lower level (e.g., L2 caches) during write operations. In this project, new cache architecture referred to as way-tagged cache to improve the energy efficiency of write-through caches is introduced. By maintaining the way tags of L2 cache in the L1 cache during read operations, the proposed technique enables L2 cache to work in an equivalent direct-mapping manner during write hits, which account for the majority of L2 cache accesses. This leads to significant energy reduction without performance degradation. Index Terms-Cache, low power, write-through policy. I.INTRODUCTION MULTI-LEVEL on-chip cache systems have been widely adopted in high-performance microprocessors. To keep data consistence throughout the memory hierarchy, write-through and write-back policies are commonly employed. Under the write-back policy, a modified cache block is copied back to its corresponding lower level cache only when the block is about to be replaced. While under the write-through policy, all copies of a cache block are updated immediately after the cache block is modified at the current cache, even though the block might not be evicted. As a result, the write-through policy maintains identical data copies at all levels of the cache hierarchy throughout most of their life time of execution. This feature is important as CMOS technology is scaled into the nanometer range, where soft errors have emerged as a major reliability issue in on-chip cache systems. It has been reported that single-event multi-bit upsets are getting worse in on-chip memories. Currently, this problem has been addressed at different levels of the design abstraction. At the architecture level, an effective solution is to keep data consistent among different levels of the memory hierarchy to prevent the system from collapse due to soft errors. Benefited from immediate update, cache write-through policy is inherently tolerant to soft errors because the data at all related levels of the cache hierarchy are always kept consistent. Due to this feature, many high-performance microprocessor designs have adopted the write- through policy. While enabling better tolerance to soft errors, the write-through policy also incurs large energy overhead. This is because under the write-through policy, caches at the lower level experience more accesses during write operations. Consider a two-level (i.e., Level-1 and Level-2) cache system for example. If the L1 data cache implements the write-back policy, a write hit in the L1 cache does not need to access the L2 cache. In contrast, if the L1 cache is write- through, then both L1 and L2 caches need to be accessed for every write operation.Obviously, the write-through policy incurs more write accesses in the L2 cache, which in turn increases the energy consumption of the cache system. Power dissipation is now considered as one of the critical issues in cache design. Studies have shown that on-chip caches can consume about
14

Different Approaches in Energy Efficient Cache Memory

Jan 10, 2017

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 184

EC-29

DIFFERENT APPROACHES IN ENERGY

EFFICIENT CACHEMEMORY

ARCHITECTURE

Dhritiman Halder Dept. of ECE, REVA ITM Yealahanka, Bangalore-64

ABSTRACT - Many high-performance

microprocessors employ cache write-through

policy for performance improvement and at

the same time achieving good tolerance to soft

errors in on-chip caches. However, write-

through policy also incurs large energy

overhead due to the increased accesses to

caches at the lower level (e.g., L2 caches)

during write operations. In this project, new

cache architecture referred to as way-tagged

cache to improve the energy efficiency of

write-through caches is introduced. By

maintaining the way tags of L2 cache in the L1

cache during read operations, the proposed

technique enables L2 cache to work in an

equivalent direct-mapping manner during

write hits, which account for the majority of

L2 cache accesses. This leads to significant

energy reduction without performance

degradation.

Index Terms-Cache, low power, write-through

policy.

I.INTRODUCTION

MULTI-LEVEL on-chip cache systems have been widely adopted in high-performance microprocessors. To keep data consistence throughout the memory hierarchy, write-through and write-back policies are commonly employed. Under the write-back policy, a modified cache

block is copied back to its corresponding lower level cache only when the block is about to be replaced. While under the write-through policy, all copies of a cache block are updated immediately after the cache block is modified at the current cache, even though the block might

not be evicted. As a result, the write-through policy maintains identical data copies at all levels of the cache hierarchy throughout most of their life time of execution. This feature is important as CMOS technology is scaled into the nanometer range, where soft errors have emerged as a major reliability issue in on-chip cache systems. It has been reported that single-event

multi-bit upsets are getting worse in on-chip memories. Currently, this problem has been

addressed at different levels of the design abstraction. At the architecture level, an effective solution is to keep data consistent among different levels of the memory hierarchy to prevent the system from collapse due to soft errors. Benefited from immediate update, cache write-through policy is inherently tolerant to soft errors because the data at all related levels of the cache hierarchy are always kept consistent. Due to this feature, many high-performance microprocessor designs have adopted the write-through policy. While enabling better tolerance

to soft errors, the write-through policy also incurs large energy overhead. This is because under the write-through policy, caches at the lower level experience more accesses during write operations. Consider a two-level (i.e., Level-1 and Level-2) cache system for example. If the L1 data cache implements the write-back policy, a write hit in the L1 cache does not need to access the L2 cache. In contrast, if the L1 cache is write-through, then both L1 and L2 caches need to be accessed for every write operation.Obviously, the write-through policy incurs more write accesses in the L2 cache, which in turn increases the

energy consumption of the cache system. Power dissipation is now considered as one of the critical issues in cache design. Studies have shown that on-chip caches can consume about

Page 2: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 185

50% of the total power in high-performance microprocessors. In this paper, new cache architecture, referred to

as way-tagged cache, to improve the energy efficiency of write-through cache systems with

minimal area overhead and no performance degradation is proposed. Consider a two-level

cache hierarchy, where the L1 data cache is

write-through and the L2 cache is inclusive for high performance. It is observed that all the data

residing in the L1 cache will have copies in the L2 cache. In addition, the locations of these

copies in the L2 cache will not change until they are evicted from the L2 cache. Thus, a tag to each

way in the L2 cache and send this tag information to the L1 cache when the data is loaded to the L1

cache can be attached. By doing so, for all the

data in the L1 cache exactly the locations (i.e., ways) of their copies in the L2 cache is known.

During the subsequent accesses when there is a write hit in the L1 cache (which also initiates a

write access to the L2 cache under the write-through policy), the L2 cache can be accessed in

an equivalent direct-mapping manner because the way tag of the data copy in the L2 cache is

available. As this operation accounts for the majority of L2 cache accesses in most

applications, the energy consumption of L2 cache

can be reduced significantly.

II. RELATED WORKS The basic idea of the horizontal cache partitioning approach is to partition the cache data memory into several segments. Each segment can be powered individually. Cache sub-banking, proposed in, is one horizontal cache partition technique which partitions the data array of a cache into several banks (called cache sub-banks). Each cache sub-bank can be accessed (powered up) individually. Only the cache sub-bank where the requested data is located consumes power in each cache access. A basic structure for cache sub-banking is presented in Figure below.

Cache sub-banking saves power by eliminating unnecessary accesses. The amount of power saving depends on the number of cache sub-banks. More cache sub-banks save more power. One advantage of cache sub-banking over block buffering is that the effective cache hit time of a sub-bank cache can be as fast as a conventional performance-driven cache since the sub-bank selection logic is usually very simple and can be easily hidden in the cache index decoding logic. With the advantage of maintaining the cache performance, cache sub-banking could be very attractive to computer architects in designing energy-efficient high-performance microprocessors. [2] Bit line segmentation offers a solution for further power savings. The internal organization of each column in the data or tag array gets modified as shown in Figure below. Here every column of bitcells, sharing one (or more) pair of bitlines are split into independent segments as shown. An additional pair of lines are run across the segments. The bit lines within each segment can be connected or isolated from these common lines as shown. The metal layer used for clock distribution can implement this line, since the clock does not need to be routed across the bit cell array. Before a readout, all segments are connected to the common lines, which are precharged as usual. In the meantime,

Page 3: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 186

the address decoder identifies the segment targeted by the row address issued to the array and isolates all but the targeted segment from the common bit line. This reduces the effective capacitive loading (due to the diffusion capacitances of the pass transistors) on the common line. This reduction is somewhat offset by the additional capacitance of the common line that spans a single segment and the diffusion capacitances of the isolating switches. The common line is then sensed. Because of the reduced loading on the common line, the energy discharged due to readout or spent in a write are small. Thus, smaller drivers, precharging

transistors and sense amps can be used. [3]

Figure above depicts the architecture of our base cache. The memory address is split into a line-offset field, an index field, and a tag field. For our base cache, those fields are 5, 6 and 21 bits, respectively, assuming a 32-bit address. Being four-way set-associative, the cache contains four tag arrays and four data arrays. During an access, the cache decodes the address’ index field to simultaneously read out the appropriate tag from each of the four tag arrays, while decoding the index field to simultaneously read out the appropriate data from the four data arrays. The cache feeds the decoded lines through two

inverters to strengthen their signals. The read tags and data items pass through sense amplifiers. The cache simultaneously compares the four tags with the address’ tag field. If one tag matches, a multiplexor routes the corresponding data to the cache output. [4]

The

energy consumption of set-associative cache tends to be higher than that of direct-mapped cache, because all the ways in a set are accessed in parallel although at most only one way has the desired data. To solve the energy issue the phased cache divides the cache-access process into the following two phases as shown below.

First, all the tags in the set are examined in

parallel, and no data accesses occur during this

phase. Next, if there is a hit, then a data access is

performed for the hit way. The way-predicting

cache speculatively chooses one way before

starting the normal cache-access process, and

then accesses the predicted way as shown below.

Page 4: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 187

Fig-a

If the prediction is correct, the cache access has been completed successfully. Otherwise, the cache then searches the other remaining ways as shown below:

Fig-b On a prediction-hit, shown in Figure (a), the way-predicting cache consumes only energy for activating the predicted way. In addition, the cache access can be completed in one cycle. On prediction-misses (or cache misses), however, the cache-access time of the way-predicting cache increases due to the successive process of two phases as shown in Figure (b). Since all the remaining ways are activated in the same manner as a conventional set-associative cache, the way-predicting cache could not reduce energy consumption in this scenario. The performance/energy efficiency of the way-predicting cache largely depends on the accuracy of the way prediction

In this approach MRU algorithm has been introduced. The MRU information for each set, which is a two-bit flag, is used to speculatively choose one way from the corresponding set. These two-bit flags are stored in a table accessed by the set-index address. Reading the MRU information before starting the cache access might make cache access time longer. However, it can be hidden by calculating the set-index address at an earlier pipe-line stage. In addition, way prediction helps reduce cache access-time due to eliminating of a delay for way selection.

So, we assumed that the cache-access time on prediction hit of the way-predicting cache is same as that of conventional set-associative cache. [5]

Another approach uses a two-phase associative cache: access all tags to determine the correct way in the first phase, and then only access a single data item from the matching way in the second phase. Although this approach has been proposed to reduce primary cache energy, it is more suited for secondary cache designs due to the performance penalty of an extra cycle in cache access time. A higher performance alternative to phased primary cache is to use CAM (content-addressablememory) to hold tags. CAM tags have been used in a number of low-power processors including the StrongARM and XScale. Although they add roughly 10% to total cache area, CAMs perform tag checks for all ways and read out only the matching data in one cycle. Moreover, a 32-way associative cache with CAM tags has roughly the same hit energy as a two-way set associative cache with RAM tags, but has a higher hit rate. Even so, a CAM tag lookup still adds considerable energy overhead to the simple RAM fetch of one instruction word. Way-prediction can also reduce the cost of tag accesses by using a way-prediction table and only accessing the tag and data from the predicted way.

Page 5: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 188

Correct prediction avoids the cost of reading tags and data from incorrect ways, but a misprediction requires an extra cycle to perform tag comparisons from all ways. This scheme has been used in commercial high-performance designs to add associativity to off-chip secondary

caches; to on-chip primary instruction caches to

reduce cache hit latencies in superscalar

processors; and has been proposed to reduce the

access energy in low-power microprocessors.

Since way prediction is a speculative technique, it

still requires that we fetch one tag and compare it

against the current PC to check if the prediction

was correct. Though it has never been examined,

way-prediction can also be applied to CAM-

tagged caches. However, because of the

speculative nature of way-prediction, a tag still

needs to be read out and compared. Also, on a

mispredict, the entire access needs to be restarted;

there is no work that can be salvaged. Thus, twice

the number of words are read out of the cache.

An alternative to wayprediction is way

memoization. Way memoization stores tag

lookup results (links) within the instruction cache

in a manner similar to some way prediction

schemes. However, way memoization also

associates a valid bit with each link. These valid

bits indicate, prior to instruction access, whether

the link is correct. This is in contrast to way

prediction where the access needs to be verified

afterward. This is the crucial difference between

the two schemes, and allows way-memoization to

work better in CAM-tagged caches. If the link is

valid, we simply follow the link to fetch the next

instruction and no tag checks are performed.

Otherwise, we fall back on a regular tag search to

find the location of the next instruction and

update the link for future use. The main

complexity in our technique is caused by the need

to invalidate all links to a line when that line is

evicted. The coherence of all the links is

maintained through an invalidation scheme. Way

memoization is orthogonal to and can be used in

conjunction with other cache energy reduction

techniques such as sub-banking, block buffering,

and the filter cache. Another approach to remove

instruction cache tag lookup energy is the L-

cache, however, it is only applicable to loops and

requires compiler support.

The way-memoizing instruction cache keeps

links within the cache. These links allow

instruction fetch to bypass the tag-array and read

out words directly from the instruction array.

Valid bits indicate whether the cache should use

the direct access method or fall back to the

normal access method. These valid bits are the

key to maintaining the coherence of the way-

memoizing cache. When we encounter a valid

link, we follow the link to obtain the cache

address of the next instruction and thereby

completely avoid tag checks. However, when we

encounter an invalid link, we fall back to a

regular tag search to find the target instruction

and update the link. Future instruction fetches

reuse the valid link. Way-memoization can be

applied to a conventional cache, a phased cache,

or a CAM-tag cache. On a correct way

prediction, the way-predicting cache performs

one tag lookup and reads one word, whereas the

way-memoizing cache does no tag lookup, and

only reads out one word. On a way

misprediction, the way-predicting cache is as

power-hungry as the conventional cache, and as

slow as the phased cache. Thus it can be worse

than the normal non-predicting caches. The way-

memoizing cache, however, merely becomes one

of the three normal non-predicting caches in the

worst case. However, the most important

difference is that the waymemoization technique

can be applied to CAM-tagged caches. [6]

There is a new way memoization technique

which eliminates redundant tag and way accesses

to reduce the power consumption. The basic idea

is to keep a small number of Most Recently Used

(MRU) addresses in a Memory Address Buffer

(MAB) and to omit redundant tag and way

accesses when there is a MAB-hit.

Page 6: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 189

The MAB is accessed in parallel with the adder used for address generation. The technique does not increase the delay of the circuit. Furthermore, this approach does not require modifying the cache architecture. This is considered an important advantage in industry because it makes it possible to use the processor core with

previously designed caches or IPs provided by other vendors. The base address and the displacement for load and store operations usually take a small number of distinct values. Therefore, we can improve the hit rate of the MAB by keeping only a small number of most recently used tags. Assume the bit width of tag memory, the number of sets in the cache, and the size of cache lines are 18, 512, and 32 bytes, respectively. The width of the setindex and offset fields will be 9 and 5 bits, respectively. Since most (according to our experiments, more than 99% of) displacement values are less than 2

14, we can easily calculate

tag values without address generation. This can be done by checking the upper 18 bits of the base address, the sign-extension of the displacement, and the carry bit of a 14-bit adder which adds the low 14 bits of the base address and the displacement. Therefore, the delay of the added circuit is the sum of the delay of the 14-bit adder and the delay of accessing the set-index table. Our experiment shows this delay is smaller than the delay of the 32-bit adder used to calculate the address.

Therefore, our technique does not have any delay penalty. Note that if the displacement value is more than or equal to 2

14 or less Than -2

14, there

will be a MAB miss, but the chance of this happening is less than 1%. To eliminate redundant tag and way accesses for intercache-line flows, we can use a MAB. Unlike the MAB used for D-cache, the inputs of the MAB used for I-cache can be one of the following three types: 1) an address stored in a link register, 2) a base address (i.e. the current program counter address) and a displacement value (i.e., a branch offset), and 3) the current program counter address and its stride. In the case of inter-cacheline sequential flow, the current program counter address and the stride of the program counter are chosen as inputs of the MAB. The stride is treated as the displacement

value. If the current operation is a”branch (or jump) to the link target”, the address in the link register is selected as the input of the MAB as shown in Figure below. Otherwise, the base address and the displacement are used as done for the data cache. [7] A new cache architecture called the location cache. Figure below illustrates its structure.

Page 7: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 190

The location cache is a small virtually-indexed

direct-mapped cache. It caches the location

information (the way number in one set a

memory reference falls into). This cache works in

parallel with the TLB and the L1 cache. On an L1

cache miss, the physical address translated by the

TLB and the way information of the reference are

both presented to the L2 cache. The L2 cache is

then accessed as a direct-mapped cache. There

can be a miss in the location cache, then the L2

cache is accessed as a conventional set-

associative cache. As opposed to way-prediction

information, the cached location is not a

prediction. Thus when there is a hit, both time

and power will be saved. Even if there is a miss,

we do not see any extra delay penalty as seen in

way- prediction caches. Caching the position,

unlike caching the data itself, will not cause

coherence problems in multi-processor systems.

Although the snooping mechanism may modify

the data stored in the L2 cache, the location will

not change. Also, even if a cache line is replaced

in the L2 cache, the way information stored in the

location cache will not generate a fault. One

interesting issue arises here: the locations for

which references should be cached? The location

cache should catch the references which turn out

to be L1 misses. A recency based strategy is not

suitable because the recent accesses to the L2

caches are very likely to be cached in the L1

caches. The equation below defines the optimal

coverage of the location cache.

Opt. coverage = L2 Coverage - L1 Coverage

As the indexing rules of L1 and L2 caches are different, this optimal coverage is not reachable. Fortunately, the memory locations are usually referenced in sequences or strides. Whenever a reference to the L2 cache is generated, we calculate the location of the next cache line and feed it into the location cache. The proposed cache system works in the following way. The location cache is accessed in parallel with the L1 caches. If the L1 cache sees a hit, then the results from the location cache is discarded. If there is a miss in the L1 cache, and there is a hit in the location cache, the L2 cache is accessed as a direct-mapped cache. If both the L1 cache and the location cache see a miss, then the L2 cache is accessed as a traditional L2 cache. The tags of the L2 cache is duplicated. We call the duplicated tag arrays of the L2 cache location tag arrays. When the L2 cache is accessed, the location tag arrays are accessed to generate the location information for the next memory reference. The generated location information is then sent to and stored in the location cache.

The L1 cache is a 16KB 4-way set-associative cache, with a cache line size of 64-bytes, implemented with a 0.13μm technology. The results were produced using the CACTI3.2 simulator. We chose the access delay of a 16KB direct-mapped cache as the baseline, which is the best-case delay when a way-prediction mechanism is implemented in the L1 cache. We normalized the baseline delay to 1. It is observed that a location cache with up-to 1024 entries has

shorter access latency than the L1 cache. Though the organization of the location cache is similar to that of a direct-mapped cache, there is a small change in the indexing rule. The block offset is 7 bit as the cache line size for the simulated L2 cache is 128 bytes. Thus the width of the tag is smaller for the location cache, compared with a regular cache.

Page 8: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 191

Compared to a regular cache design, the modification is minor. Note that we need to double the tags (or the number of ports to the tag) because when the original tags are compared to validate the accesses, a spare set of tag is compared to generate the future location information. This idea is similar to the phased cache. The difference is that we overlap the tag comparison for future references with existing cache reference and use the location cache to store such location information. The simulated cache geometry parameters were optimized for the set-associative cache. The simulation results show that the access latency for a direct-mapped hit is 40% faster than a set-associative hit.

Although the extra hardware employed by the location cache design does not introduce extra delay on the memory reference critical path, it does introduce extra power consumption. The extra power consumption comes from the small location cache and the duplicated tag arrays. The power consumption for the tag access of a direct-mapped hit to one. Comparing to the L2 cache power consumption, the location cache consumes a small amount of power is normalized. However, as the location cache is triggered much often than the L2 cache, its power consumption cannot be ignored. The total chip area of the proposed location cache system (with duplicated tag and a location cache of 1024 entries) is only 1.39% larger than that of the original cache system. [8] The r-a cache is formed by using the tag array of a set-associative cache with the data array of a direct-mapped cache, as shown in Figure 1.

For an n-way r-a cache, there is a single data bank, and n tag banks. The tag array is accessed using the conventional set-associative index, probing all the n-ways of the set in parallel, just as in a normal set-associative cache. The data array index uses the conventional set-associative index concatenated with a way number to locate a block in the set. The way number is log2(n) bits wide. For the first probe, it may come from either the conventional set-associative tag field’s lower-order bits (for the direct-mapped blocks), or the way-prediction mechanism (for the displaced blocks). If there is a second probe (due to a misprediction), then the matching way number is provided by the tag array. The r-a cache simultaneously accesses the tag and data arrays for the first probe, at either the direct-mapped location or a set-associative position provided by the way-prediction mechanism. If the first probe, called probe0, hits, then the access is complete and the data is returned to the processor. If probe0 fails to locate the block due to a misprediction (i.e., either the block is in a set-associative position when probe0 assumed direct-mapped access or the block is in a set-associative position different than the one supplied by way-prediction), probe0 obtains the correct way-number from the tag array if the block is in the cache, and a second probe, called probe1, is done using the correct way-number.

Page 9: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 192

Probe1 probes only the data array, and not the tag array. If the block is not in the cache, probe0 signals an overall miss and probe1 is not necessary. Thus there are three possible paths through the cache for a given address:(1) probe0 is predicted to be a direct mapped access, (2) probe0 is predicted to be a set-associative access and the prediction mechanism provides the predicted way-number, and (3) probe0 is mispredicted but obtains the correct way-number from the tag array, and the data array is probed using the correct way-number in probe1.

On an overall miss, the block is placed in the

direct-mapped position if it is non-conflicting, and a set-associative position (LRU, random,

etc.) otherwise. Way Prediction: The r-a cache employs hardware way-prediction to obtain the

way-number for the blocks that are displaced to set-associative positions before address

computation is complete. The strict timing

constraint of performing the prediction in parallel with effective address computation requires that

the prediction mechanism use information that is available in the pipeline earlier than the address

compute stage. The equivalent of way-prediction for i-caches is often combined with branch

prediction but because D-caches do not interact with branch prediction, those techniques cannot

be used directly. An alternative to prediction is to

obtain the correct way-number of the displaced block using the address, which delays initiating

cache access to the displaced block, as is the case for statically probed schemes such as column-

associative and group-associative caches. We examine two handles that can be used to perform

way prediction: instruction PC and approximate data address formed by XORing the register

III.WAY-TAGGED CACHE

A way-tagged cache that exploits the way

information in L2 cache to improve energy

efficiency is introduced. A conventional set-

associative cache system when the L1 data cache

loads/writes data from/into the L2 cache, all

ways in the L2 cache are activated

simultaneously for performance consideration at

the cost of energy overhead.

value with the instruction offset (proposed in,

and used in), which may be faster than

performing a full add. These two handles

represent the two extremes of the trade-off

between prediction accuracy and early

availability in the pipeline.

PC is available much earlier than the XOR

approximation but the XOR approximation is

more accurate because it is hard for PC to

distinguish among different data addresses

touched by the same instruction. Other handles

such as instruction fields (e.g., operand register

numbers) do not have significantly more

information content from a prediction standpoint,

and the PSA paper recommends the XOR scheme

for its high accuracy. In an out-of-order

processor pipeline (Figure above), the instruction

PC of a memory operation is available much

earlier than the source register.

Page 10: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 193

Therefore, way-prediction can be done in parallel

with the pipeline front end processing of memory

instructions so that the predicted way-number and

probe0 way# mux select input are ready well

before the data address is computed. The XOR

scheme, on the other hand, needs to squeeze in an

XOR operation on a value often obtained late

from a register-forwarding path followed by

prediction table lookup to produce the predicted

way-number and the probe0 way# mux select, all

within the time the pipeline computes the real

address using a full add. Note that the prediction

table must have more entries or be more

associative than the cache itself to avoid conflicts

among the XORed approximate data addresses,

and therefore will probably have a significant

access time, exacerbating the timing problem.

The above figure illustrates the architecture of the

two-level cache. Only the L1 data cache and L2

unified cache are shown as the L1 instruction

cache only reads from the L2 cache. Under the

write-through policy, the L2 cache always

maintains the most recent copy of the data. Thus,

whenever a data is updated in the L1 cache, the

L2 cache is updated with the same data as well.

This results in an increase in the write accesses to

the L2 cache and consequently more energy

consumption. The locations (i.e., way tags) of L1

data copies in the L2 cache will not change until

the data are evicted from the L2 cache. The

proposed way-tagged cache exploits this fact to

reduce the number of ways accessed during L2

cache accesses. When the L1 data cache loads a

data from the L2 cache, the way tag of the data in

the L2 cache is also sent to the L1 cache and

stored in a new set of way-tag arrays. These way

tags provide the key information for the

subsequent write accesses to the L2 cache.

In general, both write and read accesses in the L1 cache may need to access the L2 cache. These accesses lead to different operations in the

proposed way-tagged cache, as summarized in Table I. Under the write-through policy, all write operations of the L1 cache need to access the L2 cache. In the case of a write hit in the L1 cache, only one way in the L2 cache will be activated because the way tag information of the L2 cache is available, i.e., from the way-tag arrays we can obtain the L2 way of the accessed data. While for a write miss in the L1 cache, the requested data is not stored in the L1 cache. As a result, its corresponding L2 way information is not available in the way-tag arrays. Therefore, all ways in the L2 cache need to be activated

simultaneously. Since write hit/miss is not known a priori, the way-tag arrays need to be accessed simultaneously with all L1 write operations in order to avoid performance degradation. The way-tag arrays are very small and the involved energy overhead.

Page 11: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 194

The above figure shows the system diagram of proposed way-tagged cache. We introduce

several new components: way-tag arrays, way-tag buffer, way decoder, and way register, all shown

in the dotted line. The way tags of each cache line in the L2 cache are maintained in the way-

tag arrays, located with the L1 data cache. Note that write buffers are commonly employed in

write-through caches (and even in many write-

back caches) to improve the performance. With a write buffer, the data to be written into the L1

cache is also sent to the write buffer. The operations stored in the write buffer are then sent

to the L2 cache in sequence. This avoids write stalls when the processor waits for write

operations to be completed in the L2 cache. In the proposed technique, we also need to send the way

tags stored in the way-tag arrays to the L2 cache

along with the operations in the write buffer. Thus, a small way-tag buffer is introduced to

buffer the way tags read from the way-tag arrays. A way decoder is employed to decode way tags

and generate the enable signals for the L2 cache, which activate only the desired ways in the L2

cache. Each way in the L2 cache is encoded into a way tag. A way register stores way tags and

provides this information to the way-tag arrays

can be easily compensated for. For L1 read operations, neither read hits nor misses need to

access the way-tag arrays. This is because read hits do not need to access the L2 cache; while for

read misses, the corresponding way tag information is not available in the way-tag arrays.

As a result, all ways in the L2 cache are activated simultaneously under read misses. The amount of energy consumption per read and write across the conventional set-associative L2 cache and proposed L2 cache is shown below:

This cache configuration, used in Pentium-4, will

be used as a baseline system for comparison with the proposed technique under different cache configurations. IV. CONCLUSION This paper presents a new energy-efficient cache

technique for high-performance microprocessors employing the write-through policy. The

proposed technique attaches a tag to each way in

the L2 cache. This way tag is sent to the way-tag arrays in the L1 cache when the data is loaded

from the L2 cache to the L1 cache. Utilizing the way tags stored in the way-tag arrays, the L2

cache can be accessed as a direct-mapping cache during the subsequent write hits, thereby

reducing cache energy consumption. Simulation results demonstrate significantly reduction in

cache energy consumption with minimal area

overhead and no performance degradation. Furthermore, the idea of way tagging can be

applied to many existing low-power cache techniques such as the phased access cache to

further reduce cache energy consumption. Future work is being directed towards extending this

technique to other levels of cache hierarchy and reducing the energy consumption of other cache

operations.

REFERRENCES [1].An Energy-Efficient L2 Cache Architecture Using Way Tag Information Under Write-Through Policy, Jianwei Dai and Lei Wang, Senior Member, IEEE, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 21, No. 1, January 2013

[2].C. Su and A. Despain, “Cache design tradeoffs for power and performance optimization: A case study,” in Proc. Int. Symp. Low Power Electron. Design, 1997, pp. 63–68.

[3]. K. Ghose and M. B.Kamble, “Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation,” in Proc. Int. Symp. Low Power Electron. Design, 1999, pp. 70–75.

Page 12: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 195

[4]. C. Zhang, F. Vahid, and W. Najjar, “A highly-configurable cache architecture for embedded systems,” in Proc. Int. Symp. Comput. Arch., 2003, pp. 136–146.

[5]. K. Inoue, T. Ishihara, and K. Murakami, “Way-predicting set-associative cache for high performance and low energy consumption,” in Proc. Int. Symp. Low Power Electron. Design, 1999, pp. 273–275.

[6]. A.Ma, M. Zhang, and K.Asanovi, “Way memoization to reduce fetch energy in instruction caches,” in Proc. ISCA Workshop Complexity Effective Design, 2001, pp. 1–9.

[7]. T. Ishihara and F. Fallah, “A way memorization technique for reducing power consumption of caches in application specific integrated processors,” in Proc. Design Autom. Test Euro. Conf., 2005, pp. 358–363.

R. Min, W. Jone, and Y. Hu, “Location cache: A

low-power L2 cache system,” in Proc. Int. Symp.

Low Power Electron. Design, 2004, pp. 120–125.

[8]T. N. Vijaykumar, “Reactive-associative

caches,” in Proc. Int. Conf. Parallel Arch. Compiler Tech., 2011, p.4961.

[9] Way-Tagged L2 Cache Architecture in Conjunction with Energy Efficient Datum Storage Vineeta Vasudevan Nair ECE Department, ANNA University Chennai Sri Eshwar College Of Engineering Coimbatore, India

Page 13: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 196

Page 14: Different Approaches in Energy Efficient Cache Memory

4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015

20th & 21st February 2015

Jyothy Institute of Technology Department of ECE P a g e | 197