4 th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20 th & 21 st February 2015 Jyothy Institute of Technology Department of ECE Page | 184 EC-29 DIFFERENT APPROACHES IN ENERGY EFFICIENT CACHEMEMORY ARCHITECTURE Dhritiman Halder Dept. of ECE, REVA ITM Yealahanka, Bangalore-64 ABSTRACT - Many high-performance microprocessors employ cache write-through policy for performance improvement and at the same time achieving good tolerance to soft errors in on-chip caches. However, write- through policy also incurs large energy overhead due to the increased accesses to caches at the lower level (e.g., L2 caches) during write operations. In this project, new cache architecture referred to as way-tagged cache to improve the energy efficiency of write-through caches is introduced. By maintaining the way tags of L2 cache in the L1 cache during read operations, the proposed technique enables L2 cache to work in an equivalent direct-mapping manner during write hits, which account for the majority of L2 cache accesses. This leads to significant energy reduction without performance degradation. Index Terms-Cache, low power, write-through policy. I.INTRODUCTION MULTI-LEVEL on-chip cache systems have been widely adopted in high-performance microprocessors. To keep data consistence throughout the memory hierarchy, write-through and write-back policies are commonly employed. Under the write-back policy, a modified cache block is copied back to its corresponding lower level cache only when the block is about to be replaced. While under the write-through policy, all copies of a cache block are updated immediately after the cache block is modified at the current cache, even though the block might not be evicted. As a result, the write-through policy maintains identical data copies at all levels of the cache hierarchy throughout most of their life time of execution. This feature is important as CMOS technology is scaled into the nanometer range, where soft errors have emerged as a major reliability issue in on-chip cache systems. It has been reported that single-event multi-bit upsets are getting worse in on-chip memories. Currently, this problem has been addressed at different levels of the design abstraction. At the architecture level, an effective solution is to keep data consistent among different levels of the memory hierarchy to prevent the system from collapse due to soft errors. Benefited from immediate update, cache write-through policy is inherently tolerant to soft errors because the data at all related levels of the cache hierarchy are always kept consistent. Due to this feature, many high-performance microprocessor designs have adopted the write- through policy. While enabling better tolerance to soft errors, the write-through policy also incurs large energy overhead. This is because under the write-through policy, caches at the lower level experience more accesses during write operations. Consider a two-level (i.e., Level-1 and Level-2) cache system for example. If the L1 data cache implements the write-back policy, a write hit in the L1 cache does not need to access the L2 cache. In contrast, if the L1 cache is write- through, then both L1 and L2 caches need to be accessed for every write operation.Obviously, the write-through policy incurs more write accesses in the L2 cache, which in turn increases the energy consumption of the cache system. Power dissipation is now considered as one of the critical issues in cache design. Studies have shown that on-chip caches can consume about
14
Embed
Different Approaches in Energy Efficient Cache Memory
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 184
EC-29
DIFFERENT APPROACHES IN ENERGY
EFFICIENT CACHEMEMORY
ARCHITECTURE
Dhritiman Halder Dept. of ECE, REVA ITM Yealahanka, Bangalore-64
ABSTRACT - Many high-performance
microprocessors employ cache write-through
policy for performance improvement and at
the same time achieving good tolerance to soft
errors in on-chip caches. However, write-
through policy also incurs large energy
overhead due to the increased accesses to
caches at the lower level (e.g., L2 caches)
during write operations. In this project, new
cache architecture referred to as way-tagged
cache to improve the energy efficiency of
write-through caches is introduced. By
maintaining the way tags of L2 cache in the L1
cache during read operations, the proposed
technique enables L2 cache to work in an
equivalent direct-mapping manner during
write hits, which account for the majority of
L2 cache accesses. This leads to significant
energy reduction without performance
degradation.
Index Terms-Cache, low power, write-through
policy.
I.INTRODUCTION
MULTI-LEVEL on-chip cache systems have been widely adopted in high-performance microprocessors. To keep data consistence throughout the memory hierarchy, write-through and write-back policies are commonly employed. Under the write-back policy, a modified cache
block is copied back to its corresponding lower level cache only when the block is about to be replaced. While under the write-through policy, all copies of a cache block are updated immediately after the cache block is modified at the current cache, even though the block might
not be evicted. As a result, the write-through policy maintains identical data copies at all levels of the cache hierarchy throughout most of their life time of execution. This feature is important as CMOS technology is scaled into the nanometer range, where soft errors have emerged as a major reliability issue in on-chip cache systems. It has been reported that single-event
multi-bit upsets are getting worse in on-chip memories. Currently, this problem has been
addressed at different levels of the design abstraction. At the architecture level, an effective solution is to keep data consistent among different levels of the memory hierarchy to prevent the system from collapse due to soft errors. Benefited from immediate update, cache write-through policy is inherently tolerant to soft errors because the data at all related levels of the cache hierarchy are always kept consistent. Due to this feature, many high-performance microprocessor designs have adopted the write-through policy. While enabling better tolerance
to soft errors, the write-through policy also incurs large energy overhead. This is because under the write-through policy, caches at the lower level experience more accesses during write operations. Consider a two-level (i.e., Level-1 and Level-2) cache system for example. If the L1 data cache implements the write-back policy, a write hit in the L1 cache does not need to access the L2 cache. In contrast, if the L1 cache is write-through, then both L1 and L2 caches need to be accessed for every write operation.Obviously, the write-through policy incurs more write accesses in the L2 cache, which in turn increases the
energy consumption of the cache system. Power dissipation is now considered as one of the critical issues in cache design. Studies have shown that on-chip caches can consume about
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 185
50% of the total power in high-performance microprocessors. In this paper, new cache architecture, referred to
as way-tagged cache, to improve the energy efficiency of write-through cache systems with
minimal area overhead and no performance degradation is proposed. Consider a two-level
cache hierarchy, where the L1 data cache is
write-through and the L2 cache is inclusive for high performance. It is observed that all the data
residing in the L1 cache will have copies in the L2 cache. In addition, the locations of these
copies in the L2 cache will not change until they are evicted from the L2 cache. Thus, a tag to each
way in the L2 cache and send this tag information to the L1 cache when the data is loaded to the L1
cache can be attached. By doing so, for all the
data in the L1 cache exactly the locations (i.e., ways) of their copies in the L2 cache is known.
During the subsequent accesses when there is a write hit in the L1 cache (which also initiates a
write access to the L2 cache under the write-through policy), the L2 cache can be accessed in
an equivalent direct-mapping manner because the way tag of the data copy in the L2 cache is
available. As this operation accounts for the majority of L2 cache accesses in most
applications, the energy consumption of L2 cache
can be reduced significantly.
II. RELATED WORKS The basic idea of the horizontal cache partitioning approach is to partition the cache data memory into several segments. Each segment can be powered individually. Cache sub-banking, proposed in, is one horizontal cache partition technique which partitions the data array of a cache into several banks (called cache sub-banks). Each cache sub-bank can be accessed (powered up) individually. Only the cache sub-bank where the requested data is located consumes power in each cache access. A basic structure for cache sub-banking is presented in Figure below.
Cache sub-banking saves power by eliminating unnecessary accesses. The amount of power saving depends on the number of cache sub-banks. More cache sub-banks save more power. One advantage of cache sub-banking over block buffering is that the effective cache hit time of a sub-bank cache can be as fast as a conventional performance-driven cache since the sub-bank selection logic is usually very simple and can be easily hidden in the cache index decoding logic. With the advantage of maintaining the cache performance, cache sub-banking could be very attractive to computer architects in designing energy-efficient high-performance microprocessors. [2] Bit line segmentation offers a solution for further power savings. The internal organization of each column in the data or tag array gets modified as shown in Figure below. Here every column of bitcells, sharing one (or more) pair of bitlines are split into independent segments as shown. An additional pair of lines are run across the segments. The bit lines within each segment can be connected or isolated from these common lines as shown. The metal layer used for clock distribution can implement this line, since the clock does not need to be routed across the bit cell array. Before a readout, all segments are connected to the common lines, which are precharged as usual. In the meantime,
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 186
the address decoder identifies the segment targeted by the row address issued to the array and isolates all but the targeted segment from the common bit line. This reduces the effective capacitive loading (due to the diffusion capacitances of the pass transistors) on the common line. This reduction is somewhat offset by the additional capacitance of the common line that spans a single segment and the diffusion capacitances of the isolating switches. The common line is then sensed. Because of the reduced loading on the common line, the energy discharged due to readout or spent in a write are small. Thus, smaller drivers, precharging
transistors and sense amps can be used. [3]
Figure above depicts the architecture of our base cache. The memory address is split into a line-offset field, an index field, and a tag field. For our base cache, those fields are 5, 6 and 21 bits, respectively, assuming a 32-bit address. Being four-way set-associative, the cache contains four tag arrays and four data arrays. During an access, the cache decodes the address’ index field to simultaneously read out the appropriate tag from each of the four tag arrays, while decoding the index field to simultaneously read out the appropriate data from the four data arrays. The cache feeds the decoded lines through two
inverters to strengthen their signals. The read tags and data items pass through sense amplifiers. The cache simultaneously compares the four tags with the address’ tag field. If one tag matches, a multiplexor routes the corresponding data to the cache output. [4]
The
energy consumption of set-associative cache tends to be higher than that of direct-mapped cache, because all the ways in a set are accessed in parallel although at most only one way has the desired data. To solve the energy issue the phased cache divides the cache-access process into the following two phases as shown below.
First, all the tags in the set are examined in
parallel, and no data accesses occur during this
phase. Next, if there is a hit, then a data access is
performed for the hit way. The way-predicting
cache speculatively chooses one way before
starting the normal cache-access process, and
then accesses the predicted way as shown below.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 187
Fig-a
If the prediction is correct, the cache access has been completed successfully. Otherwise, the cache then searches the other remaining ways as shown below:
Fig-b On a prediction-hit, shown in Figure (a), the way-predicting cache consumes only energy for activating the predicted way. In addition, the cache access can be completed in one cycle. On prediction-misses (or cache misses), however, the cache-access time of the way-predicting cache increases due to the successive process of two phases as shown in Figure (b). Since all the remaining ways are activated in the same manner as a conventional set-associative cache, the way-predicting cache could not reduce energy consumption in this scenario. The performance/energy efficiency of the way-predicting cache largely depends on the accuracy of the way prediction
In this approach MRU algorithm has been introduced. The MRU information for each set, which is a two-bit flag, is used to speculatively choose one way from the corresponding set. These two-bit flags are stored in a table accessed by the set-index address. Reading the MRU information before starting the cache access might make cache access time longer. However, it can be hidden by calculating the set-index address at an earlier pipe-line stage. In addition, way prediction helps reduce cache access-time due to eliminating of a delay for way selection.
So, we assumed that the cache-access time on prediction hit of the way-predicting cache is same as that of conventional set-associative cache. [5]
Another approach uses a two-phase associative cache: access all tags to determine the correct way in the first phase, and then only access a single data item from the matching way in the second phase. Although this approach has been proposed to reduce primary cache energy, it is more suited for secondary cache designs due to the performance penalty of an extra cycle in cache access time. A higher performance alternative to phased primary cache is to use CAM (content-addressablememory) to hold tags. CAM tags have been used in a number of low-power processors including the StrongARM and XScale. Although they add roughly 10% to total cache area, CAMs perform tag checks for all ways and read out only the matching data in one cycle. Moreover, a 32-way associative cache with CAM tags has roughly the same hit energy as a two-way set associative cache with RAM tags, but has a higher hit rate. Even so, a CAM tag lookup still adds considerable energy overhead to the simple RAM fetch of one instruction word. Way-prediction can also reduce the cost of tag accesses by using a way-prediction table and only accessing the tag and data from the predicted way.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 188
Correct prediction avoids the cost of reading tags and data from incorrect ways, but a misprediction requires an extra cycle to perform tag comparisons from all ways. This scheme has been used in commercial high-performance designs to add associativity to off-chip secondary
caches; to on-chip primary instruction caches to
reduce cache hit latencies in superscalar
processors; and has been proposed to reduce the
access energy in low-power microprocessors.
Since way prediction is a speculative technique, it
still requires that we fetch one tag and compare it
against the current PC to check if the prediction
was correct. Though it has never been examined,
way-prediction can also be applied to CAM-
tagged caches. However, because of the
speculative nature of way-prediction, a tag still
needs to be read out and compared. Also, on a
mispredict, the entire access needs to be restarted;
there is no work that can be salvaged. Thus, twice
the number of words are read out of the cache.
An alternative to wayprediction is way
memoization. Way memoization stores tag
lookup results (links) within the instruction cache
in a manner similar to some way prediction
schemes. However, way memoization also
associates a valid bit with each link. These valid
bits indicate, prior to instruction access, whether
the link is correct. This is in contrast to way
prediction where the access needs to be verified
afterward. This is the crucial difference between
the two schemes, and allows way-memoization to
work better in CAM-tagged caches. If the link is
valid, we simply follow the link to fetch the next
instruction and no tag checks are performed.
Otherwise, we fall back on a regular tag search to
find the location of the next instruction and
update the link for future use. The main
complexity in our technique is caused by the need
to invalidate all links to a line when that line is
evicted. The coherence of all the links is
maintained through an invalidation scheme. Way
memoization is orthogonal to and can be used in
conjunction with other cache energy reduction
techniques such as sub-banking, block buffering,
and the filter cache. Another approach to remove
instruction cache tag lookup energy is the L-
cache, however, it is only applicable to loops and
requires compiler support.
The way-memoizing instruction cache keeps
links within the cache. These links allow
instruction fetch to bypass the tag-array and read
out words directly from the instruction array.
Valid bits indicate whether the cache should use
the direct access method or fall back to the
normal access method. These valid bits are the
key to maintaining the coherence of the way-
memoizing cache. When we encounter a valid
link, we follow the link to obtain the cache
address of the next instruction and thereby
completely avoid tag checks. However, when we
encounter an invalid link, we fall back to a
regular tag search to find the target instruction
and update the link. Future instruction fetches
reuse the valid link. Way-memoization can be
applied to a conventional cache, a phased cache,
or a CAM-tag cache. On a correct way
prediction, the way-predicting cache performs
one tag lookup and reads one word, whereas the
way-memoizing cache does no tag lookup, and
only reads out one word. On a way
misprediction, the way-predicting cache is as
power-hungry as the conventional cache, and as
slow as the phased cache. Thus it can be worse
than the normal non-predicting caches. The way-
memoizing cache, however, merely becomes one
of the three normal non-predicting caches in the
worst case. However, the most important
difference is that the waymemoization technique
can be applied to CAM-tagged caches. [6]
There is a new way memoization technique
which eliminates redundant tag and way accesses
to reduce the power consumption. The basic idea
is to keep a small number of Most Recently Used
(MRU) addresses in a Memory Address Buffer
(MAB) and to omit redundant tag and way
accesses when there is a MAB-hit.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 189
The MAB is accessed in parallel with the adder used for address generation. The technique does not increase the delay of the circuit. Furthermore, this approach does not require modifying the cache architecture. This is considered an important advantage in industry because it makes it possible to use the processor core with
previously designed caches or IPs provided by other vendors. The base address and the displacement for load and store operations usually take a small number of distinct values. Therefore, we can improve the hit rate of the MAB by keeping only a small number of most recently used tags. Assume the bit width of tag memory, the number of sets in the cache, and the size of cache lines are 18, 512, and 32 bytes, respectively. The width of the setindex and offset fields will be 9 and 5 bits, respectively. Since most (according to our experiments, more than 99% of) displacement values are less than 2
14, we can easily calculate
tag values without address generation. This can be done by checking the upper 18 bits of the base address, the sign-extension of the displacement, and the carry bit of a 14-bit adder which adds the low 14 bits of the base address and the displacement. Therefore, the delay of the added circuit is the sum of the delay of the 14-bit adder and the delay of accessing the set-index table. Our experiment shows this delay is smaller than the delay of the 32-bit adder used to calculate the address.
Therefore, our technique does not have any delay penalty. Note that if the displacement value is more than or equal to 2
14 or less Than -2
14, there
will be a MAB miss, but the chance of this happening is less than 1%. To eliminate redundant tag and way accesses for intercache-line flows, we can use a MAB. Unlike the MAB used for D-cache, the inputs of the MAB used for I-cache can be one of the following three types: 1) an address stored in a link register, 2) a base address (i.e. the current program counter address) and a displacement value (i.e., a branch offset), and 3) the current program counter address and its stride. In the case of inter-cacheline sequential flow, the current program counter address and the stride of the program counter are chosen as inputs of the MAB. The stride is treated as the displacement
value. If the current operation is a”branch (or jump) to the link target”, the address in the link register is selected as the input of the MAB as shown in Figure below. Otherwise, the base address and the displacement are used as done for the data cache. [7] A new cache architecture called the location cache. Figure below illustrates its structure.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 190
The location cache is a small virtually-indexed
direct-mapped cache. It caches the location
information (the way number in one set a
memory reference falls into). This cache works in
parallel with the TLB and the L1 cache. On an L1
cache miss, the physical address translated by the
TLB and the way information of the reference are
both presented to the L2 cache. The L2 cache is
then accessed as a direct-mapped cache. There
can be a miss in the location cache, then the L2
cache is accessed as a conventional set-
associative cache. As opposed to way-prediction
information, the cached location is not a
prediction. Thus when there is a hit, both time
and power will be saved. Even if there is a miss,
we do not see any extra delay penalty as seen in
way- prediction caches. Caching the position,
unlike caching the data itself, will not cause
coherence problems in multi-processor systems.
Although the snooping mechanism may modify
the data stored in the L2 cache, the location will
not change. Also, even if a cache line is replaced
in the L2 cache, the way information stored in the
location cache will not generate a fault. One
interesting issue arises here: the locations for
which references should be cached? The location
cache should catch the references which turn out
to be L1 misses. A recency based strategy is not
suitable because the recent accesses to the L2
caches are very likely to be cached in the L1
caches. The equation below defines the optimal
coverage of the location cache.
Opt. coverage = L2 Coverage - L1 Coverage
As the indexing rules of L1 and L2 caches are different, this optimal coverage is not reachable. Fortunately, the memory locations are usually referenced in sequences or strides. Whenever a reference to the L2 cache is generated, we calculate the location of the next cache line and feed it into the location cache. The proposed cache system works in the following way. The location cache is accessed in parallel with the L1 caches. If the L1 cache sees a hit, then the results from the location cache is discarded. If there is a miss in the L1 cache, and there is a hit in the location cache, the L2 cache is accessed as a direct-mapped cache. If both the L1 cache and the location cache see a miss, then the L2 cache is accessed as a traditional L2 cache. The tags of the L2 cache is duplicated. We call the duplicated tag arrays of the L2 cache location tag arrays. When the L2 cache is accessed, the location tag arrays are accessed to generate the location information for the next memory reference. The generated location information is then sent to and stored in the location cache.
The L1 cache is a 16KB 4-way set-associative cache, with a cache line size of 64-bytes, implemented with a 0.13μm technology. The results were produced using the CACTI3.2 simulator. We chose the access delay of a 16KB direct-mapped cache as the baseline, which is the best-case delay when a way-prediction mechanism is implemented in the L1 cache. We normalized the baseline delay to 1. It is observed that a location cache with up-to 1024 entries has
shorter access latency than the L1 cache. Though the organization of the location cache is similar to that of a direct-mapped cache, there is a small change in the indexing rule. The block offset is 7 bit as the cache line size for the simulated L2 cache is 128 bytes. Thus the width of the tag is smaller for the location cache, compared with a regular cache.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 191
Compared to a regular cache design, the modification is minor. Note that we need to double the tags (or the number of ports to the tag) because when the original tags are compared to validate the accesses, a spare set of tag is compared to generate the future location information. This idea is similar to the phased cache. The difference is that we overlap the tag comparison for future references with existing cache reference and use the location cache to store such location information. The simulated cache geometry parameters were optimized for the set-associative cache. The simulation results show that the access latency for a direct-mapped hit is 40% faster than a set-associative hit.
Although the extra hardware employed by the location cache design does not introduce extra delay on the memory reference critical path, it does introduce extra power consumption. The extra power consumption comes from the small location cache and the duplicated tag arrays. The power consumption for the tag access of a direct-mapped hit to one. Comparing to the L2 cache power consumption, the location cache consumes a small amount of power is normalized. However, as the location cache is triggered much often than the L2 cache, its power consumption cannot be ignored. The total chip area of the proposed location cache system (with duplicated tag and a location cache of 1024 entries) is only 1.39% larger than that of the original cache system. [8] The r-a cache is formed by using the tag array of a set-associative cache with the data array of a direct-mapped cache, as shown in Figure 1.
For an n-way r-a cache, there is a single data bank, and n tag banks. The tag array is accessed using the conventional set-associative index, probing all the n-ways of the set in parallel, just as in a normal set-associative cache. The data array index uses the conventional set-associative index concatenated with a way number to locate a block in the set. The way number is log2(n) bits wide. For the first probe, it may come from either the conventional set-associative tag field’s lower-order bits (for the direct-mapped blocks), or the way-prediction mechanism (for the displaced blocks). If there is a second probe (due to a misprediction), then the matching way number is provided by the tag array. The r-a cache simultaneously accesses the tag and data arrays for the first probe, at either the direct-mapped location or a set-associative position provided by the way-prediction mechanism. If the first probe, called probe0, hits, then the access is complete and the data is returned to the processor. If probe0 fails to locate the block due to a misprediction (i.e., either the block is in a set-associative position when probe0 assumed direct-mapped access or the block is in a set-associative position different than the one supplied by way-prediction), probe0 obtains the correct way-number from the tag array if the block is in the cache, and a second probe, called probe1, is done using the correct way-number.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 192
Probe1 probes only the data array, and not the tag array. If the block is not in the cache, probe0 signals an overall miss and probe1 is not necessary. Thus there are three possible paths through the cache for a given address:(1) probe0 is predicted to be a direct mapped access, (2) probe0 is predicted to be a set-associative access and the prediction mechanism provides the predicted way-number, and (3) probe0 is mispredicted but obtains the correct way-number from the tag array, and the data array is probed using the correct way-number in probe1.
On an overall miss, the block is placed in the
direct-mapped position if it is non-conflicting, and a set-associative position (LRU, random,
etc.) otherwise. Way Prediction: The r-a cache employs hardware way-prediction to obtain the
way-number for the blocks that are displaced to set-associative positions before address
computation is complete. The strict timing
constraint of performing the prediction in parallel with effective address computation requires that
the prediction mechanism use information that is available in the pipeline earlier than the address
compute stage. The equivalent of way-prediction for i-caches is often combined with branch
prediction but because D-caches do not interact with branch prediction, those techniques cannot
be used directly. An alternative to prediction is to
obtain the correct way-number of the displaced block using the address, which delays initiating
cache access to the displaced block, as is the case for statically probed schemes such as column-
associative and group-associative caches. We examine two handles that can be used to perform
way prediction: instruction PC and approximate data address formed by XORing the register
III.WAY-TAGGED CACHE
A way-tagged cache that exploits the way
information in L2 cache to improve energy
efficiency is introduced. A conventional set-
associative cache system when the L1 data cache
loads/writes data from/into the L2 cache, all
ways in the L2 cache are activated
simultaneously for performance consideration at
the cost of energy overhead.
value with the instruction offset (proposed in,
and used in), which may be faster than
performing a full add. These two handles
represent the two extremes of the trade-off
between prediction accuracy and early
availability in the pipeline.
PC is available much earlier than the XOR
approximation but the XOR approximation is
more accurate because it is hard for PC to
distinguish among different data addresses
touched by the same instruction. Other handles
such as instruction fields (e.g., operand register
numbers) do not have significantly more
information content from a prediction standpoint,
and the PSA paper recommends the XOR scheme
for its high accuracy. In an out-of-order
processor pipeline (Figure above), the instruction
PC of a memory operation is available much
earlier than the source register.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 193
Therefore, way-prediction can be done in parallel
with the pipeline front end processing of memory
instructions so that the predicted way-number and
probe0 way# mux select input are ready well
before the data address is computed. The XOR
scheme, on the other hand, needs to squeeze in an
XOR operation on a value often obtained late
from a register-forwarding path followed by
prediction table lookup to produce the predicted
way-number and the probe0 way# mux select, all
within the time the pipeline computes the real
address using a full add. Note that the prediction
table must have more entries or be more
associative than the cache itself to avoid conflicts
among the XORed approximate data addresses,
and therefore will probably have a significant
access time, exacerbating the timing problem.
The above figure illustrates the architecture of the
two-level cache. Only the L1 data cache and L2
unified cache are shown as the L1 instruction
cache only reads from the L2 cache. Under the
write-through policy, the L2 cache always
maintains the most recent copy of the data. Thus,
whenever a data is updated in the L1 cache, the
L2 cache is updated with the same data as well.
This results in an increase in the write accesses to
the L2 cache and consequently more energy
consumption. The locations (i.e., way tags) of L1
data copies in the L2 cache will not change until
the data are evicted from the L2 cache. The
proposed way-tagged cache exploits this fact to
reduce the number of ways accessed during L2
cache accesses. When the L1 data cache loads a
data from the L2 cache, the way tag of the data in
the L2 cache is also sent to the L1 cache and
stored in a new set of way-tag arrays. These way
tags provide the key information for the
subsequent write accesses to the L2 cache.
In general, both write and read accesses in the L1 cache may need to access the L2 cache. These accesses lead to different operations in the
proposed way-tagged cache, as summarized in Table I. Under the write-through policy, all write operations of the L1 cache need to access the L2 cache. In the case of a write hit in the L1 cache, only one way in the L2 cache will be activated because the way tag information of the L2 cache is available, i.e., from the way-tag arrays we can obtain the L2 way of the accessed data. While for a write miss in the L1 cache, the requested data is not stored in the L1 cache. As a result, its corresponding L2 way information is not available in the way-tag arrays. Therefore, all ways in the L2 cache need to be activated
simultaneously. Since write hit/miss is not known a priori, the way-tag arrays need to be accessed simultaneously with all L1 write operations in order to avoid performance degradation. The way-tag arrays are very small and the involved energy overhead.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 194
The above figure shows the system diagram of proposed way-tagged cache. We introduce
several new components: way-tag arrays, way-tag buffer, way decoder, and way register, all shown
in the dotted line. The way tags of each cache line in the L2 cache are maintained in the way-
tag arrays, located with the L1 data cache. Note that write buffers are commonly employed in
write-through caches (and even in many write-
back caches) to improve the performance. With a write buffer, the data to be written into the L1
cache is also sent to the write buffer. The operations stored in the write buffer are then sent
to the L2 cache in sequence. This avoids write stalls when the processor waits for write
operations to be completed in the L2 cache. In the proposed technique, we also need to send the way
tags stored in the way-tag arrays to the L2 cache
along with the operations in the write buffer. Thus, a small way-tag buffer is introduced to
buffer the way tags read from the way-tag arrays. A way decoder is employed to decode way tags
and generate the enable signals for the L2 cache, which activate only the desired ways in the L2
cache. Each way in the L2 cache is encoded into a way tag. A way register stores way tags and
provides this information to the way-tag arrays
can be easily compensated for. For L1 read operations, neither read hits nor misses need to
access the way-tag arrays. This is because read hits do not need to access the L2 cache; while for
read misses, the corresponding way tag information is not available in the way-tag arrays.
As a result, all ways in the L2 cache are activated simultaneously under read misses. The amount of energy consumption per read and write across the conventional set-associative L2 cache and proposed L2 cache is shown below:
This cache configuration, used in Pentium-4, will
be used as a baseline system for comparison with the proposed technique under different cache configurations. IV. CONCLUSION This paper presents a new energy-efficient cache
technique for high-performance microprocessors employing the write-through policy. The
proposed technique attaches a tag to each way in
the L2 cache. This way tag is sent to the way-tag arrays in the L1 cache when the data is loaded
from the L2 cache to the L1 cache. Utilizing the way tags stored in the way-tag arrays, the L2
cache can be accessed as a direct-mapping cache during the subsequent write hits, thereby
reducing cache energy consumption. Simulation results demonstrate significantly reduction in
cache energy consumption with minimal area
overhead and no performance degradation. Furthermore, the idea of way tagging can be
applied to many existing low-power cache techniques such as the phased access cache to
further reduce cache energy consumption. Future work is being directed towards extending this
technique to other levels of cache hierarchy and reducing the energy consumption of other cache
operations.
REFERRENCES [1].An Energy-Efficient L2 Cache Architecture Using Way Tag Information Under Write-Through Policy, Jianwei Dai and Lei Wang, Senior Member, IEEE, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 21, No. 1, January 2013
[2].C. Su and A. Despain, “Cache design tradeoffs for power and performance optimization: A case study,” in Proc. Int. Symp. Low Power Electron. Design, 1997, pp. 63–68.
[3]. K. Ghose and M. B.Kamble, “Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation,” in Proc. Int. Symp. Low Power Electron. Design, 1999, pp. 70–75.
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 195
[4]. C. Zhang, F. Vahid, and W. Najjar, “A highly-configurable cache architecture for embedded systems,” in Proc. Int. Symp. Comput. Arch., 2003, pp. 136–146.
[5]. K. Inoue, T. Ishihara, and K. Murakami, “Way-predicting set-associative cache for high performance and low energy consumption,” in Proc. Int. Symp. Low Power Electron. Design, 1999, pp. 273–275.
[6]. A.Ma, M. Zhang, and K.Asanovi, “Way memoization to reduce fetch energy in instruction caches,” in Proc. ISCA Workshop Complexity Effective Design, 2001, pp. 1–9.
[7]. T. Ishihara and F. Fallah, “A way memorization technique for reducing power consumption of caches in application specific integrated processors,” in Proc. Design Autom. Test Euro. Conf., 2005, pp. 358–363.
R. Min, W. Jone, and Y. Hu, “Location cache: A
low-power L2 cache system,” in Proc. Int. Symp.
Low Power Electron. Design, 2004, pp. 120–125.
[8]T. N. Vijaykumar, “Reactive-associative
caches,” in Proc. Int. Conf. Parallel Arch. Compiler Tech., 2011, p.4961.
[9] Way-Tagged L2 Cache Architecture in Conjunction with Energy Efficient Datum Storage Vineeta Vasudevan Nair ECE Department, ANNA University Chennai Sri Eshwar College Of Engineering Coimbatore, India
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 196
4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th & 21st February 2015
Jyothy Institute of Technology Department of ECE P a g e | 197