1 CSEE W4824 – Computer Architecture Fall 2012 Luca Carloni Department of Computer Science Columbia University in the City of New York http://www.cs.columbia.edu/~cs4824/ Lecture 8 Memory Hierarchy Design: Memory Technologies and the Basics of Caches CSEE 4824 – Fall 2012 - Lecture 7 Page 2 Luca Carloni – Columbia University Announcements: Class Pre-Taping Wednesday 10/3 Lecture #8 Regular Class Monday 10/8 Lecture #9 (Pre-taped) Pre-taped this Wed 10/3 at 4:15pm in Mudd 1127 Wednesday 10/10 Lecture #10 Regular Class Guest lecturer • Reason: Instructor is traveling to attend Embedded Systems Week 2012 • Pre-taped lectures will be shown as videos from the class PC during regular class time in Mudd 535 • Instructor’s office hours are canceled for the week of October 8
23
Embed
CSEE W4824 – Computer Architectureiris.nyit.edu/~bbehesht/EENG641_Resources/Appendix_B/Appb-Columbia.pdf · – the first (from the CPU) level of the memory hierarchy – often
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
CSEE W4824 – Computer ArchitectureFall 2012
Luca CarloniDepartment of Computer Science
Columbia University in the City of New York
http://www.cs.columbia.edu/~cs4824/
Lecture 8Memory Hierarchy Design:Memory Technologies and
the Basics of Caches
CSEE 4824 – Fall 2012 - Lecture 7 Page 2Luca Carloni – Columbia University
Announcements: Class Pre-Taping
Wednesday 10/3 Lecture #8Regular Class
Monday 10/8 Lecture #9 (Pre-taped)
Pre-taped this Wed 10/3 at 4:15pm in Mudd1127
Wednesday 10/10 Lecture #10Regular Class
Guest lecturer
• Reason: Instructor is traveling to attend Embedded Systems Week 2012
• Pre-taped lectures will be shown as videos from the class PC during regular class time in Mudd 535
• Instructor’s office hours are canceled for the week of October 8
CSEE 4824 – Fall 2012 - Lecture 8 Page 3Luca Carloni – Columbia University
CSEE 4824 – Fall 2012 - Lecture 8 Page 4Luca Carloni – Columbia University
The Processor-Memory Performance Gap
• CPU speed – assumes 25% improvement per year until 1986, 52% until 2000, 20% until
2005, and no change (on a per-core basis) until 2010• Memory Baseline:
– 64KB DRAM w/ 150-250ns latency in 1980, 7% per year latency improvement• Architects must attempt to work around this gap to minimize the
memory bottleneck
(lo
g s
cale
)
3
How Many Memory References?
• A modern high-end multi-core processor (e.g. Intel Core i7) can generate two data memory references per core each clock cycle– with 4 cores and a 3.2 GHz clock rate, this leads to a
peak of 25.6 billion 64-bit data-memory references per second, in addition to a peak of about 12.8 billion 128-bit instruction references.
– How to support a total peak bandwidth of 409.6 GB/sec!?
CSEE 4824 – Fall 2012 - Lecture 8 Page 5Luca Carloni – Columbia University
• The Memory Hierarchy– multiporting and pipelining the caches, using multiple
levels of caches, using separate first- and sometimes second-level caches per core; and by using a Harvard architecture for the first-level cache
• in contrast, the peak bandwidth to DRAM main memory is only 6% of this (25 GB/sec).
CSEE 4824 – Fall 2012 - Lecture 8 Page 7Luca Carloni – Columbia University
Typical PC Organization
Sour
ce: B
. Jac
ob e
t al
. “M
emor
y Sy
stem
s”
4
CSEE 4824 – Fall 2012 - Lecture 8 Page 8Luca Carloni – Columbia University
DSP-Style Memory System: Example based on TI TMS320C3x DSP family
• dual tag-less on-chip SRAMs (visible to programmer)• off-chip programmable ROM (or PROM or FLASH) that holds the
executable image • off-chip DRAM used for computation
Sour
ce: B
. Jac
ob e
t al
. “M
emor
y Sy
stem
s”
CSEE 4824 – Fall 2012 - Lecture 8 Page 9Luca Carloni – Columbia University
Memory Technology
• At the core of the success of computers• Various types of memory
– most common types• Dynamic Random-Access Memory (DRAM)• Static Random-Access Memory (SRAM)• Read-Only Memory (ROM)• Flash Memory
• Memory Latency Metrics– Access time
• time between when a “read” is requested and when the desired word arrives
– Cycle time ( Access time)• minimum time between two requests to memory• memory needs the address lines to be stable between accesses
5
CSEE 4824 – Fall 2012 - Lecture 8 Page 11Luca Carloni – Columbia University
A 64M-bit DRAM: Logical Organization
• Highest memory cell density– only 1 transistor used to store 1 bit– to prevent data loss, each bit must be refreshed periodically
• DRAM access periodically all bits in every row (refresh)• about 5% of the time a DRAM is not available due to refreshing
• To limit package costs, address lines are multiplexed• e.g., first send 14-bit row address (Row Access Strobe), then 14-bit
column address (Column Access Strobe)
CSEE 4824 – Fall 2012 - Lecture 8 Page 12Luca Carloni – Columbia University
Logical Organization of Wide Data-Out DRAMs
• In order to output more than one bit at a time, the DRAM is organized internally with multiple arrays, each providing one bit towards the aggregate output
• Wider output DRAMs have appeared in the last two decades– DRAM parts with x16 and x32 data widths are now common, used
primarily in high-performance applications
Sour
ce: B
. Jac
ob e
t al
. “M
emor
y Sy
stem
s”
6
CSEE 4824 – Fall 2012 - Lecture 8 Page 13Luca Carloni – Columbia University
DIMMs, Ranks, Banks, and Arrays
• A memory system may have many DIMMs, each of which may contain one or more ranks
• Each rank is a set of engaged DRAM devices, each of which may have many banks
• Each bank may have many constituent arrays, depending on the part’s data width
Sour
ce: B
. Jac
ob e
t al
. “M
emor
y Sy
stem
s”
CSEE 4824 – Fall 2012 - Lecture 8 Page 14Luca Carloni – Columbia University
DRAM Generations
Year of Introd.
Chip Size (bit)
$ per GB Total Access Time to a new row/column
Total Access Time to existing row
1980 64K 1,5M 250ns 150ns
1983 256K 500k 185ns 100ns
1985 1M 200k 135ns 40ns
1989 4M 50k 110ns 40ns
1992 16M 15k 90ns 30ns
1996 64M 10k 60ns 12ns
1998 128M 4k 60ns 10ns
2000 256M 1k 55ns 7ns
2004 512M 250 50ns 5ns
2007 1G 50 40ns 1.25ns
7
CSEE 4824 – Fall 2012 - Lecture 8 Page 15Luca Carloni – Columbia University
SRAMs• SRAM memory cell is bigger than
DRAM cell– typically 6 transistors per bit
• Better for low-power applications thanks to stand-by mode– only minimal power is necessary to retain
charge in stand-by mode• Access Time = Cycle Time
– Address lines are not multiplexed (for speed)
• In comparable technologies…– SRAM has a only 1/4 - 1/8 of DRAM
capacity– SRAM cycle time is 8-16 times faster than
DRAM– SRAM cost-per-bit is 8-16 times more
expensive than DRAM
CSEE 4824 – Fall 2012 - Lecture 8 Page 16Luca Carloni – Columbia University
ROM and Flash Memory• ROM
– programmed once and for all at manufacture time– cannot be rewritten by microprocessor – 1 transistor per bit– good for storing code and data constants in embedded applications
• replace magnetic disks in providing nonvolatile storage • add level of protection for embedded software
• Flash Memories– floating-gate technology– read access time comparable to DRAMs
• 50-100us depending on size (16M-128M)– write is 10-100 slower than DRAMs (plus erasing time 1-2ms)– price is cheaper than DRAM but more expensive than magnetic disks
• Flash: $2/GB , DRAM: $40/GB; disk = $0.09/GB– Initially, mostly used for low power/embedded applications
• but now also as solid-state replacements for disks– or efficient intermediate storage between DRAM and disks
8
CSEE 4824 – Fall 2012 - Lecture 8 Page 17Luca Carloni – Columbia University
Flash Storage: Increasingly an Alternative to Magnetic Disks
• nonvolatile like disks, but smaller (100-1000x) latency
• smaller, more power efficient, more shock resistant
• critical for mobile electronics– high volumes lead to technology
improvements • cost per GB is falling 50% per year• $2-4 per GB (in 2011)
• All data in one level are usually found also in the level below
Managedby
compiler hardware operating systems
Backedby
cache main memory disk CD or tape
operating systems /operator
4-16 GB 4-16 TB
Typical Memory Hierarchies:Servers vs. Personal Mobile Devices
CSEE 4824 – Fall 2012 - Lecture 8 Page 22Luca Carloni – Columbia University
11
CSEE 4824 – Fall 2012 - Lecture 8 Page 23Luca Carloni – Columbia University
Review: Principle of Locality
• Temporal Locality– a resource that is referenced at one point in time will
be referenced again sometime in the near future• Spatial Locality
– the likelihood of referencing a resource is higher if a resource near it was just referenced
• 90/10 Locality Rule of Thumb – a program spends 90% of its execution time in only
10% of its code• a consequence of how we program and we store the data in
the memory• hence, it is possible to predict with reasonable accuracy
what instructions and data a program will use in the near future based on its accesses in the recent past
CSEE 4824 – Fall 2012 - Lecture 8 Page 24Luca Carloni – Columbia University
Cache Concepts
• The term “Cache”– the first (from the CPU) level of the memory hierarchy– often used to refer to any buffering technique
exploiting the principle of locality• Directly exploits temporal locality providing faster
access to a smaller subset of the main memory which contains copy of data recently used
• But, all data in the cache are not necessarily data that are spatially close in the main memory…
• …still, when a cache miss occurs a fixed-size block of contiguous memory cells is retrieved from the main memory based on the principle of spatial locality
12
CSEE 4824 – Fall 2012 - Lecture 8 Page 25Luca Carloni – Columbia University
Cache Concepts – cont.
• Cache Hit– CPU find the requested data
item in the cache• Cache Miss
– CPU doesn’t find the requested data item in the cache
• Miss Penalty– time to replace a block in the cache (plus time to deliver data item
to CPU) – time depends on both latency & bandwidth
• latency determines the time to retrieve the first word• bandwidth determines the time to retrieve rest of the block
– handled by hardware that stalls the memory unit (and, therefore, the whole instruction processing in case of simple single-issue P)
CSEE 4824 – Fall 2012 - Lecture 8 Page 26Luca Carloni – Columbia University
Cache : Main Memory = Main Memory : Disk
• Virtual Memory– makes it possible to increase the amount of memory
that a program can use by temporarily storing some objects on disk
– program address space is divided in pages (fixed-size blocks) which reside either in cache/main memory ordisk
– better way to organize address space across programs• necessary protection scheme to control page access
– when the CPU references an item within a page that is not present in cache/main memory a page fault occurs
• the entire page is moved from the disk to main memory– page faults have long penalty time
• handled in SW without stalling the CPU, which switches to other tasks
13
CSEE 4824 – Fall 2012 - Lecture 8 Page 27Luca Carloni – Columbia University
Caching the Address Space• Programs today
are written to run on no particular HW configuration
• Processes execute in imaginary address spaces that are mapped onto the memory system (including DRAM and disk) by the OS
• Every HW memory structure between the CPU and the permanent store is a cache for instruction & data in the process’s address space
Sour
ce: B
. Jac
ob e
t al
. “M
emor
y Sy
stem
s”
CSEE 4824 – Fall 2012 - Lecture 8 Page 28Luca Carloni – Columbia University
Cache Schemes: Placing a Memory Block into a Cache Block Frame • Block
– unit of memory transferred across hierarchy levels
• Set– a group of blocks
• Modern processors– direct map– 2-way set associative– 4-way set associative
• Modern memories– millions of blocks
• Modern caches– thousands of block
frames
2-way set associative8-way set associative 1-way set associative
The range of caches is really a continuum of levels of set associativity
Set Index = (Block Address) MOD (Number of Sets in Cache)
14
CSEE 4824 – Fall 2012 - Lecture 8 Page 29Luca Carloni – Columbia University
Example: Direct Mapped Cache with 8 Block Frames• Each memory block is mapped
to one cache entry– cache index = (block address)
mod (# of cache blocks)– e.g., with 8 blocks, 3 low-order
address bits are sufficient• Log2 (8) = 3
• Is a block present in cache?– must check cache block tag
• upper bit of block address• Block offset
– addresses bytes in a block• block==word offset =2 bits
• How do we know if data in a block is valid?– add valid bit to each entry
The tag index boundary moves to the right as we increase associativity
(no index field in fully associative caches)
CSEE 4824 – Fall 2012 - Lecture 8 Page 30Luca Carloni – Columbia University
Ex: Direct Mapped with 1024 Blocks Frames and Block Size of 1 Word for MIPS-32• Block Offset
– is just a byte offset because each block of this cache contains 1 word
• Byte Offset– least significant 2 bits
because in MIPS-32 memory words are aligned to multiples of 4 bytes
• Block Index– 10 low-order address bits
because this cache has 1024 block frames
• Block Tag– remaining 20 address bits in
order to check that the address of the requested word matches the cache entry
Index is for addressingTag is for checking/searching
15
CSEE 4824 – Fall 2012 - Lecture 8 Page 31Luca Carloni – Columbia University
Example: 16KB Direct Mapped Cache with 256 Block Frames (of 16 Words Each)
• Single tag comparator needed
18
CSEE 4824 – Fall 2012 - Lecture 8 Page 32Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word
• Assumption– 8 block frames– block size = 1 word– main memory of 32 words
• toy example– we consider ten subsequent
accesses to memory
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N
16
CSEE 4824 – Fall 2012 - Lecture 8 Page 33Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2
3
4
5
6
7
8
9
10
CSEE 4824 – Fall 2012 - Lecture 8 Page 34Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3
4
5
6
7
8
9
10
17
CSEE 4824 – Fall 2012 - Lecture 8 Page 35Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3 11010 26 hit
4
5
6
7
8
9
10
CSEE 4824 – Fall 2012 - Lecture 8 Page 36Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3 11010 26 hit
4 10110 22 hit
5
6
7
8
9
10
18
CSEE 4824 – Fall 2012 - Lecture 8 Page 37Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3 11010 26 hit
4 10110 22 hit
5 10000 16 miss
6
7
8
9
10
CSEE 4824 – Fall 2012 - Lecture 8 Page 38Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3 11010 26 hit
4 10110 22 hit
5 10000 16 miss
6 00011 3 miss
7
8
9
10
19
CSEE 4824 – Fall 2012 - Lecture 8 Page 39Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3 11010 26 hit
4 10110 22 hit
5 10000 16 miss
6 00011 3 miss
7 10000 16 hit
8
9
10
CSEE 4824 – Fall 2012 - Lecture 8 Page 40Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3 11010 26 hit
4 10110 22 hit
5 10000 16 miss
6 00011 3 miss
7 10000 16 hit
8 10010 18 miss
9
10
20
CSEE 4824 – Fall 2012 - Lecture 8 Page 41Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3 11010 26 hit
4 10110 22 hit
5 10000 16 miss
6 00011 3 miss
7 10000 16 hit
8 10010 18 miss
9 11010 26 miss
10
CSEE 4824 – Fall 2012 - Lecture 8 Page 42Luca Carloni – Columbia University
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 WordIndex V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
cycle MemoryAddress
address indecimal
CacheEvent
1 10110 22 miss
2 11010 26 miss
3 11010 26 hit
4 10110 22 hit
5 10000 16 miss
6 00011 3 miss
7 10000 16 hit
8 10010 18 miss
9 11010 26 miss
10 11010 26 hit
21
CSEE 4824 – Fall 2012 - Lecture 8 Page 43Luca Carloni – Columbia University
Example: Measuring Cache Size
• How many total bits are required for a direct-mapped cache with 16KB of data and 4-word block frames assuming a 32-bit address?
• 16KB of data = 4K words = 2 words • Block Size of 4 (=2 ) words 2 blocks
• # Bits in a Tag = 32 - (10 + 2 + 2) = 18• # Bits in a block = # Tag Bits + # Data Bits + Valid bit• # Bits in a block = 18 + (4 * 32) + 1 = 147• Cache Size= # Blocks x #Bits in a block= 2 x 147=147Kbits• Cache Overhead = 147Kbits / 16KB = 147 / 128 = 1.15
12
2 10
2TAG OFFSET
21018INDEX
10
CSEE 4824 – Fall 2012 - Lecture 8 Page 44Luca Carloni – Columbia University
Performance Metrics for Caches
• Miss Rate (misses per memory references) – fraction of cache accesses that result in a miss
• Misses Per Instructions– often reported as misses per 1000 instructions– for speculative processors we only count the instructions
that commit
• Miss Penalty– additional clock cycles necessary to retrieve the block
with the missing word from the main memory
Miss Per Instructions = Miss Rate x(Memory Accesses / Instruction Count)
22
CSEE 4824 – Fall 2012 - Lecture 8 Page 45Luca Carloni – Columbia University
Performance Metrics for Caches - continue
• Average Memory Access Time (AMAT)AMAT = Hit time + Miss rate x Miss penalty
• Average Memory Access Time– a better estimate on cache performance – but still not a substitute for execution time
• Impact on CPU Time – including “hit clock cycles” in CPU execution clock cycles
CPU Time = (CPU execution cycles + memory stall cycles) x CCT
CSEE 4824 – Fall 2012 - Lecture 8 Page 46Luca Carloni – Columbia University
Performance Metrics for Caches - continue
• Impact on CPU Time – including “hit clock cycles” in CPU execution clock cycles– and breaking down the memory stall cycles
– the lower is the CPI, the higher the relative impact of a fixed number of cache miss clock cycles
– the faster the CPU (i.e. the lower CCT), the higher is the number of clock cycles per miss
CPU Time = IC x
(CPIexec+ missRate x memAccPerInstr x missPenalty)x CCT
23
CSEE 4824 – Fall 2012 - Lecture 8 Page 47Luca Carloni – Columbia University
Example: The Impact of Cache on Performance
• Assumptions– CPI_exec = 1 clock cycle (ignoring memory stalls)– Miss rate = 2%– Miss penalty = 200 clock cycles– Average memory references per instruction = 1.5
(CPI)no_cache = 1 + 1.5 x 200 = 301(CPI)with_cache = 1 + (1.5 x 0.02 x 200) = 7
• Impact of Cache on CPU Time is greater– the lower is the CPI of the other instructions
• for a fixed number of cache miss clock cycles– the lower is the clock cycle time of the CPU
• because the CPU has a larger number of clock cycles per miss (i.e. a higher memory portion of CPI)
CSEE 4824 – Fall 2012 - Lecture 8 Page 48Luca Carloni – Columbia University
Assigned Readings
• Computer Architecture – A Quantitative Approach by John Hennessy– Stanford University
Dave Patterson– UC Berkeley
Fifth Edition - 2012 Morgan Kaufmann (Elsevier)
• Section 2.1 and 2.3• Appendix B.1
• For review purposes: see Chapter 7 of Hennessy & Patterson “Computer Organization & Design” book
• Assigned paper: A. Leventhal, “Flash Storage Memories”