1 CACHE MEMORY ORGANIZATION FOR MULTIPROCESSORS Authors: Anil Pothireddy and Mekhala Vishwanath and Mekhala Vishwanath12/4/02.

11

CACHE MEMORY CACHE MEMORY ORGANIZATION FOR ORGANIZATION FOR MULTIPROCESSORSMULTIPROCESSORS

Authors: Authors: Anil PothireddyAnil Pothireddy

and Mekhala Vishwanathand Mekhala Vishwanath

12/4/0212/4/02

22

Paper CourtesyPaper Courtesy

Bus and cache memory organizations for multiprocessors

A doctoral dissertation by Mr. Donald Charles Winsor (EE, U.Mich, 1989) is the basis for the following paper/presentation.

33

Organization of the PresentationOrganization of the Presentation

Introduction to Multiprocessors.Introduction to Multiprocessors. Introduction to cache memory.Introduction to cache memory.Cache organization for multiprocessors.Cache organization for multiprocessors.Performance evaluation of multiprocessors.Performance evaluation of multiprocessors.

44

INTRODUCTION TO MULTIPROCESSORSINTRODUCTION TO MULTIPROCESSORS

Computer architects have always sought the El Dorado of computer design; to create powerful computers by connecting many existing smaller ones. This golden vision is the fountainhead of multiprocessors.

The single shared bus multiprocessor has been the most commercially successful multiprocessor system design up to this time, largely because it permits the implementation of efficient hardware mechanisms to enforce cache consistency.

Restricted bandwidth of the shared bus has been the most limiting factor in these systems.

55

INTRODUCTION TO MULTIPROCESSORS (2)INTRODUCTION TO MULTIPROCESSORS (2)

A single-bus multiprocessor. (Typical size is between 2 and 32 processors)

66

Multiprocessor performance evaluation:Multiprocessor performance evaluation:

Lets compare the performance of a Lets compare the performance of a SINGLESINGLE 68020 processor and a 68020 processor and a multiprocessor system consisting of multiprocessor system consisting of TENTEN 68020 processors. 68020 processors.

Consider the task of adding 100 numbers stored in memory.Consider the task of adding 100 numbers stored in memory.

The The ADD.W <ea>,DnADD.W <ea>,Dn operation in a 68020 takes 4 clock cycles. Therefore, the operation in a 68020 takes 4 clock cycles. Therefore, the uni-processor system would require at least uni-processor system would require at least 400400 cycles cycles to add all the numbers. to add all the numbers.

The multiprocessor system would perform the above addition as follows:The multiprocessor system would perform the above addition as follows:

STEP 1: All the 10 uPs would first add 10 numbers each(100) = 40 cycles.STEP 1: All the 10 uPs would first add 10 numbers each(100) = 40 cycles.STEP 2: Five uPs would then add the 10 partial sums (5) = 4 cycles.STEP 2: Five uPs would then add the 10 partial sums (5) = 4 cycles.STEP 3: Two uPs would next add the 4 of 5 partial sums(2)STEP 3: Two uPs would next add the 4 of 5 partial sums(2) = 4 cycles. = 4 cycles.STEP 4: One uP would finally add the 3 partial sums (2)STEP 4: One uP would finally add the 3 partial sums (2) = 8 cycles. = 8 cycles.

______________________________________________________________________________________________________________________________TOTAL CLOCK CYCLES REQUIRED TOTAL CLOCK CYCLES REQUIRED = = 56 cycles56 cycles

________________________________________________________________ ________________________________________________________________

77

Multiprocessor performance evaluation: (2)Multiprocessor performance evaluation: (2)

• We would expect a We would expect a x10x10 improvement in improvement in performance while using 10 processors in a performance while using 10 processors in a multiprocessor system.multiprocessor system.

• But the performance improvement observed in But the performance improvement observed in the previous example is only the previous example is only x7.14x7.14

• This decrease in performance factor is due to This decrease in performance factor is due to the overheads involved in the parallel the overheads involved in the parallel processing of multiprocessors.processing of multiprocessors.

88

Multiprocessor performance evaluation: (3)Multiprocessor performance evaluation: (3)

• Another factor which requires serious Another factor which requires serious consideration is the latency (wait time) of the consideration is the latency (wait time) of the memory system of a multiprocessor. memory system of a multiprocessor.

• Since multiple processors place memory Since multiple processors place memory requests simultaneously, the memory latency requests simultaneously, the memory latency time will be greater in a multiprocessor system, time will be greater in a multiprocessor system, when compared to a uni-processor system. when compared to a uni-processor system.

• We present a technique, which alleviates this We present a technique, which alleviates this problem by using a cache. problem by using a cache.

99

Levels of the Memory HierarchyLevels of the Memory Hierarchy

CPU Registers100s’ of Bytes<10s ns

CacheFew KB10-100 ns

Main MemoryFew MB100ns-1us$1 - $50

DiskFew GBms$.005-$0.01

CapacityAccess TimeCost/MB

TapeInfinitesec-min

$10 -6

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

1010

INTRODUCTION TO CACHE MEMORYINTRODUCTION TO CACHE MEMORY

CACHING:CACHING: It is a technology based on the memory It is a technology based on the memory

subsystem of the computer.subsystem of the computer.

Accelerates the computer while keeping the Accelerates the computer while keeping the price of the computer low.price of the computer low.

Allows computer tasks to be preformed more Allows computer tasks to be preformed more rapidly. rapidly.

1111

LIBRARY EXAMPLE (1)LIBRARY EXAMPLE (1)

Imagine a librarian behind her desk. She is there to give the customers the books they ask for. For the sake of simplicity, let's say the customers can't get the books themselves -- they have to ask the librarian for any book they want to read, and she fetches it for them from a set of stacks in a storeroom.

1212

LIBRARY EXAMPLE (2)LIBRARY EXAMPLE (2)

The librarian goes into the storeroom, gets the book, returns to the counter and gives the book to the customer. Later, when the customer comes back to return the book, the librarian takes the book and returns it to the storeroom.

When a second customer immediately asks for the same book, the librarian then has to return to the storeroom to get the book she recently handled and then give it to the customer. Under this model, the librarian has to make a complete round trip to fetch every book -- even very popular ones that are requested frequently.

1313

Is there a way to improve the Is there a way to improve the

performance of the librarian?performance of the librarian?

Yes, there's a way -- we can put a CACHE on the librarian !!! Let's give the librarian a backpack in which she will be able to store 10 books (in computer lingo, the librarian now has a 10-book cache). In this backpack, she will put the books the customers return to her, up to a maximum of 10.

1414

The day starts. The backpack of the librarian is empty. The day starts. The backpack of the librarian is empty. Customers take books and later return them.Customers take books and later return them. Instead of returning to the storeroom to return the book, Instead of returning to the storeroom to return the book,

the librarian puts the book in her backpack.the librarian puts the book in her backpack. When customers request a book, before going to the When customers request a book, before going to the

storeroom, the librarian checks to see if that title is in her storeroom, the librarian checks to see if that title is in her backpack. backpack.

If she finds it, all she has to do is take the book from the If she finds it, all she has to do is take the book from the backpack and give it to the customer. There's no journey backpack and give it to the customer. There's no journey back into the storeroom, so the customer is served more back into the storeroom, so the customer is served more efficiently. efficiently.

This is how it works:This is how it works:

1515

What if the customer asked for a title not in What if the customer asked for a title not in

the cache (backpack)?the cache (backpack)?

In this case, the librarian is less efficient with a cache than without In this case, the librarian is less efficient with a cache than without one, because of the time she takes to look for the book in her one, because of the time she takes to look for the book in her backpack first. backpack first.

One of the challenges of cache design is to minimize the impact of One of the challenges of cache design is to minimize the impact of cache searches, and modern hardware has reduced this time delay cache searches, and modern hardware has reduced this time delay to practically zero. to practically zero.

Even in our simple librarian example, the latency time (the waiting Even in our simple librarian example, the latency time (the waiting time) of searching the cache is so small compared to the time to time) of searching the cache is so small compared to the time to walk back to the storeroom that it is irrelevant. walk back to the storeroom that it is irrelevant.

The cache is small (10 books), and the time it takes to notice a miss The cache is small (10 books), and the time it takes to notice a miss is only a tiny fraction of the time that a journey to the storeroom is only a tiny fraction of the time that a journey to the storeroom takes. takes.

1616

TWO-LEVEL CACHE (aka. L2)TWO-LEVEL CACHE (aka. L2)

It is possible to have multiple layers of cache. With our librarian example, the smaller but faster memory type is the backpack, and the storeroom represents the larger and slower memory type. This is a one-level cache.

There might be another layer of cache consisting of a shelf that can hold 100 books behind the counter. The librarian can check the backpack, then the shelf and then the storeroom. This would be a two-level cache.

1717

MappingMapping

LIBRARIAN PROCESSOR

BACKPACK CACHE

BOOK BLOCK / LINE

BOOK STACKS MAIN MEMORY

1818

MORE ABOUT CACHEMORE ABOUT CACHE

There are still a few questions There are still a few questions unanswered with the cache model unanswered with the cache model described.described.

Where can a block be placed?Where can a block be placed? How is a block found?How is a block found? Which block should be replaced on a cache miss?Which block should be replaced on a cache miss? What happens on a write?What happens on a write?

1919

Cache Principle:Cache Principle:

Cache design is based on the principle of locality.Cache design is based on the principle of locality.

Temporal locality (locality in time):Temporal locality (locality in time): If an item If an item is referenced it will tend to be referenced again is referenced it will tend to be referenced again soon. soon.

Spatial locality (locality in space):Spatial locality (locality in space): If an item is If an item is referenced, items whose addresses are close by referenced, items whose addresses are close by will tend to be referenced soon. will tend to be referenced soon.

Cache memory is designed to take advantage of Cache memory is designed to take advantage of the locality of access.the locality of access.

2020

Glossary of technical termsGlossary of technical terms Block:Block: The minimum unit of information that can either The minimum unit of information that can either

be present or not present in a two-level hierarchy.be present or not present in a two-level hierarchy. Hit:Hit: The data requested by the processor appears in The data requested by the processor appears in

some block in the upper level.some block in the upper level. Miss:Miss: The data requested by the processor is not found The data requested by the processor is not found

in the upper level. in the upper level.

Hit Rate:Hit Rate: The fraction of memory access found in the The fraction of memory access found in the upper level of a two-level memory hierarchy. (Miss Rate upper level of a two-level memory hierarchy. (Miss Rate = 1 – Hit Rate).= 1 – Hit Rate).

Miss Penalty:Miss Penalty: The time to replace a block in the upper The time to replace a block in the upper level with the corresponding block from the lower level, level with the corresponding block from the lower level, plus the time to deliver it to the processorplus the time to deliver it to the processor

2121

Where can a block be placed?Where can a block be placed?

The simplest way to assign a location in The simplest way to assign a location in the cache for each word in memory is to the cache for each word in memory is to assign the cache location based on the assign the cache location based on the address of the word in memory. This address of the word in memory. This cache structure is called cache structure is called direct mappeddirect mapped as each memory location is mapped to as each memory location is mapped to exactly one location in the cache. exactly one location in the cache.

In Direct mapping, the block containing the requested In Direct mapping, the block containing the requested data is given bydata is given by

(Block number) modulo (Number of cache blocks in the cache).(Block number) modulo (Number of cache blocks in the cache).

2222

Direct Mapped Cache Example.Direct Mapped Cache Example.

A direct-mapped cache with eight entries showing the addresses of memory words between 0 and 31 that map to the same cache locations. Because there are 8 words in the cache the address X in the memory maps to X modulo 8 in the cache.

2323

Where can a block be placed? (2)Where can a block be placed? (2)

While direct mapped placement scheme is the first of a While direct mapped placement scheme is the first of a whole range of schemes for placing blocks, whole range of schemes for placing blocks, fully fully associativeassociative is at the other end of the range where a is at the other end of the range where a block can be placed in any location in the cache. block can be placed in any location in the cache.

In the middle lies the In the middle lies the sset associativeet associative scheme where scheme where there are a fixed number of locations where each block there are a fixed number of locations where each block can be placed; a set associative cache with n locations can be placed; a set associative cache with n locations for a block is called an n-way set associative cache. for a block is called an n-way set associative cache.

In Set-Associative, the set containing the memory block is given by In Set-Associative, the set containing the memory block is given by (Block number) modulo (Number of sets in the cache)(Block number) modulo (Number of sets in the cache)

2424

Comparison of cache architectures Comparison of cache architectures

The location of a memory block whose address is 12 in a cache with eight blocks varies for direct-mapped, set-associative, and fully associative placement.

2525

How is a block found?How is a block found?

The choice of how we locate a block The choice of how we locate a block depends on the block placement scheme, depends on the block placement scheme, since that dictates the number of possible since that dictates the number of possible locations.locations.

ASSOCIATIVITYASSOCIATIVITY LOCATION METHODLOCATION METHOD COMPARISIONS COMPARISIONS REQUIREDREQUIRED

Direct MappedDirect Mapped IndexIndex 11

Set AssociativeSet Associative Index the set, search among the Index the set, search among the elementselements

Degree of Degree of AssociativityAssociativity

Full AssociativeFull Associative Search all cache entriesSearch all cache entries Size of cacheSize of cache

2626

How is a block found? (2)How is a block found? (2)

A cache with 16K blocks, with one long word per block. Index=14bits, Tag=16bits.

2727

Which block should be replaced on a Which block should be replaced on a cache miss?cache miss?

We have no options left while using a direct-We have no options left while using a direct-mapped cache. There is only one location in mapped cache. There is only one location in cache for any given memory location. So that cache for any given memory location. So that block is replaced when a miss occurs.block is replaced when a miss occurs.

There are two primary strategies for replacement There are two primary strategies for replacement in set-associative or fully associative caches:in set-associative or fully associative caches:

Random:Random: Candidate blocks are randomly selected using Candidate blocks are randomly selected using some hardware assistance.some hardware assistance.Least recently used (LRU): Least recently used (LRU): The block replaced is the The block replaced is the one that has been unused for the longest time.one that has been unused for the longest time.

2828

What happens on a write?What happens on a write?

A key characteristic of any memory hierarchy is how it A key characteristic of any memory hierarchy is how it deals with writes?deals with writes?

There are two basic options:There are two basic options:

Write-through:Write-through: The information is written to both the The information is written to both the block in the cache and to the block in the main block in the cache and to the block in the main memory.memory.

Write-back:Write-back: The information is written only to the block The information is written only to the block in the cache. The modified block is written to the main in the cache. The modified block is written to the main memory only when it is being replaced.memory only when it is being replaced.

2929

Cache performance evaluation.Cache performance evaluation.The estimate of the number ofThe estimate of the number of

cycles required for variouscycles required for various

memory types is tabulated:memory types is tabulated:

Lets consider a system with a modest HIT RATE of 90%. Lets consider a system with a modest HIT RATE of 90%.

Let the frequency of all loads and stores be 36%.Let the frequency of all loads and stores be 36%.

(CPU Time with cache) / (CPU Time without cache) = (CPU Time with cache) / (CPU Time without cache) = 1.451.45

CALCULATION:CALCULATION:

( I * CPI * Clock cycle time)( I * CPI * Clock cycle time)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

( [(I * 0.36 * (2/50)CPI * 0.9) + (I * 0.36 * CPI * 0.1) + (I * 0.64 * CPI) ] * CCT )( [(I * 0.36 * (2/50)CPI * 0.9) + (I * 0.36 * CPI * 0.1) + (I * 0.64 * CPI) ] * CCT )

RegisterRegister 1 Cycle1 Cycle

CacheCache ~2 Cycles~2 Cycles

RAMRAM ~10 Cycles~10 Cycles

3030

CROSSPOINT CACHE ARCHITECTURECROSSPOINT CACHE ARCHITECTURE

We propose a new cache architecture that We propose a new cache architecture that allows a multiple bus system to be allows a multiple bus system to be constructed which assures hardware constructed which assures hardware cache consistency while avoiding the cache consistency while avoiding the performance bottlenecksperformance bottlenecks

Our proposed architecture is a crossbar Our proposed architecture is a crossbar interconnection network with a cache interconnection network with a cache memory at each crosspoint.memory at each crosspoint.

3131

Block Diagram of a Typical Multiprocessor.Block Diagram of a Typical Multiprocessor.

Single bus with snooping caches

Each processor has a private cache memory. The caches all service their misses by going to main memory over the shared bus. Cache consistency is ensured by using a snooping cache protocol in which each cache monitors all bus addresses.

3232

Snooping cache architectureSnooping cache architecture

• Although this architecture is simple and Although this architecture is simple and inexpensive, the bandwidth of the shared bus inexpensive, the bandwidth of the shared bus severely limits its maximum performance. severely limits its maximum performance.

• Furthermore, since the bus is shared by all the Furthermore, since the bus is shared by all the processors, arbitration logic is needed to control processors, arbitration logic is needed to control access to the bus. Logic delays in the arbitration access to the bus. Logic delays in the arbitration circuitry may impose additional performance circuitry may impose additional performance penalties.penalties.

3333

CROSSPOINT CACHE ARCHITECTURECROSSPOINT CACHE ARCHITECTURE

Crossbars have traditionally been avoided Crossbars have traditionally been avoided because of their complexity. because of their complexity.

However, for the system that we are However, for the system that we are focusing on with 16 or 32 processors per focusing on with 16 or 32 processors per memory bank and no more than 4 to 8 memory bank and no more than 4 to 8 memory banks, the number of crosspoint memory banks, the number of crosspoint switches required is not excessive.switches required is not excessive.

3434

CROSSPOINT CACHE ARCHITECTURE (2)CROSSPOINT CACHE ARCHITECTURE (2)

The simplicity and regular structure of the The simplicity and regular structure of the crossbar architecture greatly outweigh any crossbar architecture greatly outweigh any disadvantages due to complexity.disadvantages due to complexity.

We will show that our architecture allows We will show that our architecture allows the use of straightforward and efficient bus the use of straightforward and efficient bus oriented cache consistency schemes while oriented cache consistency schemes while overcoming their bus traffic limitations.overcoming their bus traffic limitations.

3535

Crossbar architectureCrossbar architecture

Crossbar network

3636

Crossbar architecture (2)Crossbar architecture (2)

Each processor has its own bus, as does each memory Each processor has its own bus, as does each memory bank. bank.

The processor and memory buses are oriented (at least The processor and memory buses are oriented (at least conceptually) at right angles to each other, forming a conceptually) at right angles to each other, forming a two-dimensional grid.two-dimensional grid.

A crosspoint switch is placed at each intersection of a A crosspoint switch is placed at each intersection of a processor bus and a memory bus.processor bus and a memory bus.

Each crosspoint switch consists of a bi-directional bus Each crosspoint switch consists of a bi-directional bus transceiver and the control logic needed to enable the transceiver and the control logic needed to enable the transceiver at the appropriate times. transceiver at the appropriate times.

3737

Crossbar architecture (3)Crossbar architecture (3)

This array of crosspoint switches allows any This array of crosspoint switches allows any processor to be connected to any memory bank processor to be connected to any memory bank through a single switching element. through a single switching element.

Arbitration is still needed on the memory buses, Arbitration is still needed on the memory buses, since each is shared by all processors. Thus, since each is shared by all processors. Thus, this architecture does not eliminate arbitration this architecture does not eliminate arbitration delay.delay.

Suppose we have 4 memory banks, then we can Suppose we have 4 memory banks, then we can connect x4 processors for a given bus load.connect x4 processors for a given bus load.

3838

Crossbar Architecture Evaluation:Crossbar Architecture Evaluation:

The crossbar architecture is more expensive The crossbar architecture is more expensive than a single bus. than a single bus.

However, it avoids the performance bottleneck However, it avoids the performance bottleneck of the single bus, since several memory of the single bus, since several memory requests may be serviced simultaneously. requests may be serviced simultaneously.

Unfortunately, if a cache were associated with Unfortunately, if a cache were associated with each processor in this architecture, cache each processor in this architecture, cache consistency would be difficult to achieve. consistency would be difficult to achieve.

3939

Crossbar Architecture Evaluation: (2)Crossbar Architecture Evaluation: (2)

The snooping cache schemes would not The snooping cache schemes would not work, since there is no reasonable way for work, since there is no reasonable way for every processor to monitor all the memory every processor to monitor all the memory references of every other processor. references of every other processor.

Each processor would have to monitor Each processor would have to monitor activity on every memory bus activity on every memory bus simultaneously. simultaneously.

4040

Crossbar network with cachesCrossbar network with caches

If each of the above processors were associated with a cache, there is no way the cache could monitor all the memory access and update itself.

4141

Crosspoint cache architectureCrosspoint cache architecture

To overcome this problem, we propose the To overcome this problem, we propose the crosspoint cache architecture. In the crosspoint cache architecture. In the crosspoint cache architecture crosspoint cache architecture

The general structure is similar to that of The general structure is similar to that of the crossbar network, with the addition of the crossbar network, with the addition of a cache memory in each crosspoint.a cache memory in each crosspoint.

4242

Crosspoint cache architectureCrosspoint cache architecture

4343

Crosspoint cache architecture (2)Crosspoint cache architecture (2)

For each processor, the multiple crosspoint For each processor, the multiple crosspoint cache memories that serve it (those attached to cache memories that serve it (those attached to its processor bus) behave similarly to a larger its processor bus) behave similarly to a larger single cache memory. single cache memory.

For example, in a system with four memory For example, in a system with four memory banks and a 16K byte direct mapped cache with banks and a 16K byte direct mapped cache with a 16 byte line size at each crosspoint, each a 16 byte line size at each crosspoint, each processor would “see” a single 64K byte direct processor would “see” a single 64K byte direct mapped cache with a 16 byte line size. mapped cache with a 16 byte line size.

4444

Crosspoint cache architecture (3)Crosspoint cache architecture (3)

Note that this use of multiple caches with Note that this use of multiple caches with each processor increases the total cache each processor increases the total cache size, but it does not affect the line size or size, but it does not affect the line size or the degree of set associativity.the degree of set associativity.

This approach is, in effect, an interleaving This approach is, in effect, an interleaving of the entire memory subsystem, including of the entire memory subsystem, including both the cache memory and the main both the cache memory and the main memory. memory.

4545

Processor bus activityProcessor bus activity

Each processor has the exclusive use of its processor Each processor has the exclusive use of its processor bus and all the caches connected to it. bus and all the caches connected to it.

There is only one cache in which a memory reference of There is only one cache in which a memory reference of a particular processor to a particular memory bank may a particular processor to a particular memory bank may be cached. be cached.

The processor bus bandwidth requirement is low, since The processor bus bandwidth requirement is low, since each bus needs only enough bandwidth to service the each bus needs only enough bandwidth to service the memory requests of a single processor. memory requests of a single processor.

Since each processor bus and the caches on it are Since each processor bus and the caches on it are dedicated to a single processor, arbitration is not needed dedicated to a single processor, arbitration is not needed for a processor bus or its caches. for a processor bus or its caches.

4646

Memory bus activityMemory bus activity When a cache miss occurs, a memory bus transaction is When a cache miss occurs, a memory bus transaction is

necessary. necessary.

The cache that missed places the requested memory The cache that missed places the requested memory address on the bus and waits for main memory to supply address on the bus and waits for main memory to supply the data. the data.

Since all the caches on a particular memory bus may Since all the caches on a particular memory bus may generate bus requests, bus arbitration is necessary on generate bus requests, bus arbitration is necessary on the memory buses. the memory buses.

Since data from a particular memory bank may be Since data from a particular memory bank may be cached in any of the caches connected to the cached in any of the caches connected to the corresponding memory bus, it is necessary to observe a corresponding memory bus, it is necessary to observe a cache consistency protocol along the memory buses. cache consistency protocol along the memory buses.

4747

Crosspoint Vs Snooping CacheCrosspoint Vs Snooping Cache

Since each memory bus services only a fraction of each Since each memory bus services only a fraction of each processor’s cache misses, this architecture can support processor’s cache misses, this architecture can support more processors than a single bus system before more processors than a single bus system before reaching the upper bound on performance imposed by reaching the upper bound on performance imposed by the memory bus bandwidth. the memory bus bandwidth.

If main memory were divided into four banks (say), each If main memory were divided into four banks (say), each with its own memory bus, then each memory bus would with its own memory bus, then each memory bus would only service an average of one fourth of all the cache only service an average of one fourth of all the cache misses in the system.misses in the system.

Hence, the memory bus bandwidth would allow four Hence, the memory bus bandwidth would allow four times as many processors as a single bus snooping times as many processors as a single bus snooping cache system.cache system.

4848

Memory addressing exampleMemory addressing example

Address bit mapping example

To better illustrate the memory addressing in the crosspoint cache architecture, we consider a system with the following parameters:

64 processors, 4 memory banks, 256 crosspoint caches, 32-bit byte addressable address space, 32-bit word size, 32-bit bus width.

4 word (16 byte) crosspoint cache line size, 16K byte (1024 lines) crosspoint cache size, and direct mapped crosspoint caches.

4949

Two-level cacheTwo-level cache

Crosspoint cache architecture with two cache levels

Using a two-level cache scheme introduces additional cache consistency problems. Fortunately, a simple solution is possible.

5050

Two-level cache (2)Two-level cache (2)

In our cache consistency solution, a snooping In our cache consistency solution, a snooping protocol is used on the memory buses to ensure protocol is used on the memory buses to ensure consistency between the crosspoint caches. consistency between the crosspoint caches.

The on-chip caches use write through to ensure The on-chip caches use write through to ensure that the crosspoint caches always have current that the crosspoint caches always have current data. data.

The high traffic of write through caches is not a The high traffic of write through caches is not a problem, since the processor buses are only problem, since the processor buses are only used by a single processor. used by a single processor.

5151

Performance evaluation of the Performance evaluation of the crosspoint cache architecture.crosspoint cache architecture.

Several memory requests may be serviced simultaneously.

The two-level cache architecture reduces the effect of the processor bus and crosspoint cache delays.

The fast on-chip cache keeps the average memory access time small, while the large crosspoint caches keep the memory bus traffic low.

All these factors should make it feasible to construct shared memory multiprocessor systems with several hundred processors.

5252

FUTURE RESEARCHFUTURE RESEARCH

Further studies could investigate protocols suitable for use with set associative caches.

The protocol proposed, used a write through policy for the on-chip caches. This ensured that the crosspoint caches always had the most current data so that snoop hits could be serviced quickly.

An alternative approach would be to use a write back policy for the on-chip caches, which would reduce the traffic on the processor buses but would significantly complicate the actions required when a snoop hit occurs.

Performance studies of these alternative implementations could be performed.

5353

REFERENCESREFERENCES ““Bus and cache memory organizations for multiprocessors”, a Bus and cache memory organizations for multiprocessors”, a

ddoctoral dissertation by Mr. Donald Charles Winsor (EE, U.Mich, 1989).octoral dissertation by Mr. Donald Charles Winsor (EE, U.Mich, 1989).

Computer Organization and Design. Computer Organization and Design.

David A. Patterson and John L. HennessyDavid A. Patterson and John L. Hennessy

Morgan Kaufmann Publications.Morgan Kaufmann Publications.

JAMES ARCHIBALD AND JEAN-LOUP BAER. “Cache Coherence JAMES ARCHIBALD AND JEAN-LOUP BAER. “Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model”. Protocols: Evaluation Using a Multiprocessor Simulation Model”. ACM ACM Transactions on Computer SystemsTransactions on Computer Systems, volume 4, number 4, November 1986, , volume 4, number 4, November 1986, pages 273–298.pages 273–298.

WWW.HOWSTUFFWORKS.COMWWW.HOWSTUFFWORKS.COM

1 CACHE MEMORY ORGANIZATION FOR MULTIPROCESSORS Authors: Anil Pothireddy and Mekhala Vishwanath and Mekhala Vishwanath12/4/02.

Documents