Finale PPT Tanmay Rao

www.whizchip.com

Whiz, Wizard - a person with extra-ordinary skill or accomplishment

Confidential information of WhizChip Design Technologies (www.whizchip.com).Contains WhizChip or its customer’s proprietary and business sensitive data.

Tanmay Rao M

Reg No - 101002014

MS VLSI CAD [email protected]

Design And Evaluation Of Cache Performance Evaluation System

(To Implement Search Algorithms and Evaluate Simulation Models)

Contents

• Introduction

• Terms Related to Cache

• Cache States

• Project Specifications

• Need For Search Algorithm

• Proposed Algorithm 1:BST

• Proposed Algorithm 2:Splay Tree

• Class Architecture

• Conclusion and Scope for Future Work

• References

Introduction

• Cache plays a vital role in improving performance by providing data to the requesting master in a SoC with in a very few clock cycles. Cache reduces the need for frequent access of the main memory which would take typically 50 to 100 clock cycles.

• Importance of cache in a typical SoC containing several masters is determined by adding cache at different levels.

• Four Simulation models are designed to determine the importance of cache in a SoC. The local caching of data introduces the cache coherence problem.

• Cache coherence problem is solved by implementing cache coherency protocol.

• Search algorithms are used to implement the cache controllers. Depending on the performance two search algorithms are implemented and evaluated.

Terms Related to Cache

• Cache Entries

• Cache Performance

• Replacement Policies

• Locality

• Write Policies

• Master Stalls

• Flag Bits

• Cache Miss

• Cache Hierarchy

• Victim Cache

Cache States

• Valid, Invalid: When valid cache line is present in the cache. When invalid, cache line is not present in the cache.

• Unique, Shared: When unique, the cache line exists only in one cache. When shared, cache line exists in more than one cache.

• Clean, Dirty: When Clean, the cache line is not changed so no need to update the main memory, when this clean cache line is replaced. When dirty, the cache line is changed, need to update main memory when this cache line is replaced.

UDUnique DirtyD

irty

Cle

an

Unique Shared

Valid Invalid

UDUnique Dirty

SDShared Dirty

UCUnique Clean

SCShared Clean

IInvalid

• Main Memory Controller

• This controller accepts address on the shared bus.

• This controller also has a delay to mimic the real time latency that is seen in a typical main memory access.

• Typical delays are in the range of 50-100 clock cycles

• L1 Cache Controller

• The cache controller will serve the master requests for read and write.

• On a cache miss it needs to fetch the data from the main memory or get it from other snooped cache.

• The additional responsibility of the cache controller is that it needs to support snooping which can have conflicting effects on its normal operation.

Project Specifications

• L2 Cache Controller

• When a dirty cache line from any of the L1 cache is to be replaced then it is moved to L2 cache.

• Saves the clock cycles needed to write it back to the main memory.

• Search Algorithm

• Two Search Algorithms have been implemented in this project.

• Cache Simulation Models

• Four Simulation Models were developed. I have done the evaluation of the models using test cases.

Project Specifications

Model 1

Main Memory

L1 Cache L1 Cache

L1 CacheL1 Cache

Snoop Channel

Main Memory Channel

M2

M3 M4

M1 M2

M3 M4

Model 2

Main Memory

L1 Cache L1 Cache

L1 CacheL1 Cache

Snoop Channel

Main Memory Channel

L2 cache Channel

L2 Cache

M1 M2

M3 M4

Model 3

Main Memory

L1 Cache L1 Cache

L1 CacheL1 Cache

Main Memory Channel

M1 M2

M3 M4

Model 4

Main Memory

Main Memory Channel

M1 M2

M3 M4

• Constraints

• The whole model is written in SystemVerilog

• Memory leaks associated with C++ does not exist in SystemVerilog.

• Assumptions

• The delay associated with main memory is modeled at 100 cycles. It can vary upon the user.

• The replacement algorithm for the cache is not a standard policy.

• The read and write channels for the main memory are separate.

• No particular snooping protocol is used.

Constraints and Assumptions

• Basic requirement is a search for the requested address in the cache.

• There are many algorithms that have been devised for efficient search.

• Along with search we need to add and delete addresses .

• We also need to update and replace the cache lines with new lines.

• The algorithm should add and delete the address in a manner which does not affect the search drastically.

• We need a suitable data structure which can store the address in an effective way.

• The memory footprint of the data structure should also be optimum so that it does not consume too much memory resources.

Need For Search Algorithm

• Hash Coding

• Hash coding is a process in which a search key is transformed, through the use of a hash function to an actual address for the associated data.

• A very simple hash function is the modulues function.

• Pseudo CAMs

• Since fully associative memories are difficult and expensive to build, relative to normal main memory, a method of building a large random access memory but with associative access would be advantageous.

• The pseudo CAM uses a multiple memory bank architecture in which a key is hashed to an address which is valid with every bank.

• Pre-Computation Technique

• Here extra information is stored along with tag. This extra information is derived from the stored bit.

• For an input tag we first compute the number of ones / zeros and compare with the stored bit

Background Study

• The data structure used for storing the address is a binary search tree.

• Properties of a binary search tree are

• The left subtree of a node should have values less than the current node's values.

• The right subtree of a node should have values greater than the current node's values.

• Both the left and right subtrees should also be binary search tree

• The number of elements in a data structure depends on number of levels of the binary search tree. As the number of levels increase the number of elements increase. If we have 'n' levels in a binary search tree then we have 2n -1 elements.

Proposed Algorithm 1:BST

Search Operation

Root node=addr?

Search Successful

Yes

Is addr=left

node?

No

YesNo

Is addr=right

node ?

No

Search Successful

Search Successful

Yes

Add Operation

Is Root empty

?

Add address to the tree

Yes

Is address less than current node?

No

Yes Is left child

empty?

No


YesNo

Is address greater

than current node?

Yes Is right child

empty?


Yes

No

Delete Operation

Is the address to be deleted root node?

Find the in-order successor of the

root node

Is the address to be deleted intermediate

node?


intermediate node

Yes

Yes

No

No

Is the address to be deleted

leaf node?

Delete the leaf node

Yes

Replace root node with in

node successor

Delete the duplicate enteries

Replace node with in node successor

Update Operation

Update the data in the cache line

Yes

No

Yes

Old cache line’s data != new cache line’s

data

Update the dirty bit in the cache

line

Yes

Update the shared bit in the

cache line

No

Old cache line’s dirty != new cache line’s dirty

Old cache line’s shared != new cache line’s shared

Replace Operation

Replace the cache line with the new

cache line

Yes

No

Cache line shared?

Yes

The last added node is replaced with the new

cache line

No

Cache line clean?

Replace the cache line with the new

cache line

• A modified version of a binary search tree called the splay tree is used to implement the data structure

• A splay tree is self-balancing binary search tree.

• A balanced binary search tree has uniform height on both sub trees .

• Along with self-balancing property, the splay tree has the additional property that whenever a new address is added, it is bought to the root node.

• This process of bringing the address that is added to the root is called splaying.

• So in a splay tree the time required to access most recently used address is very less as they will be nearer to the root.

Proposed Algorithm 2:Splay Tree

• The three types of splay steps are:

• Zig Step

• This step is done when p is the root.

• The tree is rotated on the edge between x and p.

• Zig-Zig Step.

• This step is done when p is not the root and x and p are either both right children or are both left children.

• The tree is rotated on the edge joining p with its parent g, then rotated on the edge joining x with p.

• Zig-Zag Step.

• This step is done when p is not the root and x is a right child and p is a left child or vice versa.

• The tree is rotated on the edge between x and p, then rotated on the edge between x and its new parent g.

Splaying

Splay

P

X C

A B

X

A P

B C

P

X C

A B

P

B G

C D

G

D

X

A

Splay

P

XA

B C

X

P G

C D

G

D

BA

• This is the most important operation in a binary search tree.

• The given address is compared with the root, its left child and right child.

• The next stage of comparisons has four possibilities.

• If the address to be searched is less than p and the left child then the address if exists will be in the node represented by A.

• If the address to be searched is less than p and greater than the left child then the address if exists will be in the node represented by B.

• If the address to be searched is greater than p and less than the right child then the address if exists will be in the node represented by C.

• If the address to be searched is greater than p and the right child then the address if exists will be in the node represented by D

Search Operation

Search Operation

P

LC RC

A B C D

Add Operation

Add address to right child of the

right node

Yes

Yes

Is right child

empty?

No

YesIs right child

empty?

Splay the tree if required

Addr > current

and Addr > right node

Yes

Addr > current

and Addr < right node

Add address to left child of the

right node

Yes Is left child

empty?

Yes

Addr < current

and Addr > left node

Add address to right child of the

right node

No

Yes

YesIs left child

empty?

Addr < current

and Addr < left node

Add address to left child of the

right node

No

Yes

Current = root node

Change current node to right

child

No

No Change current node to left child

Change current node to right

child

Change current node to left child

NoNo

No

End

Delete Operation

Is the address to be deleted root node?


root node

Is the address to be deleted intermediate

node?


intermediate node

Yes

Yes

No

No

Is the address to be deleted

leaf node?

Delete the leaf node

Yes

Replace root node with in

node successor

Delete the duplicate enteries

Replace node with in node successor

Splay if required

Cache Line

Field Description

Key Stores the Address. The address is matched in search operation. It is unique part of the cache line which distinguishes from other cache lines.

Data Stores the Data that is associated with particular address. This may be consistent with the main memory or may be provided by the master.

Shared Flag bit .This bit when set indicates that the cache line is shared among other masters

Dirty Flag bit. This bit when set indicates that the data in the cache line is dirty. When the cache line is evicted data must be written back to the main memory.

Invalid Flag bit. This bit when set indicates that the data in the cache line is valid.

• The add task is used to add the cache line to the data structure.

• The delete operation for the binary search tree is also implemented.

• Temporary nodes are used to find the in node successor or the in -node predecessor

• In this implementation the in-node successor method is implemented.

• The update operation is common for the both the algorithms.

• The splay task is implemented only for the second algorithm.

• In the implementation we splay the data structure when we add 3, 5, 7, 9 and 15 elements

Data Structure Class

8

16

24 24

16

8

16

248

Splay For 3 Element Tree

Splay For 5 Element Tree

16

24

32

40

8

16

32

4024

8

24

32

4016

8

24

40

32

16

8

• Binary Search Tree

• The algorithm class takes the data structure object as a parameter.

• This means that though the data structure changes we need not change the algorithm.

• Splay Tree

• In the splay tree implementation the whole data structure is divided into eight binary trees.

• The hash function will select the bank, where the addresses are stored.

• Pipelined scheme saves cycles compared with algorithm that does not have pipeline as the add and delete will sit idle. Here all the three work in parallel hence saving clock cycles.

Algorithm Class

• Another important component in the model is the Main Memory Controller

• Read accepts the address and gives the data with a data valid signal after 20 cycles.

• Similarly the write task accepts the address and data to be written in the main memory.

• There is another task which initializes the locations of the main memory for simulation purpose.

Main Memory Class

Description Average Cycles(BST)

Average Cycles(Splay Tree)

Local Cache Hits(4 Masters)

Snoop Cache Hits(4 Masters)

Main Memory Access(4 Masters)

Local Hit Rate - 0.25

15.11 15.15 32 0 96

Local Hit Rate - 0.5 10.32 10.25 64 0 64Local Hit Rate - 0.75

5.66 5.35 96 0 32

Local Hit Rate - 1 1.24 0.65 128 0 0Local Hit Rate - 0.25 Snoop Hit Rate - 0.25

10.97 10.58 32 32 64

Local Hit Rate - 0.5 Snoop Hit Rate - 0.25

6.19 5.67 64 32 32


2.5 1.09 64 64 0


2.64 1.61 32 96 0


1.23 0.99 96 32 0

Snoop Hit Rate - 1 4.31 2.24 0 128 0Local Hit Rate - 0.25 Snoop Hit Rate - 0.5

6.02 6.13 32 64 32

Test Cases for Evaluation of Model 1


Description Average

Cycles

(BST)

Average

Cycles

(Splay Tree)

Local Cache

Hits(4

Masters)

Snoop

Cache

Hits(4

Masters)

Victim

Cache

Hits(4

Masters)

Main

Memory

Access(4

Masters)

Local Hit Rate - 0.25 15.11 15.15 32 0 0 96

Local Hit Rate - 0.5 10.32 10.25 64 0 0 64

Local Hit Rate - 0.75 5.66 5.35 96 0 0 32

Local Hit Rate - 1 1.24 0.65 128 0 0 0


Victim Hit Rate - 0.25

10.80 11.13 32 0 32 64

Local Hit Rate – 0.25

Snoop Hit Rate - 0.25


6.67 6.66 32 32 32 32



11.09 10.66 32 32 0 64



6.29 5.75 64 32 0 32



2.65 1.16 64 64 0 0


Description Average

Cycles

(BST)

Average

Cycles

(Splay Tree)

Local Cache

Hits(4

Masters)

Snoop

Cache

Hits(4

Masters)

Victim

Cache Hits(4

Masters)

Main

Memory

Access(4

Masters)



3.08 1.63 32 96 0 0



1.41 1.03 96 32 0 0

Snoop Hit Rate - 1 5.14 2.29 0 128 0 0



6.22 6.15 32 64 0 32

Snoop Hit Rate – 0.25


11.40 11.77 0 32 32 64

Victim Hit Rate – 0.25 15.70 16.01 0 0 32 96



3.08 1.63 32 96 0 0



1.41 1.03 96 32 0 0

Snoop Hit Rate - 1 5.14 2.29 0 128 0 0




Local Cache Hits(4 Masters)



15.11 15.15 32 96


10.32 10.25 64 64


5.66 5.35 96 32

Local Hit Rate - 1

1.24 0.65 128 0






20 20 96


20 20 64


20 20 32

Local Hit Rate - 1

20 20 0

Graph for 4 Models

VictimCache hit

Snoop Cache hit

Main Memory access

3

6406326161691531451297264484024161……………………..128

4

1

2

7

8

5

6

11

12

9

10

15

16

13

14

19

20

17

18

- Cache Model 4

- Cache Model 2

- Cache Model 3

LocalCache hit

Input address

Clo

ck c

ycle

s

-Cache Model 1

• Various search algorithms were studied for the implementation of the cache controller.

• Two search algorithms were implemented in systemverilog. The algorithms are used by the cache models developed.

• The model can be enhanced by incorporating more search algorithms. The user may have their own search algorithm.

• We can also use different replacement policies for the cache controller. The cache architecture itself can be of different types like direct mapped or set associative.

Conclusion and Scope for Future Work

1)Hennessy, John and David Patterson, Computer Architecture: A Quantitative Approach.

2) FAST:Fast Architecture Sensitive Tree Search on Modern

CPUs and GPUs. Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D SIGMOD10.

3) Designing Very Large Content- Addressable Memories by

John H Shaffer,University of Pennsylvania 4) Splay Tree – Stephen J Allan

5)AMBA AXI and ACE Protocol Specification.

6) SystemVerilog 3.1a LRM

References

Thank You

Finale PPT Tanmay Rao

Documents

averag ge

cache coherence

deleted intermediate

deleted leaf

binary search

deleted root

local hit

change current