www.whizchip.com zard - a person with extra-ordinary skill or accomplishment
Oct 30, 2014
www.whizchip.com
Whiz, Wizard - a person with extra-ordinary skill or accomplishment
Confidential information of WhizChip Design Technologies (www.whizchip.com).Contains WhizChip or its customer’s proprietary and business sensitive data.
Tanmay Rao M
Reg No - 101002014
MS VLSI CAD [email protected]
Design And Evaluation Of Cache Performance Evaluation System
(To Implement Search Algorithms and Evaluate Simulation Models)
Contents
• Introduction
• Terms Related to Cache
• Cache States
• Project Specifications
• Need For Search Algorithm
• Proposed Algorithm 1:BST
• Proposed Algorithm 2:Splay Tree
• Class Architecture
• Conclusion and Scope for Future Work
• References
Introduction
• Cache plays a vital role in improving performance by providing data to the requesting master in a SoC with in a very few clock cycles. Cache reduces the need for frequent access of the main memory which would take typically 50 to 100 clock cycles.
• Importance of cache in a typical SoC containing several masters is determined by adding cache at different levels.
• Four Simulation models are designed to determine the importance of cache in a SoC. The local caching of data introduces the cache coherence problem.
• Cache coherence problem is solved by implementing cache coherency protocol.
• Search algorithms are used to implement the cache controllers. Depending on the performance two search algorithms are implemented and evaluated.
Terms Related to Cache
• Cache Entries
• Cache Performance
• Replacement Policies
• Locality
• Write Policies
• Master Stalls
• Flag Bits
• Cache Miss
• Cache Hierarchy
• Victim Cache
Cache States
• Valid, Invalid: When valid cache line is present in the cache. When invalid, cache line is not present in the cache.
• Unique, Shared: When unique, the cache line exists only in one cache. When shared, cache line exists in more than one cache.
• Clean, Dirty: When Clean, the cache line is not changed so no need to update the main memory, when this clean cache line is replaced. When dirty, the cache line is changed, need to update main memory when this cache line is replaced.
UDUnique DirtyD
irty
Cle
an
Unique Shared
Valid Invalid
UDUnique Dirty
SDShared Dirty
UCUnique Clean
SCShared Clean
IInvalid
• Main Memory Controller
• This controller accepts address on the shared bus.
• This controller also has a delay to mimic the real time latency that is seen in a typical main memory access.
• Typical delays are in the range of 50-100 clock cycles
• L1 Cache Controller
• The cache controller will serve the master requests for read and write.
• On a cache miss it needs to fetch the data from the main memory or get it from other snooped cache.
• The additional responsibility of the cache controller is that it needs to support snooping which can have conflicting effects on its normal operation.
Project Specifications
• L2 Cache Controller
• When a dirty cache line from any of the L1 cache is to be replaced then it is moved to L2 cache.
• Saves the clock cycles needed to write it back to the main memory.
• Search Algorithm
• Two Search Algorithms have been implemented in this project.
• Cache Simulation Models
• Four Simulation Models were developed. I have done the evaluation of the models using test cases.
Project Specifications
Model 1
Main Memory
L1 Cache L1 Cache
L1 CacheL1 Cache
Snoop Channel
Main Memory Channel
M2
M3 M4
M1 M2
M3 M4
Model 2
Main Memory
L1 Cache L1 Cache
L1 CacheL1 Cache
Snoop Channel
Main Memory Channel
L2 cache Channel
L2 Cache
M1 M2
M3 M4
Model 3
Main Memory
L1 Cache L1 Cache
L1 CacheL1 Cache
Main Memory Channel
M1 M2
M3 M4
Model 4
Main Memory
Main Memory Channel
M1 M2
M3 M4
• Constraints
• The whole model is written in SystemVerilog
• Memory leaks associated with C++ does not exist in SystemVerilog.
• Assumptions
• The delay associated with main memory is modeled at 100 cycles. It can vary upon the user.
• The replacement algorithm for the cache is not a standard policy.
• The read and write channels for the main memory are separate.
• No particular snooping protocol is used.
Constraints and Assumptions
• Basic requirement is a search for the requested address in the cache.
• There are many algorithms that have been devised for efficient search.
• Along with search we need to add and delete addresses .
• We also need to update and replace the cache lines with new lines.
• The algorithm should add and delete the address in a manner which does not affect the search drastically.
• We need a suitable data structure which can store the address in an effective way.
• The memory footprint of the data structure should also be optimum so that it does not consume too much memory resources.
Need For Search Algorithm
• Hash Coding
• Hash coding is a process in which a search key is transformed, through the use of a hash function to an actual address for the associated data.
• A very simple hash function is the modulues function.
• Pseudo CAMs
• Since fully associative memories are difficult and expensive to build, relative to normal main memory, a method of building a large random access memory but with associative access would be advantageous.
• The pseudo CAM uses a multiple memory bank architecture in which a key is hashed to an address which is valid with every bank.
• Pre-Computation Technique
• Here extra information is stored along with tag. This extra information is derived from the stored bit.
• For an input tag we first compute the number of ones / zeros and compare with the stored bit
Background Study
• The data structure used for storing the address is a binary search tree.
• Properties of a binary search tree are
• The left subtree of a node should have values less than the current node's values.
• The right subtree of a node should have values greater than the current node's values.
• Both the left and right subtrees should also be binary search tree
• The number of elements in a data structure depends on number of levels of the binary search tree. As the number of levels increase the number of elements increase. If we have 'n' levels in a binary search tree then we have 2n -1 elements.
Proposed Algorithm 1:BST
Search Operation
Root node=addr?
Search Successful
Yes
Is addr=left
node?
No
YesNo
Is addr=right
node ?
No
Search Successful
Search Successful
Yes
Add Operation
Is Root empty
?
Add address to the tree
Yes
Is address less than current node?
No
Yes Is left child
empty?
No
Add address to the tree
YesNo
Is address greater
than current node?
Yes Is right child
empty?
Add address to the tree
Yes
No
Delete Operation
Is the address to be deleted root node?
Find the in-order successor of the
root node
Is the address to be deleted intermediate
node?
Find the in-order successor of the
intermediate node
Yes
Yes
No
No
Is the address to be deleted
leaf node?
Delete the leaf node
Yes
Replace root node with in
node successor
Delete the duplicate enteries
Replace node with in node successor
Update Operation
Update the data in the cache line
Yes
No
Yes
Old cache line’s data != new cache line’s
data
Update the dirty bit in the cache
line
Yes
Update the shared bit in the
cache line
No
Old cache line’s dirty != new cache line’s dirty
Old cache line’s shared != new cache line’s shared
Replace Operation
Replace the cache line with the new
cache line
Yes
No
Cache line shared?
Yes
The last added node is replaced with the new
cache line
No
Cache line clean?
Replace the cache line with the new
cache line
• A modified version of a binary search tree called the splay tree is used to implement the data structure
• A splay tree is self-balancing binary search tree.
• A balanced binary search tree has uniform height on both sub trees .
• Along with self-balancing property, the splay tree has the additional property that whenever a new address is added, it is bought to the root node.
• This process of bringing the address that is added to the root is called splaying.
• So in a splay tree the time required to access most recently used address is very less as they will be nearer to the root.
Proposed Algorithm 2:Splay Tree
• The three types of splay steps are:
• Zig Step
• This step is done when p is the root.
• The tree is rotated on the edge between x and p.
• Zig-Zig Step.
• This step is done when p is not the root and x and p are either both right children or are both left children.
• The tree is rotated on the edge joining p with its parent g, then rotated on the edge joining x with p.
• Zig-Zag Step.
• This step is done when p is not the root and x is a right child and p is a left child or vice versa.
• The tree is rotated on the edge between x and p, then rotated on the edge between x and its new parent g.
Splaying
Splay
P
X C
A B
X
A P
B C
P
X C
A B
P
B G
C D
G
D
X
A
Splay
P
XA
B C
X
P G
C D
G
D
BA
• This is the most important operation in a binary search tree.
• The given address is compared with the root, its left child and right child.
• The next stage of comparisons has four possibilities.
• If the address to be searched is less than p and the left child then the address if exists will be in the node represented by A.
• If the address to be searched is less than p and greater than the left child then the address if exists will be in the node represented by B.
• If the address to be searched is greater than p and less than the right child then the address if exists will be in the node represented by C.
• If the address to be searched is greater than p and the right child then the address if exists will be in the node represented by D
Search Operation
Search Operation
P
LC RC
A B C D
Add Operation
Add address to right child of the
right node
Yes
Yes
Is right child
empty?
No
YesIs right child
empty?
Splay the tree if required
Addr > current
and Addr > right node
Yes
Addr > current
and Addr < right node
Add address to left child of the
right node
Yes Is left child
empty?
Yes
Addr < current
and Addr > left node
Add address to right child of the
right node
No
Yes
YesIs left child
empty?
Addr < current
and Addr < left node
Add address to left child of the
right node
No
Yes
Current = root node
Change current node to right
child
No
No Change current node to left child
Change current node to right
child
Change current node to left child
NoNo
No
End
Delete Operation
Is the address to be deleted root node?
Find the in-order successor of the
root node
Is the address to be deleted intermediate
node?
Find the in-order successor of the
intermediate node
Yes
Yes
No
No
Is the address to be deleted
leaf node?
Delete the leaf node
Yes
Replace root node with in
node successor
Delete the duplicate enteries
Replace node with in node successor
Splay if required
Cache Line
Field Description
Key Stores the Address. The address is matched in search operation. It is unique part of the cache line which distinguishes from other cache lines.
Data Stores the Data that is associated with particular address. This may be consistent with the main memory or may be provided by the master.
Shared Flag bit .This bit when set indicates that the cache line is shared among other masters
Dirty Flag bit. This bit when set indicates that the data in the cache line is dirty. When the cache line is evicted data must be written back to the main memory.
Invalid Flag bit. This bit when set indicates that the data in the cache line is valid.
• The add task is used to add the cache line to the data structure.
• The delete operation for the binary search tree is also implemented.
• Temporary nodes are used to find the in node successor or the in -node predecessor
• In this implementation the in-node successor method is implemented.
• The update operation is common for the both the algorithms.
• The splay task is implemented only for the second algorithm.
• In the implementation we splay the data structure when we add 3, 5, 7, 9 and 15 elements
Data Structure Class
8
16
24 24
16
8
16
248
Splay For 3 Element Tree
Splay For 5 Element Tree
16
24
32
40
8
16
32
4024
8
24
32
4016
8
24
40
32
16
8
• Binary Search Tree
• The algorithm class takes the data structure object as a parameter.
• This means that though the data structure changes we need not change the algorithm.
• Splay Tree
• In the splay tree implementation the whole data structure is divided into eight binary trees.
• The hash function will select the bank, where the addresses are stored.
• Pipelined scheme saves cycles compared with algorithm that does not have pipeline as the add and delete will sit idle. Here all the three work in parallel hence saving clock cycles.
Algorithm Class
• Another important component in the model is the Main Memory Controller
• Read accepts the address and gives the data with a data valid signal after 20 cycles.
• Similarly the write task accepts the address and data to be written in the main memory.
• There is another task which initializes the locations of the main memory for simulation purpose.
Main Memory Class
Description Average Cycles(BST)
Average Cycles(Splay Tree)
Local Cache Hits(4 Masters)
Snoop Cache Hits(4 Masters)
Main Memory Access(4 Masters)
Local Hit Rate - 0.25
15.11 15.15 32 0 96
Local Hit Rate - 0.5 10.32 10.25 64 0 64Local Hit Rate - 0.75
5.66 5.35 96 0 32
Local Hit Rate - 1 1.24 0.65 128 0 0Local Hit Rate - 0.25 Snoop Hit Rate - 0.25
10.97 10.58 32 32 64
Local Hit Rate - 0.5 Snoop Hit Rate - 0.25
6.19 5.67 64 32 32
Local Hit Rate - 0.5 Snoop Hit Rate - 0.5
2.5 1.09 64 64 0
Local Hit Rate - 0.25 Snoop Hit Rate - 0.75
2.64 1.61 32 96 0
Local Hit Rate - 0.75 Snoop Hit Rate - 0.25
1.23 0.99 96 32 0
Snoop Hit Rate - 1 4.31 2.24 0 128 0Local Hit Rate - 0.25 Snoop Hit Rate - 0.5
6.02 6.13 32 64 32
Test Cases for Evaluation of Model 1
Test Cases for Evaluation of Model 2
Description Average
Cycles
(BST)
Average
Cycles
(Splay Tree)
Local Cache
Hits(4
Masters)
Snoop
Cache
Hits(4
Masters)
Victim
Cache
Hits(4
Masters)
Main
Memory
Access(4
Masters)
Local Hit Rate - 0.25 15.11 15.15 32 0 0 96
Local Hit Rate - 0.5 10.32 10.25 64 0 0 64
Local Hit Rate - 0.75 5.66 5.35 96 0 0 32
Local Hit Rate - 1 1.24 0.65 128 0 0 0
Local Hit Rate - 0.25
Victim Hit Rate - 0.25
10.80 11.13 32 0 32 64
Local Hit Rate – 0.25
Snoop Hit Rate - 0.25
Victim Hit Rate - 0.25
6.67 6.66 32 32 32 32
Local Hit Rate - 0.25
Snoop Hit Rate - 0.25
11.09 10.66 32 32 0 64
Local Hit Rate - 0.5
Snoop Hit Rate - 0.25
6.29 5.75 64 32 0 32
Local Hit Rate - 0.5
Snoop Hit Rate - 0.5
2.65 1.16 64 64 0 0
Test Cases for Evaluation of Model 2
Description Average
Cycles
(BST)
Average
Cycles
(Splay Tree)
Local Cache
Hits(4
Masters)
Snoop
Cache
Hits(4
Masters)
Victim
Cache Hits(4
Masters)
Main
Memory
Access(4
Masters)
Local Hit Rate - 0.25
Snoop Hit Rate - 0.75
3.08 1.63 32 96 0 0
Local Hit Rate - 0.75
Snoop Hit Rate - 0.25
1.41 1.03 96 32 0 0
Snoop Hit Rate - 1 5.14 2.29 0 128 0 0
Local Hit Rate - 0.25
Snoop Hit Rate - 0.5
6.22 6.15 32 64 0 32
Snoop Hit Rate – 0.25
Victim Hit Rate - 0.25
11.40 11.77 0 32 32 64
Victim Hit Rate – 0.25 15.70 16.01 0 0 32 96
Local Hit Rate - 0.25
Snoop Hit Rate - 0.75
3.08 1.63 32 96 0 0
Local Hit Rate - 0.75
Snoop Hit Rate - 0.25
1.41 1.03 96 32 0 0
Snoop Hit Rate - 1 5.14 2.29 0 128 0 0
Test Cases for Evaluation of Model 3
Description Average Cycles(BST)
Average Cycles(Splay Tree)
Local Cache Hits(4 Masters)
Main Memory Access(4 Masters)
Local Hit Rate - 0.25
15.11 15.15 32 96
Local Hit Rate - 0.5
10.32 10.25 64 64
Local Hit Rate - 0.75
5.66 5.35 96 32
Local Hit Rate - 1
1.24 0.65 128 0
Test Cases for Evaluation of Model 4
Description Average Cycles(BST)
Average Cycles(Splay Tree)
Main Memory Access(4 Masters)
Local Hit Rate - 0.25
20 20 96
Local Hit Rate - 0.5
20 20 64
Local Hit Rate - 0.75
20 20 32
Local Hit Rate - 1
20 20 0
Graph for 4 Models
VictimCache hit
Snoop Cache hit
Main Memory access
3
6406326161691531451297264484024161……………………..128
4
1
2
7
8
5
6
11
12
9
10
15
16
13
14
19
20
17
18
- Cache Model 4
- Cache Model 2
- Cache Model 3
LocalCache hit
Input address
Clo
ck c
ycle
s
-Cache Model 1
• Various search algorithms were studied for the implementation of the cache controller.
• Two search algorithms were implemented in systemverilog. The algorithms are used by the cache models developed.
• The model can be enhanced by incorporating more search algorithms. The user may have their own search algorithm.
• We can also use different replacement policies for the cache controller. The cache architecture itself can be of different types like direct mapped or set associative.
Conclusion and Scope for Future Work
1)Hennessy, John and David Patterson, Computer Architecture: A Quantitative Approach.
2) FAST:Fast Architecture Sensitive Tree Search on Modern
CPUs and GPUs. Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D SIGMOD10.
3) Designing Very Large Content- Addressable Memories by
John H Shaffer,University of Pennsylvania 4) Splay Tree – Stephen J Allan
5)AMBA AXI and ACE Protocol Specification.
6) SystemVerilog 3.1a LRM
References
Thank You