Bus / Crossbar Switch
AMANO, Hideharu
hunga@am.ics.keio.ac.jp
Today, I am talking about the basic knowledge on buses and crossbar switches
used in UMA or multi-core.
1
CPU
L1 Cache
L2 Cache
L3 Cache
SRAM
Main memory
DRAM
~64KB 1-2clock
~256KB 3-10clock
2M~4MB 10-20clock
4~16GB 50-100clock
Memory Hierarchy
Locality is used.
Small high speed
Large low speed
Secondary Memory
μ-msec
TB
On-Chip cache
Transparent from Software
Managed by
Operating
System
First of all, let’s review on the memory hierarch. It makes the use of special
and temporal locality of memory access. That is, a high speed but small
memory is located near the CPU. L1 or level 1 cache is directly connected to
the CPU, and L2 cache is located on the same chip. L3 cache is an on-board
cache but recently it is also embedded in the same chip. These cache memory
modules are all implemented with SRAM or static ram. The main memory is
implemented with the dynamic RAM. From the main memory, all memory
modules are transparent, that is, they have to be controlled by the hardware.
The secondary memory is managed by the operating system. It has been built
with magnetic disk, but recently, flush memory has taken place rapidly.
2
Uni-processor structure
CPU
North
Bridge
South
Bridge
Graphics DRAM
USB
Ether
Legacy I/O
PCI/PCIexpress
Memory Controller HUB
I/O Controller HUB
L1
L2
L3
The memory hierarchy is implemented with two types of bus or switch: North
bridge and south bridge. They are Intel’s terminorogy. North bridge is a switch
for high speed devices such as DRAMs and graphics processing unit. On the
other hand south bridge is a bridge to I/O devices, USB, Ethernet and other
legacy I/Os. PCIe bus is also connected.
3
Sharing cache introduces congestion
PE
North
Bridge
South
Bridge
Graphics DRAM
USB
Ether
Legacy I/O
PCI/PCIexpress
Memory Controller HUB
I/O Controller HUB
L1
L2
L3
PEPE PE
OK. What happens when a multicore is used instead of the uniprocessor? The
simplest idea is to share the L1 cache. But, it is impossible because of the
severe access contention at the L1 cache, since the L1 cache is very tightly
connected to the processing element.
4
The typical multicore structure
PE
North
Bridge
South
Bridge
Graphics DRAM
USB
Ether
Legacy I/O
PCI/PCIexpress
Memory Controller HUB
I/O Controller HUB
L1
L3
PE PE PE
L1
L2
L1 L1
Snoop CacheBus
Crossbar
So, how about providing the private cache as the L1 cache and sharing the L2
cache. It is possible just if the cache coherence problem is solved. I will talk
about this problem in the next lesson, and skip it today. Anyway, note that the
bus is used to connect cores and the L2 cache. Of course, this bus is not a
traditional passive bus.
5
Implementation of buses
Multiplexer
Passive Bus:Board levelimplementation
Active Bus:Chip level implementation
A single module sends data to all other modules
In old days, the bus is implemented with just wires pulled up registers for
termination. This style of bus is still used for the back-plane bus of desktop
computers. Various types of components can be connected but the operational
speed tends to be slow.
Instead, buses in the chip is implemented with active gates. For example, a
multiplexer is used in this case, it selects one of inputs data and distributes to
all components. The definition of the bus is that anyone of the connected
modules can send its data to all other members. This is sometimes called a
multi-drop bus. In computers, only multi-drop buses are used.
6
Requirements
◼ High Performance
❑ Bandwidth(Throughput)
❑ Latency
◼ Flexibility(Universality)
❑ The number of modules
❑ Clock frequency
❑ Electrical characteristics
Dedicated Bus
Standard Bus
There are inconsistent requirements to bus. First is high performance. The
bandwidth or throughput indicates the total amount of data to be transferred in
a cycle. The latency means the time from the sender starting data to the
receiver receiving them. The high speed bus is needed to have a large
bandwidth and short latency.
On the other hand, flexibility or universality is also required. That is, a bus is
needed to connect a certain number of modules which use various clock
frequency and electrical characteristics. Since it is difficult to satisfy both, we
provide two types of buses: dedicated bus and standard bus in a system.
7
System bus vs. I/O bus
System Bus
(Dedicated)
I/O Bus
(Standard)
Most of multicore system uses a dedicated system bus to connect cache system
and high speed accelerators. Also it provides standard I/O bus through the
bridge chip. North-and-South bridge structure which was introduced before is
one of this example.
8
Synchronous vs. Asynchronous
◼ Synchronous bus
❑ Data is sent synchronized with a clock
◼ Easy to handshake, block (continuous) data transfer
◼ Module numbers/types are limited
❑ PCI、Mbus、PCIx、PCI express, On chip buses
❑ Performance centric
◼ Asynchronous bus
❑ Data is sent without a system clock
◼ Variable modules can be connected
❑ VME、Futurebus+
Recently, asynchronous buses are not commonly used
Buses are sometimes classified into two categories: synchronous or
asynchronous. In synchronous buses, data are sent synchronized with a clock.
This bus can support a high speed block data transfer but modules numbers
and/or types are limited. They must follow the fixed bus clock. On the
contrary, in asynchronous buses, there is no system clock, and so every data
transfer requires a handshake. However, this classification is rather old-
fashioned. Recent buses are all synchronous bus, and asynchronous buses
seem to be extencted.
9
Terms around bus
◼ Transaction: A continuous data transfer of
address and data
◼ Arbitration:An operation for taking a right to
control the bus
◼ Bus Master: a module which had a right of
controlling the bus through the arbitration
◼ Bus Slave:modules except the bus master
Now, let me explain some terms around bus.
10
A sequence of data transfer with the bus
◼ Get the mastership with the arbitration
◼ Bus Transaction
❑ Address transfer
❑ Data transfer (repeated if necessary)
❑ End of transaction
◼ Release the mastership
Arbiter hardware
Handshake
When you want to use a bus, first, you have to get the mastership with the
arbitration. Then, you start the bus transaction. Usually, address and data are
multiplexed, so first, address is transferred. At that time, handshake is needed
between the master and slaves. Then, data transfer starts. Multiple data are
often transferred iteratively synchronized with a clock in the synchronous bus.
At the end of transaction, sometimes, data for error detecting or correction
code are transferred. Then, the master releases the mastership, and the next
transaction starts by another master.
11
Arbiter
..
Priority Encoder
Arbiter
Centralized
H
Distributed bus
Daisy Chain
Distributed
Centralized arbiter is used inside the chip
Arbiter is classified into centralized and distributed. The simplest centralized
arbiter is a priority encoder. It selects a winner from multiple competitors. For
multicores, since every module is embedded into a chip, the centralized arbiter
is used.
12
From
CMOS VLSI Design
by Weste and Harris
Centralized
Arbiter
=
Priority
Encoder
Tree
There are various types of priority encoders. They somehow, resemble to
adders. The simplest one uses the ripple carry structure. Like the adder, it has
the problem of long delay time. So, look ahead, incremented, and Skransky
methods have been proposed to reduce the delay. Of course, they need extra
hardware.
13
Daisy Chain
If no request EI→EO
The request can be issued only if EI is H level
When the request is issued, EO becomes L level
EI EO EI EO EI EO EI EO EI EOH H H H H
Request
L L
RequestX
Right side module has a low priority
Request
Left side module has a high priority
L L
X
Daisy chain is the simplest arbiter. The modules are connected tandemly with
the signals EI and EO. The rule is simple, if there is no requests, the module
transfers the level of EI to EO. The request can be issued only if EI is High
level, and negates EO. All the following modules’ EI turns Low, thus they
cannot issue the request. Obviously, the leftmost module as the highest
priority. Since this method is easy to implement, it is often used for an arbiter
Direct Memory Access or DMA requests.
14
Open Drain bus
‘H’
‘H’
O
F
F
O
F
F
O
F
F
O
F
F
If all inputs are ‘H’, the bus becomes ‘H’.
‘H’ ‘H’ ‘H’
‘H’
‘H’
O
F
F
O
F
F
O
F
F
O
F
F
If at least an input becomes ‘L’,
the bus becomes ‘L’ .
‘H’ ‘H’ ‘H’
O
N
‘L’
‘L’‘L’
O
N
If multiple inputs become ‘L’
it still remains ‘L’,
Wired-OR(AND Tie)
The traditional backplane bus uses the open drain gate. In such a gate, the
drain of output transistor is opened and connected together.
The high level is supplied through the termination resisters. In this case, if all
transistors are off, the bus wire becomes high because it is pulled up by
resisters.
If at least an input becomes low, the transistor turns on, and the level becomes
low. The case of two or more transistors are ON, the same thing happens.
Since at least an input becomes low, the wire is low. Thus, this mechanism is
sometimes called Wired-OR. Some people call it AND Tie.
15
Distributed bus arbiter
0 1 2 3 4 5 6 7
001
011
110
0 1 2 3 4 5 6 7
001
011
110
Check from the upper line.
If the value on the line is
not equal to its output
number, then stop the
output.
Open Drain:0 overtakes 1
Output its own
number
By using open drain buses, a log2N distributed bus arbiter can be built. We
need log2N buses for N modules. Here, there are 8 modules, so three wires are
needed. Each module sends its own binary number to the bus. Here, assume
that module 1, 3, and 6 issue the request.
Then, the bus is checked from the highest digit. If the level is not the same as
its own level, it removes its request from all wires. In this case, the module six
withdraws since it receives 0 instead of its issued 1. Then, the second wire is
checked. This time 3 withdraws. Finally the module 1 wins.
16
Modified method(Keio’s patent)
0 1 2 3 4 5 6 7
001
011
110
Output its own
number
XXXX
0 1 2 3 4 5 6 7
011
110
Parallel check is
possible
XXXX
001
Set cut-points on the bus
The problem of the distributed arbiter is it takes Log2N clocks to fix the result.
Keio university invented the improved the method. The key of magic is to cut
the wire and divides it into 4 segments and 2 segments like this.
Then the same algorithm is applied, but this time the winner can be selected
just with a clock cycle, since 0 is not propagated beyond the cut. I think it is a
good idea, and Keio University got a patent. Unfortunately, no one was
interested in this idea and this patent was expired without getting any money.
17
Starvation Problem
◼ If the priority of the arbiter is fixed, a weak
module cannot use the bus continuously.
◼ Central arbiter
→ Round robin priority scheduling
◼ Distributed arbiter
→ The next request cannot be issued until all
requesting modules satisfy their requests.
If the priority level of the arbiter is fixed, the starvation problem may happen.
That is, a weak module cannot use the bus continuously. For the central arbiter,
the round robin priority scheduling can be implemented. That is the priority
level is shifted between modules. For the distributed arbiter, it is hard to be
implemented. In this case pheudo round robin method is used.
18
Round Robin
Priority 000 001 010 011 100 101 110 111
000001 010 011 100 101 110 111
000 001010 011 100 101 110 111
000 001 010011 100 101 110 111
This diagram shows the round robin priority scheduling. After an arbitration is
finished, the priority shifts so that the fairness of the bus is kept.
19
Practical Starvation Avoidance
Priority 000 001 010 011 100 101 110 111
Assume that 0 is the strongest.
Blocked
Blocked
Blocked
Blocked
All Blocked modules are released
Since the round robin scheduling takes cost, a simpler policy is often used. The
rule is simple, a module who won the arbitration cannot issue the next request
until there is no requesting module. This diagram shows an example.
20
Overlap between the arbitration and data
transfer
Arbitration
Data transfer
n
n-1
bus master for
n-th transaction
n
n+1
n+1
bus master for
n+1-th transaction
n+2
n+1
bus master for
n+2-th transaction
n+3
So, the arbitration time is not critical in most cases.
The bus arbitration is usually done overlapped with bus transaction itself.
Thus, during a transaction, the next bus master is selected, and the time of bus
arbitration is hidden. This is why the speed of arbiter is sometimes not treated
as a critical matter.
21
glossary-1
◼ Arbiter 調停回路◼ Arbitration 調停操作、バスマスタを選ぶ◼ Bus master バスマスタ、バスの利用権を管理するモジュール◼ Bus slave バススレーブ、バスの利用権を持たないモジュール(マスタからスレーブに常にデータを転送するわけではないので注意!)
◼ Centralized 集中型 ⇔ Distributed 分散型◼ Daisy Chain Arbiterの一方法で、ヒナゲシの花輪から来ている◼ Transaction バス上でデータを転送するための一連の操作◼ Open drain オープンドレイン、バスの作り方の一つで、出力トランジスタをオープンにして抵抗につなぐ。全てがOFFのときのみHレベルになり、どれか一つでもONになるとLレベルになる。この操作をワイヤードORと呼ぶ。
◼ Starvation 飢餓状態、バスの利用権を獲得できない状態が長期間続くこと
◼ Round-robin ラウンドロビン、優先順位をArbitration毎に隣りのモジュールに移動していく方法
22
Handshake for data transfer
2-edge
2-line (Strobe + 1 Acknowledge)
4-edge
Only for a single slave
2-edge
3-line (Strobe + 2-Acknowledge)
4-edge
For multiple slaves
Now, let me explain about the handshake mechanism. For a bus with a single
slave, 2-line handshake is enough. If it transfers data with four edges, it is
called 4-edge handshake, and with 2 edges, it is called 2-edge handshake. For
multiple slaves we need three lines.
23
2-line 4-edge handshake
Strobe
Address/
Data
Acknowledge
This shows a waveform of the 2-line 4-edge handshake. Here is an example of
transferring the data from the master to a slave. The master set the address or
data on the bus and assert the strobe. The slave checks the strobe and receives
the content on the bus at the negative edge of the strobe signal. After receiving
the slave changes the acknowledge low to high. The master removes the
content and changes strobe to high. After checking it, the slave returns the
acknowledge signal and goes to the next step. This method uses four edges to
send one item on the bus.
24
2-line 2-edge handshake
Strobe
Address/
Data
Acknowledge
Data ttem is transferred with both edges of the strobe
4 edge handshake is time consuming. So, we can send the next data at the
rising edge of the strobe. This is called 2 edge handshake. Address or data are
transferred at the both edge.
25
In the case of multiple slaves
Strobe
Address/
Data
Module 1Acknowledge
Module 2Acknowledge
L because 2 is L
Acknowledge
Bus (Wired-OR)
Two line handshake does not work well when the number of slaves become
two or more. In this case, when both slaves receive the data, they turn
acknowledge line high. Since open drain bus is used for acknowledge line, it
turns high when the slower module changes the level. This diagram seems
well.
26
Quiz
◼ 3-line handshake (1 for strove and 2 for
acknowledge) is used for multiple slaves.
◼ Why 2-line handshake cannot manage
multiple slaves?
Let’s think well. Why 2-line handshake cannot manage multiple slaves and
another line is needed?
27
2-line cannot manage multiple slaves
Strobe
Address/
Data
Module 1Acknowledge
Module 2 (SLOW!)
Acknowledge
2 is still L
Then, go to next transfer!
OK!
Slow module
Cannot receive
Acknowledge
Bus (Wired-OR=AND)
Negative edge cannot be used for synchronization
This diagram illustrates the reason. Assume that the module 2 is extremely
slow. Even the case, the acknowledge for receiving data will transferred well
because of the wired or bus. When the slower module turns the level high, the
bus level becomes high. However, when master wants to go to next step and
turns the strobe high. In this case, the acknowledge signal becomes low when
the faster module turns the level low. So, the master cannot recognize whether
the slower module is ready or not. In this case, the master can go the next bus
transaction and misunderstand the acknowledge of the slower module.
28
3-line handshake
Strobe
Address/
Data
Acknowledge 1
Acknowledge 2
3-line 2-edge handshake is also possible
OK!Next transfer
OK!
Positive edges of two acknowledge lines are used in turn
This problem can be solved by providing another acknowledge signals and
using them in turn. In this case, only the rising edge can be used for the
handshake, and it certainly becomes high when the slower module turns to
high. This diagram shows 3-line 4-edge handshake, but 3-line 2-edge
handshake is possible.
29
Handshake in the chip
StrobeMaster
slave 1
slave 2
slave n…
slave 1
slave 2
slave n…
AND
AND
Master
Ack1
Ack2
Of course, wired-or wire is not used.
The concept itself is not changed.
Inside the chip, of course, the wired-or wire is not used, but the concept itself
is not changed. Two acknowledge signals are used to inform master whether
slaves are ready or not.
30
Synchronous bus is suitable for block
transfer
Strobe
Address/
Data
Acknowledge
Clock
The start/end handshake is the same, but block
transfer is possible synchronized with a clock
Usually, the handshake is required when bus transaction is started. But, the
data transfer can be done synchronized with the bus clock, and continuous data
transfer can be done. This is why the synchronous bus is suitable for block or
burst data transfer. After that, the handshake is taken.
31
Non-Split Transaction
Address
Memory reading
Data
transfer
Module A
Module B
Bus utilization is degraded
On the basic bus, the bus is locked during the waiting time. Assume that the
module A is a core and module B is a memory. When a core sends an address
with reading request, the memory starts reading the data. After the memory is
ready, the data block is transferred.
32
Split Transaction
データ転送B→A
Address
Module A
Module B Module D
Address
Module C
Split transaction of A→B
Transaction C→D is executed
C→D
For efficient use of such a waiting time, the split transaction is used. After
sending the address and reading request, the module A releases the bus. So, the
module C can use the bus for its purpose. After the waiting time, the split
transaction from B to A starts again. After that, the data transfer from C to D is
executed.
33
Advanced I/O Buses
◼ PCI bus was widely used, but it could not cope with
recent computer system.
❑ 32bit/33MHz, 64bit/66MHz
◼ New standard I/O bus
❑ PCI-X
◼ 64bit/133MHz DDR/QDR
❑ PCI Express
◼ Point-to-point serial data transfer
◼ 1 lane:2.5Gbps
◼ x2, x4, x8
❑ Now, PCI Express is used instead of PCI bus.
As the bus for Personal Computers, PCI bus was widely used until about
1990’s. It was 32bit bus with 33MHz clock and then extended to 64bit with 66
MHz. But, it could not cope with the performance improvement of personal
computers. Then, the PCI-X, a straight extension was shortly, but they were
replaced with a new concept PCI express. It has been widely used.
34
PCI Express
◼ Consisting of serial one-to-one bidirectional connection
wires called lanes.
◼ Each lane supports 2.5Gbps/5Gbps (Physical Speed)
◼ Multiple lanes can be used as a link(x4, x8, x16 and
x32).
◼ The data is transferred in a packet called TLP
(Transaction Layer Packet).
◼ Interconnection network rather than the bus, but the
protocol of traditional PCI bus is supported.
port
Physical
layer
port
Physical
layer
lane
link
PCI Express is actually not a bus in traditional sense, but a serial high speed
one-to-one bidirectional connection. There are several bi-directional lanes and
form a link. The data is transferred in a packet called TLP or Transaction Layer
Packet. It is a packet switching network, but the protocol of old PCI bus is
supported.
35
PCIe standardGen1 Gen2 Gen3
Physical speed
(Gbps)2.5 5 8
Bandwidth
(GB/sec)
0.25 0.5 1.0
x8 bandwidth
(GB/sec)
2.0 4.0 7.9
Encoding 8b/10b 8b/10b 128b/130b
Physical speed is x1.6, but almost twice practical performance is realized by
changing the encoding method.
OK. Let’s review PCIe standard. The physical link speed of Gen3 is 8Gbps,
the 1.6 times as that of Gen2. But, by using 128b/130b coding instead of
8b/10b used in Gen2, it achieves almost twice bandwidth that of Gen3.
36
An example of bus system using PCI
expressCPU
Root Complex Memory
System bus
Memory bus
Graphics
Switch Switch
End
point
End
point
End
pointPCI Bridge
PCI bus
PCI
Express
PCI express is used with a tree like structure. It is connected with CPU through
the root complex. It is actually a switch which has multiple links to other
switches. I/O modules are connected to end-point.
37
On-chip bus
◼ For on-chip implementation, various types of IP (Intellectual Property) must be connected.
◼ Standard bus is required.❑ AMBA (Advanced Microcontroller Bus Architecture): a bus
for ARM cores.
❑ CoreConnect: a bus for PowerPC cores.
❑ Wrapper based buses
◼ IPs are wrapped in the standard interface.
◼ For further performance improvement, NoCs (Network on Chips) are introduced.
→ Introduced in the later part of this lecture
There are various types of on-chip bus, but recently network-on-chips or NoCs
are becoming used popularly.
38
NEC MP211
ARM926
PE0
ARM926
PE1
ARM926
PE2
SPX-K602
DSP
DMAC USB
OTG
3D
Acc.
Rot-
ater.Image
Acc.
Cam
DTV
I/F.
LCD
I/F
Async
Bridge0
Async
Bridge1
APB
Bridge0
IIC UART
TIM1
TIM2
TIM3
WDT
Mem. card
PCM
APB
Bridge1Bus Interface
Scheduler
SDRAM
Controller
SRAM
Interface
On-chip
SRAM
(640KB)
PLL OSC
Inst.
RAM
PMU
INTC TIM0GPIO SIO
Sec.
Acc.
SMU uWIRE
CameraLCD
FLASH DDR SDRAM
Multi-Layer AHB
An example of on-chip bus
39
Summary of Bus
◼ Classic bus with passive wires has been
changed to active bus with a kind of switches
◼ High Speed Bus
❑ Synchronous bus with Split Transaction
❑ Using active devices
❑ It becomes somehow like a packet transfer with
switching hub.
OK. This is a summary of buses.
40
glossary 2
◼ Handshake 握手のことだがここでは正しく転送するための信号のやりとりを指す
◼ Synchronous 同期式⇔Asynchronous 非同期式
◼ Strobe 転送を起動を知らせる信号線
◼ Acknowledge Strobeに対する応答用の信号線
◼ Edge 信号線の変化
◼ Split transaction バス転送を中断して途中に他の転送を挟むことを可能にする方法
41
Crossbar switch
n
m
Cross point: small
switching element
The number of
cross points:
nxm
Extension of the buses
Today’s next target is a crossbar switch. Some of you may feel strange,
because the crossbar is a typical switch. However, the crossbar switch is
actually an extension of multiple buses. Here we assume n cores and m
memory. Providing small switching elements at the cross points of their
individual buses, the crossbar can be built. That is, it requires nxm cross point
switches.
42
Non-blocking property
n
m
For different
destination,
conflict free
By make the best use of cross point switches dedicated lines can be formed for
the different destination. That is it is conflict free or non-blocking.
43
Head Of Line (HOL) conflict
n
m
X
Arbiter is required for each bus
The buffer is required
The number of cross
point is not dominant.
However, of course, if the destination module is the same the multiple requests
conflict with each other. In this case, like common buses, the arbiters are used.
Some people call this conflict, head of line or HOL conflict.
44
Input buffer switch
Crossbar
Input buffer
One of conflicting packets is selected.
Others are stored Into the input buffer
When conflict occurs, one of the requests is usually stored in the input buffer
or FIFO attached to the crossbar. After another request passing the crossbar,
the waiting request is transferred. The similar mechanism can be provided for
the bus. Parallel machines commonly use this style so called input buffer
switch.
45
Output buffer switch
Crossbar
Crossbar must work with ×n frequency of input/output rate.
No HOL problem.
Used in switches in WAN, but for parallel machines it is difficult.
Output buffer
works with ×n freq.
Some people claim that there is output buffer switch. But, for parallel machine,
it is rarely used, because for solving the HOL conflict, the multiple data must
be transferred to the output buffer. It means that the output buffer must work
with n times clock frequency. This situation is almost impossible for multicore
machines.
46
Buffers at cross-point
n
m
The buffer is provided
at each cross-point.
High performance but
the total amount of
buffer becomes large.
Another idea is providing buffers at cross point. It achieves high performance
but the total amount of buffer becomes huge.
47
48
An example of a modern router◼ WH router with two virtual channels
5x5 XBAR
ARBITER
FIFO
FIFO
FIFO
FIFO
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
(Introduced later in this lecture)
This is a practical design example of a modern router using the crossbar. I will
explain this structure later in this lesson.
Merit/demerit of Crossbars
◼ Non-blocking property
◼ Simple structure/Control
◼ The hardware for cross-points usually do not
limit the system (Fallacy of crossbars)
◼ Extension is difficult by the pin-limitation of
LSIs
❑ If pins can be used, a large crossbar can be
constructed → Earth simulator
Here, the pros and cons of crossbars are summarized.
49
SUN T1
Core
Core
Core
Core
Core
Core
Core
Core
Crossbar
Switch
FPU
L2
Cache
bankDirectory
L2
Cache
bankDirectory
L2
Cache
bankDirectory
L2
Cache
bankDirectory
Single issue six-stage pipeline
RISC with 16KB Instruction cache/
8KB Data cache for L1Total 3MB, 64byte Interleaved
Memory
This is an example of Sun microsystem’s earlier multi core processor.
50
The earth simulator
Ve
cto
r P
ro
ce
sso
r
Ve
cto
r P
ro
ce
sso
r
…
Ve
cto
r P
ro
ce
sso
r
0 1 7
Shared Memory
16GB
Ve
cto
r P
ro
ce
sso
r
Ve
cto
r P
ro
ce
sso
r
…
Ve
cto
r P
ro
ce
sso
r0 1 7
Shared Memory
16GB
Ve
cto
r P
ro
ce
sso
r
Ve
cto
r P
ro
ce
sso
r
…
Ve
cto
r P
ro
ce
sso
r
0 1 7
Shared Memory
16GB
….
Interconnection Network (16GB/s x 2)
Node 0 Node 1 Node 639
Peak performance
40TFLOPS
Japanese supercomputer “the earth simulator” used a huge crossbar. It
connected 639 modules.
51
glossary 3
◼ Crossbar switch: クロスバスイッチ、ここでは主としてスイッチ本体を指すが、バッファも入れて考える場合もある
◼ Router:パケットを転送するためのハードウェア全体を指す
◼ WH, Virtual Channel:この授業のもっとあとで紹介するのでここでは深く追求しないでよい
◼ Non-blocking, blocking:出力ポートが重ならなければ、衝突が起きないのがノンブロッキング、出力ポートが重ならなくてもスイッチ内部で衝突するのがブロッキング
◼ HOL conflict:出線競合、出力ポートが重なることで起きる衝突
52
Homework 2
◼ Your computer uses PCIe gen2 x 8.
1. How much maximum bandwidth can be used ?
2. You want to improve the bandwidth.
2-1. When you use PCIe gen2 x 16, how much
maximum bandwidth can be used?
2-2. You changed the bus to PCIe gen3 x 8, how
much maximum bandwidth can be used?
Just a simple calculation. You will spend only about
3 minutes.
OK. This homework is too simple. Maybe you will spend 3 minutes.
53