1 Computer Architectures M Core. 2 CMP Chip Multi Processor In this context I/O indicates any communication with the external world (real I/O, memory,

1

Computer Architectures M

Core

2

CMP Chip Multi Processor

In this context I/O indicates any communication with the external world (real I/O, memory, external caches). Shared cache indicates L2 or L3. Very often L2 too is integrated in the processor

3

Advantages

• Minimum latency time for data transfer• No bus use for interprocessor communication• Possible dynamic cache allocation between the

processors

Disadvantages

• Complexity. The controller must evaluate in real time the needs of the two CPUs and an error can block one of the CPUs.

• The cache bandwith must be much higher for serving to CPUs

• If the cache access is multiport further increased complexity. If queue only then reduced efficiency

• This design does not cater for scaling (the cache cannot be divided)

4

Advantages

•Reduced handling complexity•Easy scaling (i.e. one CPU ony)•No Bus involvement

Disadvantages

• No dynamic balancing• Accurate I/O controller design• Performance reduced because of the traffic

between the two CPUs (which affect sthe I/O too)

5

Advantages

• It is a dual CPU and therefore easier design• Easy scaling with one CPU only• Reduced test complexity (one CPU at a time can

be tested)• Shorter «time to market»

Disadvantages

• Very important: CPU communication affects the bus

• Double electrical impact on the bus capacitors=>slower behaviour

Shared package

7

Enhanced SpeedStep Technology

• It allows to reduce the operating voltage while reducing the clock frequency

Voltage Clock

1.484 V 1.6 GHz

1.42 V 1.4 GHz

1.276 V 1.2 GHz

1.164 V 1 GHz

1.036 V 800 MHz

0.956 V 600 MHz

Pentium 1,6 GHZ

• There are different power lines for the functional units which can be selectively switched off

10

64 bit extensions

• MMX registers 64 bit

AH AL

EAXRAX AX

63 32 31 16 15 8 7 0

• It allows the execution of 64 bit OS and programs

• Addressing space up to 64 exabytes (2**64 bytes – 2**32 x 2**32 bytes - 4Gx4GB).

• 8 additional 64 bit registers/ accumulators R08-R16

• All other accumulators 64 bits

11

Core

• While PIV tried to increase the efficiency by increasing the clock the Core relies on multiprocessing thanks to the transistor reduced size. This allows also an increase of the cache size.

• 2005-2007

• New architecture named Core for multicore

• Low power consumption

• 14 stages pipeline

• Developed in Israel

• Multicore with Out Of Order

12

COREPipeline

0

COREPipeline

1

L1 – Core 0 L1 – Core 1

L2 - shared

FSB interfaceThere were many different Core versions 1) Merom (mobile – low power)2) Conroe (the first implemented - desktop)3) Bloomfled (end 2008 – server - quadcore)

Core

NB: in this figure the prefetcher includes the L1 cache

13

Core

14

• The two cores L1 can exchange information directly without using the bus

Core• 1 + 3 decoders – 7 u-op, greater ROB increased number of EUs

• 2.66 GHz

• Smart power reduction. For instance not only the unused EU are powered down but also the internal busses path are activated only when necessary for each instruction

• For each core two L1 caches L1 (Data and instructions): Instructions => 32 (or 64) KB 8 Ways Data=>32 (or 64) KB 2/8 Way – No trace cache (Inefficient!)

• L2 Cache: 2-4 MB unified

15

• Core has no multithread which is resumed in the following processors generation

Core Microarchitecture

• Dual core superscalar 4, 36 bit physical addresses

• L2 shared inclusive – unified D e I. Each cores uses the portion which it needs. If the two cores use the same instructions they can be shared

16Independent L2 Caches Shared L2 Cache

Advanced Smart Cache

Shared L2 advantages:• None of the previous disadvantages

NON shared L2 has the following disadvantages:• possible replication of the same data in the two caches• snoop through the FSB• static partitioning of the silicon

17

• The prefetch algorithm (secret) considers the access sequences in order to predict the next requests and to anticipate them

Core Microarchitecture

• Intelligent prefetcher: 2x16=32 bytes buffer (as in P6). The system tries to guess the required data. For instance when the data at address 1-3-5 are requested then the system reads in advance the data at address 7 (if the bus is available)

• More precisely each Core has 2x3+2=8 prefetchers (two for the data and one for the instructions plus two prefetchers for the shared L2 ).The prefetch policies are different in different models according to the use (mobile, server, desktop)

18

Loop detector

– … But requires the decoding each cycle

• Exploits the hardware loop detectionI The loop detector analyzes the branches and determines whether it is a loop

– Avoids the repetitive fetch and branch prediction

DecodeBranchPrediction Fetch

LoopStream

Detector

18 instructions

19

Core Pipeline

• It must be noted that the same «in flight» instructions number is further increased by the fusion (see later). The Core window is therefore greater than the increase of the RS and ROB could induce to think

• The pipeline is 14 stages (P6 12 stages). The additional two stages inserted for delays and fusion handling

• ROB stores 96 u-ops (Xeon 126 because of the multithread)

• Unified RS handling (memory/non memory – no difference between the FUs) with increased entries for better FUs exploitation

20

Core Architecture

6 Ports

Data restructured for the vector ALUs. Operations on 128 bit data split into two 64 bit operations

21

6 ports

Higher efficiency ALUs

Core Microarchitecture 4 + 3 u-ops

7 u-ops/clock

22

Macrofusion

The sequence

load EAX, [mem1] cmp EAX, [mem2] jne Target

becomes

load EAX, [mem1] cmp EAX, [mem2] + jne Target (test and branch)

• In the main decoder couples of machine instructions can be fused (typically compare and test instructions are fused with the branch instructions) The only limit is that only one “macrofused” instruction per cycle can be generated

• This requires a higher complexity decoder, ALU and Branch EU more complex but grants a reduced number of «in flight» u-ops, faster ROB and RS emptying and an apparent higher efficiency of the ALUs. This means a lower power consumption fror the same program.

23

Microfusion

• Two distinct u-ops allocated in the same bit string

• When a microfused instruction reaches the RS they are separately sent to the respective FUs either in parallel or serially when they require the same FU (i.e. the case of LOAD e STORE)

• STORE operations are normally subdivided into two u-ops: one for the data and one for the address (two separata FUs) The data are sent to the store buffer while the addrtess is calculated: when ready it is retired by the store buffer.

• The same applies to the LOAD or READ-MODIFY : in this case the two operations are serially executed.

• The number of the u-ops is reduced in average by 10%. The efficiency increase is 5% for integer operations and 10% for FP opertions.

24

• Combining Macrofusion and Microfusion an average 10% u-ops reduction is achieved. Higher use of the FU. Higher paarallelism and higher number of u-ops among which to chose the OOO sequence.

Core Front End

• Predecode and fusion stage: it detects the instructions length and the relative boundaries

• The trace-cache is not more present because of its statistically poor performance. 4 decoders (one complex and three simple – 7 uops/clock – one more than P6)

25

Front End

P6Core

One more simple decoder: 7 microops /cycleThe simple decoders are able to decode a larger number of instructions: almost one u-op per instruction achieved

26

Dispatching architecture

27

Core

• Many u-ops require multiple clock for the execution but this doesn’t block the ports. For instance port 1, once a FADD is started is free for the IEU

Core 6 ports

• One more dispatch port is dedicated the logical and arithmetical u-ops

• Increased integer units number

• Up to 3 u-ops (ports 0, 1 e 2) can be executed per clock (not counting the Branch Execution Unit and the Memory Address Unit – ports3, 4 and 5 – which don’t produce results).

• The system is not simmetrical: FP multiplications can be executed only in a FPU and the same holds for the FADD

28

Mathematical EUs

Floating point execution units

Two units able to execute scalar and FP u-ops. One unit for simple operations (i.e. FADD)

• Three integer EUs each one able to execute a 64 bit u-op per clock. One is for complex u-ops (CIU Complex Integer Unit) and two for simple u-ops (SIU) like additions. All of them operate in parallel with the branch execution unit

Integer execution units

29

Memory acces instructions

• Load and Store – when committed - are moved from the ROB to a FIFO called MOB (Memory Reorder Buffer) which in some cases allows «overtakings» of the Loads

• Load and Store are much more complex than – for instance – the addition. First of all because they require the access to the RF (for the address computation) and because they must access the data cache. L1 access is much slower than the acces to the renamed registers and there is always the risk of a L2 access

30

Memory “disambiguation”

• u-ops commitments (and therefore memory and registers updating and reading ) must be necessarily executed “in order” But…

Memory aliasing

• Statistically B case is 97% (it depende on the compiler too!) but in P6 and PIV because of A cases (3%) no Load can be executed before the Store. Big performance loss

• There are two cases: the “Store” uses the same address of the “Load” (case A) or not (case B)

• In case A the “Store” must precede the “Load”, in case B not

• Case A is the “Memory aliasing”

31

If the processor detects that we are in case B we must not wait the memory update related to the Store and we can overlap the operations as in figure: one clock cycle was spared (more than 16%)

Clock

123456


In case A the address is computed at clock 1 and the store is executed in clock 2. Another cycle must elapse for the memory update (clock 3) and then the load can be executed which requires cycles 4 and 5 for the register update. (It must be remembered that a Load in any case «occupies» the memory location which cannot be at the same time used by a Store). Eventually on te 6h clock cycle the sum can be executed.

32

In case of Core we have the B-2 situation where load is anticipated before the store sparing 3 cycles in comparison with A and two cycles in comparison with B. This is possible thanks to an algorithm which analyses the u-ops and predicts the memory aliasing. In this case too the prediction can be wrong and the pipeline must be flushed but the percent advantage is very significant

-1 -2

123456


1 Computer Architectures M Core. 2 CMP Chip Multi Processor In this context I/O indicates any communication with the external world (real I/O, memory,

Documents