Multiprocessor Initialization

Multiprocessor Initialization

An introduction to the use of Interprocessor Interrupts

A traditional MP system

CPU0

CPU1 Main memory

system bus

Core 2 Duo processor

Dual-Core Technology

CPU0

CPU1

Main memory

system bus

Shared level-2 cache

Multi-Core TechnologyCore 2 Quad processor

CPU0

CPU1

Main memory

system bus


CPU2

CPU3


CPU has its own Local-APIC

CPU processor’s application registers

EAX, EBX, …, EIP, EFLAGS

processor’s system registers CR0, CR2, CR3, …, IDTR, GDTR, TR

processor’s Local-APIC registersLocal-ID, IRR, ISR, EOI, LVT0, LVT1, …, ICR, TCFG

processor’s Execution Engine

The Local-APIC ID register

reservedAPICID

31 24 0

Memory-Mapped Register-Address: 0xFEE00020

This register is initially zero, but its APIC ID Field (8-bits) is programmed by the BIOS during system startup with a unique processor identification-Number, which subsequently is used when specifying the processor as arecipient of inter-processor interrupts.

The Local-APIC EOI register

write-only register

31 0

Memory-Mapped Register-Address: 0xFEE000B0

This write-only register is used by Interrupt Service Routines to issue an‘End-Of-Interrupt’ command to the Local-APIC. Any value written to thisregister will be interpreted by the Local-APIC as an EOI command. Thevalue stored in this register is initially zero (and it will remain unchanged).

The Spurious Interrupt register

reserved spuriousvector

31 7 0

Memory-Mapped Register-Address: 0xFEE000F0

This register is used to Enable/Disable the functioning of the Local-APIC,and when enabled, to specify the interrupt-vector number to be deliveredto the processor in case the Local-APIC generates a ‘spurious’ interrupt.(In some processor-models, the vector’s lowest 4-bits are hardwired 1s.)

EN

8

Local-APIC is Enabled (1=yes, 0=no)

Interrupt Command Register

• Each processor’s Local-APIC unit has a 64-bit Interrupt Command Register

• It can be programmed by system software to transmit messages to one, or to several, of the other processors in the system

• Each processor has a unique identification number in its APIC Local-ID Register that can be used for directing messages to it

ICR (upper 32-bits)

reservedDestinationfield

31 24 0

Memory-Mapped Register-Address: 0xFEE00310

The Destination Field (8-bits) can be used to specify whichprocessor (or group of processors) will receive the message

ICR (lower 32-bits)

Vectorfield

31 19 18 07

Destination Shorthand 00 = no shorthand 01 = only to self 10 = all including self 11 = all excluding self

R/O

10 8

Delivery Mode 000 = Fixed 001 = Lowest Priority 010 = SMI 011 = (reserved) 100 = NMI 101 = INIT 110 = Start Up 111 = (reserved)

Trigger Mode 0 = Edge 1 = Level

15

Level 0 = De-assert 1 = Assert Destination Mode

0 = Physical 1 = Logical

12

Delivery Status 0 = Idle 1 = Pending Memory-Mapped Register-Address: 0xFEE00300

MP initialization protocol

• Set a shared processor-counter equal to 1• Step 1: issue an ‘INIT’ IPI to all-except-self• Delay for 10 milliseconds• Step 2: issue ‘Startup’ IPI to all-except-self• Delay for 200 microseconds• Step 3: issue ‘Startup’ IPI to all-except-self• Delay for 200 microseconds• Check the value of the processor-counter

Issue an ‘INIT’ IPI

# address Local-APIC via register FSmov $sel_fs, %axmov %ax, %fs# broadcast ‘INIT’ IPI to ‘all-except-self’mov $0x000C4500, %eaxmov %eax, %fs:0xFEE00300)

.B0: btl $12, %fs:(0xFEE00300)jc .B0

Issue a ‘Startup’ IPI

# broadcast ‘Startup’ IPI to all-except-self # using vector 0x11 to specify entry-point # at real memory-address 0x00011000 mov $0x000C4611, %eax mov %eax, %fs:(0xFEE00300)

.B1: btl $12, %fs:(0xFEE00300)jc .B1

Timing delays

• Intel’s MP Initialization Protocol specifies the use of some timing-delays:– 10 milliseconds ( = 10,000 microseconds)– 200 microseconds

• We can use the 8254 Timer’s Channel 2 for implementing these timed delays, by programming it for ‘one-shot’ countdown mode, then polling bit #5 at i/o port 0x61

Mathematical examples

EXAMPLE 2Delaying for 200-microseconds means delaying 1/5000-th of a second (because 5000 times 200 microseconds = one-million microseconds)

EXAMPLE 1 Delaying for 10-milliseconds means delaying for 1/100-th of a second (because 100 times 10-milliseconds = one-thousand milliseconds)

GENERAL PRINCIPLEDelaying for x–microseconds means delaying for 1000000/x seconds (because 1000000/x times x-microseconds = one-million microseconds)

Mathematical theory

RECALL: Clock-Frequency-in-Seconds = 1193182 HertzALSO: One second equals one-million microseconds

PROBLEM: Given the desired delay-time in microseconds, express the desired delay-time in clock-frequency pulses and program that number into the PIT’s Latch-Register

Delay-in-Clock-Pulses = Delay-in-Microseconds * Pulses-Per-Microsecond

Pulses-Per-Microsecond = Pulses-Per-Second / Microseconds-Per-SecondAPPLYING DIMENSIONAL ANALYSIS

CONCLUSION

For a desired time-delay of x microseconds, the number of clock-pulsesmay be computed as x * (1193182 /1000000) = (1193182 * x) / 1000000as dividing by a fraction amounts to multiplying by that fraction’s reciprocal

Delaying for EAX microseconds

# We compute the value for the 8254 Timer’s Channel-2 Latch-register# Delaying for EAX microseconds means that Latch-register’s value is # a certain fraction of one full second’s worth of input-pulses:# fraction = (EAX microseconds)/(one-million microseconds-per-second) # # Thus the latch-value should be: fraction*(1193182 pulses-per-second)# which we can compute by doing a multiplication followed by a division #

mov %eax, %ecx # copy the delay to ECX

mov $1193182, %eax # setup input-frequency in EAXmul %ecx # multiplied by microsecondsmov $1000000, %ecx # setup one-million as a divisordiv %ecx # so quotient will be Latch-value

# Quotient in register AX should be written to the timer’s Latch Register

Intel’s MP terminology

• When an MP system starts up, one of the CPUs will be selected to handle the ‘boot’ procedures, while the other CPUs ‘sleep’

• The BSP is this BootStrap Processor, and every other processor is known as an AP (i.e., a so-called ‘Application Processor’)

BSP AP AP AP

‘parallel computing’ principles

• When it’s awakened, each processor will need its own private stack-area, so it can handle any interrupts or procedure-calls without modifying an area in memory which another processor is also using

• And whenever two or more processors do share ‘write-access’ to any memory area, then those accesses must ‘serialized’

‘atomic’ memory-access• Shared variables must not be modified by more

than one processor at a time (‘atomic’ access)• The x86 cpu’s ‘lock’ prefix helps enforce this• Example: every processor adds 1 to a counter

lockincl (counter)

• Some instructions have ‘atomic’ access built in • Example: all processors needs private stacks

mov 0x1000, %axxadd (new_SS), %axmov %ax, %ss

ROM-BIOS isn’t ‘reentrant’

• The video service-functions in ROM-BIOS often used to display a message-string at the current cursor-location (and afterward advance the cursor) modify global storage locations (as well as i/o ports), and hence must be called by one processor at a time

• A shared memory-variable (called ‘mutex’) is used to enforce this mutual exclusion

Implementing a ‘spinlock’# Here is a ‘global’ variable, which all of the processors can modifymutex: .word 1 # initial value for variable is 1

# Here is a ‘prologue’ and ‘epilog’ for using this variable to enforce# ‘mutually exclusive access’ to a section of ‘non-reentrant’ code

spin: btw $0, mutex # test bit #0 to see if mutex is freejnc spin # spin if the mutex is not available

lock # else request exclusive bus-access btrw $0, mutex # and try to grab mutex ownershipjnc spin # unsuccessful? then try again

< CRITICAL SECTION OF ‘NON-REENTRANT’ CODE>

btsw $0, mutex # release the mutex when finished

Demo: ‘mphello.s’

• Each CPU needs to access its Local-APIC• The BSP (“Boot-Strap Processor”) wakes

up other processors by broadcasting the ‘INIT-SIPI-SIPI’ message-sequence

• Each AP (“Application Processor”) starts executing at a 4K page-boundary -- and needs its own private stack-area

• Shared variables require ‘atomic’ access

Demo’s organizationMAIN: # the BSP will execute these callscall allow_4GB_accesscall display_APIC_LocalIDcall broadcast_AP_starupcall delay_until_APs_halt

initAP: # each AP will execute these callscall allow_4GB_accesscall display_APIC_LocalID

In-class exercise #1

• Add a call to this procedure by each of the processors, but do it without using a ‘lock’ prefix (and outside mutex-protected code)

• Then let the BSP print the value of ‘total’

total: .word 0 # include this ‘shared’ global-variable

add_one_thousand: # let each processor call this subroutinemov $1000, %cx

nxadd: addw $1, totalloop nxaddret

Binary-to-Decimal

• Recall algorithm for converting numbers to decimal digit-strings (for console display)num2dec: # converts value in register AX to a decimal string at DS:DI

mov $10, %bx # setup the number-base in BXxor %cx, %cx # setup remainder-count in CX

nxdiv: xor %dx, %dx # extend AX to a doubleworddiv %bx # divide the doubleword by tenpush %dx # save remainder on the stackinc %cx # and count this remainderor %ax, %ax # was the quotient zero yet?jnz nxdiv # no, generate another digit

nxdgt: pop %dx # recover saved remainderadd $’0’, %dl # convert remainder to ASCIImov %dl, (%di) # store numeral in output-bufferinc %di # and advance buffer-pointerloop nxdgt # again for other remainders

In-class exercise #2

• Using a Core-2 Quad processor we might expect the value of ‘total’ would be 4000

• But see if that’s what actually happens!• Without the ‘lock’ prefix, the four CPUs

may all try to increment ‘total’ at once, resulting in a logically incorrect total

• So fix this problem (by using a ‘lock’ prefix ahead of the ‘addw $1, total’ instruction)

Do you need a ‘barrier’?• You can use a software construct, known as a

‘barrier’, to stop CPUs from entering a block of code until a prescribed number of them are all ready to enter it together (i.e., simultaneously)

• This may be helpful with the in-class exercises

arrived: .word 0 # allocate a shared global variable

barrier: lock # acquire exclusive bus-access incw arrived # each cpu adds 1 to the variable

await: cmpw $4, arrived # are four cpus ready to proceed?jb await # no, wait for others to arrive herecall add_one_thousand # then proceed together

Multiprocessor Initialization

Documents

apic localid register

address localapic

localapic eoi registerwrite

register fsmov

0xfee000f0this register

0xfee00020this register

apic id field

system startup