Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Target Machine Architecture

Group 1Presented by Greg Klepic

Sections● 5.1 The memory hierarchy● 5.2 Data Representation● 5.3 Instruction Set Architecture● 5.4 Architecture and Implementation● 5.5 Compiling for Modern Processors

Memory hierarchyTypical Access Time Typical Capacity

Registers 0.2-0.5ns 256-1024 bytes

Primary (L1) cache 0.4-1ns 32K-256K bytes

L2 or L3 (on-chip) cache 4-30ns 1M-32M bytes

Off-chip cache 10-50ns Up to 128M bytes

Main Memory 50-200ns 256M-16G bytes

Flash 40-400μs 4G bytes- 1 T bytes

Disk 5-15ms 500G bytes and up

Tape 1-50s Effectively unlimited

Memory Hierarchy● Moving down the hierarchy increases latency but also increases capacity● Registers are accessed in 1 clock cycle, everything else take more● Keeping the data moving up the hierarchy prevents idle clock cycles● Usage of Caches vs memory and other slower sources limited by physical

size and cost● Latency limited by technology and bandwidth of buses● Disk and tape drives also limited by mechanical speed of the drive.

Memory Hierarchy- Registers● X86-64bit register

diagram● For 32 bit application

only white parts are available

● XMM 128-bit registers used for SSE vector ops

Memory Hierarchy-Caches● Cache is typically on the same

chip as the cpu● Much larger but slower access

time than registers, much faster access time

● L1 cache divided into two parts 1 for instructions other for data

● L3 cache shared for multi-core● Exploit locality to avoid cache

misses

Cache Hierarchy of Intel Westmere Architecture 4 core processor

Caches- Locality● Caches exploit locality to avoid cache misses● Spatial locality- tendency of a program to access data sequentially e.g.

arrays● Temporal locality- tendency to reuse certain data e.g. local variable in a

loop● Cache miss- If data is not in the cache at the time it’s needed, must wait

for memory (wasted clock cycles)

Memory Hierarchy- Processor/Memory Gap● Over time the rate at which

processor speed has increased is much greater than RAM

● If data had to be accessed from RAM frequently processor will be idle too often

Memory- Alignment● Operand appear in several sizes typically, 1,2,4, and 8 bytes● Some architectures require n-byte operands to appear at an address

divisible by n● Others like x86 do not, but run faster if operands are aligned● Buses are designed in such a way that if alignment isn’t there bits would

need to be shifted which takes time.● Also forcing alignment allows offsets to be specified in words rather than

bytes.

Data Representation● Operations interpret bits in memory in different ways● Data formats include instructions, addresses, binary ints, floats, and char● Integers come in half-word, word, and double-word length● Floating point come in single and double precision lengths

Little-Endian vs Big-Endian● Least significant byte of a

multiword datum at the address is little-endian

● Most significant byte at the address is call big-endian

● Little Endian is tolerant of variations in operand size

● X86 is Little-endian most other common architectures can be either

Operations on characters● Varies from architecture to architecture● Some can perform arithmetic and logical operations on 1-byte quantities● Most do not, only load/store● x86 has instructions for strings of characters e.g. copying, comparing,

searching● Vector operations in x86 can be used on strings

Integer Arithmetic-Representation● Two different representations of ints, signed and unsigned● Also two sets of operators, unsigned arithmetic used for pointers● Unsigned ints commonly represented by hexadecimal preceded by 0x, e.

g. 0x400= 4*162 +0*161 + 0*160 =1024 = 0100 0000 0000● Signed arithmetic uses ‘two’s complement arithmetic’● Examples of 4 bit 2’s complement ints

Signed Int arithmetic- Addition● The most significant bit of all negative numbers is one and non-negative is

zero, non-negative 2’s complement ints are represent same as unsigned● Smallest negative number in a range of size n is 1 0n-1 and normal rules

apply to increasing magnitude● Addition algorithm is the same as unsigned, no additional logic needed● Overflow occurs when a result is too large to fit in a word.● If addition of two non-negative ints give a ‘negative’ result (i.e. most

significant bit flips) then there is overflow, similar for 2 negative ints● Subtraction is adding the additive inverse

Floating Point Representation● Prior to 1985 float poorly-defined and vary from machine to machine● IEEE standard 754 defined 2 sizes for floats, single and double precision.● Single: sign bit, eight bit exponent, 23 bit significand

○ Represent number from roughly 10-38 to 1038

● Double: 11 bit exponent and 52 bit significand○ Represent number from roughly 10-308 to 10308

● Notation where s is the sign bit, sig is signifcand, exp is exponent○ -1s * sig * 2exp is the value of a given float

● exp is normalized by subtracting bias, b = most negative number○ b=-127 for single and b=-1023 for double-precision

● sig is always 1.X, unless the value is very close to zero then sig = 0.X*2min+1

○ min is the smallest allowed exponent

● Special patterns for zero, ∞, -∞, and Not a Number values

Instruction Set Architecture (ISA)● An ISA includes instructions available on a given machine and their

encoding in machine language● Instructions for

○ Computation: arithmetic and logical operations, tests, and comparisons on values in registers or memory (with address held in register)

○ Data Movement: Loads/Stores to and from memory to the registers or copies from one register or memory location to another

○ Control Flow: Branches, subroutine calls, and returns

ISAs● Differing philosophies● CISC (Complex instruction set Computing)- do as much work as possible

with each instruction as in x86● RISC (Reduced instruction set Computing)- maximize number of

instructions performed per second as in ARM, MIPS, etc.● RISC instructions more suitable for pipelined architecture● Most CISC ISAs convert to RISC-like instructions to facilitate pipelining.

Addressing Modes● RISC systems only allow computational instructions on value held in

registers called register-register architecture● CISC allow computational instructions to access operands directly in

memory, register-memory architecture● x86 allows for 2 address instructions meaning one will be overwritten by

the result● Others allow a third address to specify where to send the result● Displacement addressing gives address by a displacement relative to a

base used by some RISC ISAs● Indexed addressing address are found using two register, e.g. one

register has address of an array and second contains index, ARM uses this● CISC machines have more complex addressing modes

Conditions and Branching● Condition codes in a special processor status register control branch flow● Operations may change these codes● Conditional move instruction moves a value only if the codes are set

correctly● Predication allows any operation to be marked as conditional● Branched code can then be made branchless, where instructions in the

branch that doesn’t occur have no effect.● Doing this can save time if both branches are short

Architecture and Implementation● 4 Architectural breakthroughs lead to modern processing

1. Microprogramming, in the early 1960’s IBM developed a way to share code among processor ‘families’ previously code was written for a specific machine, led to more concise instructions

2. Microprocessor- mid 1970’s a processor could be implemented on a single chip with local registers, 8-bit registers to start got bigger as transistor count increased

3. RISC- compute many small pipelined operations in parallel, sometimes with more than one pipeline

4. Multicore processors- There was a limit reached in terms of clock speed due to heat, decrease in feature size led to development of multicore processors

● More recently SoC’s and HSA’s

Compiling for Modern Processors● Main concerns: effective use of pipeline and registers● Reasons a pipeline may stall

1. Cache misses: instructions not ready in cache2. Resource hazards: 2 instructions need the same functional unit3. Data hazards: operand needed but still in each by another instruction4. Control Hazards: Fetching can not occur until branching is resolved

● Branch prediction: predict branch outcome based on past results and roll back execution if needed, to avoid control hazards

● Instruction Scheduling: Reorder instructions at compile time to minimize stalling, maximize instruction level-parallelism.

Branch Prediction● Early RISC processors provided delayed branch instructions

○ This means that the instruction following the branch would be executed no matter what the outcome of the branch

● Proved to be impractical due to scheduling conflicts● Modern solution: branch predictor- guess the outcome of branches and

continue without knowing the result, backtrack as needed● Cache misses typically fall on the programmer to avoid, the compiler

assumes cache hits which in most programs is safe.● Even fast caches have some delay, so instructions can be reordered to

avoid the delay as long as the result of the program doesn’t change

Scheduling Dependence● Flow Dependence (true or read-after-right dependence): a later

instruction uses a value produced by an earlier instruction● Anti-Dependence (write-after-read dependence): later instruction

overwrites a value read by an earlier instruction● Output dependence (write-after-write dependence(: a later instruction

overwrites a value written by a previous instruction● The 2nd and 3rd type can frequently be corrected by the compiler

renaming registers○ Increases registered used but also increases instruction level parallelism

Register Allocation● 2-stage process1. Identify ‘blocks’ of code sequences with no branches in or out

a. Within each block assign a ‘virtual register to each value

2. Compiler maps virtual registers of a subroutine to the architectural registersa. Uses the same register when possibleb. If not enough registers spills over into memory

● Instruction scheduling then takes the instructions and may reorder them, this may cause an increase in the number of registers needed, but also decrease stalling.

Subroutine Calls● Small subroutine may be treated inline with the rest of the code, this

increases the code length but may help with register allocation● If unable then it’s treated separately allocating registers to the new

routine.○ If a value is used by the outer program making the call it must spill over into memory○ The program making the subroutine call then must reread the new values from memory

○ Sometimes the compiler must make assumption, if necessary the subroutine must rewrite every variable in scope to memory

● Inlining saves on time but can increase cache usage

Conclusions● Change in hardware such as pipelining and RISC instruction have

increased the complexity of machine code needed to take advantage of hardware

● Processor-Memory gap, relatively rapid increase of processor performance versus memory, necessitate taking advantage of the memory-hierarchy

● Optimizations to code take place at ISA level, compiler level, and program level.

Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Documents