Top Banner
Target Machine Architecture Group 1 Presented by Greg Klepic
26

Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

May 23, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Target Machine Architecture

Group 1Presented by Greg Klepic

Page 2: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Sections● 5.1 The memory hierarchy● 5.2 Data Representation● 5.3 Instruction Set Architecture● 5.4 Architecture and Implementation● 5.5 Compiling for Modern Processors

Page 3: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Memory hierarchyTypical Access Time Typical Capacity

Registers 0.2-0.5ns 256-1024 bytes

Primary (L1) cache 0.4-1ns 32K-256K bytes

L2 or L3 (on-chip) cache 4-30ns 1M-32M bytes

Off-chip cache 10-50ns Up to 128M bytes

Main Memory 50-200ns 256M-16G bytes

Flash 40-400μs 4G bytes- 1 T bytes

Disk 5-15ms 500G bytes and up

Tape 1-50s Effectively unlimited

Page 4: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Memory Hierarchy● Moving down the hierarchy increases latency but also increases capacity● Registers are accessed in 1 clock cycle, everything else take more● Keeping the data moving up the hierarchy prevents idle clock cycles● Usage of Caches vs memory and other slower sources limited by physical

size and cost● Latency limited by technology and bandwidth of buses● Disk and tape drives also limited by mechanical speed of the drive.

Page 5: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Memory Hierarchy- Registers● X86-64bit register

diagram● For 32 bit application

only white parts are available

● XMM 128-bit registers used for SSE vector ops

Page 6: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Memory Hierarchy-Caches● Cache is typically on the same

chip as the cpu● Much larger but slower access

time than registers, much faster access time

● L1 cache divided into two parts 1 for instructions other for data

● L3 cache shared for multi-core● Exploit locality to avoid cache

misses

Cache Hierarchy of Intel Westmere Architecture 4 core processor

Page 7: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Caches- Locality● Caches exploit locality to avoid cache misses● Spatial locality- tendency of a program to access data sequentially e.g.

arrays● Temporal locality- tendency to reuse certain data e.g. local variable in a

loop● Cache miss- If data is not in the cache at the time it’s needed, must wait

for memory (wasted clock cycles)

Page 8: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Memory Hierarchy- Processor/Memory Gap● Over time the rate at which

processor speed has increased is much greater than RAM

● If data had to be accessed from RAM frequently processor will be idle too often

Page 9: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Memory- Alignment● Operand appear in several sizes typically, 1,2,4, and 8 bytes● Some architectures require n-byte operands to appear at an address

divisible by n● Others like x86 do not, but run faster if operands are aligned● Buses are designed in such a way that if alignment isn’t there bits would

need to be shifted which takes time.● Also forcing alignment allows offsets to be specified in words rather than

bytes.

Page 10: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Data Representation● Operations interpret bits in memory in different ways● Data formats include instructions, addresses, binary ints, floats, and char● Integers come in half-word, word, and double-word length● Floating point come in single and double precision lengths

Page 11: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Little-Endian vs Big-Endian● Least significant byte of a

multiword datum at the address is little-endian

● Most significant byte at the address is call big-endian

● Little Endian is tolerant of variations in operand size

● X86 is Little-endian most other common architectures can be either

Page 12: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Operations on characters● Varies from architecture to architecture● Some can perform arithmetic and logical operations on 1-byte quantities● Most do not, only load/store● x86 has instructions for strings of characters e.g. copying, comparing,

searching● Vector operations in x86 can be used on strings

Page 13: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Integer Arithmetic-Representation● Two different representations of ints, signed and unsigned● Also two sets of operators, unsigned arithmetic used for pointers● Unsigned ints commonly represented by hexadecimal preceded by 0x, e.

g. 0x400= 4*162 +0*161 + 0*160 =1024 = 0100 0000 0000● Signed arithmetic uses ‘two’s complement arithmetic’● Examples of 4 bit 2’s complement ints

Page 14: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Signed Int arithmetic- Addition● The most significant bit of all negative numbers is one and non-negative is

zero, non-negative 2’s complement ints are represent same as unsigned● Smallest negative number in a range of size n is 1 0n-1 and normal rules

apply to increasing magnitude● Addition algorithm is the same as unsigned, no additional logic needed● Overflow occurs when a result is too large to fit in a word.● If addition of two non-negative ints give a ‘negative’ result (i.e. most

significant bit flips) then there is overflow, similar for 2 negative ints● Subtraction is adding the additive inverse

Page 15: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Floating Point Representation● Prior to 1985 float poorly-defined and vary from machine to machine● IEEE standard 754 defined 2 sizes for floats, single and double precision.● Single: sign bit, eight bit exponent, 23 bit significand

○ Represent number from roughly 10-38 to 1038

● Double: 11 bit exponent and 52 bit significand○ Represent number from roughly 10-308 to 10308

● Notation where s is the sign bit, sig is signifcand, exp is exponent○ -1s * sig * 2exp is the value of a given float

● exp is normalized by subtracting bias, b = most negative number○ b=-127 for single and b=-1023 for double-precision

● sig is always 1.X, unless the value is very close to zero then sig = 0.X*2min+1

○ min is the smallest allowed exponent

● Special patterns for zero, ∞, -∞, and Not a Number values

Page 16: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Instruction Set Architecture (ISA)● An ISA includes instructions available on a given machine and their

encoding in machine language● Instructions for

○ Computation: arithmetic and logical operations, tests, and comparisons on values in registers or memory (with address held in register)

○ Data Movement: Loads/Stores to and from memory to the registers or copies from one register or memory location to another

○ Control Flow: Branches, subroutine calls, and returns

Page 17: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

ISAs● Differing philosophies● CISC (Complex instruction set Computing)- do as much work as possible

with each instruction as in x86● RISC (Reduced instruction set Computing)- maximize number of

instructions performed per second as in ARM, MIPS, etc.● RISC instructions more suitable for pipelined architecture● Most CISC ISAs convert to RISC-like instructions to facilitate pipelining.

Page 18: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Addressing Modes● RISC systems only allow computational instructions on value held in

registers called register-register architecture● CISC allow computational instructions to access operands directly in

memory, register-memory architecture● x86 allows for 2 address instructions meaning one will be overwritten by

the result● Others allow a third address to specify where to send the result● Displacement addressing gives address by a displacement relative to a

base used by some RISC ISAs● Indexed addressing address are found using two register, e.g. one

register has address of an array and second contains index, ARM uses this● CISC machines have more complex addressing modes

Page 19: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Conditions and Branching● Condition codes in a special processor status register control branch flow● Operations may change these codes● Conditional move instruction moves a value only if the codes are set

correctly● Predication allows any operation to be marked as conditional● Branched code can then be made branchless, where instructions in the

branch that doesn’t occur have no effect.● Doing this can save time if both branches are short

Page 20: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Architecture and Implementation● 4 Architectural breakthroughs lead to modern processing

1. Microprogramming, in the early 1960’s IBM developed a way to share code among processor ‘families’ previously code was written for a specific machine, led to more concise instructions

2. Microprocessor- mid 1970’s a processor could be implemented on a single chip with local registers, 8-bit registers to start got bigger as transistor count increased

3. RISC- compute many small pipelined operations in parallel, sometimes with more than one pipeline

4. Multicore processors- There was a limit reached in terms of clock speed due to heat, decrease in feature size led to development of multicore processors

● More recently SoC’s and HSA’s

Page 21: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Compiling for Modern Processors● Main concerns: effective use of pipeline and registers● Reasons a pipeline may stall

1. Cache misses: instructions not ready in cache2. Resource hazards: 2 instructions need the same functional unit3. Data hazards: operand needed but still in each by another instruction4. Control Hazards: Fetching can not occur until branching is resolved

● Branch prediction: predict branch outcome based on past results and roll back execution if needed, to avoid control hazards

● Instruction Scheduling: Reorder instructions at compile time to minimize stalling, maximize instruction level-parallelism.

Page 22: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Branch Prediction● Early RISC processors provided delayed branch instructions

○ This means that the instruction following the branch would be executed no matter what the outcome of the branch

● Proved to be impractical due to scheduling conflicts● Modern solution: branch predictor- guess the outcome of branches and

continue without knowing the result, backtrack as needed● Cache misses typically fall on the programmer to avoid, the compiler

assumes cache hits which in most programs is safe.● Even fast caches have some delay, so instructions can be reordered to

avoid the delay as long as the result of the program doesn’t change

Page 23: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Scheduling Dependence● Flow Dependence (true or read-after-right dependence): a later

instruction uses a value produced by an earlier instruction● Anti-Dependence (write-after-read dependence): later instruction

overwrites a value read by an earlier instruction● Output dependence (write-after-write dependence(: a later instruction

overwrites a value written by a previous instruction● The 2nd and 3rd type can frequently be corrected by the compiler

renaming registers○ Increases registered used but also increases instruction level parallelism

Page 24: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Register Allocation● 2-stage process1. Identify ‘blocks’ of code sequences with no branches in or out

a. Within each block assign a ‘virtual register to each value

2. Compiler maps virtual registers of a subroutine to the architectural registersa. Uses the same register when possibleb. If not enough registers spills over into memory

● Instruction scheduling then takes the instructions and may reorder them, this may cause an increase in the number of registers needed, but also decrease stalling.

Page 25: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Subroutine Calls● Small subroutine may be treated inline with the rest of the code, this

increases the code length but may help with register allocation● If unable then it’s treated separately allocating registers to the new

routine.○ If a value is used by the outer program making the call it must spill over into memory○ The program making the subroutine call then must reread the new values from memory

○ Sometimes the compiler must make assumption, if necessary the subroutine must rewrite every variable in scope to memory

● Inlining saves on time but can increase cache usage

Page 26: Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented by Greg Klepic Sections 5.1 The memory hierarchy 5.2 Data Representation 5.3 Instruction

Conclusions● Change in hardware such as pipelining and RISC instruction have

increased the complexity of machine code needed to take advantage of hardware

● Processor-Memory gap, relatively rapid increase of processor performance versus memory, necessitate taking advantage of the memory-hierarchy

● Optimizations to code take place at ISA level, compiler level, and program level.