Top Banner
ISA By AJAL.A.J - AP/ECE
108
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: isa architecture

ISA

By

AJAL.A.J - AP/ECE

Page 2: isa architecture

Instruction Set Architecture• Instruction set architecture is the structure of a

computer that a machine language programmer must understand to write a correct (timing independent) program for that machine.

• The instruction set architecture is also the machine description that a hardware designer must understand to design a correct implementation of the computer.

• a fixed number of operations are formatted as one big instruction (called a bundle)

op op op Bundling info

Page 3: isa architecture

Instruction Set Architecture

Computer Architecture =Instruction Set Architecture

+ Machine Organization

• “... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior …”

Page 4: isa architecture

Evolution of Instruction SetsSingle Accumulator (EDSAC 1950)

Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model from Implementation

High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture

RISC

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)

LIW/”EPIC”? (IA-64. . .1999) VLIW

Page 5: isa architecture

Instruction Set Architecture

– Interface between all the software that runs on the machine and the hardware that executes it

• Computer Architecture = Hardware + ISA

Page 6: isa architecture

instruction set, or instruction set architecture (ISA)

• An instruction set, or instruction set architecture (ISA), is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory, architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine language), and the native commands implemented by a particular processor.

Page 7: isa architecture

Microarchitecture• Instruction set architecture is distinguished from

the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different micro architectures can share a common instruction set.

• For example, the Intel Pentium and the AMD Athlon implement nearly identical versions of the x86 instruction set, but have radically different internal designs.

Page 8: isa architecture
Page 9: isa architecture

NUAL vs. UAL

• Unit Assumed Latency (UAL)– Semantics of the program are that each

instruction is completed before the next one is issued

– This is the conventional sequential model

• Non-Unit Assumed Latency (NUAL):– At least 1 operation has a non-unit assumed

latency, L, which is greater than 1– The semantics of the program are correctly

understood if exactly the next L-1 instructions are understood to have issued before this operation completes

Page 10: isa architecture
Page 11: isa architecture
Page 12: isa architecture
Page 13: isa architecture
Page 14: isa architecture
Page 15: isa architecture
Page 16: isa architecture
Page 17: isa architecture
Page 18: isa architecture
Page 19: isa architecture
Page 20: isa architecture
Page 21: isa architecture
Page 22: isa architecture

Instruction Set Architecture ICS 233 – Computer Architecture and Assembly Language – KFUPM

© Muhamed Mudawar slide 22

Summary of RISC Design All instructions are typically of one size

Few instruction formats

All operations on data are register to register

Operands are read from registers

Result is stored in a register

General purpose integer and floating point registers

Typically, 32 integer and 32 floating-point registers

Memory access only via load and store instructions

Load and store: bytes, half words, words, and double words

Few simple addressing modes

Page 23: isa architecture

Instruction Set Architectures

Reduced Instruction Set Computers (RISCs) Simple instruction Flexibility Higher throughput Faster execution

Complex Instruction Set Computers (CISCs) Hardware support for high-level language Compact program

Page 24: isa architecture

MIPS: A RISC example

Smaller and simpler instruction set 111 instructions

One cycle execution time Pipelining 32 registers

32 bits for each register

Page 25: isa architecture

MIPS Instruction Set

25 branch/jump instructions 21 arithmetic instructions 15 load instructions 12 comparison instructions 10 store instructions 8 logic instructions 8 bit manipulation instructions 8 move instructions 4 miscellaneous instructions

Page 26: isa architecture

Overview of the MIPS Processor

Memory

Up to 232 bytes = 230 words

4 bytes per word

$0

$1

$2

$31

Hi Lo

ALU

F0

F1

F2

F31FP

Arith

EPC

Cause

BadVaddr

Status

EIU FPU

TMU

Execution &

Integer Unit(Main proc)

FloatingPoint Unit(Coproc 1)

Trap & Memory Unit(Coproc 0)

. . .

. . .

Integer mul/div

Arithmetic &Logic Unit

32 GeneralPurposeRegisters

Integer Multiplier/Divider

32 Floating-PointRegisters

Floating-PointArithmetic Unit

Page 27: isa architecture

CPU

Registers

$0

$31

Arithmeticunit

Multiplydivide

Lo Hi

Coprocessor 1 (FPU)

Registers

$0

$31

Arithmeticunit

Registers

BadVAddr

Coprocessor 0 (traps and memory)

Status

Cause

EPC

Memory

MIPS R2000 Organization

Page 28: isa architecture

3-28ECE 361

DefinitionsDefinitions

Performance is typically in units-per-second

• bigger is better

If we are primarily concerned with response time

• performance = 1 execution_time

" X is n times faster than Y" means

nePerformanc

ePerformanc

imeExecutionT

imeExecutionT

y

x

x

y ==

Page 29: isa architecture

3-29ECE 361

Organizational Trade-offsOrganizational Trade-offs

Compiler

Programming Language

Application

DatapathControl

TransistorsWiresPins

ISA

Function Units

Instruction Mix

Cycle Time

CPI

CPI is a useful design measure relating the Instruction Set Architecture with the Implementation of that architecture, and the program measured

Page 30: isa architecture

3-30ECE 361

Principal Design Metrics: CPI and Cycle TimePrincipal Design Metrics: CPI and Cycle Time

Seconds

nsInstructio

CycleSeconds

nInstructioCycles

ePerformanc

CycleTimeCPIePerformanc

imeExecutionTePerformanc

=

×=

=

1

1

1

Page 31: isa architecture

3-31ECE 361

Amdahl's “Law”: Make the Common Case FastAmdahl's “Law”: Make the Common Case Fast

Speedup due to enhancement E:

ExTime w/o E Performance w/ E

Speedup(E) = -------------------- = ---------------------

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task

by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E)

Performance improvement is limited by how much the improved feature is used Invest resources where time is spent.

Page 32: isa architecture

Classification of Instruction Set Architectures

Page 33: isa architecture

3-33ECE 361

Instruction Set Instruction Set DesignDesign

Multiple Implementations: 8086 Pentium 4

ISAs evolve: MIPS-I, MIPS-II, MIPS-II, MIPS-IV, MIPS,MDMX, MIPS-32, MIPS-64

instruction set

software

hardware

Page 34: isa architecture

3-34ECE 361

The steps for executing an instruction:The steps for executing an instruction:

1.Fetch the instruction

2.Decode the instruction

3.Locate the operand

4.Fetch the operand (if necessary)

5.Execute the operation in processor registers

6.Store the results

7.Go back to step 1

Page 35: isa architecture

3-35ECE 361

Typical Processor Execution CycleTypical Processor Execution Cycle

Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

Obtain instruction from program storage

Determine required actions and instruction size

Locate and obtain operand data

Compute result value or status

Deposit results in register or storage for later use

Determine successor instruction

Page 36: isa architecture

3-36ECE 361

Instruction and Data Memory: Unified or SeparateInstruction and Data Memory: Unified or Separate

ADDSUBTRACTANDORCOMPARE...

0101001110100111000111010...

Programmer's View

Computer's View

CPUMemory

I/O

ComputerProgram(Instructions)

Princeton (Von Neumann) Architecture--- Data and Instructions mixed in same unified memory

--- Program as data

--- Storage utilization

--- Single memory interface

Harvard Architecture--- Data & Instructions in separate memories

--- Has advantages in certain high performance implementations

--- Can optimize each memory

Page 37: isa architecture

3-37ECE 361

Basic Addressing ClassesBasic Addressing Classes

Declining cost of registers

Page 38: isa architecture

10-5 Data-transfer instructions

Page 39: isa architecture

Arithmetic instructions

Page 40: isa architecture

Logical and bit-manipulation instructions

Page 41: isa architecture

Shift instruction

Page 42: isa architecture

3-42ECE 361

Stack ArchitecturesStack Architectures

Page 43: isa architecture

3-43ECE 361

Stack architecture high frequency of memory accesses has made it

unattractive is useful for rapid interpretation of high-level

language programs

Infix expression

(A+B) ×C+(D×E)

Postfix expression

AB+C×DE×+

Page 44: isa architecture

3-44ECE 361

Accumulator ArchitecturesAccumulator Architectures

Page 45: isa architecture

3-45ECE 361

Register-Set ArchitecturesRegister-Set Architectures

Page 46: isa architecture

3-46ECE 361

Register-to-Register: Load-Store ArchitecturesRegister-to-Register: Load-Store Architectures

Page 47: isa architecture

3-47ECE 361

Register-to-Memory ArchitecturesRegister-to-Memory Architectures

Page 48: isa architecture

3-48ECE 361

Memory-to-Memory ArchitecturesMemory-to-Memory Architectures

Page 49: isa architecture

3-49ECE 361

Addressing ModesAddressing Modes

Page 50: isa architecture

3-50ECE 361

Instruction Set Design MetricsInstruction Set Design MetricsStatic Metrics

• How many bytes does the program

occupy in memory?

Dynamic Metrics

• How many instructions are executed?

• How many bytes does the processor fetch to execute the program?

• How many clocks are required per instruction?

• How "lean" a clock is practical?

CPI

Instruction Count Cycle Time

Cycle

Seconds

nInstructio

CyclesnsInstructio

ePerformancimeExecutionT ××== 1

Page 51: isa architecture

Types of ISA and examples:

1. RISC -> Playstation2. CISC -> Intel x863. MISC -> INMOS Transputer4. ZISC -> ZISC365. SIMD -> many GPUs6. EPIC -> IA-64 Itanium7. VLIW -> C6000 (Texas Instruments)

Page 52: isa architecture

Problems of the Past

• In the past, it was believed that hardware design was easier than compiler design– Most programs were written in assembly

language

• Hardware concerns of the past:– Limited and slower memory– Few registers

Page 53: isa architecture

The Solution

• Have instructions do more work, thereby minimizing the number of instructions called in a program

• Allow for variations of each instruction– Usually variations in memory access

• Minimize the number of memory accesses

Page 54: isa architecture

The Search for RISC

• Compilers became more prevalent• The majority of CISC instructions were rarely

used• Some complex instructions were slower than

a group of simple instructions performing an equivalent task– Too many instructions for designers to optimize

each one

Page 55: isa architecture

RISC Architecture

• Small, highly optimized set of instructions• Uses a load-store architecture• Short execution time• Pipelining• Many registers

Page 56: isa architecture

Pipelining

• Break instructions into steps• Work on instructions like in an assembly line• Allows for more instructions to be executed

in less time• A n-stage pipeline is n times faster than a non

pipeline processor (in theory)

Page 57: isa architecture

RISC Pipeline Stages

• Fetch instruction• Decode instruction• Execute instruction• Access operand• Write result

– Note: Slight variations depending on processor

Page 58: isa architecture

Without Pipelining

Instr 1

Instr 2

Clock Cycle 1 2 3 4 5 6 7 8 9 10

• Normally, you would perform the fetch, decode, execute, operate, and write steps of an instruction and then move on to the next instruction

Page 59: isa architecture

With Pipelining

Clock Cycle 1 2 3 4 5 6 7 8 9

Instr 1

Instr 2

Instr 3

Instr 4

Instr 5

• The processor is able to perform each stage simultaneously.

• If the processor is decoding an instruction, it may also fetch another instruction at the same time.

Page 60: isa architecture

Pipeline (cont.)

• Length of pipeline depends on the longest step

• Thus in RISC, all instructions were made to be the same length

• Each stage takes 1 clock cycle• In theory, an instruction should be finished

each clock cycle

Page 61: isa architecture

Pipeline Problem

• Problem: An instruction may need to wait for the result of another instruction

Page 62: isa architecture

Pipeline Solution :

• Solution: Compiler may recognize which instructions are dependent or independent of the current instruction, and rearrange them to run the independent one first

Page 63: isa architecture

How to make pipelines faster

• Superpipelining– Divide the stages of pipelining into more stages

• Ex: Split “fetch instruction” stage into two stages

Super duper pipelining

Super scalar pipelining Run multiple pipelines in parallel

Automated consolidation of data from many sources,

Page 64: isa architecture

Dynamic pipeline

• Dynamic pipeline: Uses buffers to hold instruction bits in case a dependent instruction stalls

Page 65: isa architecture

Why CISC Persists ?

• Most Intel and AMD chips are CISC x86 • Most PC applications are written for x86• Intel spent more money improving the

performance of their chips• Modern Intel and AMD chips incorporate

elements of pipelining– During decoding, x86 instructions are split into

smaller pieces

Page 67: isa architecture

Outline• Types of architectures• Superscalar• Differences between CISC, RISC and VLIW

• VLIW ( very long instruction word )

Page 68: isa architecture

VLIW Goals:Flexible enoughMatch well technology

Very Long Instruction Word

o Very long instruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism

VLIW philosophy:– “dumb” hardware– “intelligent” compiler

Page 69: isa architecture

VLIW - History• Floating Point Systems Array Processor

– very successful in 70’s– all latencies fixed; fast memory

• Multiflow– Josh Fisher (now at HP)– 1980’s Mini-Supercomputer

• Cydrome– Bob Rau (now at HP)– 1980’s Mini-Supercomputer

• Tera– Burton Smith– 1990’s Supercomputer– Multithreading

• Intel IA-64 (Intel & HP)

Page 70: isa architecture

VLIW Processors

Goal of the hardware design:• reduce hardware complexity• to shorten the cycle time for better performance• to reduce power requirements

How VLIW designs reduce hardware complexity ?1. less multiple-issue hardware

1. no dependence checking for instructions within a bundle2. can be fewer paths between instruction issue slots & FUs

2. simpler instruction dispatch1. no out-of-order execution, no instruction grouping

3. ideally no structural hazard checking logic

• Reduction in hardware complexity affects cycle time & power consumption

Page 71: isa architecture

VLIW Processors

More compiler support to increase ILP

detects hazards & hides latencies

• structural hazards• no 2 operations to the same functional unit• no 2 operations to the same memory bank

• hiding latencies• data prefetching• hoisting loads above stores

• data hazards• no data hazards among instructions in a

bundle• control hazards

• predicated execution• static branch prediction

Page 72: isa architecture

VLIW: Definition• Multiple independent Functional Units• Instruction consists of multiple independent instructions• Each of them is aligned to a functional unit• Latencies are fixed

– Architecturally visible

• Compiler packs instructions into a VLIW also schedules all hardware resources

• Entire VLIW issues as a single unit• Result: ILP with simple hardware

– compact, fast hardware control– fast clock– At least, this is the goal

Page 73: isa architecture

Introductiono Instruction of a VLIW processor consists of multiple independent

operations grouped together.o There are multiple independent Functional Units in VLIW processor

architecture.o Each operation in the instruction is aligned to a functional unit.o All functional units share the use of a common large register file.o This type of processor architecture is intended to allow higher

performance without the inherent complexity of some other approaches.

Page 74: isa architecture

Slid

e 7

4

VLIW HistoryThe term coined by J.A. Fisher (Yale) in 1983

ELI S12 (prototype) Trace (Commercial)

Origin lies in horizontal microcode optimizationAnother pioneering work by B. Ramakrishna Rau in

1982 Poly cyclic (Prototype) Cydra-5 (Commercial)

Recent developments Trimedia – Philips TMS320C6X – Texas Instruments

Page 75: isa architecture

"Bob" Rau

• Bantwal Ramakrishna "Bob" Rau (1951 – December 10, 2002) was a computer engineer and HP Fellow. Rau was a founder and chief architect of Cydrome, where he helped develop the Very long instruction word technology that is now standard in modern computer processors. Rau was the recipient of the 2002 Eckert–Mauchly Award.

Page 76: isa architecture

1984: Co-founded Cydrome Inc. and was the chief architect of the Cydra 5 mini-supercomputer.

1989: Joined Hewlett Packard and started HP Lab's research program in VLIW and instruction-level parallel processing. Director of the Compiler and Architecture Research (CAR) program, which during the 1990s, developed advanced compiler technology for Hewlett Packard and Intel computers.

At HP, also worked on PICO (Program In, Chip Out) project to take an embedded application and to automatically design highly customized computing hardware that is specific to that application, as well as any compiler that might be needed.

2002: passed away after losing a long battle with cancer

Page 77: isa architecture

The VLIW Architecture

• A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length.

• Multiple functional units are used concurrently in a VLIW processor.

• All functional units share the use of a common large register file.

Page 78: isa architecture

Parallel Operating Environment (POE) • Compiler creates complete plan of run-time execution

– At what time and using what resource– POE communicated to hardware via the ISA– Processor obediently follows POE

– No dynamic scheduling, out of order execution

• These second guess the compiler’s plan

• Compiler allowed to play the statistics– Many types of info only available at run-time

• branch directions, pointer values

– Traditionally compilers behave conservatively handle worst case possibility

– Allow the compiler to gamble when it believes the odds are in its favor• Profiling

• Expose micro-architecture to the compiler– memory system, branch execution

Page 79: isa architecture

VLIW Processors

Compiler support to increase ILP• compiler creates each VLIW word• need for good code scheduling greater than with in-order issue superscalars

• instruction doesn’t issue if 1 operation can’t ( reverse to maala bulb )

• techniques for increasing ILP1.loop unrolling2.software pipelining (schedules instructions from

different iterations together)3.aggressive inlining (function becomes part of the

caller code)4.trace scheduling (schedule beyond basic block

boundaries)

Page 80: isa architecture

Different Approaches

Other approaches to improving performance in processor architectures :o Pipelining

Breaking up instructions into sub-steps so that instructions can be executed partially at the same time

o Superscalar architectures

Dispatching individual instructions to be executed completely independently in different parts of the processor

o Out-of-order execution

Executing instructions in an order different from the program

Page 81: isa architecture

Parallel processing

Processing instructions in parallel requires three major tasks:

1. checking dependencies between instructions to determine which instructions can be grouped together for parallel execution;

2. assigning instructions to the functional units on the hardware;

3. determining when instructions are initiated placed together into a single word.

Page 82: isa architecture

ILP

Consider the following program: op 1 e = a + b op2 f = c + d op3 m = e * f

o Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed

o However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously

o If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time giving an ILP of 3/2.

Page 83: isa architecture

Two approaches to ILP

oHardware approach: Works upon dynamic parallelism where

scheduling of instructions is at run time

oSoftware approach: Works on static parallelism where

scheduling of instructions is by compiler

Page 84: isa architecture

VLIW COMPILER

o Compiler is responsible for static scheduling of instructions in VLIW processor.

o Compiler finds out which operations can be executed in parallel in the program.

o It groups together these operations in single instruction which is the very large instruction word.

o Compiler ensures that an operation is not issued before its operands

are ready.

Page 85: isa architecture

VLIW Example

I-fetch &Issue

FU

FU

MemoryPort

MemoryPort

Multi-portedRegister

File

Page 86: isa architecture

Block Diagram

Page 87: isa architecture

Working

o Long instruction words are fetched from the memoryo A common multi-ported register file for fetching the operands and

storing the results. o Parallel random access to the register file is possible through the

read/write cross bar. o Execution in the functional units is carried out concurrently with the

load/store operation of data between RAM and the register file. o One or multiple register files for FX and FP data.o Rely on compiler to find parallelism and schedule dependency free

program code.

Page 88: isa architecture

Major categories

VLIW – Very Long Instruction WordEPIC – Explicitly Parallel Instruction Computing

Page 89: isa architecture

IA-64 EPIC

Explicitly Parallel Instruction Computing, VLIW

2001 800 MHz Itanium IA-64 implementation

Bundle of instructions• 128 bit bundles• 3 41-bit instructions/bundle• 2 bundles can be issued at once• if issue one, get another

• less delay in bundle issue

Page 90: isa architecture

Slid

e 9

0

Data path : A simple VLIW Architecture

FU FU FU

Register file

Scalability ?

Access time, area, power consumption sharply increase with

number of register ports

Page 91: isa architecture

Slid

e 9

1

Data path : Clustered VLIW Architecture(distributed register file)

FU FU

Register file

FU FU

Register file

FU FU

Register file

Interconnection Network

Page 92: isa architecture

Slid

e 9

2

Coarse grain Fus with VLIW core

MULT RAM ALU

Coarse grain

FU

Reg

2

Reg

1

Reg

1

Reg

1

Reg

2

Reg

2

Multiplexer network

Micro

Code

IR

Prg. Counter

Logic

Embedded (co)-processors as Fus in a VLIW architecture

Page 93: isa architecture

Slid

e 9

3

Application Specific FUs

FUfunctionality

number of inputs

number of outputs

latency initiation interval I/O time shape

Functional Units

Page 94: isa architecture

Superscalar Processors

• Superscalar processors are designed to exploit more instruction-level parallelism in user programs.

• Only independent instructions can be executed in parallel without causing a wait state.

• The amount of instruction-level parallelism varies widely depending on the type of code being executed.

• Superscalar– Operations are sequential

– Hardware figures out resource assignment, time of execution

Page 95: isa architecture

Pipelining in Superscalar Processors

• In order to fully utilise a superscalar processor of degree m, m instructions must be executable in parallel. This situation may not be true in all clock cycles. In that case, some of the pipelines may be stalling in a wait state.

• In a superscalar processor, the simple operation latency should require only one cycle, as in the base scalar processor.

Page 96: isa architecture
Page 97: isa architecture

Superscalar Execution

Page 98: isa architecture

Superscalar Implementation

• Simultaneously fetch multiple instructions• Logic to determine true dependencies involving

register values• Mechanisms to communicate these values• Mechanisms to initiate multiple instructions in

parallel• Resources for parallel execution of multiple

instructions

• Mechanisms for committing process state in correct order

Page 99: isa architecture

Difference Between VLIW &

Superscalar Architecture

Page 100: isa architecture

Slid

e

10

0

Why Superscalar Processors are commercially more popular as compared to VLIW processor ?

Binary code compatibility among scalar & superscalar processors of same family

Same compiler works for all processors (scalars and superscalars) of same family

Assembly programming of VLIWs is tediousCode density in VLIWs is very poor

- Instruction encoding schemes Area Performance

Page 101: isa architecture

Slid

e

10

1

Superscalars vs. VLIW

VLIW requires a more complex compiler

Superscalar's can more efficiently execute pipeline-independent code• consequence: don’t have to recompile if change

the implementation

Page 102: isa architecture

VLSI DESIGN GROUP – METS SCHOOL OF ENGINEERING , MALA

Comparison: CISC, RISC, VLIW

Page 103: isa architecture

VLSI DESIGN GROUP – METS SCHOOL OF ENGINEERING , MALA

Page 104: isa architecture

Advantages of VLIW

Compiler prepares fixed packets of multiple operations that give the full "plan of execution"

– dependencies are determined by compiler and used to schedule according to function unit latencies

– function units are assigned by compiler and correspond to the position within the instruction packet ("slotting")

– compiler produces fully-scheduled, hazard-free code => hardware doesn't have to "rediscover" dependencies or schedule

Page 105: isa architecture

Disadvantages of VLIW

Compatibility across implementations is a major problem

– VLIW code won't run properly with different number of function units or different latencies

– unscheduled events (e.g., cache miss) stall entire processor

Code density is another problem – low slot utilization (mostly nops) – reduce nops by compression ("flexible VLIW",

"variable-length VLIW")

Page 106: isa architecture
Page 107: isa architecture

References

1. Advanced Computer Architectures, Parallelism, Scalability, Programmability, K. Hwang, 1993.

2. M. Smotherman, "Understanding EPIC Architectures and Implementations" (pdf) http://www.cs.clemson.edu/~mark/464/acmse_epic.pdf

3. Lecture notes of Mark Smotherman, http://www.cs.clemson.edu/~mark/464/hp3e4.html

4. An Introduction To Very-Long Instruction Word (VLIW) Computer Architecture, Philips Semiconductors, http://www.semiconductors.philips.com/acrobat_download/other/vliw-wp.pdf

5. Texas Instruments, Tutorial on TMS320C6000 VelociTI Advanced VLIW Architecture. http://www.acm.org/sigs/sigmicro/existing/micro31/pdf/m31_seshan.pdf