x86 Code Optimization for AMD Processors

8/14/2019 x86 Code Optimization for AMD Processors

http://slidepdf.com/reader/full/x86-code-optimization-for-amd-processors 1/84

This document contains information on a product under development at Advanced Micro

Devices (AMD). The information is intended to help you evaluate this product. AMD re-

serves the right to change or discontinue work on this proposed product without notice.

Publication # 21828 Rev: A Amendment/0

Issue Date: August 1997

Application Note

AMD-K6

MMX EnhancedProcessor

x86 Code

Optimization

™

™



© 1997 Advanced Micro Devices, Inc. All rights reserved.

Advanced Micro Devices, Inc. ("AMD") reserves the right to make changes in its

products without notice in order to improve design or performance characteristics.

The information in this publication is believed to be accurate at the time of

publication, but AMD makes no representations or warranties with respect to the

accuracy or completeness of the contents of this publication or the information

contained herein, and reserves the right to make changes at any time, without

notice. AMD disclaims responsibility for any consequences resulting from the use

of the information included in this publication.

This publication neither states nor implies any representations or warranties of

any kind, including but not limited to, any implied warranty of merchantability or

fitness for a particular purpose. AMD products are not authorized for use as critical

components in life support devices or systems without AMD’s written approval.AMD assumes no liability whatsoever for claims associated with the sale or use

(including the use of engineering samples) of AMD products except as provided in

AMD’s Terms and Conditions of Sale for such product.

Trademarks

AMD, the AMD logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.

RISC86 is a registered trademark, and K86, AMD-K5, AMD-K6, and the AMD-K6 logo are trademarks of Advanced

Micro Devices, Inc.

Microsoft and Windows are registered trademarks, and Windows NT is a trademark of Microsoft Corporation.

Pentium is a registered trademark and MMX is a trademark of the Intel Corporation.

Other product names used in this publication are for identification purposes only and may be trademarks of their

respective companies.



Contents iii

21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Contents

1 Introduction 1Purpose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

The AMD-K6™ Family of Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 1

The AMD-K6™ MMX™ Enhanced Processor. . . . . . . . . . . . . . . . . . . . 2

2 The AMD-K6™ Processor RISC86® Microarchitecture 3

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

RISC86® Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 AMD-K6™ Processor Execution Units and Dependency Latencies 9

Execution Unit Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Six-Stage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Integer and Multimedia Execution Units. . . . . . . . . . . . . . . . . . . . . . 11

Load Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Store Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Branch Condition Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Floating Point Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Latencies and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Resource Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Code Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Instruction Dispatch and Execution Timing 23

5 x86 Optimization Coding Guidelines 51

General x86 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . 51

General AMD-K6™ Processor x86 Coding Optimizations. . . . . . . . . 53

AMD-K6™ Processor Integer x86 Coding Optimizations . . . . . . . . . 57

AMD-K6™ Processor Multimedia Coding Optimizations . . . . . . . . . 61

AMD-K6™ Processor Floating-Point Coding Optimizations. . . . . . . 62

6 Considerations for Other Processors 67



iv Contents

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997



List of Tables v


List of Tables

Table 1. RISC86® Execution Latencies and Throughput . . . . . . 16

Table 2. Sample 1 – Integer Register Operations . . . . . . . . . . . . 19

Table 3. Sample 2 – Integer Register and Memory Load

Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Table 4. Sample 3 – Integer Register and Memory Load/Store

Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Table 5. Sample 4 – Integer, MMX™, and Memory Load/Store

Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Table 6. Integer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Table 7. MMX™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Table 8. Floating-Point Instructions. . . . . . . . . . . . . . . . . . . . . . . 46

Table 9. Decode Accumulation and Serialization . . . . . . . . . . . . 54

Table 10. Specific Optimizations and Guidelines for AMD-K6™

and AMD-K5™ Processors . . . . . . . . . . . . . . . . . . . . . . . 67

Table 11. AMD-K6™ Processor Versus Pentium®

Processor-Specific Optimizations and Guidelines . . . . 69

Table 12. AMD-K6™ Processor and Pentium® Processor with

Optimizations for MMX™ Instructions . . . . . . . . . . . . . 71

Table 13. AMD-K6™ Processor and Pentium® Pro

Processor-Specific Optimizations. . . . . . . . . . . . . . . . . . 71

Table 14. AMD-K6™ Processor and Pentium® Pro with

Optimizations for MMX™ Instructions . . . . . . . . . . . . . 73

http://-/?-

http://-/?-

http://-/?-

http://-/?-



vi List of Tables




List of Figures vii


List of Figures

Figure 1. AMD-K6™ MMX™ Enhanced Processor Block

Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 2. AMD-K6™ Processor Pipeline . . . . . . . . . . . . . . . . . . . . . 11

Figure 3. Integer/Multimedia Execution Unit . . . . . . . . . . . . . . . . . 12

Figure 4. Load Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Figure 5. Store Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14



viii List of Figures




Revision History ix


Revision History

Date Rev Description August 1997 A Initial Release



x Revision History





Chapter 1 Introduction 1

1Introduction

Purpose

The AMD-K6™ MMX™ enhanced processor is the newestmicroprocessor in the AMD K86™ family of microprocessors.

The AMD-K6 processor can efficiently execute code written forprevious-generation x86 processors. However, there are many

ways to get higher performance from the AMD-K6 processor.

This document contains information to assist programmers increating optimized code for the AMD-K6 processor. This

document is targeted at compiler/assembler designers andassembly language programmers writing high-performancecode sequences.

It is assumed that the reader possesses an in-depth knowledgeof the x86 microarchitecture.

The AMD-K6™ Family of Processors

Processors in the AMD-K6™ family use a decoupled instructiondecode and execution superscalar microarchitecture, including

state-of-the-art RISC design techniques, to deliversixth-generation performance with full x86 binary software



2 Introduction Chapter 1


compatibility. An x86 binary-compatible processor implements

the industry-standard x86 instruction set by decoding andexecuting the x86 instruction set as its native mode ofoperation. Only this native mode permits delivery of maximumperformance when running PC software.

The AMD-K6™ MMX™ Enhanced Processor

The AMD-K6 MMX enhanced processor, the first in theAMD-K6 family, brings superscalar RISC performance to

desktop systems running industry-standard x86 software. Thisprocessor implements advanced design techniques such as

instruction pre-decoding, multiple x86 opcode decoding,single-cycle internal RISC operations, multiple parallel

execution units, out-of-order execution, data-forwarding,register renaming, and dynamic branch prediction. In other

words, the AMD-K6 processor is capable of issuing, executing,and retiring multiple x86 instructions per cycle, resulting insuperior scaleable performance.

Although the AMD-K6 processor is capable of extracting codeparallelism out of off-the-shelf, commercially available x86software, specific code optimizations for the AMD-K6 processor

can result in even higher delivered performance. Thisdocument describes the RISC86® microarchitecture in theAMD-K6 processor and makes recommendations for optimizing

execution of x86 software on the processor. The codingtechniques for achieving peak performance on the AMD-K6processor include, but are not limited to, those recommendedfor the Pentium® and Pentium Pro processors. However, many

of these optimizations are not necessary for the AMD-K6processor to achieve maximum performance. Due to the more

flexible pipeline control of the AMD-K6 microarchitecture, theAMD-K6 processor is not as sensitive to instruction selection

and the scheduling of code. This flexibility is one of the distinctadvantages of the AMD-K6 processor microarchitecture.




Chapter 2 The AMD-K6™ Processor RISC86® Microarchitecture 3

2The AMD-K6™ Processor

RISC86® Microarchitecture

Overview

When discussing processor design, it is important tounderstand the terms architecture, microarchitecture, and design

implementation. The term architecture refers to the instructionset and features of a processor that are visible to software

programs runni ng on the processor. The architecturedetermines what software the processor can run. Thearchitecture of the AMD-K6 MMX processor is theindustry-standard x86 instruction set.

The term microarchitecture refers to the design techniques usedin the processor to reach the target cost, performance, andfunctionality goals. The AMD-K6 processor is based on a

sophisticated RISC core known as the enhanced RISC86microarchitecture. The enhanced RISC86 microarchitecture is

an advanced, second-order decoupled decode/execution designapproach that enables industry-leading performance for

x86-based software.

The term design implementation refers to the actual logic andcircuit designs from which the processor is created according tothe microarchitecture specifications.



4 The AMD-K6™ Processor RISC86® Microarchitecture Chapter 2


RISC86® Microarchitecture

The enhanced RISC86 microarchitecture defines thecharacteristics of the AMD-K6 MMX enhanced processor. Theinnovative RISC86 microarchitecture approach implements thex86 instruction set by internally translating x86 instructionsinto RISC86 operations. These RISC86 operations were

specially designed to include direct support for the x86instruction set while observing the RISC performance

principles of fixed-length encoding, regularized instructionfields, and a large register set. The enhanced RISC86microarchitecture used in the AMD-K6 enables higherprocessor core performance and promotes straightforward

extendibility in future designs. Instead of executing complexx86 instructions, which have lengths of 1 to 15 bytes, the

AMD-K6 processor executes the simpler fixed-length RISC86opcodes, while maintaining the instruction coding efficienciesfound in x86 programs.

The AMD-K6 processor includ es parallel decoders, acentralized scheduler, and seven execution units that supportsuperscalar operation—multiple decode, execution, andretirement—of x86 instructions. These elements are packed

into an aggressive and very efficient six-stage pipeline.

Decoding of the x86 instructions into RISC86 operations beginswhen the on-chip level-one instruction cache is filled.

Predecode logic determines the length of an x86 instruction ona byte-by-byte basis. This predecode information is stored,

alongside the x86 instructions, in a dedicated level-onepredecode cache to be used later by the decoders. Up to twox86 instructions are decoded per clock on-the-fly, resulting in amaximum of four RISC86 operations per clock with no

additional latency.

The AMD-K6 processor categorizes x86 instructions into threetypes of decodes — short, long, and vector. The decoders

process either two short, one long, or one vectored decode at atime. The three types of d ecodes have the following

characteristics:

s Short decode—common x86 instructions less than or equalto 7 bytes in length that produce one or two RISC86operations.





s Long decode—more complex and somewhat common x86

instructions less than or equal to 11 bytes in length thatproduce up to four RISC86 operations.

s Vectored decode—complex x86 instructions requiring longsequences of RISC86 operations.

Short and long decodes are processed completely within the

decoders. Vectored decodes are started by the decoders withthe generation of an initial set of four RISC86 operations, and

then completed by fetching a sequence of additional operationsfrom an on-chip ROM (at a rate of four operations per clock).RISC86 operations, whether produced by decoders or fetchedfrom ROM, are then sent to a buffer in the centralized

scheduler for dispatch to the execution units.

The internal RISC86 instruction set consists of the following sixcategories or types of operations (the execution unit that

handles each type of operation is displayed in parenthesis):

s Memory load operations (load)

s Memory store operations (store)

s Integer register operations (alu/alux)

s MMX register operations (meu)

s Floating-point register operations (float)

s Branch condition evaluations (branch)

The following example shows a series of x86 instructions andthe corresponding decoded RISC86 operations.

x86 Instructions RISC86 Operations

MOV CX, [SP+4] Load

ADD AX,BX Alu (Add)

CMP CX,[AX] Load

Alu (Sub)

JZ foo Branch

The MOV instruction converts to a RISC86 load that requires

indirect data to be loaded from memory. The ADD instructionconverts to an alu function that can be sent to either of theinteger units. The CMP instruction converts into two RISC86

instructions. The first RISC86 load operation requires indirectdata to be loaded from memory. That value is then compared

(alu function) with CX.





Once the RISC86 operations are in the centralized scheduler

buffer they are ready for the scheduler to issue them to theappropriate execution unit. The AMD-K6 processor containsseven execution units—Integer X, Integer Y, Multimedia,Load, Store, Branch, and Floating-Point. Figure 1 shows a block

diagram of these units.

Figure 1. AMD-K6™ MMX™ Enhanced Processor Block Diagram

The centralized scheduler buffer, in conjunction with theinstruction control unit (ICU/scheduler), buffers and manages

up to 24 RISC86 operations at a time (which equates with up to12 x86 instructions). This buffer size (24) is well matched to theprocessor’s six-stage RISC86 pipeline and seven parallelexecution units.

On every clock, the centralized scheduler buffer can accept upto four RISC86 operations from the decoders and issue up to sixRISC86 operations to corresponding execution units. (SixRISC86 operations can be issued at a time because the alux andmultimedia execution units share the same pipeline.)

When managing the 24 RISC86 operations, the scheduler uses48 physical registers contained within the RISC86

Integer X(Register) Unit

StoreUnit

Integer Y(Register) Unit

Floating-PointUnit

Branch(Resolving) Unit

StoreQueue

Instruction

Control Unit

SchedulerBuffer

(24 RISC86)Six RISC86Operation Issue

Out-of-OrderExecution Engine

Level-One Dual-Port Data Cache (32KByte) 128-Entry DTLB

Level-One Instruction Cache(32KByte + Predecode) 64-Entry ITLB

Dual Instruction Decoders x86 to RISC86

Branch Logic(8192-Entry BHT)

(16-Entry BTC)(16-Entry RAS)

LoadUnit

MultimediaUnit

PredecodeLogic

Level-One CacheController

Socket 7Bus

Interface

16-Byte Fetch

Four RISC86Decode





microarchitecture. The 48 physical registers are located in a

general register file and are grouped as 24 committed orarchitectural registers plus 24 rename registers. The 24architectural registers consist of 16 scratch registers and eightregisters that correspond to the x86 general-purpose

registers— EAX, EBX, ECX, EDX, EBP, ESP, ESI, and EDI.

The AMD-K6 processor offers sophisticated dynamic branchlogic that includes the following elements:

s Branch history/prediction table

s Branch target cache

s Return address stack

These components serve to minimize or eliminate the delays

due to the branch instructions (jumps, calls, returns) commonin x86 software.

The AMD-K6 processor implements a two-level branchprediction scheme based on an 8192-entry branch history table.

The branch history table stores prediction information that isused for predicting the direction of conditional branches. The

target addresses of conditional and unconditional branches arenot predicted, but instead are calculated on-the-fly duringinstruction decode by special branch target address ALUs. Thebranch target cache augments performance of taken branches

by avoiding a one-cycle cache-fetch penalty. This specializedtarget cache does this by supplying the first 16 bytes of targetinstructions to the decoders when a branch is taken.

The return address stack serves to optimize CALL/RETinstruction pairs by remembering the return address of each

CALL within a nested series of subroutines and supplying it asthe predicted target address of the corresponding RETinstruction.

As shown in Figure 1 on page 6, the high-performance,

out-of-order execution engine is mated to a split 64-Kbyte(Harvard architecture) writeback level-one cache with 32Kbytes of instruction cache and 32 Kbytes of data cache. The

level-one instruction cache feeds the decoders and, in turn, thedecoders feed the scheduler. The ICU controls the issue andretirement of RISC86 operations contained in the centralizedscheduler buffer. The level-one data cache satisfies most

memory reads and writes by the load and store execution units.





The store queue temporarily buffers memory writes from the

store unit until they can safely be committed into the cache(that is, when all preceding operations have been found to befree of faults and branch mispredictions). The system businterface is an industry-standard 64-bit Pentium

processor-compatible demultiplexed address/data system bus.

The AMD-K6 processor uses the latest in processormicroarchitecture techniques to provide the highest x86

performance for today’s PC. In short, the AMD-K6 processoroffers true sixth-generation performance and full x86 binary

software compatibility.




Chapter 3 AMD-K6™ Processor Execution Units and Dependency Latencies 9

3AMD-K6™ Processor

Execution Units and

Dependency Latencies

The AMD-K6 MMX enhanced processor contains sevenspecialized execution units—store, load, integer X, integer Y,multimedia, floating-point, and branch condition. Each unitoperates independently and handles a specific group of the

RISC86 instruction set. This chapter describes the operation ofthese units, their execution latencies, and how concurrent

dependency chains affect those latencies.

A dependency occurs when data needed in one execution unitis being processed in another unit (or the sam e unit).

Additional latencies can occur because the dependentexecution unit must wait for the data. Table 1 on page 16provides a summary of the execution units, the operationsperformed within these units, the operation latency, and the

operation throughput.



10 AMD-K6™ Processor Execution Units and Dependency Latencies Chapter 3


Execution Unit Terminology

The execution units operate on two different types of registervalues — operands and results. There are three types ofoperands and two types of results.

Operands The three types of operands are as follows:

s Address register operands—used for address calculations ofload and store operations

s Data register operands—used for register operations

s Store data register operands—used for memory stores

Results The two types of results are as follows:s Data register results—from load or register operations

s Address register results—from Lea or Push operations

The following examples illustrate the operand and resultdefinitions:

Add AX, BX

The Add operation has two data register operands (AX,and BX) and one data register result (AX).

Load BX, [SP+4·CX+8]

The Load operation has two address register operands (SP

and CX as base and index registers, respectively) and adata register result (BX).

Store [SP+4·CX+8], AX

The Store operation has a store data register operand(AX) and two address register operands (SP and CX as

base and index registers, respectively).

Lea SI, [SP+4·CX+8]

The Lea operation (a type of store operation) has address

register operands (SP and CX as base and index registers,respectively), and an address register result.





Six-Stage Pipeline

To help visualize the operations within the AMD-K6 processor,Figure 2 illustrates the six-stage pipeline design. This is asimplified illustration in that the AMD-K6 contains multipleparallel pipelines (starting after common instruction fetch andx86 decode pipe stages), and these pipelines often execute

operations out-of-order with respect to each other. This view ofthe AMD-K6 execution pipeline il lustrates the effect of

execution latencies for various types of operations.

For register operations that only require one execution cycle,this pipeline is effectively shorter due to the absence of

execution stage 2.

The samples starting on p ag e 1 9 assume that the x86instructions have already been fetched, decoded, and placed in

the centralized scheduler buffer. The RISC86 operations arewaiting to be dispatched to the appropriate execution units.

Figure 2. AMD-K6™ Processor Pipeline

Integer and Multimedia Execution Units

The integer X execution unit can execute all ALU operations,multiplies and divides (signed and unsigned), shifts, androtates. Data register results are available after one clock ofexecution latency.

The multimedia execution unit (meu) executes all MMXoperations and shares pipeline control with the integer Xexecution unit (an integer X operation and an MMX operation

cannot be dispatched simultaneously). In most cases, dataregister results are available after one clock and after two

clocks for PMULH and PMADD operations.

Instruction

Fetch

x86—>RISC86

Decode

RISC86

Issue

Execution

Stage 1

Execution

Stage 2

Retire





The integer Y execution unit can execute the basic word and

doubleword ALU operations (ADD, AND, CMP, OR, SUB andXOR) and zero and sign-extend operations. Data registerresults are available after one clock.

Figure 3 shows the architecture of the single-stage integerexecution pipeline. The operation issue and fetch stages thatprecede this execution stage are not part of the executionpipeline. The data register operands are received at the end of

the operand fetch pipe stage, and the data register result isproduced near the end of the execution pipe stage.

Figure 3. Integer/Multimedia Execution Unit

Load Unit

The load unit is a two-stage pipelined design that performs data

memory reads. This unit uses two address register operandsand a memory data value as inputs and produces a data registerresult.

The load unit has a two-clock latency from the time it receives

the address register operands until it produces a data registerresult.

Memory read data can come from either the data cache or thestore queue entry for a recent store. If the data is forwardedfrom the store queue, there is a zero additional execution

latency. This means that a dependent load operation cancomplete its execution one clock after a store operationcompletes execution.

Execution Stage(Integer X,Integer Y,

Multimedia)

Data Register Operands

Data Register Result





Figure 4 shows the architecture of the two-stage load execution

pipeline. The operation issue and fetch stages that precede thisexecution stage are not part of the execution pipeline. Theaddress register operands are received at the end of theoperand fetch pipe stage, and the data register result is

produced near the end of the second execution pipe stage.

Figure 4. Load Execution Unit

Store Unit

The store execution unit is a two-stage pipelined design thatperforms data memory writes and/or, in some cases, producesan address register result. For inputs, the store unit uses two

address register operands and, during actual memory writes, astore data register operand. This unit also produces an addressregister result for some store unit operations. For most storeoperations, which actually write to memory, the store unitproduces a physical memory address and the associated bytes

of data to be written. After execution completes, these results

are entered in a new store queue entry.

The store unit has a one-clock execution latency from the time

it receives address register operands until the time it producesan address register result. The most common examples are the

Load Effective Address (Lea) and Store and Update (Push)RISC86 operations, which are produced from the x86 LEA andPUSH instructions, respectively. Most store operations do not

Execution Stage 1Address Calculation

Stage

Address RegisterOperands

(Base and Index)

Memory data from Data

Cache or Store Queue

Execution Stage 2Data Cache/

Store Queue Lookup Data Register Result





produce an address register result and only perform a memory

write. The Push operation is unique because it produces bothan address register result and performs a memory write.

The store unit has a one-clock execution latency from the timeit receives address register operands until it enters a storememory address and data pair into the store queue.

The store unit also has a three-clock latency occurring from thetime it receives address register operands and a store data

register operand until it enters a store memory address anddata pair into the store queue.

Note: Address register operands are required at the start of

execution, but register data is not required until the end of

execution.

Figure 5 shows the architecture of the two-stage store execution

pipeline. The operation issue and fetch stages that precede thisexecution stage are not part of the execution pipeline. The

address register operands are received at the end of theoperand fetch pipe stage, and the new store queue entry iscreated upon completion of the second execution pipe stage.

Figure 5. Store Execution Unit

Address RegisterOperands

(Base and Index)

Store Data RegisterOperand

Execution Stage 1Address Calculation

Stage

Store Queue Entry

Execution Stage 2

Address Register Result

Address

Data





Branch Condition Unit

The branch condition unit is separate from the branchprediction logic, which is utilized at x86 instruction decodetime. This unit resolves conditional branches, such as JCC and

LOOP instructions, at a rate of up to one per clock cycle.

Floating Point Unit

The floating-point unit handles all register operations for x87instructions. The execution unit is a single-stage design that

takes data register operands as inputs and produces a data

register result as an output. The most common floating-pointinstructions have a two clock execution latency from the time itreceives data register operands until the time it produces a data

register result.

Latencies and Throughput

Table 1 on page 16 summarizes the latencies and throughput of

each execution unit.





Table 1. RISC86® Execution Latencies and Throughput

ExecutionUnit

Operations Latency Throughput

Integer X

Integer ALU

Integer Multiply

Integer Shift

1

2–3

1

1

2–3

1

Multimedia

MMX ALU

MMX Shifts, Packs, Unpack

MMX Multiply Low/High

MMX Multiply-Accumulate

1

1

1/2

2

1

1

1/2

2

Integer Y Basic ALU (16– and 32– bit operands) 1 1

LoadFrom Address Register Operands to Data Register Result

Memory Read Data from Data Cache/Store Queue to Data Register Result

2

0

1

1

Store

From Address Register Operands to Address Register Result

From Store Data Register Operands to Store Queue Entry

From Address Register Operands to Store Queue Entry

1

1

3

1

1

1

Branch Resolves Branch Conditions 1 1

FPUFADD, FSUB

FMUL

2

2

2

2

Note:

No additional latency exists between execution of dependent operations. Bypassing of register results directly from

producing execution units to the operand inputs of dependent units is fully supported. Similarly, forwarding of memory store values from the store queue to dependent load operations is supported.





Resource Constraints

To optimize code effectively, consider not only the latencies ofcritical dependencies, but also execution resource constraints.Due to a fixed number of execution units, only so manyoperations can be issued in each cycle (up to 6 RISC86operations per cycle), even though, based on dependencies,

more execution parallelism may be possible.

For example, if code contains three consecutive integeroperations that do not have co-dependencies, they cannot

execute in parallel because there are only 2 integer executionunits. The third operation is delayed by one cycle.

Contention for execution resources causes delays in the issuing

and execution of instructions. In addition, stalls due to resourceconstraints can combine with dependency latencies to cause or

exacerbate stalls due to dependencies. In general, constraintsthat delay non-critical instructions do not impact performancebecause such stalls typically overlap with the execution ofcritical operations.





Code Sample Analysis

The samples in this section show the execution behavior ofseveral series of instructions as a function of decodeconstraints, dependencies, and execution resource constraints.

The sample tables show the x86 instructions, the RISC86

operation equivalents, the clock counts, and a description ofthe events occurring within the processor.

The following nomenclature is used to describe the current

location of a RISC86 operation (RISC86op):

s D — Decode stage

s IX — Issue stage of integer X unit

s OX — Operand fetch stage of integer X unit

s EX1 — Execution stage 1 of integer X unit

s IY — Issue stage of integer Y unit

s OY — Operand fetch stage of integer Y unit

s EY1 — Execution stage 1 of integer Y unit

s IL — Issue stage of load unit

s OL — Operand fetch stage of load unit

s EL1 — Execution stage 1 of load unit

s EL2 — Execution stage 2 of load unit

s IS — Issue stage of store unit

s OS — Operand fetch stage of store unit

s ES1 — Execution stage 1 of store unit

s ES2

— Execution stage 2 of store unit

Note: Instructions execute more efficiently (that is, without

delays) when scheduled apart by suitable distances based ondependencies. In general, the samples in this section show

poorly scheduled code in order to illustrate the resultant effects.







Table 3. Sample 2 – Integer Register and Memory Load Operations

Instruction Clocks

Number Instruction RISC86op 1 2 3 4 5 6 7 8 9 10 111 DEC EDX alu D I X O X E X1

2 MOV EDI, [ECX] load D IL OL EL1 EL2

3 SUB EAX, [EDX+20] load D IL OL EL1 EL2

alu I X O X I X O X E X1

4 SAR EAX, 5 alux D I X O X I X O X E X1

5 ADD ECX, [EDI+4] load D IL OL EL1 EL2

alu I Y O Y I Y O Y E Y1

6 AND EBX, 0x1F alu D I Y

O Y

E Y1

7 MOV ESI, [0x0F100] load D IL OL EL1 EL2

8 OR ECX, [ESI+EAX*4+8] load D IL OL OL EL1 EL2

alu I X O X O X O X E X1

Comments for Each Instruction Number

1 This simple alu operation ends up in the X pipe.

2 This operation will occupy the load execution unit.

3 The register operand for the load operation is bypassed, without delay, from the result of instruction #1’sregister operand. In clock 4, the register operation is ‘bumped’ out of the integer X unit while waiting for the

previous load operation result to complete. It is re-issued just in time to receive the bypassed result of theload.

4 Shift instructions are only executable in the integer X unit. The register operation is bumped in clock 5 while waiting for the result of the receding instruction #3.

5 The register operand for the load operation is bypassed, without delay, from the result of instruction #2’sregister operand. Note how this and most surrounding load operations are generated by instruction decoders,and issued and executed by the load unit “smoothly” at a rate of one clock per cycle. In clock 5, the registeroperation is bumped out of the integer Y unit while waiting for the previous load operation result to complete.

6 The register operation falls through into the integer Y unit right behind instruction #5’s register operation.

7 This operation falls into the load unit behind the load in instruction #5

8 The operand fetch for the load operation is delayed because it needs the result of the immediately precedingload operation #7 as well as the results from earlier instructions #3 and #4.





Table 4. Sample 3 – Integer Register and Memory Load/Store Operations

Instruction Clocks

Number Instruction RISC86op 1 2 3 4 5 6 7 8 9 10 111 MOV EDX, [0xA0008F00] load D IL OL EL1 EL2

2 ADD [EDX+16], 7 load D IL OL EL1 EL2

alu I X O X I X O X E X1

store IS OS OS ES1 ES2 ES2

3 SUB EAX, [EDX+16] load D IL IL OL EL1 EL2 EL2

alu I X O X I X I X O X O X E X1

4 PUSH EAX store D IS IS OS ES1 ES2 ES2 ES2

5 LEA EBX, [ECX+EAX*4+3] store D IS OS OS OS ES1 ES2

6 MOV EDI, EBX alu D I Y O Y O Y O Y O Y O Y E Y1


1 This operation will occupy the load unit.

2 This long decoded ADD instruction takes a single clock to decode. The operand fetch for the load operation isdelayed waiting for the result of the previous load operation from instruction #1. The store operation completesconcurrent with the register operation. The result of the register operation is bypassed directly into a new storequeue entry created by the store operation.

3 The issue of the load operation is delayed because the operand fetch of the preceding load operation frominstruction #2 was delayed. The completion of the load operation is held up due to a memory dependency on

the preceding store operation of instruction #2. The load operation completes immediately after the storeoperation, with the store data being forwarded from a new store queue entry.

4 Completion of the store operation is held up due to a data dependency on the preceding instruction #3. Thestore data is bypassed directly into a new store queue entry from the result of instruction #3’s registeroperation.

5 The Lea RISC86 operation is executed by the store unit. The operand fetch is delayed waiting for the result ofinstruction #3. The register result value is produced in the first execution stage of the store unit.

6 This simple alu operation is stalled due to the dependency of the BX result in instruction #5.





Table 5. Sample 4 – Integer, MMX™, and Memory Load/Store Operations

Instruction Clocks

Number Instruction RISC86op 1 2 3 4 5 6 7 8 9 101 MOVQ MM0, [EAX] mload D IL OL EL1 EL2

2 PSUBSW MM0, [EAX+16] mload D IL OL EL1 EL2

alux I X O X O X O X E X1

3 ADD EBX, ECX alu D I Y O Y E Y1

4 PADDSW MM1, MM2 alux D I X I X I X O X E X1

5 PUSH EBX store D IS OS ES1 ES2

6 PMADDWD MM0, MM1 alux D I X O X E X1 E X1

7 ADD EAX, 32 alu D I Y

O Y

E Y1

8 MOVQ [EDI], MM0 mstore D IS OS ES1 ES2 ES2

9 ADD EDI, 8 alu D I Y O Y E Y1


1 This multimedia operation will occupy the load unit.

2 Instruction #2 could not be decoded along with the preceding instruction because only MMX instructionscan be decoded in the first decode position. The MMX register operation is executable only by the integer

X unit. The operand fetch is delayed because of the dependency of the load.

3 This instruction can be decoded in parallel with instruction #2 because it is not an MMX instruction. It isissued to the integer Y unit in parallel with the issuing of the preceding MMX register operation ininstruction #2.

4 This instruction is only executable in the integer X unit. The issue of this MMX instruction is delayed dueto the delay of the operand fetch of the preceding MMX register operation.

5 This instruction stores the contents of BX in memory.

6 Instruction #6 is only executable in the integer X unit. This non-pipelined unit has a two-clock executionlatency for this instruction, and it is delayed due to ‘stacking-up’ behind the preceding MMX operations.

7 This instruction is issued to the integer Y unit in parallel with the series of preceding MMX registeroperations being issued to the integer X unit.

8 Completion of this store operation is held up due to a data dependency on the preceding MMX register

operation from instruction #6. The store data is bypassed directly into a new store queue entry from theresult of the MMX operation.

9 This instruction is issued to the integer Y unit in parallel with the series of preceding MMX registeroperations being issued to the integer X unit.




Chapter 4 Instruction Dispatch and Execution Timing 23

4Instruction Dispatch and

Execution Timing

This chapter describes the RISC86 operations executed by eachinstruction. Three separate tables define the integer, MMX, and

floating-point instructions.

The first column in these tables indicates the instructionmnemonic and operand types with the following notations:

s reg8—byte integer register defined by instruction byte(s) orbits 5, 4, and 3 of the modR/M byte

s mreg8—byte integer register defined by bits 2, 1, and 0 of

the modR/M byte

s reg16/32—word and doubleword integer register defined by

instruction byte(s) or bits 5, 4, and 3 of the modR/M byte

s mreg16/32—word and doubleword integer register definedby bits 2, 1, and 0 of the modR/M byte

s mem8—byte memory location

s mem16/32—word or doubleword memory location

s mem32/48—doubleword or 6-byte memory location

s mem48—48-bit integer value in memory

s mem64—64-bit value in memory

s imm8/16/32—8-bit , 16-bit or 32-bit immediate value

s disp8—8-bit displacement value

s disp16/32—16-bit or 32-bit displacement value

s disp32/48—32-bit or 48-bit displacement value



24 Instruction Dispatch and Execution Timing Chapter 4


s eXX—register width depending on the operand size

s mem32real—32-bit floating-point value in memory



s mmreg —MMX register

s mmreg1—MMX register defined by bits 5, 4, and 3 of the

modR/M byte

s mmreg2—MMX register defined by bits 2, 1, and 0 of the

modR/M byte

The second and third columns list all applicable encodingopcode bytes.

The fourth column lists the modR/M byte when used by the

instruction. The modR/M byte defines the instruction asregister or memory form. If mod bits 7 and 6 are documented as

mm (memory form), mm can only be 10b, 01b, or 00b.

The fifth column lists the type of instruction decode—short,

long, or vectored. The AMD-K6 MMX enhanced processordecode logic can process two short, one long, or one vectoreddecode per clock. In addition, two short integer, one shortinteger and one short MMX, or one short integer and one short

FPU instruction can be decoded simultaneously.

Note: In order to simultaneously decode an integer with a floating-point or MMX instruction, the floating-point or

MMX instruction must precede the integer instruction.

The sixth column lists the type of RISC86 operation(s) required

for the instruction. The operation types and correspondingexecution units are as follows:

s load, fload, mload —load unit

s store, fstore, mstore—store unit

s alu—either of the integer execution units

s alux—integer X execution unit only

s branch—branch condition unit

s float —floating-point execution unit

s meu—multimedia execution unit

s limm—load immediate, instruction control unit





The operation(s) of most instructions form a single dependency

chain. For instructions whose operations form two paralleldependency chains, the RISC86 operations and executionlatency for each dependency chain is shown on a separate row.

Table 6. Integer Instructions

Instruction MnemonicFirstByte

SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes

AAA 37h vector

AAD D5h 0Ah vector

AAM D4h 0Ah vector

AAS 3Fh vector

ADC mreg8, reg8 10h 11-xxx-xxx vector

ADC mem8, reg8 10h mm-xxx-xxx vector

ADC mreg16/32, reg16/32 11h 11-xxx-xxx vector

ADC mem16/32, reg16/32 11h mm-xxx-xxx vector

ADC reg8, mreg8 12h 11-xxx-xxx vector

ADC reg8, mem8 12h mm-xxx-xxx vector

ADC reg16/32, mreg16/32 13h 11-xxx-xxx vector

ADC reg16/32, mem16/32 13h mm-xxx-xxx vector

ADC AL, imm8 14h xx-xxx-xxx vector

ADC EAX, imm16/32 15h xx-xxx-xxx vector

ADC mreg8, imm8 80h 11-010-xxx vector

ADC mem8, imm8 80h mm-010-xxx vector

ADC mreg16/32, imm16/32 81h 11-010-xxx vector

ADC mem16/32, imm16/32 81h mm-010-xxx vector

ADC mreg16/32, imm8 (signed ext.) 83h 11-010-xxx vector

ADC mem16/32, imm8 (signed ext.) 83h mm-010-xxx vector

ADD mreg8, reg8 00h 11-xxx-xxx short alux

ADD mem8, reg8 00h mm-xxx-xxx long load, alux, store

ADD mreg16/32, reg16/32 01h 11-xxx-xxx short alu

ADD mem16/32, reg16/32 01h mm-xxx-xxx long load, alu, store

ADD reg8, mreg8 02h 11-xxx-xxx short alux

ADD reg8, mem8 02h mm-xxx-xxx short load, alux

ADD reg16/32, mreg16/32 03h 11-xxx-xxx short alu

ADD reg16/32, mem16/32 03h mm-xxx-xxx short load, alu

ADD AL, imm8 04h xx-xxx-xxx short alux





ADD EAX, imm16/32 05h xx-xxx-xxx short alu

ADD mreg8, imm8 80h 11-000-xxx short alux

ADD mem8, imm8 80h mm-000-xxx long load, alux, store

ADD mreg16/32, imm16/32 81h 11-000-xxx short alu

ADD mem16/32, imm16/32 81h mm-000-xxx long load, alu, store

ADD mreg16/32, imm8 (signed ext.) 83h 11-000-xxx short alux

ADD mem16/32, imm8 (signed ext.) 83h mm-000-xxx long load, alux, store

AND mreg8, reg8 20h 11-xxx-xxx short alux

AND mem8, reg8 20h mm-xxx-xxx long load, alux, store

AND mreg16/32, reg16/32 21h 11-xxx-xxx short alu

AND mem16/32, reg16/32 21h mm-xxx-xxx long load, alu, store

AND reg8, mreg8 22h 11-xxx-xxx short alux

AND reg8, mem8 22h mm-xxx-xxx short load, alux

AND reg16/32, mreg16/32 23h 11-xxx-xxx short alu

AND reg16/32, mem16/32 23h mm-xxx-xxx short load, alu

AND AL, imm8 24h xx-xxx-xxx short alux

AND EAX, imm16/32 25h xx-xxx-xxx short alu

AND mreg8, imm8 80h 11-100-xxx short alux

AND mem8, imm8 80h mm-100-xxx long load, alux, store

AND mreg16/32, imm16/32 81h 11-100-xxx short alu

AND mem16/32, imm16/32 81h mm-100-xxx long load, alu, store

AND mreg16/32, imm8 (signed ext.) 83h 11-100-xxx short alux

AND mem16/32, imm8 (signed ext.) 83h mm-100-xxx long load, alux, store

ARPL mreg16, reg16 63h 11-xxx-xxx vector

ARPL mem16, reg16 63h mm-xxx-xxx vector

BOUND 62h xx-xxx-xxx vector

BSF reg16/32, mreg16/32 0Fh BCh 11-xxx-xxx vector

BSF reg16/32, mem16/32 0Fh BCh mm-xxx-xxx vector

BSR reg16/32, mreg16/32 0Fh BDh 11-xxx-xxx vector

BSR reg16/32, mem16/32 0Fh BDh mm-xxx-xxx vector

BSWAP EAX 0Fh C8h long alu

BSWAP ECX 0Fh C9h long alu

BSWAP EDX 0Fh CAh long alu

Table 6. Integer Instructions (continued)


SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





BSWAP EBX 0Fh CBh long alu

BSWAP ESP 0Fh CCh long alu

BSWAP EBP 0Fh CDh long alu

BSWAP ESI 0Fh CEh long alu

BSWAP EDI 0Fh CFh long alu

BT mreg16/32, reg16/32 0Fh A3h 11-xxx-xxx vector

BT mem16/32, reg16/32 0Fh A3h mm-xxx-xxx vector

BT mreg16/32, imm8 0Fh BAh 11-100-xxx vector

BT mem16/32, imm8 0Fh BAh mm-100-xxx vector

BTC mreg16/32, reg16/32 0Fh BBh 11-xxx-xxx vector

BTC mem16/32, reg16/32 0Fh BBh mm-xxx-xxx vector

BTC mreg16/32, imm8 0Fh BAh 11-111-xxx vector

BTC mem16/32, imm8 0Fh BAh mm-111-xxx vector

BTR mreg16/32, reg16/32 0Fh B3h 11-xxx-xxx vector

BTR mem16/32, reg16/32 0Fh B3h mm-xxx-xxx vector

BTR mreg16/32, imm8 0Fh BAh 11-110-xxx vector

BTR mem16/32, imm8 0Fh BAh mm-110-xxx vector

BTS mreg16/32, reg16/32 0Fh ABh 11-xxx-xxx vector

BTS mem16/32, reg16/32 0Fh ABh mm-xxx-xxx vector

BTS mreg16/32, imm8 0Fh BAh 11-101-xxx vector

BTS mem16/32, imm8 0Fh BAh mm-101-xxx vector

CALL full pointer 9Ah vector

CALL near imm16/32 E8h short store

CALL mem16:16/32 FFh 11-011-xxx vector

CALL near mreg32 (indirect) FFh 11-010-xxx vector

CALL near mem32 (indirect) FFh mm-010-xxx vector

CBW/CWDE EAX 98h vector

CLC F8h vector

CLD FCh vector

CLI FAh vector

CLTS 0Fh 06h vector

CMC F5h vector

CMP mreg8, reg8 38h 11-xxx-xxx short alux



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





CMP mem8, reg8 38h mm-xxx-xxx short load, alux

CMP mreg16/32, reg16/32 39h 11-xxx-xxx short alu

CMP mem16/32, reg16/32 39h mm-xxx-xxx short load, alu

CMP reg8, mreg8 3Ah 11-xxx-xxx short alux

CMP reg8, mem8 3Ah mm-xxx-xxx short load, alux

CMP reg16/32, mreg16/32 3Bh 11-xxx-xxx short alu

CMP reg16/32, mem16/32 3Bh mm-xxx-xxx short load, alu

CMP AL, imm8 3Ch xx-xxx-xxx short alux

CMP EAX, imm16/32 3Dh xx-xxx-xxx short alu

CMP mreg8, imm8 80h 11-111-xxx short alux

CMP mem8, imm8 80h mm-111-xxx short load, alux

CMP mreg16/32, imm16/32 81h 11-111-xxx short alu

CMP mem16/32, imm16/32 81h mm-111-xxx long load, alu

CMP mreg16/32, imm8 (signed ext.) 83h 11-111-xxx short load, alu

CMP mem16/32, imm8 (signed ext.) 83h mm-111-xxx short load, alu

CMPSB mem8,mem8 A6h vector

CMPSW mem16, mem32 A7h vector

CMPSD mem32, mem32 A7h vector

CMPXCHG mreg8, reg8 0Fh B0h 11-xxx-xxx vector

CMPXCHG mem8, reg8 0Fh B0h mm-xxx-xxx vector

CMPXCHG mreg16/32, reg16/32 0Fh B1h 11-xxx-xxx vector

CMPXCHG mem16/32, reg16/32 0Fh B1h mm-xxx-xxx vector

CMPXCH8B EDX:EAX 0Fh C7h 11-xxx-xxx vector

CMPXCH8B mem64 0Fh C7h mm-xxx-xxx vector

CPUID 0Fh A2h vector

CWD/CDQ EDX, EAX 99h vector

DAA 27h vector

DAS 2Fh vector

DEC EAX 48h short alu

DEC ECX 49h short alu

DEC EDX 4Ah short alu

DEC EBX 4Bh short alu

DEC ESP 4Ch short alu



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes







INC ESP 44h short alu

INC EBP 45h short alu

INC ESI 46h short alu

INC EDI 47h short alu

INC mreg8 FEh 11-000-xxx vector

INC mem8 FEh mm-000-xxx long load, alux, store

INC mreg16/32 FFh 11-000-xxx vector

INC mem16/32 FFh mm-000-xxx long load, alu, store

INVD 0Fh 08h vector

INVLPG 0Fh 01h mm-111-xxx vector

JO short disp8 70h short branch

JB/JNAE short disp8 71h short branch

JNO short disp8 71h short branch

JNB/JAE short disp8 73h short branch

JZ/JE short disp8 74h short branch

JNZ/JNE short disp8 75h short branch

JBE/JNA short disp8 76h short branch

JNBE/JA short disp8 77h short branch

JS short disp8 78h short branch

JNS short disp8 79h short branch

JP/JPE short disp8 7Ah short branch

JNP/JPO short disp8 7Bh short branch

JL/JNGE short disp8 7Ch short branch

JNL/JGE short disp8 7Dh short branch

JLE/JNG short disp8 7Eh short branch

JNLE/JG short disp8 7Fh short branch

JCXZ/JEC short disp8 E3h vector

JO near disp16/32 0Fh 80h short branch

JNO near disp16/32 0Fh 81h short branch

JB/JNAE near disp16/32 0Fh 82h short branch

JNB/JAE near disp16/32 0Fh 83h short branch

JZ/JE near disp16/32 0Fh 84h short branch

JNZ/JNE near disp16/32 0Fh 85h short branch



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





JBE/JNA near disp16/32 0Fh 86h short branch

JNBE/JA near disp16/32 0Fh 87h short branch

JS near disp16/32 0Fh 88h short branch

JNS near disp16/32 0Fh 89h short branch

JP/JPE near disp16/32 0Fh 8Ah short branch

JNP/JPO near disp16/32 0Fh 8Bh short branch

JL/JNGE near disp16/32 0Fh 8Ch short branch

JNL/JGE near disp16/32 0Fh 8Dh short branch

JLE/JNG near disp16/32 0Fh 8Eh short branch

JNLE/JG near disp16/32 0Fh 8Fh short branch

JMP near disp16/32 (direct) E9h short

JMP far disp32/48 (direct) EAh vector

JMP disp8 (short) EBh short

JMP far mreg32 (indirect) EFh 11-101-xxx vector

JMP far mem32 (indirect) EFh mm-101-xxx vector

JMP near mreg16/32 (indirect) FFh 11-100-xxx vector

JMP near mem16/32 (indirect) FFh mm-100-xxx vector

LAHF 9Fh vector

LAR reg16/32, mreg16/32 0Fh 02h 11-xxx-xxx vector

LAR reg16/32, mem16/32 0Fh 02h mm-xxx-xxx vector

LDS reg16/32, mem32/48 C5h mm-xxx-xxx vector

LEA reg16/32, mem16/32 8Dh mm-xxx-xxx shortalustore

LEAVE C9h longaluload, alu

LES reg16/32, mem32/48 C4h mm-xxx-xxx vector

LFS reg16/32, mem32/48 0Fh B4h vector

LGDT mem48 0Fh 01h mm-010-xxx vectorLGS reg16/32, mem32/48 0Fh B5h vector

LIDT mem48 0Fh 01h mm-011-xxx vector

LLDT mreg16 0Fh 00h 11-010-xxx vector

LLDT mem16 0Fh 00h mm-010-xxx vector

LMSW mreg16 0Fh 01h 11-100-xxx vector



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





LMSW mem16 0Fh 01h mm-100-xxx vector

LODSB AL, mem8 ACh longloadalu

LODSW AX, mem16 ADh longloadalu

LODSD EAX, mem32 ADh longloadalu

LOOP disp8 E2h shortbranchalu

LOOPE/LOOPZ disp8 E1h vector

LOOPNE/LOOPNZ disp8 E0h vector

LSL reg16/32, mreg16/32 0Fh 03h 11-xxx-xxx vector

LSL reg16/32, mem16/32 0Fh 03h mm-xxx-xxx vector

LSS reg16/32, mem32/48 0Fh B2h mm-xxx-xxx vector

LTR mreg16 0Fh 00h 11-011-xxx vector

LTR mem16 0Fh 00h mm-011-xxx vector

MOV mreg8, reg8 88h 11-xxx-xxx short alux

MOV mem8, reg8 88h mm-xxx-xxx short store

MOV mreg16/32, reg16/32 89h 11-xxx-xxx short alu

MOV mem16/32, reg16/32 89h mm-xxx-xxx short store

MOV reg8, mreg8 8Ah 11-xxx-xxx short alux

MOV reg8, mem8 8Ah mm-xxx-xxx short load

MOV reg16/32, mreg16/32 8Bh 11-xxx-xxx short alu

MOV reg16/32, mem16/32 8Bh mm-xxx-xxx short load

MOV mreg16, segment reg 8Ch 11-xxx-xxx long load

MOV mem16, segment reg 8Ch mm-xxx-xxx vector

MOV segment reg, mreg16 8Eh 11-xxx-xxx vector

MOV segment reg, mem16 8Eh mm-xxx-xxx vector

MOV AL, mem8 A0h short load

MOV EAX, mem16/32 A1h short load

MOV mem8, AL A2h short store

MOV mem16/32, EAX A3h short store

MOV AL, imm8 B0h short limm

MOV CL, imm8 B1h short limm



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





MOV DL, imm8 B2h short limm

MOV BL, imm8 B3h short limm

MOV AH, imm8 B4h short limm

MOV CH, imm8 B5h short limm

MOV DH, imm8 B6h short limm

MOV BH, imm8 B7h short limm

MOV EAX, imm16/32 B8h short limm

MOV ECX, imm16/32 B9h short limm

MOV EDX, imm16/32 BAh short limm

MOV EBX, imm16/32 BBh short limm

MOV ESP, imm16/32 BCh short limm

MOV EBP, imm16/32 BDh short limm

MOV ESI, imm16/32 BEh short limm

MOV EDI, imm16/32 BFh short limm

MOV mreg8, imm8 C6h 11-000-xxx short limm

MOV mem8, imm8 C6h mm-000-xxx long store

MOV reg16/32, imm16/32 C7h 11-000-xxx short limm

MOV mem16/32, imm16/32 C7h mm-000-xxx long store

MOVSB mem8,mem8 A4h longload, storealualu

MOVSD mem16, mem16 A5h longload, storealualu

MOVSW mem32, mem32 A5h longload, storealualu

MOVSX reg16/32, mreg8 0Fh BEh 11-xxx-xxx short alu

MOVSX reg16/32, mem8 0Fh BEh mm-xxx-xxx short load, aluMOVSX reg32, mreg16 0Fh BFh 11-xxx-xxx short alu

MOVSX reg32, mem16 0Fh BFh mm-xxx-xxx short load, alu

MOVZX reg16/32, mreg8 0Fh B6h 11-xxx-xxx short alu

MOVZX reg16/32, mem8 0Fh B6h mm-xxx-xxx short load, alu

MOVZX reg32, mreg16 0Fh B7h 11-xxx-xxx short alu



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





MOVZX reg32, mem16 0Fh B7h mm-xxx-xxx short load, alu

MUL AL, mreg8 F6h 11-100-xxx vector

MUL AL, mem8 F6h mm-100-xx vector

MUL EAX, mreg16/32 F7h 11-100-xxx vector

MUL EAX, mem16/32 F7h mm-100-xx vector

NEG mreg8 F6h 11-011-xxx short alux

NEG mem8 F6h mm-011-xx vector

NEG mreg16/32 F7h 11-011-xxx short alu

NEG mem16/32 F7h mm-011-xx vector

NOP (XCHG AX, AX) 90h short limm

NOT mreg8 F6h 11-010-xxx short alux

NOT mem8 F6h mm-010-xx vector

NOT mreg16/32 F7h 11-010-xxx short alu

NOT mem16/32 F7h mm-010-xx vector

OR mreg8, reg8 08h 11-xxx-xxx short alux

OR mem8, reg8 08h mm-xxx-xxx long load, alux, store

OR mreg16/32, reg16/32 09h 11-xxx-xxx short alu

OR mem16/32, reg16/32 09h mm-xxx-xxx long load, alu, store

OR reg8, mreg8 0Ah 11-xxx-xxx short alux

OR reg8, mem8 0Ah mm-xxx-xxx short load, alux

OR reg16/32, mreg16/32 0Bh 11-xxx-xxx short alu

OR reg16/32, mem16/32 0Bh mm-xxx-xxx short load, alu

OR AL, imm8 0Ch xx-xxx-xxx short alux

OR EAX, imm16/32 0Dh xx-xxx-xxx short alu

OR mreg8, imm8 80h 11-001-xxx short alux

OR mem8, imm8 80h mm-001-xxx long load, alux, store

OR mreg16/32, imm16/32 81h 11-001-xxx short alu

OR mem16/32, imm16/32 81h mm-001-xxx long load, alu, store

OR mreg16/32, imm8 (signed ext.) 83h 11-001-xxx short alux

OR mem16/32, imm8 (signed ext.) 83h mm-001-xxx long load, alux, store

POP ES 07h vector

POP SS 17h vector

POP DS 1Fh vector



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





POP FS 0Fh A1h vector

POP GS 0Fh A9h vector

POP EAX 58h shortloadalu

POP ECX 59h shortloadalu

POP EDX 5Ah shortloadalu

POP EBX 5Bh shortloadalu

POP ESP 5Ch shortloadalu

POP EBP 5Dh shortloadalu

POP ESI 5Eh shortloadalu

POP EDI 5Fh shortloadalu

POP mreg 8Fh 11-000-xxx shortloadalu

POP mem 8Fh mm-000-xxx longload, storealu

POPA/POPAD 61h vector

POPF/POPFD 9Dh vector

PUSH ES 06h long load, store

PUSH CS 0Eh vector

PUSH FS 0Fh A0h vector

PUSH GS 0Fh A8h vector

PUSH SS 16h vector

PUSH DS 1Eh long load, store

PUSH EAX 50h short store

PUSH ECX 51h short store

PUSH EDX 52h short store

PUSH EBX 53h short store

PUSH ESP 54h short store



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





PUSH EBP 55h short store

PUSH ESI 56h short store

PUSH EDI 57h short store

PUSH imm8 6Ah long store

PUSH imm16/32 68h long store

PUSH mreg16/32 FFh 11-110-xxx vector

PUSH mem16/32 FFh mm-110-xxx long load, store

PUSHA/PUSHAD 60h vector

PUSHF/PUSHFD 9Ch vector

RCL mreg8, imm8 C0h 11-010-xxx vector

RCL mem8, imm8 C0h mm-010-xxx vector

RCL mreg16/32, imm8 C1h 11-010-xxx vector

RCL mem16/32, imm8 C1h mm-010-xxx vector

RCL mreg8, 1 D0h 11-010-xxx vector

RCL mem8, 1 D0h mm-010-xxx vector

RCL mreg16/32, 1 D1h 11-010-xxx vector

RCL mem16/32, 1 D1h mm-010-xxx vector

RCL mreg8, CL D2h 11-010-xxx vector

RCL mem8, CL D2h mm-010-xxx vector

RCL mreg16/32, CL D3h 11-010-xxx vector

RCL mem16/32, CL D3h mm-010-xxx vector

RCR mreg8, imm8 C0h 11-011-xxx vector

RCR mem8, imm8 C0h mm-011-xxx vector

RCR mreg16/32, imm8 C1h 11-011-xxx vector

RCR mem16/32, imm8 C1h mm-011-xxx vector

RCR mreg8, 1 D0h 11-011-xxx vector

RCR mem8, 1 D0h mm-011-xxx vector

RCR mreg16/32, 1 D1h 11-011-xxx vector

RCR mem16/32, 1 D1h mm-011-xxx vector

RCR mreg8, CL D2h 11-011-xxx vector

RCR mem8, CL D2h mm-011-xxx vector

RCR mreg16/32, CL D3h 11-011-xxx vector

RCR mem16/32, CL D3h mm-011-xxx vector



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





RET near imm16 C2h vector

RET near C3h vector

RET far imm16 CAh vector

RET far CBh vector

ROL mreg8, imm8 C0h 11-000-xxx vector

ROL mem8, imm8 C0h mm-000-xxx vector

ROL mreg16/32, imm8 C1h 11-000-xxx vector

ROL mem16/32, imm8 C1h mm-000-xxx vector

ROL mreg8, 1 D0h 11-000-xxx vector

ROL mem8, 1 D0h mm-000-xxx vector

ROL mreg16/32, 1 D1h 11-000-xxx vector

ROL mem16/32, 1 D1h mm-000-xxx vector

ROL mreg8, CL D2h 11-000-xxx vector

ROL mem8, CL D2h mm-000-xxx vector

ROL mreg16/32, CL D3h 11-000-xxx vector

ROL mem16/32, CL D3h mm-000-xxx vector

ROR mreg8, imm8 C0h 11-001-xxx vector

ROR mem8, imm8 C0h mm-001-xxx vector

ROR mreg16/32, imm8 C1h 11-001-xxx vector

ROR mem16/32, imm8 C1h mm-001-xxx vector

ROR mreg8, 1 D0h 11-001-xxx vector

ROR mem8, 1 D0h mm-001-xxx vector

ROR mreg16/32, 1 D1h 11-001-xxx vector

ROR mem16/32, 1 D1h mm-001-xxx vector

ROR mreg8, CL D2h 11-001-xxx vector

ROR mem8, CL D2h mm-001-xxx vector

ROR mreg16/32, CL D3h 11-001-xxx vector

ROR mem16/32, CL D3h mm-001-xxx vector

SAHF 9Eh vector

SAR mreg8, imm8 C0h 11-111-xxx short alux

SAR mem8, imm8 C0h mm-111-xxx vector

SAR mreg16/32, imm8 C1h 11-111-xxx short alu

SAR mem16/32, imm8 C1h mm-111-xxx vector



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





SAR mreg8, 1 D0h 11-111-xxx short alux

SAR mem8, 1 D0h mm-111-xxx vector

SAR mreg16/32, 1 D1h 11-111-xxx short alu

SAR mem16/32, 1 D1h mm-111-xxx vector

SAR mreg8, CL D2h 11-111-xxx short alux

SAR mem8, CL D2h mm-111-xxx vector

SAR mreg16/32, CL D3h 11-111-xxx short alu

SAR mem16/32, CL D3h mm-111-xxx vector

SBB mreg8, reg8 18h 11-xxx-xxx vector

SBB mem8, reg8 18h mm-xxx-xxx vector

SBB mreg16/32, reg16/32 19h 11-xxx-xxx vector

SBB mem16/32, reg16/32 19h mm-xxx-xxx vector

SBB reg8, mreg8 1Ah 11-xxx-xxx vector

SBB reg8, mem8 1Ah mm-xxx-xxx vector

SBB reg16/32, mreg16/32 1Bh 11-xxx-xxx vector

SBB reg16/32, mem16/32 1Bh mm-xxx-xxx vector

SBB AL, imm8 1Ch xx-xxx-xxx vector

SBB EAX, imm16/32 1Dh xx-xxx-xxx vector

SBB mreg8, imm8 80h 11-011-xxx vector

SBB mem8, imm8 80h mm-011-xxx vector

SBB mreg16/32, imm16/32 81h 11-011-xxx vector

SBB mem16/32, imm16/32 81h mm-011-xxx vector

SBB mreg8, imm8 (signed ext.) 83h 11-011-xxx vector

SBB mem8, imm8 (signed ext.) 83h mm-011-xxx vector

SCASB AL, mem8 AEh vector

SCASW AX, mem16 AFh vector

SCASD EAX, mem32 AFh vector

SETO mreg8 0Fh 90h 11-xxx-xxx vector

SETO mem8 0Fh 90h mm-xxx-xxx vector

SETNO mreg8 0Fh 91h 11-xxx-xxx vector

SETNO mem8 0Fh 91h mm-xxx-xxx vector

SETB/SETNAE mreg8 0Fh 92h 11-xxx-xxx vector

SETB/SETNAE mem8 0Fh 92h mm-xxx-xxx vector



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes







SHL/SAL mem8, 1 D0h mm-100-xxx vector

SHL/SAL mreg16/32, 1 D1h 11-100-xxx short alu

SHL/SAL mem16/32, 1 D1h mm-100-xxx vector

SHL/SAL mreg8, CL D2h 11-100-xxx short alux

SHL/SAL mem8, CL D2h mm-100-xxx vector

SHL/SAL mreg16/32, CL D3h 11-100-xxx short alu

SHL/SAL mem16/32, CL D3h mm-100-xxx vector

SHR mreg8, imm8 C0h 11-101-xxx short alux

SHR mem8, imm8 C0h mm-101-xxx vector

SHR mreg16/32, imm8 C1h 11-101-xxx short alu

SHR mem16/32, imm8 C1h mm-101-xxx vector

SHR mreg8, 1 D0h 11-101-xxx short alux

SHR mem8, 1 D0h mm-101-xxx vector

SHR mreg16/32, 1 D1h 11-101-xxx short alu

SHR mem16/32, 1 D1h mm-101-xxx vector

SHR mreg8, CL D2h 11-101-xxx short alux

SHR mem8, CL D2h mm-101-xxx vector

SHR mreg16/32, CL D3h 11-101-xxx short alu

SHR mem16/32, CL D3h mm-101-xxx vector

SHLD mreg16/32, reg16/32, imm8 0Fh A4h 11-xxx-xxx vector

SHLD mem16/32, reg16/32, imm8 0Fh A4h mm-xxx-xxx vector

SHLD mreg16/32, reg16/32, CL 0Fh A5h 11-xxx-xxx vector

SHLD mem16/32, reg16/32, CL 0Fh A5h mm-xxx-xxx vector

SHRD mreg16/32, reg16/32, imm8 0Fh ACh 11-xxx-xxx vector

SHRD mem16/32, reg16/32, imm8 0Fh ACh mm-xxx-xxx vector

SHRD mreg16/32, reg16/32, CL 0Fh ADh 11-xxx-xxx vector

SHRD mem16/32, reg16/32, CL 0Fh ADh mm-xxx-xxx vector

SLDT mreg16 0Fh 00h 11-000-xxx vector

SLDT mem16 0Fh 00h mm-000-xxx vector

SMSW mreg16 0Fh 01h 11-100-xxx vector

SMSW mem16 0Fh 01h mm-100-xxx vector

STC F9h vector

STD FDh vector



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





STI FBh vector

STOSB mem8, AL AAh longstorealux

STOSW mem16, AX ABh longstorealux

STOSD mem32, EAX ABh longstorealux

STR mreg16 0Fh 00h 11-001-xxx vector

STR mem16 0Fh 00h mm-001-xxx vector

SUB mreg8, reg8 28h 11-xxx-xxx short aluxSUB mem8, reg8 28h mm-xxx-xxx long load, alux, store

SUB mreg16/32, reg16/32 29h 11-xxx-xxx short alu

SUB mem16/32, reg16/32 29h mm-xxx-xxx long load, alu, store

SUB reg8, mreg8 2Ah 11-xxx-xxx short alux

SUB reg8, mem8 2Ah mm-xxx-xxx short load, alux

SUB reg16/32, mreg16/32 2Bh 11-xxx-xxx short alu

SUB reg16/32, mem16/32 2Bh mm-xxx-xxx short load, alu

SUB AL, imm8 2Ch xx-xxx-xxx short alux

SUB EAX, imm16/32 2Dh xx-xxx-xxx short aluSUB mreg8, imm8 80h 11-101-xxx short alux

SUB mem8, imm8 80h mm-101-xxx long load, alux, store

SUB mreg16/32, imm16/32 81h 11-101-xxx short alu

SUB mem16/32, imm16/32 81h mm-101-xxx long load, alu, store

SUB mreg16/32, imm8 (signed ext.) 83h 11-101-xxx short alux

SUB mem16/32, imm8 (signed ext.) 83h mm-101-xxx long load, alux, store

TEST mreg8, reg8 84h 11-xxx-xxx short alux

TEST mem8, reg8 84h mm-xxx-xxx vector

TEST mreg16/32, reg16/32 85h 11-xxx-xxx short aluTEST mem16/32, reg16/32 85h mm-xxx-xxx vector

TEST AL, imm8 A8h long alux

TEST EAX, imm16/32 A9h long alu

TEST mreg8, imm8 F6h 11-000-xxx long alux

TEST mem8, imm8 F6h mm-000-xx long load, alux

TEST mreg8, imm16/32 F7h 11-000-xxx long alu



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





TEST mem8, imm16/32 F7h mm-000-xx long load, alu

VERR mreg16 0Fh 00h 11-100-xxx vector

VERR mem16 0Fh 00h mm-100-xxx vector

VERW mreg16 0Fh 00h 11-101-xxx vector

VERW mem16 0Fh 00h mm-101-xxx vector

WAIT 9Bh vector

XADD mreg8, reg8 0Fh C0h 11-100-xxx vector

XADD mem8, reg8 0Fh C0h mm-100-xxx vector

XADD mreg16/32, reg16/32 0Fh C1h 11-101-xxx vector

XADD mem16/32, reg16/32 0Fh C1h mm-101-xxx vector

XCHG reg8, mreg8 86h 11-xxx-xxx vector

XCHG reg8, mem8 86h mm-xxx-xxx vector

XCHG reg16/32, mreg16/32 87h 11-xxx-xxx vector

XCHG reg16/32, mem16/32 87h mm-xxx-xxx vector

XCHG EAX, EAX 90h short limm

XCHG EAX, ECX 91h long alu, alu, alu

XCHG EAX, EDX 92h long alu, alu, alu

XCHG EAX, EBX 93h long alu, alu, alu

XCHG EAX, ESP 94h long alu, alu, alu

XCHG EAX, EBP 95h long alu, alu, alu

XCHG EAX, ESI 96h long alu, alu, alu

XCHG EAX, EDI 97h long alu, alu, alu

XLAT D7h vector

XOR mreg8, reg8 30h 11-xxx-xxx short alux

XOR mem8, reg8 30h mm-xxx-xxx long load, alux, store

XOR mreg16/32, reg16/32 31h 11-xxx-xxx short alu

XOR mem16/32, reg16/32 31h mm-xxx-xxx long load, alu, store

XOR reg8, mreg8 32h 11-xxx-xxx short alux

XOR reg8, mem8 32h mm-xxx-xxx short load, alux

XOR reg16/32, mreg16/32 33h 11-xxx-xxx short alu

XOR reg16/32, mem16/32 33h mm-xxx-xxx short load, alu

XOR AL, imm8 34h xx-xxx-xxx short alux

XOR EAX, imm16/32 35h xx-xxx-xxx short alu



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





XOR mreg8, imm8 80h 11-110-xxx short alux

XOR mem8, imm8 80h mm-110-xxx long load, alux, store

XOR mreg16/32, imm16/32 81h 11-110-xxx short alu

XOR mem16/32, imm16/32 81h mm-110-xxx long load, alu, store

XOR mreg16/32, imm8 (signed ext.) 83h 11-110-xxx short alux

XOR mem16/32, imm8 (signed ext.) 83h mm-110-xxx long load, alux, store



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes

Table 7. MMX™ Instructions

Instruction MnemonicPrefixByte(s)

FirstByte

modR/MByte

DecodeType

RISC86®

Opcodes

EMMS 0Fh 77h vector

MOVD mmreg, mreg32 0Fh 6Eh 11-xxx-xxx short store, mload

MOVD mmreg, mem32 0Fh 6Eh mm-xxx-xxx short mload

MOVD mreg32, mmreg 0Fh 7Eh 11-xxx-xxx short mstore, load

MOVD mem32, mmreg 0Fh 7Eh mm-xxx-xxx short mstore

MOVQ mmreg1, mmreg2 0Fh 6Fh 11-xxx-xxx short meu

MOVQ mmreg, mem64 0Fh 6Fh mm-xxx-xxx short mloadMOVQ mmreg1, mmreg2 0Fh 7Fh 11-xxx-xxx short meu

MOVQ mem64, mmreg 0Fh 7Fh mm-xxx-xxx short mstore

PACKSSDW mmreg1, mmreg2 0Fh 6Bh 11-xxx-xxx short meu

PACKSSDW mmreg, mem64 0Fh 6Bh mm-xxx-xxx short mload, meu

PACKSSWB mmreg1, mmreg2 0Fh 63h 11-xxx-xxx short meu

PACKSSWB mmreg, mem64 0Fh 64h mm-xxx-xxx short mload, meu

PACKUSWB mmreg1, mmreg2 0Fh 67h 11-xxx-xxx short meu

PACKUSWB mmreg, mem64 0Fh 67h mm-xxx-xxx short mload, meu

PADDB mmreg1, mmreg2 0Fh FCh 11-xxx-xxx short meuPADDB mmreg, mem64 0Fh FCh mm-xxx-xxx short mload, meu

PADDD mmreg1, mmreg2 0Fh FEh 11-xxx-xxx short meu

PADDD mmreg, mem64 0Fh FEh mm-xxx-xxx short mload, meu

PADDSB mmreg1, mmreg2 0Fh ECh 11-xxx-xxx short meu

PADDSB mmreg, mem64 0Fh ECh mm-xxx-xxx short mload, meu

PADDSW mmreg1, mmreg2 0Fh EDh 11-xxx-xxx short meu







PSLLW mmreg, imm8 0Fh 71h 11-110-xxx short meu

PSLLD mmreg1, mmreg2 0Fh F2h 11-xxx-xxx short meu

PSLLD mmreg, mem64 0Fh F2h 11-xxx-xxx short meu

PSLLD mmreg, imm8 0Fh 72h 11-110-xxx short meu

PSLLQ mmreg1, mmreg2 0Fh F3h 11-xxx-xxx short meu

PSLLQ mmreg, mem64 0Fh F3h 11-xxx-xxx short meu

PSLLQ mmreg, imm8 0Fh 73h 11-110-xxx short meu

PSRAW mmreg1, mmreg2 0Fh E1h 11-xxx-xxx short meu

PSRAW mmreg, mem64 0Fh E1h 11-xxx-xxx short meu

PSRAW mmreg, imm8 0Fh 71h 11-100-xxx short meu

PSRAD mmreg1, mmreg2 0Fh E2h 11-xxx-xxx short meu

PSRAD mmreg, mem64 0Fh E2h 11-xxx-xxx short meu

PSRAD mmreg, imm8 0Fh 72h 11-100-xxx short meu

PSRLW mmreg1, mmreg2 0Fh D1h 11-xxx-xxx short meu

PSRLW mmreg, mem64 0Fh D1h 11-xxx-xxx short meu

PSRLW mmreg, imm8 0Fh 71h 11-010-xxx short meu

PSRLD mmreg1, mmreg2 0Fh D2h 11-xxx-xxx short meu

PSRLD mmreg, mem64 0Fh D2h 11-xxx-xxx short meu

PSRLD mmreg, imm8 0Fh 72h 11-010-xxx short meu

PSRLQ mmreg1, mmreg2 0Fh D3h 11-xxx-xxx short meu

PSRLQ mmreg, mem64 0Fh D3h 11-xxx-xxx short meu

PSRLQ mmreg, imm8 0Fh 73h 11-010-xxx short meu

PSUBB mmreg1, mmreg2 0Fh F8h 11-xxx-xxx short meu

PSUBB mmreg, mem64 0Fh F8h mm-xxx-xxx short mload, meu

PSUBD mmreg1, mmreg2 0Fh FAh 11-xxx-xxx short meu

PSUBD mmreg, mem64 0Fh FAh mm-xxx-xxx short mload, meu

PSUBSB mmreg1, mmreg2 0Fh E8h 11-xxx-xxx short meu

PSUBSB mmreg, mem64 0Fh E8h mm-xxx-xxx short mload, meu

PSUBSW mmreg1, mmreg2 0Fh E9h 11-xxx-xxx short meu

PSUBSW mmreg, mem64 0Fh E9h mm-xxx-xxx short mload, meu

PSUBUSB mmreg1, mmreg2 0Fh D8h 11-xxx-xxx short meu

PSUBUSB mmreg, mem64 0Fh D8h mm-xxx-xxx short mload, meu

PSUBUSW mmreg1, mmreg2 0Fh D9h 11-xxx-xxx short meu

Table 7. MMX™ Instructions (continued)


FirstByte

modR/MByte

DecodeType

RISC86®

Opcodes





PSUBUSW mmreg, mem64 0Fh D9h mm-xxx-xxx short mload, meu

PSUBW mmreg1, mmreg2 0Fh F9h 11-xxx-xxx short meu

PSUBW mmreg, mem64 0Fh F9h mm-xxx-xxx short mload, meu

PUNPCKHBW mmreg1, mmreg2 0Fh 68h 11-xxx-xxx short meu

PUNPCKHBW mmreg, mem64 0Fh 68h mm-xxx-xxx short mload, meu

PUNPCKHWD mmreg1, mmreg2 0Fh 69h 11-xxx-xxx short meu

PUNPCKHWD mmreg, mem64 0Fh 69h mm-xxx-xxx short mload, meu

PUNPCKHDQ mmreg1, mmreg2 0Fh 6Ah 11-xxx-xxx short meu

PUNPCKHDQ mmreg, mem64 0Fh 6Ah mm-xxx-xxx short mload, meu

PUNPCKLBW mmreg1, mmreg2 0Fh 60h 11-xxx-xxx short meu

PUNPCKLBW mmreg, mem64 0Fh 60h mm-xxx-xxx short mload, meu

PUNPCKLWD mmreg1, mmreg2 0Fh 61h 11-xxx-xxx short meu

PUNPCKLWD mmreg, mem64 0Fh 61h mm-xxx-xxx short mload, meu

PUNPCKLDQ mmreg1, mmreg2 0Fh 62h 11-xxx-xxx short meu

PUNPCKLDQ mmreg, mem64 0Fh 62h mm-xxx-xxx short mload, meu

PXOR mmreg1, mmreg2 0Fh EFh 11-xxx-xxx short meu

PXOR mmreg, mem64 0Fh EFh mm-xxx-xxx short mload, meu

Table 7. MMX™ Instructions (continued)


FirstByte

modR/MByte

DecodeType

RISC86®

Opcodes

Table 8. Floating-Point Instructions


SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes

F2XM1 D9h F0h short float

FABS D9h F1h short float

FADD ST(0), ST(i) D8h 11-000-xxx short float

FADD ST(0), mem32real D8h mm-000-xxx short fload, float

FADD ST(i), ST(0) DCh 11-000-xxx short floatFADD ST(0), mem64real DCh mm-000-xxx short fload, float

FADDP ST(i), ST(0) DEh 11-000-xxx short float

FBLD DFh mm-100-xxx vector

FBSTP DFh mm-110-xxx vector

FCHS D9h E0h short float





FCLEX DBh E2h vector

FCOM ST(0), ST(i) D8h 11-010-xxx short float

FCOM ST(0), mem32real D8h mm-010-xxx short fload, float

FCOM ST(0), mem64real DCh mm-010-xxx short fload, float

FCOMP ST(0), ST(i) D8h 11-011-xxx short float

FCOMP ST(0), mem32real D8h mm-011-xxx short fload, float

FCOMP ST(0), mem64real DCh mm-011-xxx short fload, float

FCOMPP DEh 11-011-001 short float

FCOS ST(0) D9h FFh short float

FDECSTP D9h F6h short float

FDIV ST(0), ST(i) (single precision) D8h 11-110-xxx short float

FDIV ST(0), ST(i) (double precision) D8h 11-110-xxx short float

FDIV ST(0), ST(i) (extended precision) D8h 11-110-xxx short float

FDIV ST(i), ST(0) (single precision) DCh 11-111-xxx short float

FDIV ST(i), ST(0) (double precision) DCh 11-111-xxx short float

FDIV ST(i), ST(0) (extended precision) DCh 11-111-xxx short float

FDIV ST(0), mem32real D8h mm-110-xxx short fload, float

FDIV ST(0), mem64real DCh mm-110-xxx short fload, float

FDIVP ST(0), ST(i) DEh 11-111-xxx short float

FDIVR ST(0), ST(i) D8h 11-110-xxx short float

FDIVR ST(I), ST(0) DCh 11-111-xxx short float

FDIVR ST(0), mem32real D8h mm-111-xxx short fload, float

FDIVR ST(0), mem64real DCh mm-111-xxx short fload, float

FDIVRP ST(i), ST(0) DEh 11-110-xxx short float

FFREE ST(I) DDh 11-000-xxx short float

FIADD ST(0), mem32int DAh mm-000-xxx short fload, float

FIADD ST(0), mem16int DEh mm-000-xxx short fload, float

FICOM ST(0), mem32int DAh mm-010-xxx short fload, float

FICOM ST(0), mem16int DEh mm-010-xxx short fload, float

FICOMP ST(0), mem32int DAh mm-011-xxx short fload, float

FICOMP ST(0), mem16int DEh mm-011-xxx short fload, float

FIDIV ST(0), mem32int DAh mm-110-xxx short fload, float

FIDIV ST(0), mem16int DEh mm-110-xxx short fload, float

Table 8. Floating-Point Instructions (continued)


SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





FIDIVR ST(0), mem32int DAh mm-111-xxx short fload, float

FIDIVR ST(0), mem16int DEh mm-111-xxx short fload, float

FILD mem16int DFh mm-000-xxx short fload, float

FILD mem32int DBh mm-000-xxx short fload, float

FILD mem64int DFh mm-101-xxx short fload, float

FIMUL ST(0), mem32int DAh mm-001-xxx short fload, float

FIMUL ST(0), mem16int DEh mm-001-xxx short fload, float

FINCSTP D9h F7h short

FINIT DBh E3h vector

FIST mem16int DFh mm-010-xxx short fload, float

FIST mem32int DBh mm-010-xxx short fload, float

FISTP mem16int DFh mm-011-xxx short fload, float

FISTP mem32int DBh mm-011-xxx short fload, float

FISTP mem64int DFh mm-111-xxx short fload, float

FISUB ST(0), mem32int DAh mm-100-xxx short fload, float

FISUB ST(0), mem16int DEh mm-100-xxx short fload, float

FISUBR ST(0), mem32int DAh mm-101-xxx short fload, float

FISUBR ST(0), mem16int DEh mm-101-xxx short fload, float

FLD ST(i) D9h 11-000-xxx short fload, float

FLD mem32real D9h mm-000-xxx short fload, float

FLD mem64real DDh mm-000-xxx short fload, float

FLD mem80real DBh mm-101-xxx vector

FLD1 D9h E8h short fload, float

FLDCW D9h mm-101-xxx vector

FLDENV D9h mm-100-xxx short fload, float

FLDL2E D9h EAh short float

FLDL2T D9h E9h short float

FLDLG2 D9h ECh short float

FLDLN2 D9h EDh short float

FLDPI D9h EBh short float

FLDZ D9h EEh short float

FMUL ST(0), ST(i) D8h 11-001-xxx short float

FMUL ST(i), ST(0) DCh 11-001-xxx short float



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes





FMUL ST(0), mem32real D8h mm-001-xxx short fload, float

FMUL ST(0), mem64real DCh mm-001-xxx short fload, float

FMULP ST(0), ST(i) DEh 11-001-xxx short float

FNOP D9h D0h short float

FPTAN D9h F2h vector

FPATAN D9h F3h short float

FPREM D9h F8h short float

FPREM1 D9h F5h short float

FRNDINT D9h FCh short float

FRSTOR DDh mm-100-xxx vector

FSAVE DDh mm-110-xxx vector

FSCALE D9h FDh short float

FSIN D9h FEh short float

FSINCOS D9h FBh vector

FSQRT (single precision) D9h FAh short float

FSQRT (double precision) D9h FAh short float

FSQRT (extended precision) D9h FAh short float

FST mem32real D9h mm-010-xxx short fstore

FST mem64real DDh mm-010-xxx short fstore

FST ST(i) DDh 11-010xxx short fstore

FSTCW D9h mm-111-xxx vector

FSTENV D9h mm-110-xxx vector

FSTP mem32real D9h mm-011-xxx short fstore

FSTP mem64real DDh mm-011-xxx short fstore

FSTP mem80real D9h mm-111-xxx vector

FSTP ST(i) DDh 11-011-xxx short float

FSTSW AX DFh E0h vector

FSTSW mem16 DDh mm-111-xxx vector

FSUB ST(0), mem32real D8h mm-100-xxx short fload, float

FSUB ST(0), mem64real DCh mm-100-xxx short fload, float

FSUB ST(0), ST(i) D8h 11-100-xxx short float

FSUB ST(i), ST(0) DCh 11-101-xxx short float

FSUBP ST(0), ST(I) DEh 11-101-xxx short float



SecondByte

modR/MByte

DecodeType

RISC86®

Opcodes






Chapter 5 x86 Optimization Coding Guidelines 51

5x86 Optimization Coding

Guidelines

General x86 Optimization Techniques

This section describes general code optimization techniquesspecific to superscalar processors (that is, techniques common

to the AMD-K6 MMX enhanced processor, AMD-K5™ processor,and Pentium-family processors). In general, all optimization

techniques used for the AMD-K5 processor, Pentium, andPentium Pro processors either improve the performance of theAMD-K6 processor or are not required and have no effect (dueto fewer coding restrictions with the AMD-K6 processor).

Short Forms—Use shorter forms of instructions to increase theeffective number of instructions that can be examined fordecoding at any one time. Use 8-bit displacements and jump

offsets where possible.

Simple Instructions—Use simple instructions with hardwireddecode (pairable, short, or fast) because they perform more

efficiently. This includes “register←register op memory” aswell as “register←register op register” forms of instructions.

Dependencies—Spread out true dependencies to increase the

opportunities for parallel execution. Anti-dependencies andoutput dependencies do not impact performance.



52 x86 Optimization Coding Guidelines Chapter 5


Memory Operands —Instructions that operate on data in

memory (load/op/store) can inhibit parallelism. The use ofseparate move and ALU instructions allows better codescheduling for independent operations. However, if there are noopportunities for parallel execution, use the load/op/store forms

to reduce the number of register spills (storing values inmemory to free registers for other uses).

Register Operands —Maintain frequently used values in

registers rather than in memory.

Stack References—Use ESP for stack references so that EBPremains available.

Stack Allocation—When allocating space for local variables

and/or outgoing parameters within a procedure, adjust thestack pointer and use moves rather than pushes. This method ofallocation allows random access to the outgoing parameters sothat they can be set up when they are calculated instead ofbeing held somewhere else until the procedure call. This

method also reduces ESP dependencies and uses fewerexecution resources.

Data Embedding —When data is embedded in the code

segment, align it in separate cache blocks from nearby code.This technique avoids some overhead when maintaining

coherency between the instruction and data caches.

Loops—Unroll loops to get more parallelism and reduce loopoverhead, even with branch prediction. Inline small routines to

avoid procedure-call overhead. For both techniques, however,consider the cost of possible increased register usage, whichmight add load/store instructions for register spilling.

Code Alignment—Aligning at 0-mod-16 improves performance(ideally at 0-mod-32). However, there is a trade-off betweenexecution speed and code size.





General AMD-K6™ Processor x86 Coding Optimizations

This section describes general code optimization techniquesspecific to the AMD-K6 MMX enhanced processor.

Use short-decodeable instructions —To increase decodebandwidth and minimize the number of RISC86 operations per

x86 instruction, use short-decodeable x86 instructions. See“Instruction Dispatch and Execution Timing” on page 23 forthe list of short-decodeable instructions.

Pair short-decodeable instructions—Two short-decodeable x86instructions can be decoded per clock, using the full decode

bandwidth of the AMD-K6 processor.

Avoid using complex instructions —The more complex and

uncommon instructions are vector decoded and can generate alarger ratio of RISC86 operations per x86 instruction thanshort-decodeable or long-decodeable instructions.

0Fh prefix usage—0Fh does not count as a prefix.

Avoid long instruction length—Use x86 instructions that areless than eight bytes in length. An x86 instruction that is longer

than seven bytes cannot be short-decoded.

Align branch targets—Keep branch targets away from the endof a cache line. 16-byte alignment is preferred for branch

targets, while 32-byte alignment is ideal.

Use read-modify-write instructions over discrete equivalent—No advantage is gained by splitting read-modify-write

instructions into a load-execute-store instruction group. Bothread-modify-write instr uctions and load-execute-store

instruction groups decode and execute in one cycle butread-modify-write instructions promote better code density.

Move rarely used code and data to separate pages—Placing

code, such as exception handlers, in separate pages and data,such as error text messages, in separate pages maximizes theuse of the TLBs and prevents pollution with rarely used items.





Avoid multiple and accumulated prefixes — In order to

accomplish an instruction decode, the decoders requiresufficient predecode information. When an instruction hasmultiple prefixes and this cannot be deduced by the decoders(due to a lack of data in the instruction decode buffer), the first

decoder retires and accumulates one prefix at a time until theinstruction is completely decoded. Table 9 shows when prefixes

are accumulated and decoding is serialized.

Avoid mixing code size types —Size prefixes that affect thelength of an instruction can sometimes inhibit dual decoding.

Always pair CALL and RETURN—If CALLs and RETs are notpaired, the return address stack gets out of synchronization,increasing the latency of returns and decreasing performance.

Exploit parallel execution of integer and floating-point

multiplies —The AMD-K6 MMX enhanced processor allowssimultaneous integer and floating-point multiplies using

separate, low-latency multipliers.

Avoid more than 16 levels of nesting in subroutines—More than16 levels of nested subroutine calls overflow the return address

stack, leading to lower performance. While this is not a problemfor most code, recursive subroutines might easily exceed 16levels of subroutine calls. If the recursive subroutine is tailrecursive, it can usually be mechanically transformed into an

iterative version, which leads to increased performance.

Table 9. Decode Accumulation and Serialization

Decode #1 Decoder #2 Results

Instruction Single instruction decoded

Instruction Instruction Dual instruction decode

Instruction PrefixSingle instruction decode and prefix isaccumulated

PrefixInstruction(modified by Prefix)

No prefix accumulation and single instructionis decoded

PrefixA PrefixB Accumulate PrefixA and cancel decode of thesecond prefix

PrefixB Instruction

If a prefix has already been accumulated inthe previous decode cycle, accumulate PrefixBand cancel instruction decode, wait for nextdecode cycle to decode the instruction





Place frequently used stack data within 128 bytes of the EBP—

The statically most-referenced data items in a function’s stackframe should be located from –128 to +127 bytes from EBP. Thistechnique improves code density by enabling the use of an 8-bitsign-extended displacement instead of a 32-bit displacement.

Avoid superset dependencies —Using the larger form of aregister immediate after an instruction uses the smaller formcreates a superset dependency and prevents parallel execution.

For example, avoid the following type of code:

Avoid OR AH,055h

AND EAX,1555555h

Avoid excessive loop unrolling or code inlining—Excessive loopunrolling or code inlining increases code size and reduces

locality, which leads to lower cache hit rates and reducedperformance.

Avoid splitting a 16-bit memory access in 32-bit code —Noadvantage is gained by splitting a 16-bit memory access in32-bit code into two byte-sized accesses. This technique avoids

the operand size override.

Avoid data dependent branches around a single instruction—Data dependent branches acting upon basically random datacause the branch prediction logic to mispredict the branch

about 50% of the time. Design branch-free alternative codesequences that can replace straight forward code with datadependent branches. The effect is shorter average executiontime. The following examples illustrate this concept:

s Signed integer ABS function (x = labs(x))

Static Latency: 4 cycles

MOV ECX, [x] ;load value

MOV EBX, ECX

SAR ECX, 31

XOR EBX, ECX ;1’s complement if x<0, else don’t modify

SUB EBX, ECX ;2’s complement if x<0, else don’t modify

MOV [x], EBX ;save labs result





s Unsigned integer min function (z = x < y ? x : y)


MOV EAX, [x] ;load x value

MOV EBX, [y] ;load y value

SUB EAX, EBX ;set carry flag if y is greater than x

SBB ECX, ECX ;get borrow out from previous SUB

AND ECX, EAX ;if x > y, ECX = x–y, else 0

ADD ECX, EBX ;if x > y, return x–y+y = x, else y

MOV [z], ECX ;save min (x,y)

s Hexadecimal to ASCII conversion

(y=x < 10 ? x + 0x30: x + 0x41)Static Latency: 4 cycles

MOV AL, [x] ;load x value

CMP AL, 10 ;if x is less than 10, set carry flag

SBB AL, 69h ;0..9 –> 96h, Ah..Fh –> A1h...A6h

DAS ;0..9: subtract 66h, Ah..Fh: Subtract 60hMOV [y],AL ;save conversion in y

The [ESI] addressing mode—When using [ESI] as an indirectmemory addressing mode, explicitly code [ESI] to be [ESI+0].

Doing so improves decode bandwidth. For an example, see“Floating-Point Code Sample” on page 64.





AMD-K6™ Processor Integer x86 Coding Optimizations

This section describes integer code optimization techniquesspecific to the AMD-K6 MMX enhanced processor.

Neutral code filler —Use the XCHG EAX, EAX or NOPinstruction when aligning instructions. XCHG EAX, EAX

consumes decode slots but requires no execution resources.Essentially, the scheduler absorbs the equivalent RISC86operation without requiring any of the execution units.

Inline REP String with low counts —Expand REP Stringinstructions into equivalent sequences of simple x86

instructions. This technique eliminates the setup overhead ofthese instructions and increases instruction throughput.

Use ADD reg, reg instead of SHL reg, 1—This optimizationtechnique allows the scheduler to use either of the two integeradders rather than the single shifter and effectively increasesoverall throughput. The only difference between these twoinstructions is the setting of the AF flag.

Access 16-bit memory data using the MOVSX and MOVZX

instructions—The AMD-K6 processor has direct hardwaresupport for extending word size operands to doubleword length.

Use load-execute integer instructions —Most load-executeinteger instructions are short-decodeable and can be decodedat the rate of two per cycle. Splitting a load-execute instructioninto two separate instructions—a load instruction and a reg, reg

instruction— reduces decoding bandwidth and increasesregister pressure.

Use AL, AX, and EAX to improve code density—In many cases,

instructions using AL and EAX can be encoded in one less bytethan using a general purpose register. For example, ADD AX,

0x5555 should be encoded 05 55 55 and not 81 C0 55 55.

Clear registers using MOV reg, 0 instead of XOR reg, reg—Executing XOR reg, reg requires additional overhead due to

register dependency checking and flag generation. Using MOVreg, 0 produces a limm (load immediate) RISC86 operation thatis completed when placed in the scheduler and does notconsume execution resources.





Use 8-bit sign-extended immediates —Using 8 -bit

sign-extended immediates improves code density with nonegative effects on the AMD-K6 processor. For example, ADDBX, –55 should be encoded 83 C3 FB and not 81 C3 FF FB.

Use 8-bit sign-extended displacements for conditional

branches—Using short, 8-bit sign-extended displacements forconditional branches improves code density with no negativeeffects on the AMD-K6 processor.

Use integer multiply over shift-add sequences when it is

advantageous — The AMD-K6 MMX enhanced processorfeatures a low-latency integer multiplier. Therefore, almost anyshift-add sequences can have higher latency than MUL or IMULinstructions. An exception is a trivial case involving

multiplication by powers of two by means of left shifts. Ingeneral, replacements should be made if the shift-addsequences have a latency greater than or equal to 3 clocks.

Carefully choose the best method for pushing memory data—

To reduce register pressure and code dependency, use PUSH[mem] rather than MOV EAX, [mem], PUSH EAX.

Balance the use of CWD, CBW, CDQ, and CWDE —These

instructions require special attention to avoid either decreaseddecode or execution bandwidth. The following code illustrates

the possible trade-offs:

s The following code replacement trades decode bandwidth(CWD is vector decoded, but with only one RISC86

operation) with execution bandwidth (SAR requires twoRISC86 operations, including a shift):

Replace:CWD With: MOV DX,AX

SAR DX,15

s The following code replacement improves decodebandwidth (CBW is vector decoded while MOVSX is shortdecoded):

Replace:CBW With: MOVSX AX,AL

s The following code replacement trades decode bandwidth(CDQ is vector decoded, but with only two RISC86operations) with execution bandwidth (SAR requires twoRISC86 operations, including a shifter):

Replace:CDQ With: MOV EDX,EAX

SAR EDX,31





s The following code replacement improves decode

bandwidth (CWDE is vector decoded while MOVSX is shortdecoded):

Replace:CWDE With: MOVSX EAX, AX

Replace integer division by constants with multiplication by

the reciprocal—This is a commonly used optimization on RISC

CPUs. Because the AMD-K6 processor has an extremely fastinteger multiply (two cycles) and the integer division deliversonly two bits of quotient per cycle (approximately 18 cycles for32-bit divides), the equivalent code is much faster. The

following examples illustrate the use of integer division byconstants:

s Unsigned division by 10 using multiplication by reciprocal

Static Latency: 5 cycles; IN: EAX = dividend

; OUT:EDX = quotient

MOV EDX, 0CCCCCCCDh ;0.1 * 2^32 * 8 rounded up

MUL EDX

SHR EDX, 3 ;divide by 2^32 * 8

s Unsigned division by 3 using multiplication by reciprocal


; IN: EAX = dividend


MOV EDX, 0AAAAAAABh ;1/3 * 2^32 * 2 rounded up

MUL EDXSHR EDX, 1 ;divide by 2^32 * 2

s Signed division by 2Static Latency: 3 cycles


; OUT:EAX = quotient

CMP EAX, 800000000h ;CY = 1, if dividend >=0

SBB EAX, –1 ;increment dividend if it is <0

SAR EAX, 1 ;perform a right shift

s Signed division by 2^n


; IN: EAX = dividend; OUT:EAX = quotient

MOV EDX, EAX ;sign extend into EDX

SAR EDX, 31 ;EDX = 0xFFFFFFFF if dividend < 0

AND EDX, (2^n–1) ;mask correction (use divisor –1)

ADD EAX, EDX ;apply correction if necessary

SAR EAX, (n) ;perform right shift by log2 (divisor)





s Signed division by –2




CMP EAX, 800000000h ;CY = 1, if dividend >=0

SBB EAX, –1 ;increment dividend if it is <0

SAR EAX, 1 ;perform right shift

NEG EAX ;use (x/–2) = = – (x/2)

s Signed division by –(2^n)Static Latency: 6 cycles





AND EDX, (2^n–1) ;mask correction (–divisor –1)

ADD EAX, EDX ;apply correction if necessarySAR EAX, (n) ;right shift by log2(–divisor)

NEG EAX ;use (x/–(2^n)) = = (– (x/2^n))

s Remainder of signed integer 2 or (–2)Static Latency: 4 cycles





AND EDX, 1 ;compute remainder

XOR EAX, EDX ;negate remainder if

SUB EAX, EDX ;dividend was < 0

MOV [quotient], EAX

s Remainder of signed integer (2^n) or (–(2^n)))Static Latency: 6 cycles





AND EDX, (2^n–1) ;mask correction (abs(divison)–1)

ADD EAX, EDX ;apply pre-correction

AND EAX, (2^n–1) ;mask out remainder (abs(divison)–1)

SUB EAX, EDX ;apply pre-correction if necessary

MOV [quotient], EAX





AMD-K6™ Processor Multimedia Coding Optimizations

This section describes multimedia code optimizationtechniques specific to the AMD-K6 MMX enhanced processor.

Pair MMX instructions with short-decodeable instructions—MMX instructions are short -decodeable and can be

simultaneously decoded with any other short -decodeableinstruction. This technique requires that MMX instructions bearranged as the first of a pair of short-decodeable instructions.

Avoid using MMX registers to move double-precisio n

floating-point data—Although using an MMX register to move

floating-point data appears fast, using MMX registers requiresthe use of the EMMS instruction when switching from MMX to


Avoid switching between MMX and FPU instructions—Becausethe MMX registers are mapped on to the floating-point stack,the EMMS instruction must be executed after using MMX codeand prior to the use of the floating-point unit. Group or

partition MMX code away from FPU code so that the use of theEMMS instruction is minimized. Also, the actual penalty fromthe use of the EMMS instruction occurs not at the time of

execution but when the first floating-point instruction isencountered.





AMD-K6™ Processor Floating-Point Coding Optimizations

This section describes floating-point code optimizationtechniques specific to the AMD-K6 MMX enhanced processor.

Avoid vector decoded floating-point instructions—Mostfloating-point instructions are short decodeable. A few of the

less common instructions are vector decoded. Additionally if ashort decodeable instruction straddles a cache line, it willbecome vector decoded. This adds unnecessary overheard thatcan be avoided by inserting NOPs in strategic locations within

the code.

Pair floating-point with short-decodeable instructions—Mostfloating-point instructions (also known as ESC instructions) are

short-decodeable and are limited to the first decoder. Theshort-decodeable floating-point instructions can be paired with

other short-decodeable instructions. This technique requiresthat floating-point instructions be arranged as the first of a pairof short-decodeable instructions.

Avoid FXCH usage—Pairing FXCH with other floating-pointinstructions does not increase performance.

Avoid switching between MMX and FPU instructions—Becausethe MMX registers are mapped on to the floating-point stack,

the EMMS instruction must be executed after using MMX codeand prior to the use of the floating-point unit. Group orpartition MMX code away from FPU code so that the use of theEMMS instruction is minimized. Also, the actual penalty fromthe use of the EMMS instruction occurs not at the time ofexecution but when the first floating-point instruction is

encountered.

Avoid using MMX registers to move double-precisio n

floating-point data—Although using an MMX register to move

floating-point data appears fast, using MMX registers requiresthe use of the EMMS instruction when switching from MMX to






Exploit parallel execution of integer and floating-point

multiplies —The AMD-K6 MMX enhanced processor allowssimultaneous integer and floating-point multiplies usingseparate, low-latency multipliers.

Avoid splitting floating-point instructions with integer

instructions—No penalty is incurred when using arithmetic orcomparison floating-point instructions that use integeroperands, such as FIADD or FICOM. Splitting these

instructions into discrete load and floating-point instructionsdecreases performance.

Replace FDIV instructions with FMUL where possible—TheFMUL instruction latency is much less than the FDIVinstruction. When possible, replace floating-point divisions

with floating-point multiplication of the reciprocal.

Use integer instructions to move floating-point data — Afloating-point load and store instruction pair requires aminimum of four cycles to complete (two-cycle latency for each

instruction). The AMD-K6 processor can perform one integerload and one store per cycle. Therefore, moving single-precision

data requires one cycle, moving double-precision data requirestwo cycles, and moving extended-precision data only requiresthree cycles when using integer loads and stores. The examplebelow shows how to translate the C-style code when moving

double-precision floating-point data:

double temp1, temp2;

temp2 = temp1;

Avoid:FLD QWORD PTR [temp1];Use: MOV EAX, [temp1];

FSTP QWORD PTR [temp2]; MOV [temp2], EAX;

MOV EAX, [temp1+4];

MOV [temp2+4], EAX;

Scheduling of floating-point instructions is unnecessary—The

AMD-K6 proces sor has a low-latency, non-pipeline d

floating-point execution unit. Therefore, no scheduling betweenfloating-point instructions is necessary.

Use load-execute floating-point instructions—The use of aload-execute instruction (such as, FADD DWORD PRT [mem])

is preferable to the use of a load floating-point instructionfollowed by a FP reg, reg inst ruction. For the AMD-K6

processor, load-execute arithmetic and compare instructions





are identical in throughput to FP reg, reg instructions. Because

common floating-point instructions execute in two cycles eachand the floating-point unit is not pipelined, code executes moreefficiently if the minimum possible number of floating-pointinstructions are generated.

Floating-Point Code Sample

The following code sample uses three important rules tooptimize this matrix multiply routine. The first rule is to force[ESI] to be [ESI+0]. The second rule is the insertion of NOPs to

avoid cache line straddles. The third rule used is avoidingvector decoded instructions.

MATMUL MACRO

db 0d9h, 046h, 00h ;; FLD DWORD PTR [ESI+00] ;;x

FMUL DWORD PTR [EBX] ;; a11*x

FLD DWORD PTR [ESI+4] ;; y

FMUL DWORD PTR [EBX+4] ;; a21*y

FLD DWORD PTR [ESI+8] ;; z

FMUL DWORD PTR [EBX+8] ;; a31*z

FLD DWORD PTR [ESI+12] ;; w

FMUL DWORD PTR [EBX+12] ;; a41*w

FADDP ST(3), ST ;; a41*w+a31*z

FADDP ST(2), ST ;; a41*w+a31*z+a21*y

FADDP ST(1), ST ;; a41*w+a31*z+a21*y+a11*x

FSTP DWORD PTR [EDI] ;; store rx

NOP ;; make sure it does not

;; straddle across a cache line

db 0d9h, 046h, 00h ;; FLD DWORD PTR [ESI+00] ;; x

FMUL DWORD PTR [EBX+16] ;; a12*x












NOP ;; make sure it does not;; straddle across a cache line

FSTP DWORD PTR [EDI+4] ;; store ry


















FSTP DWORD PTR [EDI+8] ;; store rz














FSTP DWORD PTR [EDI+12] ;; store rw

ENDM








Chapter 6 Considerations for Other Processors 67

6Considerations for Other

Processors

The tables in this chapter contain information describing howAMD-K6 MMX enhanced processor-specific optimization

techniques affect other processors, including the AMD-K5processor.

Table 10. Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors

AMD-K5Processor

Guideline/Event

AMD-K5 Processor DetailsUsage/Effecton AMD-K6

Processors

AMD-K6 Processor

Details

Jumps andLoops

JCXZ requires 1 cycle (correctly predicted)and therefore is faster than a TEST/JZ. Allforms of LOOP take 2 cycles (correctlypredicted).

DifferentJCXZ takes 2 cycles when taken and 7cycles when not taken. LOOP takes 1cycle.

Shifts

Although there is only one shifter, certainshifts can be done using other executionunits. For example, shift left 1 by adding a

value to itself. Use LEA index scaling toshift left by 1, 2, or 3.

Same

Shifts are short decodeable andconverted to a single RISC86 shiftoperation that executes only in theInteger X unit. LEA is executed in thestore unit.

Multiplies

Independent IMULs can be pipelined atone per cycle with 4-cycle latency. (MULhas the same latency, although the implicit

AX usage of MUL prevents independent,parallel MUL operations.)

Different2– or 3–cycle throughput and latency(3 cycles only if the upper half of theproduct is produced)



68 Considerations for Other Processors Chapter 6


DispatchConflicts

Load-balancing (that is, selectinginstructions for parallel decode) is stillimportant, but to a lesser extent than onthe Pentium processor. In particular,arrange instructions to avoidexecution-unit dispatching conflicts.

Same

Byte Operations

For byte operations, the high and lowbytes of AX, BX, CX, and DX are effectivelyindependent registers that can beoperated on in parallel. For example,

reading AL does not have a dependencyon an outstanding write to AH.

SameRegister dependency is checked on abyte boundary.

Floating-PointTop-of-StackBottleneck

The AMD-K5 processor has a pipelinedfloating-point unit. Greater parallelism canbe achieved by using FXCH in parallel withfloating-point operations to alleviate thetop-of-stack bottleneck, as in the Pentium.

Not requiredLoads and stores are performed inparallel with floating-point instructions.

Move andConvert

MOVZX, MOVSX, CBW, CWDE, CWD, CDQall take 1 cycle (2 cycles for memory-basedinput).

SameZero and sign extension areshort-decodeable with 1 cycleexecution.

Indexed

Addressing

There is no penalty for base + index

addressing in the AMD-K5 processor.Same

InstructionPrefixes

There is no penalty for instruction prefixes,including combinations such assegment-size and operand-size prefixes.This is particularly important for 16-bitcode.

Possible A penalty can only occur duringaccumulated prefix decoding.

Floating-PointExecution

Parallelism

The AMD-K5 processor permits integeroperations (ALU, branch, load/store) inparallel with floating-point operations.

SameIn addition, the AMD-K6 processorallows two integer, a branch, a load,and a store.

Locating BranchTargets

Performance can be sensitive to codealignment, especially in tight loops.

Locating branch targets to the first 17bytes of the 32-byte cache line maximizesthe opportunity for parallel execution atthe target.

Optional Branch targets should be placed on 0mod 16 alignment for optimalperformance.

NOPs

The AMD-K5 processor executes NOPs(opcode 90h) at the rate of two per cycle.

Adding NOPs is even more effective if theyexecute in parallel with existing code.

SameNOPs are short-decodeable andconsume decode bandwidth but noexecution resources.

Table 10. Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors (continued)

AMD-K5Processor

Guideline/Event


Processors

AMD-K6 ProcessorDetails





BranchPrediction

There are two branch prediction bits in a32-byte instruction cache line. For effectivebranch prediction, code should begenerated with one branch per 16-byteline half.

Not requiredThis optimization has a neutral effecton the AMD-K6 processor.

Bit ScanBSF and BSR take 1 cycle (2 cycles formemory-based input), compared to thePentium’s data-dependent 6 to 34 cycles.

Different A multi-cycle operation, but faster thanPentium.

Bit Test

BT, BTS, BTR, and BTC take 1 cycle forregister-based operands, and 2 or 3 cycles

for memory-based operands withimmediate bit-offset. Register-basedbit-offset forms on the AMD-K5 processortake 5 cycles.

Different Bit test latencies are similar to thePentium.

Table 10. Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors (continued)

AMD-K5Processor

Guideline/Event


Processors


Table 11. AMD-K6™ Processor Versus Pentium® Processor-Specific Optimizations and Guidelines

PentiumGuideline/Event

PentiumEffect

Usage/Effect onAMD-K6 Processors


Instruction Fetches Across Two Cache Lines

No Penalty PossibleDecode penalty only if there isnot sufficient information todecode at least one instruction.

MispredictedConditional BranchExecuted in U pipe

3-cycle penalty DifferentMispredicted branches have a 1–to 4–cycle penalty.

MispredictedConditional BranchExecuted in V pipe

4-cycle penalty DifferentMispredicted branches have a 1–to 4–cycle penalty.

Mispredicted Calls 3-cycle penalty None

MispredictedUnconditional Jumps

3-cycle penalty None

FXCH OptimizingPairs with most FP instructionsand effectively hides FP stackmanipulations.

None

Index Versus BaseRegister

1-cycle penalty to calculate theeffective address when an indexregister is used.

None





Address GenerationInterlock Due to ExplicitRegister Usage

1-clock penalty wheninstructions are not scheduledapart by at least one instruction.

NoneHowever, it is best to scheduleapart the dependency.

Address GenerationInterlock Due to ImplicitRegister Usage

1-clock penalty wheninstructions are not scheduledapart by at least one instruction.

NoneHowever, it is best to scheduleapart the dependency.

Instructions with anImmediate Displacement

1-cycle penalty None

Carry & BorrowInstructions Issue Limits

Issued to U pipe only Same Issued to Integer X unit only.

Prefix Decode Penalty 1-clock delay Possible Delays can occur due to prefixaccumulation.

0Fh Prefix Penalty 1-clock delay None

MOVZX Acceleration No, incurs 4-cycle latency Yes Short-decodeable, 1 cycle.

Unpairability Due toRegister Dependencies

Incurred during flow and outputdependency.

NoneDependencies do not affectinstruction decode.

Shifts with ImmediatesIssue Limitations

Issued to U pipe only Similar Issued to the Integer X unit only.

Floating-Point Ops IssueLimitation

Issued to U pipe only SimilarIssued to dedicated floating-pointunit.

Conditional Code Pairing Special pairing case NoneConditional code such as JCCsare short decodeable andpairable.

Integer Execution DelayDue to TranscendentalOperation

Issue to U pipe is stalled NoneThe AMD-K6 processor has aseparate floating-point executionunit.

Instructions GreaterThan 7 Bytes

Issued to U pipe only SimilarLong and vector decodeableonly.

Misaligned Data Penalty 3-clock delay Partial 1-clock delay.

Table 11. AMD-K6™ Processor Versus Pentium® Processor-Specific Optimizations and Guidelines

PentiumGuideline/Event

PentiumEffect

Usage/Effect onAMD-K6 Processors






Table 12. AMD-K6™ Processor and Pentium® Processor with Optimizations for MMX™ Instructions

Pentium/MMXGuideline/Event

Pentium/MMXEffect

Usage/Effect onAMD-K6 Processor


0Fh Prefix Penalty None None

Three-clock Stalls forDependent MMX Multiplies

Dependent instruction must bescheduled two instruction pairsfollowing the multiply.

DifferentMULL - 1 clockMULH - 2 clocksMADD - 2 clocks

Two-clock Stalls for WritingThen Storing an MMXRegister

Requires scheduling the storetwo cycles after writing(updating) the MMX register.

None

U Pipe: Integer/MMX Pairing

MMX instruction that accesseither memory or integerregisters cannot be executedin the V pipe.

Different

Pairing requiresshort-decodeable integerinstruction as the secondinstruction.

U Pipe: MMX/Integer Pairing V pipe integer instruction mustbe pairable.

Similar

Pairing requiresshort-decodeable integerinstruction as the secondinstruction.

Pairing Two MMX Instructions

Cannot pair two MMXmultiplies, two MMX shifts, orMMX instructions in V pipe

with U pipe dependency.

None

66h or 67h Prefix Penalty Three clocks. None

Table 13. AMD-K6™ Processor and Pentium® Pro Processor-Specific Optimizations

Pentium ProGuideline/Event

Pentium Pro EffectUsage/Effect on

AMD-K6Processor

AMD-K6 Processor Detail

Partial-Register Stalls

Avoid reading a large register after writ inga smaller version of the same register.This causes the P6 to stall the issuing ofinstructions that reference the full register

and all subsequent instructions until afterthe partial write has retired. If the partialregister update is adjacent to asubsequent full register read, the stalllasts at least seven clock cycles withrespect to the decoder outputs. On theaverage, such a stall can prevent from 3to 21 micro-ops from being issued.

Different

The AMD-K6 processorperforms registerdependency checking on a

byte granularity. Due toshorter pipelines,execution latency, andcommitment latency,instruction issuing is notaffected. However,execution is stalled.





BranchesExploit the P6 return stack by using a RETrather than a JMP at the end of asubroutine.

SameThe AMD-K6 processorcontains a Call/Returnstack.

Avoid Self-ModifyingCode

Code that alters itself can cause the P6 toflush the processor’s pipelines and caninvalidate code resident in caches.

Same

Code Alignment 16-byte block Same

Predicted Branch Penalty BTB suffers 1-cycle delay None

The AMD-K6 processoruses parallel adders foron-the-fly addressgeneration.

Mispredicted Branch Minimum 9, typically 10 to 15 clocks Different 1 to 4 clocks.

Misaligned Data Penalty More than 3 clocks Different 1-clock maximum delay.

2-Byte Data Alignment 4-byte boundary SameNote, the misalignmentpenalty is only a 1-clockdelay.

4-Byte Data Alignment 4-byte boundary SameNote, the misalignmentpenalty is only a 1-clockdelay.

8-Byte Data Alignment 8-byte boundary Same

Note, the misalignment

penalty is only a 1-clockdelay.

Instruction LengthsGreater Than 7 Bytes

Issued one at a time DifferentLong-decodeable and

vector-decodeable.

Prefix Penalty 1-clock delay PossibleDelays can sometimesoccur due to prefixaccumulation.

0Fh Prefix Penalty None None

MOVZX Acceleration Yes Yes Short-decodeable, 1 cycle.

Static Prediction Penalty 6 clocks Different 3 clocks.

Table 13. AMD-K6™ Processor and Pentium® Pro Processor-Specific Optimizations (continued)



AMD-K6

Processor

AMD-K6 Processor Detail





Table 14. AMD-K6™ Processor and Pentium® Pro with Optimizations for MMX™ Instructions



AMD-K6 ProcessorAMD-K6 Processor

Details

Three Clock Stalls forDependent MMX Multiplies

Dependent instruction must bescheduled two instruction pairs followingthe multiply.

NoneMULL - 1 clockMULH - 2 clocksMADD - 2 clocks

Pairing Two MMX InstructionsCannot pair two MMX multiplies, twoMMX shifts, or MMX instructions in

V-pipe with U-pipe dependency.None

Predicted Branches not in theBTB

~5-cycle latency Different1-cycle latency forBTB miss.




x86 Code Optimization for AMD Processors

Documents