This document contains information on a product under development at Advanced Micro Devices (AMD). The information is intended to help you evaluate this product. AMD re- serves the right to change or discontinue work on this proposed product without notice. Publication # 21828 Rev: A Amendment/0 Issue Date: August 1997 A pplication Note AMD-K6 MMX Enhanced Processor x86 Code Optimization ™ ™
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
8/14/2019 x86 Code Optimization for AMD Processors
Advanced Micro Devices, Inc. ("AMD") reserves the right to make changes in its
products without notice in order to improve design or performance characteristics.
The information in this publication is believed to be accurate at the time of
publication, but AMD makes no representations or warranties with respect to the
accuracy or completeness of the contents of this publication or the information
contained herein, and reserves the right to make changes at any time, without
notice. AMD disclaims responsibility for any consequences resulting from the use
of the information included in this publication.
This publication neither states nor implies any representations or warranties of
any kind, including but not limited to, any implied warranty of merchantability or
fitness for a particular purpose. AMD products are not authorized for use as critical
components in life support devices or systems without AMD’s written approval.AMD assumes no liability whatsoever for claims associated with the sale or use
(including the use of engineering samples) of AMD products except as provided in
AMD’s Terms and Conditions of Sale for such product.
Trademarks
AMD, the AMD logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.
RISC86 is a registered trademark, and K86, AMD-K5, AMD-K6, and the AMD-K6 logo are trademarks of Advanced
Micro Devices, Inc.
Microsoft and Windows are registered trademarks, and Windows NT is a trademark of Microsoft Corporation.
Pentium is a registered trademark and MMX is a trademark of the Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their
respective companies.
8/14/2019 x86 Code Optimization for AMD Processors
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
compatibility. An x86 binary-compatible processor implements
the industry-standard x86 instruction set by decoding andexecuting the x86 instruction set as its native mode ofoperation. Only this native mode permits delivery of maximumperformance when running PC software.
The AMD-K6™ MMX™ Enhanced Processor
The AMD-K6 MMX enhanced processor, the first in theAMD-K6 family, brings superscalar RISC performance to
desktop systems running industry-standard x86 software. Thisprocessor implements advanced design techniques such as
execution units, out-of-order execution, data-forwarding,register renaming, and dynamic branch prediction. In other
words, the AMD-K6 processor is capable of issuing, executing,and retiring multiple x86 instructions per cycle, resulting insuperior scaleable performance.
Although the AMD-K6 processor is capable of extracting codeparallelism out of off-the-shelf, commercially available x86software, specific code optimizations for the AMD-K6 processor
can result in even higher delivered performance. Thisdocument describes the RISC86® microarchitecture in theAMD-K6 processor and makes recommendations for optimizing
execution of x86 software on the processor. The codingtechniques for achieving peak performance on the AMD-K6processor include, but are not limited to, those recommendedfor the Pentium® and Pentium Pro processors. However, many
of these optimizations are not necessary for the AMD-K6processor to achieve maximum performance. Due to the more
flexible pipeline control of the AMD-K6 microarchitecture, theAMD-K6 processor is not as sensitive to instruction selection
and the scheduling of code. This flexibility is one of the distinctadvantages of the AMD-K6 processor microarchitecture.
8/14/2019 x86 Code Optimization for AMD Processors
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
Chapter 2 The AMD-K6™ Processor RISC86® Microarchitecture 3
2The AMD-K6™ Processor
RISC86® Microarchitecture
Overview
When discussing processor design, it is important tounderstand the terms architecture, microarchitecture, and design
implementation. The term architecture refers to the instructionset and features of a processor that are visible to software
programs runni ng on the processor. The architecturedetermines what software the processor can run. Thearchitecture of the AMD-K6 MMX processor is theindustry-standard x86 instruction set.
The term microarchitecture refers to the design techniques usedin the processor to reach the target cost, performance, andfunctionality goals. The AMD-K6 processor is based on a
sophisticated RISC core known as the enhanced RISC86microarchitecture. The enhanced RISC86 microarchitecture is
an advanced, second-order decoupled decode/execution designapproach that enables industry-leading performance for
x86-based software.
The term design implementation refers to the actual logic andcircuit designs from which the processor is created according tothe microarchitecture specifications.
8/14/2019 x86 Code Optimization for AMD Processors
4 The AMD-K6™ Processor RISC86® Microarchitecture Chapter 2
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
RISC86® Microarchitecture
The enhanced RISC86 microarchitecture defines thecharacteristics of the AMD-K6 MMX enhanced processor. Theinnovative RISC86 microarchitecture approach implements thex86 instruction set by internally translating x86 instructionsinto RISC86 operations. These RISC86 operations were
specially designed to include direct support for the x86instruction set while observing the RISC performance
principles of fixed-length encoding, regularized instructionfields, and a large register set. The enhanced RISC86microarchitecture used in the AMD-K6 enables higherprocessor core performance and promotes straightforward
extendibility in future designs. Instead of executing complexx86 instructions, which have lengths of 1 to 15 bytes, the
AMD-K6 processor executes the simpler fixed-length RISC86opcodes, while maintaining the instruction coding efficienciesfound in x86 programs.
The AMD-K6 processor includ es parallel decoders, acentralized scheduler, and seven execution units that supportsuperscalar operation—multiple decode, execution, andretirement—of x86 instructions. These elements are packed
into an aggressive and very efficient six-stage pipeline.
Decoding of the x86 instructions into RISC86 operations beginswhen the on-chip level-one instruction cache is filled.
Predecode logic determines the length of an x86 instruction ona byte-by-byte basis. This predecode information is stored,
alongside the x86 instructions, in a dedicated level-onepredecode cache to be used later by the decoders. Up to twox86 instructions are decoded per clock on-the-fly, resulting in amaximum of four RISC86 operations per clock with no
additional latency.
The AMD-K6 processor categorizes x86 instructions into threetypes of decodes — short, long, and vector. The decoders
process either two short, one long, or one vectored decode at atime. The three types of d ecodes have the following
characteristics:
s Short decode—common x86 instructions less than or equalto 7 bytes in length that produce one or two RISC86operations.
8/14/2019 x86 Code Optimization for AMD Processors
Chapter 2 The AMD-K6™ Processor RISC86® Microarchitecture 5
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
s Long decode—more complex and somewhat common x86
instructions less than or equal to 11 bytes in length thatproduce up to four RISC86 operations.
s Vectored decode—complex x86 instructions requiring longsequences of RISC86 operations.
Short and long decodes are processed completely within the
decoders. Vectored decodes are started by the decoders withthe generation of an initial set of four RISC86 operations, and
then completed by fetching a sequence of additional operationsfrom an on-chip ROM (at a rate of four operations per clock).RISC86 operations, whether produced by decoders or fetchedfrom ROM, are then sent to a buffer in the centralized
scheduler for dispatch to the execution units.
The internal RISC86 instruction set consists of the following sixcategories or types of operations (the execution unit that
handles each type of operation is displayed in parenthesis):
s Memory load operations (load)
s Memory store operations (store)
s Integer register operations (alu/alux)
s MMX register operations (meu)
s Floating-point register operations (float)
s Branch condition evaluations (branch)
The following example shows a series of x86 instructions andthe corresponding decoded RISC86 operations.
x86 Instructions RISC86 Operations
MOV CX, [SP+4] Load
ADD AX,BX Alu (Add)
CMP CX,[AX] Load
Alu (Sub)
JZ foo Branch
The MOV instruction converts to a RISC86 load that requires
indirect data to be loaded from memory. The ADD instructionconverts to an alu function that can be sent to either of theinteger units. The CMP instruction converts into two RISC86
instructions. The first RISC86 load operation requires indirectdata to be loaded from memory. That value is then compared
(alu function) with CX.
8/14/2019 x86 Code Optimization for AMD Processors
6 The AMD-K6™ Processor RISC86® Microarchitecture Chapter 2
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
Once the RISC86 operations are in the centralized scheduler
buffer they are ready for the scheduler to issue them to theappropriate execution unit. The AMD-K6 processor containsseven execution units—Integer X, Integer Y, Multimedia,Load, Store, Branch, and Floating-Point. Figure 1 shows a block
The centralized scheduler buffer, in conjunction with theinstruction control unit (ICU/scheduler), buffers and manages
up to 24 RISC86 operations at a time (which equates with up to12 x86 instructions). This buffer size (24) is well matched to theprocessor’s six-stage RISC86 pipeline and seven parallelexecution units.
On every clock, the centralized scheduler buffer can accept upto four RISC86 operations from the decoders and issue up to sixRISC86 operations to corresponding execution units. (SixRISC86 operations can be issued at a time because the alux andmultimedia execution units share the same pipeline.)
When managing the 24 RISC86 operations, the scheduler uses48 physical registers contained within the RISC86
Integer X(Register) Unit
StoreUnit
Integer Y(Register) Unit
Floating-PointUnit
Branch(Resolving) Unit
StoreQueue
Instruction
Control Unit
SchedulerBuffer
(24 RISC86)Six RISC86Operation Issue
Out-of-OrderExecution Engine
Level-One Dual-Port Data Cache (32KByte) 128-Entry DTLB
Chapter 2 The AMD-K6™ Processor RISC86® Microarchitecture 7
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
microarchitecture. The 48 physical registers are located in a
general register file and are grouped as 24 committed orarchitectural registers plus 24 rename registers. The 24architectural registers consist of 16 scratch registers and eightregisters that correspond to the x86 general-purpose
registers— EAX, EBX, ECX, EDX, EBP, ESP, ESI, and EDI.
The AMD-K6 processor offers sophisticated dynamic branchlogic that includes the following elements:
s Branch history/prediction table
s Branch target cache
s Return address stack
These components serve to minimize or eliminate the delays
due to the branch instructions (jumps, calls, returns) commonin x86 software.
The AMD-K6 processor implements a two-level branchprediction scheme based on an 8192-entry branch history table.
The branch history table stores prediction information that isused for predicting the direction of conditional branches. The
target addresses of conditional and unconditional branches arenot predicted, but instead are calculated on-the-fly duringinstruction decode by special branch target address ALUs. Thebranch target cache augments performance of taken branches
by avoiding a one-cycle cache-fetch penalty. This specializedtarget cache does this by supplying the first 16 bytes of targetinstructions to the decoders when a branch is taken.
The return address stack serves to optimize CALL/RETinstruction pairs by remembering the return address of each
CALL within a nested series of subroutines and supplying it asthe predicted target address of the corresponding RETinstruction.
As shown in Figure 1 on page 6, the high-performance,
out-of-order execution engine is mated to a split 64-Kbyte(Harvard architecture) writeback level-one cache with 32Kbytes of instruction cache and 32 Kbytes of data cache. The
level-one instruction cache feeds the decoders and, in turn, thedecoders feed the scheduler. The ICU controls the issue andretirement of RISC86 operations contained in the centralizedscheduler buffer. The level-one data cache satisfies most
memory reads and writes by the load and store execution units.
8/14/2019 x86 Code Optimization for AMD Processors
8 The AMD-K6™ Processor RISC86® Microarchitecture Chapter 2
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
The store queue temporarily buffers memory writes from the
store unit until they can safely be committed into the cache(that is, when all preceding operations have been found to befree of faults and branch mispredictions). The system businterface is an industry-standard 64-bit Pentium
processor-compatible demultiplexed address/data system bus.
The AMD-K6 processor uses the latest in processormicroarchitecture techniques to provide the highest x86
performance for today’s PC. In short, the AMD-K6 processoroffers true sixth-generation performance and full x86 binary
software compatibility.
8/14/2019 x86 Code Optimization for AMD Processors
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
Chapter 3 AMD-K6™ Processor Execution Units and Dependency Latencies 9
3AMD-K6™ Processor
Execution Units and
Dependency Latencies
The AMD-K6 MMX enhanced processor contains sevenspecialized execution units—store, load, integer X, integer Y,multimedia, floating-point, and branch condition. Each unitoperates independently and handles a specific group of the
RISC86 instruction set. This chapter describes the operation ofthese units, their execution latencies, and how concurrent
dependency chains affect those latencies.
A dependency occurs when data needed in one execution unitis being processed in another unit (or the sam e unit).
Additional latencies can occur because the dependentexecution unit must wait for the data. Table 1 on page 16provides a summary of the execution units, the operationsperformed within these units, the operation latency, and the
operation throughput.
8/14/2019 x86 Code Optimization for AMD Processors
10 AMD-K6™ Processor Execution Units and Dependency Latencies Chapter 3
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
Execution Unit Terminology
The execution units operate on two different types of registervalues — operands and results. There are three types ofoperands and two types of results.
Operands The three types of operands are as follows:
s Address register operands—used for address calculations ofload and store operations
s Data register operands—used for register operations
s Store data register operands—used for memory stores
Results The two types of results are as follows:s Data register results—from load or register operations
s Address register results—from Lea or Push operations
The following examples illustrate the operand and resultdefinitions:
Add AX, BX
The Add operation has two data register operands (AX,and BX) and one data register result (AX).
Load BX, [SP+4·CX+8]
The Load operation has two address register operands (SP
and CX as base and index registers, respectively) and adata register result (BX).
Store [SP+4·CX+8], AX
The Store operation has a store data register operand(AX) and two address register operands (SP and CX as
base and index registers, respectively).
Lea SI, [SP+4·CX+8]
The Lea operation (a type of store operation) has address
register operands (SP and CX as base and index registers,respectively), and an address register result.
8/14/2019 x86 Code Optimization for AMD Processors
Chapter 3 AMD-K6™ Processor Execution Units and Dependency Latencies 11
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
Six-Stage Pipeline
To help visualize the operations within the AMD-K6 processor,Figure 2 illustrates the six-stage pipeline design. This is asimplified illustration in that the AMD-K6 contains multipleparallel pipelines (starting after common instruction fetch andx86 decode pipe stages), and these pipelines often execute
operations out-of-order with respect to each other. This view ofthe AMD-K6 execution pipeline il lustrates the effect of
execution latencies for various types of operations.
For register operations that only require one execution cycle,this pipeline is effectively shorter due to the absence of
execution stage 2.
The samples starting on p ag e 1 9 assume that the x86instructions have already been fetched, decoded, and placed in
the centralized scheduler buffer. The RISC86 operations arewaiting to be dispatched to the appropriate execution units.
Figure 2. AMD-K6™ Processor Pipeline
Integer and Multimedia Execution Units
The integer X execution unit can execute all ALU operations,multiplies and divides (signed and unsigned), shifts, androtates. Data register results are available after one clock ofexecution latency.
The multimedia execution unit (meu) executes all MMXoperations and shares pipeline control with the integer Xexecution unit (an integer X operation and an MMX operation
cannot be dispatched simultaneously). In most cases, dataregister results are available after one clock and after two
clocks for PMULH and PMADD operations.
Instruction
Fetch
x86—>RISC86
Decode
RISC86
Issue
Execution
Stage 1
Execution
Stage 2
Retire
8/14/2019 x86 Code Optimization for AMD Processors
12 AMD-K6™ Processor Execution Units and Dependency Latencies Chapter 3
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
The integer Y execution unit can execute the basic word and
doubleword ALU operations (ADD, AND, CMP, OR, SUB andXOR) and zero and sign-extend operations. Data registerresults are available after one clock.
Figure 3 shows the architecture of the single-stage integerexecution pipeline. The operation issue and fetch stages thatprecede this execution stage are not part of the executionpipeline. The data register operands are received at the end of
the operand fetch pipe stage, and the data register result isproduced near the end of the execution pipe stage.
Figure 3. Integer/Multimedia Execution Unit
Load Unit
The load unit is a two-stage pipelined design that performs data
memory reads. This unit uses two address register operandsand a memory data value as inputs and produces a data registerresult.
The load unit has a two-clock latency from the time it receives
the address register operands until it produces a data registerresult.
Memory read data can come from either the data cache or thestore queue entry for a recent store. If the data is forwardedfrom the store queue, there is a zero additional execution
latency. This means that a dependent load operation cancomplete its execution one clock after a store operationcompletes execution.
Execution Stage(Integer X,Integer Y,
Multimedia)
Data Register Operands
Data Register Result
8/14/2019 x86 Code Optimization for AMD Processors
Chapter 3 AMD-K6™ Processor Execution Units and Dependency Latencies 13
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
Figure 4 shows the architecture of the two-stage load execution
pipeline. The operation issue and fetch stages that precede thisexecution stage are not part of the execution pipeline. Theaddress register operands are received at the end of theoperand fetch pipe stage, and the data register result is
produced near the end of the second execution pipe stage.
Figure 4. Load Execution Unit
Store Unit
The store execution unit is a two-stage pipelined design thatperforms data memory writes and/or, in some cases, producesan address register result. For inputs, the store unit uses two
address register operands and, during actual memory writes, astore data register operand. This unit also produces an addressregister result for some store unit operations. For most storeoperations, which actually write to memory, the store unitproduces a physical memory address and the associated bytes
of data to be written. After execution completes, these results
are entered in a new store queue entry.
The store unit has a one-clock execution latency from the time
it receives address register operands until the time it producesan address register result. The most common examples are the
Load Effective Address (Lea) and Store and Update (Push)RISC86 operations, which are produced from the x86 LEA andPUSH instructions, respectively. Most store operations do not
Execution Stage 1Address Calculation
Stage
Address RegisterOperands
(Base and Index)
Memory data from Data
Cache or Store Queue
Execution Stage 2Data Cache/
Store Queue Lookup Data Register Result
8/14/2019 x86 Code Optimization for AMD Processors
14 AMD-K6™ Processor Execution Units and Dependency Latencies Chapter 3
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
produce an address register result and only perform a memory
write. The Push operation is unique because it produces bothan address register result and performs a memory write.
The store unit has a one-clock execution latency from the timeit receives address register operands until it enters a storememory address and data pair into the store queue.
The store unit also has a three-clock latency occurring from thetime it receives address register operands and a store data
register operand until it enters a store memory address anddata pair into the store queue.
Note: Address register operands are required at the start of
execution, but register data is not required until the end of
execution.
Figure 5 shows the architecture of the two-stage store execution
pipeline. The operation issue and fetch stages that precede thisexecution stage are not part of the execution pipeline. The
address register operands are received at the end of theoperand fetch pipe stage, and the new store queue entry iscreated upon completion of the second execution pipe stage.
Figure 5. Store Execution Unit
Address RegisterOperands
(Base and Index)
Store Data RegisterOperand
Execution Stage 1Address Calculation
Stage
Store Queue Entry
Execution Stage 2
Address Register Result
Address
Data
8/14/2019 x86 Code Optimization for AMD Processors
Chapter 3 AMD-K6™ Processor Execution Units and Dependency Latencies 15
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
Branch Condition Unit
The branch condition unit is separate from the branchprediction logic, which is utilized at x86 instruction decodetime. This unit resolves conditional branches, such as JCC and
LOOP instructions, at a rate of up to one per clock cycle.
Floating Point Unit
The floating-point unit handles all register operations for x87instructions. The execution unit is a single-stage design that
takes data register operands as inputs and produces a data
register result as an output. The most common floating-pointinstructions have a two clock execution latency from the time itreceives data register operands until the time it produces a data
register result.
Latencies and Throughput
Table 1 on page 16 summarizes the latencies and throughput of
each execution unit.
8/14/2019 x86 Code Optimization for AMD Processors
16 AMD-K6™ Processor Execution Units and Dependency Latencies Chapter 3
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
Table 1. RISC86® Execution Latencies and Throughput
ExecutionUnit
Operations Latency Throughput
Integer X
Integer ALU
Integer Multiply
Integer Shift
1
2–3
1
1
2–3
1
Multimedia
MMX ALU
MMX Shifts, Packs, Unpack
MMX Multiply Low/High
MMX Multiply-Accumulate
1
1
1/2
2
1
1
1/2
2
Integer Y Basic ALU (16– and 32– bit operands) 1 1
LoadFrom Address Register Operands to Data Register Result
Memory Read Data from Data Cache/Store Queue to Data Register Result
2
0
1
1
Store
From Address Register Operands to Address Register Result
From Store Data Register Operands to Store Queue Entry
From Address Register Operands to Store Queue Entry
1
1
3
1
1
1
Branch Resolves Branch Conditions 1 1
FPUFADD, FSUB
FMUL
2
2
2
2
Note:
No additional latency exists between execution of dependent operations. Bypassing of register results directly from
producing execution units to the operand inputs of dependent units is fully supported. Similarly, forwarding of memory store values from the store queue to dependent load operations is supported.
8/14/2019 x86 Code Optimization for AMD Processors
Chapter 3 AMD-K6™ Processor Execution Units and Dependency Latencies 17
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
Resource Constraints
To optimize code effectively, consider not only the latencies ofcritical dependencies, but also execution resource constraints.Due to a fixed number of execution units, only so manyoperations can be issued in each cycle (up to 6 RISC86operations per cycle), even though, based on dependencies,
more execution parallelism may be possible.
For example, if code contains three consecutive integeroperations that do not have co-dependencies, they cannot
execute in parallel because there are only 2 integer executionunits. The third operation is delayed by one cycle.
Contention for execution resources causes delays in the issuing
and execution of instructions. In addition, stalls due to resourceconstraints can combine with dependency latencies to cause or
exacerbate stalls due to dependencies. In general, constraintsthat delay non-critical instructions do not impact performancebecause such stalls typically overlap with the execution ofcritical operations.
8/14/2019 x86 Code Optimization for AMD Processors
18 AMD-K6™ Processor Execution Units and Dependency Latencies Chapter 3
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
Code Sample Analysis
The samples in this section show the execution behavior ofseveral series of instructions as a function of decodeconstraints, dependencies, and execution resource constraints.
The sample tables show the x86 instructions, the RISC86
operation equivalents, the clock counts, and a description ofthe events occurring within the processor.
The following nomenclature is used to describe the current
location of a RISC86 operation (RISC86op):
s D — Decode stage
s IX — Issue stage of integer X unit
s OX — Operand fetch stage of integer X unit
s EX1 — Execution stage 1 of integer X unit
s IY — Issue stage of integer Y unit
s OY — Operand fetch stage of integer Y unit
s EY1 — Execution stage 1 of integer Y unit
s IL — Issue stage of load unit
s OL — Operand fetch stage of load unit
s EL1 — Execution stage 1 of load unit
s EL2 — Execution stage 2 of load unit
s IS — Issue stage of store unit
s OS — Operand fetch stage of store unit
s ES1 — Execution stage 1 of store unit
s ES2
— Execution stage 2 of store unit
Note: Instructions execute more efficiently (that is, without
delays) when scheduled apart by suitable distances based ondependencies. In general, the samples in this section show
poorly scheduled code in order to illustrate the resultant effects.
8/14/2019 x86 Code Optimization for AMD Processors
Number Instruction RISC86op 1 2 3 4 5 6 7 8 9 10 111 DEC EDX alu D I X O X E X1
2 MOV EDI, [ECX] load D IL OL EL1 EL2
3 SUB EAX, [EDX+20] load D IL OL EL1 EL2
alu I X O X I X O X E X1
4 SAR EAX, 5 alux D I X O X I X O X E X1
5 ADD ECX, [EDI+4] load D IL OL EL1 EL2
alu I Y O Y I Y O Y E Y1
6 AND EBX, 0x1F alu D I Y
O Y
E Y1
7 MOV ESI, [0x0F100] load D IL OL EL1 EL2
8 OR ECX, [ESI+EAX*4+8] load D IL OL OL EL1 EL2
alu I X O X O X O X E X1
Comments for Each Instruction Number
1 This simple alu operation ends up in the X pipe.
2 This operation will occupy the load execution unit.
3 The register operand for the load operation is bypassed, without delay, from the result of instruction #1’sregister operand. In clock 4, the register operation is ‘bumped’ out of the integer X unit while waiting for the
previous load operation result to complete. It is re-issued just in time to receive the bypassed result of theload.
4 Shift instructions are only executable in the integer X unit. The register operation is bumped in clock 5 while waiting for the result of the receding instruction #3.
5 The register operand for the load operation is bypassed, without delay, from the result of instruction #2’sregister operand. Note how this and most surrounding load operations are generated by instruction decoders,and issued and executed by the load unit “smoothly” at a rate of one clock per cycle. In clock 5, the registeroperation is bumped out of the integer Y unit while waiting for the previous load operation result to complete.
6 The register operation falls through into the integer Y unit right behind instruction #5’s register operation.
7 This operation falls into the load unit behind the load in instruction #5
8 The operand fetch for the load operation is delayed because it needs the result of the immediately precedingload operation #7 as well as the results from earlier instructions #3 and #4.
8/14/2019 x86 Code Optimization for AMD Processors
Number Instruction RISC86op 1 2 3 4 5 6 7 8 9 10 111 MOV EDX, [0xA0008F00] load D IL OL EL1 EL2
2 ADD [EDX+16], 7 load D IL OL EL1 EL2
alu I X O X I X O X E X1
store IS OS OS ES1 ES2 ES2
3 SUB EAX, [EDX+16] load D IL IL OL EL1 EL2 EL2
alu I X O X I X I X O X O X E X1
4 PUSH EAX store D IS IS OS ES1 ES2 ES2 ES2
5 LEA EBX, [ECX+EAX*4+3] store D IS OS OS OS ES1 ES2
6 MOV EDI, EBX alu D I Y O Y O Y O Y O Y O Y E Y1
Comments for Each Instruction Number
1 This operation will occupy the load unit.
2 This long decoded ADD instruction takes a single clock to decode. The operand fetch for the load operation isdelayed waiting for the result of the previous load operation from instruction #1. The store operation completesconcurrent with the register operation. The result of the register operation is bypassed directly into a new storequeue entry created by the store operation.
3 The issue of the load operation is delayed because the operand fetch of the preceding load operation frominstruction #2 was delayed. The completion of the load operation is held up due to a memory dependency on
the preceding store operation of instruction #2. The load operation completes immediately after the storeoperation, with the store data being forwarded from a new store queue entry.
4 Completion of the store operation is held up due to a data dependency on the preceding instruction #3. Thestore data is bypassed directly into a new store queue entry from the result of instruction #3’s registeroperation.
5 The Lea RISC86 operation is executed by the store unit. The operand fetch is delayed waiting for the result ofinstruction #3. The register result value is produced in the first execution stage of the store unit.
6 This simple alu operation is stalled due to the dependency of the BX result in instruction #5.
8/14/2019 x86 Code Optimization for AMD Processors
Number Instruction RISC86op 1 2 3 4 5 6 7 8 9 101 MOVQ MM0, [EAX] mload D IL OL EL1 EL2
2 PSUBSW MM0, [EAX+16] mload D IL OL EL1 EL2
alux I X O X O X O X E X1
3 ADD EBX, ECX alu D I Y O Y E Y1
4 PADDSW MM1, MM2 alux D I X I X I X O X E X1
5 PUSH EBX store D IS OS ES1 ES2
6 PMADDWD MM0, MM1 alux D I X O X E X1 E X1
7 ADD EAX, 32 alu D I Y
O Y
E Y1
8 MOVQ [EDI], MM0 mstore D IS OS ES1 ES2 ES2
9 ADD EDI, 8 alu D I Y O Y E Y1
Comments for Each Instruction Number
1 This multimedia operation will occupy the load unit.
2 Instruction #2 could not be decoded along with the preceding instruction because only MMX instructionscan be decoded in the first decode position. The MMX register operation is executable only by the integer
X unit. The operand fetch is delayed because of the dependency of the load.
3 This instruction can be decoded in parallel with instruction #2 because it is not an MMX instruction. It isissued to the integer Y unit in parallel with the issuing of the preceding MMX register operation ininstruction #2.
4 This instruction is only executable in the integer X unit. The issue of this MMX instruction is delayed dueto the delay of the operand fetch of the preceding MMX register operation.
5 This instruction stores the contents of BX in memory.
6 Instruction #6 is only executable in the integer X unit. This non-pipelined unit has a two-clock executionlatency for this instruction, and it is delayed due to ‘stacking-up’ behind the preceding MMX operations.
7 This instruction is issued to the integer Y unit in parallel with the series of preceding MMX registeroperations being issued to the integer X unit.
8 Completion of this store operation is held up due to a data dependency on the preceding MMX register
operation from instruction #6. The store data is bypassed directly into a new store queue entry from theresult of the MMX operation.
9 This instruction is issued to the integer Y unit in parallel with the series of preceding MMX registeroperations being issued to the integer X unit.
8/14/2019 x86 Code Optimization for AMD Processors
24 Instruction Dispatch and Execution Timing Chapter 4
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
s eXX—register width depending on the operand size
s mem32real—32-bit floating-point value in memory
s mem64real—64-bit floating-point value in memory
s mem80real—80-bit floating-point value in memory
s mmreg —MMX register
s mmreg1—MMX register defined by bits 5, 4, and 3 of the
modR/M byte
s mmreg2—MMX register defined by bits 2, 1, and 0 of the
modR/M byte
The second and third columns list all applicable encodingopcode bytes.
The fourth column lists the modR/M byte when used by the
instruction. The modR/M byte defines the instruction asregister or memory form. If mod bits 7 and 6 are documented as
mm (memory form), mm can only be 10b, 01b, or 00b.
The fifth column lists the type of instruction decode—short,
long, or vectored. The AMD-K6 MMX enhanced processordecode logic can process two short, one long, or one vectoreddecode per clock. In addition, two short integer, one shortinteger and one short MMX, or one short integer and one short
FPU instruction can be decoded simultaneously.
Note: In order to simultaneously decode an integer with a floating-point or MMX instruction, the floating-point or
MMX instruction must precede the integer instruction.
The sixth column lists the type of RISC86 operation(s) required
for the instruction. The operation types and correspondingexecution units are as follows:
s load, fload, mload —load unit
s store, fstore, mstore—store unit
s alu—either of the integer execution units
s alux—integer X execution unit only
s branch—branch condition unit
s float —floating-point execution unit
s meu—multimedia execution unit
s limm—load immediate, instruction control unit
8/14/2019 x86 Code Optimization for AMD Processors
Chapter 4 Instruction Dispatch and Execution Timing 25
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
The operation(s) of most instructions form a single dependency
chain. For instructions whose operations form two paralleldependency chains, the RISC86 operations and executionlatency for each dependency chain is shown on a separate row.
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
Chapter 5 x86 Optimization Coding Guidelines 51
5x86 Optimization Coding
Guidelines
General x86 Optimization Techniques
This section describes general code optimization techniquesspecific to superscalar processors (that is, techniques common
to the AMD-K6 MMX enhanced processor, AMD-K5™ processor,and Pentium-family processors). In general, all optimization
techniques used for the AMD-K5 processor, Pentium, andPentium Pro processors either improve the performance of theAMD-K6 processor or are not required and have no effect (dueto fewer coding restrictions with the AMD-K6 processor).
Short Forms—Use shorter forms of instructions to increase theeffective number of instructions that can be examined fordecoding at any one time. Use 8-bit displacements and jump
offsets where possible.
Simple Instructions—Use simple instructions with hardwireddecode (pairable, short, or fast) because they perform more
efficiently. This includes “register←register op memory” aswell as “register←register op register” forms of instructions.
Dependencies—Spread out true dependencies to increase the
opportunities for parallel execution. Anti-dependencies andoutput dependencies do not impact performance.
8/14/2019 x86 Code Optimization for AMD Processors
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
Memory Operands —Instructions that operate on data in
memory (load/op/store) can inhibit parallelism. The use ofseparate move and ALU instructions allows better codescheduling for independent operations. However, if there are noopportunities for parallel execution, use the load/op/store forms
to reduce the number of register spills (storing values inmemory to free registers for other uses).
Register Operands —Maintain frequently used values in
registers rather than in memory.
Stack References—Use ESP for stack references so that EBPremains available.
Stack Allocation—When allocating space for local variables
and/or outgoing parameters within a procedure, adjust thestack pointer and use moves rather than pushes. This method ofallocation allows random access to the outgoing parameters sothat they can be set up when they are calculated instead ofbeing held somewhere else until the procedure call. This
method also reduces ESP dependencies and uses fewerexecution resources.
Data Embedding —When data is embedded in the code
segment, align it in separate cache blocks from nearby code.This technique avoids some overhead when maintaining
coherency between the instruction and data caches.
Loops—Unroll loops to get more parallelism and reduce loopoverhead, even with branch prediction. Inline small routines to
avoid procedure-call overhead. For both techniques, however,consider the cost of possible increased register usage, whichmight add load/store instructions for register spilling.
Code Alignment—Aligning at 0-mod-16 improves performance(ideally at 0-mod-32). However, there is a trade-off betweenexecution speed and code size.
8/14/2019 x86 Code Optimization for AMD Processors
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
General AMD-K6™ Processor x86 Coding Optimizations
This section describes general code optimization techniquesspecific to the AMD-K6 MMX enhanced processor.
Use short-decodeable instructions —To increase decodebandwidth and minimize the number of RISC86 operations per
x86 instruction, use short-decodeable x86 instructions. See“Instruction Dispatch and Execution Timing” on page 23 forthe list of short-decodeable instructions.
Pair short-decodeable instructions—Two short-decodeable x86instructions can be decoded per clock, using the full decode
bandwidth of the AMD-K6 processor.
Avoid using complex instructions —The more complex and
uncommon instructions are vector decoded and can generate alarger ratio of RISC86 operations per x86 instruction thanshort-decodeable or long-decodeable instructions.
0Fh prefix usage—0Fh does not count as a prefix.
Avoid long instruction length—Use x86 instructions that areless than eight bytes in length. An x86 instruction that is longer
than seven bytes cannot be short-decoded.
Align branch targets—Keep branch targets away from the endof a cache line. 16-byte alignment is preferred for branch
targets, while 32-byte alignment is ideal.
Use read-modify-write instructions over discrete equivalent—No advantage is gained by splitting read-modify-write
instructions into a load-execute-store instruction group. Bothread-modify-write instr uctions and load-execute-store
instruction groups decode and execute in one cycle butread-modify-write instructions promote better code density.
Move rarely used code and data to separate pages—Placing
code, such as exception handlers, in separate pages and data,such as error text messages, in separate pages maximizes theuse of the TLBs and prevents pollution with rarely used items.
8/14/2019 x86 Code Optimization for AMD Processors
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
Avoid multiple and accumulated prefixes — In order to
accomplish an instruction decode, the decoders requiresufficient predecode information. When an instruction hasmultiple prefixes and this cannot be deduced by the decoders(due to a lack of data in the instruction decode buffer), the first
decoder retires and accumulates one prefix at a time until theinstruction is completely decoded. Table 9 shows when prefixes
are accumulated and decoding is serialized.
Avoid mixing code size types —Size prefixes that affect thelength of an instruction can sometimes inhibit dual decoding.
Always pair CALL and RETURN—If CALLs and RETs are notpaired, the return address stack gets out of synchronization,increasing the latency of returns and decreasing performance.
Exploit parallel execution of integer and floating-point
multiplies —The AMD-K6 MMX enhanced processor allowssimultaneous integer and floating-point multiplies using
separate, low-latency multipliers.
Avoid more than 16 levels of nesting in subroutines—More than16 levels of nested subroutine calls overflow the return address
stack, leading to lower performance. While this is not a problemfor most code, recursive subroutines might easily exceed 16levels of subroutine calls. If the recursive subroutine is tailrecursive, it can usually be mechanically transformed into an
iterative version, which leads to increased performance.
Table 9. Decode Accumulation and Serialization
Decode #1 Decoder #2 Results
Instruction Single instruction decoded
Instruction Instruction Dual instruction decode
Instruction PrefixSingle instruction decode and prefix isaccumulated
PrefixInstruction(modified by Prefix)
No prefix accumulation and single instructionis decoded
PrefixA PrefixB Accumulate PrefixA and cancel decode of thesecond prefix
PrefixB Instruction
If a prefix has already been accumulated inthe previous decode cycle, accumulate PrefixBand cancel instruction decode, wait for nextdecode cycle to decode the instruction
8/14/2019 x86 Code Optimization for AMD Processors
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
Place frequently used stack data within 128 bytes of the EBP—
The statically most-referenced data items in a function’s stackframe should be located from –128 to +127 bytes from EBP. Thistechnique improves code density by enabling the use of an 8-bitsign-extended displacement instead of a 32-bit displacement.
Avoid superset dependencies —Using the larger form of aregister immediate after an instruction uses the smaller formcreates a superset dependency and prevents parallel execution.
For example, avoid the following type of code:
Avoid OR AH,055h
AND EAX,1555555h
Avoid excessive loop unrolling or code inlining—Excessive loopunrolling or code inlining increases code size and reduces
locality, which leads to lower cache hit rates and reducedperformance.
Avoid splitting a 16-bit memory access in 32-bit code —Noadvantage is gained by splitting a 16-bit memory access in32-bit code into two byte-sized accesses. This technique avoids
the operand size override.
Avoid data dependent branches around a single instruction—Data dependent branches acting upon basically random datacause the branch prediction logic to mispredict the branch
about 50% of the time. Design branch-free alternative codesequences that can replace straight forward code with datadependent branches. The effect is shorter average executiontime. The following examples illustrate this concept:
s Signed integer ABS function (x = labs(x))
Static Latency: 4 cycles
MOV ECX, [x] ;load value
MOV EBX, ECX
SAR ECX, 31
XOR EBX, ECX ;1’s complement if x<0, else don’t modify
SUB EBX, ECX ;2’s complement if x<0, else don’t modify
MOV [x], EBX ;save labs result
8/14/2019 x86 Code Optimization for AMD Processors
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
AMD-K6™ Processor Integer x86 Coding Optimizations
This section describes integer code optimization techniquesspecific to the AMD-K6 MMX enhanced processor.
Neutral code filler —Use the XCHG EAX, EAX or NOPinstruction when aligning instructions. XCHG EAX, EAX
consumes decode slots but requires no execution resources.Essentially, the scheduler absorbs the equivalent RISC86operation without requiring any of the execution units.
Inline REP String with low counts —Expand REP Stringinstructions into equivalent sequences of simple x86
instructions. This technique eliminates the setup overhead ofthese instructions and increases instruction throughput.
Use ADD reg, reg instead of SHL reg, 1—This optimizationtechnique allows the scheduler to use either of the two integeradders rather than the single shifter and effectively increasesoverall throughput. The only difference between these twoinstructions is the setting of the AF flag.
Access 16-bit memory data using the MOVSX and MOVZX
instructions—The AMD-K6 processor has direct hardwaresupport for extending word size operands to doubleword length.
Use load-execute integer instructions —Most load-executeinteger instructions are short-decodeable and can be decodedat the rate of two per cycle. Splitting a load-execute instructioninto two separate instructions—a load instruction and a reg, reg
instruction— reduces decoding bandwidth and increasesregister pressure.
Use AL, AX, and EAX to improve code density—In many cases,
instructions using AL and EAX can be encoded in one less bytethan using a general purpose register. For example, ADD AX,
0x5555 should be encoded 05 55 55 and not 81 C0 55 55.
Clear registers using MOV reg, 0 instead of XOR reg, reg—Executing XOR reg, reg requires additional overhead due to
register dependency checking and flag generation. Using MOVreg, 0 produces a limm (load immediate) RISC86 operation thatis completed when placed in the scheduler and does notconsume execution resources.
8/14/2019 x86 Code Optimization for AMD Processors
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
Use 8-bit sign-extended immediates —Using 8 -bit
sign-extended immediates improves code density with nonegative effects on the AMD-K6 processor. For example, ADDBX, –55 should be encoded 83 C3 FB and not 81 C3 FF FB.
Use 8-bit sign-extended displacements for conditional
branches—Using short, 8-bit sign-extended displacements forconditional branches improves code density with no negativeeffects on the AMD-K6 processor.
Use integer multiply over shift-add sequences when it is
advantageous — The AMD-K6 MMX enhanced processorfeatures a low-latency integer multiplier. Therefore, almost anyshift-add sequences can have higher latency than MUL or IMULinstructions. An exception is a trivial case involving
multiplication by powers of two by means of left shifts. Ingeneral, replacements should be made if the shift-addsequences have a latency greater than or equal to 3 clocks.
Carefully choose the best method for pushing memory data—
To reduce register pressure and code dependency, use PUSH[mem] rather than MOV EAX, [mem], PUSH EAX.
Balance the use of CWD, CBW, CDQ, and CWDE —These
instructions require special attention to avoid either decreaseddecode or execution bandwidth. The following code illustrates
the possible trade-offs:
s The following code replacement trades decode bandwidth(CWD is vector decoded, but with only one RISC86
operation) with execution bandwidth (SAR requires twoRISC86 operations, including a shift):
Replace:CWD With: MOV DX,AX
SAR DX,15
s The following code replacement improves decodebandwidth (CBW is vector decoded while MOVSX is shortdecoded):
Replace:CBW With: MOVSX AX,AL
s The following code replacement trades decode bandwidth(CDQ is vector decoded, but with only two RISC86operations) with execution bandwidth (SAR requires twoRISC86 operations, including a shifter):
Replace:CDQ With: MOV EDX,EAX
SAR EDX,31
8/14/2019 x86 Code Optimization for AMD Processors
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
s The following code replacement improves decode
bandwidth (CWDE is vector decoded while MOVSX is shortdecoded):
Replace:CWDE With: MOVSX EAX, AX
Replace integer division by constants with multiplication by
the reciprocal—This is a commonly used optimization on RISC
CPUs. Because the AMD-K6 processor has an extremely fastinteger multiply (two cycles) and the integer division deliversonly two bits of quotient per cycle (approximately 18 cycles for32-bit divides), the equivalent code is much faster. The
following examples illustrate the use of integer division byconstants:
s Unsigned division by 10 using multiplication by reciprocal
Static Latency: 5 cycles; IN: EAX = dividend
; OUT:EDX = quotient
MOV EDX, 0CCCCCCCDh ;0.1 * 2^32 * 8 rounded up
MUL EDX
SHR EDX, 3 ;divide by 2^32 * 8
s Unsigned division by 3 using multiplication by reciprocal
Static Latency: 5 cycles
; IN: EAX = dividend
; OUT:EDX = quotient
MOV EDX, 0AAAAAAABh ;1/3 * 2^32 * 2 rounded up
MUL EDXSHR EDX, 1 ;divide by 2^32 * 2
s Signed division by 2Static Latency: 3 cycles
; IN: EAX = dividend
; OUT:EAX = quotient
CMP EAX, 800000000h ;CY = 1, if dividend >=0
SBB EAX, –1 ;increment dividend if it is <0
SAR EAX, 1 ;perform a right shift
s Signed division by 2^n
Static Latency: 5 cycles
; IN: EAX = dividend; OUT:EAX = quotient
MOV EDX, EAX ;sign extend into EDX
SAR EDX, 31 ;EDX = 0xFFFFFFFF if dividend < 0
AND EDX, (2^n–1) ;mask correction (use divisor –1)
ADD EAX, EDX ;apply correction if necessary
SAR EAX, (n) ;perform right shift by log2 (divisor)
8/14/2019 x86 Code Optimization for AMD Processors
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
AMD-K6™ Processor Multimedia Coding Optimizations
This section describes multimedia code optimizationtechniques specific to the AMD-K6 MMX enhanced processor.
Pair MMX instructions with short-decodeable instructions—MMX instructions are short -decodeable and can be
simultaneously decoded with any other short -decodeableinstruction. This technique requires that MMX instructions bearranged as the first of a pair of short-decodeable instructions.
Avoid using MMX registers to move double-precisio n
floating-point data—Although using an MMX register to move
floating-point data appears fast, using MMX registers requiresthe use of the EMMS instruction when switching from MMX to
floating-point instructions.
Avoid switching between MMX and FPU instructions—Becausethe MMX registers are mapped on to the floating-point stack,the EMMS instruction must be executed after using MMX codeand prior to the use of the floating-point unit. Group or
partition MMX code away from FPU code so that the use of theEMMS instruction is minimized. Also, the actual penalty fromthe use of the EMMS instruction occurs not at the time of
execution but when the first floating-point instruction isencountered.
8/14/2019 x86 Code Optimization for AMD Processors
This section describes floating-point code optimizationtechniques specific to the AMD-K6 MMX enhanced processor.
Avoid vector decoded floating-point instructions—Mostfloating-point instructions are short decodeable. A few of the
less common instructions are vector decoded. Additionally if ashort decodeable instruction straddles a cache line, it willbecome vector decoded. This adds unnecessary overheard thatcan be avoided by inserting NOPs in strategic locations within
the code.
Pair floating-point with short-decodeable instructions—Mostfloating-point instructions (also known as ESC instructions) are
short-decodeable and are limited to the first decoder. Theshort-decodeable floating-point instructions can be paired with
other short-decodeable instructions. This technique requiresthat floating-point instructions be arranged as the first of a pairof short-decodeable instructions.
Avoid FXCH usage—Pairing FXCH with other floating-pointinstructions does not increase performance.
Avoid switching between MMX and FPU instructions—Becausethe MMX registers are mapped on to the floating-point stack,
the EMMS instruction must be executed after using MMX codeand prior to the use of the floating-point unit. Group orpartition MMX code away from FPU code so that the use of theEMMS instruction is minimized. Also, the actual penalty fromthe use of the EMMS instruction occurs not at the time ofexecution but when the first floating-point instruction is
encountered.
Avoid using MMX registers to move double-precisio n
floating-point data—Although using an MMX register to move
floating-point data appears fast, using MMX registers requiresthe use of the EMMS instruction when switching from MMX to
floating-point instructions.
8/14/2019 x86 Code Optimization for AMD Processors
Avoid splitting floating-point instructions with integer
instructions—No penalty is incurred when using arithmetic orcomparison floating-point instructions that use integeroperands, such as FIADD or FICOM. Splitting these
instructions into discrete load and floating-point instructionsdecreases performance.
Replace FDIV instructions with FMUL where possible—TheFMUL instruction latency is much less than the FDIVinstruction. When possible, replace floating-point divisions
with floating-point multiplication of the reciprocal.
Use integer instructions to move floating-point data — Afloating-point load and store instruction pair requires aminimum of four cycles to complete (two-cycle latency for each
instruction). The AMD-K6 processor can perform one integerload and one store per cycle. Therefore, moving single-precision
data requires one cycle, moving double-precision data requirestwo cycles, and moving extended-precision data only requiresthree cycles when using integer loads and stores. The examplebelow shows how to translate the C-style code when moving
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
are identical in throughput to FP reg, reg instructions. Because
common floating-point instructions execute in two cycles eachand the floating-point unit is not pipelined, code executes moreefficiently if the minimum possible number of floating-pointinstructions are generated.
Floating-Point Code Sample
The following code sample uses three important rules tooptimize this matrix multiply routine. The first rule is to force[ESI] to be [ESI+0]. The second rule is the insertion of NOPs to
avoid cache line straddles. The third rule used is avoidingvector decoded instructions.
MATMUL MACRO
db 0d9h, 046h, 00h ;; FLD DWORD PTR [ESI+00] ;;x
FMUL DWORD PTR [EBX] ;; a11*x
FLD DWORD PTR [ESI+4] ;; y
FMUL DWORD PTR [EBX+4] ;; a21*y
FLD DWORD PTR [ESI+8] ;; z
FMUL DWORD PTR [EBX+8] ;; a31*z
FLD DWORD PTR [ESI+12] ;; w
FMUL DWORD PTR [EBX+12] ;; a41*w
FADDP ST(3), ST ;; a41*w+a31*z
FADDP ST(2), ST ;; a41*w+a31*z+a21*y
FADDP ST(1), ST ;; a41*w+a31*z+a21*y+a11*x
FSTP DWORD PTR [EDI] ;; store rx
NOP ;; make sure it does not
;; straddle across a cache line
db 0d9h, 046h, 00h ;; FLD DWORD PTR [ESI+00] ;; x
FMUL DWORD PTR [EBX+16] ;; a12*x
FLD DWORD PTR [ESI+4] ;; y
FMUL DWORD PTR [EBX+20] ;; a22*y
FLD DWORD PTR [ESI+8] ;; z
NOP ;; make sure it does not
;; straddle across a cache line
FMUL DWORD PTR [EBX+24] ;; a32*z
FLD DWORD PTR [ESI+12] ;; w
FMUL DWORD PTR [EBX+28] ;; a42*w
FADDP ST(3), ST ;; a42*w+a32*z
FADDP ST(2), ST ;; a42*w+a32*z+a22*y
FADDP ST(1), ST ;; a42*w+a32*z+a22*y+a12*x
NOP ;; make sure it does not;; straddle across a cache line
FSTP DWORD PTR [EDI+4] ;; store ry
db 0d9h, 046h, 00h ;; FLD DWORD PTR [ESI+00] ;; x
FMUL DWORD PTR [EBX+32] ;; a13*x
FLD DWORD PTR [ESI+4] ;; y
FMUL DWORD PTR [EBX+36] ;; a23*y
NOP ;; make sure it does not
;; straddle across a cache line
FLD DWORD PTR [ESI+8] ;; z
8/14/2019 x86 Code Optimization for AMD Processors
AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization 21828A/0 —August 1997
DispatchConflicts
Load-balancing (that is, selectinginstructions for parallel decode) is stillimportant, but to a lesser extent than onthe Pentium processor. In particular,arrange instructions to avoidexecution-unit dispatching conflicts.
Same
Byte Operations
For byte operations, the high and lowbytes of AX, BX, CX, and DX are effectivelyindependent registers that can beoperated on in parallel. For example,
reading AL does not have a dependencyon an outstanding write to AH.
SameRegister dependency is checked on abyte boundary.
Floating-PointTop-of-StackBottleneck
The AMD-K5 processor has a pipelinedfloating-point unit. Greater parallelism canbe achieved by using FXCH in parallel withfloating-point operations to alleviate thetop-of-stack bottleneck, as in the Pentium.
Not requiredLoads and stores are performed inparallel with floating-point instructions.
Move andConvert
MOVZX, MOVSX, CBW, CWDE, CWD, CDQall take 1 cycle (2 cycles for memory-basedinput).
SameZero and sign extension areshort-decodeable with 1 cycleexecution.
Indexed
Addressing
There is no penalty for base + index
addressing in the AMD-K5 processor.Same
InstructionPrefixes
There is no penalty for instruction prefixes,including combinations such assegment-size and operand-size prefixes.This is particularly important for 16-bitcode.
Possible A penalty can only occur duringaccumulated prefix decoding.
Floating-PointExecution
Parallelism
The AMD-K5 processor permits integeroperations (ALU, branch, load/store) inparallel with floating-point operations.
SameIn addition, the AMD-K6 processorallows two integer, a branch, a load,and a store.
Locating BranchTargets
Performance can be sensitive to codealignment, especially in tight loops.
Locating branch targets to the first 17bytes of the 32-byte cache line maximizesthe opportunity for parallel execution atthe target.
Optional Branch targets should be placed on 0mod 16 alignment for optimalperformance.
NOPs
The AMD-K5 processor executes NOPs(opcode 90h) at the rate of two per cycle.
Adding NOPs is even more effective if theyexecute in parallel with existing code.
SameNOPs are short-decodeable andconsume decode bandwidth but noexecution resources.
Table 10. Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors (continued)
AMD-K5Processor
Guideline/Event
AMD-K5 Processor DetailsUsage/Effecton AMD-K6
Processors
AMD-K6 ProcessorDetails
8/14/2019 x86 Code Optimization for AMD Processors
21828A/0—August 1997 AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization
BranchPrediction
There are two branch prediction bits in a32-byte instruction cache line. For effectivebranch prediction, code should begenerated with one branch per 16-byteline half.
Not requiredThis optimization has a neutral effecton the AMD-K6 processor.
Bit ScanBSF and BSR take 1 cycle (2 cycles formemory-based input), compared to thePentium’s data-dependent 6 to 34 cycles.
Different A multi-cycle operation, but faster thanPentium.
Bit Test
BT, BTS, BTR, and BTC take 1 cycle forregister-based operands, and 2 or 3 cycles
for memory-based operands withimmediate bit-offset. Register-basedbit-offset forms on the AMD-K5 processortake 5 cycles.
Different Bit test latencies are similar to thePentium.
Table 10. Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors (continued)
AMD-K5Processor
Guideline/Event
AMD-K5 Processor DetailsUsage/Effecton AMD-K6
Processors
AMD-K6 ProcessorDetails
Table 11. AMD-K6™ Processor Versus Pentium® Processor-Specific Optimizations and Guidelines
PentiumGuideline/Event
PentiumEffect
Usage/Effect onAMD-K6 Processors
AMD-K6 ProcessorDetails
Instruction Fetches Across Two Cache Lines
No Penalty PossibleDecode penalty only if there isnot sufficient information todecode at least one instruction.
MispredictedConditional BranchExecuted in U pipe
3-cycle penalty DifferentMispredicted branches have a 1–to 4–cycle penalty.
MispredictedConditional BranchExecuted in V pipe
4-cycle penalty DifferentMispredicted branches have a 1–to 4–cycle penalty.
Mispredicted Calls 3-cycle penalty None
MispredictedUnconditional Jumps
3-cycle penalty None
FXCH OptimizingPairs with most FP instructionsand effectively hides FP stackmanipulations.
None
Index Versus BaseRegister
1-cycle penalty to calculate theeffective address when an indexregister is used.
None
8/14/2019 x86 Code Optimization for AMD Processors
Two-clock Stalls for WritingThen Storing an MMXRegister
Requires scheduling the storetwo cycles after writing(updating) the MMX register.
None
U Pipe: Integer/MMX Pairing
MMX instruction that accesseither memory or integerregisters cannot be executedin the V pipe.
Different
Pairing requiresshort-decodeable integerinstruction as the secondinstruction.
U Pipe: MMX/Integer Pairing V pipe integer instruction mustbe pairable.
Similar
Pairing requiresshort-decodeable integerinstruction as the secondinstruction.
Pairing Two MMX Instructions
Cannot pair two MMXmultiplies, two MMX shifts, orMMX instructions in V pipe
with U pipe dependency.
None
66h or 67h Prefix Penalty Three clocks. None
Table 13. AMD-K6™ Processor and Pentium® Pro Processor-Specific Optimizations
Pentium ProGuideline/Event
Pentium Pro EffectUsage/Effect on
AMD-K6Processor
AMD-K6 Processor Detail
Partial-Register Stalls
Avoid reading a large register after writ inga smaller version of the same register.This causes the P6 to stall the issuing ofinstructions that reference the full register
and all subsequent instructions until afterthe partial write has retired. If the partialregister update is adjacent to asubsequent full register read, the stalllasts at least seven clock cycles withrespect to the decoder outputs. On theaverage, such a stall can prevent from 3to 21 micro-ops from being issued.
Different
The AMD-K6 processorperforms registerdependency checking on a
byte granularity. Due toshorter pipelines,execution latency, andcommitment latency,instruction issuing is notaffected. However,execution is stalled.
8/14/2019 x86 Code Optimization for AMD Processors