Instruction Sets and Beyond: Computers, Complexity, and ...

Instruction Sets and Beyond:

Computers, Complexity,and Controversy

Robert P. Colwell, Charles Y. Hitchcock m,E. Douglas Jensen, H. M. Brinkley Sprunt,

and Charles P. Kollar

Carnegie-Mellon University

JOCUS on rne assignmentofsystem functionality toimplementation levelswithin an architecture,and not be guided bywhether it is a RISCor CISC design.

t alanc olt1Mthi tyreceived

,,tl ica' ch m e no09 Fy ipt issues. RISC

have guidedyears. A study of

d yield a deeper un-

f hardware/softwaremputer performance, the

iluence ofVLSI on processor design,and many other topics. Articles onRISC research, however, often fail toexplore these topics properly and canbe misleading. Further, the few papersthat present comparisons with com-plex instruction set computer designoften do not address the same issues.As a result, even careful study of theliterature is likely to give a distortedview of this area of research. This arti-cle offers a useful perspective ofRISC/Complex Instruction Set Com-puter research, one that is supportedby recent work at Carnegie-MellonUniversity.Much RISC literature is devoted to

discussions of the size and complexityof computer instruction sets. Thesediscussions are extremely misleading.

Instruction set design is important, butit should not be driven solely by adher-ence to convictions about design style,RISC or CISC. The focus ofdiscussionshould be on the more general questionof the assignment of system function-ality to implementation levels withinan architecture. This point of view en-compasses the instruction set-CISCstend to install functionality at lowersystem levels than RISCs-but alsotakes into account other design fea-tures such as register sets, coproces-sors, and caches.While the implications of RISC re-

search extend beyond the instructionset, even within the instruction set do-main, there are limitations that havenot been identified. Typical RISCpapers give few clues about where theRISC approach might break down.Claims are made for faster machinesthat are cheaper and easier to designand that "map" particularly well ontoVLSI technology. It has been said,however, that "Every complex prob-lem has a simple solution. . . and it iswrong." RISC ideas are not "wrong,"but a simple-minded view of themwould be. RISC theory has many im-plications that are not obvious. Re-

0018-9162/85/0900-0008$01.00 1985 IEEE8 COMPUTER

search in this area has helped focus at-tention on some important issues incomputer architecture whose resolu-tions have too often been determinedby defaults; yet RISC proponentsoften fail to discuss the application, ar-chitecture, and implementation con-texts in which their assertions seemjustified.

While RISC advocates have beenvocal concerning their design methodsand theories, CISC advocates havebeen disturbingly mute. This is not ahealthy state of affairs. Withoutsubstantive, reported CISC research,many RISC arguments are left un-countered and, hence, out of per-spective. The lack of such reports isdue partially to the proprietary natureof most commercial CISC designs andpartially to the fact that industry de-signers do not generally publish asmuch as academics. Also, the CISCdesign style has no coherent statementof design principles, and CISC design-ers do not appear to be actively work-ing on one. This lack of a manifestodifferentiates the CISC and RISC de-sign styles and is the result of their dif-ferent historical developments.

Towards defining a RISC

Since the earliest digital electroniccomputers, instruction sets have tendedto grow larger and more complex. The1948 MARK-1 had only seven instruc-tions of minimal complexity, such asadds and simple jumps, but a contem-porary machine like the VAX has hun-dreds of instructions. Furthermore, itsinstructions can be rather compli-cated, like atomically inserting an ele-ment into a doubly linked list orevaluating a floating point polynomialof arbitrary degree. Any high perfor-mance implementation of the VAX, asa result, has to rely on complex im-plementation techniques such as pipe-lining, prefetching, and multi-cycle in-struction execution.

This progression from small andsimple to large and complex instruc-tion sets is striking in the developmentof single-chip processors within thepast decade. Motorola's 68020, for ex-ample, carries 11 more addressingmodes than the 6800, more than twiceas many instructions, and support foran instruction cache and coprocessors.Again, not only has the number of ad-dressing modes and instructions in-creased, but so has their complexity.

This general trend toward CISC ma-chines was fueled by many things, in-cluding the following:

* New models are often required tobe upward-compatible with exist-ing models in the same computerfamily, resulting in the superset-ting and proliferation of features.

* Many computer designers tried toreduce the "semantic gap" be-tween programs and computer in-struction sets. By adding instruc-tions semantically closer to thoseused by programmers, these de-signers hoped to reduce softwarecosts by creating a more easilyprogrammed machine. Such in-structions tend to be more com-plex because of their higher se-mantic level. (It is often the case,however, that instructions withhigh semantic content do not ex-actly match those required for thelanguage at hand.)

* In striving to develop faster ma-chines, designers constantly movedfunctions from software to micro-code and from microcode to hard-ware, often without concern forthe adverse effects that an addedarchitectural feature can have onan implementation. For example,addition of an instruction requir-ing an extra level of decoding logiccan slow a machine's entire in-struction set. (This is called the"n + 1" phenomenon.1 )

* Tools and methodologies aid de-signers in handling the inherent

complexity of large architectures.Current CAD tools and microcod-ing support programs are ex-amples.

Microcode is an interesting exampleof a technique that encourages com-plex designs in two ways. First, it pro-vides a structured means of effectivelycreating and altering the algorithmsthat control execution of numerousoperations and complex instructions ina computer. Second, the proliferationof CISC features is encouraged by thequantum nature of microcode memor-ies; it is relatively easy to add anotheraddressing mode or obscure instruc-tion to a machine which has not yetused all of its microcode space.

Instruction traces from CISC ma-chines consistently show that few ofthe available instructions are used inmost computing environments. Thissituation led IBM's John Cocke, in theearly 70's, to contemplate a departurefrom traditional computer styles. Theresult was a research project based onan ECL rnachine that used a very ad-vanced compiler, creatively named"801" for the research group's build-ing number. Little has been publishedabout that project, but what has beenreleased speaks for a principled andcoherent research effort.The 801's instruction set was based

on three design principles. Accordingto Radin, 2 the instruction set was to bethat set of run-time operations that

* could not be moved to compiletime,

* could not be more efficiently exe-cuted by object code produced bya compiler that understood thehigh-level intent of the program,and

* could be implemented in randomlogic more effectively than theequivalent sequence of softwareinstructions.

The machine relied on a compiler thatused many optimization strategies formuch of its effectiveness, including a

September 1985 9

powerful scheme of register alloca-tion. The hardware implementationwas guided by a desire for leanness andfeatured hardwired control and single-cycle instruction execution. The archi-tecture was a 32-bit load/store ma-chine (only load and store instructionsaccessed memory) with 32 registersand single-cycle instructions. It hadseparate instruction and data caches toallow simultaneous access to code andoperands.Some of the basic ideas from the 801

research reached the West Coast in themid 70's. At the University of Califor-nia at Berkeley, these ideas grew into aseries of graduate courses that prod-uced the RISC I * (followed later by theRISC II) and the numerous CAD toolsthat facilitated its design. Thesecourses laid the foundation for relatedresearch efforts in performance eval-uation, computer-aided design, andcomputer implementation.The RISC I processor, 3 like the 801,

is a load/store machine that executesmost of its instructions in a single cy-cle. It has only 31 instructions, each ofwhich fits in a single 32-bit word anduses practically the same encoding for-mat. A special feature of the RISC I isits large number of registers, well overa hundred, which are used to form aseries of overlapping register sets. Thisfeature makes procedure calls on theRISC I less expensive in terms of pro-cessor-memory bus traffic.Soon after the first RISC I project at

Berkeley, a processor named MIPS(Microprocessor without InterlockedPipe Stages) took shape at Stanford.MIPS I is a pipelined, single-chip pro-cessor that relies on innovative softwareto ensure that its pipeline resources areproperly managed. (In machines suchas the IBM System/360 Model 91,pipeline interstage interlocking is per-

Please note that the term "RISC" is used throughoutthis article to refer to all research efforts conceming Re-duced Instruction Set Computers, while the term "RISCI" refers specificallv to the Berkeley research project.

formed at run-time by special hard-ware). By trading hardware for com-pile-time software, the Stanfordresearchers were able to expose and usethe inherent internal parallelism oftheir fast computing engine.These three machines, the 801,

RISCI, and MIPS, form the core ofRISC research machines, and share aset of common features. We proposethe following elements as a workingdefinition of a RISC:

(1) Single-cycle operation facilitatesthe rapid execution of simplefunctions that dominate a com-puter's instruction stream andpromotes a low interpretiveoverhead.

(2) Loadlstore design follows froma desire for single-cycle opera-tion.

(3) Hardwired control provides forthe fastest possible single-cycleoperation. Microcode leads toslower control paths and adds tointerpretive overhead.

(4) Relatively few instructions andaddressing modes facilitate afast, simple interpretation by thecontrol engine.

(5) Fixed instruction format withconsistent use, eases the hard-wired decoding of instructions,which again speeds controlpaths.

(6) More compile-time effort offersan opportunity to explicitlymove static run-time complexityinto the compiler. A good ex-ample of this is the softwarepipeline reorganizer used byMIPS. I

A consideration of the two com-panies that claim to have created thefirst commercial "RISC" computer,Ridge Computers and Pyramid Tech-nology, illustrates why a definition isneeded. Machines of each firm haverestricted instruction formats, a fea-ture they share with RISC machines.

Pyramid's machine is not a load/storecomputer, however, and both Ridgeand Pyramid machines have variablelength instructions and use multiple-cycle interpretation and microcodedcontrol engines. Further, while theirinstruction counts might seem reducedwhen compared to a VAX, the Pyra-mid has almost 90 instructions and theRidge has over 100. The use of micro-coding in these machines is for priceand performance reasons. The Pyra-mid machine also has a system of mul-tiple register sets derived from theBerkeley RISC I, but this feature is or-thogonal to RISC theory. These maybe successful machines, from bothtechnological and marketing stand-points, but they are not RISCs.The six RISC features enumerated

above can be used to weed out mis-leading claims and provide a spring-board for points of debate. Althoughsome aspects of this list may be argu-able, it is useful as a working defi-nition.

Points of attention andcontention

There are two prevalent misconcep-tions about RISC and CISC. The firstis due to the RISC and CISC acro-nyms, which seem to imply that thedomain for discussion should be re-stricted to selecting candidates for amachine's instruction set. Althoughspecification format and number ofinstructions are the primary issues inmost RISC literature, the best gener-alization of RISC theory goes wellbeyond them. It connotes a willingnessto make design tradeoffs freely andconsciously across architecture/imple-mentation, hardware/software, andcompile-time/run-time boundaries inorder to maximize performance asmeasured in some specific context.

The RISC and CISC acronyms alsoseem to imply that any machine can beclassified as one or the other and that

COM PUTER10

the primary task confronting an archi-tect is to choose the most appropriatedesign style for a particular applica-tion. But the classification is not adichotomy. RISCs and CISCs are atdifferent corners of a continous multi-dimensional design space. The need isnot for an algorithm by which one canbe chosen: rather, the goal should bethe formulation of a set of techniques,drawn from CISC experiences andRISC tenets, which can be used by adesigner in creating new systems. 46One consequence of the us-or-them

attitude evinced by most RISC publi-cations is that the reported perfor-mance of a particular machine (e.g.,RISC I) can be hard to interpret if thecontributions made by the various de-sign decisions are not presented indi-vidually. A designer faced with a largearray of choices needs guidance morespecific than a monolithic, all-or-nothing performance measurement.An example of how the issue of

scope can be confused is found in a re-cent article.7 By creating a machinewith only one instruction, its authorsclaim to have delimited the RISC de-sign space to their machine at one endof the space and the RISC I (with 31instructions) at the other end. Thismodel is far too simplistic to be useful;an absolute number of instructionscannot be the sole criterion for cate-gorizing an architecture as to RISC orCISC. It ignores aspects of addressingmodes and their associated complexi-ty, fails to deal with compiler/archi-tecture coupling, and provides no wayto evaluate the implementation ofother non-instruction set design deci-sions such as register files, caches,memory management, floating pointoperations, and co-processors.

Another fallacy is that the total sys-tem is composed of hardware, soft-ware, and application code. Thisleaves out the operating system, andthe overhead and the needs of the op-erating system cannot be ignored inmost systems. This area has received

far too little attention from RISC re-search efforts, in contrast to the CISCefforts focused on this area. 8,9An early argument in favor of RISC

design was that simpler designs couldbe realized more quickly, giving them aperformance advantage over complexmachines. In addition to the economicadvantages of getting to market first,the simple design was supposed to

The insinuation that the Micro-VAX-32follows in a RISCtradition is unreasonable. It

does not follow our definitionof a RISC; it violates all

six RISC criteria.

avoid the performance disadvantagesof introducing a new machine basedon relatively old implementationtechnology. In light of these argu-ments, DEC's MicroVAX-3210 isespecially interesting.The VAX easily qualifies as a CISC.

According to published reports, theMicroVAX-32, a VLSI implementa-tion of the preponderance of the VAXinstruction set, was designed, realized,and tested in a period of severalmonths. One might speculate that thisvery short gestation period was madepossible in large part by DEC's con-siderable expertise in implementing theVAX architecture (existing productsincluded the 11/780, 11/750, 11/730,and VLSI-VAX). This shorteneddesign time would not have been possi-ble had DEC had not first created astandard instruction set. Standardiza-tion at this level, however, is preciselywhat RISC theory argues against.Such standards constrain the un-conventional RISC hardware/soft-ware tradeoffs. From a commercialstandpoint, it is significant that theMicroVAX-32 was born into a worldwhere compatible assemblers, com-pilers, and operating systems abound,something that would certainly not bethe case for a RISC design.

Such problems with RISC systemdesigns may encourage commercialRISC designers to define a new level ofstandardization in order to achievesome of the advantages of multiple im-plementations supporting one stan-dard interface. A possible choice forsuch an interface would be to define anintermediate language as the target forall compilation. The intermediate lan-guage would then be translated intooptimal machine code for each imple-mentation. This translation processwould simply be performing resourcescheduling at a very low level (e.g.,pipeline management and registerallocation).

It should be noted that the Micro-VAX-32 does not directly implementall VAX architecture. The suggestionhas been made that this implementa-tion somehow supports the RISC incli-nation toward emulating complexfunctions in software. In a recent pub-lication, David Patterson observed:

Although I doubt DEC is callingthem RISCs, I certainly found it in-teresting that DEC's single chipVAXs do not implement the wholeVAX instruction set. A MicroVAXtraps when it tries to execute someinfrequent but complicated oper-ations, and invokes transparentsoftware routines that simulatethose complicated instructions. I I

The insinuation that the Micro-VAX-32 follows in a RISC tradition isunreasonable. It does not come closeto fitting our definition of a RISC; itviolates all six RISC criteria. To beginwith, any VAX by definition has avariable-length instruction format andis not a load/store machine. Further,the MicroVAX-32 has multicycle in-struction execution, relies on a micro-coded control engine, and interpretsthe whole array of VAX addressingmodes. Finally, the MicroVAX-32 exe-cutes 175 instructions on-chip, hardlya reduced number.

September 1985

m

1 1

A better perspective in the MicroVAX-32 shows that there are indeedcost/performance ranges where mi-crocoded implementation of certainfunctions is inappropriate and soft-ware emulation is better. The impor-tance of carefully making this assign-ment of function to implementationlevel-software, microcode, or hard-ware-has been amply demonstratedin many RISC papers. Yet this basicconcern is also evidenced in manyCISC machines. In the case of theMicroVAX-32, floating point instruc-tions are migrated either to a copro-cessor chip or to software emulationroutines. The numerous floating-pointchips currently available attest to themarket reception for this partitioning.Also migrated to emulation are theconsole, decimal, and string instruc-tions. Since many of these instructionsare infrequent, not time-critical, or arenot generated by many compilers, it

would be difficult to fault this ap-proach to the design of an inexpensiveVAX. The MicroVAX-32 also showsthat it is still possible for intelligent,competent computer designers whounderstand the notion of correct func-tion-to-level mapping to find micro-coding a valuable technique. Pub-lished RISC work, however, does notaccommodate this possibility.The application environment is also

of crucial importance in system design.The RISC I instruction set was de-signed specifically to run the C lan-guage efficiently, and it appearsreasonably successful. The RISC Iresearchers have also investigatedthe Smalltalk-80 computing environ-ment. 12 Rather than evaluate RISC Ias a Smalltalk engine, however, theRISC I researchers designed a newRISC and report encouraging perfor-mance results from simulations. Still,designing a processor to run a single

language well is different from cre-ating a single machine such as theVAX that must exhibit at least ac-ceptable performance for a wide rangeof languages. While RISC researchoffers valuable insights on a per-lan-guage basis, more emphasis on cross-language anomalies, commonalities,and tradeoffs is badly needed.

Especially misleading are RISCclaims concerning the amount of de-sign time saved by creating a simplemachine instead of a complex one.Such claims sound reasonable. Never-theless, there are substantial dif-ferences in the design environmentsfor an academic one-of-a-kind project(such as MIPS or RISC I) and amachine with lifetime measured inyears that will require substantial soft-ware and support investments. As waspointed out in a recent ElectronicsWeek article, R. D. Lowry, marketdevelopment manager for Denelcor,

12 COMPUTER

noted that "commercial-product de-velopment teams generally start off aproject by weighing the profit and lossimpacts of design decisions."13 Lowryis quoted as saying, "A universitydoesn't have to worry about that, sothere are often many built-in deadendsin projects. This is not to say the valueof their research is diminished. It does,however, make it very difficult forsomeone to reinvent the system tomake it a commercial product." For aproduct to remain viable, a great dealof documentation, user training, coor-dination with fabrication or produc-tion facilities, and future upgradesmust all be provided. It is not knownhow these factors might skew a design-time comparison, so all such compari-sons should be viewed with suspicion.

Even performance claims, perhapsthe most interesting of all RISC asser-tions, are ambiguous. Performance asmeasured by narrowly compute-

bound, low-level benchmarks thathave been used by RISC researchers(e.g., calculating a Fibonacci series re-cursively) is not the only metric in acomputer system. In some', it is noteven one of the most interesting. Formany current computers, the only use-ful performance index is the numberof transactions per second, which hasno direct or simple correlation to thetime it takes to calculate Ackermann'sfunction. While millions ofinstructionsper second might be a meaningful met-ric in some computing environments,reliability, availability, and responsetime are of much more concern inothers, such as spaceand aviation com-puting. The extensive error checkingincorporated into these machines atevery level may slow the basic clocktime and substantially diminish per-formance. Reduced performance istolerable; but downtime may not be.In the extreme, naive application of

the RISC rules for designing an in-struction set might result in a missileguidance computer optinized for run-ning its most common task-diagnos-tics. In terms ofinstruction frequencies,of course, flight control applicationsconstitute a trivial special case andwould not be given much attention. It isworth emphasizing that in efforts toquantify performance and apply thosemeasurements to system design, onemust pay attention not just to instruc-tion execution frequencies, but also tocycles consumed per instruction execu-tion. Levy and Clark make this pointregarding the VAX instruction set,'4but it has yet to appear in any paperson RISC.When performance, such as

throughput or transactions per second,is a first-order concern, one is facedwith the task of quantifying it. TheBerkeley RISC I efforts to establish themachine's throughput are laudable, but

September 1985 13

before sweeping conclusions are drawnone must carefully examine the bench-mark programs used. As Pattersonnoted:

The performance predictions for[RISC I and RISC II] were based onsmall programs. This small size wasdictated by the reliability of thesimulator and compiler, the avail-able simulation time, and the in-ability of the first simulators to han-dle UNIX system calls. "

Some of these "small" programs ac-tually execute millions of instructions,yet they are very narrow programs interms of the scope of function. For ex-ample, the Towers of Hanoi program,when executing on the 68000, spendsover 90 percent of its memory accessesin procedure calls and returns. TheRISC I and II researchers recentlyreported results from a large bench-mark,"I but the importance of large,

heterogenous benchmarks in perfor-mance measurement is still lost onmany commercial and academic com-puter evaluators who have succumbedto the mi1conception that "micro-benchmarks" represent a useful mea-surement in isolation.

Multiple register sets

Probably the most publicized RISC-style processor is the Berkeley RISC I.The best-known feature of this chip isits large register file, organized as aseries of overlapping register sets. Thisis ironic, since the register file is a per-formance feature independent of anyRISC (as defined earlier) aspect of theprocessor. Multiple register sets couldbe included in any general-purposeregister machine.

It is easy to believe that MRSs canyield performance benefits, sinceprocedure-based, high-level languages

typically use registers for informationspecific to a procedure. When a pro-cedure call is performed, the informa-tion must be saved, usually on amemory stack, and restored on a pro-cedure return. These operations aretypically very time consuming due tothe intrinsic data transfer require-ments. RISC I uses its multiple registersets to reduce the frequency of thisregister saving and restoring. It alsotakes advantage of an overlap betweenregister sets for parameter passing,reducing even further the memoryreads and writes necessary. 15

RISC I has a register file of 13832-bit registers organized into eightoverlapping "windows." In each win-dow, six registers overlap the nextwindow (for outgoing parameters andincoming results). During any proce-dure, only one of these windows is ac-tually accessible. A procedure callchanges the current window to the next

COMPUTER

N.

14

window by incrementing a pointer,and the six outgoing parameter regis-ters become the incoming parametersof the called procedure. Similarly, aprocedure return changes the currentwindow to the previous window, andthe outgoing result registers becomethe incoming result registers of thecalling procedure. If we assume thatsix 32-bit registers are enough to con-tain the parameters, a procedure callinvolves no actual movement of infor-mation (only the window pointer is ad-justed). The finite on-chip resourceslimit the actual savings due to registerwindow overflows and underflows.3

It has been claimed that the smallcontrol area needed to implement thesimple instruction set of a VLSI RISCleaves enough chip area for the largeregister file.3 The relatively smallamount of control logic used by aRISC does free resources for otheruses, but a large register file is not the

only way to use them, nor even neces-sarily the best. For example, designersof the 801 and MIPS chose other waysto use their available hardware; theseRISCs have only a single, convention-ally sized register set. Caches, floating-point hardware, and interprocess com-munication support are a few of themany possible uses for those resources"freed" by a RISC's simple instruc-tion set. Moreover, as chip technologyimproves, the tradeoffs between in-struction set complexity and architec-ture/implementation features becomeless constrained. Computer designerswill always have to decide how to bestuse available resources and, in doingso, should realize which relations areintrinsic and which are not.The Berkeley papers describing the

RISC I and RISC II processors claimedtheir resource decisions produced largeperformance improvements, two tofour times over CISC machines like

the VAX and the 68000.3,11 There aremany problems with these results andthe methods used to obtain them.Foremost, the performance effects ofthe reduced instruction set were notdecoupled from those of the over-lapped register windows. Consequent-ly, these reports shed little light on theRISC-related performance of the ma-chine, as shown below.Some performance comparisons be-

tween different machines, especiallyearly ones, were based on simulatedbenchmark execution times. While ab-solute speed is always interesting,other metrics less implementation-de-pendent can provide design informa-tion more useful to computer archi-tects, such as data concerning theprocessor-memory traffic necessary toexecute a series of benchmarks. It isdifficult to draw firm conclusionsfrom comparisons of vastly differentmachines unless some effort has been

September 1985 15

made to factor out implementation-dependent features not being com-pared (e.g., caches and floating pointaccelerators).

Experiments structured to accom-modate these reservations were con-ducted at CMU to test the hypothesisthat the effects of multiple register setsare orthogonal to instruction set com-plexity. 16 Specifically, the goal was tosee if the performance effects ofMRSswere comparable for RISCs andCISCs. Simulators were written fortwo CISCs (the VAX and the 68000)without MRSs, with non-overlappingMRSs and with overlapping MRSs.Simulators were also written for theRISC I, RISC I with non-overlappingregister sets, and RISC I with only asingle register set. In each of thesimulators, care was taken not tochange the initial architectures anymore than absolutely necessary to addor remove MRSs. Instead of simulat-ing execution time, the total amount ofprocessor-memory traffic (bytes readand written) for each benchmark wasrecorded for comparison. To use thisdata fairly, only different register setversions of the same architecture werecompared so the ambiguities that arisefrom comparing different architec-tures like the RISC I and the VAX wereavoided. The benchmarks used werethe same ones originally used to evalu-ate RISC I. A summary of the experi-ments and their results are presented byHitchcock and Sprunt. 17

As expected, the results show a sub-stantial difference in processor-mem-ory traffic for an architecture with andwithout MRSs. The MRS versions ofboth theVAX and 68000 show markeddecreases in processor-memory trafficfor procedure-intensive benchmarks,shown in Figures I and 2. Similarly,the single register set version of RISC Irequires many more memory readsand writes than RISC I with overlap-ped register sets (Figure 3). This resultis due in part to the method used for

Figure 1. Total processor-memory traffic for benchmarks on the standardVAX and two modified VAX computers, one with multiple register sets andone with overlapped multiple register sets.

Figure 2. Total processor-memory traffic for benchmarks on the standard68000 and two modified 68000s, one with multiple register sets and one withoverlapped multiple register sets.

Figure 3. Total processor-memory traffic for benchmarks on the standardRISC I and two modified RISC l's, one with no overlap between register setsand one with only one register set.

16 COMPUTER

handling register set overflow andunderflow, which was kept the samefor all three variations. With a moreintelligent scheme, the single registerset RISC I actually required fewerbytes of memory traffic on Acker-mann's function than its multipleregister set counterparts. For bench-marks with very few procedure calls(e.g., the sieve of Eratosthenes), thesingle register set version has the sameamount of processor-memory trafficas the MRS version of the same ar-chitecture. 17

Clearly, MRSs can affect the amountof processor-memory traffic necessaryto execute a program. A significantamount of the performance of RISC Ifor procedure-intensive environmentshas been shown to be attributable to itsscheme of overlapped register sets, afeature independent of instruction-setcomplexity. Thus, any performanceclaims for reduced instruction set com-puters that do not remove effects dueto multiple register sets are inconclu-sive, at best.

These CMIU experiments used bench-marks drawn from other RISCresearch efforts for the sake of con-tinuity and consistency. Some of thebenchmarks, such as Ackermann,Fibonacci, and Hanoi, actually spendmost of their time performing proce-dure calls. The percentage of the totalprocessor-memory traffic due to "C"procedure calls for these three bench-marks on the single register set versionof the 68000 ranges from 66 to 92 per-cent. As was expected, RISC I, with itsoverlapped register structure thatallows procedure calls to be almostfree in terms of processor-memory bustraffic, did extremely well on thesehighly recursive benchmarks whencompared to machines with only asingle register set. It has not been es-tablished, however, that these bench-marks are representative of any com-puting environment.

The 432

The Intel 432 is a classic example ofa CISC. It is an object-oriented VLSImicroprocessor chip-set designed ex-

pressly to provide a productive Adaprogramming environment for largescale, multiple-process, multiple-processor systems. Its architecturesupports object orientation such thatevery object is protected uniformlywithout regard to traditional distinc-tions such as "supervisor/user mode"or "system/user data structures." The432 has a very complex instruction set.

Its instructions are bit-encoded andrange in length from six to 321 bits.The 432 incorporates a significantdegree of functional migration fromsoftware to on-chip microcode. Theinterprocess communication SENDprimitive is a 432 machine instruction,for instance.

Published studies of the perfor-mance of the Intel 432 on low-levelbenchmarks (e.g., towers of Hanoi 18)show that it is very slow, taking 10 to20 times as long as the VAX 11/780.Such a design, then, invites scrutiny inthe RISC/CISC controversy.One is tempted to blame the ma-

chine's object-oriented runtime envi-ronment for imposing too much over-

head. Every memory reference ischecked to ensure that it lies within theboundaries of the referenced object,and the read/write protocols of theexecuting context are verified. RISCproponents argue that the complexityof the 432 architecture, and the addi-tional decoding required for a bit-encoded instruction stream contributeto its poor performance. To addressthese and other issues, a detailed studyof the 432 was undertaken to evaluatethe effectiveness of the architecturalmechanisms provided in support of its

intended runtime environment. The

study concentrated on one of the cen-

tral differences in the RISC and CISCdesign styles: RISC designs avoidhardware/microcode structures in-

September 1985

tended to support the runtime environ-ment, attempting instead to place equi-valent functionality into the compileror software. This is contrary to themainstream of instruction set design,which reflects a steady migration ofsuch functionality from higher levels(software) to lower ones (microcode orhardware) in the expectation of im-proved performance.

This investigation should include ananalysis of the 432's efficiency in exe-cuting large-system code, since exe-cuting such code well was the primarydesign goal of the 432. Investigatorsused the Intel 432 microsimulator,which yields cycle-by-cycle traces ofthe machine's execution. While thismicrosimulator is well-suited to simu-lating small programs, it is quite un-wieldy for large ones. As a result, theconcentration here is on the low-levelbenchmarks that first pointed out thepoor 432 performance.

Simulations of these benchmarksrevealed several performance prob-lems with the 432 and its compiler:

(1) The 432's Ada compiler per-forms almost no optimization. Themachine is frequently forced to makeunnecessary changes to its complex ad-dressing environment, and it oftenrecomputes costly, redundant subex-pressions. This recomputation serious-ly skews many results from benchmarkcomparisons. Such benchmarks reflectthe performance of the present versionof the 432 but show very little aboutthe efficacy of the architectural trade-offs made in that machine.

(2) The bandwidth of 432 memoryis limited by several factors. The 432has no on-chip data caching, no in-struction stream literals, and no localdata registers. Consequently, it makesfar more memory references than itwould otherwise have to. These refer-ence requirements also make the codesize much larger, since many more bitsare required to reference data withinan object than within a local register.

17

And because of pin limitations, the 432must multiplex both data and addressinformation over only 16 pins. Also,the standard Intel 432/600 develop-ment system, which supports shared-memory multiprocessing, uses a slowasynchronous bus that was designedmore for reliability than throughput.These implementation factors com-bine to make wait states consume 25 to40 percent of the processor's time onthe benchmarks.

(3) On highly recursive benchmarks,the object-oriented overhead in the 432does indeed appear in the form of aslow procedure call. Even here,though, the performance problemsshould not be attributed to objectorientation or to the machine's intrin-sic complexity. Designers of the 432made a decision to provide a new, pro-tected context for every procedure call;the user has no option in this respect. Ifan unprotected call mechanism wereused where appropriate, the Dhry-stone benchmark19 would run 20 per-cent faster.

(4) Instructions are bit-aligned, sothe 432 must almost of necessity de-code the various fields of an instruc-tion sequentially. Since such decodingoften overlaps with instruction execu-tion, the 432 stalls three percent of thetime while waiting for the instructiondecoder. This percentage will getworse, however, once other problemsabove are eliminated.

Colwell provides a detailed treat-ment of this experiment and itsresults. 20

This 432 experiment is evidence thatRISC's renewed emphasis on the im-portance of fast instruction decodingand fast local storage (such as cachesor registers) is substantiated, at leastfor low-level compute-bound bench-marks. Still, the 432 does not providecompelling evidence that large-scalemigration of function to microcodeand hardware is ineffective. On thecontrary, Cox et al.2' demonstrated

that the 432 microcode implementa-tion of interprocess communication ismuch faster than an equivalent soft-ware version. On these low-levelbenchmarks, the 432 could have muchhigher performance with only a bettercompiler and minor changes to its im-plementation.Thus, it is wrong to con-clude that the 432 supports the generalRISC point of view.

In spite of-and sometimes becauseof-the wide publicity given to cur-

rent RISC and CISC research, it is noteasy to gain a thorough appreciationof the important issues. Articles onRISC research are often oversimpli-fied, overstated, and misleading,and papers on CISC design offer nocoherent design principles for com-parison. RISC/CISC issues are bestconsidered in light of their function-to-implementation level assignment.Strictly limiting the focus to instruc-tion counts or other oversimpli-fications can be misleading or mean-ingless.

Some of the more subtle issues havenot been brought out in current lit-erature. Many of these are design con-siderations that do not lend themselvesto the benchmark level analysis used inRISC research. Nor are they alwaysproperly evaluated by CISC designers,guided so frequently by tradition andcorporate economics.RISC/CISC research has a great

deal to offer computer designers.These contributions must not be lostdue to an illusory and artificialdichotomy. Lessons learned studyingRISC machines are not incompatiblewith or mutually exclusive of the richtradition of computer design thatpreceded them. Treating RISC ideasas perspectives and techniques ratherthan dogma and understanding theirdomains of applicability can add im-portant new tools to a computerdesigner's repertoire. Z

AcknowledgementsWe would like to thank the in-

numerable individuals, from industryand academia, who have shared theirthoughts on this matter with us andstimulated many of our ideas. In par-ticular, we are grateful to George Coxand Konrad Lai of Intel for their helpwith the 432 microsimulator.

This research was sponsored in partby the Department of the Army undercontract DAA B07-82-C-J164.

References

1. J. Hennessy et al., "Hardware/Soft-ware Tradeoffs for Increased Perfor-mance," Proc. Symp. ArchitecturalSupportfor Programming Languagesand Operating Systems, 1982, pp.2-11.

2. G. Radin, "The 801 Minicomputer,"Proc. Symp. Architectural Supportfor Programming Languages andOperating Systems, 1982, pp. 39-47.

3. D. A. Patterson and C. H. Sequin, "AVLSI RISC," Computer, Vol. 15, No.9, Sept. 1982, pp. 8-21.

4. R. P. Colwell, C. Y. Hitchcock II1,and E. D. Jensen, " A Perspective onthe Processor Complexity Controver-sy," Proc. Int. Conf. ComputerDesign: VLSI in Computers, 1983, pp.613-616.

5. D. Hammerstrom, "Tutorial: TheMigration of Function into Silicon,"10th Ann. Int'I Symp. Computer Ar-chitecture, 1983.

6. J. C. Browne, "Understanding Exe-cution Behavior of Software Systems,"Cotnputer, Vol. 17, No. 7, July 1984,pp. 83-87.

7. H. Azaria and D. Tabak, "TheMODHEL Microcomputer for RISCsStudy", Microprocessing and Micro-programming, Vol. 12, No. 3-4,Oct.-Nov. 1983, pp. 199-206.

8. G. C. Barton "Sentry: A Novel Hard-ware Implementation of ClassicOperating System Mechanisms,"Proc. Ninth Ann. Int'l Symp. Con-puterArchitecture, 1982, pp. 140-147.

9. A. D. Berenbaum, M. W. Condry,and P. M. Lu, "The Operating Systemand Language Support Features of theBELLMAC-32 Microprocessor,"

COM PUTER

Proc. Symp. Architectural Supportfor Programming Languages andOperating Systems, 1982, pp. 30-38.

10. J. Hennessy, "VLSI Processor Ar-chitecture," IEEE Transactions onComputers, Vol. C-33, No. 12, Dec.1984, pp. 1221-1246.

11. D. Patterson, "RISC Watch," Com-puterArchitectureNews, Vol. 12, No.1, Mar. 1984, pp. 11-19.

12. David Ungar et al., "Architecture ofSOAR: Smalltalk on a RISC," 11thAnn. Int'l Symp. ComputerArchitec-ture, 1984, pp. 188-197.

13. W. R. Iversen, "Money Starting toFlow As Parallel Processing GetsHot," Electronics Week, Apr. 22,1985, pp. 36-38.

14. H. M. LevyandD. W. Clark, "OntheUse of Benchmarks for MeasuringSystem Performance" ComputerAr-chitectureNews, Vol. 10, No. 6, 1982,pp. 5-8.

15. D. C. Halbert and P. B. Kessler,"Windows of Overlapping RegisterFrames", CS292R Final Reports,University of California, Berkeley,June 9, 1980.

16. R. P. Colwell, C. Y. Hitchcock III,and E. D. Jensen, "Peering Throughthe RISC/CISC Fog: An Outline ofResearch," Computer ArchitectureNews, Vol. 11, No. 1, Mar. 1983, pp.44-50.

17. C. Y. Hitchcock III and H. M. B.Sprunt, "Analyzing Multiple RegisterSets," 12th Ann. Int'l Symp. Com-puter Architecture, 1985, in press.

18. P. M. Hansenetal., "A PerformanceEvaluation of the Intel iAPX 432,"Computer Architecture News, Vol.10, No. 4, June 1982, pp. 17-27.

19. R. P. Weicker, "Dhrystone: A Syn-thetic Systems Programming Bench-mark," Comm. ACM, Vol. 27, No.10, Oct. 1984, pp. 1013-1030.

20. R. P. Colwell, "The Performance Ef-fects of Functional Migration and Ar-chitectural Complexity in Object-Oriented Systems," PhD. thesis,Carnegie-Mellon University, Pitts-burgh, PA. Expected completion inJune, 1985.

21. G. W. Cox et al., "Interprocess Com-munication and Processor Dispatch-ing on the Intel 432," ACM Trans.Computer Systems, Vol. 1, No. 1,Feb. 1983, pp. 45-66.

trical and Computer Engineering Depart-ments of Carnegie-Mellon University forsix years. For the previous 14 years he per-formed industrial R/D on computer sys-tems, hardware, and software. He consultsand lectures extensively throughout theworld and has participated widely in profes-sional society activities.

Robert P. Colwell recently completed hisdoctoral dissertation on the performanceeffects of migrating functions into silicon,using the Intel 432 as a case study. His in-dustrial experience includes design of acolor graphics workstation for Perq Sys-tems, and work on Bell Labs' microproces-sors. He received the PhD and MSEEdegrees from Carnegie-Mellon Universityin 1985 and 1978, and the BSEE degreefrom the University of Pittsburgh in 1977.He is a member of the IEEE and ACM.

H. M. Brinkley Sprunt is a doctoral can-didate in the Department of Electrical andComputer Engineering of Carnegie-MellonUniversity. He received a BSEE degree inelectrical engineering from Rice Universityin 1983. His research interests include com-puter architecture evaluation and design.He is a member of the IEEE and ACM.

Charles Y. Hitchcock mI is a doctoral can-didate in Carnegie-Mellon University'sDepartment of Electrical and ComputerEngineering. He is currently pursuingresearch in computer architecture and is amember of the IEEE and ACM. He grad-uated with honors in 1981 from PrincetonUniversity with a BSE in electrical engineer-ing and computer science. His MSEE fromCMU in 1983 followed research he did indesign automation.

E. Douglas Jensen has been on the facultiesof both the Computer Science and Elec-

September 1985

Charles P. Koflar is a senior research staffmember in Carnegie-Mellon University'scomputer Science Department. He is cur-rently pursuing research in decentralizedasynchronous computing systems. He hasbeen associated with the MCF andNEBULA project at Carnegie-Mellon Uni-versity since 1978. Previous research hasbeen in the area of computer architecturevalidation and computer architecturedescription languages. He holds a BS incomputer science from the University ofPittsburgh.

Questions about this article can bedirected to Colwell at the Computer ScienceDepartment, Carnegie-Mellon University,Pittsburgh, PA 15213.

19

..W

Instruction Sets and Beyond: Computers, Complexity, and ...

Documents