A Case Study of Multi-Threading in the Embedded …sherwood/pubs/CASES-embedded...A Case Study of Multi-Threading in the Embedded Space Greg Hoover University of California, Santa

A Case Study of Multi-Threading in the Embedded Space

Greg HooverUniversity of California, Santa

BarbaraEngineering I

Santa Barbara, California93106

[email protected]

Forrest BrewerUniversity of California, Santa



[email protected]

Timothy SherwoodUniversity of California, Santa



[email protected]

ABSTRACTThe continuing miniaturization of technology coupled withwireless networks has made it feasible to physically embedsensor network systems into the environment. Sensor netprocessors are tasked with the job of handling a disparateset of interrupt driven activity, from networks to timers tothe sensors themselves. In this paper, we demonstrate theadvantages of a tiny multi-threaded microcontroller designwhich targets embedded applications that need to respondto events at high speed. While multi-threading is typicallyused to improve resource utilization, in the embedded spaceit can provide zero-cycle context switching and interrupt ser-vice threads (IST), enabling complex programmable controlin latency constrained environments. To explore the advan-tages of multi-threading on these embedded problems, wehave implemented in hardware a family of controllers sup-porting eight dynamically interleaved threads and executingthe AVR instruction set. This allows us to carefully quan-tify the effects of threading on interrupt latency, code size,overall processor throughput, cycle time, and design area forcomplete designs with different numbers of threads.

Categories and Subject DescriptorsB.8.2 [Hardware]: Performance and Reliability—perfor-mance analysis and design aids; B.5 [Hardware]: Register-Transfer-Level Implementation

General TermsPerformance, Design

Keywordsmulti-threading, embedded architecture

1. INTRODUCTIONThe continuing miniaturization of technology enables small

systems to be innocuously embedded into our physical en-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CASES’06, October 23–25, 2006, Seoul, Korea.Copyright 2006 ACM 1-59593-XXX-X/06/0010 ...$5.00.

vironment. Such systems open the possibility of early de-tection of structural failure, tracking microclimates in for-est canopies, and observing migratory patterns of any num-ber of species [16, 11, 29]. In an ideal world, these tinydigital systems would be operating a wireless network, han-dling sensor readings, controlling electro-statically-activateddevices, processing software updates, and performing dis-tributed computations. To handle all of these functions atthe required throughput, we argue for the use of dynamicmulti-threading at all levels of the microcontroller design.

At first this concept may seem counterintuitive as mostmulti-threaded architectures were developed to better-exploitan abundance of resources [25], something that these sys-tems most certainly do not have. Indeed, because mostmodern hardware-threaded systems are superscalar designswith support for speculative or even out-of-order execution,a very common belief is that adding hardware support forthreading will increase complexity more than is sensible inthe microcontroller space. Only by carefully quantifyingboth the circuit-level overhead of multi-threading and thesoftware-level advantages of its application can we truly un-derstand the tradeoffs involved in embedded multi-threadedsystems.

This paper presents the architecture and implementationdetails of our multi-threaded microcontroller designed withthese concerns in mind. Our design, JackKnife, aims toshow that threading can be a winning design point in thisspace, providing a synthesizable, multi-threaded, pipelinedmicrocontroller supporting the AVR instruction set. Jack-Knife is highly modular and customizable, supporting theinclusion of custom peripherals via an extensible bus archi-tecture. The current implementation supports up to 8 dy-namically interleaved threads, each with independent pro-cessor contexts. The performance of our implementationin 0.15µm TSMC process synthesizes at over 400MHz. Weshow at least an order of magnitude increase in performanceover currently available AVR designs which, for many in-structions, require multiple cycles per instruction and havea maximum clock frequency of 20MHz.

We show that a practical embedded multi-threaded de-sign is possible (even for an ISA as complex as AVR) andthrough a detailed hardware-software implementation studywe demonstrate that:

• zero-cycle context switching enables extremely low la-tency interrupt response which could then be exploitedto decrease the required frequency.

• for a fully synthesizable design (including control and

1

datapaths), the area overhead to implement multi-threading is minimal – only 20% per added thread.

• not only does the latency response improve, multi-threaded interrupt code actually requires both less to-tal storage (50% on average) and less total instructionsexecuted (50% on average) than unthreaded code

If supported as a first class design constraint at all levels,threading and pipelining the AVR does not add significantlyto the complexity or area of the design. The major com-plexity comes in supporting true zero-cycle interrupts andin the interaction between in-flight interrupts and schedule-effecting instructions. However, by implementing the con-trol using the pyPBS control synthesis tool, the complexityexposed to the designer is kept to a minimum. The remain-der of this paper is organized as follows: Section 2 discussesthe motivation for and contributions of this work. Section 3describes the architectural features of the processor core per-taining to interleaving, pipeline structure, scheduling, inter-rupts and synchronization. The memory architecture is pre-sented in Section 4 with specifics on instruction fetch, theregister file, and the extensible bus (I/O). Section 5 outlinesimplementation details with discussion of multi-threadingeffects and synthesis results. Prior art in the field of multi-threading is presented in Section 6. Finally, we summarizeour findings in Section 7.

2. MOTIVATIONThe availability of small electronic sensors and embedded

computing platforms has already provided the means forexploring new areas in network sensors. In general, sensorsystems tend to be small in size, data-intensive, diverse indesign and use, and exhibit limited physical parallelism [8].These systems must meet size, reliability, and longevity re-quirements [11], while providing communication and sensorinterfaces, data processing and storage, low-latency responsetimes, and power management. Existing network sensor ar-chitectures have been built around commercially availablemicrocontroller devices such as the AVR microcontroller [8],the Hitachi SH1[11, 29], and the ARM7 [12].

While traditional embedded systems and sensor nets sharemany similarities (power and cost constraints, small amountsof storage, etc.), a sensor net processor is called on to betruly general purpose. In particular, sensor net processorsjuggle many different interrupt driven tasks, from timers tosensors to the network. As interrupts are one of the mostcommon tasks, they need to be handled efficiently. Unfortu-nately, in a single threaded design, managing the complexityof program execution becomes difficult, and the bookkeep-ing work of transferring control between processes consumessignificant time and space.

Figure 1 highlights the prologue and epilogue code seg-ments for an instruction service routine on single threadedAVR microcontroller. While the size of these segments canvary, the routine presented is not atypical. Such save/returncode segments can dominate the execution, resulting in in-efficient use of resources. Each added instruction affects,not only, the program footprint, but also execution latency;and the added delay serves to further constrain the rate atwhich the system can process events. While Figure 1 mo-tivates the need for a more efficient interrupt managementmethod, a quantitative analysis of this problem is presentedin Section 5.

0x921F ; PUSH R10x920F ; PUSH R00xB60F ; IN 63 R00x920F ; PUSH R00x2411 ; EOR R1 R10x932F ; PUSH R180x933F ; PUSH R190x934F ; PUSH R200x935F ; PUSH R210x936F ; PUSH R220x937F ; PUSH R230x938F ; PUSH R240x939F ; PUSH R250x93AF ; PUSH R260x93BF ; PUSH R270x93EF ; PUSH R300x93FF ; PUSH R31

0x91FF ; POP R310x91EF ; POP R300x91BF ; POP R270x91AF ; POP R260x919F ; POP R250x918F ; POP R240x917F ; POP R230x916F ; POP R220x915F ; POP R210x914F ; POP R200x913F ; POP R190x912F ; POP R180x900F ; POP R00xBE0F ; OUT 63 R00x900F ; POP R00x901F ; POP R10x9518 ; RETI

0x9180 ; LDS R24 21020x0836 ; ---0x2388 ; AND R24 R240xF049 ; BRBS 9 10xE080 ; LDI 0 R240x940E ; CALL 27500x0ABE ; ---

0x9B81 ; SBIS 16 10xC005 ; RJMP 50x9180 ; LDS R24 20950x082F ; ---0x2388 ; AND R24 R240xF009 ; BRBS 1 10xCF81 ; RJMP 3969

Prologue Epilogue

Figure 1: Assembly code from an AVR interruptservice routine highlighting the overhead of prologueand epilogue code segments.

In traditional microcontrollers, software developers can-not be shielded from the fact that interrupts are typicallyboth complex to program and slow to respond. While theperipheral sets of modern microcontrollers continue to ex-pand, support for the growing complexity of interrupt pro-cessing does not. These devices are typically limited to, atmost, one fast-interrupt source. This imposes limitationson embedded software, requiring the overhead of a task-manager or similar OS-level component [8, 16, 15].

The main idea behind this paper is to explore the effec-tiveness of multi-threading in the embedded space. Whilethe case for interrupt service threads has been made before,no one to date has conducted a careful study of embeddedmulti-threading that quantified the ramifications from thecircuit level all the way to the software layers. A multi-threaded design will only make sense if the software gainsare thoroughly weighed against the hardware overheads. Toflush out the important tradeoffs we have fully implementeda family of threaded processors which are binary compati-ble with the Atmel AVR instruction set. We quantify thearea and timing ramifications of our design, and more im-portantly we demonstrate how it scales with the number ofhardware threads. However, before we get to our resultswe need to present the details of our micro-architecture anddescribe the ways which made multi-threading work in thisextremely resource constrained environment.

3. PROCESSOR ARCHITECTUREJackKnife models its architecture after that of the AVR,

providing advanced capabilities while maintaining simplic-ity and software compatibility. Figure 2 provides a func-tional overview of the processor architecture, illustratingthe division of the six pipeline stages. A memory-mappedinfrastructure is used throughout the design, reducing de-sign complexity by providing a uniform interface to systemperipherals and memory. By supporting interleaved multi-threading, JackKnife provides implicit sharing of data pathresources for increased throughput and low-latency responseto dynamic events. Extensions to the original AVR pipelineresult in over an order of magnitude increase in clock rate,further increasing system throughput. Additionally, Jack-Knife includes a custom scheduler, dynamic interrupt han-dling, and a novel synchronization construction.

2

Arithmetic/Logic Unit

Multiplication Unit

ConditionUnit

Branch/Jump Unit

Register File

Status Reg

Program Cntr

Scheduler

Ext Instr Reg

Instr Reg

Fetch Decode

Instr Format

Reg File Addr

Instruction Memory Bus

Commit

Stack Pointer

Instruction Memory Bus

Data Memory Bus

Load/StoreUnit

Figure 2: Functional overview of the JackKnife core components, including the division of the 6 pipelinestages.

3.1 Dynamic InterleavingIn an interleaved pipeline, instructions are selected from a

different thread every cycle. This strategy provides implicitdata path sharing and removes data dependencies betweenconsecutive instructions, alleviating many costly stall condi-tions and providing better utilization of processor resources.Dynamic interleaving allows the scheduler to dynamicallyselect a thread for execution based on resource availabilityand thread readiness. The JackKnife architecture focuses onflexibility of design and system responsiveness, selecting ac-tive threads on a round-robin basis and scheduling interruptservices threads (IST) on the next cycle.

JackKnife supports concurrent execution of any numberof threads up to the current design limit of eight. Inter-leaved processors, like the Tera MTA, often suffer from per-formance degradation when the number of executing threadsdrops below some threshold. We have escaped this draw-back through limited use of forwarding which typically re-duces data-dependent stalls to a single cycle. The ability toexecute single-threaded code with only slight performancedegradation allows the system to respond to randomnesstypically exhibited in many embedded environments. Whilewe do not directly address the need for greater than 8 con-current threads, we expect that software-level threads could

be mapped onto hardware-level threads to provide addi-tional design flexibility.

3.2 PipelineJackKnife employs a six stage pipeline with extended func-

tionality for per-stage flush and stall conditions. The event-driven nature of embedded systems means that control flowwill change often and unexpectedly; a successful systemshould minimize the overhead of these transitions. The AVRarchitecture utilizes a dynamic pipeline which executes in-structions in anywhere from one to four cycles, depending oninstruction complexity. JackKnife inherits this dynamic be-havior, but provides single-cycle support for a much greaternumber of instructions than the AVR.

Dynamic multi-threaded execution forces increases in pipelinecomplexity, requiring additional logic for recognizing dataand control hazards. Managing the control logic, and allof the states needed to handle the interactions of data pathstalls, flushes, interrupts, and multiple threads, can get quicklyout of hand. One of the problems is that the semantics ofan instruction stream should hold no matter how it is inter-leaved with streams from other threads. A typical state ma-chine approach to building such a controller can quickly ex-plode into an unmanageable number of states. The other op-

3

tion is to hand build a control structure that can effectivelyhandle a particular instance of the design. This however isagainst our design goal of creating a configurable and ex-tensible design (for instance, this would make changing thenumber of threads handled by the architecture at a hard-ware level very difficult). To tackle this problem we haveused instruction tagging in conjunction with a novel pipelinespecification methodology based around the control specifi-cation language pyPBS [9]. The non-deterministic automata(NFA) technique allows efficient controllers to be realizedwith minimum design effort and implementation overhead.A complete description of the methodology and synthesislanguage are available in [10] and [9].

3.3 SchedulerSupporting multiple concurrent streams of execution and

zero-latency interrupt response requires a custom schedulercapable of balancing system requirements. The JackKnifescheduler is a best-effort, round-robin scheduler targeted atmaintaining balanced execution among threads. A round-robin scheduling policy is utilized based on the last dis-patched thread, the pool of active threads, and the stateof interrupts.

Scheduling can quickly become a complex process, ac-counting for many aspects of the system state. The Jack-Knife scheduler is designed with simplicity in mind, operat-ing on a minimal set of inputs. Configuration is handled viaa bank of active registers which maintain thread state anda 7-bit counter for delayed de-scheduling. The density ofthe AVR instruction set precludes the possibility for addingcustom instructions for frequently used atomic operations,such as releasing a synchronization lock and de-schedulinga thread. Such scenarios result in race conditions, mak-ing system execution unpredictable. Delayed de-schedulingprovides a solution to these issues, by allowing a thread toinform the scheduler about the number of instructions itneeds to execute before being deactivated.

The addition of multi-threading to a single-threaded sys-tem poses some issues in term of thread support. Threadinitialization is particularly challenging, as each processorcontext has no direct access to any other. To facilitatethis process, the scheduler provides initialization registerswhich are used when a thread has yet to execute. Mod-ification of the execution history flags and target instruc-tion address allow any thread to change the execution flowof any other. The ability to (re)start threads at arbitrarycode segments is similar to software-level threading tech-niques like POSIX and Java, where execution begins at aspecified function/method. While these operations are uni-versally allowed, we expect that protected thread executionand other complex scheduling can be accomplished at thesoftware level.

System execution always begins in thread zero which en-ters program code at the beginning of memory. This threadis effectively used to bootstrap the initialization of otherthreads. Shared memory provides an easily-used commu-nication channel for information passing between threads.This is of particular importance during thread initialization,where an appropriate address for the stack pointer must beestablished to avoid memory corruption. On acceptance ofan interrupt, the scheduler overrides the round-robin policy,scheduling the interrupt at the soonest possible time.

3.4 InterruptsEfficient use of system resources often necessitates the

use of interrupt-driven execution models. Atmel’s AVR de-vices support up to 35 interrupts, servicing a variety of pe-ripherals and inputs. JackKnife provides 31 customizable,priority-based interrupt sources, allowing for the inclusionof a broad range of peripherals. Compatibility with AVRdevices is maintained by executing interrupt routines out ofa common jump vector located at the beginning of programmemory. Code migration from AVR is as simple as assigningservice threads to each interrupt source. This assignment isdynamic and configured in a set of memory-mapped controlregisters within the interrupt unit.

With clock speeds reaching 20MHz and minimum inter-rupt response times of 4 clock cycles, interrupt latency forcommercially available AVRs is at least 250ns. The AVRarchitecture provides no hardware support for fast interruptcontext switching, adding additional delay to ISR responsetimes. AVR interrupt routines typically require prologueand epilogue as long as 17 instructions (Section 5.2), result-ing in delays easily reaching 1µs. In contrast, JackKnife canachieve interrupt response times of several nanoseconds byproviding zero-cycle context switching, priority scheduling,and dedicated instruction service threads (IST) [3, 21, 13, 2].ISTs allow user code to be executed immediately, withoutthe overhead of prologue and epilogue code. This alone hasshown a nearly 50% improvement in execution time (Sec-tion 5.2).

Guaranteeing correct program execution requires that in-terrupt services save and restore system state prior to andfollowing execution of interrupt routines. To enable cor-rect program flow, interrupts trigger an implicit stack pushwith the servicing thread’s current instruction address. Thismakes thread initialization important for ISTs as well as nor-mal process threads to avoid corrupting memory. Interruptroutines typically conclude with a return instruction, caus-ing execution flow to return to a pre-interrupt state. AnIST has no meaningful previous execution point and there-fore should de-schedule itself to avoid rampant execution.

While ISTs offer immediate response to interrupt events,the time to complete an interrupt routine is highly influ-enced by the number of concurrently running threads. Worstcase execution times can be determined based on the max-imum number of supported hardware threads. At the soft-ware level, the current scheduler provides the option of cus-tom, or even dynamic, allocation of system resources to ex-ecuting interrupts. For instance, a high priority interruptcould de-schedule all competing threads upon entry. At thehardware level, a priority scheduler could provide the sameadvantages with reduced software overhead.

3.5 SynchronizationProcessor throughput is maximized when parallel pro-

cesses have no inter-dependencies. However, it is often thecase that several threads of execution require some combina-tion of synchronization, communication, and data sharing.The distribution of tasks among threads makes this evenmore important, as it is inevitable that a coherent viewof the system is needed at times. While communicationand data sharing are implicitly supported by the sharedmemory architecture, synchronization requires specializedsupport to guarantee race-free execution [19, 5]. This sup-port is frequently seen through a variety of implementations

4

Thread

Program Counter

Active

Program Counter

History

Interrupt Source

Round Robin

InterruptInterrupt Thread

Data Memory Bus

Start Address

Figure 3: Functional overview of the JackKnife scheduler.

ranging from flagged memory [26, 20, 1] to dedicated in-structions [26, 17]. Some of these methods aim to provideprimitive constructs such as spin-locks [7, 14], while othersprovide more elaborate solutions.

Though there exist any number of suitable solutions, manyof these techniques require custom instructions or architec-tural modifications that are at odds with our design goalsfor maintaining compatibility with existing tools and soft-ware. Instead, we have taken an approach that builds on theextensible nature of the I/O system to implement a config-urable synchronization module that offers a variable numberof dedicated synchronization locks. The module is a smallmemory with special handling of read and write requests.Obtaining one of the synchronization locks is done by per-forming a read request to the I/O address associated withthe lock. Read requests generate an atomic test-and-set inthe synchronization unit, setting the lock and returning oneif the lock was previously free, and zero otherwise – effec-tively a ‘try’ operation. Locks are cleared by writing zero tothe lock address.

By combining spin-locks (using the synchronization regis-ters) with explicit scheduler control, it is possible to createmore complex forms of synchronization with less overhead.Figure 4 illustrates a meeting point where all but the lastthread to enter, de-schedule themselves. The final thread toenter then restarts all threads simultaneously. This exam-ple makes use of delayed de-scheduling to deactivate threadsand release the synchronization lock, avoiding an otherwisepotential race condition. As threads progressively deacti-vate themselves, processor resources are dynamically real-located to running threads, resulting in reduced overheadand faster completion – an obvious improvement over spin-waiting. Figure 5 demonstrates a turnstile in which threadsare allowed to execute in the critical region in single-file.De-scheduling all competing threads upon entry of the criti-cal region reduces contention for resources and also removesany overhead otherwise incurred by spin-waiting.

4. MEMORY ARCHITECTUREJackKnife implements unique contexts for all of its 8 hard-

ware threads. Each context consists of a 32-byte register file,status register, program counter, and stack pointer. Thememory architecture of JackKnife equivalently models thatof the AVR, allowing for seamless code migration from exist-ing devices. In the AVR architecture, much of the program-ming complexity is masked by providing a uniform interfaceto peripherals and memory. This allows our controller tohandle a large number of different on-chip structures with-out requiring specialized design. Memory-mapping servesto reduce complexity stemming from otherwise necessarycustom instructions and component interconnect. From animplementation standpoint, this methodology reduces theoverhead of adding custom peripherals (such as our syn-chronization module) and control elements since interfacingrequires no additional hardware.

A robust set of memory access instructions add flexibilityto the AVR architecture by providing multiple ways to ef-ficiently access both data and I/O memory. These instruc-tions allow direct and indirect access with pre-decrement,post-increment, and displacement capabilities, as well as bitmanipulation for some I/O registers. All AVR peripheralsare accessible via memory-mapped I/O registers at addresses32 - 255. Though no commercially available AVR devices useeven half of this space, JackKnife allows for remapping ofdata memory to facilitate larger I/O memory space. Whilefunctionally disparate, I/O and data memory appear as asingle contiguous memory that can be partitioned in anynumber of ways. This flexibility allows for straightforwardcustomization of both on-board peripherals and memory formeeting area and functionality constraints.

The remainder of this section describes the instructionfetch and cache methodology employed in JackKnife, theregister file implementation strategy, and the extensible busarchitecture.

5

meet ing po int (void ) {// Spin−wait on sync lock zerowhile ( !SYNC0 ) ;

// Check i f any other threads are runningi f ( (ACTIVE & THREAD GROUP MASK)

== THREAD ID MASK)// Wake a l l threads in groupACTIVE = THREAD GROUP MASK;

else// Delayed s l e ep a f t e r 1 ins t ruc t ion∗ ( ( char ∗) (ACTIVE0 + THREAD ID)) = 3 ;

// Release sync lockSYNC0 = 0 ;

// End meeting point}

Figure 4: Thread Meeting Point

t u r n s t i l e (void ) {// Spin−wait on sync lock zerowhile ( !SYNC0 ) ;

// Store s ta tu s of threadsunsigned char thread status backup = ACTIVE;

// Sleep a l l other threads in groupACTIVE = (ACTIVE & ˜THREAD GROUP MASK)

| THREAD THIS ;

/∗ Cr i t i c a l code segment ∗/

// Restore thread runnable s ta tu sACTIVE = thread status backup ;

// Release the sync lockSYNC0 = 0 ;

}

Figure 5: Thread Turnstile

4.1 Instruction FetchMany available microcontroller devices provide embedded

memories for both program and data. Some devices, such asARM, use a bootstrapping technique to move program codefrom non-volatile memory to volatile memory before begin-ning execution of the main program. JackKnife employs ahardware-level bootstrapping process to move program codeto fast instruction memory. With multi-threading, one ofthe concerns is always the increased pressure on instructionfetch. Many of the embedded applications we target are verysmall, on the order of hundreds of bytes, and can be pulledcompletely on-chip. The ability to do single cycle access toall of memory greatly improves processor throughput whencompared to the penalties incurred from instruction cacheimplementations and external memory accesses. In a finalproduction design, a vast majority of the code can be storedin ROM, and a software patch memory can insure that lim-ited software updates and bug-fixes will still be possible.

4.2 Register FileThe AVR architecture provides 32 8-bit general purpose

registers with support for some 16-bit operations. JackKnifeimplements the 8 processor contexts as a pair-wise con-tiguous memory. Pairing registers provides 16-bit aligneddata access with minimal overhead for outputting 8-bit data.Rather than implement independent register files for eachthread, a uniform memory can be optimized for speed, area,and locality with data path components. Utilizing the threadidentifier as part of the register file address provides a straight-forward method for accessing and updating data with verylittle added complexity. Rather than add complex forward-ing paths, the register file provides transparent updates, ef-fectively forwarding data for consecutive instructions. Thisfurther opens space for optimizations in the non-uniformdistribution of registers to threads, and could potentially beof use in reducing the size of the register file.

4.3 Extensible BusAs previously mentioned, the embedded domain contains

many different applications that require varying types of in-terfaces. Because the peripheral set of a commercial (COTS)device is fixed, families of devices are typically offered withvarying peripheral sets and memory sizes. It is often the casethat these predetermined peripheral sets are not well suited

to a particular design – included peripherals are unnecessaryor do not implement the desired functionality. As a synthe-sizable design, JackKnife provides the ability to customizesystem peripherals for a given application.

Central to this capability is an extensible bus architecturethat links the processor core to both data and I/O mem-ory. Peripheral expansion can be easily accomplished whilemaintaining compatibility with existing AVR devices by im-plementing memory-mapped control registers in unused re-gions of the AVR I/O space. The decision to fully support amemory-mapped infrastructure allows integration complex-ity to be masked, and provides a uniform interface to allsystem components. Peripheral modules are self-contained,making them interchangeable and infinitely customizable.In addition to memory-mapped access, JackKnife providesaccess to its prioritized interrupt unit, allowing peripher-als to request immediate servicing via one of 31 customiz-able interrupt sources. All peripheral modules adhere toa common bus policy in which write requests are specifiedas single-cycle operations and basic read requests are two-cycle operations, consisting of one cycle for address outputand one cycle for data consumption. Though not advisable,the JackKnife memory controller supports the use of mem-ory wait signals for slow memory devices. By maintaininga tight communication protocol, we aim to increase perfor-mance by reducing the number of required stall cycles.

5. IMPLEMENTATIONANDSYNTHESISRE-SULTS

Now that we have described the architecture in a fairamount of detail, we describe the performance and area im-pact as determined through synthesis, and the observed ef-fects of multi-threading on both execution and interrupt re-sponse time. The JackKnife implementation contains roughly4250 lines of synthesizable Verilog HDL. Data path compo-nents are all written in Verilog, while sequential controllersare specified using the pyPBS language and compiled intosynthesizable Verilog.

5.1 Synthesis ResultsSynopsys Design Compiler was used to synthesize sev-

eral versions of the JackKnife implementation. Design Warecomponents are used throughout the implementation, pro-

6

Extensible Data Memory Bus

Instruction Memory BusJa

ckK

nife

Cor

e

I/O

Por

t A

I/O

Por

t B

UA

RT

SPI

Dat

a M

emor

y

Inte

rrup

t C

ontr

olle

r

Inst

ruct

ion

Mem

ory

Sync

hron

izat

ion

Figure 6: System-level overview of the JackKnife core and peripheral set.

viding tested and optimized components. A 0.15µm TSMCstandard cell library is used for target mapping of all com-ponents. All components, including memories, are imple-mented in standard cells, resulting in substantially largerarea and power cost than would be incurred by full-customimplementation of memory components. The TSMC cellsare characteristic of typical standard cells in the technol-ogy. Conservative wiring and wire-delay models were usedfor performance calculations.

As we have argued throughout this paper, a multi-threadedarchitecture would ease the programming burden of thosecontrolling multi-tasking embedded systems, but this canonly be justified if it does not increase the size and delayof the design significantly. To show the effect of multi-threading we have synthesized a single-threaded design alongwith multi-threaded designs with support for 2 to 8 differ-ent hardware contexts. We should point out that all of oursynthesized designs run at more than 20 times higher fre-quency than the current best commercial AVR processors,which run at 20MHz.

The small context size of the AVR allows addition of mul-tiple contexts at a significantly reduced cost when comparedto that of a 32-bit machine. Figure 7 compares the area andclock speed trade-offs of implementing the entire JackKnifecore with 1 through 8 threads. The area scales nicely withnearly a 3.5x increase from the single-thread implementa-tion to that with 8 threads. The effective area differencebetween implementations consists of the size of added hard-ware contexts, consisting of 37 bytes of memory and sup-porting logic. The trade-off for multi-threading in generalis shown to be minimal, with added logic complexity shownmainly in scheduling and control logic.

The clock speed is shown to remain constant across alldesigns with variances in design optimization accountingfor any differences in maximum clock frequency. Currently,critical paths exist through the multiplication unit. Thisstems from single-cycle support for fractional, unsigned withsigned multiplication – an operation that takes 2 cycle onthe original AVR. The obscurity of such an instruction pro-vides motivation for pipelining the multiplication unit infuture implementations. While power concerns are clearlyat least as important as performance, we should point outthat all the performance we can extract from the machineat this stage in its design will give us slack to exploit in afull custom power optimized implementation.

B

B

B

B

B

B

B

B

J JJ J J J J

J

1 2 3 4 5 6 7 80

100000

200000

300000

400000

500000

600000

700000

800000

0

50

100

150

200

250

300

350

400

450

Are

a (u

m

)

Clo

ck F

req

uen

cy (

MH

z)

B Area J Clock Frequency

2

Number of Threads

Figure 7: A plot of the Area of the JackKnife core(µm2) and the Operating Frequency (MHz) as afunction of the number of number hardware con-texts supported.

To understand how the different parts of the processorare scaling with the number of contexts, Figure 8 shows theindividual breakdown among processor components as theycontribute to the total design area. As we have mentionedin the past, the addition of a context should increase the sizeof the register file and other context registers linearly, andimpact some of the control and scheduling logic. At 61% ofthe 8-threaded design, the register file clearly dominates thetotal area. This is in part due to the standard cell imple-mentation; it is important to note that the use of modernmemory structures can reduce this area by as much as a fac-tor of 12. Fabricated implementations would take advantageof modern memory structures through the use of a memorycompiler, increasing performance further while significantlyreducing area costs.

Looking past the size of the hardware contexts, Figure 8shows other key components as well. The size of the decodelogic remains constant across all implementations, while thecontrol grows only slightly. This is due to a common control

7

1 2 3 4 5 6 7 80

100000

200000

300000

400000

500000

600000

700000

800000

Decode

Control

Register File

Stack Pointers

Status Registers

Scheduler

CoreA

rea

( um

)2

Number of Threads

Figure 8: Area breakdown by component. As wevary the number of hardware thread contexts, thisshows how each of the components of JackKnifescales in terms of required area.

Clock Frequency Area0

50100150200250300350400450

0

100000

200000

300000

400000

500000

600000

700000

800000

Clo

ck F

requ

ency

(M

Hz)

Are

a (u

m

)

Clock Frequency Optimized Area Optimized

2

Figure 9: Constrained Optimization. This figureshows the effect of area optimized synthesis versusperformance optimized synthesis.

hierarchy that is present in all implementations [10]. Forsingle-threaded implementations, the scheduler consists ofthe program counter register and supporting logic. The areabalance attributed to the core corresponds to the functionalunits and supporting glue logic. While the logic complexityof the pipeline and functional units remain constant acrossall designs, switching between hardware contexts comes atsome expense.

Results shown thus far were synthesized using timing con-straints as the primary implementation target. Figure 9shows the synthesis trade-off for implementations targetedat area rather than performance. While the implementationtargeting area is shown to be less than 90% of the size ofthe version strictly targeted for performance, the resultingperformance cost is 2.5x. Again, the total area cost of theimplementation would be greatly reduced by use of mem-ory structures versus standard cells. Overall, our microcon-troller design scales well with the number of threads and,even at 167MHz outperforms other designs in its class.

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0102030405060708090100

CPI

Inst

ruct

ions

Dis

patc

hed

/ Com

mitt

ed

Number of Threads

B

BB B B B B B

B CPICommit Ratio

Figure 10: A plot of the CPI and ratio of dis-patched to committed instructions executing aReed-Solomon encoding algorithm.

5.2 Effect of Multi-threadingWhile clock speeds in excess of 400MHz provide a strong

basis for our argument, they fail to convey true machine per-formance without some measure of the pipeline efficiency.In order to provide this measure, we simulated the Jack-Knife design for several test applications and characterizedthe performance in terms of average clocks per instruction,interrupt bandwidth, and efficiency in terms of the ratio ofcommitted to dispatched instructions. Figure 10 plots theCPI and pipeline efficiency ratio as determined by runninga Reed-Solomon encoding algorithm on 1 to 8 threads, witheach thread running an independent encoding job. CPI isclearly reduced by the use of multi-threading where run-ning more than 4 threads in parallel is shown to provide a30% improvement compared to single-threaded execution.The committed to dispatched instruction efficiency measureshows that execution with more than 4 threads negates anywaste associated with flushed in-flight instructions arisingfrom control flow changes. In terms of power, this resultis interesting as it signifies that the processor is performing100% of its scheduled instructions, thereby making efficientuse of power.

While typical CPI varies depending on the applicationsource code, commercial AVR devices require more than onecycle to execute the majority of the instruction set. AVRdevices specify a maximum CPI of 1, but like most proces-sors, can’t achieve this under typical execution – some in-structions require as many as 4 clock cycles to complete andreported averages are in the range of 1.8. While the Jack-Knife core also has a peak CPI of 1, it typically executes at1.16 when running more than 4 threads.

A side effect of multi-threading and the use of ISTs is theresulting advantages in terms of interrupt routine size andexecution performance. By utilizing ISTs, interrupt rou-tines can avoid the overhead associated with backing upand restoring machine state – prologue and epilogue codesegments. The prologue and epilogue cannot be avoidedfor systems with a large number of pre-emptive interruptshowever. In such systems, the eight thread limit requiresthat several interrupt sources map to a common thread. To

8

MP3 Player Power Relay Controller Automotive Engine Monitor Average

Inte

rrup

t Siz

e (i

nstr

uctio

ns)

0

10

20

30

40

50

60

70

0

10

20

30

40

50

60

70

User

Prologue

Epilogue

Figure 11: The sizes (instructions) of the interrupt service routines from three different AVR applications,showing the break down of prologue, epilogue, and user instructions.

MP3 Player Power Relay Controller Automotive Engine Monitor Average

Exe

cuti

on T

ime

(ns)

0

50

100

150

200

250

300

0

50

100

150

200

250

300

Non-IST

IST

Figure 12: A comparison of execution time (ns) for running the surveyed interrupt routines in a single-threaded implementation (non-IST) versus a multi-threaded implementation employing the use of interruptservice threads.

this end, JackKnife offers a great deal of flexibility in threadmapping, allowing the software designer to make intelligentdecisions based on system requirements.

We have used a set of microcontroller applications fromother domains in an attempt to characterize the amount ofinterrupt handling code. Figure 11 surveys interrupt routinesizes for three applications: a hard disk-based MP3 playerwith LCD display, USB, serial, and push button interfaces;a timer and power relay controller with LCD, serial, andpush button interfaces; and an automotive engine monitorwith ADC, serial, and VFD interfaces. The informationgathered reveals that typically 50% of the instructions inan ISR are overhead. Replacing standard interrupt routineswith dedicated ISTs results in an average 50% decrease inboth code size and execution time as shown in Figure 12.

In the end, the major concern is the impact of multi-threading on interrupt performance. We have argued thatmulti-threading benefits the designer by alleviating overheadwhile allowing greater bandwidth for interfacing with theexternal world. We have shown that the use of ISTs re-

sults in savings in code size and execution time, however wehave not shown how multi-threading effects interrupt per-formance. Figure 13 attempts to capture the improvementsin execution time resulting from modifying a typical ISR torun as an IST. It further demonstrates the average runtimeof such a routine when run in parallel with at least 4 othercopies. It is important to stress that all incoming interruptsare executed with single cycle response times. Running thisroutine in parallel on all 8 threads provides nearly 20MHzinterrupt bandwidth – a 2x increase over single-threaded op-eration.

6. RELATED WORKAs we have described, the idea of hardware support for

multi-threading is central to our design. There is a great dealof prior work from both academia and industry on multi-threading, and while a full description of all related work isnot possible here, we briefly describe several related schemesas they relate to our synthesizable design.

Prior work in the area of multi-threaded architecture has

9

0

20

40

60

80

100

120

0

20

40

60

80

100

120Ex

ecut

ion

Tim

e (n

s)

Non-IST

IST

Parallel IST

Figure 13: Comparison of execution time (ns) ofa typical interrupt routine when coded for single-threaded execution, as an IST, and the average timefor running multiple copies of the routine in parallel,with single-cycle response times.

primarily addressed the high-performance market, with de-signs targeting highly parallel large-scale applications. TheDynamic Instruction Stream Computer (DISC) [18] showedthat dynamic interleaving is a viable solution to achievingbetter resource utilization in modern processors. The is-sues of synchronization, interrupt handling, and memoryarchitecture were identified as key areas in efficient multi-threaded architectures, and approaches were presented. In-terleaved multi-threading is capable of filling the verticalwaste [27, 25] that occurs in conventional pipelined proces-sor designs, but shows little advantage when utilizing morethan four threads [25].

Multi-threading provides higher bandwidth than conven-tional processors, requiring instruction fetch and issue archi-tectures capable of saturating the pipeline at all times [24].Novel approaches to thread utilization [28] provide a mech-anism for speculative execution, reducing the overhead ofbranch mispredictions. Speculative caching techniques [22]reduce the latency in instruction fetching, providing as muchas a 28% improvement in performance. Novel approaches toexception handling [30] parallel the methodology employedin ISTs, removing unnecessary program serialization and al-lowing software techniques to achieve performance on parwith sophisticated hardware.

The Komodo project [3, 21, 13, 2] has been developinga Java based microcontroller targeting real-time applica-tions. Real-time support is guaranteed for three of thefour hardware contexts, with non-real-time tasks sharingthe fourth context during periods of high interrupt activity.Real-time scheduling techniques [6] bound worst case exe-cution for multi-threaded processors, providing the capabil-ity to perform static scheduling for real-time applications.Alternatively, non-deterministic elements of the processorcan be eliminated, providing predictable timing at the ex-pense of performance. The Java microcontroller excludescache elements and utilizes scheduling algorithms targetingshared execution, simulations of which show a 28% speedincrease [3] in certain tasks when used to control an au-tonomous guided vehicle.

The Tera MTA [23, 4, 1, 27] is one of the more successfularchitectures in implementing interleaved multi-threading

(IMT). Tera MTA supports multi-CPU systems where eachprocessor can handle up to 128 concurrent threads. Thearchitecture is optimized for execution of a large numberof threads with compiler support for VLIW instructionsand dependency lookahead information. Execution of smallnumbers of threads results in performance degradation, whileonly slight performance improvement is seen for variationsin larger numbers of threads [20]. Single-threaded execu-tion is impractical given that each thread may only have asingle instruction in the pipeline at a time. Interrupts aresupported through polling in a dedicated thread rather thansupporting a preemptive interrupt architecture.

While there is a large base of theoretical work in the areaof multi-threading, few physical implementations have beenrealized. Our work describes the architecture of a synthe-sizable threaded and pipelined AVR compatible microcon-troller that has been mapped to standard-cell. Because thisdesign is extensible through a simple memory mapped I/Ointerface, it can be easily combined on-chip with a varietyof sensors to control and coordinate operation.

7. CONCLUSIONSMany embedded applications require tight size and flex-

ibility without sacrificing high levels of performance andinterrupt bandwidth. Our multi-threaded microcontroller,JackKnife, aims to provide these features with capabilitiesfor low-latency event handling, higher performance, and cus-tomization. Supporting interrupt service threads (IST) withzero-cycle context switching and reduced ISR overhead pro-vides response times on the order of nanoseconds with typ-ically 50% reduction in execution time.

We have shown that multi-threading in the embeddedspace, even in a pipelined machine with complex synchro-nization and I/O handling, is feasible and can be done withminimal overhead. Standard cell synthesis under a varietyof constraints has shown that our design scales well in bothperformance and area, offering a range of implementationstargeting power, size, and performance. From the extensibleand open nature of our design, and due to the availabilityof mature compilers and software systems for the AVR in-structions set, we believe that our design will open the op-portunity to create integrated solutions not possible with offthe shelf components.

8. REFERENCES[1] G. Alverson, P. Briggs, S. Coatney, S. Kahan, and

R. Korry. Tera hardware-software cooperation. In1997 ACM/IEEE Conference on Supercomputing(CDROM), pages 1–16, 1997.

[2] U. Brinkschulte, C. Krakowski, C. Kreuzinger, andT. Ungerer. Interrupt service threads - a newapproach to handle multiple hard real-time events ona multithreaded microcontroller. In RTSS WIPsessions, pages 11–15, Dec. 1999.

[3] U. Brinkschulte, C. Krakowski, J. Kreuzinger, andT. Ungerer. A multithreaded java microcontroller forthread-oriented real-time event-handling. In 1999International Confererence on Parllel Architecturesand Compilation Techniques, pages 34–39, Oct. 1999.

[4] S. Brunett, J. Thornley, and M. Ellenbecker. An initialevaluation of the tera multithreaded architecture andprogramming system using the c3i parallel benchmark

10

suite. In 1998 ACM/IEEE Conference onSupercomputing (CDROM), pages 1–19, 1998.

[5] S. Carr, J. Mayo, and C.-K. Shene. Race conditions: acase study. J. Comput. Small Coll., 17(1):90–105,2001.

[6] A. El-Haj-Mahmoud and E. Rotenberg. Safelyexploiting multithreaded processors to toleratememory latency in real-time systems. In CASES ’04:Proceedings of the 2004 international conference onCompilers, architecture, and synthesis for embeddedsystems, pages 2–13, New York, NY, USA, 2004. ACMPress.

[7] J. R. Goodman, M. K. Vernon, and P. J. Woest.Efficient synchronization primitives for large-scalecache-coherent multiprocessors. In ASPLOS-III:Proceedings of the third international conference onArchitectural support for programming languages andoperating systems, pages 64–75, New York, NY, USA,1989. ACM Press.

[8] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. Culler, andK. Pister. System architecture directions for networksensors. In Ninth International Conference onArchitectural Support for Programming Languages andOperating Systems, pages 93–104, 2000.

[9] G. Hoover and F. Brewer. Pypbs design andmethodologies. In Third International Conference onFormal Methods and Models for Codesign, 2005.

[10] G. Hoover and F. Brewer and T. Sherwood.Extensible Control Architectures. In InternationalConference on Compilers, Architectures, and Synthesisfor Embedded Systems, 2006.

[11] P. Juang, H. Oki, Y. Wang, M. Martonosi, L. S. Peh,and D. Rubenstein. Energy-efficient computing forwildlife tracking: Design tradeoffs and earlyexperience with zebranet. In Tenth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, pages 96–107, 2002.

[12] R. Kling. Intel mote: An enhanced sensor networknode. In International Workshop on AdvancedSensors, Structural Health Monitoring, and SmartSensors, 2003.

[13] J. Kreuzinger, R. Marston, T. Ungerer,U. Brinkschulte, and C. krakowski. The komodoproject: thread-based event handling supported by amultithreaded java microcontroller. In 25thEUROMICRO Conference, 1999, volume 2, pages122–128, Sept. 1999.

[14] C. P. Kruskal, L. Rudolph, and M. Snir. Efficientsynchronization of multiprocessors with sharedmemory. ACM Trans. Program. Lang. Syst.,10(4):579–601, 1988.

[15] T. Liu and M. Martonosi. Impala: A middlewaresystem for managing autonomic, parallel sensorsystems. In Ninth ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming,pages 107–118, 2003.

[16] T. Liu, C. M. Sadler, P. Zhang, and M. Martonosi.Implementing software on resource-constrained mobilesensors: Experiences with impala and zebranet. InInternational Conference on Mobile Systems,Applications and Services, pages 256–269, 2004.

[17] J. M. Mellor-Crummey and M. L. Scott. Algorithms

for scalable synchronization on shared-memorymultiprocessors. ACM Trans. Comput. Syst.,9(1):21–65, 1991.

[18] M. D. Nemirovsky, F. Brewer, and R. C. Wood. Disc:Dynamic instruction stream computer. In 24thInternational Symposium on Microarchitecture, pages163–171, 1991.

[19] R. H. B. Netzer and B. P. Miller. What are raceconditions?: Some issues and formalizations. ACMLett. Program. Lang. Syst., 1(1):74–88, 1992.

[20] L. Oliker and R. Biswas. Parallelization of a dynamicunstructured application using three leadingparadigms. In 1999 ACM/IEEE Conference onSupercomputing (CDROM), 1999.

[21] M. Pfeffer, S. Uhrig, T. Ungerer, and U. Brinkshculte.A real-time java system on a multithreaded javamicrocontroller. In Fifth IEEE InternationalSymposium on Object-Oriented Real-Time DistributedComputing, 2002 (ISORC 2002), pages 34–41, May2002.

[22] E. Rotenberg, S. Bennett, and J. Smith. Trace cache:a low latency approach to high bandwidth instructionfetching. In 29th International Symposium onMicroarchitecture, pages 24–34. IEEE, Dec. 1996.

[23] A. Snavely, L. Carter, J. Boisseau, A. Majumdar,K. S. Gatlin, N. Mitchell, J. Feo, and B. Koblenz.Multi-processor performance on the tera mta. In 1998ACM/IEEE Conference on Supercomputing(CDROM), pages 1–8, 1998.

[24] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, andR. Stamm. Exploiting choice: Instruction fetch andissue on an implementable simultaneousmultithreading processor. In 23rd AnnualInternational Symposium on Computer Architecture,pages 191–202, May 1996.

[25] D. Tullsen, S. Eggers, and H. Levy. Simultaneousmultithreading: Maximizing on-chip parallelism. In22nd Annual International Symposium on ComputerArchitecture, pages 392–403, June 1995.

[26] D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy.Supporting fine-grained synchronization on asimultaneous multithreading processor. In Proceedingsof the Fifth International Symposium onHigh-Performance Computer Architecture, pages54–58, Jan. 1999.

[27] T. Ungerer and B. Robic. A survey of processors withexplicit multithreading. 35:29–63, Mar. 2003.

[28] S. Wallace, B. Calder, and D. Tullsen. Threadedmultiple path execution. In 25th Annual InternationalSymposium on Computer Architecture, June 1998.

[29] P. Zhang, C. M. Sadler, S. A. Lyon, andM. Martonosi. Hardware design experiences inzebranet. In International Conference on EmbeddedNetworked Sensor Systems, pages 227–238, 2004.

[30] C. B. Zilles, J. S. Emer, and G. S. Sohi. The use ofmultithreading for exception handling. In MICRO 32:Proceedings of the 32nd annual ACM/IEEEinternational symposium on Microarchitecture, pages219–229, Washington, DC, USA, 1999. IEEEComputer Society.

11

A Case Study of Multi-Threading in the Embedded …sherwood/pubs/CASES-embedded...A Case Study of Multi-Threading in the Embedded Space Greg Hoover University of California, Santa

Documents