14 Implementation of a Self-Motivated Arbitration

818 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010

Implementation of a Self-Motivated ArbitrationScheme for the Multilayer AHB BusmatrixSoo Yun Hwang, Dong Soo Kang, Hyeong Jun Park, and Kyoung Son Jhang, Member, IEEE

Abstract—The multilayer advanced high-performance bus(ML-AHB) busmatrix employs slave-side arbitration. Slave-sidearbitration is different from master-side arbitration in terms ofrequest and grant signals since, in the former, the master merelystarts a burst transaction and waits for the slave response toproceed to the next transfer. Therefore, in the former, the unitof arbitration can be a transaction or a transfer. However, theML-AHB busmatrix of ARM offers only transfer-based fixed-pri-ority and round-robin arbitration schemes. In this paper, wepropose the design and implementation of a flexible arbiter forthe ML-AHB busmatrix to support three priority policies—fixedpriority, round robin, and dynamic priority—and three datamultiplexing modes—transfer, transaction, and desired transferlength. In total, there are nine possible arbitration schemes. Theproposed arbiter, which is self-motivated (SM), selects one of thenine possible arbitration schemes based upon the priority-levelnotifications and the desired transfer length from the masters sothat arbitration leads to the maximum performance. Experimentalresults show that, although the area overhead of the proposedSM arbitration scheme is 9%–25% larger than those of the otherarbitration schemes, our arbiter improves the throughput by14%–62% compared to other schemes.

Index Terms—Multilayer AHB (ML-AHB) busmatrix, on-chipbus, self-motivated (SM) arbitration scheme, slave-side arbitra-tion, system-on-a-chip (SoC).

I. INTRODUCTION

T HE ON-CHIP bus plays a key role in the system-on-a-chip(SoC) design by enabling the efficient integration of het-

erogeneous system components such as CPUs, DSPs, applica-tion-specific cores, memories, and custom logic [1]. Recently,as the level of design complexity has become higher, SoC de-signs require a system bus with high bandwidth to perform mul-tiple operations in parallel [2]. To solve the bandwidth prob-lems, there have been several types of high-performance on-chipbuses proposed, such as the multilayer AHB (ML-AHB) bus-matrix from ARM [3], the PLB crossbar switch from IBM [4],and CONMAX from Silicore [5]. Among them, the ML-AHBbusmatrix has been widely used in many SoC designs. This is

Manuscript received July 15, 2008; revised December 13, 2008. First pub-lished July 28, 2009; current version published April 23, 2010. This work wassupported in part by the STSAT-3 Program, Grant of the Ministry of Educa-tion, Science and Technology of South Korea, and by the IT R&D program ofMKE/IITA [2006-S001-03, Development of Adaptive Radio Access and Trans-mission Technologies for the 4th Generation Mobile Communications].

S. Y. Hwang and H. J. Park are with the High-Speed User Equipment ModemResearch Team, Department of Mobile Convergence Research, Electronics andTelecommunications Research Institute, Daejeon 305-700, Korea (e-mail: [email protected]; [email protected]).

D. S. Kang and K. S. Jhang are with the Digital System Laboratory, De-partment of Computer Engineering, Chungnam National University, Daejeon305-764, Korea (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/TVLSI.2009.2015665

because of the simplicity of the AMBA bus of ARM, which at-tracts many IP designers [6], and the good architecture of theAMBA bus for applying embedded systems with low power [7].

The ML-AHB busmatrix is an interconnection scheme basedon the AMBA AHB protocol, which enables parallel accesspaths between multiple masters and slaves in a system. This isachieved by using a more complex interconnection matrix andgives the benefit of both increased overall bus bandwidth and amore flexible system structure [3]. In particular, the ML-AHBbusmatrix uses slave-side arbitration. Slave-side arbitration isdifferent from master-side arbitration in terms of request andgrant signals since, in the former, the master merely starts aburst transaction and waits for the slave response to proceedto the next transfer. Therefore, the unit of arbitration can be atransaction or a transfer [8]. The transaction-based arbiter mul-tiplexes the data transfer based on the burst transaction, andthe transfer-based arbiter switches the data transfer based ona single transfer. However, the ML-AHB busmatrix of ARMpresents only transfer-based arbitration schemes, i.e., transfer-based fixed-priority and round-robin arbitration schemes. Thislimitation on the arbitration scheme may lead to degradation ofthe system performance because the arbitration scheme is usu-ally dependent on the application requirements; recent applica-tions are likewise becoming more complex and diverse. By im-plementing an efficient arbitration scheme, the system perfor-mance can be tuned to better suit applications [9].

For a high-performance on-chip bus, several studies re-lated to the arbitration scheme have been proposed, suchas table-lookup-based crossbar arbitration [10], two-leveltime-division multiplexing (TDM) scheduling [11], token-ringmechanism [12], dynamic bus distribution algorithm [13],and LOTTERYBUS [14]. However, these approaches employmaster-side arbitration. Therefore, they can only control pri-ority policy and also present some limitations when handlingthe transfer-based arbitration scheme since master-side arbitra-tion uses a centralized arbiter. In contrast, it is possible to dealwith the transfer-based arbitration scheme as well as the trans-action-based arbitration scheme in slave-side arbitration. In thispaper, we propose a flexible arbiter based on the self-motivated(SM) arbitration scheme for the ML-AHB busmatrix. Our SMarbitration scheme has the following advantages: 1) It canadjust the processed data unit; 2) it changes the priority policiesduring runtime; and 3) it is easy to tune the arbitration schemeaccording to the characteristics of the target application. Hence,our arbiter is able to not only deal with the transfer-basedfixed-priority, round-robin, and dynamic-priority arbitrationschemes but also manage the transaction-based fixed-priority,round-robin, and dynamic-priority arbitration schemes. Fur-thermore, our arbiter provides the desired-transfer-length-based

1063-8210/$26.00 © 2009 IEEE

HWANG et al.: IMPLEMENTATION OF A SELF-MOTIVATED ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX 819

Fig. 1. Overall structure of the ML-AHB busmatrix of ARM [3].

fixed-priority, round-robin, and dynamic-priority arbitrationschemes. In addition, the proposed SM arbiter selects one ofthe nine possible arbitration schemes based on the priority-levelnotifications and the desired transfer length from the masters toensure that the arbitration leads to the maximum performance.

In Section II, we briefly explain the arbitration schemes forthe ML-AHB busmatrix of ARM, while Section III describes animplementation method for our flexible arbiter based upon theSM arbitration scheme for the ML-AHB busmatrix. We thenpresent experimental results in Section IV and concluding re-marks in Section V.

II. ARBITRATION SCHEMES FOR THE ML-AHBBUSMATRIX OF ARM

The ML-AHB busmatrix of ARM consists of the input stage,decoder, and output stage, including an arbiter [3]. Fig. 1 showsthe overall structure of the ML-AHB busmatrix of ARM.

The input stage is responsible for holding the address andcontrol information when transfer to a slave is not able to com-mence immediately. The decoder determines which slave that atransfer is destined for. The output stage is used to select whichof the various master input ports is routed to the slave. Eachoutput stage has an arbiter. The arbiter determines which inputstage has to perform a transfer to the slave and decides whichthe highest priority is currently. The ML-AHB busmatrix em-ploys slave-side arbitration, in which the arbiters are located infront of each slave port, as shown in Fig. 1; the master simplystarts a transaction and waits for the slave response to proceedto the next transfer. Therefore, the unit of arbitration can be atransaction or a transfer. However, the ML-AHB busmatrix ofARM furnishes only transfer-based arbitration schemes, specif-ically transfer-based fixed-priority and round-robin arbitrationschemes. The transfer-based fixed-priority (round-robin) arbitermultiplexes the data transfer based on a single transfer in afixed-priority or round-robin fashion.

III. SM ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX

An assumption is made that the masters can change their pri-ority level and can issue the desired transfer length to the arbitersin order to implement a SM arbitration scheme. This assumption

Fig. 2. Arbitration scheme examples in an embedded system. (a) Arbitrationscheme with no consideration of the latency constraint. (b) Arbitration schememinimizing latency. (c) SM arbitration scheme.

should be valid because the system developer generally recog-nizes the features of the target applications [15]. For example,some masters in embedded systems are required to completetheir job for given timing constraints, resulting in the satisfac-tion of system-level timing constraints. The computation time ofeach master is predictable, but it is not easy to foresee the datatransfer time since the on-chip bus is usually shared by severalmasters. Previous works solved this issue by minimizing the la-tencies of several latency-critical masters, but a side effect ofthese methods is that they can increase the latencies of othermasters; hence, they may violate the given timing constraints[16]. Unlike existing works, our scheme can keep the latencyclose to its given constraint by adjusting the priority level andtransfer length of the masters. Fig. 2 shows an example.

In this example, the service latencies (latency-limit times) ofM1, M2, and M3 are 4, 8, and 2 cycles (T14, T8, and T10), re-spectively. The requests for three masters are all initiated at T0,and M3 is the most latency-sensitive master. Fig. 2(a) shows anarbitration scheme that does not use latency constraints for ar-bitration. Therefore, M2 and M3 violate the latency constraintas the masters are selected in ascending order. Only M1 meetsthe constraint. Fig. 2(b) shows the scheduling of a typical la-tency-minimizing arbiter. It minimizes the latency of the mostlatency-sensitive module, namely, M3, causing M2 to violate itsconstraint. Although neither of these two arbitration schemescan meet the latency constraints for all three masters, in the SMarbitration shown in Fig. 2(c), all masters use the bus with noviolations by configuring the priority levels (transfer lengths) ofM1, M2, and M3 as the lowest, highest, and intermediate prior-ities (4, 8, and 2), respectively.

We use part of a 32-b address bus of the masters to informthe arbiters of the priority level and the desired transfer length


Fig. 3. Decoding information of the 32-b address bus.

Fig. 4. Internal structure of our arbiter.

of the masters. Fig. 3 shows the decoding information for ouraddress bus.

In Fig. 3, S_Number indicates the target slave number,P_Level means the priority level of a master, T_Length denotesthe desired transfer length of a master, and Offset_Add specifiesthe internal address of the target slave. Each of S_Numberand P_Level consists of 3 b because the maximum number ofmaster–slave sets is 8 8 [3]. Also, T_Length is composed of4 b because the maximum number of burst lengths is 16 [3].Although we used 7 b for P_Level and T_Length in the 32-baddress bus to notify the arbiters of the priority level and thedesired transfer length of a master, we consider it adequate toexpress the internal address of a slave because the range ofOffset_Add is from 0 to . Through the aforementionedassumption, the priority level and transfer length can then bechanged by the SM demand of each master.

Fig. 4 shows the internal structure of our arbiter based uponthe SM arbitration scheme.

In Fig. 4, the NoPort signal means that none of the mas-ters must be selected and that the address and control signalsto the shared slave must be driven to an inactive state, whileMaster No. indicates the currently selected master number gen-erated by the controller for the SM arbitration scheme. In gen-eral, our arbiter consists of an RR block, a P block, two multi-plexers, a counter, a controller, and two flip-flops. MUX_1 andMUX_2 are used to select the arbitration scheme and the desiredtransfer length of a master, respectively. A counter calculates thetransfer length, with two flip-flops being inserted to avoid the at-tempts by the critical path to arbitrate. An RR block (P block)performs the round-robin- or priority-based arbitration scheme.Fig. 5 shows the internal process of an RR block. Initially, wecreate the up- and down-mask vectors (Up_Mask and Dn_Mask)based on the number of currently selected masters, as shown inFig. 5. We then generate the up- or down-masked vector createdthrough bitwise AND-ing operation between the mask vector

Fig. 5. Internal process of the RR block.

Fig. 6. VHDL code of the round-robin function.

and the requested master vector. After generating the up- anddown-masked vectors, we examine each masked vector as towhether they are zero or not. If the up-masked vector is zero,the down-masked vector is inserted to the input parameter ofthe round-robin function; if it is not zero, the up-masked vectoris the one inserted. A master for the next transfer is chosen bythe round-robin function, and the current master is updated after1 clock cycle. The RR block is then performed by repeating thearbitration procedure shown in Fig. 5.

Fig. 6 shows the VHDL code of the round-robin function atthe behavioral level.

In Fig. 6, a master for the next transfer is selected throughthe for-statements in line 6, with the priority level of theleast significant bit in Masked_Vector being the highest. Ifwe modify the range of Masked_Vector in line 6 to “0 toMasked_Vector’left,” then the priority level of the most signifi-cant bit in Masked_Vector becomes the highest.

Fig. 7 shows the internal procedure of the P block. First of all,we create the highest priority vector (V) through the round-robinfunction of Fig. 6. After generating the highest priority vector(V), the priority-level vectors and the highest priority vector (V)are inserted to the input parameters of the priority function. Themaster with the highest priority is chosen by the priority func-tion, while the current master is updated after 1 clock cycle.

Fig. 8 shows the VHDL code of the priority function at thebehavioral level.

In Fig. 8, the master with the highest priority is selectedthrough the for-statements in line 7.


Fig. 7. Internal procedure of the P block.

Fig. 8. VHDL code of the priority function.

A controller compares the priority levels of the requestingmasters. If the masters have equal priorities, the controller se-lects the round-robin arbitration scheme (RR block); in othercases, it chooses the priority arbitration scheme (P block). Thecontroller also makes the final decision on the master for thenext transfer based on the transfer length of the selected master.The control process follows the following three steps.

1) If HMASTLOCK is asserted, the same master remains se-lected.

2) If HMASTLOCK is not asserted and the currently selectedmaster does not exist, the following hold.

a) If no master is requesting access, the NoPort signal isasserted.

b) Otherwise, a new master for the next transfer is ini-tially selected. If the masters have equal priorities, theround-robin arbitration scheme is selected; otherwise,the priority arbitration scheme is chosen. In addition,the counter is updated based on the transfer length ofthe selected master.

3) If none of the previous statements applies, the followinghold.

a) If the counter is expired, the following hold.

i) If the requesting masters do not exist, the No-Port signal is updated based on the HSEL signalof the currently selected master. If the HSELsignal is “1,” the same master remains selected,and the NoPort signal is deasserted. Otherwise,the NoPort signal is asserted.

ii) Otherwise, a master for the next transfer is se-lected based on the priority levels of the re-questing masters. Also, the counter is updated.

b) If the counter is not expired, and the HSEL signal ofthe current master is “1,” the same master remainsselected, and the counter is decreased.

c) If the currently selected master completes a transac-tion before the counter is expired, the following hold.

i) If the requesting masters do not exist, the No-Port signal is asserted.

ii) Otherwise, a master for the next transfer ischosen based on the priority levels of the re-questing masters, and the counter is updated.

The SM arbitration scheme is achieved through iteration ofthe aforementioned steps. Combining the priority level andthe desired transfer length of the masters allows our arbiterto handle the transfer-based fixed-priority, round-robin, anddynamic-priority arbitration schemes (abbreviated as the FT,RT, and DT arbitration schemes, respectively), as well as thetransaction-based fixed-priority, round-robin, and dynamic-pri-ority arbitration schemes (abbreviated as the FR, RR, and DRarbitration schemes, respectively). Moreover, our arbiter canalso deal with the desired-transfer-length-based fixed-priority,round-robin, and dynamic-priority arbitration schemes (abbre-viated as the FL, RL, and DL arbitration schemes, respectively).The transfer- or transaction-based arbiter switches the datatransfer based upon a single transfer (burst transaction), andthe desired-transfer-length-based arbiter multiplexes the datatransfer based on the transfer length assigned by the masters.

Fig. 9 shows the configurations for the fixed-priority arbitra-tion schemes.

In this figure, the smaller the priority level number, the higherthe priority level. In the fixed-priority arbitration schemes, eachmaster has a static priority. In transfer-based arbitration, how-ever, the transfer length is allocated as 1, indicating a singletransfer; in transaction-based arbitration, the transfer length isequal to the HBURST signal, which refers to the transactiontype (transfer ). In addition, the transfer length forthe desired-transfer-length-based arbitration is allotted by thedemand of each master (for example, let , ,

, and ). The arbitration results of Fig. 9 are asfollows (“#” indicates the transfer number).

1) FT arbitration scheme: M2(#0), M2(#1), M2(#2), M1(#0),M1(#1), M1(#2), M1(#3), M1(#4), M0(#0), M0(#1),M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7),M1(#5), M1(#6), M1(#7), M2(#3), M2(#4), M2(#5),M2(#6), M2(#7), M3(#0), M3(#1), M3(#2), M3(#3),M3(#4), M3(#5), M3(#6), M3(#7).

2) FR arbitration scheme: M2(#0), M2(#1), M2(#2), M2(#3),M2(#4), M2(#5), M2(#6), M2(#7), M0(#0), M0(#1),M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7),M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5),


Fig. 9. Configurations for the fixed-priority arbitration schemes.

M1(#6), M1(#7), M3(#0), M3(#1), M3(#2), M3(#3),M3(#4), M3(#5), M3(#6), M3(#7).

3) FL arbitration scheme: M2(#0), M2(#1), M2(#2), M2(#3),M2(#4), M2(#5), M2(#6), M2(#7), M0(#0), M0(#1),M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7),M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5),M1(#6), M1(#7), M3(#0), M3(#1), M3(#2), M3(#3),M3(#4), M3(#5), M3(#6), M3(#7).

In this case, the result of transaction-based arbitration is equalto that of desired-transfer-length-based arbitration because thepriority levels of all the masters are fixed.

Fig. 10 shows the combinations for the round-robin arbitra-tion schemes.

In these schemes, the masters have equal priorities, with thetransfer length being assigned as 1 in transfer-based arbitrationand 8 in transaction-based arbitration. Also, in desired-transfer-length-based arbitration, the transfer length is assigned by thedemand of each master (for example, let , ,

, and ). The arbitration results of Fig. 10 are asfollows.

1) RT arbitration scheme: M0(#0), M1(#0), M2(#0), M3(#0),M0(#1), M1(#1), M2(#1), M3(#1), M0(#2), M1(#2),M2(#2), M3(#2), M0(#3), M1(#3), M2(#3), M3(#3),M0(#4), M1(#4), M2(#4), M3(#4), M0(#5), M1(#5),M2(#5), M3(#5), M0(#6), M1(#6), M2(#6), M3(#6),M0(#7), M1(#7), M2(#7), M3(#7).

2) RR arbitration scheme: M0(#0), M0(#1), M0(#2), M0(#3),M0(#4), M0(#5), M0(#6), M0(#7), M1(#0), M1(#1),M1(#2), M1(#3), M1(#4), M1(#5), M1(#6), M1(#7),M2(#0), M2(#1), M2(#2), M2(#3), M2(#4), M2(#5),M2(#6), M2(#7), M3(#0), M3(#1), M3(#2), M3(#3),M3(#4), M3(#5), M3(#6), M3(#7).

3) RL arbitration scheme: M0(#0), M0(#1), M1(#0), M1(#1),M1(#2), M1(#3), M1(#4), M1(#5), M1(#6), M1(#7),M2(#0), M2(#1), M2(#2), M2(#3), M2(#4), M2(#5),M3(#0), M3(#1), M3(#2), M3(#3), M0(#2), M0(#3),M2(#6), M2(#7), M3(#4), M3(#5), M3(#6), M3(#7),M0(#4), M0(#5), M0(#6), M0(#7).

Fig. 11 shows the configurations for the dynamic-priority ar-bitration schemes. In the dynamic-priority arbitration schemes,the priority of the masters can be changed by the SM demand ofeach master. Furthermore, the transfer length is assigned as 1 intransfer-based arbitration and 4 in transaction-based arbitration.Also, the transfer length for desired-transfer-length-based arbi-tration is assigned, as shown in Fig. 11. The arbitration resultsof Fig. 11 are as follows.

1) DT arbitration scheme: M2(#0), M3(#0), M3(#1),M3(#2), M3(#3), M1(#0), M1(#1), M1(#2), M1(#3),M0(#0), M0(#1), M0(#2), M0(#3), M2(#1), M2(#2),M2(#3) M3(#0), M3(#1), M0(#0), M0(#1), M0(#2),M2(#0), M2(#1), M2(#2), M2(#3), M0(#3), M1(#0),M1(#1), M1(#2), M1(#3), M3(#2), M3(#3).


Fig. 10. Configurations for the round-robin arbitration schemes.

2) DR arbitration scheme: M2(#0), M2(#1), M2(#2),M2(#3), M3(#0), M3(#1), M3(#2), M3(#3), M1(#0),M1(#1), M1(#2), M1(#3), M0(#0), M0(#1), M0(#2),M0(#3) M3(#0), M3(#1), M3(#2), M3(#3), M0(#0),M0(#1), M0(#2), M0(#3), M2(#0), M2(#1), M2(#2),M2(#3), M1(#0), M1(#1), M1(#2), M1(#3).

3) DL arbitration scheme: M2(#0), M2(#1), M2(#2),M3(#0), M3(#1), M3(#2), M3(#3), M1(#0), M1(#1),M1(#2), M1(#3), M0(#0), M0(#1), M0(#2), M0(#3),M2(#3) M3(#0), M3(#1), M0(#0), M0(#1), M0(#2),M0(#3), M2(#0), M2(#1), M2(#2), M2(#3), M1(#0),M1(#1), M1(#2), M1(#3), M3(#2), M3(#3).

IV. IMPLEMENTATION RESULTS AND PERFORMANCE ANALYSIS

A. Implementation Results

We implemented different slave-side arbitration schemes forthe ML-AHB busmatrix. Each arbitration-scheme-based bus-matrix was implemented with synthesizable RTL VHDL tar-geting XILINX FPGA (XC2VP100-6ff1704). The XILINX de-sign tool (ISE 7.1i) was used to measure the total area. The im-plemented arbitration schemes are as follows:

• FT, FR, RT, RR, DT, DR, and SM arbitration schemes.The ML-AHB busmatrix of ARM provides only two arbitra-

tion schemes: FT and RT arbitration schemes. Thus, we com-

pared the FT- and RT-based busmatrixes of ARM with our cor-responding busmatrixes in the area overhead to show the credi-bility of our implementation. Fig. 12 shows the comparison re-sults. The total areas of our FT- and RT-based busmatrixes de-creased by 21% and 13% on average, respectively, comparedwith the FT- and RT-based busmatrixes of ARM. One reason isthat we adapted the bit masking mechanism [17] to our busma-trixes to reduce the area of the arbiter, while ARM used multiplepriority encoders, a multiplexer, and a demultiplexer to imple-ment the arbiters of the busmatrixes.

Table I charts the synthesis results of our ML-AHB busma-trixes with the different arbitration schemes.

It is apparent that the total area of the SM-based busmatrix is9%–25% larger than those of the other busmatrixes. This may bedue to our SM-based busmatrix also requiring the comparator tocompare the priority of the masters and the counters to calculatethe transfer length. Although our SM-based busmatrix occupiesmore area than the other busmatrixes, our arbiter is able to dealwith varied arbitration schemes such as the FT, FR, RT, RR, DT,and DR arbitration schemes.

B. Performance Analysis

We utilized a ModelSim II simulator to measure the perfor-mance of the ML-AHB busmatrixes with the different arbitra-tion schemes and demonstrate the efficiency of our flexible SMarbitration scheme.


Fig. 11. Configurations for the dynamic-priority arbitration schemes.

1) Simulation Environments: Fig. 13 shows our simulationenvironment.

In our simulation environment, the clock frequencies of allcomponents are 100 MHz (10 ns). The implemented ML-AHBbusmatrix has a 32-b address bus, a 32-b write data bus, a32-b read data bus, a 15-b control bus, and a 3-b response bus.Meanwhile, the simulation environment consists of both animplemented and a virtual part. The former corresponds to theML-AHB busmatrixes with different arbitration schemes andconsists of four masters and two slaves. Specifically, we onlyconsidered two target slaves, which is when conflict frequentlyhappens. The masters then access these in order to focus on theperformance analysis based on the arbitration schemes of eachbusmatrix. The virtual part, however, is composed of AHBmasters and AHB slaves. The AHB master generates the trans-actions, with the transactions of the masters having the samelength as an 8-beat incrementing burst type. The AHB slaveresponds to the transfers of the masters. Both the AHB mastersand slaves are fully compatible with the AMBA AHB protocol[3]. For a more realistic model of a SoC design, we modeled theAHB masters after the features of the processor and DMA withVHDL at the behavioral level. For the AHB slaves, we usedthe real SRAM, SDRAM, and SDRAM controller RTL modelsused in many applications. We also constructed the protocolchecker and performance monitor modules with the VHDLand foreign language interface (FLI C module) to ensure thereliability of our performance simulations.

Prior to the simulation, the workloads should be determinedas they affect the simulation results. However, determining theappropriate workloads of real applications is difficult becausethese can only be obtained when all applications with real inputdata are specifically modeled. Instead, the workloads for per-formance simulation are obtained through synthetic workloadgeneration [18] with the following parameters.

1) The distribution of transactions. This indicates what pro-portion of the total transactions that each master is respon-sible for.

2) The ratio of the nonbus transaction time to the total transac-tion time per AHB master, where the total transaction timeconsists of a nonbus transaction (internal transaction of themaster) time and a bus transaction (external transaction ofthe master through the busmatrix) time.

3) The latency time of the accessed slave by each master.These parameters determine the delay of components in the

virtual part. Through synthetic workload generation, variouspossible situations are investigated, where the ML-AHB bus-matrixes with each arbitration scheme can be utilized well. Inthis regard, we found three useful categories of experiments toidentify the effects of the following factors:

1) job length of the masters;2) latency time of the slaves;3) both the job length of the masters and the latency time of

the slaves.


Fig. 12. Comparison of our busmatrixes with those of ARM in total area.

TABLE ISYNTHESIS RESULTS OF THE ML-AHB BUSMATRIXES WITH DIFFERENT

ARBITRATION SCHEMES (NUMBER OF FPGA SLICES)

The dynamic-priority-based arbitration scheme has the ad-vantage for throughput when there are few masters with longjob lengths in a system; in other cases, the round-robin-basedarbitration scheme can get higher throughput than other arbi-tration schemes [19]. In addition, the arbitration scheme withtransaction-based multiplexing performs better than the samearbitration scheme with single-transfer-based switching in ap-plications with frequent access to long-latency slaves such asSDRAM [19].

The slave for the first category is the SRAM-type AHB slave(AHB slave0 in Fig. 13) without latency for access, while theslave for the second category is the SDRAM-type AHB slave

TABLE IIDISTRIBUTION OF TRANSACTIONS: TOTAL TRANSACTION/NONBUS

TRANSACTION

� In Table II, the transactions of the masters with long job lengths are generallybus transactions.� In this case, the masters follow the characteristics of the DMA master inthat the DMA aspect is used for data transmissions through buses in manyapplications.

TABLE IIIDISTRIBUTION OF TRANSACTIONS: TOTAL TRANSACTION/NONBUS

TRANSACTION

� In Table III, the transactions of the masters with long job lengths are mostlynonbus transactions. In other words, the majority of the masters have internaljobs.� In this case, the masters are processor-type AHB masters in the sense thatthe processor usually has many internal jobs for calculation compared withexternal jobs of setting the control registers for the slave module.

(AHB slave1 in Fig. 13) with a long latency time for access. Theslave for the third category can be an AHB slave0 or an AHBslave1. In particular, the target addresses are generated based onthe uniform distribution random number function between AHBslave0 and AHB slave1. Therefore, each master communicateswith the slaves with the same probability in the third category.Tables II and III tabulate the simulation parameters.

We performed a number of performance simulations at var-ious job lengths and observed no difference in the results of theperformance simulation at specific job lengths. The specific joblength was 4800, and we decided the job length for performanceanalysis to be the same at 4800. In addition, this job lengthexplicitly exhibits the features of each arbitration scheme verywell.

2) Simulation Results: Fig. 14 shows the simulation resultsof the first category. In this paper, throughput is defined as

where is the total number of transactions,indicates the number of transfers per transaction,

denotes the data bit width, and means the completiontime of the data transmission. Note that , ,and are all fixed in three categories because the job lengthis fixed as 4800. However, the simulation results are differentfrom each other since the distribution of transactions (totaltransaction/nonbus transaction) is different from each other,as shown in Tables II and III. In addition, (total transaction


Fig. 13. Simulation environment for performance analysis.

time) consists of internal- and external-job times. For example,we can assume the job schedule of a master as follows:

In the aforementioned job schedule, Job0 is performed re-gardless of the arbitration scheme because an internal job doesnot use the busmatrix. However, Job2 can be performed aftercompletion of Job1 and Job1 is strongly related to the busmatrixarbitration scheme since Job1 is an external job. In other words,there is a close dependence between internal and external jobs.The external-job time is a critical factor that decides , whichis defined as the maximum for all . We employ theaforementioned scheme for more realistic experimentations.

Based on the performance simulations of the first category,we observed that the overall system performance dependson the number of masters with a long job length and theprocessed data unit by the arbiter, regardless of the masters’transaction type (bus or nonbus transaction). In type1 and type5cases, where only one of the masters had a long job length,the SM-based busmatrix had the highest throughput. This isbecause master0, which had the longest job length, issuedthe highest priority level together with the desired transferlength to the arbiter. The arbiter, in turn, processed the datatransfer, focusing on the demands of master0. Accordingly,master0 could finish the transactions more rapidly. Althoughthe transactions of the other masters were somewhat delayed,the total transaction end time of the SM-based busmatrix wasthe shortest in type1 and type5. In particular, master0 could

Fig. 14. Simulation results for the first category. (a) Simulation result forTable II. (b) Simulation result for Table III.


Fig. 15. Simulation results for the second category. (a) Simulation result for Table II �� . (b) Simulation result forTable III �� . (c) Simulation result for Table II �� . (d) Simulation result for Table III�� . (e) Simulation result for Table II �� . (f) Simulation result for Table III�� .

TABLE IVCONFIGURATIONS OF THE SM-BASED ARBITRATION SCHEME FOR MAXIMUM

PERFORMANCE

quickly complete the internal job (nonbus transaction) in type5due to the prompt processing of the arbiter for the master0bus transaction. The SM-based busmatrix likewise showed thehighest performance in type2 and type6 cases, where there weretwo masters with long job lengths. The reasons are similar tothose of type1 and type5. Clearly, when there are few masterswith long job lengths in a bus system, the SM-based busmatrixconfigured as the DL arbitration scheme shows the maximumperformance. However, the SM-based busmatrix configured asthe RL arbitration scheme had the highest throughput in type3,type4, type7, and type8 cases, where there were many masterswith long job lengths. In other words, when there are many

masters with long job lengths or the job lengths of all mastersare similar or the same, the SM-based busmatrix organized asthe RL arbitration scheme shows the highest performance. Inmost cases, the fixed-priority scheme has the lowest throughputbecause of starvation.

We also identified the effect of the data switching unit on theoverall system performance. The data multiplexing unit can beordered by the system throughput as follows:

-

-

-

The desired transfer length- or transfer-based arbitrationshave the highest or lowest throughput because the data mul-tiplexing occurs as a unit of desired transfer length or singletransfer. Based on the performance simulations for the firstcategory, we observed that the SM-based arbitration schemeimproved its throughput by 14%–47% compared with otherarbitration schemes. In addition, the maximum bandwidthon the busmatrix of the first category was 8.4 Gb/s, with ourSM-based arbitration utilizing about 82% of the bandwidth.

Fig. 15 shows the simulation results of the second category.Using the performance simulations in this category, we iden-tified that the latency time of the slave has an effect on the


Fig. 16. Simulation results for the third category. (a) Simulation result for Table II �� . (b) Simulation result for Table III�� . (c) Simulation result for Table II �� . (d) Simulation result for Table III�� . (e) Simulation result for Table II �� . (f) Simulation result for Table III�� .

overall system performance. Based on the simulation results ofother arbitration schemes besides our SM-based one, we ob-served that an arbitration scheme with transaction-based mul-tiplexing displays a higher performance than the same arbitra-tion scheme with transfer-based switching in an application withfrequent access to long-latency devices or memories such asthe SDRAM. The improvements of the arbitration scheme withtransaction-based rather than transfer-based multiplexing are,on average, 19%, 24%, and 27% when the latency times ofthe SDRAM are 1, 2, and 3 clock cycles, respectively. Morespecifically, the differences between the transfer- and transac-tion-based arbitrations are largest in the round-robin arbitrationschemes because the data switching occurs as a unit of singletransfer in the RT arbitration scheme; furthermore, the latencyincreases as the data multiplexing augments. In fact, the im-provements of the RR arbitration scheme over the RT arbitrationscheme are about 26%, 42%, and 51% when the latency timesof the SDRAM are 1, 2, and 3 clock cycles, respectively.

Based on the previous results and the outcome of the firstcategory, we can configure our SM-based arbitration scheme toobtain the maximum throughput as follows:

1) type1, type2, type5, and type6: DR arbitration scheme;2) type3, type4, type7, and type8: RR arbitration scheme.The performance simulations for the second category show

that the SM-based arbitration scheme enhances the throughput

by 26%–62% compared with other arbitration schemes. Also,the maximum bandwidth on the busmatrix of the second cate-gory is 7.12 Gb/s, while our SM-based arbitration used about79% of the bandwidth.

By virtue of the results of the first and second categories, wecan predict the optimal configurations of the SM-based arbitra-tion scheme for the highest performance, as shown in Table IV.

For SRAM-type slaves without latency for access, the de-sired-transfer-length-based arbitration schemes are suitablefor the highest throughput; the transaction-based arbitrationschemes are appropriate for SDRAM-type slaves with a longlatency time for access. In addition, when there are few mas-ters with long job lengths in a bus system, such as in type1,type2, type5, and type6, the SM-based busmatrix configuredas a dynamic-priority arbitration scheme has the maximumperformance. In comparison, the SM-based busmatrix config-ured as the round-robin arbitration scheme obtains the highestthroughput, provided that there are many masters with long joblengths or that the job lengths of all masters are similar or thesame to each other in a bus system, such as in type 3, type4,type7, and type8.

On the other hand, based on the performance simulations forthe third category, we confirm that the configurations of Table IVhave the maximum performance among the arbitration schemes.Fig. 16 shows the simulation results of the third category. These


results indicate that the total throughput of the third category im-proves to about 67% compared to the first and second categoriesbecause the number of accessible target slaves is two and someslave operations are performed in parallel.

Based on the results of the performance simulations for thethird category, we observe that the SM-based arbitration schemeconfigured in Table IV improves the throughput by 19%–47%compared to other arbitration schemes. Moreover, the maximumbandwidth on the busmatrix of the third category is 15.52 Gb/s,and our SM-based arbitration utilized about 69% of the band-width.

V. CONCLUSION

In this paper, we proposed a flexible arbiter based on theSM arbitration scheme for the ML-AHB busmatrix. Our arbitersupports three priority policies-fixed priority, round-robin,and dynamic priority-and three approaches to data multi-plexing-transfer, transaction, and desired transfer length; inother words, there are nine possible arbitration schemes. In ad-dition, the proposed SM arbiter selects one of the nine possiblearbitration schemes based on the priority-level notificationsand the desired transfer length from the masters to allow thearbitration to lead to the maximum performance. Experimentalresults show that, although the area of the proposed SM arbitra-tion scheme is 9%–25% larger than those of other arbitrationschemes, our arbiter improves the throughput by 14%–62%compared with other schemes. We therefore expect that itwould be better to apply our SM arbitration scheme to an appli-cation-specific system because it is easy to tune the arbitrationscheme according to the features of the target system.

For future work, we feel that the configurations of the SMarbitration scheme with the maximum throughput need to befound automatically during runtime. We are likewise looking atthe applicability of the proposed arbitration scheme to AMBAAXI (ver. 3.0).1

REFERENCES

[1] M. Drinic, D. Kirovski, S. Megerian, and M. Potkonjak, “Latency-guided on-chip bus-network design,” IEEE Trans. Comput.-Aided De-sign Integr. Circuits Syst., vol. 25, no. 12, pp. 2663–2673, Dec. 2006.

[2] S. Y. Hwang, K. S. Jhang, H. J. Park, Y. H. Bae, and H. J. Cho, “Anameliorated design method of ML-AHB busmatrix,” ETRI J., vol. 28,no. 3, pp. 397–400, Jun. 2006.

[3] ARM, “AHB Example AMBA System,” 2001 [Online]. Available:http://www.arm.com/products/solutions/AMBA_Spec.html

[4] IBM, New York, “32-bit Processor Local Bus Architecture Specifica-tion,” 2001.

[5] R. Usselmann, “WISHBONE interconnect matrix IP core,” Open-Cores, 2002. [Online]. Available: http://www.opencores.org/?do=project=wb_conmax

[6] N.-J. Kim and H.-J. Lee, “Design of AMBA wrappers for multiple-clock operations,” in Proc. Int. Conf. ICCCAS, Jun. 2004, vol. 2, pp.1438–1442.

[7] D. Flynn, “AMBA: Enabling reusable on-chip designs,” IEEE Micro,vol. 17, no. 4, pp. 20–27, Jul./Aug. 1997.

1http://www.arm.com/products/solutions/axi_spec.html, accessed Feb. 2008

[8] S. Y. Hwang, H.-J. Park, and K.-S. Jhang, “Performance analysis ofslave-side arbitration schemes for the multi-layer AHB busmatrix,” J.KISS, Comput. Syst. Theory, vol. 34, no. 5, pp. 257–266, Jun. 2007.

[9] S. S. Kallakuri and A. Doboli, “Customization of arbitration policiesand buffer space distribution using continuous-time Markov decisionprocesses,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15,no. 2, pp. 240–245, Feb. 2007.

[10] D. Seo and M. Thottethodi, “Table-lookup based crossbar arbitrationfor minimal-routed, 2D mesh and torus networks,” in Proc. Int. Conf.IPDPS, Mar. 2007, pp. 1–10.

[11] K. Lahiri, A. Raghunathan, and S. Dey, “Performance analysis of sys-tems with multi-channel communication architectures,” in Proc. Int.Conf. VLSI Design, Jan. 2000, pp. 530–537.

[12] J. Turner and N. Yamanaka, “Architectural choices in large scale ATMswitches,” IEICE Trans. Commun., vol. E-81B, no. 2, pp. 120–137,Feb. 1998.

[13] C. H. Pyoun, C. H. Lin, H. S. Kim, and J. W. Chong, “The efficientbus arbitration scheme in SoC environment,” in Proc. Int. Conf. SoCReal-Time Appl., Jul. 2003, pp. 311–315.

[14] K. Lahiri, A. Raghunathan, and G. Lakshminarayana, “The LOT-TERYBUS on-chip communication architecture,” IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol. 14, no. 6, pp. 596–608, Jun. 2006.

[15] J. H. Han, M. Y. Lee, B. Younghwan, and C. Hanjin, “Application spe-cific processor design for H.264 decoder with a configurable embeddedprocessor,” ETRI J., vol. 27, no. 5, pp. 491–496, Oct. 2005.

[16] M. Jun, K. Bang, H.-J. Lee, N. Chang, and E.-Y. Chung, “Slack-basedbus arbitration scheme for soft real-time constrained embedded sys-tems,” in Proc. Int. Conf. ASP-DAC, Jan. 2007, pp. 159–164.

[17] S. Y. Hwang, H. J. Park, and K. S. Jhang, An Efficient Implementa-tion Method of Arbiter for the ML-AHB Busmatrix. Berlin, Germany:Springer-Verlag, May 2007, vol. 4523, LNCS, pp. 229–240.

[18] E.-G. Jeong, J.-G. Lee, K.-S. Jhang, J.-A. Lee, and D. Har, “Asyn-chronous layered interface of multimedia socs for multiple outstandingtransactions,” J. VLSI Signal Process. Syst., vol. 46, no. 2/3, pp.133–151, Mar. 2007.

[19] S. Y. Hwang, H. J. Park, and K. S. Jhang, “An implementation and per-formance analysis of slave-side arbitration schemes for the ML-AHBbusmatrix,” in Proc. Int. Conf. ACM Symp. Appl. Comput., Mar. 2007,vol. 2, pp. 1545–1551.

Soo Yun Hwang was born in Seoul, Korea, in 1976.He received the B.S. degree in computer engineeringfrom Hannam University, Daejeon, Korea, in 2002and the M.S. and Ph.D. degrees in computer engi-neering from Chungnam National University, Dae-jeon, in 2004 and 2008, respectively. His Ph.D. de-gree work concerned enhancements in the architec-ture and arbitration schemes of the multilayer AHBbusmatrix.

He joined the Electronics and TelecommunicationsResearch Institute, Daejeon, in 2006, where he is cur-

rently a Senior Member of the Engineering Staff working in the High-SpeedUser Equipment Modem Research Team, Department of Mobile ConvergenceResearch. His research interests include CAD for VLSI, system-on-a-chip (SoC)design methodology, on-chip communication architecture in SoC, and high-speed user equipment modem designs.

Dong Soo Kang received the B.S. and M.S. degreesfrom the Department of Computer Engineering,Chungnam National University, Daejeon, Korea, in2005 and 2007, respectively, where he is currentlyworking toward the Ph.D. degree in the DigitalSystem Laboratory.

His research interests include satellite onboardcomputers, memory-aware compilers and architec-tures, multimedia system designs, and processorarchitectures.


Hyeong Jun Park received the B.S. degree in elec-tronic engineering from Hanyang University, Seoul,Korea, in 1987 and the M.S. degree in electronic en-gineering from Chungnam National University, Dae-jeon, Korea, in 2001, where he is currently workingtoward the Ph.D. degree.

He joined the Electronics and Telecommunica-tions Research Institute, Daejeon, in 1987, wherehe is currently the Head of the High-Speed UserEquipment Modem Research Team, Department ofMobile Convergence Research. His research inter-

ests include system-on-a-chip design methodology, mobile communicationarchitecture, and high-speed user equipment modem designs.

Kyoung Son Jhang (M’89) was born in Oggu-Gun,Korea, in 1964. He received the B.S., M.S., and Ph.D.degrees in computer engineering from Seoul NationalUniversity, Seoul, Korea, in 1986, 1988, and 1995,respectively.

In 1996, he joined Hannam University, Daejeon,Korea as a Faculty Member. He then moved toChungnam National University, Daejeon. He thenmoved to Chungnam National University, Daejeon,where he is currently a Professor teaching systemsprogramming and digital hardware design in the

Department of Computer Engineering. His current major interests includefault-tolerant hardware designs, electronic design automation, and digitalsystem design.

14 Implementation of a Self-Motivated Arbitration

Documents