Hierarchical Computing

Hierar hi al Computing: An Ar hite ture for EÆ ient Transa tionPro essingJuan Rubio and Lizy K. JohnLaboratory for Computer Ar hite tureThe University of Texas at AustinAustin, TX 78712fjrubio,ljohng�e e.utexas.eduAbstra tTransa tion pro essing workloads impose heavy demands in the memory and storage sub-systems and oftenresult in large amounts of traÆ in I/O and memory buses. In this paper, we propose to utilize pro essingelements distributed a ross the memory hierar hy, with the obje tive of performing the omputation loseto the data residen e. Leveraging on a tive memory modules and a tive disk devi es emerging from otherresear h groups and available in the market, we propose a hierar hi al omputing system in whi h thedistributed pro essing elements operate on urrently and ommuni ate using a hierar hi al inter onne t.Transa tions are partitioned a ross the di�erent layers in the hierar hy depending on the aÆnity of odeto a parti ular layer or other heuristi s. Commands per olate down into the lower layers of the hierar hyand prepro essed/partially pro essed information ows up into the higher layers. All layers ACTIVELYparti ipate in the pro essing of the transa tion by doing tasks for whi h they are parti ularly suited. Thelower layers ontain inexpensive pro essor units and in onjun tion with the powerful entral pro essorand other ollaborating memory and disk pro essors, yield high performan e in a ost-e�e tive fashion.This on ept is then applied to the online transa tion pro essing ben hmark TPC-C and s hemes for ode partitioning are outlined. A hierar hi al omputing system ontaining four inexpensive memorypro essors and 32 very inexpensive disk pro essors an yield speedups of up to 4:52x when ompared witha traditional system. Sin e transa tion pro essing has been seen to ontain �ne-grained and oarse-grainedparallelism, the proposed hierar hi al omputing paradigm whi h exploits parallelism and redu es datatransport requirements seems to be a feasible model for future database servers.Keywords: Computer Ar hite ture, Parallel Pro essing, Memory Hierar hy, High Performan e Comput-ing, Database Servers.Te hni al Area: Ar hite ture1

1 Introdu tionIt is a well known fa t that the server market is the driving fa tor for several of the te hnologi al ad-van ements in the omputer industry. A few years ba k, this market was mainly dominated by te hni alworkloads. But during the last two de ades, it has hanged to power a large portion of ommer ial opera-tions.One important type of appli ations in the group of ommer ial workloads is Transa tion Pro essing(TP). Transa tion pro essing workloads are lassi�ed in two types: Online Transa tion Pro essing (OLTP)and De ision Support Systems (DSS). OLTP systems are used to handle those operations that o ur duringthe normal operation of a business (e.g. a lient buys produ ts, the managers he k the inventory, adjustthe pri e of an item). On the other hand, DSS systems are used to take de isions based on the datagathered by a business, whi h usually omes from an OLTP system (e.g. �nd most popular produ t withina given demographi bra ket, estimate net pro�t of all sales in the last three months). Even when bothworkloads �t within the ategory of transa tion pro essing, they have many di�eren es:� OLTP operations are of short duration, taking millise onds to omplete, whereas DSS operationstake minutes.� The above statement also applies to the dataset of an operation. While OLTP operations usually havedatasets in the order of kilobytes or megabytes, DSS operations usually a ess megabytes or hundredsor megabytes of data. Re ent literature suggests that DSS systems will be a essing gigabytes in thenext ouple of years [1℄.� The number of on urrent operations in a OLTP system is in the order of thousands, while DSSsystems normally have less than a hundred on urrent operations.� OLTP systems onstantly modify the data stored in the databases (e.g. enter a sale, deliver a pa kage).DSS systems, on the other hand, use mostly read operations during their exe ution.Transa tion pro essing systems are typi ally implemented using a multi-tier ar hite ture. The ideais to implement a fun tional pipeline that streams transa tions from the lients to the server databasein an eÆ ient way. As an be seen from Figure 1, lients on the left are onne ted to an intermediateserver or Middle-Tier, through a swit hed network. The fun tion of the middle-tier server is to a t as a�lter and reje t those requests presented by the lients that are in orre tly generated. It also enfor es these urity in the system and serves as a parser that transforms requests formulated in one language domain(e.g. HTML) to another domain (e.g. SQL).The next omponent is the Middle-Tier server. In this example, the Middle-Tier server is implementedas a server luster, with a front-side onne tion to the lients through a load-balan ing swit h, whose2

FRONT−SIDE

SWITCH

MIDDLE−TIER

SERVER

BACK−END

SWITCH

CLIENTS BACK−END−TIER

SERVER

��

��

��

��

��

��

��

��

��

Figure 1: Conventional System Level Ar hite ture for a Transa tion Pro essing System.fun tion is to reate a distributed load in all the nodes of the luster. Work done in this area in ludeslo ality aware distribution algorithms like the one developed by Pai, et al.[2℄. The nodes in the luster ommuni ate with ea h other through a ba k-side network interfa e, whi h they also use to send therequests to the database server, also referred to as the Ba k-end-Tier server.The �nal omponent of the system is the Ba k-end-Tier, whi h is also the fo us of this paper. Thisserver is the one that manipulates the primary data of the ommer ial operation (e.g. it keeps the list of lients, the orders they pla e, pri es of items and their quantities in the warehouses). As su h, the ba k-end-tier has omplete ontrol over a large portion of the data, whi h is normally lo al to it and a essedusing a Relational Database Manager System (RDBMS or ommonly DBMS). Implementations of thisserver in lude symmetri multipro essor systems (SMP) as well as luster servers.When we look at the exe ution behavior of ommer ial workloads, we observe ommer ial workloadsare di�erent than te hni al workloads and present more vigorous demands to the memory and storagesub-systems [3, 4, 5℄. In fa t, studies that analyzed transa tion pro essing workloads indi ate that systemsspend around 90% of the time waiting for the I/O devi es to a ess the data [6℄. On e the data are broughtto memory, the pro essor uses between 25% to 45% of the exe ution time handling memory a esses [7℄.That results in a sub-optimal utilization of the laten y hiding features of modern dynami ally s heduledpro essors [8℄.One of the reasons for this unbalan e between omputation and data a ess ba ktra ks to the prin iples3

of traditional memory hierar hies, where data moves from the storage sub-system to the pro essor beforeit an be pro essed. Although we have be ome a ustomed to this exe ution model, whi h works wellfor te hni al and some other appli ations, it is far from optimal when used with a transa tion pro essingworkload. The a tion of moving data ba k and forth between the storage and the omputing elements notonly results in a high volume of traÆ , whi h hurts the s alability of the system, but it also reates anarti� ial bottlene k by serializing the exe ution in an environment with enough parallelism.This paper presents the Hierar hi al Computing model as a possible solution to the problems presentedabove. The next se tion introdu es the idea, giving spe ial attention to the operation of the hardware and ommuni ation of the devi es. Se tion 3 overs the programming model used in the system, how we planto partition the problem, and use a set of basi primitives to operate on the data. Se tion 5 performsa mathemati al analysis of the idea, and identi�es those parameters that a�e t the performan e of thiste hnique. Se tion 4 presents a basi ode partitioning s heme and inspe ts the idea of using a onventionaltransa tion pro essing workload, whi h helps us to determine the feasibility of the model. Se tion 6 looksat other ideas proposed in the literature. Se tion 7 on ludes with a highlight of the most signi� ant ontributions.2 Hierar hi al ComputingTo address the problems presented in the previous se tion, most transa tion pro essing systems exploitthe oarse grain parallelism present in the form of on urrent transa tions. Before pro eeding further it isimportant to de�ne what is a transa tion. A transa tion is a sequen e of operations, whi h are exe utedatomi ally (atomi ), whi h always maintain a onsistent state in the database ( onsistent), whose exe utionis not a�e ted by on urrent transa tions (isolation) and whose e�e ts are permanent (durable). Theseset of properties, ommonly referred as the ACID properties for the �rst letter of ea h property, is thebasis for the transa tion pro essing theory. Implementations of urrent systems use a thin layer of softwareknown as the Transa tion Manager. Its fun tion is to enfor e an order in the arrival of the transa tionsand the ACID properties required by the transa tion model. The queries that form a transa tion are thenpassed to the database pro esses whi h run in ea h one of the pro essors in the system. Although thisapproa h has been relatively su essful, it is possible to ause resour e management problems, manifestedin the form of hot spots during the a ess of the database tables, thus inhibiting the exploitation of all theparallelism in the system.The Hierar hi al Computing exe ution model exploits the available parallelism present within a singletransa tion, in addition to thread-level or oarse level parallelism that an be exploited using other means.The idea is to distribute the omputation a ross a omputer system on behalf of a unique transa tion.In the model, the ommuni ation is based in a message based approa h using a hierar hi al topology of4

inter onne ts. Sin e this model exploits a di�erent type of parallelism than a traditional system (intra-transa tion vs inter-transa tion), its use an be orthogonal to existing methods. While in a traditionalsystem all the omputations are required to be performed by the CPU, under this model, there is no needto insist on that. There is an important di�eren e between both models, whi h is seen in the way a serverhandles a query from a lient. In fa t, several operations may be eÆ iently performed at the data residen e.This brings a set of interesting tradeo�s, whi h are presented and evaluated in the next se tions.To distribute omputing in this way, some omputing power is lo ated in memory and some in disk, loseto the lo ation of the data, where they perform simple omputations su h as omparison and a umulation.The omponents are oupled using a hierar hi al inter onne t (any sort of dire ted a y li graph like abinary tree).

5. Disk(Array)

Interconnect4.

Interconnect2.

Processor1.(ILP)

(multi−bank)Main Memory3.

Processing

Element

Interconnect

InterconnectInterconnect

Figure 2: Sample topology for a Hierar hi al Computing system.Figure 2 shows the topology of a sample system based on the above ideas and where we perform omputations in �ve di�erent levels. The additional points of omputation are represented by shadedareas and are lo ated in the memory banks, storage devi es and inter onne ts between the main levels.To exemplify the operation of the system, we an introdu e the analogy of a orporate oÆ e, where theemployees are organized in a well de�ned hierar hy, ea h of them with an amount of data in its losevi inity and over whi h they have omplete ownership. As a team, they handle ea h of the transa tionsre eived by the oÆ e, in a very distributed fashion. Even when at a parti ular point in time, only somemembers work in the same task, ea h one operates on the transa tion at some point. Managers at di�erent5

levels do not need to know every detail of the operations of their subordinates. Ea h level in the hierar hy onveys only the right amount of information to the layer above.A system with, say, 1 main memory module and 4 disk modules needs to have only 3 levels of om-putations (1 in the main CPU, 1 in memory and 1 in the disks), whereas pro essors in the inter onne twill be riti al to larger hierar hi al omputing systems. The number of levels of inter onne t pro essorsdepends on the number of disks reporting to the same memory, as well as the inter onne t network, thusfor a binary tree we need a number of levels of inter onne ts expressed by:Max(log2 Number of disks� log2Number of memory units� 1; 0) (1)The intelligen e in the di�erent levels an be realized using intelligent memory modules investigatedin re ent resear h [9, 10, 11, 12℄, and intelligent or a tive disks [13, 14, 15, 16, 17℄. It is possible to �ndstorage devi es in the market with 150 MIPS ore and up to 2MB of main memory [18, 19, 20, 21℄. On-going e�ort in intelligent or a tive memories and disk omponents an thus be leveraged to implementhierar hi al omputing systems. If omputing apability is required in the network or swit hes, they an be realized using hips similar to mi ro- ontrollers embedded in the swit h/bus interfa e. It may benoted that the omputing resour es required in the storage devi es or memory are signi� antly heaperthan the entral pro essor. Use of a powerful pro essor whi h exploits instru tion level parallelism andthread-level-parallelism is favored at the root of the hierar hy.The hierar hi al mode of operation has worked relatively well in our so iety, and we expe t it to workas well in a omputer system due to the following hara teristi s:� Raw data stay lo al to a level of the hierar hy, whi h gives more freedom to the upper levels tooperate and hold temporary results of the operations in their fast but redu ed storage spa e.� It suggests a spe ialization of the units, whi h permits a system to use dynami ally s heduled pro- essors in the upper levels of the hierar hy, where omplex de isions are required and ontrol owis hard to predi t. In-order pro essors, or narrower power-eÆ ient pro essors an be used to handlethe bulk of the data in the lower levels.The impli ations of the �rst point a�e t the mode of operation of the hierar hy and its programmingmodel and are overed in the next se tion. The analysis of the se ond point dire tly a�e ts the hardwareused in the system and will be studied in subsequent se tions.3 Programming ModelAs it was mentioned in the previous se tion, the basi element behind the Hierar hi al Computing modelis the use of omputation engines that sit lose to the lo ation of the data. The idea is to expedite the6

movement of data from its natural point of residen e to the omputation unit. This omputation unit anbe the main pro essor, for tasks that require the high omputation power provided by a high frequen yand dynami ally s heduled pro essor, or ould as well be a mu h simpler mi ropro essor or mi ro ontrollerthat fun tions as an intelligent memory or disk ontroller. The partition of the data and operations totake advantage of this new ar hite ture are overed in Se tion 4. This se tion deals with the programmingmodel used to support a transa tion pro essing workload.The heart of a transa tion pro essing system is the database server. Nowadays, databases are basedon the Relational model [22℄. In this model data is stored in the form of tables with a variable numberof rows of a predetermined width. To a ess these tables, these systems use data manipulation languages(DML), of whi h the stru tured query language (SQL) is the one most frequently used. The operationssupported by the SQL language in lude:� S an: lo ates rows that mat h a parti ular riteria. The riteria an be a single predi ate (e.g. listall ights to Pittsburgh), or a omplex predi ate (e.g. �nd all vehi les in the state of Texas andregistered after 1973). It is alled sele t under some data manipulation languages. This operates ona single table.� Join: similar to the s an operation, but operates on two or more tables (e.g. �nd all vehi les registeredbefore 1966 and urrently owned by individuals born after 1966). The two pie es of information are ontained in separate tables.� Insert: reates a new row in an existing table.� Remove: homologous to insert, it removes a row from a table.� Update: modi�es one or more �elds within one or more rows.� Sort: is not an independent type of operation, but an be applied to a s an operation.In order to implement the fun tionality required by the relational model and the SQL language andto take advantage of the on urren y provided by the Hierar hi al Computing model, we have opted for amessage-based data ow model. This on ept is presented in Figure 3, whi h shows two levels of a samplehierar hy (although the model is exible enough to support several levels).We assume that data is already partitioned and that it resides in the lower level (Lf1; 2g). The top level(Tf1g) is the level that initiates the operation by issuing a ommand (CMD) to one or more modules inthe lower levels (Tf1g and/or Tf2g). The a tion of sending the ommand an be a broad ast or multi ast(e.g. sele t all rows whi h mat h a riteria), or it an also be a uni ast (e.g. insert row in table). The ommands en ompass enough information to allow the lower levels to perform the omputations in behalf7

of the upper level. On e the upper level sends the ommand to the lower level, it waits for data, whi hdepending on the operation an be a null response, a single element or a sequen e of elements. Thehardware provides basi ontrol ow signals to help the upper level handle the amount of data that mightresult from an operation.CMD

DATUM . DATUM

��

��

��

��

��

��

��

��

7

*

*

−6

8 −

5

4

3

2

1 −

−

−

*

.

L{1,2}

T{1}

Active

component

Figure 3: Exe ution of an operation in the Hierar hi al Computing model.From the perspe tive of the lower level, on e it re eives a ommand from the upper level, it performsa preorder traversal starting at this level. Thus, the node pro eeds to a ess the data over whi h it has ontrol. If the data is not present in its level, it forwards the ommand to the level immediately under itand forwards all responses to the upper level. If there are no levels under it, a blank response is sent tothe upper level.In order to support this me hanism both ommands and responses need to be tagged with a uniqueidenti�er. This te hnique is similar to the tokens present in traditional data ow ma hines [23℄. We also tagthe ommands with the ID of the level that initiated it, this is done to redu e the overhead of pro essingthe responses. Finally, the model also implements a name spa e lo ator in the form of a table allo ationindex. This index permits the pro essor in the level lo ate data within its boundaries. It is also used todetermine if data is not present, thus avoids a lengthy traversal of all the data.After studying the di�erent SQL operations and the algorithms used in transa tion pro essing work-loads, we designed two types of operations: individual and aggregate. They are based in the exe utionpresented in Figure 3, but di�er in the way the lower level generates the results and how the upper levelinterprets them.� Type 1: IndividualNamed Individual, as the lower level informs the upper level of every single result. It e�e tively a tsas an unbu�ered �lter. The semanti s an be designed to allow the operation to return on the �rst8

event triggered or to ontinue operating until it rea hes the end of the region.Example: Sear h a range of data for a string.��

��

��

��

��

��

��

��

4

73

2

1 −

−

−

*

*

*

−6

8 −

5

Active component

Data Match*

3’

.

5’

7’

.

CMD’CMD’

CMD

Figure 4: Primitive Type 1 (Individual).� Type 2: AggregateFor this primitive, the lower level a esses its asso iated data, and �nds those elements that mat ha parti ular riteria. However, it does not send all these results to the upper level. Instead itprodu es an aggregate number and sends it on e all its data has been analyzed. In this ontext,an aggregate fun tion Aggregate() is any fun tion that produ es a single number based on a set ofnumbers (fI0; :::; INg). The most ommon aggregate fun tions in transa tion pro essing workloadsare: sum(), ount(), average(),max(), and min(). In this ase, the results returned by the di�erentnodes in the lower level might need to be ombined in the upper level to produ e a unique answer.We an a omplish that, if we apply the same aggregate fun tion in the upper level to the valuesreturned. This works for all the examples shown above, ex ept for the average() fun tion. For those ir umstan es the algorithm is hanged to return both sum() and ount() from the nodes in thelower levels and the upper level performs the omputation of average().Example: Count orders pla ed in January 2001.4 Code partitioningWhen we introdu ed the programming model in the Se tion 3, we assumed that the data was laid out orre tly in disk. The distribution of ode among the di�erent levels of the hierar hy impa ts performan e.In this se tion we address the issue of how to partition a transa tion in order to a hieve a good performan ewith the Hierar hi al Computing model. 9

��

��

��

��

��

��

��

��

4

73

2

1 −

−

−

*

*

*

−6

8 −

5

Active component

Data Match*

aggr<3’> aggr<5’,7’>

CMD’ CMD’

CMD

Figure 5: Primitive Type 2 (Aggregate).To study the data and ode partitioning, we begin by looking at the TPC-C ben hmark [24℄, a populartransa tion pro essing workload. The TPC-C ben hmark is developed by the Transa tion Pro essingCoun il (TPC), and is intended to serve as a standard ben hmark for Online Transa tion Pro essingsystems. The ben hmark models the operation of a business with �ve di�erent types of transa tions (neworder, payment, order status, delivery and sto k level).Table 1 shows the hara teristi s of the database tables used by the ben hmark. The parameter Wrepresents the number of warehouses in the ben hmark, and is used to s ale it to di�erent hardware on�gurations. The ardinality olumn indi ates the number of rows in a database table. The next olumnshows the size of a row for our implementation using IBM DB2 Universal Database [25℄. The last olumnshows the size of the tables for a on�guration with 17,500 warehouses, whi h is lose to the highestnon- lustered re ord to TPC-C by the time we ondu ted this study.Spe i�ed as part of the TPC-C ben hmark is the high level des ription of ea h of the �ve di�erenttransa tions. For ea h of these transa tions, we show the query exe ution plans (Figures 6 to 10). Exe utionplans are dire ted graphs that represent the tables in the database, the ow of data and the operationsperformed over the data. The tables are represented by the ir les, operations by the re tangles and the ow of data by the ar s.Note that there is no referen e to time in this representation, as the omputation is driven by thearrival of data. Also, to read them orre tly, they should be traversed in postorder (i.e. visit hildrenbefore entering the node). So if we look at, say, Figure 8, we observe that we annot perform the s an overtable order-line before we perform the s an over table order, whi h itself needs the s an over table ustomertogether with a sort operation. An additional lari� ation is needed for the transa tion shown in Figure 6,where there is a dotted box around some operations. It is used to indi ate that a se tion of the plan is10

Customer

ScanScan

Warehouse

Order

Insert

OrderLine

Insert

Scan

District

Update

Scan

Update

Scan

Item StockFigure 6: Exe ution Plan for the TPC-C New-Order Transa tion.Scan

Update

Scan

Update

Scan

Update Insert

History

CustomerDistrictWarehouseFigure 7: Exe ution Plan for the TPC-C Payment Transa tion.

Customer

Scan

Sort Order

Scan OrderLine

Scan

Figure 8: Exe ution Plan for the TPC-C Order-Status Transa tion.11

Scan

Sort

NewOrder

Order

ScanRemove

Update

OrderLine

Scan

Aggregate UpdateScan

Update

Customer

Figure 9: Exe ution Plan for the TPC-C Delivery Transa tion.

Scan

District

Scan

OrderLine

Stock

Join

Count

Figure 10: Exe ution Plan for the TPC-C Sto k-Level Transa tion.12

Table Cardinality Row Size Table Size (GB)(bytes) (W=17.5k)Warehouse W 101 <0.1Distri t W � 10 107 <0.1Customer W � 30k 701 342.8Sto k W � 100k 330 537.9Item 100k 90 <0.1Order W � 30k+ 40 19.6New-Order W � 9k+ 10 1.5Order-Line W � 300k+ 80 391.2History W � 30k+ 68 33.3Total 1,270.4Table 1: Dimensions of tables for the TPC-C ben hmark.repeated several times. In this parti ular ase, one time for every item in the order.In a old system all the data is stored in disk, but as the exe ution progresses, we an expe t tables, asubset of all the rows in the table or indi es to move to upper levels in the hierar hy. We all this pro essdata promotion. Di�erent from what happens in a traditional system with a hes, this operation is nottransparent to the software, and is ontrolled by the mapping algorithms.Based on the query exe ution plans, we use a stati ost analysis similar to the one used in some of the�rst query optimizers [26℄, where the ost (in instru tions) of a essing a table is omputed as a fun tionof the dimensions of the table and the memory apa ity. Given a table Tablej of size Size(Tablej) andwith Rows(Tablej) rows, in a hierar hy with M levels, the pro ess used to partition the ode is shownin Figure 11. For join operations we use a se ond table Tablek of size Size(Tablek) and Row(Tablek)rows. This is a simple model, and by no means should be onsidered an optimal partition. Additionalinformation used by the algorithm might in lude the notion of pro essing aÆnity [27℄, where a module of omputation is sent to the omponent whi h will exe ute it in the minimum amount of time.5 EvaluationTo evaluate the potential from the hierar hi al omputing model, we estimate the time needed to performa task in this model and ompare it with the time needed in a onventional system. Equation 2 showsthe basi expression for the time ne essary to omplete a task, where performan e gains an be obtainedby improving any of the fa tors of the equation. We have hosen to express the time required to exe utean instru tion (TPI) as the produ t of the lo k frequen y and the CPI ( y les-per-instru tion). The�rst is ommonly asso iated with the hardware implementation details, while the se ond is a fa tor of thepro essor ar hite ture and the workload being exe uted.13

Initial table allo ationFor ea h LeveliFor ea h Tablejif (i =M)TableInLevel(Tablej ; Leveli) Trueelseif (Size(Tablej) < Threshold(Leveli))TableInLevel(Tablej ; Leveli) TrueelseTableInLevel(Tablej ; Leveli) FalseCost estimationTraverse exe ution planFor ea h operationObtain type(operation)Set level where omputation is performedLevels fLeveli j TableInLevel(Table(operation); Leveli) = TruegHighestLevel Min(Levels)if (type(operation) = fSCAN; INSERT;REMOV E;UPDATEgExe OpInLevel(operation) HighestLevelif (type(operation) = fJOINgExe OpInLevel(operation) Min(Rows(Tablej); Rows(Tablek))Assist(operation) Max(HighestLevel � 1; 1)Remember tables used by levelLevelUsesTable(Exe OpInLevel(operation); TablesUsedBy(operation)) TrueCompute ost(operation)if (type(operation) = SCAN) ost(operation) Rows(Tablej)if (type(operation) = INSERT ) ost(operation) 1if (type(operation) = REMOV E) ost(operation) 1if (type(operation) = UPDATE)if (dependen ies(operation) > 0) ost(operation) Rows(Tablej)else ost(operation) 0if (type(operation) = AGGREGATE) ost(operation) 0if (type(operation) = SORT ) ost(operation) Rows(Tablej)if (type(operation) = JOIN) ost(operation) Max(j; k)Add ost(operation) to CostOfLevel(Exe OpInLevel(operation))Remove unused opies of tablesFor ea h Leveli where i > MFor ea h Tablejif (LevelUsesTable(Leveli; Tablej) = False)TableInLevel(Leveli; Tablej) FalseFigure 11: Simple s heme for partitioning transa tion pro essing workloads.14

t = Instru tions� TPINpro essors (2)TPI = lk � CPI (3)This exe ution model performs omputations in the di�erent levels of the storage hierar hy, in what anbe onsidered heterogeneous omputing elements, thus Equation 4 expresses the time spent omputing bya single level. The index level i goes from 1 to M , where M is the total number of levels in the hierar hy.1 The parameter Ni indi ates the number of pro essing elements in the parti ular level i.ti = Instru tionsi � TPIiNi (4)Another interesting hara teristi of this model, is that it allows for operations to be performed on- urrently in the di�erent levels. However there might be algorithms that do not allow for that level ofparallelism be ause they have serial omponents in their exe ution. Taking that into a ount, Equations 5and 6 show the range of values for the total time taken by a task. When we an exploit full parallelism,the resulting time will be the maximum of the individual times. Situations where all the operations needto be serialized would result in a time equal to the sum of all the times. Hen e the range of exe ution time an be al ulated as follows: tmax = MXi=1 ti no overlapping (5)tmin = MMAXi=1 ti full overlapping (6)To enable this model, omputation is partitioned so operations are assigned to the omputation elementsin ea h one of the levels. Considering the nature of transa tion pro essing workloads, the partition ofoperations an be done stati ally or using query optimizers like the ones built in most ommer ial databases[26℄. With these systems, we express the distribution of omputation in Equation 7. The oeÆ ients 'iindi ate the fra tion of the total number of instru tions exe uted in level i of the hierar hy.Instru tions = MXi=1 Instru tionsi (7)'i = Instru tionsiInstru tions (8)1Our onvention is to number the hierar hy from the top to the bottom, so the level with the main pro essor is assigned avalue i = 1. 15

The number of instru tions a pro essor is apable of exe uting is a fun tion of a myriad of fa tors inthe system. Among them, the type of workload plays an important role. Given that we are evaluating thismodel under the light of transa tion pro essing workloads, we will use someCPI = CPI omputation + CPIstorage (9)A hara teristi of this model, is the use of relatively simple omputation engines in the lower levels ofthe hierar hy. This a�e ts the balan e of the operations in the system, a ording to Equation 10.CPI omputation;i < CPI omputation;j i < j (10)Assuming a hierar hy of three levels like the one shown in Figure 12, it is possible to present anexpansion of the above relations, showing the fa tors in whi h the time de omposes. We have assigned the�rst level to operations omputed by the main pro essor, the se ond level to the main memory and thethird level to those operations performed by the disk ontroller.

1

P.main

P.memory

N21

1

P.memory

P.disk P.disk

N3

��

��

��

��

��

��

��

��

Memory(i=2)

(i=1)

(i=3)Disk

Main

Figure 12: Topology of a Simple Hierar hi al Computing system with M = 3.t = Instru tions1 � TPI1N1| {z } + Instru tions2 � TPI2N2| {z } + Instru tions3 � TPI3N3| {z }main memory disk (11)Sin e typi ally the disk and memory pro essors will not be as sophisti ated as the entral pro essor, weuse a degradation fa tor to indi ate how the memory and disks pro essors ompare to the entral pro essor.16

The degradation fa tor for TPIi is designed as �i = TPIiTPI1 , whi h we use to obtain the speedup for thenon-overlapping mode using Equation 5 is:Speedup = 1'1 + '2 �2 N1N2 + '3 �3 N1N3 (12)Similarly, the speedup when the algorithm shows full overlapping is expressed as:Speedup = 1MAX('1; '2 �2 N1N2 ; '3 �3 N1N3 ) (13)

2.03.1

4.25.2

6.37.4

8.59.6

3.04.6

6.37.9

9.611.2

12.814.5

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

2.50

2.75

3.00

Speedup

t 2t 3

F(17%,25%,58%) Y(1,4,32)

2.75-3.002.50-2.752.25-2.502.00-2.251.75-2.001.50-1.751.25-1.501.00-1.250.75-1.000.50-0.750.25-0.500.00-0.25

Figure 13: Speedup for �(17%; 25%; 58%) (1; 4; 32).Let us use tuple to represent the number of omputation elements in ea h level of the hierar hy.Code will be partitioned between the di�erent layers based on aÆnity or other heuristi s. We use tuple �to represent the fra tion of the instru tions exe uted in ea h level of the hierar hy ('i). To analyze thee�e t of the sele tion of devi es, we generate the speedup surfa e for variations of �2 and �3 (Figure 13).As it was expe ted, we a hieve the maximum speedup when we use the smallest numbers for �2 and �3(i.e. P.memory and P.disk are not signi� antly slower than P.main). There are however several points withequal speedup for multiple pairs of degradation fa tors, whi h gives designers a great degree of freedomwhen sele ting omponents. 17

Speedup � Min Max( 1, 4,32) (10%, 5%, 85%) 1.60 4.32( 1, 4,32) (35%, 10%, 55%) 1.17 2.13( 1, 4,32) (15%, 25%, 60%) 0.95 2.86( 1, 4,32) ( 0%, 25%, 75%) 1.01 4.52( 1, 4,32) (10%, 20%, 70%) 1.08 3.48( 1, 4,16) (10%, 20%, 70%) 0.80 2.67( 1, 4,32) (10%, 20%, 70%) 1.08 3.48( 1, 4,64) (10%, 20%, 70%) 1.31 4.10( 1, 2,32) (10%, 20%, 70%) 0.70 2.58( 1, 2,64) (10%, 20%, 70%) 0.79 2.91Table 2: Speedups of hierar hi al omputing system with onventional memory and disk omponents.The algorithm presented in Figure 11 is used to obtain a preliminary estimate of the ode partition-ing. For the situation where Threshold(Leveli) = Capa ity(Leveli), Table 3 shows the fra tions of the omputation that should be performed in ea h level for a on�guration of (1; 4; 16).Transa tion type '1 '2 '3New Order 10.0 % 0.8 % 89.2 %Payment 36.3 % 9.1 % 54.6 %Order Status 16.7 % 25.0 % 58.3 %Delivery 0.00 % 25.0 % 75.0 %Sto k Level 6.3 % 18.8 % 74.9 %Table 3: Partition of ode for the TPC-C ben hmark.Now, we pro eed to do a sensitivity analysis assuming a hierar hi al omputing system built withavailable te hnologies. We assume a main pro essor running at 1 GHz, memory pro essors running between100 and 500 MHz (i.e. 2 < �2 < 10), and disk pro essors running between 66 and 250 MHz (i.e. 4 < �3 <15). Considering the database sizes of some implementations of transa tion pro essing systems, we sele teda on�guration with four memory pro essors and 32 disks (whi h is expressed by (1; 4; 32)). The �rstse tion of Table 2 shows the maximum and minimum speedups obtained for di�erent partitions of the ode(i.e. varying values of �). Depending on the distribution of work, we an observe speedups of up to 4:52xwhen omparing with a traditional unipro essor system. If the memory and disk pro essors are very slow,potential slowdowns may be observed for ertain partitions of the ode. For instan e, the distribution�(15%; 25%; 60%) en ounters a slowdown when �2 = 10 and �3 = 15. However the same on�guration is apable of a speedup of 2:86x given more powerful nodes.Code partition is dependent on the nature of the transa tion, and as su h is relatively hard to hange18

without restru turing the algorithm. It is then important to analyze the e�e t that hanging the hardware on�guration has over the speedup for a given ode partition. This is shown in the lower half of Table 2.Here we work on a partition �(10%; 20%; 70%) and hange the number of memory and disk modules.Given the large amount of ode given to the disk modules, we observe that in rementing the number ofdisks results in an in rease in the speedup. As in any other multipro essor system, the in rease is notlinear with the amount of pro essing elements added. It may also be noted that a tive memory systemsand a tive disk systems are subsets of the proposed general hierar hi al omputing system.6 Related WorkDuring the 1970s, omputer s ientists looked at database appli ations and proposed spe ially designedma hines to handle the in reasing gap in the performan e between primary and se ondary storage as wellas the overwhelming software omplexity present in database appli ations [28, 29℄. Known as DatabaseMa hines, these systems in orporated spe ialized omponents in the form of pro essors per-disk, per-tra k,per-head and asso iative memories, in order to fa ilitate the a ess of data. The problem with these systemswas that the use of non- ommodity hardware drasti ally in reased the ost of the systems. Additionallythey were designed to handle only database workload, whi h resulted in a de lined interest by the rest ofthe ar hite ture and software ommunities.Database ma hines saw their last days with the development of parallel databases [30℄, whi h provedto be a ost-e�e tive solution for the problems of the day. Sin e then, modern ommer ial databases haveadopted several of the proposed algorithms: parallel sort [31℄, parallel join [32℄ and other algorithms thattradeo� memory utilization for I/O bandwidth [33℄.In addition to database ma hines, several resear h proje ts have looked at the idea of having om-putation elements lose to the data. Intelligent memories have mostly been targeted to regular numeri appli ations [9, 10℄, but re ent attempts also look at its use in non-regular appli ations [11, 34, 12, 35℄.Likewise the idea of intelligent disk has also been overed by di�erent resear h groups. The workloads on-sidered for this te hnology onsist of De ision Support Systems [13, 14, 15, 17℄, data-mining and multimediaappli ations [16℄.Related to our resear h, the X-Tree ar hite ture [36, 37℄, looks at a multipro essor organization, wherepro essors are onne ted using a binary tree. This topology fa ilitates the design of high bandwidthsystems as the average distan e between the nodes in reases logarithmi ally with the number of nodes inthe system. However, the emphasis in the X-Tree system was to build VLSI hips based on the idea ofre ursive ar hite tures [38℄, where it was possible to design a omputer system by onstru ting a hierar hywith the same type of pro essors. Another example of a re ursive system is the Data Driven Ma hine(DDM1), designed by Davis et al.[39℄. DDM1 was able to exploit on urren y due to its implementation19

of Data Driven Nets (DDN), whi h onstitute a form of data ow similar to the one used by our model.While the omputing paradigm presented in this paper has similarities to the aforementioned resear he�orts, it must be noted that the merits of several past ar hite tural paradigms are being synergisti- ally ombined and applied to transa tion pro essing in our urrent resear h e�ort. We are leveragingon advan ements in a tive memories and a tive disks while at the same time taking advantage of theadvan ements done during the last twenty years in parallel databases, and query optimizers.7 Con lusionsIn this paper, we have presented a hierar hi al model of omputing, whi h is based on the on ept ofperforming omputations in a hierar hi al manner distributed over the memory hierar hy. This resear h isintended to alleviate the imbalan e of omputation and data a essing experien ed by large s ale transa tionpro essing systems. The building blo ks that help to realize the proposed hierar hi al model are a tivememory units and a tive disk devi es that are emerging in the market. The basi prin iple is to use omputation engines that sit lose to the lo ation of the data and the use of a hierar hy to onne t these omputation engines. This paradigm also brings in bene�ts of the data ow model of omputation.We des ribed the omputation paradigm and outlined a simple ode partitioning s heme. Then weapplied the ode partitioning s heme to the TPC-C ben hmark, and observed that it is possible to splitthe transa tions using stati information about the database tables and storage apa ity of the omputationnodes. Using the simple partitioning algorithm and assuming inexpensive memory and disk pro essors,we show that a hierar hi al omputing system with four memory pro essors and 32 disk pro essors anobtain speedups of up to 4:52x when ompared with traditional unipro essor systems. While performan edegradations will o ur in non-parallelizable ode, or systems with very slow memory or disk pro essors,it is seen that judi ious partitioning of ode an yield in performan e improvements. Sin e transa tionpro essing has been shown to ontain signi� ant amounts of parallelism, partitioning ode in a fruitfulway is not immensely diÆ ult. In summary, hierar hi al omputing whi h exploits parallelism, distributes omputations, and redu es the data transport requirements is a feasible model of omputation for futuredatabase servers.One attra tive result of using this paradigm is the feasibility of using pro essors, with di�erent perfor-man e ratings in the same system. This is possible due to the use of heterogenous pro essing elements inthe di�erent layers of the hierar hy. Pro essors whi h are used at the top of the hierar hy move down asmemory pro essors on e a new high performan e pro essor generation arrives. Simultaneously, the urrentmemory pro essors will move to the disk as disk pro essors. This maximizes the lifetime of a pro essordesign, amortizing the ost of the design. 20

Referen es[1℄ R. Winter, \The growth of enterprise data: Impli ations for the storage infrastru ture," in WhitepaperWinter Corporation, 1998.[2℄ V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Drus hel, W. Zwaenepoel, and E. Nahum, \Lo ality-aware request distribution in luster-based network servers," in Pro eedings of the Eight InternationalConferen e on Ar hite tural Support for Programming Languages and Operating Systems, (San Jose,CA, USA), pp. 205{216, O t. 2{7 1998.[3℄ A. M. G. Maynard, C. M. Donnelly, and B. R. Olszewski, \Contrasting hara teristi s and a he per-forman e of te hni al and multi-user ommer ial workloads," in Pro eedings of the Sixth InternationalConferen e on Ar hite tural Support for Programming Languages and Operating Systems, (San Jose,CA, USA), pp. 145{156, O t. 4{7 1994.[4℄ S. E. Perl and R. L. Sites, \Studies of Windows NT Performan e Using Dynami Exe ution Trees,"in Pro eedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation,(Seattle, WA, USA), pp. 169{184, O t. 28{31 1996.[5℄ L. Barroso, K. Ghara horloo, and E. Bugnion, \Memory system hara terization of ommer ial work-loads," in Pro eedings of the 25th Annual International Symposium on Computer Ar hite ture (ISCA-98), (Bar elona, Spain), pp. 3{14, June 27{July 1 1998.[6℄ M. Rosenblum, E. Bugnion, S. A. Herrod, E. Wit hel, and A. Gupta, \The impa t of ar hite turaltrends on operating system performan e," in Pro eedings of the Fifteenth ACM Symposium on Oper-ating Systems Prin iples, (Copper Mountain, CO), pp. 285{298, ACM Press, De . 1995.[7℄ A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood, \DBMSs on a modern pro essor: Where doestime go?," in Pro eedings of the 25th Conferen e on Very Large Data Bases (VLDB'99), (Edinburgh,S otland), pp. 15{26, Sept. 7{10 1999.[8℄ K. Keeton, D. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker, \Performan e Chara terizationof a Quad Pentium Pro SMP using OLTPWorkloads," in Pro eedings of the 25th Annual InternationalSymposium on Computer Ar hite ture (ISCA-98), (Bar elona, Spain), pp. 15{26, June 27{July 1 1998.[9℄ D. G. Elliott, W. M. Snelgrove, and M. Stumm, \Computational RAM: A memory-SIMD hybrid andits appli ation to DSP," in Custom Integrated Cir uits Conferen e, pp. 30.6.1{30.6.4, May 1992.[10℄ D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, andK. Yeli k, \A ase for intelligent RAM: IRAM," IEEE MICRO, Apr. 1997.[11℄ M. Oskin, F. Chong, and T. Sherwood, \A tive pages: A omputation model for intelligent memory,"in Pro eedings of the 25th Annual International Symposium on Computer Ar hite ture (ISCA-98),(Bar elona, Spain), pp. 192{203, June 27{July 1 1998.[12℄ M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Drapper, J. LaCoss, J. Grana ki, J. Bro kman,A. Srivastava, W. Athas, V. Free h, J. Shin, and J. Park, \Mapping irregular appli ations to DIVA,a PIM-based data-intensive ar hite ture," in Pro eedings of the High Performan e Networking andComputing Conferen e (SC99), (Portland, OR), Nov. 13{19 1999.21

[13℄ K. Keeton, D. A. Patterson, and J. M. Hellerstein, \A ase for intelligent disks (IDISKs)," in Pro eed-ings of the ACM SIGMOD International Conferen e on Management of Data (SIGMOD-98), (Seattle,WA, USA), pp. 42{52, June 1{4 1998.[14℄ A. A harya, M. Uysal, and J. Saltz, \A tive disks: Programming model, algorithms and evalua-tion," in Pro eedings of the Eight International Conferen e on Ar hite tural Support for ProgrammingLanguages and Operating Systems, (San Jose, CA, USA), pp. 81{91, O t. 2{7 1998.[15℄ G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobio�, C. Hardin, E. Riedel,D. Ro hberg, and J. Zelenka, \A ost-e�e tive, high-bandwidth storage ar hite ture," in Pro eed-ings of the Eight International Conferen e on Ar hite tural Support for Programming Languages andOperating Systems, (San Jose, CA, USA), pp. 92{103, O t. 2{7 1998.[16℄ E. Riedel, G. Gibson, and C. Faloustos, \A tive storage for large-s ale data mining and multimedia,"in Pro eedings of the 24th VLDB Conferen e, Aug. 24{27 1998.[17℄ G. Memik, M. T. Kandemir, and A. Choudhary, \Design and evaluation of smart disk ar hite turefor DSS ommer ial workloads," in Pro eedings of the 2000 International Conferen e on ParallelPro essing, (Toronto, ON, Canada), pp. 335{342, Aug. 21{24 2000.[18℄ Cirrus Logi , In ., \Preliminary produ t bulletin CLSH8665." June 1998.[19℄ Intel Corporation, \i960 HX mi ropro essor developers's manual." Order Number 272484-002, Sept.1998.[20℄ Siemens Mi roele troni s, \TriCore ar hite ture overview handbook." Feb. 1999.[21℄ A. Tessardo, \TMS320C27x: New generation of embedded pro essors looks like a mi ro ontroller,runs like a DSP." White Paper SPRA446, Digital Signal Pro essing Solutions, 1998.[22℄ E. F. Codd, \A relational model for shared large data banks," Communi ations of the ACM, vol. 13,no. 6, pp. 377{387, 1970.[23℄ Arvind and R. S. Nikhil, \Exe uting a program on the MIT tagged-token data ow ar hite ture,"in PARLE '87, Parallel Ar hite tures and Languages Europe, Volume 2: Parallel Languages (J. W.de Bakker, A. J. Nijman, and P. C. Treleaven, eds.), Berlin, DE: Springer-Verlag, 1987. Le ture Notesin Computer S ien e 259.[24℄ \TPC-C spe i� ation."http://www.tp .org/ spe .html.[25℄ \IBM DB2 Universal Database."http://www.software.ibm. om/data/db2/udb/.[26℄ M. Jarke and J. Ko h, \Query optimization in database systems," ACM Computing Survey, vol. 16,pp. 111{152, June 1984.[27℄ J. Lee, Y. Solihin, and J. Torrellas, \Automati ally mapping ode on an intelligent memroy ar hi-te ture," in Pro eedings of the Seventh International Symposium on High Performan e ComputerAr hite ture (HPCA-7), (Monterrey, Mexi o), Jan. 19{24 2001.[28℄ D. K. Hsiao, ed., Advan ed Database Ma hine Ar hite ture. Englewood Cli�s, NJ: Prenti e-Hall, 1983.22

[29℄ L. L. Miller, A. R. Hurson, and S. H. Pakzad, eds., Parallel ar hite tures for data/knowledge-basedsystems. Los Alamitos, CA: IEEE Computer So iety Press, 1995.[30℄ D. J. DeWitt and J. Gray, \Parallel database systems: The future of high-performan e databasesystems," Communi ations of the ACM, vol. 35, pp. 85{98, June 1992.[31℄ M. H. Nodine and J. S. Vitter, \Greed sort optimal deterministi sorting on parallel disks," ACMTransa tions on Database Systems, vol. 42, pp. 919{933, July 1995.[32℄ A. Segev, \Optimization of join operations in horizontally partitioned database systems," ACM Trans-a tions on Database Systems, vol. 11, pp. 48{80, Mar. 1986.[33℄ L. D. Shapiro, \Join pro essing in database systems with large main memories," ACM Transa tionson Database Systems, vol. 11, pp. 239{264, Sept. 1986.[34℄ Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, \Flexram:Toward an advan ed intelligent memory system," in Pro eedings of the International Conferen e onComputer Design (ICCD99), (Austin, TX, USA), O t. 1999.[35℄ K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, \Smart memories: A modularre on�gurable ar hite ture," in Pro eedings of the 27th Annual International Symposium on ComputerAr hite ture (ISCA'00), (Van ouver, BC, Canada), pp. 161{171, June 12{14 2000.[36℄ A. M. Despain and D. A. Patterson, \X-TREE: A tree stru tured multi-pro essor omputer ar hi-te ture," in Pro eedings of the 5th Annual International Symposium on Computer Ar hite ure, (PaloAlto, CA, USA), pp. 144{151, Apr. 3{5 1978.[37℄ D. A. Patterson, E. S. Fehr, and C. H. Sequin, \Design onsiderations for the VLSI pro essor ofX-TREE," in Pro eedings of the 6th Annual International Symposium on Computer Ar hite ure,(Philadelphia, PA, USA), pp. 90{101, Apr. 23{25 1979.[38℄ P. C. Treleaven, \VLSI pro essor ar hite tures," IEEE Computer, pp. 33{45, June 1982.[39℄ A. L. Davis, \The ar hite ure and system method of DDM1: A re ursively stru tured data drivenma hine," in Pro eedings of the 5th Annual International Symposium on Computer Ar hite ure, (PaloAlto, CA, USA), pp. 210{215, Apr. 3{5 1978.

23

Hierarchical Computing

Documents