Congestion Management in MINs through Marked and Validated Packets

Congestion Management in MINs through Marked & Validated Packets∗

Joan-LLuıs Ferrer, Elvira Baydal, Antonio Robles, Pedro Lopez, Jose DuatoParallel Architectures Group

Universidad Politecnica ValenciaCamino de Vera s/n, 46021 Valencia, Spain

[email protected];{elvira, arobles, plopez, jduato}@disca.upv.es

Abstract

Congestion management is a very critical problem tack-led in interconnection networks for years but not solved yet.Although several mechanisms have been recently proposedfor lossless Multistage Interconnection Networks (MINs),they either have drawbacks or are partial solutions. Someof them introduce penalty over packets not really addressedto the hot-spots, whereas others can cope only with con-gestion situations that last a short time. In this paper, wepropose an effective and efficient congestion managementmechanism for lossless interconnection networks based onexplicit congestion notification. The mechanism uses twodifferent flags in ACK packets, a Marking Bit (MB) and aValidation Bit (VB), to detect congestion and warn the ori-gin hosts. In this way, packets belonging to “cold flows” butstopped because of head-of-line (HOL) blocking can be dis-tinguished from “hot flow” packets which are really causingcongestion. In response, origin hosts can apply correctiveactions only to the “hot flows”, minimizing the negative im-pact on “cold flows” performance. Evaluation results showthat the proposed congestion management strategy is ableto avoid the degradation of network performance, regard-less of traffic load and the location of the congestion in thenetwork.

1. Introduction

Over the latest years, processing systems have exper-imented a dramatic growth, due to the needs of newcommunication-intensive applications and the increasingdemand of new services. High performance network tech-nologies have allowed to tackle the requirements derived

∗This work was supported by the Spanish program CONSOLIDER-INGENIO 2010 under Grant CSD2006-00046, by the Spanish CICYT un-der Grant TIN2006-15516-C04-01, by Junta de Comunidades de Castilla-La Mancha under Grants PBC-05-005-2 and PBC-05-007-2 and by theEuropean Commission in the context of the SCALA integrated project#27648 (FP6).

from this new scenario. Among all the feasible topolo-gies, MINs have become very popular. Good examples ofthese technologies are Infiniband, Myrinet and Quadrics.As these interconnects are expensive compared to proces-sors, an easy solution to cut down costs is to reduce thenumber of components (i.e. switches and links), increasingtheir use. On the other hand, there is a rising interest inapplying power saving techniques everywhere. In the caseof interconnects, these techniques are mainly based on re-ducing the number of links of the network or using somefrequency/voltage scaling techniques [1, 13]. Both factors,reducing cost and power saving, lead to a higher link uti-lization. In this situation, network congestion may arise,dramatically degrading network performance. Congestionappears when there is contention between several packetstrying to use the same output link. These packets gener-ate “hot flows”. If this situation remains for long, pack-ets start to accumulate at the input queues of the affectedswitches. As a consequence of the back pressure actionsperformed by the flow control mechanism, packet advancein the previous switches is also delayed, generating head-of-line (HOL) blocking, which prevents the advance of packetsaddressed to non-congested links. These packets belong to“cold flows”, being victims of the “hot flows”. This situ-ation spreads along the network forming a saturation tree,blocking upstream switches and causing the degradation ofthe overall network performance. Therefore, to avoid thesenegative effects, it is mandatory to use an efficient conges-tion management mechanism.

Congestion management has generated a lot of researchand many mechanisms have been proposed over the years.As we describe in Section 2, many of them have scalabil-ity problems, increasing the number of required resourcesas the network size rises. Others cannot manage congestionwhen it lasts for long. Recently, some mechanisms basedon marking packets in transit have been proposed to de-tect and manage congestion in InfiniBand. Unfortunately,these approaches do not guarantee, for all traffic distribu-tions, that corrective actions are only carried out on those

https://www.researchgate.net/publication/220951788_Dynamic_power_saving_in_fat-tree_interconnection_networks_using_onoff_links?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5

https://www.researchgate.net/publication/4005869_Dynamic_voltage_scaling_with_links_for_power_optimization_of_interconnection_networks?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5

packet flows causing congestion. However, if switches withqueues at both their input and output ports (CIOQ switches)are used, as it is the case of many recent designs, a moreselective packet marking mechanism can be applied. In thisway, “cold” and “hot flows” could be distinguished, evenlydistributing the available network resources among the de-vices that demand them, maximizing network throughput.

In this paper, we take on such a challenge and proposea cost-effective control management strategy for losslessMINs based on CIOQ switch technology. The proposedmechanism combines a packet marking mechanism at theswitch input buffers with its corresponding validation at theoutput buffers that feed those links that are truly saturated.Only the packets belonging to “hot flows” will be validated.Additionally, we propose an effective correction mechanismon the injection rate of the “hot flows” able to give a quickresponse in accordance with the congestion degree in thenetwork.

The rest of the paper is structured as follows. Section 2shows a background on the congestion management mecha-nisms proposed for lossless interconnection networks. Sec-tion 3 presents, in detail, the proposed strategy. The sim-ulation scenario and the evaluation results are presented inSection 4. Finally, in Section 5 some conclusions are drawn.

2. Related work

Congestion management in lossless networks has beenwidely studied over the years [9]. From our point of view,there are mainly two strategies to solve the problem: con-gestion prevention and recovery. Congestion prevention re-quires to know in advance the network resources needed, re-serving them before starting the packet transmission. Thesestrategies imply some overhead and are only used in pro-tocols aimed to QoS. Congestion recovery management isusually based on three actions: detection, notification, andcorrection. Congestion is frequently detected measuringswitch buffer occupancy [4, 11, 12, 15]. After detectingcongestion, the sources injecting traffic have to be warned.Usually, they then will apply message throttling. But mostof the proposed mechanisms do not guarantee that the no-tified sources are only the ones that are injecting traffic tothe congested links. This may occur if “hot flows” cannotbe properly distinguished from “cold flows” [5, 11, 12], orif broadcast is used to notify the nodes [15, 16]. This maypenalize network throughput. Moreover, broadcasting mayrequire to transmit a lot of control information, thereforewasting bandwidth network. To avoid this drawback, otherproposals [5, 11, 12] take advantage of the acknowledgment(ACK) sent back to the source when a packet has reachedits destination. In this ACK packet, only one bit is enoughto notify congestion to the source in an efficient way. Othermechanism proposed in [4] tries to solve congestion using

additional queues in the switches where congestion is de-tected. Additional queues are intended for “hot flows”. Thisallows to separate “hot flows” from “cold flows” in differ-ent queues, but it only works with congestion keeping up ashort time interval. On the other hand, there are some pro-posals that try to resolve HOL blocking at the switch level[2, 8, 14] or at the network level [3, 7]. This reduces con-gestion, but it does not really solve it.

3. The MVCM mechanism

In this section, we describe in detail the proposed con-gestion management strategy. This strategy combines apacket marking/validation mechanism together with somecorrective actions performed at the origin hosts. The maingoal of the strategy is to properly differentiate “hot flows”from “cold flows”, applying packet injection limitation onlyto the source nodes that are really generating congestionand with the appropriated intensity. The mechanism willbe referred to as MVCM (Marking/Validation Mechanismfor Congestion Management).

3.1. Congestion detection mechanism

In order to cope with congestion situations, some ap-proaches mark all the waiting packets at the input bufferswhen its number exceeds a threshold. When these markedpackets arrive at their destinations, then destination hostsinform about the congestion occurrence to the origin hosts.However, notice that if we pay attention only to the statusof the input buffer of switches, packets that belong to “coldflows” could be also marked. Therefore, the corrective ac-tions at origin hosts could penalize the “cold flows”, neg-atively impacting overall network performance. To avoidpenalizing “cold flows”, we propose a new packet markingmechanism, that not only marks packets at the input buffersbut also validates them later at some output buffers whensome conditions are met. Therefore, the marking mecha-nism operates in two steps.

Firstly, packets arriving to an input buffer are marked ifthe number of stored packets in the buffer exceeds a certainthreshold. This represents that a congestion tree is possi-bly starting to appear. This is performed by activating theMarking Bit (MB) in the packet header. To this end, we canuse any of the header bits usually reserved by the specs forvendor applications.

Secondly, when a marked packet is forwarded through asaturated output link (maybe belonging to a different switchfrom the one it was marked), we proceed to validate it byactivating a second bit in the packet header, the ValidationBit (VB). We assume that an output link is saturated whenthe number of packets stored in its buffer (output buffer) ex-ceeds another threshold. On the other hand, the mechanism

https://www.researchgate.net/publication/4021026_End-to-end_congestion_control_for_InfiniBand?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5



https://www.researchgate.net/publication/3783571_Global_reactive_congestion_control_in_multicomputer_networks?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5

https://www.researchgate.net/publication/3733103_Credit-flow-controlled_ATM_for_MP_interconnection_The_ATLAS_I_single-chip_ATM_switch?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5

https://www.researchgate.net/publication/4008281_Evaluation_of_congestion_detection_mechanisms_for_InfiniBand_switches?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5



https://www.researchgate.net/publication/3888035_Self-tuned_congestion_control_for_multiprocessor_networks?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5


https://www.researchgate.net/publication/4119867_A_new_scalable_and_cost-effective_congestion_management_strategy_for_lossless_multistage_interconnection_networks?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5


https://www.researchgate.net/publication/268273401_A_Localized_Congestion_Control_Mechanism_for_PCI_Express_Advanced_Switching_Fabrics?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5

https://www.researchgate.net/publication/268979817_High_speed_switch_scheduling_for_local_area_networks?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5

https://www.researchgate.net/publication/260585448_Hot_spot_contention_and_combining_in_multistage_interconnection_networks?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5

Table 1. Corrective actions.Ack bits

MB VB Actions0 0 None1 0 Moderate (Preventive)1 1 Imminent (Corrective)

does not allow to carry out the following actions. It is notable to validate (VB = 1) a packet that has not been pre-viously marked (MB = 0), and marked packets will neverbe unmarked. Once congestion has been detected by iden-tifying the existence of some “hot flows” in the network(MB = 1,VB = 1), the corresponding origin hosts have to bewarned, proceeding to apply some corrective actions. Forthis purpose, we take advantage of ACK packets. At thedestination host, an ACK packet is sent back to the originhost with its MB and VB bits activated with the same val-ues as those of the corresponding received data packet. Inthis way, the origin host can identify the flows on which itshould apply corrective actions. Notice that this notificationmechanism does not overload the network with additionalcontrol traffic, because it only makes use of the ACK pack-ets commonly used in reliable communications.

3.2. Congestion correction mechanism

Basically, to avoid the congestion, it is necessary to re-duce the injection rate of the hosts responsible for “hotflows”. To this end, we propose two levels of correctiveactions at the origin hosts. The first level is based on ad-justing the packet injection rate by using a sliding window(SW). This technique is already used with success in othernetwork protocols like TCP [6]. It is based on the idea oflimiting, for each flow (a traffic generated from an originto a destination), the number of outstanding packets in thenetwork. Outstanding packets are those that have not beenacknowledged yet. If congestion persists, a second levelwill reduce even more the injection rate. It consists of in-troducing waiting periods (WP) in the injection of packetsmeanwhile the window size is kept to 1. Table 1 shows thedefined corrective actions according to the values of the re-ceived bits on ACK packets. We refer to “imminent” as theactions performed on those flows with both marked and val-idated packets. On flows with only marked packets, we takeonly “moderate” actions to prevent a possible beginning ofcongestion. The corrective actions to take will always de-pend on the seriousness of the detected congestion.

3.2.1 Reducing the injection rate

Initially, the window size is set to some value SWmax,and packets can be injected without interruption (WP=0).Moderate actions: Only the SW size is modified, reducing

Table 2. Corrective algorithm.Ack bits

MB VB Reduction Tech. Recovery Tech.if WP>0 then

WP:=00 0 elseif SW<SWmax

then SW:=SW+1end;

if SW>1 then1 0 SW:=SW-1

end;if SW>1 then

SW:=SW-11 1 else Increase(WP) (*)

end;

For (*) see “Waiting periods calculation”

its current value by subtracting one per each marked ACKpacket received. If window size reaches its minimum value(one) and more marked (but not validated) ACK packets ar-rive, the mechanism will keep the window size equal to 1.This situation will remain till a non marked ACK packetarrives. Then, window size will be increased by addingone per each non marked ACK packet received until themax value of SW is reached(see section 3.2.2). In this way,for those flows where no congestion is detected but markedpackets are received (it is an indication of possible nearcongestion), the injection rate will be decreased during thestrictly necessary period of time, by a simple SW control.So, SW size will be adjusted into the interval [1..SWmax].

Imminent actions: They will be applied when bothmarked and validated ACK packets are received. The mech-anism reacts by reducing the SW size as moderate actionsdo. If the SW size reduction is not enough to stop the grow-ing of the saturation tree, heavier actions will be appliedintended to reduce even more the injection rate of the flowsresponsible of the congestion. This second level of actionsstarts when the SW size becomes equal to one. Then, wait-ing periods (WP) will be inserted between the injected pack-ets. Notice that both situations can occur at the same time.That is, only marked or marked and validated packets canarrive during a congestion process indistinctly. In that sit-uation, the mechanism will combine the actions describedabove (i.e., with marked packets, only actions over SW areapplied; with both marked and validated ones, actions overSW and later WP insertion are applied). Corrective algo-rithm is shown in Table 2.

3.2.2 Recovering the injection rate

The strategy to recover the injection rate must be atradeof between achieving a fast response time when con-gestion has finished and avoiding to inject too much trafficwhen the network is still congested.

Non-marked packets reception. The reception of ACKpackets at the origin node with MB=0 and VB=0 will al-

https://www.researchgate.net/publication/234780531_Congestion_avoidance_and_control?el=1_x_8&enrichId=rgreq-ff55a12d-8aec-43bb-9b7e-c67db9fd4c20&enrichSource=Y292ZXJQYWdlOzIzMjYzODM2MTtBUzo5NzQ4MzQyMjE3NTIzN0AxNDAwMjUzMjY3Nzc5

low the recovery of the initial values of the parameters forthe congestion control mechanism (i.e. full injection rate).Recovery period depends only on the size the window hasreached. When the first non-marked ACK packet is re-ceived, the WP applied to that flow will be immediatelyeliminated thus allowing for a fast recovering but keepingthe window size equal to 1. If more non-marked ACK pack-ets arrive, SW will recover the initial value by adding onefor each packet not marked and received. Table 2 showsthese actions. After receiving SWmax packets the full in-jection rate will be available.

Injecting a packet to an empty queue at the origin host.In order to speed-up even more removing the corrective ac-tions when congestion is no longer detected, parameterswill be set to their initial values if an origin host injects apacket into an empty injection queue. Notice that if a hosttemporarily stops injection, it does not longer contribute tocongestion. Moreover, if no data packets are injected, thenno ACK packets will be received. Hence, although this hostis not longer generation congestion, it will not be able toquickly recover the initial values of the parameters. Conse-quently, the parameters for this origin-destination flow willbe reset. That is, SW = SWmax and WP = 0. With thisaction, we get an immediate recovery.

Therefore, the proposed mechanism takes corrective ac-tions immediately against serious situations that could causecongestion in the network. However, if the congestion takesplace only during a brief period of time (few validated ACKpackets are received), the recovery is also very fast. As aconsequence, the mechanism does not penalize the networkperformance in the absence of congestion.

3.2.3 Parameters initialization

Buffer Threshold: The input and output buffer thresh-olds at each switch should be chosen with values such that,for a uniform traffic pattern, packets are allowed to crossthe switch without neither being marked nor validated, re-gardless of the network load. Although the network traf-fic were near the saturation point, if destination distributionis uniform, network will limit the entrance of the packetsat origin hosts. To calculate the threshold, we have car-ried out a simulation of the network behavior with a uni-form traffic distribution and an injection rate near the sat-uration point. Packet size is assumed to be 256+22 bytesfor data (payload+control) and 0+22 bytes for ACK pack-ets. At this traffic level, the maximum occupation at inputand output buffers is calculated. From these results, we de-fine the buffer thresholds. In particular, we have obtainedthreshold values of 2 and 1 (data packets) for the input andoutput buffer thresholds, respectively.

Sliding Window Size: There are some proposals with SWfixed to 1 [10]. While this choice is appropriate to palliatethe congestion in a general scenario, it may negatively im-

pact network throughput (this would be the case when a hostsends packets toward a single destination host). Therefore,it is necessary to identify the optimum window size for thenetwork. Let us assume a scenario with a uniform trafficdestination distribution and an injection rate near the satu-ration point. Initial value of the window size is establishedaccording to the round-trip delay (RTT) of the packets. Itwill be identified by the number of packets from a flow thatcan be injected into the network before receiving the firstACK at the origin host.

010000200003000040000500006000070000

0 2.5e+06 5e+06 7.5e+06

Lat

ency

(C

ycle

s)

Simulation Time (Cycles)

SW=infinite

010000200003000040000500006000070000

0 2.5e+06 5e+06 7.5e+06

Lat

ency

(C

ycle

s)


SW=2

(a) (b)

010000200003000040000500006000070000

0 2.5e+06 5e+06 7.5e+06L

aten

cy (

Cyc

les)


SW=1

020406080

100120140160

0 2.5e+06 5e+06

Wai

ting

Pack

s at

Ori

gins


SW=InfiniteSW=2SW=1

(c) (d)

Figure 1. Comparison of network perfor-mance with different window sizes for a 4-ary4-fly. (a), (b) and (c) average latency, (d) aver-age number of waiting packets at origins.

Figure 1 shows the influence of SW size for a 4-ary 4-fly interconnection network. Traffic pattern is a hot-spotformed by 64 hosts injecting packets toward a single des-tination only during some time of the simulation, whereasthe remaining 192 hosts are continuously injecting uniformtraffic. We can appreciate that if the window size is equalto 1, both latency from generation time and the numberof waiting packets in origin are minimized when conges-tion appears. However, once congestion decreases, networkwould never recover its initial performance values becausea SW size equal to 1 does not allow consuming all packetsaccumulated at origin hosts during the congestion period.As shown, a SW size equal to 2 provides the best behavior.In this network, when it is not saturated, the round-trip de-lay is equal to the time needed to inject 1.45 packets. This isthe ideal size of the window, as it is the minimum size that isbig enough to avoid bubbles in packet sending. Therefore,a value of 2 is the nearest integer. Notice that when the net-work is not saturated, a bigger SW size does not provideadvantages since the ACKs arrive to the origin before more

than 2 packets can be injected. However, as the network isgetting saturated, packet contention arises and packet staylonger in the queues, increasing RTTs and delaying ACKs.In this situation, a bigger SW size allows to have more pack-ets in transit through the network, thus leading to an evenmore congested situation. In Figure 1, the SW size=infiniteconfirms this behavior, taking the network to the most con-gested situation.

Waiting Periods Calculation: Injection rate reduction isobtained by adding waiting periods on injecting consecutivepackets at origins. The waiting periods (WP) are calculatedfrom this function:

WP := Radix ∗ WPprevious

In absence of congestion, WP=0. When the first vali-dated ACK is received, WP is set to the round trip delay.The WP is then increased by the radix after receiving eachACK. The radix value of the network is used because ittakes into account the switches used in the network. Noticethat the higher the radix, the more severe a congestion situ-ation can be, requiring a higher reduction on injection rate.As it can be seen, the mechanism is easy to implement.

4 Performance evaluation

4.1 Network configurations

The proposed mechanism has been evaluated by using anetwork simulator. A generic switch-based network tech-nology with point-to-point links and buffered credit-basedflow control with serial links has been assumed. Packetswill be transmitted over the link if there is enough bufferspace (measured in credits of 64 bytes) to store the en-tire packet. We have evaluated Butterfly MINs for differ-ent values of switch radix and different number of networkstages. In particular, the following network configurationshave been evaluated: 2-ary 4-fly, 2-ary 6-fly, 2-ary 7-fly, 4-ary 3-fly, 4-ary 4-fly, 4-ary 5-fly, 8-ary 3-fly. Data packetshave a payload of 256 Bytes plus 22 Bytes of control, re-sulting in a 278 Byte packet size. ACK packet size is 22Bytes.

Switches have 1KByte buffers associated at both theirinput and output ports, thus allowing the storage of 3 com-plete data packets plus some ACK packets. A Full VirtualOutput Queuing (VOQ) is used at source hosts. Thus, weeliminate the occurrence of HOL blocking at origins. Adeterministic routing algorithm is used. We first generatepackets according to a uniform distribution of message des-tinations. Then, we create a hot-spot in the network and an-alyze what happens with “cold” and “hot flows”. In particu-lar, hosts that send uniform traffic remain injecting packetsduring the whole simulation. Hosts that generate hot-spot

Table 3. Evaluated traffic patterns8-ary 3-fly (512) 4-ary 4-fly (256) 4-ary 3-fly (64)

Hot-Spot #Srcs Dest. #Srcs Dest. #Srcs Dest.Small 480 Unif. 240 Unif. 60 Unif.

32 H.S. 16 H.S. 4 H.S.Medium 448 Unif. 224 Unif. 56 Unif.

64 H.S. 32 H.S. 8 H.S.High 384 Unif. 192 Unif. 48 Unif.

128 H.S. 64 H.S. 16 H.S.

traffic remain inactive until the first 50,000 packets havebeen received. Then, they start injecting 1,000 packets withthe same injection rate as that of the other hosts, but ad-dressed to only one destination host (the Hot-Spot). Theystop generating packets when each one has injected 1,000packets. This scenario has been simulated with different in-jection rates: low, intermediate, and high rate (0.45, 0.76,and 1.6 bytes/cycle/injecting sw, respectively. “B/C/isw” infigures). “Injecting sw” refers to the switches with sourcehosts connected to them (i.e. the switches of the first stage).In all simulation scenarios, we have obtained very good re-sults. However, for the sake of shortness we will only showa subset of the results. These ones are for the followingcases: 8-ary 3-fly in Figures 3, 4, and 5; 4-ary 3-fly in Fig-ure 6, and 4-ary 4-fly in Figures 7 and 8. Traffic patternsare shown in Table 3. The obtained results are qualitativelysimilar to the rest of environments.

4.2 Evaluation results

In order to properly evaluate the proposed congestionmanagement mechanism, it is very interesting to analyzewhat happens with network performance when a very shorthot-spot is applied. The congestion management mecha-nism is not able to react in such small period of time. Fig-

1000

1500

2000

2500

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)


0.2

0.3

0.4

0.5

0.6

0 2e+06 4e+06 6e+06

Thr

ough

put (

B/C

/isw

)


(a) Average latency (b) Throughput

Figure 2. Cold flows performance in a 4-ary4-fly with short hot-spot and MVCM.

ure 2 shows the “cold flows” performance when MVCM isapplied with the traffic load shown in Table 4. As we canappreciate, only one packet is sent per “hot flow”. So, if avalidated ACK packet is returned to the source host, due toa suspected beginning of congestion (but not actual), no ac-

Table 4. Traffic pattern for a very short hot-spot for a 4-ary 4-fly.

#Srcs Dest. Inj.Rate Traffic Beginning Traffic Ending192 Unif 100% Start of Simulation End of Simulation64 HS 100% After 50000 After injecting

received packets a single packet

tions will be taken because no more packets will be sent tothe hot-spot from “hot hosts”. As shown in Figure 2a, “coldflows” suffer a sharp increase in their latency just whennodes start injecting packets to the hot-spot. In a generalscenario, with true congestion, this time is the minimum de-lay between the congestion beginning and the mechanismreaction. Note that this short pulse on latency can not beavoided, being more significant as long as network trafficis higher. Anyway, this latency pulse can be consideredunimportant if we take into account that the “cold flows”throughput (Figure 2b) is not penalized at all. To sum up,a small latency pulse can appear just at the beginning ofcongestion, but it hardly has influence because “cold flows”throughput is not altered.

Next results have been obtained for a medium hot-spotaccording to the traffic patterns shown in Table 3, when ap-plying an intermediate injection rate. The congestion rootlink is located in the last stage of every analyzed network.

Firstly, we check whether the mechanism is able to man-age congestion in a 8-ary 3-fly network. Figures 3, 4, and 5show the performance of “cold” and “hot” flows, with andwithout MVCM mechanism.

02000004000006000008000001e+06

1.2e+06

0 3e+06 6e+06 9e+06

Lat

ency

(C

ycle

s)


0

500000

1e+06

1.5e+06

2e+06

2.5e+06

0 3e+06 6e+06 9e+06

Lat

ency

(C

ycle

s)


(a) Cold Flows (b) Hot Flows

500

1000

1500

2000

2500

3000

0 3e+06 6e+06 9e+06

Lat

ency

(C

ycle

s)


0

500000

1e+06

1.5e+06

2e+06

0 3e+06 6e+06 9e+06

Lat

ency

(C

ycle

s)


(c) Cold Flows (d) Hot Flows

Figure 3. Network Latency for a 8-ary 3-fly. (a)and (b) without MVCM, (c) and (d) with MVCM.

Figure 3 shows how MVCM provides an important re-

duction of the latency of “cold flows”. Only a small pulse,due to the “congestion-notification-reaction” interval (com-mented before) still remains. On the other hand, despite thefact that latency for “hot flows” is slightly increased whenapplying MVCM, we can observe, though, that the time toconsume the whole hot-spot has been significantly reduced.Notice that the fluctuations shown in Figure 3b within theinterval [4e+06..1e+07] are produced by the generated pack-ets from “hot hosts” that remain in the network without be-ing delivered once the hot-spot period has finished. Figure3d shows that all the packets belonging to “hot flows” havearrived to their destination before 5e+06 simulation time.Therefore, MVCM reduces latency for “cold flows” provid-ing a stable injection rate able just to fullfil the bandwidthof the congestion root link by “hot flows”.

0.2

0.4

0.6

0.8

1

1.2

0 3e+06 6e+06 9e+06

Thr

ough

put (

B/C

/isw

)

Simulation Time (Cycle)

0.02

0.04

0.06

0.08

0.1

0.12

0 3e+06 6e+06 9e+06

Thr

ough

put (

B/C

/isw

)



0.4

0.5

0.6

0.7

0.8

0 3e+06 6e+06 9e+06

Thr

ough

put (

B/C

/isw

)


0.02

0.04

0.06

0.08

0.1

0.12

0 3e+06 6e+06 9e+06

Thr

ough

put (

B/C

/isw

)



Figure 4. Network Throughput for a 8-ary 3-fly. (a) and (b) without MVCM, (c) and (d) withMVCM.

Figure 4 shows network throughput with and withoutMVCM. Notice that the interferences produced in “coldflows” by the hot-spot (Figure 4a) completely disappearwhen MVCM is applied (Figure 4c). Furthermore, as itcan be seen, the short latency pulse shown in Figure 3chas no influence on throughput. Moreover, MVCM alsoallows to keep constant the network throughput for “hotflows” both during the hot-spot period (Figure 4d) and oncethe packets generated by the hot-spot have been totally con-sumed (Simul.Time > 5e+06 ). Notice that in the latter case,throughput is due only to the packets belonging to uniformtraffic. However, if MVCM is not applied, the networkthroughput for “hot flows” (Figure 4b) after finishing thehot-spot period (Simul.Time > 4e+06) is due to packets be-longing to uniform traffic plus packets from “hot flows” thatstill remain at origin hosts or in the network.

Figure 5b shows how MVCM is able to provide the max-imum occupation for the link connected to the hot-spot,keeping it 100% busy during the hot-spot period. However,as shown in Fig. 5a, when no congestion management is ap-plied, it is not possible to guarantee the full bandwidth uti-lization of the hot-spot link. Indeed, in absence of conges-tion, once the hot-spot period has concluded, (Simul.Time> 5e+06 in Figure 5b) by applying MVCM, the occupationis mantained more stable than without applying a conges-tion management (Figure 5a).

All the results shown above have been obtained by usinga radix value of 8 to calculate WP. It is necessary to checkwhether similar results can be obtained for other networksizes with different radix values.

0

20

40

60

80

100

0 3e+06 6e+06 9e+06

Util

izat

ion

(%)


0

20

40

60

80

100

0 3e+06 6e+06 9e+06

Util

izat

ion

(%)


(a) Without MVCM (b) With MVCM

Figure 5. Utilization percentage of the linkconnected to the hot-spot for a 8-ary 3-fly.

Figure 6 shows similar results for 4-ary 3-fly network.As can be appreciated, the mechanism also reacts by re-ducing latency for “cold flows” (Figures 6a and 6c) andeliminating the interferences after hot-spot period for “hotflows” (Figures 6b and 6d), thus allowing packets belong-ing to “hot flows” to be consumed in a stable and continuousway. Therefore, MVCM is able to suitably manage conges-tion using the proposed expression to compute the waitingperiods as a function of network radix.

Finally, we are interested in analyzing the influence ofthe congestion root link location on network performance.Specifically, we will analyze what happens when the con-gestion root link is located at an intermediate stage insteadof being allocated at the last network stage. In particular,we have evaluated the case of a 4-ary 4-fly network with thecongestion root link located at stage 3 (Figure 8).

For comparison purposes, Figure 7 shows the resultswith the congestion root link in the last stage. Results arequalitatively similar to the other network configurations an-alyzed. That is, latency is reduced for “cold flows”, the timeto consume all the “hot packets” is decreased, and a stableinjection rate for “cold” and “hot” flows is obtained.

Figure 8 shows how MVCM reacts against congestionwhen the congestion root link is located at stage 3. Noticethat in this case, the number of hosts that can contribute to

050000

100000150000200000250000300000350000

0 1e+06 2e+06 3e+06

Lat

ency

(C

ycle

s)


0100000200000300000400000500000600000700000

0 1e+06 2e+06 3e+06

Lat

ency

(C

ycle

s)



90010001100120013001400150016001700

0 1e+06 2e+06 3e+06

Lat

ency

(C

ycle

s)


0100000200000300000400000500000600000700000800000

0 1e+06 2e+06 3e+06

Lat

ency

(C

ycle

s)



Figure 6. Network Latency for a 4-ary 3-fly. (a)and (b) without MVCM, (c) and (d) with MVCM.

0100000200000300000400000500000600000700000

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)


0

500000

1e+06

1.5e+06

2e+06

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)



1000200030004000500060007000

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)


0

500000

1e+06

1.5e+06

2e+06

2.5e+06

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)Simulation Time (Cycles)


Figure 7. Congestion root link in stage 4 for a4-ary 4-fly.(a) and (b) without MVCM, (c) and(d) with MVCM.

the hot-spot1 is smaller than when the congestion root linkis located in the last stage. Figures show again that the pro-posed congestion management mechanism works properly,regardless of how strong the hot-spot is and where conges-tion is located at.

To sum up, a significant reduction in the latency is appre-ciated for “cold flows” in all cases when MVCM is applied.On the other hand, latency since generation for “hot flows”remains the same or is slightly increased because their pack-

1Hosts that are not contributing to hot-spot, inject uniform traffic.

ets are stopped at origins (they are prevented from beinginjected into the network). Therefore, although a slight in-crease in latency may appear, lower values of the time re-quired to absorb the hot-spot are obtained. Indeed, regard-less of where congestion root is located at, and how strongthe hot-spot is, the MVCM mechanism can effectively reactagainst congestion, evenly distributing bandwidth amongthe flows that demand it.

50000100000150000200000250000300000

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)


0

500000

1e+06

1.5e+06

2e+06

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)



1000200030004000500060007000

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)


0

500000

1e+06

1.5e+06

2e+06

0 2e+06 4e+06 6e+06

Lat

ency

(C

ycle

s)



Figure 8. Congestion root link in stage 3 for a4-ary 4-fly.(a) and (b) without MVCM, (c) and(d) with MVCM.

5 Conclusions

In this paper, we have proposed a packet mark-ing/validation mechanism together with an end-to-end flow-control strategy for congestion management in MINs. Thisstrategy (MVCM) uses ACK packets to inform which flowsare causing congestion in order to take corrective actionsdirectly on them. Basically, the mechanism consists of anearly detection of the congestion produced by “hot flows”,a packet marking at the switch input buffers and their cor-responding validation at the output buffers of the switchwhere congestion is detected, a notification to the originhosts responsible for the congestion by means of ACKpackets, and an injection rate reduction in response to ACKpackets. Injection rate reduction has two different phasesdepending on the congestion degree. First, the size of thesliding window is reduced. Second, waiting periods are in-serted between packets injected at the source host, if the firstphase was not enough to reduce congestion.

By using this mechanism, packets belonging to “hotflows” wait at their source hosts, instead of remaining

blocked in the network, avoiding HOL blocking and there-fore not penalizing the “cold flows”.

Performance evaluation shows that the MVCM mecha-nism is able to effectively manage congestion in MINs, re-gardless of the switch radix value, the number of networkstages, the location of the congestion root link, and the con-gestion degree.

References

[1] M. Alonso, S. Coll, J. Martinez, V. Santonja, P. Lopez, andJ. Duato. ”Dynamic Power Saving in Fat-Tree Interconnec-tion Networks Using On/Off Links”. in Proc. on Int. Symp.on HP-PAC and IPDPS, 2006.

[2] T. Anderson, S. Owicki, J. Saxe, and C. Thacker. ”High-Speed Switch Scheduling for Local-Area Networks”. ACMTrans. On Computer, 1993.

[3] W. Dally, P. Carvey, and L. Dennison. ”The Avici TerabitSwitch/Router”. in Proc. on Hot Interconnects, 1998.

[4] J. Duato, I. Johnson, J. Flich, F. Naven, P. Garcia, and T. Na-chiondo. ”A New Scalable and Cost-Effective CongestionManagement Strategy for Lossless Multistage Interconnec-tion Networks”. in Proc. on Int. Sym. on HPCA, 2005.

[5] Http://www.infinibandta.org.[6] V. Jacobson. ”Congestion Avoidance and Control”. in Proc.

ACM SIGCOMM, 1988.[7] M. Katevenis, D. Serpanos, and E. Spyridakis. ”Credit-

Flow Controlled ATM for MP Interconnection: the ATLASI Single-Chip ATM Switch”. in Proc. on Int. Symp. onHPCA, 1998.

[8] V. Krishnan and D. Mayhew. ”A Localized Congestion Con-trol Mechanism for PCI Express Advanced Switching Fab-rics”. in Proc. IEEE Symp. on Hot Interconnects, 2004.

[9] G. Pfister and V. Norton. ”Hot Spot Contention and Combin-ing in Multistage Interconnection Networks”. IEEE Trans.on Computers, 1985.

[10] G. Pfister et al. ”Solving Hot Spot Contention Using Infini-band Architecture Congestion Control”. Ion HPI-DC, 2005.

[11] J. Renato, Y. Turner, and G. Janakiraman. ”Evaluation ofCongestion Detection Mechanism for Infiniband Switches”.on IEEE GLOBECOM, 2002.

[12] J. Renato, Y. Turner, and G. Janakiraman. ”End-to-End Con-gestion Control for Infiniband”. on IEEE INFOCOM, 2003.

[13] L. Shang, L. Peh, and N. Jha. ”Dynamic Voltage Scalingwith Links for Power Optimization of Interconnection Net-works”. in Proc. on Int. Sym. on HPCA, 2003.

[14] A. Smai and L. Thorelli. ”Global Reactive Congestion Con-trol in Multicomputer Networks”. Proc. on Int. Conf. onHPC, 1998.

[15] M. Thottetodi, A. Lebeck, and S. Mukherjee. ”Self-TunedCongestion Control for Multiprocessor Networks”. in Proc.on Int. Symp. on HPCA, 2001.

[16] W. Vogels et al. ”Tree-Saturation Control in the AC3 Ve-locity Cluster Interconnect”. in Proc. Conference on HotInterconnects, 2000.






































Congestion Management in MINs through Marked and Validated Packets

Documents