DARD: Distributed Adaptive Routing for Datacenter Networks Xin Wu Xiaowei Yang Dept. of Computer Science, Duke University Duke-CS-TR-2011-01 ABSTRACT Datacenter networks typically have many paths connecting each host pair to achieve high bisection bandwidth for arbi- trary communication patterns. Fully utilizing the bisection bandwidth may require flows between the same source and destination pair to take different paths to avoid hot spots. However, the existing routing protocols have little support for load-sensitive adaptive routing. We propose DARD, a Distributed Adaptive Routing architecture for Datacenter net- works. DARD allows each end host to move traffic from overloaded paths to underloaded paths without central co- ordination. We use an openflow implementation and sim- ulations to show that DARD can effectively use a datacen- ter network’s bisection bandwidth under both static and dy- namic traffic patterns. It outperforms previous solutions based on random path selection by 10%, and performs similarly to previous work that assigns flows to paths using a centralized controller. We use competitive game theory to show that DARD’s path selection algorithm makes progress in every step and converges to a Nash equilibrium in finite steps. Our evaluation results suggest that DARD can achieve a close- to-optimal solution in practice. 1. INTRODUCTION Datacenter network applications, e.g., MapReduce and net- work storage, often demand high intra-cluster bandwidth [11] to transfer data among distributed components. This is be- cause the components of an application cannot always be placed on machines close to each other (e.g., within a rack) for two main reasons. First, applications may share com- mon services provided by the datacenter network, e.g., DNS, search, and storage. These services are not necessarily placed in nearby machines. Second, the auto-scaling feature offered by a datacenter network [1, 5] allows an application to cre- ate dynamic instances when its workload increases. Where those instances will be placed depends on machine avail- ability, and is not guaranteed to be close to the application’s other instances. Therefore, it is important for a datacenter network to have high bisection bandwidth to avoid hot spots between any pair of hosts. To achieve this goal, today’s datacenter networks often use commodity Ethernet switches to form multi-rooted tree topologies [21] (e.g., fat-tree [10] or Clos topology [16]) that have multiple equal-cost paths connecting any host pair. A flow (a flow refers to a TCP connection in this paper) can use an alternative path if one path is overloaded. However, legacy transport protocols such as TCP lack the ability to dynamically select paths based on traffic load. To overcome this limitation, researchers have advocated a va- riety of dynamic path selection mechanisms to take advan- tage of the multiple paths connecting any host pair. At a high-level view, these mechanisms fall into two categories: centralized dynamic path selection, and distributed traffic- oblivious load balancing. A representative example of cen- tralized path selection is Hedera [11], which uses a central controller to compute an optimal flow-to-path assignment based on dynamic traffic load. Equal-Cost-Multi-Path for- warding (ECMP) [19] and VL2 [16] are examples of traffic- oblivious load balancing. With ECMP, routers hash flows based on flow identifiers to multiple equal-cost next hops. VL2 [16] uses edge switches to forward a flow to a randomly selected core switch to achieve valiant load balancing [23]. Each of these two design paradigms has merit and im- proves the available bisection bandwidth between a host pair in a datacenter network. Yet each has its limitations. A centralized path selection approach introduces a potential scaling bottleneck and a centralized point of failure. When a datacenter scales to a large size, the control traffic sent to and from the controller may congest the link that con- nects the controller to the rest of the datacenter network, Dis- tibuted traffic-oblivious load balancing scales well to large datacenter networks, but may create hot spots, as their flow assignment algorithms do not consider dynamic traffic load on each path. In this paper, we aim to explore the design space that uses end-to-end distributed load-sensitive path selection to fully use a datacenter’s bisection bandwidth. This design paradigm has a number of advantages. First, placing the path selection logic at an end system rather than inside a switch facilitates deployment, as it does not require special hardware or replacing commodity switches. One can also upgrade or extend the path selection logic later by applying software patches rather than upgrading switching hardware. Second, a distributed design can be more robust and scale better than a centralized approach. This paper presents DARD, a lightweight, distributed, end system based path selection system for datacenter networks. DARD’s design goal is to fully utilize bisection bandwidth and dynamically balance the traffic among the multipath paths between any host pair. A key design challenge DARD faces is how to achieve dynamic distributed load balancing. Un-
13
Embed
DARD: Distributed Adaptive Routing for Datacenter Networksmobilityfirst.winlab.rutgers.edu/documents/DARD.pdf · tibuted traffic-oblivious load balancing scales well to large datacenter
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DARD: Distributed Adaptive Routing for DatacenterNetworks
Xin Wu Xiaowei YangDept. of Computer Science, Duke University
Duke-CS-TR-2011-01
ABSTRACT
Datacenter networks typically have many paths connecting
each host pair to achieve high bisection bandwidth for arbi-
trary communication patterns. Fully utilizing the bisection
bandwidth may require flows between the same source and
destination pair to take different paths to avoid hot spots.
However, the existing routing protocols have little support
for load-sensitive adaptive routing. We propose DARD, a
Distributed Adaptive Routing architecture for Datacenter net-
works. DARD allows each end host to move traffic from
overloaded paths to underloaded paths without central co-
ordination. We use an openflow implementation and sim-
ulations to show that DARD can effectively use a datacen-
ter network’s bisection bandwidth under both static and dy-
namic traffic patterns. It outperforms previous solutions based
on random path selection by 10%, and performs similarly to
previous work that assigns flows to paths using a centralized
controller. We use competitive game theory to show that
DARD’s path selection algorithm makes progress in every
step and converges to a Nash equilibrium in finite steps. Our
evaluation results suggest that DARD can achieve a close-
to-optimal solution in practice.
1. INTRODUCTION
Datacenter network applications, e.g., MapReduce and net-
work storage, often demand high intra-cluster bandwidth [11]
to transfer data among distributed components. This is be-
cause the components of an application cannot always be
placed on machines close to each other (e.g., within a rack)
for two main reasons. First, applications may share com-
mon services provided by the datacenter network, e.g., DNS,
search, and storage. These services are not necessarily placed
in nearby machines. Second, the auto-scaling feature offered
by a datacenter network [1, 5] allows an application to cre-
ate dynamic instances when its workload increases. Where
those instances will be placed depends on machine avail-
ability, and is not guaranteed to be close to the application’s
other instances.
Therefore, it is important for a datacenter network to have
high bisection bandwidth to avoid hot spots between any pair
of hosts. To achieve this goal, today’s datacenter networks
often use commodity Ethernet switches to form multi-rooted
tree topologies [21] (e.g., fat-tree [10] or Clos topology [16])
that have multiple equal-cost paths connecting any host pair.
A flow (a flow refers to a TCP connection in this paper) can
use an alternative path if one path is overloaded.
However, legacy transport protocols such as TCP lack the
ability to dynamically select paths based on traffic load. To
overcome this limitation, researchers have advocated a va-
riety of dynamic path selection mechanisms to take advan-
tage of the multiple paths connecting any host pair. At a
high-level view, these mechanisms fall into two categories:
centralized dynamic path selection, and distributed traffic-
oblivious load balancing. A representative example of cen-
tralized path selection is Hedera [11], which uses a central
controller to compute an optimal flow-to-path assignment
based on dynamic traffic load. Equal-Cost-Multi-Path for-
warding (ECMP) [19] and VL2 [16] are examples of traffic-
oblivious load balancing. With ECMP, routers hash flows
based on flow identifiers to multiple equal-cost next hops.
VL2 [16] uses edge switches to forward a flow to a randomly
selected core switch to achieve valiant load balancing [23].
Each of these two design paradigms has merit and im-
proves the available bisection bandwidth between a host pair
in a datacenter network. Yet each has its limitations. A
centralized path selection approach introduces a potential
scaling bottleneck and a centralized point of failure. When
a datacenter scales to a large size, the control traffic sent
to and from the controller may congest the link that con-
nects the controller to the rest of the datacenter network, Dis-
tibuted traffic-oblivious load balancing scales well to large
datacenter networks, but may create hot spots, as their flow
assignment algorithms do not consider dynamic traffic load
on each path.
In this paper, we aim to explore the design space that
uses end-to-end distributed load-sensitive path selection to
fully use a datacenter’s bisection bandwidth. This design
paradigm has a number of advantages. First, placing the
path selection logic at an end system rather than inside a
switch facilitates deployment, as it does not require special
hardware or replacing commodity switches. One can also
upgrade or extend the path selection logic later by applying
software patches rather than upgrading switching hardware.
Second, a distributed design can be more robust and scale
better than a centralized approach.
This paper presents DARD, a lightweight, distributed, end
system based path selection system for datacenter networks.
DARD’s design goal is to fully utilize bisection bandwidth
and dynamically balance the traffic among the multipath paths
between any host pair. A key design challenge DARD faces
is how to achieve dynamic distributed load balancing. Un-
like in a centralized approach, with DARD, no end system or
router has a global view of the network. Each end system can
only select a path based on its local knowledge, thereby mak-
ing achieving close-to-optimal load balancing a challenging
problem.
To address this challenge, DARD uses a selfish path se-
lection algorithm that provably converges to a Nash equilib-
rium in finite steps (Appedix B). Our experimental evalua-
tion shows that the equilibrium’s gap to the optimal solution
is small. To facilitate path selection, DARD uses hierarchi-
cal addressing to represent an end-to-end path with a pair of
source and destination addresses, as in [27]. Thus, an end
system can switch paths by switching addresses.
We have implemented a DARD prototype on DeterLab [6]
and a ns-2 simulator. We use static traffic pattern to show
that DARD converges to a stable state in two to three con-
trol intervals. We use dynamic traffic pattern to show that
DARD outperforms ECMP, VL2 and TeXCP and its per-
formance gap to the centralized scheduling is small. Under
dynamic traffic pattern, DARD maintains stable link utiliza-
tion. About 90% of the flows change their paths less than 4times in their life cycles. Evaluation result also shows that
the bandwidth taken by DARD’s control traffic is bounded
by the size of the topology.
DARD is a scalable and stable end host based approach
to load-balance datacenter traffic. We make every effort to
leverage existing infrastructures and to make DARD prac-
tically deployable. The rest of this paper is organized as
follows. Section 2 introduces background knowledge and
discusses related work. Section 3 describes DARD’s design
goals and system components. In Section 4, we introduce
the system implementation details. We evaluate DARD in
Section 5. Section 6 concludes our work.
2. BACKGROUND AND RELATED WORK
In this section, we first briefly introduce what a datacenter
network looks like and then discuss related work.
2.1 Datacenter Topologies
Recent proposals [10, 16, 24] suggest to use multi-rooted
tree topologies to build datacenter networks. Figure 1 shows
a 3-stage multi-rooted tree topology. The topology has three
vertical layers: Top-of-Rack (ToR), aggregation, and core. A
pod is a management unit. It represents a replicable building
block consisting of a number of servers and switches sharing
the same power and management infrastructure.
An important design parameter of a datacenter network
is the oversubscription ratio at each layer of the hierarchy,
which is computed as a layer’s downstream bandwidth dev-
ided by its upstream bandwidth, as shown in Figure 1. The
oversubscription ratio is usually designed larger than one,
assuming that not all downstream devices will be active con-
currently.
We design DARD to work for arbitrary multi-rooted tree
topologies. But for ease of exposition, we mostly use the
Figure 1: A multi-rooted tree topology for a datacenter network. The
aggregation layer’s oversubscription ratio is defined asBWdown
BWup.
fat-tree topology to illustrate DARD’s design, unless other-
wise noted. Therefore, we briefly describe what a fat-tree
topology is.
Figure 2 shows a fat-tree topology example. A p-pod fat-
tree topology (in Figure 2, p = 4)has p pods in the horizontal
direction. It uses 5p2/4 p-port switches and supports non-
blocking communication among p3/4 end hosts. A pair of
end hosts in different pods have p2/4 equal-cost paths con-
necting them. Once the two end hosts choose a core switch
as the intermediate node, the path between them is uniquely
determined.
Figure 2: A 4-pod fat-tree topology.
In this paper, we use the term elephant flow to refer to
a continuous TCP connection longer than a threshold de-
fined in the number of transferred bytes. We discuss how
to choose this threshold in § 3.
2.2 Related work
Related work falls into three broad categories: adaptive
path selection mechanisms, end host based multipath trans-
cols such as TeXCP [20] are originally designed to balance
traffic in an ISP network, but can be adopted by datacenter
networks. However, because these protocols are not end-to-
end solutions, they forward traffic along different paths in
the granularity of a packet rather than a TCP flow. There-
fore, it can cause TCP packet reordering, harming a TCP
flow’s performance. In addition, different from DARD, they
also place the path selection logic at switches and therefore
require upgrading switches.
3. DARD DESIGN
In this section, we describe DARD’s design. We first high-
light the system design goals. Then we present an overview
of the system. We present more design details in the follow-
ing sub-sections.
3.1 Design Goals
DARD’s essential goal is to effectively utilize a datacen-
ter’s bisection bandwidth with practically deployable mech-
anisms and limited control overhead. We elaborate the de-
sign goal in more detail.
1. Efficiently utilizing the bisection bandwidth . Given the
large bisection bandwidth in datacenter networks, we aim to
take advantage of the multiple paths connecting each host
pair and fully utilize the available bandwidth. Meanwhile we
desire to prevent any systematic design risk that may cause
packet reordering and decrease the system goodput.
2. Fairness among elephant flows . We aim to provide fair-
ness among elephant flows so that concurrent elephant flows
can evenly share the available bisection bandwidth. We fo-
cus our work on elephant flows for two reasons. First, ex-
isting work shows ECMP and VL2 already perform well on
scheduling a large number of short flows [16]. Second, ele-
phant flows occupy a significant fraction of the total band-
width (more than 90% of bytes are in the 1% of the largest
flows [16]).
3. Lightweight and scalable. We aim to design a lightweight
and scalable system. We desire to avoid a centralized scaling
bottleneck and minimize the amount of control traffic and
computation needed to fully utilize bisection bandwidth.
4. Practically deployable. We aim to make DARD com-
patible with existing datacenter infrastructures so that it can
be deployed without significant modifications or upgrade of
existing infrastructures.
3.2 Overview
In this section, we present an overview of the DARD de-
sign. DARD uses three key mechanisms to meet the above
system design goals. First, it uses a lightweight distributed
end-system-based path selection algorithm to move flows
from overloaded paths to underloaded paths to improve effi-
ciency and prevent hot spots (§ 3.5). Second, it uses hierar-
chical addressing to facilitate efficient path selection (§ 3.3).
Each end system can use a pair of source and destination ad-
dresses to represent an end-to-end path, and vary paths by
varying addresses. Third, DARD places the path selection
logic at an end system to facilitate practical deployment, as
a datacenter network can upgrade its end systems by apply-
ing software patches. It only requires that a switch support
the openflow protocol and such switches are commercially
available [?].
Figure 3 shows DARD’s system components and how it
works. Since we choose to place the path selection logic
at an end system, a switch in DARD has only two func-
tions: (1) it forwards packets to the next hop according to a
pre-configured routing table; (2) it keeps track of the Switch
State (SS, defined in § 3.4) and replies to end systems’ Switch
State Request (SSR). Our design implements this function
using the openflow protocol.
An end system has three DARD components as shown in
Figure 3: Elephant Flow Detector, Path State Monitor and
Figure 3: DARD’s system overview. There are multiple paths con-
necting each source and destination pair. DARD is a distributed system
running on every end host. It has three components. The Elephant
Flow Detector detects elephant flows. The Path State Monitor monitors
traffic load on each path by periodically querying the switches. The
Path Selector moves flows from overloaded paths to underloaded paths.
Path Selector. The Elephant Flow Detector monitors all the
output flows and treats one flow as an elephant once its size
grows beyond a threshold. We use 100KB as the threshold
in our implementation. This is because according to a re-
cent study, more than 85% of flows in a datacenter are less
than 100 KB [16]. The Path State Monitor sends SSR to the
switches on all the paths and assembles the SS replies in Path
State (PS, as defined in § 3.4). The path state indicates the
load on each path. Based on both the path state and the de-
tected elephant flows, the Path Selector periodically assigns
flows from overloaded paths to underloaded paths.
The rest of this section presents more design details of
DARD, including how to use hierarchical addressing to se-
lect paths at an end system (§ 3.3), how to actively monitor
all paths’ state in a scalable fashion (§ 3.4), and how to as-
sign flows from overloaded paths to underloaded paths to
improve efficiency and prevent hot spots (§ 3.5).
3.3 Addressing and Routing
To fully utilize the bisection bandwidth and, at the same
time, to prevent retransmissions caused by packet reorder-
ing (Goal 1), we allow a flow to take different paths in its
life cycle to reach the destination. However, one flow can
use only one path at any given time. Since we are exploring
the design space of putting as much control logic as possi-
ble to the end hosts, we decided to leverage the datacenter’s
hierarchical structure to enable an end host to actively select
paths for a flow.
A datacenter network is usually constructed as a multi-
rooted tree. Take Figure 4 as an example, all the switches
and end hosts highlighted by the solid circles form a tree
with its root core1. Three other similar trees exist in the
same topology. This strictly hierarchical structure facili-
tates adaptive routing through some customized addressing
rules [10]. We borrow the idea from NIRA [27] to split
an end-to-end path into uphill and downhill segments and
encode a path in the source and destination addresses. In
DARD, each of the core switches obtains a unique prefix
and then allocates nonoverlapping subdivisions of the prefix
to each of its sub-trees. The sub-trees will recursively allo-
cate nonoverlapping subdivisions of their prefixes to lower
hierarchies. By this hierarchical prefix allocation, each net-
work device receives multiple IP addresses, each of which
represents the device’s position in one of the trees.
As shown in Figure 4, we use corei to refer to the ithcore, aggrij to refer to the jth aggregation switch in the ithpod. We follow the same rule to interpret ToRij for the
top of rack switches and Eij for the end hosts. We use
the device names prefixed with letter P and delimited by
colons to illustrate how prefixes are allocated along the hi-
erarchies. The first core is allocated with prefix Pcore1. It
then allocates nonoverlapping prefixesPcore1.Paggr11 and
Pcore1.Paggr21 to two of its sub-trees. The sub-tree rooted
from aggr11 will further allocate four prefixes to lower hier-
archies.
For a general multi-rooted tree topology, the datacenter
operators can generate a similar address assignment schema
and allocate the prefixes along the topology hierarchies. In
case more IP addresses than network cards are assigned to
each end host, we propose to use IP alias to configure mul-
tiple IP addresses to one network interface. The latest oper-
ating systems support a large number of IP alias to associate
with one network interface, e.g., Linux kernel 2.6 sets the
limit to be 256K IP alias per interface [3]. Windows NT 4.0has no limitation on the number of IP addresses that can be
bound to a network interface [4].
One nice property of this hierarchical addressing is that
one host address uniquely encodes the sequence of upper-
level switches that allocate that address, e.g., in Figure 4,