Solving the TCP-incast Problem with Application-Level Scheduling

Copyright © 2005 Department of Computer Science

1

Solving the TCP-incast Problem with Application-Level Scheduling

Maxim Podlesny, University of WaterlooCarey Williamson, University of Calgary


22

Motivation

2

• Emerging IT paradigms– Data centers, grid computing, HPC, multi-core– Cluster-based storage systems, SAN, NAS– Large-scale data management “in the cloud”– Data manipulation via “services-oriented computing”

• Cost and efficiency advantages from IT trends, economy of scale, specialization marketplace

• Performance advantages from parallelism– Partition/aggregation, MapReduce, BigTable, Hadoop– Think RAID at Internet scale! (1000x)


33

Problem Statement

• High-speed, low-latency network (RTT ≤ 0.1 ms) • Highly-multiplexed link (e.g., 1000 flows)• Highly-synchronized flows on bottleneck link• Limited switch buffer size (e.g., 32 KB)

How to provide high goodputfor data centerapplications?

TCP retransmission timeouts

TCP throughput degradation

N


444

Related Work• E. Krevat et al., “On Application-based Approaches to Avoiding TCP

Throughput Collapse in Cluster-based Storage Systems”, Proceedings of SuperComputing 2007

• A. Phanishayee et al., “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of FAST 2008

• Y. Chen et al., “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, WREN 2009

• V. Vasudevan et al., “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication”, Proceedings of ACM SIGCOMM 2009

• M. Alizadeh et al., “Data Center TCP”, Proc. ACM SIGCOMM 2010• A. Shpiner et al., “A Switch-based Approach to Throughput Collapse

and Starvation in Data Centers”, IWQoS 2010


55

Summary

• Data centers have specific network characteristics

• TCP-incast throughput collapse problem emerges

• Possible solutions:

– Tweak TCP timers and/or parameters for this environment

– Redesign (or replace!) TCP in this environment

– Rewrite applications for this environment (Facebook)

– Increase switch buffer sizes (extra queueing delay!)

– Smart edge coordination for uploads/downloads

Summary of Related Work


6

Data Center System Model

N servers

Logical

data block

(S)

(e.g., 1 MB)

Server

Request

Unit

(SRU)

(e.g., 32 KB)

1

2

3

N

packet size S_DATA

small buffer B

link capacity C

switch client


7

Performance Comparisons

Internet vs. data center network:• Internet propagation delay: 10-100 ms• data center propagation delay: 0.1 ms• packet size 1 KB, link capacity 1 Gbps -> packet transmission time is 0.01 ms


88

Summary

• Determine maximum TCP flow concurrency (n)

that can be supported without any packet loss

• Arrange the servers into k groups of (at most) n

servers each, by staggering the group scheduling

Analysis Overview (1 of 2)


99

Summary

• Determine maximum TCP flow concurrency (n)

that can be supported without any packet loss

– Determine flow size in packets (based on SRU and MSS)

– Determine maximum outstanding packets per flow (Wmax)

– Determine max flow concurrency (based on B and Wmax)

• Arrange the servers into k groups of (at most) n

servers each, by staggering the group scheduling

Analysis Overview (2 of 2)


1010

Summary

• Recall TCP slow start dynamics:

– Initial TCP congestion window (cwnd) is 1 packet

– Acks cause cwnd to double every RTT (1, 2, 4, 8, 16…)

• Consider TCP transfer of an arbitrary SRU (e.g., 21)

• Determine peak power-of-2 cwnd value (WA)

• Determine “residual window” for the last RTT (WB)

• Wmax depends on both WA and WB (e.g., WA+ WB/2 )

Determining Wmax


1111

Scheduling Overview

n

nn

n n n

N


12

Scheduling Details

Using lossless scheduling of server responses: maximum n servers responding simultaneously, with k groups of responding servers scheduled

Using lossless scheduling of server responses: maximum n servers responding simultaneously, with k groups of responding servers scheduled

Server i (1 <= i <= N) starts responding at:

Server i (1 <= i <= N) starts responding at:


13

Theoretical Results

Maximum goodput of an application in a data center with lossless scheduling is:

where:• S - size of a logical data block• T - actual completion time of an SRU• - SRU completion time used for scheduling• k – how many groups of servers to use

• dmax - real system scheduling variance

Maximum goodput of an application in a data center with lossless scheduling is:

where:• S - size of a logical data block• T - actual completion time of an SRU• - SRU completion time used for scheduling• k – how many groups of servers to use

• dmax - real system scheduling variance

maxd+T+)(kT

S=g

1~


141414

Solution Analytical Model Results


15

Results for 10 KB Fixed SRU Size (1 of 2)


16

Results for 10 KB Fixed SRU Size (2 of 2)


17

Results for Varied SRU Size (1 MB / N)


18

Effect of TCP Timer Granularity


19

Summary and Conclusion

Application-level scheduling for TCP-incast throughput collapse

Main idea: scheduling responses of servers so that there are no losses

Maximum goodput with lossless scheduling Non-monotonic goodput, highly-sensitive to network configuration parameters


20

Future Work

Implementing and testing our solution in real data centers

Evaluating our solution for different application traffic scenarios

Solving the TCP-incast Problem with Application-Level Scheduling

Documents