Sunday, September 27, 2009
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication
V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, B. Mueller, "Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication," ACM SIGCOMM Conference, (August 2009).
One line summary: This paper suggests reducing or eliminating the minimum RTO and enabling fine-grained RTT and RTO measurement and calculation on the order microseconds as a way of eliminating the problem of TCP incast collapse in data centers.
Summary
This paper presents a solution to the TCP incast problem. TCP incast collapse results in severely reduced throughput when multiple senders try to send to a single receiver. The preconditions for TCP incast collapse are: (1) the network must have high bandwidth and low latency with switches that have small buffers, (2) the workload must be a highly parallel barrier synchronization request workload, (3) the servers must return a small amount of data per request (high fan in). An example scenario in which TCP incast collapse occurs is one in which a client sends a request for data that has been partitioned across many servers. The servers all try to respond with their data and overload the client’s buffers. Packets are lost and some servers experience a TCP retransmission timeout (RTO). But the default minimum retransmission timeout is 200ms, so the client must wait at least this long to receive the extra data, even though the client’s link may be idle for most of this waiting time. This results in decreased throughput and link utilization. The authors postulate that in order to prevent TCP incast collapse, the RTO must operate on granularity closer to the RTT of the underlying network, which in datacenters is hundreds of microseconds or less.
The authors demonstrate via a series of simulations and experiments in real clusters that allowing for a minimum RTO on the order of microseconds or eliminating the minimum RTO altogether improves throughput and helps avoid TCP incast collapse. They also experiment with desynchronizing retransmissions by adding a randomizing component. They advocate such desychronization in data centers but note that it is likely unnecessary in the wide-area since different flows have different RTTs and thus different RTOs. The authors explain that it is not enough to lower the minimum RTO but that it is also necessary to enable measurement and computation of the RTT and RTO on a fine-grained level as well. They describe their implementation of fine-grained RTT and RTO computation using Linux high-resolution timers. This involved making modifications to the kernel and the TCP stack.
The authors then discuss whether eliminating the minimum RTO and allowing microsecond retransmissions is appropriate for the wide-area or if such techniques should be limited to the data center. They note two potential conflicts: spurious retransmissions when the network RTT suddenly spikes, and delayed ACKs acting as an additional timeout mechanism in certain situations because the delayed ACK timer is set higher than the RTO. They argue that problems arising from these conflicts don’t occur often and are mitigated by newer TCP features. They perform some wide-area experiments to test this claim and find that spurious retransmissions aren’t an issue and while better performance can be achieved by disabling delayed ACKS, leaving them enabled and with a high timer only slightly harms performance.
Critique
I liked this paper and I think it should remain on the syllabus. I don’t have many criticisms, although I did find it strange that in their wide-area experiments, two different servers experienced almost the exact same distribution of flows. This suggests that something funny might be going on with those experiments. I thought the proposed solution in this paper was good, it would be perfect if enabling fine-grained RTT and RTO measurements and calculation didn’t require special kernel hacking. Perhaps in the future the OS will automatically provide such capabilities. In the meantime, I thought their suggestion of setting the minimum RTO to the lowest value possible was at least a nice practical idea. I also liked how they did both simulations and experiments on a real cluster.
Subscribe to:
Post Comments (Atom)
It is certainly a simple solution, but when we apply it to our implosion workload based on map-reduce traffic, it does not work very well in obtaining high throughput.
ReplyDelete