Sara's Networks Class Blog: Anderson

V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, B. Mueller, "Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication," ACM SIGCOMM Conference, (August 2009).

One line summary: This paper suggests reducing or eliminating the minimum RTO and enabling fine-grained RTT and RTO measurement and calculation on the order microseconds as a way of eliminating the problem of TCP incast collapse in data centers.

Summary

This paper presents a solution to the TCP incast problem. TCP incast collapse results in severely reduced throughput when multiple senders try to send to a single receiver. The preconditions for TCP incast collapse are: (1) the network must have high bandwidth and low latency with switches that have small buffers, (2) the workload must be a highly parallel barrier synchronization request workload, (3) the servers must return a small amount of data per request (high fan in). An example scenario in which TCP incast collapse occurs is one in which a client sends a request for data that has been partitioned across many servers. The servers all try to respond with their data and overload the client’s buffers. Packets are lost and some servers experience a TCP retransmission timeout (RTO). But the default minimum retransmission timeout is 200ms, so the client must wait at least this long to receive the extra data, even though the client’s link may be idle for most of this waiting time. This results in decreased throughput and link utilization. The authors postulate that in order to prevent TCP incast collapse, the RTO must operate on granularity closer to the RTT of the underlying network, which in datacenters is hundreds of microseconds or less.

The authors demonstrate via a series of simulations and experiments in real clusters that allowing for a minimum RTO on the order of microseconds or eliminating the minimum RTO altogether improves throughput and helps avoid TCP incast collapse. They also experiment with desynchronizing retransmissions by adding a randomizing component. They advocate such desychronization in data centers but note that it is likely unnecessary in the wide-area since different flows have different RTTs and thus different RTOs. The authors explain that it is not enough to lower the minimum RTO but that it is also necessary to enable measurement and computation of the RTT and RTO on a fine-grained level as well. They describe their implementation of fine-grained RTT and RTO computation using Linux high-resolution timers. This involved making modifications to the kernel and the TCP stack.

The authors then discuss whether eliminating the minimum RTO and allowing microsecond retransmissions is appropriate for the wide-area or if such techniques should be limited to the data center. They note two potential conflicts: spurious retransmissions when the network RTT suddenly spikes, and delayed ACKs acting as an additional timeout mechanism in certain situations because the delayed ACK timer is set higher than the RTO. They argue that problems arising from these conflicts don’t occur often and are mitigated by newer TCP features. They perform some wide-area experiments to test this claim and find that spurious retransmissions aren’t an issue and while better performance can be achieved by disabling delayed ACKS, leaving them enabled and with a high timer only slightly harms performance.

Critique

I liked this paper and I think it should remain on the syllabus. I don’t have many criticisms, although I did find it strange that in their wide-area experiments, two different servers experienced almost the exact same distribution of flows. This suggests that something funny might be going on with those experiments. I thought the proposed solution in this paper was good, it would be perfect if enabling fine-grained RTT and RTO measurements and calculation didn’t require special kernel hacking. Perhaps in the future the OS will automatically provide such capabilities. In the meantime, I thought their suggestion of setting the minimum RTO to the lowest value possible was at least a nice practical idea. I also liked how they did both simulations and experiments on a real cluster.

R. Mahajan, D. Wetherall, T. Anderson, "Understanding BGP Misconfiguration," ACM SIGCOMM Conference, (August 2002).

One line summary: In this paper the authors identify and analyze BGP misconfiguration errors, measure their frequency, classify them into a number of types, examine their causes, and suggest a number of mechanisms for reducing these misconfigurations.

Summary

This paper provides a quantitative and systematic study of BGP misconfiguration. They classify misconfigurations into two main types: origin misconfiguration and export misconfiguration. Origin misconfiguration occurs when an AS inadvertently advertises an IP prefix and it becomes globally visible. Export configuration occurs when an AS fails to filter a route that should have been filtered, thereby violating policies of one or more of the ASs in the AS path. They identify a number of negative effects of such misconfigurations, including increased routing load, connectivity disruption, and policy violation. In order to measure and analyze misconfigurations, the authors collected data from 23 peers in 19 ASs over a period of three weeks. They examine new routes and assume that those that don’t last for very long are likely due to misconfiguration and failures, and so select these to investigate. As part of their investigation they used an email survey of the operators of the ASs involved, as well as a connectivity verifier to determine the extent of disruptions. They note that their method underestimates the number and effect of misconfigurations for various reasons.

The authors first describe their results for origin misconfiguration analysis. They classify the new routes that are potential results of origin misconfiguration into three categories, self-deaggregation of prefixes, announcement of a new route with an origin related to the origin of the old route via their AS paths, and announcement of a new route with a foreign origin (unrelated to that of the old route). They observe that the number of incidents from each of these three categories is roughly the same, with self-deaggregation being slightly higher. They note, however, that the success rates for identifying these different types of origin misconfigurations are different for each, as some incidents that were classified as origin misconfigurations were actually the result of failures. Some interesting conclusions they draw from their analysis are that at least 72% of new routes seen by a router in a day are the result of misconfiguration, that 13% of incidents cause connectivity disruptions, mainly caused by new routes of foreign origin, that compared to failures connectivity disruptions due to misconfigurations play a small role, and that 80% of misconfigurations are corrected within an hour, often less if the misconfiguration disrupts connectivity. The authors next examine export misconfigurations. They note that such misconfigurations do not tend to cause connectivity problems directly, and that most incidents involved providers rather than peers. The authors also examine the effect of misconfigurations on routing load, and conclude that in the extreme case, load can spike to 60%.

In the paper, the authors identify and classify a number of causes of misconfiguration, which they classify into slips and mistakes. Slips and mistakes turn out to be roughly equally responsible for misconfiguration. Mistakes in origin misconfiguration that they identified include initialization bugs, reliance on upstream filtering, and use of old configurations. Slips in origin misconfiguration include accidents (such as typos) in specifying redistribution, attachment of the wrong community attribute to prefixes, hijacks, forgotten filters, incorrect summaries, unknown errors, and miscellaneous problems. They also identify three additional mistakes causing export misconfiguration, including prefix-based configuration, bad ACL or route map, and initialization bugs. Lastly, they identify a number of causes for short-lived new routes that are not misconfigurations, including failures, testing, migration, and load balancing.

Lastly, the authors suggest a number of ways to reduce misconfigurations. These include enacting improvement to the router CLIs, implementing transactional semantics for configuration changes, developing and supporting high-level configuration tools, developing configuration checkers, and building database consistency mechanisms. They also describe a protocol extension to BGP called S-BGP which would prevent about half of the misconfigurations they observed.

Critique

In general, I thought this was an entertaining read, especially as I could relate to how difficult router CLIs are to use and how easy it is to make mistakes in configuring BGP, having had to do this in a previous networks class. It is unfortunate that because misconfigurations are hard to identify, the authors’ methodology was necessarily limited. I’d be interested to see if other newer techniques have been or could be developed to do this and similar sorts of analysis. Due to the weaknesses in their methodology, I’m not sure how meaningful some of the figures and percentages they derive actually are, but they do still provide some interesting insights, especially if they are correct in arguing that their study provides a lower bound. That said, I still think their approach was clever, given the difficulties.

I particularly like their suggestions for reducing some of the causes of misconfigurations. User interface design improvements seemed to me to be the most obvious thing to do. In general, I wonder why many of their suggested solutions, which have been used in many other contexts and computer systems, have not been used in router configuration. Although the authors do briefly discuss some of the barriers to implementing such solutions, it still surprises me that harried system administrators haven’t risen up and demanded that at least some of the more obvious steps be taken sooner, but maybe the potential for missteps makes things more interesting for them, it’s hard to say. I think investigation of some of the improvements they suggest would be an interesting area for research, although they did seem to imply at one point that industry can be a barrier to the adoption of some of these improvements, which might be too frustrating to deal with.

Sara's Networks Class Blog

Sunday, September 27, 2009

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

Thursday, September 3, 2009

Understanding BGP Misconfiguration

Blog Archive

About Me