Sara's Networks Class Blog: Joseph

Thursday, November 12, 2009

A Policy-aware Switching Layer for Data Centers

D. Joseph, A. Tavakoli, I. Stoica, "A Policy-aware Switching Layer for Data Centers," ACM SIGCOMM Conference, (August 2008)

One line summary: This paper presents a new layer-2 mechanism for supporting middleboxes in datacenters called the PLayer that incorporates pswitches, which allow administrators to specify the sequence of middleboxes a given type of traffic should traverse.

Summary

This paper presents PLayer, a new layer-2 for data centers consisting of policy-aware switches called pswitches. The main goal of the PLayer is to provide a mechanism to support the use of middleboxes in datacenters. The problem with using middleboxes in datacenters today is that (1) they are hard to configure to ensure traffic traverses the correct sequence of middleboxes without physically changing the network topology or overloading other layer-2 mechanisms, which is difficult and error-prone, (2) network inflexibility makes introducing changes difficult and can result in fate-sharing between middleboxes and other traffic flow, and (3) some traffic is often forced to traverse some middleboxes unnecessarily and load balancing across multiple instances of a middlebox is difficult. PLayer seeks to address these problems by providing correctness, flexibility, and efficiency. It does this by adhering to two main design principles: (1) separate policy from reachability and (2) take middleboxes off the physical network path.

The PLayer consists of pswitches, which forward frames according to an administrator-specified policy. Policies define the sequence of middleboxes that a given type of traffic should traverse and are of the form “[start location, traffic selector] → sequence”, where a traffic selector is a 5-tuple consisting of source and destination IP addresses and port numbers and the protocol type. These policies get translated into rules of the form “[previous hop, traffic selector] → next hop”. Administrators specify policies at a centralized policy controller, which disseminates those policies to each pswitch. A centralized middlebox controller monitors the liveness of middleboxes and keeps the pswitches informed of failures. The PLayer achieves requiring minimal infrastructure changes by only requiring switches be replaced, using encapsulation, and being incrementally deployable. It allows the use of unmodified middleboxes and servers by ensuring packets only reach middleboxes and servers in the proper Ethernet format, using only nonintrusive techniques to identify previous hops, and supporting the various middlebox addressing requirements. It supports the use of non-transparent middleboxes, that is, middleboxes that modify frame headers or content, through the use of something called per-segment policies. The PLayer can be enhanced through the use of stateful switches, which store hash tables in order to track the processing of a flow. PLayer guarantees correctness under network, policy, and middlebox churn, as demonstrated in the paper. The end of the paper includes a description of the implementation of pswitches in Click and a validation of the their functionality. It also includes a performance test, and as expected, pswitches achieve less TCP throughput and increased latency, as do middleboxes deployed with them. It concludes with a summary of the limitations of the PLayer. Briefly, these include the introduction of indirect paths since the PLayer directs traffic to off-path middleboxes, difficulty in traffic classification and policy specification, incorrect packet classification when the 5-tuple is insufficient for correct classification, failure of the PLayer if middleboxes are incorrectly wired, the existence of some policies the PLayer cannot support, and the added complexity of pswitches.

Critique

I thought the PLayer seemed like a good idea, and very well-motivated. I don’t think we should have necessarily read this version of the paper, however. Some sections, such as the formal analysis and the appendix, didn’t really add much toward understanding the PLayer, but rather just made the paper an annoyingly long read. With respect to the PLayer itself, it seems that if it could overcome its poor performance when compared to current datacenter switches, it could become pretty competitive. Or, at least the idea of allowing administrators to specify which traffic should traverse which middleboxes in a flexible way requiring minimal changes to existing infrastructure seems like an idea that could be very applicable, and perhaps companies such as Cisco would do well to incorporate some of those ideas into their products.

Monday, September 28, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks

Y. Chen, R. Griffith, J. Liu, A. Joseph, R. H. Katz, "Understanding TCP Incast Throughput Collapse in Datacenter Networks," Workshop on Research in Enterprise Networks (WREN'09), (August 2009).

One line summary: This paper examines the problem of TCP incast throughput collapse by attempting to reproduce the results of previous papers and examining in depth the behavior observed, as well as developing a quantitative model that attempts to explain it.

Summary

This paper examines the TCP incast problem and (1) shows that it is a general problem and occurs in a variety of network environments, (2) reproduces the experiments from prior work, and (3) proposes a quantitative model to understand some of the observed behavior. The authors describe the difference between fixed-fragment and variable-fragment workloads used in previous papers. In variable-fragment workloads, the amount of data sent by each sender decreases as the number of senders increases. They justify their use of a fixed-fragment workload as being more representative. They perform their experiments in two different test environments, avoiding the use of simulation. They first step is to verify that the incast problem can be replicated, which they do. They then test a number of modifications, including decreasing the minimum TCP RTO, randomizing the minimum TCP RTO, setting a smaller multiplier for the RTO exponential backoff, and randomizing the multiplier value. Their results suggested that reducing the minimum TCP RTO value was the most helpful modification, while the majority of the other modifications were unhelpful.

They performed experiments in which they used a fixed-fragment workload for different minimum RTO values, varying the number of senders and measuring goodput. They found that there are three regions of interest in the resulting graph: (1) the initial goodput collapse, (2) goodput increase, and (3) another region of slower goodput decrease. They then did a similar experiment only in this case they tested different RTOs with and without delayed ACKs enabled. Unlike in previous papers by other researchers, they found that disabling ACKs resulted in suboptimal behavior. They hypothesize that this is because disabling ACKs causes the TCP congestion window to be over-driven, and examine some internal TCP state variables to verify that this hypothesis is correct. They also conclude that this sub-optimal behavior is independent of type of workload. The results for their experiments were very different from the results in previous papers. They conclude that this is in part due to the difference in workloads (fixed-fragment versus variable-fragment), but also due to the difference in the network test environments.

They propose a simple quantitative model to describe the behavior in the case of the delayed ACK enabled, low-resolution timer using the fixed-fragment workload. It comes close to describing the observed behavior for the 200ms minimum RTO timer. However, their model does not come as close for the lower 1ms minimum RTO timer. Their model reveals that the goodput is affected by both the minimum RTO value and also the inter-packet wait times between packet transmissions, and that for larger RTO values, reducing the value helps, but for smaller RTO values, controlling the inter-packet wait time is important. The authors provide a number of refinements to their model which they use to explain several details observed in the graphs from their earlier experiments.

Critique

There are many things I appreciated about this paper. At first I thought it was going to be redundant and not contribute much beyond what they paper we read before this did, but it is clear that the authors’ comparison of their results with the results from the previous paper revealed a lot of interesting things. This paper brings up a lot of points regarding important details and influencing factors that were not examined in the previous paper and that weren’t immediately obvious from reading the previous paper. It was just nice to see them try to reproduce the previous experiments and results from the earlier paper; I don’t think I’ve seen that elsewhere even though it seems like it would be good to make this standard practice.

I’m not sure I completely buy some of their explanations in this paper but I appreciate that they tried to explain the behavior they observed so thoroughly. For instance, I’m not sure their possible explanation for why randomizing the minimum and initial RTO values is unhelpful, but that could very likely be because I don’t understand it. One other thing that is not clear to me is if when they examined enabling delayed ACKs, if they adjusted the delayed ACK timer so that it wasn’t the default 40ms, or if they left it alone. I wonder if they would have drawn other conclusions had they tried adjusting the delayed ACK timer along with the minimum RTO timer. I thought their attempt at a model was an impressive undertaking but I’m not so sure the results of this attempt were very good.

Sara's Networks Class Blog

Thursday, November 12, 2009

A Policy-aware Switching Layer for Data Centers

Monday, September 28, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks

Blog Archive

About Me