Monday, September 28, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks


Y. Chen, R. Griffith, J. Liu, A. Joseph, R. H. Katz, "Understanding TCP Incast Throughput Collapse in Datacenter Networks," Workshop on Research in Enterprise Networks (WREN'09), (August 2009).

One line summary: This paper examines the problem of TCP incast throughput collapse by attempting to reproduce the results of previous papers and examining in depth the behavior observed, as well as developing a quantitative model that attempts to explain it.

Summary

This paper examines the TCP incast problem and (1) shows that it is a general problem and occurs in a variety of network environments, (2) reproduces the experiments from prior work, and (3) proposes a quantitative model to understand some of the observed behavior. The authors describe the difference between fixed-fragment and variable-fragment workloads used in previous papers. In variable-fragment workloads, the amount of data sent by each sender decreases as the number of senders increases. They justify their use of a fixed-fragment workload as being more representative. They perform their experiments in two different test environments, avoiding the use of simulation. They first step is to verify that the incast problem can be replicated, which they do. They then test a number of modifications, including decreasing the minimum TCP RTO, randomizing the minimum TCP RTO, setting a smaller multiplier for the RTO exponential backoff, and randomizing the multiplier value. Their results suggested that reducing the minimum TCP RTO value was the most helpful modification, while the majority of the other modifications were unhelpful.

They performed experiments in which they used a fixed-fragment workload for different minimum RTO values, varying the number of senders and measuring goodput. They found that there are three regions of interest in the resulting graph: (1) the initial goodput collapse, (2) goodput increase, and (3) another region of slower goodput decrease. They then did a similar experiment only in this case they tested different RTOs with and without delayed ACKs enabled. Unlike in previous papers by other researchers, they found that disabling ACKs resulted in suboptimal behavior. They hypothesize that this is because disabling ACKs causes the TCP congestion window to be over-driven, and examine some internal TCP state variables to verify that this hypothesis is correct. They also conclude that this sub-optimal behavior is independent of type of workload. The results for their experiments were very different from the results in previous papers. They conclude that this is in part due to the difference in workloads (fixed-fragment versus variable-fragment), but also due to the difference in the network test environments.

They propose a simple quantitative model to describe the behavior in the case of the delayed ACK enabled, low-resolution timer using the fixed-fragment workload. It comes close to describing the observed behavior for the 200ms minimum RTO timer. However, their model does not come as close for the lower 1ms minimum RTO timer. Their model reveals that the goodput is affected by both the minimum RTO value and also the inter-packet wait times between packet transmissions, and that for larger RTO values, reducing the value helps, but for smaller RTO values, controlling the inter-packet wait time is important. The authors provide a number of refinements to their model which they use to explain several details observed in the graphs from their earlier experiments.

Critique

There are many things I appreciated about this paper. At first I thought it was going to be redundant and not contribute much beyond what they paper we read before this did, but it is clear that the authors’ comparison of their results with the results from the previous paper revealed a lot of interesting things. This paper brings up a lot of points regarding important details and influencing factors that were not examined in the previous paper and that weren’t immediately obvious from reading the previous paper. It was just nice to see them try to reproduce the previous experiments and results from the earlier paper; I don’t think I’ve seen that elsewhere even though it seems like it would be good to make this standard practice.

I’m not sure I completely buy some of their explanations in this paper but I appreciate that they tried to explain the behavior they observed so thoroughly. For instance, I’m not sure their possible explanation for why randomizing the minimum and initial RTO values is unhelpful, but that could very likely be because I don’t understand it. One other thing that is not clear to me is if when they examined enabling delayed ACKs, if they adjusted the delayed ACK timer so that it wasn’t the default 40ms, or if they left it alone. I wonder if they would have drawn other conclusions had they tried adjusting the delayed ACK timer along with the minimum RTO timer. I thought their attempt at a model was an impressive undertaking but I’m not so sure the results of this attempt were very good.

2 comments:

  1. 1. Yes the model is kind of ugly.
    2. The randomized min/init RTO result is pretty convincing in that it doesn't help. What is particularly unconvincing with our explanation? Do you have a better one? (said not as a challenge but as an invitation for your opinion)
    3. We did not try changing delayed ACK values because the delayed ACK behavior is not universal across testbeds - you'll see tomorrow. But it does point out that congestion window management is important.

    ReplyDelete
  2. The paper is a work in progress, so your reaction among others is very useful for us. We felt that the CMU paper analyzes a very different kind of workload, and we believe that our approach is more generally applicable.

    ReplyDelete