Monday, September 28, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks


Y. Chen, R. Griffith, J. Liu, A. Joseph, R. H. Katz, "Understanding TCP Incast Throughput Collapse in Datacenter Networks," Workshop on Research in Enterprise Networks (WREN'09), (August 2009).

One line summary: This paper examines the problem of TCP incast throughput collapse by attempting to reproduce the results of previous papers and examining in depth the behavior observed, as well as developing a quantitative model that attempts to explain it.

Summary

This paper examines the TCP incast problem and (1) shows that it is a general problem and occurs in a variety of network environments, (2) reproduces the experiments from prior work, and (3) proposes a quantitative model to understand some of the observed behavior. The authors describe the difference between fixed-fragment and variable-fragment workloads used in previous papers. In variable-fragment workloads, the amount of data sent by each sender decreases as the number of senders increases. They justify their use of a fixed-fragment workload as being more representative. They perform their experiments in two different test environments, avoiding the use of simulation. They first step is to verify that the incast problem can be replicated, which they do. They then test a number of modifications, including decreasing the minimum TCP RTO, randomizing the minimum TCP RTO, setting a smaller multiplier for the RTO exponential backoff, and randomizing the multiplier value. Their results suggested that reducing the minimum TCP RTO value was the most helpful modification, while the majority of the other modifications were unhelpful.

They performed experiments in which they used a fixed-fragment workload for different minimum RTO values, varying the number of senders and measuring goodput. They found that there are three regions of interest in the resulting graph: (1) the initial goodput collapse, (2) goodput increase, and (3) another region of slower goodput decrease. They then did a similar experiment only in this case they tested different RTOs with and without delayed ACKs enabled. Unlike in previous papers by other researchers, they found that disabling ACKs resulted in suboptimal behavior. They hypothesize that this is because disabling ACKs causes the TCP congestion window to be over-driven, and examine some internal TCP state variables to verify that this hypothesis is correct. They also conclude that this sub-optimal behavior is independent of type of workload. The results for their experiments were very different from the results in previous papers. They conclude that this is in part due to the difference in workloads (fixed-fragment versus variable-fragment), but also due to the difference in the network test environments.

They propose a simple quantitative model to describe the behavior in the case of the delayed ACK enabled, low-resolution timer using the fixed-fragment workload. It comes close to describing the observed behavior for the 200ms minimum RTO timer. However, their model does not come as close for the lower 1ms minimum RTO timer. Their model reveals that the goodput is affected by both the minimum RTO value and also the inter-packet wait times between packet transmissions, and that for larger RTO values, reducing the value helps, but for smaller RTO values, controlling the inter-packet wait time is important. The authors provide a number of refinements to their model which they use to explain several details observed in the graphs from their earlier experiments.

Critique

There are many things I appreciated about this paper. At first I thought it was going to be redundant and not contribute much beyond what they paper we read before this did, but it is clear that the authors’ comparison of their results with the results from the previous paper revealed a lot of interesting things. This paper brings up a lot of points regarding important details and influencing factors that were not examined in the previous paper and that weren’t immediately obvious from reading the previous paper. It was just nice to see them try to reproduce the previous experiments and results from the earlier paper; I don’t think I’ve seen that elsewhere even though it seems like it would be good to make this standard practice.

I’m not sure I completely buy some of their explanations in this paper but I appreciate that they tried to explain the behavior they observed so thoroughly. For instance, I’m not sure their possible explanation for why randomizing the minimum and initial RTO values is unhelpful, but that could very likely be because I don’t understand it. One other thing that is not clear to me is if when they examined enabling delayed ACKs, if they adjusted the delayed ACK timer so that it wasn’t the default 40ms, or if they left it alone. I wonder if they would have drawn other conclusions had they tried adjusting the delayed ACK timer along with the minimum RTO timer. I thought their attempt at a model was an impressive undertaking but I’m not so sure the results of this attempt were very good.

Sunday, September 27, 2009

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication


V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, B. Mueller, "Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication," ACM SIGCOMM Conference, (August 2009).

One line summary: This paper suggests reducing or eliminating the minimum RTO and enabling fine-grained RTT and RTO measurement and calculation on the order microseconds as a way of eliminating the problem of TCP incast collapse in data centers.

Summary

This paper presents a solution to the TCP incast problem. TCP incast collapse results in severely reduced throughput when multiple senders try to send to a single receiver. The preconditions for TCP incast collapse are: (1) the network must have high bandwidth and low latency with switches that have small buffers, (2) the workload must be a highly parallel barrier synchronization request workload, (3) the servers must return a small amount of data per request (high fan in). An example scenario in which TCP incast collapse occurs is one in which a client sends a request for data that has been partitioned across many servers. The servers all try to respond with their data and overload the client’s buffers. Packets are lost and some servers experience a TCP retransmission timeout (RTO). But the default minimum retransmission timeout is 200ms, so the client must wait at least this long to receive the extra data, even though the client’s link may be idle for most of this waiting time. This results in decreased throughput and link utilization. The authors postulate that in order to prevent TCP incast collapse, the RTO must operate on granularity closer to the RTT of the underlying network, which in datacenters is hundreds of microseconds or less.

The authors demonstrate via a series of simulations and experiments in real clusters that allowing for a minimum RTO on the order of microseconds or eliminating the minimum RTO altogether improves throughput and helps avoid TCP incast collapse. They also experiment with desynchronizing retransmissions by adding a randomizing component. They advocate such desychronization in data centers but note that it is likely unnecessary in the wide-area since different flows have different RTTs and thus different RTOs. The authors explain that it is not enough to lower the minimum RTO but that it is also necessary to enable measurement and computation of the RTT and RTO on a fine-grained level as well. They describe their implementation of fine-grained RTT and RTO computation using Linux high-resolution timers. This involved making modifications to the kernel and the TCP stack.

The authors then discuss whether eliminating the minimum RTO and allowing microsecond retransmissions is appropriate for the wide-area or if such techniques should be limited to the data center. They note two potential conflicts: spurious retransmissions when the network RTT suddenly spikes, and delayed ACKs acting as an additional timeout mechanism in certain situations because the delayed ACK timer is set higher than the RTO. They argue that problems arising from these conflicts don’t occur often and are mitigated by newer TCP features. They perform some wide-area experiments to test this claim and find that spurious retransmissions aren’t an issue and while better performance can be achieved by disabling delayed ACKS, leaving them enabled and with a high timer only slightly harms performance.

Critique

I liked this paper and I think it should remain on the syllabus. I don’t have many criticisms, although I did find it strange that in their wide-area experiments, two different servers experienced almost the exact same distribution of flows. This suggests that something funny might be going on with those experiments. I thought the proposed solution in this paper was good, it would be perfect if enabling fine-grained RTT and RTO measurements and calculation didn’t require special kernel hacking. Perhaps in the future the OS will automatically provide such capabilities. In the meantime, I thought their suggestion of setting the minimum RTO to the lowest value possible was at least a nice practical idea. I also liked how they did both simulations and experiments on a real cluster.

Monday, September 21, 2009

VL2: A Scalable and Flexible Data Center Network


A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, S. Sengupta, "VL2: A Scalable and Flexible Data Center Network," ACM SIGCOMM 2009, (August 2009).

One line summary: This paper presents VL2, a data center network architecture that provides layer 2 semantics, high capacity, and performance isolation between services using flat addressing, Valiant Load Balancing, and a directory service for mappings, and runs over a Clos topology.

Summary

This paper presents a data center network architecture called VL2. The goals of VL2 are to achieve uniform high capacity, performance isolation between services, layer 2 semantics, as well as agility (the ability to assign any server to any service). VL2 is built from low cost switches arranged into a Clos topology and uses Valiant Load Balancing (VLB) to spread traffic across available paths and adapt to volatility in traffic and failure patterns. The authors of the paper first conduct a data center traffic analysis by implementing a data center that supports data mining. From this their major findings were that (1) the majority of flows are small, (2) the number of concurrent flows through a machine is around ten more than half the time, (3) the variability in traffic patterns is not easily summarized and there are a large number of representative traffic matrices, (4) traffic patterns change nearly constantly and unpredictably, (5) most failures are small in size, and (6) the main causes of downtimes are network misconfigurations, firmware bugs, and faulty components, and there is no obvious way to eliminate failures from the top of the data center topology hierarchy.

Briefly, VL2 operates as follows. VL2 is built on a Clos network topology. It uses two different kinds of addresses: location specific IP addresses (LAs) and application specific IP addresses (AAs). All switches and interfaces have LAs. Each server has an AA that remains the same no matter where the application is located and is assigned when the server is provisioned to a service. All servers within a service believe they are in the same subnet. The AA of a server is associated with the LA of the ToR switch to which the server is connected. Switches run a link-state routing protocol using these LAs. In performing routing, VL2 uses VLB to distribute traffic across intermediate switches and ECMP to distribute across equal cost paths. There is a VL2 agent at each server that encapsulates packets emanating from the servers within a packet that has the LA address of the destination ToR in the header. A directory service stores AA-LA-MAC address mappings. The directory service consists of a number of read-optimized directory services that cache AA-LA mappings and a smaller number of write-optimized replicated state machine servers that reliably store AA-LA mappings. The directory service performs lookups, updates, and reactive cache updates.

The authors perform an evaluation of VL2, which serves to demonstrate that VL2 provides uniform high capacity, VLB fairness, performance isolation, convergence after link failures, and has a directory service that is scalable and resilient, with high availability and throughput. Of these experiments, I thought the most interesting was the one where they performed an all-to-all data shuffle stress test to show VL2 achieves high utilization and fairness between flows.

Critique

In terms of criticisms, here are several notes I made about this paper:

I liked that before designing VL2 they did an analysis of data center traffic. The results of this were somewhat surprising to me so I found them interesting. However, for this analysis, the authors instrumented a data mining workload, but it’s probably not the case that a data mining workload is representative of most traffic in data centers in general.

I didn’t understand why they exclude UDP traffic from consideration in their evaluations, in which they only use TCP flows, not UDP flows. I don’t know if this is realistic. They make some hand wavy claim that techniques such as STCP to make UDP “TCP friendly” are well known and can be used, but I am not convinced it is really that easy. Also in their evaluations, they state that they rely on TCP to ensure that each flow is rate limited to its fair share of the bottleneck, but many of the previous papers we read suggest that TCP is not very good at fairness. Another thing is that in their evaluation using data shuffle, they claim that a conventional design would take 11x longer but they don’t explain where they got their figures. Also, they claim several times throughout the paper that agility is a big goal for VL2 but they don’t do an experiment to show what happens when a server is migrated, as was done in the PortLand paper. I don’t think that by agility they necessarily mean ease of VM migration but perhaps this still would have been interesting to examine.

Some miscellaneous points are that in one part of the paper, they claim that that VL2 enables fine-grained path control by adjusting the randomization used in VLB but they don’t elaborate further on this. Also the performance of VL2 appears to depend on traffic obeying the hose model. I don’t know what the hose model is, but is it safe to assume all data center traffic will obey the hose model?

A general point not just about this paper in particular but about all of these papers on data center networks is that it seems that every paper only uses evaluations that highlight the good features of the solution suggested in that paper, so it is hard to compare solutions. It would be nice if there were at least some standard experiments or similar experiments that could facilitate comparison, analogous to the sets of benchmarks used for processors in architecture.

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric


R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, A. Vahdat, "PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric", ACM SIGCOMM, (August 2009).

One line summary: This paper presents PortLand, a layer 2 network fabric consisting of a set of Ethernet-compatible routing and addressing protocols tailored for data centers with multi-rooted tree topologies.

Summary

This paper presents PortLand, a network fabric for data centers with layer 2 semantics. The goals of PortLand are that it be scalable, easy to manage, flexible, fault-tolerant, and that it support virtual machine migration. The authors motivate their approach by first describing the problems with traditional Ethernet-, LAN-, and VLAN-based solutions. These problems are essentially the same ones that were mentioned in the SEATTLE paper, so refer to the SEATTLE post below for a discussion of them. One main difference between PortLand and other data center network fabric solutions is that PortLand is designed for use over a specific topology, the fat tree or similar multi-rooted tree topologies. Central in the design of PortLand is the fabric manager (FM), which maintains soft state about network configuration information such as topology. Also key is that in PortLand, each end host has a pseudo-MAC (PMAC) address, of which the hosts are unaware, that encodes the hosts’ location in the topology. These PMACs are mapped to the hosts’ actual MAC (AMAC) address.


Forwarding in PortLand is done using PMACs and host-connected edge switches perform the AMAC to PMAC header rewriting so that the hosts remain unaware of the PMACs. AMAC-PMAC-IP address mappings are maintained in the FM, which responds with this information to ARP requests that are unicast to it from the edge switches. This special handling of ARP messages is called proxy-based ARP in the paper. Switches learn of their location in the topology using the Location Discovery Protocol (LDP), which requires no administrator configuration. After the switches establish their location, they use updates from their neighbor switches to populate their forwarding tables. Forwarding in PortLand is provably loop-free (by observing up-down semantics – if a packet is forwarded down a layer in the topology it shouldn’t be forwarded back up). Multicast groups in PortLand are mapped to core switches using a hash function, and all multicast packets are forwarded to the core switch the group is hashed to. The FM is responsible for installing the proper forwarding state in core and aggregation routers to ensure that every host in a multicast group receives multicast packets for that group. LDP also provides switch failure detection and failure recovery is aided by the FM, which tracks the state of switches.


The authors conclude by evaluating PortLand in a testbed of 20 4-port switches and 16 end hosts. They measure the time for convergence for UDP flows in the presence of link failures as well as for TCP flows. They also measure multicast convergence when a link to a core switch fails. They measure scalability and VM migration as well. In all these experiments, they evaluate PortLand alone without comparing it to other solutions.


Critique

I didn't really like this paper. One thing I didn't like is that the authors claim that they restrict the amount of centralized knowledge in PortLand but they never explain what happens when the central FM goes down. They discuss several fault cases but not one that deals with this. Clearly if the FM were down this would be catastrophic, so you would think they would address this. Also, they claim that the FM is amenable to a distributed realization but they don’t elaborate on this so it is not very convincing. Also, the FM doesn’t seem very scalable, especially if they don’t distribute it somehow. Distributing the FM seems like it would be more complicated to do than they admit.


I also was struck by how often they mentioned special failure cases. Maybe this is always the case for these kinds of systems and other papers just don’t emphasize it like the authors do here, but having so many cases where PortLand could fail and special measures need to be taken makes it seem kind of fragile.


Another obvious criticism is that PortLand isn’t applicable to general topologies; they assume that the fat-tree or a similar topology is the obvious best one, so in some sense their solution depends on this assumption remaining true. Also, they mention at one point that their implementation uses a separate control network for communication between the FM and the switches. I feel like they should have been more upfront about this, and perhaps evaluated PortLand without the separate control network as well.


I also thought that the authors were slightly unfair and misleading in their description of SEATTLE. They make a lot of criticisms of SEATTLE and TRILL but they don’t compare PortLand to them in their experiments. While a direct comparison might have been difficult for many reasons, it would have been nice if they had at least attempted to do similar experiments as done in the SEATTLE and TRILL papers. It doesn’t seem fair that they didn’t.


One thing that is good about PortLand is that, as the authors claim, it is implementable using existing hardware and software, although maybe that’s true of most of the alternatives as well.


Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks


S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, "Detailed Diagnosis in Enterprise Networks," ACM SIGCOMM Conference, (August 2009).


One line summary: This paper presents NetMedic, a diagnostic system for enterprise networks that uses network history to infer causes of faults without application specific knowledge by representing components as a directed graph and reasoning over this structure; this paper shows NetMedic is able to pinpoint fault causes with relative specificity compared to previous diagnostic systems.

Summary

This paper describes a diagnostic system for enterprise networks called NetMedic. NetMedic approaches the problem of diagnosing faults as one of inference. The goal of NetMedic is to build a system that can identify the likely causes of a fault with as much specificity as possible and to do so with minimal application specific knowledge. It does this by modeling the network as a dependency graph and then using history to detect abnormalities and likely causes. The nodes of this graph are network components such as application processes, machines, and configurations, and network paths. There is a directed edge from a node A to a node B if A impacts B, and the weight of this edge represents the magnitude of this impact. Each component has a state consisting of many variables. The abnormality of a component at a given time is determined and used to compute the edge weights. The authors describe various extensions to make this process more robust to large and diverse sets of variables. After these weights are obtained, causes are ranked such that more likely causes have lower ranks.

The authors implement NetMedic on the Windows platform, using Windows Performance Counter framework for their source of data. They claim to be developing a prototype for Linux as well. They evaluate NetMedic in comparison to a course diagnosis method loosely based on previous methods. They evaluate in a live environment and a controlled environment, but it is unclear how realistic even the live environment is, as they inject the faults they are trying to detect. In one of their evaluations, for 80% of the faults the median rank of the true cause is 1 in NetMedic, meaning NetMedic correctly identifies the culprit in these cases. They also demonstrate the benefit of their extensions by comparing them with a version of NetMedic that has application specific information hand coded in it. Their extensions perform well in this comparison. They also study how NetMedic does when diagnosing two simultaneous faults and study the impact of the length of history used.

Critique

This paper was interesting to read. After reading the first section or two the first question that comes to mind is how they manage without knowing application specific details, because it is pretty obvious that this method isn’t workable in the general case. They have a clever way of getting around this in the analysis phase but then they do admit that in the data collection phase of their experiments they do utilize application specific information about where configuration data is stored, though they claim to be working on a way to get around this. There is one part in the section on implementation where they talk about how they handle some counters differently from others because they represent cumulative values and it made me wonder how they determine which counters fall into this category. Does that not count as having application specific information? They talk about automatically detecting cumulative variables earlier in the paper when discussing their extensions, such as aggregate relationships across variables, but the example they give with the counter (number of exceptions a process has experienced) doesn’t seem to fall into the same category as those discussed in the extension section.

Since NetMedic would be used by network administrators and you can imagine that some things about a network stay the same all the time (such as what kind of applications are running) it might be interesting to see how NetMedic could be enhanced if administrators had the option of providing some application specific details, and if there are ways NetMedic could leverage this. I’m not really sure how or if that would work, but it is something to consider.

I wasn’t particularly impressed by their evaluation. It would be more compelling if they had more data from real-world situations instead of constructed situations with injected faults. It might have been informative too if they had shown some results measuring metrics other than the rank of the correct cause, although I can’t think of another metric off the top of my head. Also, they compared their system against another that was “loosely based” on systems such as Sherlock and Score, and they don’t really discuss that system much, so it seems a bit questionable as to whether this is a fair comparison. Lastly, their evaluation in which they show NetMedic identifies a virus scanning program or sync utility as abnormal doesn’t seem like something to brag about. I’m not sure I understand this section or why this is a good thing, since presumably virus scanning is an acceptable activity. In this section, they claim to be showing how NetMedic can help with naturally occurring faults, but I’m not sure they actually accomplish that. I could be just entirely misunderstanding this section.

It might potentially be interesting to explore using variables as nodes in the graph instead of just the components. This could probably make it much harder to scale though. It would also be interesting to see how NetMedic performs when a fault is due to a confluence of factors and not just one culprit alone.

Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises


C. Kim, M. Caesar, J. Rexford, "Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises," ACM SIGCOMM Conference, (August 2008).


One line summary: This paper describes SEATTLE, an Ethernet-based architecture for use as a building block in large enterprise networks that uses a DHT to provide the same functionality as conventional Ethernet, but in a scalable and easy-to-administer way.

Summary

This paper discusses an alternative Ethernet-based protocol for use as a network building block in enterprise and access provider networks called SEATTLE. It is motivated by the observation that despite it’s usefulness especially concerning its plug-and-play semantics, conventional Ethernet scales poorly and has several limitations. (1) It disseminates every host’s location globally using flooding, resulting in high control overhead and large forwarding table sizes, (2) it forces paths to make up a spanning tree, constraining route selection, and (3) bootstrapping protocols such as ARP and DHCP rely on broadcasts for frequent and basic operations, degrading network performance. One alternative that attempts to address some of the problems with Ethernet is to decompose the network into multiple LANs interconnected by IP routing. The disadvantages of this approach include (1) configuration overhead due to hierarchical addressing and difficulty in maintaining consistency between routers and DCHP servers, (2) complexity in implementing network policies, since these are often based on the IP prefix of a host, which is subject to change, and (3) limited support for host mobility, making it difficult for hosts to move from one subnet to another. An additional approach that uses virtual LANs (VLANs) instead of LANS is used to address these deficiencies in turn, however, it too has its limitations. These include (1) overhead in configuring trunks – deciding which bridges should be in a given VLAN must often be done manually, (2) large forwarding table entries on bridges in multiple VLANs, and (3) a single spanning tree in each VLAN limits route selection and requires manual rebalancing and updating to adjust to shifts in network load.

SEATTLE in turn addresses all of these problems. It maintains MAC-IP and MAC-location mappings using a one-hop DHT. This information is hashed onto switches using consistent hashing to reduce re-hashing overhead when a switch fails and keys must be remapped. Among switches, a link-state routing protocol allows routing between switches along the shortest path. SEATTLE allows for the creation of virtual switches for load balancing i.e. more powerful switches can have more virtual switches and thus a higher load. Also, the DHT can be used to store information about various network services or attributes, such as printers and servers, by hashing a descriptor of that service along with its location or other information. SEATTLE can also be configured to use a multi-level one-hop DHT to run over hierarchical networks with several regions connected by backbones. In this case, separate regional and backbone hash rings are maintained. SEATTLE is backwards compatible with end hosts using conventional Ethernet. SEATTLE also allows for the definition of groups, which is defined as a set of hosts that share the same broadcast domain. This is important for backward compatibility with conventional Ethernet and also facilitates the implementation of reachability policies.

The paper then provides results of several simulations using four topologies that the authors consider representative. They measure SEATTLE’s sensitivity to cache eviction timeouts, network changes, host mobility, switch failure, the forwarding table size, its control overhead, and compare it with other approaches. They describe a prototype implementation of SEATTLE.

Critique

I liked this paper. There was quite a bit of time between me reading it and writing the summary so I can’t remember specific criticisms or points I wanted to make about it. They did a good job of explaining what was wrong with current solutions and I liked how they addressed each of these problems. One thing that confused me in their simulations is that they said they leveraged real network traces and supplemented them with synthetic traces, but I must not understand fully how these are used because it seems like once you have a different building block (SEATTLE in place of Ethernet) that alters the trace so a sequence of events in the original trace might no longer make sense if you are assuming you are using SEATTLE instead. One thing I liked about their simulations is that, unlike in many of the other papers we’ve read, at least their simulation networks were big. I feel like a lot of other experiments that use topologies with only a handful of hosts are not as interesting, although in many cases there is probably nothing wrong with evaluating in this way.

Tuesday, September 15, 2009

Congestion Control for High Bandwidth-Delay Product Networks


D. Katabi, M. Handley, C. Rohrs, "Congestion Control for High Bandwidth-Delay Product Networks," ACM SIGCOMM Conference, (August 2002).

Summary

This paper describes an alternative transport protocol to TCP called the eXplicit Control Protocol (XCP). One major problem with TCP is that as the delay-bandwidth product of links in a network increases, TCP becomes oscillatory and unstable, degrading performance, regardless of the congestion control or queueing scheme used at the routers. TCP can also be unfair to connections using high latency links. In addition, TCP unnecessarily delays the transfer of short flows, which are most of the flows in the Internet. Thus, this paper presents XCP as an alternative to TCP. Some of the main concepts underlying XCP are the decoupling of utilization from fairness control and the placement of state in the packets as opposed to the routers. The paper demonstrates that XCP is stable for any link capacity, feedback delay, or number of sources, and the parameters of XCP are constants independent of the environment. It achieves high utilization, small queues, and near zero drops in any environment. Other advantages of XCP are that allows for easy implementation of service differentiation schemes, it distinguishes error losses from congestion losses, it facilitates detection of misbehaving sources, and its performance is an incentive to deployment. Also, it is TCP-friendly.

The basic operation of XCP is as follows: An XCP sender maintains a congestion window (cwnd) and an estimate of round-trip time (RTT) just like in TCP. The XCP protocol defines a special congestion header carried in each packet used to communicate a flow’s state and feedback. This header contains a cwnd field set to the sender’s current cwnd, a RTT field set to the sender’s RTT estimate, and a feedback field, initially set to the sender’s demand and later reset by the router to provide feedback to the sender. The XCP router has an efficiency controller (EC) and a fairness controller (FC). These compute estimates of the average RTT of all flows over the link and make a single control decision every average RTT, which is the control interval. The EC computes the desired increase or decrease in the number of aggregate bytes to transmit in the next control interval. These bytes are allocated via the feedback field of the packet header. Exactly how they are divided up among the flows is decided by the FC, which uses an additive-increase multiplicative-decrease (AIMD) mechanism to converge to fairness – if an aggregate increase is desired, it allocates so that the increase in the throughput of all flows is the same, and if a decrease is desired, it allocates so that the decrease in throughput of a flow is proportional to its current throughput. In addition, the FC periodically shuffles allocation among 10% of the flows at a link in order to speed convergence to fairness.

This paper uses a fluid model to prove that XCP is stable independently of delay, capacity, and number of sources. It then provides a comparison of XCP to TCP Reno over Random Early Discard (RED), Random Early Marking (REM), Adaptive Virtual Queue (AVQ), and Core Stateless Fair Queueing (CSFQ) queueing policies. The queueing policy used with XCP almost never mattered because XCP almost never drops packets. These simulations show that XCP maintains near optimal utilization even in the face of capacity increase, feedback delay, increase in number of flows, and short web-like traffic, whereas TCP’s performance degrades. Also, XCP converges to fairness quickly and is not biased against long RTT flows like TCP, and is robust to high variance in the RTT distribution even though it uses average RTT. Moreover, XCP is able to dampen oscillations and respond quickly to sudden increases and decreases in traffic. Toward the end of the paper the authors describe how XCP can easily be used to provide differential service according to the shadow prices model and detect misbehaving sources. The also demonstrate that XCP is TCP-friendly.

Critique

I thought this paper was nice to read. It seems like their fluid model approach from control theory led to a very clean design and analysis of XCP. I tend to agree with their discussion about why the congestion signal shouldn’t be to drop packets and it needn’t even be binary. I also thought the extension to provide differential bandwidth allocation using the shadow prices model was good, especially as it seemed so straightforward to implement. I could see where that would be a commercially desirable feature of XCP.

I thought their argument that XCP could be used to facilitate policing agents that detect misbehaving sources was interesting but somewhat lacking. It might be nice to see this in action or to take a more detailed look XCP from a security angle and see if there are vulnerabilities that aren’t obvious from a first glance. This paper made me wonder why XCP isn’t supported in the Internet today. I’d be interested to see if anything ever came from this paper or if it was just a thought experiment.

Random Early Detection Gateways for Congestion Avoidance


S. Floyd, V. Jacobson, "Random Early Detection Gateways for Congestion Avoidance," IEEE/ACM Transactions on Networking, (August 1993).

One line summary: This paper presents an algorithm for congestion avoidance called Random Early Detection (RED) that drops packets probabilistically when the computed average queue length is between a minimum and maximum threshold in order to avoid congestion.

Summary

This paper presents an algorithm for congestion avoidance called Random Early Detection (RED). The authors argue that the most effective congestion detection can occur at gateways as opposed to the endpoints. In the current Internet, TCP only detects congestion when a packet has been dropped, meaning the queue at the gateway is full. Large queues at gateways are especially undesirable in networks with large delay-bandwidth product connections as it can significantly increase the average delay in the network. The authors also assume that it is useful to have queues at gateways in which traffic is multiplexed together with FIFO scheduling, and that per-connection gateway mechanisms such as those used in Fair Queueing should only be used when not having per-flow mechanisms is clearly inadequate. As RED does not use per-connection mechanisms, it is intended for networks where transport protocols are cooperative and respond to congestion indications, although it can be used to control congestion even when sources are not cooperative. In addition to the goals of keeping average queue size low and detecting congestion, RED places emphasis on avoiding global synchronization of TCP connections and avoiding bias against bursty traffic.

The RED algorithm works by first calculating the average queue size using a low-pass filter with an exponential weighted moving average. RED defines two thresholds, minimum and maximum. When the average queue size is below the minimum threshold, no packets are dropped. When it is above the maximum threshold, all packets are dropped. When it is between the two thresholds, each packet from the connections in the queue may be dropped with a probability that is a function of the average queue size, and this probability uniformly distributed and is proportional to that connection’s share of the bandwidth at the gateway. Later in the paper, the authors discuss setting the parameters for the algorithm. The weight in calculating the average should be selected to be not so low that it responds too slowly to changes in the actual queue size but not so high that the average fluctuates too wildly in response to bursty traffic. The minimum threshold shouldn’t be too low if traffic is fairly bursty. The maximum threshold depends on how high of an average queue length is tolerable. A rule of thumb is that the difference in the threshold values should be larger than the typical queue length increase in one RTT, in order to avoid global synchronization.

The authors state that RED meets the following goals: congestion avoidance, congestion detection at the same time scale as connections can respond to congestion, no global synchronization, simplicity, maximizing the throughput-to-delay ratio, no bias against bursty traffic (which the authors somewhat misleadingly term “fairness”), and appropriateness for a wide range of network environments. To show this they conduct a number of simulation experiments that compare RED with Drop Tail and Random Drop gateways. They also show how RED gateways can be used to identify connections using more than their fair share of bandwidth by noting which connections are having more of their packets dropped, as the probability of a connection having its packets dropped is assumed to be equal to that connection’s fraction of the link bandwidth through the gateway. Lastly, they provide an efficient implementation of the RED algorithm.

Critique

I didn’t really like this paper. I found it interesting that they argued for FIFO queueing, and I’m not sure I found their arguments as compelling after reading the other congestion avoidance papers we read. I also found their talk about their algorithm being “fair” rather misleading, since they don’t mean fair in the accepted sense. I think they gloss over the issue of true fairness as it was talked about in the Fair Queueing paper a little bit too much. Also, proper setting of the parameters of their algorithm seems to be important, but despite their rough guidelines it may not always be clear what the proper setting is. This seems like an undesirable property of their algorithm. It does seem like a step up from Drop Tail and a smaller step up from Random Drop, I will give them that. However, as I think we mentioned in class, I don’t think that this has led it to be widely implemented anyway. On more of a nit-picky note, I thought that a lot of their phrasing was awkward and their graphs were absolute crap (difficult to read, didn’t even have legends). I didn’t think the paper needed to be 20 pages long.

Wednesday, September 9, 2009

Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks


I. Stoica, S. Shenker, H. Zhang, "Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks," ACM SIGCOMM, (August 1998).


One line summary: This paper describes Core-Stateless Fair Queueing, a mechanism for implementing fair allocation in networks in such a way that minimizes the amount of state stored and the complexity of algorithms implemented in core network routers by relying on edge routers to perform flow rate estimation and labeling.

Summary

This paper presents a mechanism for fair allocation called Core-Stateless Fair Queueing (CSFQ). CSFQ identifies contiguous portions of the network as islands consisting of edge routers and core routers performing different functions to provide fair allocation in a simple and effective way. Edge routers compute per-flow estimates and label packets passing through them for use by core routers. Core routers keep no per-flow state and execute FIFO queueing and probabilistic dropping using the information in the packet labels put in place by the edge routers. The authors’ motivation for designing CSFQ begins with the assumptions that fair allocation is important to congestion control and that the complexity of existing mechanism hinders their adoption. The goal of CSFQ then is to achieve fair allocation while avoiding the need for core routers to maintain per-flow state or use complicated algorithms.

The authors use a fluid model to design and analyze CSFQ. Their fairness model is min-max fairness. Their algorithm uses the concept of max-min fairness to calculate the fair share of each flow. When the total arrival rate of all flows is less than the link capacity, no packets will be dropped. When arrival rate exceeds link capacity, those flows that are using more than their fair share will have packets dropped with a probability based on the amount they are using over their fair share. Two challenges in this algorithm are estimating flow rates and estimating the fair share. To estimate flow arrival rates, the algorithm uses an exponential averaging formula. Estimating fair share is more complicated. It is based on the prior estimate of fair share, the estimated total arrival rate, the estimated rate at which the algorithm accepts packets, and whether or not the link is congested. Edge routers estimate flow rates and use this to label packets. Note that these labels must be rewritten at a link that is congested because the estimated rate will no longer be accurate; the minimum of the estimated rate and the fair share rate instead. When experiencing congestion, core routers use the estimate of the flow rate and the estimate of the fair share to perform probabilistic dropping of flows’ packets based on how much beyond their fair share these flows are using. The authors briefly mention an extension that could be implemented to CSFQ that uses weights. They provide performance bounds for CSFQ and show that flows cannot use more than their fair share over the long run, so their algorithm does ensure fairness. They also demonstrate how their algorithm could be efficiently implemented in contemporary routers.

The authors compare CSFQ with four other algorithms. Two are baseline cases that do not ensure fairness: FIFO and RED. The other two represent alternative approaches to ensuring fairness: FRED and DRR. The authors note that CSFQ edge routers have complexity comparable to FRED while core routers have complexity comparable to RED. This is important because one of the main arguments of the authors is that CSFQ is simpler and thus more feasible to deploy. Their findings include that CSFQ provides reasonable fairness even in the face of ill-behaving source, works with different underlying flow control schemes, performs well over multiple congested links, reacts well to bursty traffic, performs reasonably when there is large link delay, and works well even when packets must be relabeled. In general, CSFQ provides fairness comparable to FRED but in a much more scalable manner. The authors also show how CSFQ can be extended to punish unfriendly or unresponsive sources.

Critique

In general, I liked this paper. I especially thought their experiments were nice and clean and revealed interesting aspects of the systems tested. However, as the authors admit, their experiments do not fully reveal how CSFQ would operate over more complex topologies, especially one containing large latencies. Large latency links seem to be a hazardous area for many a network protocol.

One concern is whether CSFQ would in fact be easier to adopt and administer than other fair queueing algorithms. That was one of their main motivations for proposing the architecture, but the fact that it has to be adopted on an “island-by-island” basis as opposed to a router-by-router basis seems potentially awkward. As the authors freely admit in the beginning of their paper, their underlying assumptions are open to debate. In particular, it was (and still is?) arguable whether other fair allocation mechanisms are too complex to implement and deploy in contemporary routers. While this may indeed turn out to be false, I think I recall discussing in class the fact that fair allocation mechanisms weren’t really implemented in most routers today. I may be misremembering this, but if not, then we can’t really say if the authors’ goal of providing a simple and feasibly deployable fair allocation mechanism has been met, since it hasn’t really been tested in practice.


Analysis and Simulation of a Fair Queueing Algorithm


A. Demers, S. Keshav, S. Shenker, "Analysis and Simulation of a Fair Queueing Algorithm," Internetworking: Research and Experience, 1 (1990), pp. 3-26.


One line summary: This paper proposes a fair queueing algorithm for congestion control at gateways and provides analysis and simulation results for this algorithm.

Summary

This paper proposes a fair queueing algorithm for congestion control at gateways. As the authors point out, there are two points at which congestion control can be implemented. The first is at the source and the second is at the gateways. At the source, flow control algorithms, while not directly designed to control congestion, do affect overall network traffic. At the gateways, congestion control can be implemented through routing algorithms, such as adaptive routing, and through queueing algorithms, which affect congestion indirectly. This paper focuses on the latter.

The authors note that queueing algorithms can be thought of as allocating bandwidth, promptness, and buffer space. The most commonly used queueing algorithm, first-come-first-served (FCFS) does not address each of these things separately. The authors argue that it is important to consider how flow control algorithms interact with queueing algorithms to affect network congestion. With this in mind, a major shortcoming of FCFS is that it does not provide good congestion control in the presence of ill-behaved sources in the network i.e. sources that do not provide good flow control and thus flood the network with packets. Such ill-behaving sources can unfairly capture high portions of the outgoing bandwidth of gateways implementing FCFS. In response to this, the authors propose a fair queueing (FQ) algorithm.

The authors’ requirements of their algorithm are: it must be able to allocate bandwidth and buffer space fairly; it must be able to control promptness/delay somewhat independently of bandwidth and buffer space; and it should provide service that does not depend discontinuously on a packet’s time of arrival. To define fair, the authors use the max-min fairness criteria, which state that an allocation is fair if (1) no user receives more than it requests, (2) no other allocation satisfying the first condition has a higher minimum allocation, and (3) the second condition remains recursively true as the minimal user is removed and the total resource amount reduced accordingly. The authors define a user as a source-destination pair, or a conversation. They then define their algorithm, which is based on a packetized version of a bit-by-bit round-robin scheme. Essentially, they first allocate a queue to each conversation. Then they consider the packets at the head of the queue, and estimate what they call the ‘finish time’ for these packets, which is the time at which the packet would finish being serviced. Finally, the algorithm selects the packet with the lowest finish time for transmission. The authors build upon this algorithm by introducing an additional factor that they call the bid. In this improved version of the algorithm, the packets are selected for transmission based on their bid, which is calculated in such a way as to provide less delay to users using less than their fair share of bandwidth. When the gateway’s buffers begin to fill, it must choose packets to drop. The way the authors’ algorithm selects a packet to drop is by choosing the last packet from the user currently using the most buffer space.

The authors then analyze their algorithm with respect to two types of sources: FTP-like sources and telnet-like sources. They do a bunch of annoying math and conclude that their algorithm is indeed able to provide lower delay to lower throughput sources. They then consider their algorithm with respect to flow control algorithms. They consider three types, (1) generic flow control algorithms using a timeout mechanism and a sliding window of set size, (2) second generation flow control algorithms using dynamic window sizes, fast retransmits, improved timeout calculation, and congestion signals, and (3) third generation flow control algorithms, which are similar to the second except that congestion signals are sent selectively. They select one flow control algorithm from each category and use them each with their FQ algorithm and with the FCFS algorithm to compare the two via simulation. They provide a lot of results but the main lesson from these is that FQ gateways cannot control congestion by themselves but can help protect from ill-behaved sources. Their FQ algorithm isn’t perfect and its performance depends a lot on the flow control algorithm used and the specific network dynamics, but it still provides much better fairness than FCFS.

Critique

There are a few things I didn’t like about this paper. I found all their mathematical analysis a bit much. It was annoying to read, especially the preemptive versus nonpreemtive bit, which didn’t factor in much anywhere else in the paper. I think it would have been better if the authors included all the mathematical details in an appendix and used the space instead to state their main point and conclusion of all that analysis, which as far as I can tell, they did in just one sentence out of 3-5 pages. The unused space then could have been used to provide some psuedocode or examples of their algorithm in action. These things would have made their paper much more appealing.

The authors also point out that there are objections to their algorithm, which might be interesting to discuss. The first objection is that some source-destination pairs, such as mail servers, need more than their fair share of bandwidth. The second is that their algorithm is too complicated to implement on routers in a such a way that it could run fast enough to match the speed of the links. It wouldn’t be very useful or compelling if their algorithm slowed the routers down.

Tuesday, September 8, 2009

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks


D-M Chiu, R. Jain, "Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks," Computer Networks and ISDN Systems, 17 (1989), pp 1-14.


One line summary: In this paper the authors algebraically analyze a subset of algorithms for congestion avoidance with respect to a number of criteria, including efficiency, fairness, distributedness, and convergence; they are able to conclude that the optimal algorithm uses additive increase and multiplicative decrease.

Summary

In this paper, the authors analyze decentralized algorithms for congestion avoidance based on binary feedback signals, focusing on what they call increase/decrease algorithms. They focus on an abstract system in which there is one bottleneck resource used by n users. They assume that the binary feedback signal indicates whether this resource is overloaded or underloaded, and that the signal is sent by setting a bit in the packet header. They also assume that the feedback and control loop is synchronous across all users. The authors focus on linear control functions. In this context then, increase/decrease algorithms are merely those in which one signal indicates that the resource is underloaded and the load should be increased, and the other signal means the opposite. Examples of such algorithms include multiplicative increase/additive decrease and additive increase/multiplicative decrease.

The criteria by which the authors select controls are efficiency, fairness, distributedness, and convergence. Efficiency is measure by how close the total load on a resource is to the point where any further load would cause a decrease in response time and throughput, i.e. the point of maximum utilization and throughput and lowest response time. Fairness is measured by a function that the authors define which attempts to measure how equal the allocation of the resource is across all users. Distributedness means that the users only have limited knowledge of the system, in particular, whether the resource is overloaded or underloaded. Lastly, convergence is a measure of both responsiveness, or how fast the system takes to reach equilibrium, and of smoothness, or how large the oscillations around the equilibrium are. They choose to examine distributed linear control functions that converge to efficient and fair states.

In order to analyze the set of possible control functions, the authors represent system state transitions in an n-dimensional vector space (representing the load for each of n users) across discrete time slots. The ith user’s load at time t is x_i(t) and the load at the next time interval, x_i(t+1), is a function of the control scheme used with the load at time t and the system feedback. For example, it could be that x_i(t+1) = a + b*x_i(t) if the binary signal is indicates the resource is underloaded. In this function, a is additive and b is multiplicative. With this representation, they are able to deduce algebraically a number of propositions. The first is essentially that the decrease policy should be multiplicative and the increase policy should have an additive component and can optionally have a multiplicative component with coefficient no less than one. The second proposition is that both the increase and the decrease can have both additive and multiplicative components provide that a special step is taken in the case when certain convergence restrictions would be violated. The third and most conclusive proposition is that for feasibility and optimal convergence, the increase should be additive and the decrease should be multiplicative.

At the end of the paper, the authors describe what nonlinear control functions in congestion avoidance algorithms might look like, but reason that because such control is too sensitive to various system parameters, the robustness of the control is decreased. On this note, they go on to enumerate a few practical considerations in choosing an algorithm for congestion avoidance. The first is that it should be independent of hardware and software parameters as much as possible in order to reduce the complexity of the configuration task and make configuration errors less possible. The second consideration is that the resources and allocations must be measured in integers, so this should be taken into account when choosing an algorithm. The last practical consideration they mention is ease of implementation. Given all these considerations in addition to the result of their earlier analysis, they conclude that an algorithm that uses a linear control function with additive increase and multiplicative decrease is optimal.

Critique

I didn’t like everything about this paper. I thought that it was a bit too abstract, and I think in general it is no trivial matter to translate abstract results into real working solutions. However, given that there appear to be so many different algorithms for congestion avoidance, of which they only examine a subset, their analysis does provide some solid guidance in choosing among these, especially if you accept that the conclusions they draw from their analysis of their simplified system still completely hold in the much more complex environment of real systems. Since real TCP uses additive increase/multiplicative decrease, it seems that this could be a safe bet.

I’d be curious to see if algorithms with control functions other than additive increase/multiplicative decrease have been tested in practice. The authors mention that this paper is just one in a series of papers in which they examine a range of mechanisms for congestion avoidance, so an interesting exercise might be to take some of the algorithms they look at in their other papers and actually test them against additive increase/multiplicative decrease, and also maybe examine some algorithms that use nonlinear control functions.

One thing I did really like about this paper is their use of graphs. Whereas some of the algebra was at times a bit tedious or difficult to follow, I thought the graphs helped them provide a nice, clear, and more intuitive explanation for how the control functions worked and how they converged to fairness and efficiency. Although these graphs were further simplified on top of their simplifying assumptions, in that they represented the case with only two users, they were still pretty much key in helping me to understand some of their conclusions.

Monday, September 7, 2009

Congestion Avoidance and Control


V. Jacobson, M. Karels, "Congestion Avoidance and Control," ACM SIGCOMM Conference, (August 1988).


One line summary: This paper describes several important algorithms for congestion control in TCP, including slow-start, exponential backoff, multiplicative decrease additive increase, and RTT variance estimation; the paper also provides results that demonstrate that implementations of TCP using the algorithms described in the paper are better than older implementations.

Summary


The authors note that due to an explosive increase in size, computer networks have increasingly experienced congestion problems. Seven algorithms added to TCP are meant to help deal with this: round-trip time (RTT) variance estimation, exponential backoff, slow start, aggressive receiver ACK policy, dynamic window sizing, Karn’s algorithm, and fast retransmit. The authors describe the first five of these algorithms. According to the authors, these algorithms stem from the observation that a TCP connection should obey a “conservation of packets” principle, which basically states that for a connection in equilibrium, the sender should not place a new packet into the connection until an old packet leaves. There are three instances in which this principle will not be upheld: when the connection doesn’t initially reach equilibrium, when after equilibrium is reached, a new packet is sent before an old packet leaves, and when resource limitations along the connection path make equilibrium unattainable. The authors address each of these in turn.

The first failure occurs when a connection is being started or restarted. The authors note that a TCP connection is a self-clocking system, which help makes it stable, however, this also means that in order to get packets flowing there must be ACKS to clock them out, but to get ACKS there must be packets flowing. The authors thus describe the slow-start algorithm as a method for “starting the clock” and gradually increasing the number of packets flowing. The slow-start algorithm is as follows: For each connection, store a variable called the congestion window that will be used to determine the number of packets to send. When starting or restarting, set the congestion window to 1 packet. Every time a packet is acknowledged, increase the congestion window by 1. To select how many packets to send at a time, choose the minimum of the congestion window and the receiver’s advertised window. The result of this algorithm is that if no packets are lost or delayed, the congestion window doubles every RTT, and never will the sender send packets at a rate faster than twice the maximum send rate on the connection path. The authors find that this algorithm greatly decreases the number of retransmits and increases effective bandwidth.

The authors next explain that the second failure must be due to a sender’s faulty retransmit timer. This timer should estimate the RTT, which increases when load increases, in order to determine when a packet should be retransmitted, but one mistake in doing this is not estimating the variance in the RTT. The authors present an easy-to-calculate method for estimating the variance in their appendix. Another mistake is not choosing the correct backoff after a packet is retransmitted. The authors argue that only exponential backoff schemes have any hope of working in the general case. The authors demonstrate that a timer that estimates RTT using their method more accurately approximates RTT than the method previously used in TCP.

Lastly, the authors point out that if the timers are doing a good job of estimating RTT, then a timeout probably indicates a lost packet instead of a broken timer. Lost packets are typically a sign that the network is congested, causing connections to encounter resource limits, which is the third failure case. Connection endpoints must have a way of responding to congestion such that they decrease the amount they send when the network is congested and increase the amount they send when it is possible to do so, in order to achieve the proper utilization of network resources while avoiding congestion. To this end, the authors introduce a multiplicative decrease additive increase algorithm. The reason the decrease happens much faster than the increase is because it is easy for the network to become congested and much harder for it to become uncongested. The algorithm works as follows: On a timeout, the congestion window is set to half of the current congestion window size (multiplicative decrease). For each packet acknowledgement received thereafter, increase the congestion window (cwnd) by 1/cwnd (additive increase). Then when choosing the number of packets to send, choose the minimum of the congestion window and the receiver’s advertised window. The authors note that this does not include slow-start because they are two entirely different algorithms, but they should be implemented together, which is done usually by having separate slow-start and congestion avoidance phases. The authors show that their congestion avoidance approach results in fewer retransmits and higher bandwidth utilization.

Critique

The algorithms the authors propose in this paper later made it into newer implementations of TCP, so in that sense, they were very successful. Their results demonstrate that their algorithms were generally an improvement over previous implementations. However, as the authors point out, there are still improvements to be made. For instance, none of their solutions ensure fair sharing of network bandwidth, so they propose this as an area of future work. (Then again, according to Kurose and Ross, the multiplicative decrease additive increase algorithm was later shown to converge to provide equal sharing of bandwidth by Chui in 1989, a year after this paper was published, so perhaps it was the case that this work did not need to be done after all.)

Also, in certain situations, TCP’s exponential backoff and congestion avoidance can cause serious performance problems. Regular TCP can be improved for high-performance networks and there are apparently some extensions to TCP for this. Investigation reveals that some of the later changes to TCP’s congestion avoidance algorithms were suggested in order to allow it to more appropriately handle networks with LFNs and high-latency links. Lastly, some of the assumptions that the authors make about the network aren’t necessarily always valid. For instance, the authors state that packets are lost due to either network congestion or packet damage, but that the latter occurs so infrequently that it is safe to assume that a timeout is almost always a sign of network congestion. However, in wireless networks, this assumption does not generally hold, so TCP’s congestion avoidance can in this case result in link underutilization. Thus, a modified mechanism that takes this into account would be appropriate for TCP in wireless networks.

I thought this paper was good, and obviously the algorithms described in the paper are important, as they are now an integral part of TCP. Congestion control is a very important service, without which the Internet could not function well, if at all. I was somewhat confused as to how much of the algorithms suggested were original contributions, since some of the footnotes seem to indicate very similar algorithms had first been suggested elsewhere. I found that the authors’ tendency to reference a handful of papers I hadn’t read without really explaining briefly what was in those papers inhibited my understanding. And just as a matter of personal taste, I did not like it that so much information was crammed into the footnotes and appendices; to me this made the paper less readable, although aside from this it was well-written.


[EDIT: From my comment about fairness as examined by Chui it's probably clear that I read this paper before I read the other one. ;-) ]

Thursday, September 3, 2009

Understanding BGP Misconfiguration


R. Mahajan, D. Wetherall, T. Anderson, "Understanding BGP Misconfiguration," ACM SIGCOMM Conference, (August 2002).

One line summary: In this paper the authors identify and analyze BGP misconfiguration errors, measure their frequency, classify them into a number of types, examine their causes, and suggest a number of mechanisms for reducing these misconfigurations.

Summary

This paper provides a quantitative and systematic study of BGP misconfiguration. They classify misconfigurations into two main types: origin misconfiguration and export misconfiguration. Origin misconfiguration occurs when an AS inadvertently advertises an IP prefix and it becomes globally visible. Export configuration occurs when an AS fails to filter a route that should have been filtered, thereby violating policies of one or more of the ASs in the AS path. They identify a number of negative effects of such misconfigurations, including increased routing load, connectivity disruption, and policy violation. In order to measure and analyze misconfigurations, the authors collected data from 23 peers in 19 ASs over a period of three weeks. They examine new routes and assume that those that don’t last for very long are likely due to misconfiguration and failures, and so select these to investigate. As part of their investigation they used an email survey of the operators of the ASs involved, as well as a connectivity verifier to determine the extent of disruptions. They note that their method underestimates the number and effect of misconfigurations for various reasons.


The authors first describe their results for origin misconfiguration analysis. They classify the new routes that are potential results of origin misconfiguration into three categories, self-deaggregation of prefixes, announcement of a new route with an origin related to the origin of the old route via their AS paths, and announcement of a new route with a foreign origin (unrelated to that of the old route). They observe that the number of incidents from each of these three categories is roughly the same, with self-deaggregation being slightly higher. They note, however, that the success rates for identifying these different types of origin misconfigurations are different for each, as some incidents that were classified as origin misconfigurations were actually the result of failures. Some interesting conclusions they draw from their analysis are that at least 72% of new routes seen by a router in a day are the result of misconfiguration, that 13% of incidents cause connectivity disruptions, mainly caused by new routes of foreign origin, that compared to failures connectivity disruptions due to misconfigurations play a small role, and that 80% of misconfigurations are corrected within an hour, often less if the misconfiguration disrupts connectivity. The authors next examine export misconfigurations. They note that such misconfigurations do not tend to cause connectivity problems directly, and that most incidents involved providers rather than peers. The authors also examine the effect of misconfigurations on routing load, and conclude that in the extreme case, load can spike to 60%.


In the paper, the authors identify and classify a number of causes of misconfiguration, which they classify into slips and mistakes. Slips and mistakes turn out to be roughly equally responsible for misconfiguration. Mistakes in origin misconfiguration that they identified include initialization bugs, reliance on upstream filtering, and use of old configurations. Slips in origin misconfiguration include accidents (such as typos) in specifying redistribution, attachment of the wrong community attribute to prefixes, hijacks, forgotten filters, incorrect summaries, unknown errors, and miscellaneous problems. They also identify three additional mistakes causing export misconfiguration, including prefix-based configuration, bad ACL or route map, and initialization bugs. Lastly, they identify a number of causes for short-lived new routes that are not misconfigurations, including failures, testing, migration, and load balancing.


Lastly, the authors suggest a number of ways to reduce misconfigurations. These include enacting improvement to the router CLIs, implementing transactional semantics for configuration changes, developing and supporting high-level configuration tools, developing configuration checkers, and building database consistency mechanisms. They also describe a protocol extension to BGP called S-BGP which would prevent about half of the misconfigurations they observed.

Critique

In general, I thought this was an entertaining read, especially as I could relate to how difficult router CLIs are to use and how easy it is to make mistakes in configuring BGP, having had to do this in a previous networks class. It is unfortunate that because misconfigurations are hard to identify, the authors’ methodology was necessarily limited. I’d be interested to see if other newer techniques have been or could be developed to do this and similar sorts of analysis. Due to the weaknesses in their methodology, I’m not sure how meaningful some of the figures and percentages they derive actually are, but they do still provide some interesting insights, especially if they are correct in arguing that their study provides a lower bound. That said, I still think their approach was clever, given the difficulties.

I particularly like their suggestions for reducing some of the causes of misconfigurations. User interface design improvements seemed to me to be the most obvious thing to do. In general, I wonder why many of their suggested solutions, which have been used in many other contexts and computer systems, have not been used in router configuration. Although the authors do briefly discuss some of the barriers to implementing such solutions, it still surprises me that harried system administrators haven’t risen up and demanded that at least some of the more obvious steps be taken sooner, but maybe the potential for missteps makes things more interesting for them, it’s hard to say. I think investigation of some of the improvements they suggest would be an interesting area for research, although they did seem to imply at one point that industry can be a barrier to the adoption of some of these improvements, which might be too frustrating to deal with.

Interdomain Internet Routing


H. Balakrishnan, "Interdomain Internet Routing," MIT Lecture Notes.


One line summary: In this lecture the author describes the Border Gateway Protocol (BGP), used used to exchange reachability information for interdomain routing in the Internet, describing its operation as well as pointing out several flaws in the protocol.

Summary

This lecture starts by explaining that the Internet service is provided by a large number of commercial entities called Internet Service Providers (ISPs) which own an administer portions of the wide-area routing infrastructure called autonomous systems (ASs). ASs are classified into tiers based on their size and routing scope, e.g. the ISPs that have global scope are Tier-1 ISPs, whereas local providers tend to be Tier-3 ISPs. Within each AS, routing is done via Interior Gateway Protocols (IGPs) such as RIP and OSPF. Routing between ASs is accomplished through BGP. BGP is used to exchange routing information between ASs. ASs interconnections are generally either transit or peering type connections. In transit connections, one ASs acts as the customer and the other as the provider, where the provider gives the customer access to some or all of the destinations in its routing tables. In peering connections, ASs share some subset of their routing tables with each other, and these connections tend to be between business competitors. The first kind of interconnection tends to generate revenue for the provider, while the second does not.

BGP consists of two parts: eBGP between ASs and iBGP within an AS. In BGP, one router opens a BGP connection with another router using an OPEN message. The routers then exchange some of their routing information subject to filtering rules which ASs use to determine which routes to share. This exchange is done by UPDATE messages, which consist of an IP prefix followed by a number of attributes. These attributes are what ASs use to apply their filtering rules. Some of these attributes include Next Hop, AS Path, Local Preference, and Multiple-Exit Discriminator (MED). As mentioned, eBGP is used to disseminate routes between ASs, and eBGP connections are usually point-to-point. Two main goals eBGP must meet are loop-free forwarding and complete visibility. Within an AS, routers use iBGP to learn information about routes external to the AS, although the routing protocol used within the AS itself will be one of the IGPs. Ideally, iBGP routers within an AS would be organized into a full mesh, but as some ASs have large numbers of internal routers, this solution is not scalable. Within an AS, there are two methods for organizing iBGP routers into some sort of hierarchy to improve scalability. The first is to use route reflectors, and the second is to set up confederations.

Critique

There are many issues with BGP that cause it to not function as desired. One issue is that because peering interconnections between ASs usually occur between business competitors, financial motivations cause ASs to attempt to force other ASs to use up their resources (i.e. bandwidth) carrying packets in the wide-area network, leading to a large amount of asymmetrical routing. In addition, as mentioned before, there is an issue with scalability. Even bigger issues are caused by the lack of origin authentication. Because of this, misconfigurations or malicious users can cause a route to any IP prefix to be incorrectly advertised from an AS, causing traffic to be routed to that AS when that AS does not actually own that set of IPs. The propagation of incorrect routing information via BGP can affect huge portions of the Internet and result in users being unable to connect to the mis-advertised addresses. Also, by advertising false routing information via BGP, malicious users can engage in such activities as transmitting hard to trace email spam. An additional issue with BGP is convergence. BGP initially can take several minutes to converge, and in addition, to prevent propagation of transient faulty routing information, many routers use the policy of ignoring frequently changing advertisements, increasing convergence time. Another issue is that the increase of multi-homed customer networks is causing additional stress by increasing routing complexity, churn, and convergence time.

There are several things I found interesting about this lecture. One is the vulnerability of BGP to misconfigurations and malicious activity. I wonder if it is possible to “fix” BGP to address this issue or if it would be more advisable to attempt to replace BGP with a better protocol, if one even exists. Given the capacity for massive routing failures caused by BGP, I wonder what measures are taken to prevent them from occurring more often. I also found the discussion of the effect of financial interests of ISPs on routing interesting. It seems that ISPs engage in a fair amount of subterfuge, subverting routing goals. It might be interesting to examine this activity in more detail, and possibly determine if it would be possible to design a protocol to prevent this sort of behavior or create incentives to improve routing in the Internet, or if this sort of behavior is unavoidable given the nature of the participants.