Sara's Networks Class Blog: Stoica

Showing posts with label Stoica. Show all posts

Thursday, November 12, 2009

X-Trace: A Pervasive Network Tracing Framework

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, I. Stoica, "X-Trace: A Pervasive Network Tracing Framework," NSDI'07, (April 2007).

One line summary: This paper describes a diagnostic tracing framework for distributed or networked applications called X-Trace that is intended to be used as a debugging tool.

Summary

This paper describes a tracing framework for distributed or networked applications called X-Trace. X-Trace is intended to be used as a diagnostic tool. It works by tagging each application task with metadata including an identifier, and each network operation originating from that task is tagged with that task identifier, along with data relevant to that operation. From a trace of these tags, it is possible to construct a task tree, which can be used to debug networked applications. In X-Trace, the trace request is sent in band, while the collected trace data itself is sent out of band back to the requesting source. The authors implemented X-Trace and tested the performance of an web server and database back-end with X-Trace enabled. They observed a small decrease in performance, but feel that this is justified given X-Trace is meant to be used for debugging. The authors also provide a number of usage scenarios.

Critique

I didn’t think that this paper was a particularly great idea. It doesn’t seem that difficult of a concept to come up with, but it does seem very unlikely that it would actually become widely used because it requires modifications to applications and network protocols. It also would not be very useful if partially deployed, as the authors mention. Other drawbacks that the authors mention include security concerns, non-tree request structures that can’t be captured, the difficulty in distinguishing lost reports from other problems and the difficulty in constructing causal connections if reports are lost, the extra traffic caused by X-Trace, and the difficulty in ascertaining whether X-Trace is actually a useful tool in helping system administrators and other users. The authors treatment of these issues did not really convince me that X-Trace was still a worthwhile consideration. I am not altogether convinced that alternative and potentially currently available diagnostic tools and methods could be just as, if not more, useful, especially considering the potential difficulty in deploying X-Trace. I think something that requires no modifications to existing applications and protocols is vastly more appropriate.

A Policy-aware Switching Layer for Data Centers

D. Joseph, A. Tavakoli, I. Stoica, "A Policy-aware Switching Layer for Data Centers," ACM SIGCOMM Conference, (August 2008)

One line summary: This paper presents a new layer-2 mechanism for supporting middleboxes in datacenters called the PLayer that incorporates pswitches, which allow administrators to specify the sequence of middleboxes a given type of traffic should traverse.

Summary

This paper presents PLayer, a new layer-2 for data centers consisting of policy-aware switches called pswitches. The main goal of the PLayer is to provide a mechanism to support the use of middleboxes in datacenters. The problem with using middleboxes in datacenters today is that (1) they are hard to configure to ensure traffic traverses the correct sequence of middleboxes without physically changing the network topology or overloading other layer-2 mechanisms, which is difficult and error-prone, (2) network inflexibility makes introducing changes difficult and can result in fate-sharing between middleboxes and other traffic flow, and (3) some traffic is often forced to traverse some middleboxes unnecessarily and load balancing across multiple instances of a middlebox is difficult. PLayer seeks to address these problems by providing correctness, flexibility, and efficiency. It does this by adhering to two main design principles: (1) separate policy from reachability and (2) take middleboxes off the physical network path.

The PLayer consists of pswitches, which forward frames according to an administrator-specified policy. Policies define the sequence of middleboxes that a given type of traffic should traverse and are of the form “[start location, traffic selector] → sequence”, where a traffic selector is a 5-tuple consisting of source and destination IP addresses and port numbers and the protocol type. These policies get translated into rules of the form “[previous hop, traffic selector] → next hop”. Administrators specify policies at a centralized policy controller, which disseminates those policies to each pswitch. A centralized middlebox controller monitors the liveness of middleboxes and keeps the pswitches informed of failures. The PLayer achieves requiring minimal infrastructure changes by only requiring switches be replaced, using encapsulation, and being incrementally deployable. It allows the use of unmodified middleboxes and servers by ensuring packets only reach middleboxes and servers in the proper Ethernet format, using only nonintrusive techniques to identify previous hops, and supporting the various middlebox addressing requirements. It supports the use of non-transparent middleboxes, that is, middleboxes that modify frame headers or content, through the use of something called per-segment policies. The PLayer can be enhanced through the use of stateful switches, which store hash tables in order to track the processing of a flow. PLayer guarantees correctness under network, policy, and middlebox churn, as demonstrated in the paper. The end of the paper includes a description of the implementation of pswitches in Click and a validation of the their functionality. It also includes a performance test, and as expected, pswitches achieve less TCP throughput and increased latency, as do middleboxes deployed with them. It concludes with a summary of the limitations of the PLayer. Briefly, these include the introduction of indirect paths since the PLayer directs traffic to off-path middleboxes, difficulty in traffic classification and policy specification, incorrect packet classification when the 5-tuple is insufficient for correct classification, failure of the PLayer if middleboxes are incorrectly wired, the existence of some policies the PLayer cannot support, and the added complexity of pswitches.

Critique

I thought the PLayer seemed like a good idea, and very well-motivated. I don’t think we should have necessarily read this version of the paper, however. Some sections, such as the formal analysis and the appendix, didn’t really add much toward understanding the PLayer, but rather just made the paper an annoyingly long read. With respect to the PLayer itself, it seems that if it could overcome its poor performance when compared to current datacenter switches, it could become pretty competitive. Or, at least the idea of allowing administrators to specify which traffic should traverse which middleboxes in a flexible way requiring minimal changes to existing infrastructure seems like an idea that could be very applicable, and perhaps companies such as Cisco would do well to incorporate some of those ideas into their products.

Internet Indirection Infrastructure

I. Stoica, D. Adkins, S. Zhuang, S. Shenker, S. Surana, "Internet Indirection Infrastructure," ACM SIGCOMM Conference, (August 2002).

One line summary: This paper introduces the Internet Indirection Infrastructure, an overlay architecture for the Internet that is designed to make it easy to implement otherwise difficult to build applications such as anycast, service composition, multicast, and mobility.

Summary

This paper proposes an overlay framework for the Internet called the Internet Indirection Infrastructure (i3) that provides a communication abstraction to facilitate the implementation of applications such as multicast, anycast, and mobility. These are typically difficult to implement in the Internet, which was originally designed to provide unicast point-to-point communication between fixed locations. Many current solutions that provide functionality for one application aren’t general enough to support the other applications. i3 seeks to resolve that, primarily by adding a layer of indirection that decouples senders from receivers. Senders and receivers thus need not be aware of each other’s existence or location. The is achieved by a mechanism in which senders send packets to logical identifiers and receivers express interest in receiving packets destined to that identifier by inserting triggers. Thus, packets are pairs (id, data) and triggers are pairs (id, addr). IDs are m bits long and longest prefix matching is used to match triggers to IDs. A set of servers in the overlay network that makes up i3 is responsible for storing these triggers and forwarding packets to interested receivers. Each ID maps to a unique server. A generalization of i3 allows stacks of identifiers to be used in order to send packets to a series of identifiers, like source routing.

The authors provide examples of how i3 might be used. These include service composition, heterogeneous multicast, server selection, and large scale multicast. They then discuss several design and performance issues. One consideration is that the overlay network on which i3 is built must be robust, efficient, scalable, and stable. Thus, the authors’ prototype implementation of i3 is built on top of Chord. This provides many of the desired properties just mentioned. Some other advantages of i3 including those inherited from the Chord overlay network include that it is incrementally deployable and can provide support for legacy applications. Some of the main concerns regarding i3 are with respect to routing efficiency, hot spots, and security. The authors address each in turn and suggest possible ways to address these concerns. Many of these solutions rely on the concept of private versus public triggers. Public triggers are knowable by all end hosts in the system and are used to exchange private triggers with which subsequent communication occurs. This concept can be used to address routing efficiency concerns. The main problem here is triangle routing: the server on which a trigger is stored and through which packets are forwarded may be far away from either of the end hosts. To address this the authors suggest that end hosts probe servers to find the closest ones and choose private triggers that will be stored on these nearby servers. One security concern, eavesdropping, can also be addressed by the use of private triggers. Hot spots can be addressed by replicating triggers across more than one server. Trigger hijacking can be prevented by inserting two triggers that chain together such that the shared part of each trigger is only known to the host itself. DoS attacks can be addressed by a series of mechanisms, including challenging hosts that insert triggers to verify they intended to insert the trigger, having servers use Fair Queuing, and implementing loop detection. The authors argue that these measures make security in i3 no worse than in today’s Internet. The authors provide some simulation results, and as expected, there is some latency stretch associated with routing over i3. They also experiment on their prototype, examining the latency for trigger insertion, packet forwarding, internal i3 routing, and throughput.

Critique

I liked how the authors of this paper built upon many of the research ideas we’ve read about and really integrated a lot of the “received wisdom” into their system (e.g. indirection good). I thought they did a nice job of trying to address every potential criticism of i3 by suggesting possible solutions. However, in practical terms, I think the performance and security issues associated with i3 are in reality too much work to fix and the stigma or concern over them too great for i3 to ever actually be used. But I’m getting pretty used to the idea that most research ideas that are published will never actually see real use, at least in the form in which they are initially published, even if they seem like the greatest thing ever invented. I suppose a lot of times it is enough that the underlying ideas and contributions make it out there somehow, whether as components of or inspiration for other systems.

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications

I. Stoica, R. Morris, D. Karger, F. Kaashoek, H. Balakrishnan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications," ACM SIGCOMM Conference, 2001.

One line summary: This paper presents Chord, a protocol for a distributed lookup service built on a distributed hash table abstraction.

Summary

This paper presents a distributed lookup protocol called Chord that is simple and has provable correctness and performance properties. It is based on the distributed hash table (DHT) abstraction. It addresses the problems of load balancing, decentralization, scalability, availability, and flexible naming. Chord specifies how to map key/value pairs to storage nodes. It does this through the use of consistent hashing. This works as follows: each of the nodes participating in Chord is assigned an identifier based on the hash of its IP address. Each key also has an identifier that is the hash of the key. The identifiers form an identifier circle that has 2m places in it, where m is the length of the identifier. The identifier of a node determines its place in the circle. A key is assigned to the first node that has an identifier equal to or following it on the circle. The node that the key is assigned to is responsible for storing the corresponding key/value pair. Consistent hashing is used for this because it theoretically ensures that key/value pairs are roughly evenly distributed across nodes.

Each node in the circle stores a pointer to its successor so that queries for a key/value pair can be routed to the correct node. Each node also stores a finger table in which entry i contains a pointer to the node that is at least 2i-1 places ahead of it on the circle. As an optimization, each node can also store a pointer to its predecessor. The result of this setup is that a query for a key/value pair takes O(log N) steps to answer, where N is the number of nodes. When a node joins Chord it is logically placed in the spot on the circle to which its identifier maps. The predecessors and fingers of the new node and the other nodes are updated and all key/value pairs with identifiers equal to or following the new node’s identifier are transferred from the node where they were previously stored to the new node. To join the circle, a node only need know the identity of one other node already in Chord. From that node it can learn its place on the circle. Chord periodically runs a stabilization procedure in which each node verifies its successor. To deal with failures, each Chord node keeps a list of its n successors so that it can find a new one when its old one fails.

The paper next provides an evaluation of Chord. The authors first find that due to the non-uniform distribution of node identifiers, keys are not evenly mapped across the nodes, so they suggest a workaround involving virtual nodes. They next show that the average path length traversed by a query is ½ log N, where N is the number of nodes. They find that Chord experiences virtually no lookup failures due to node failures and very few lookup failures (~3%) for lookups that occur during stabilization. Lastly, they find that lookup latency grows slowly with the number of nodes.

Critique

I liked reading this paper because I’ve seen Chord used in a number of other systems. Briefly, I have only two main criticisms of this paper. First, the load balancing or distribution of keys among Chord nodes without using virtual nodes is disappointing but foreseeable. Second, it might be nice if Chord provided some optional automatic or built-in way of replicating keys across nodes. It seems straightforward to do so it would be nice if it were just automatically provided instead of left to the user to do.

Tuesday, November 3, 2009

Looking Up Data in P2P Systems

H. Balakrishnan, F. Kaashoek, D. Karger, R. Morris, I. Stoica, "Looking Up Data in P2P Systems," Communications of the ACM, V. 46, N. 2, (February 2003).

One line summary: This paper provides an overview of the various DHT-based data lookup mechanisms for peer-to-peer systems.

Summary

This paper begins by introducing the lookup problem in peer-to-peer (P2P) systems, which is that of finding data in a large P2P system containing a dynamic set of nodes in a scalable manner without introducing any centralization or hierarchy. The paper focuses primarily on solutions based on the distributed hash table (DHT) abstraction, after briefly outlining the problems with solutions involving a central database or hierarchical distributed storage structures. The main benefits of a DHT-based solution are that it has few constraints, including that all data items be uniquely identified with a numeric key and that nodes store data for other nodes. A DHT implements only a lookup operation that takes as input the identifying key of a data item and returns the location of the node that is responsible for that data item. A DHT lookup algorithm must address the following: mapping keys to nodes in a load-balanced way, forwarding lookups to the appropriate nodes, providing a distance function that describes the closeness of keys to each other and nodes to each other, and adaptively build routing tables.

The various DHT-based lookup algorithms reviewed in this paper include CAN, Chord, Kademlia, Pastry, Tapestry, and Viceroy. These vary in the data structures they use to provide O(log N) lookups, their routing dimensions, emphasis, distance functions, various operation costs, degree of fault tolerance and concurrency, routing, and security measures. For example, Chord maintains a skiplist-like data structure, while Pastry, Tapestry, and Kademlia use a tree-based structure. In contrast, CAN uses a d-dimensional Cartesian coordinate space as its abstraction. With respect to distance functions, Chord uses distance in the skiplist that is unidirectional and asymmetric, while Pastry uses prefixes and numeric distances and so is symmetric but not unidirectional, and Kademlia uses an XOR-based function that is both unidirectional and symmetric. Topics for future work in this area include analyzing operation costs when the system is large and joins and departures are frequent, improving routing latency, providing security mechanisms, and implementing efficient indexing and keyword search.

Critique

There isn’t much to say in terms of a critique since this paper is just a brief introduction to and description of various algorithms, but I did find it helpful and thought the writing was nicely structured, so I would keep it in the syllabus.

Wednesday, September 9, 2009

Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks

I. Stoica, S. Shenker, H. Zhang, "Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks," ACM SIGCOMM, (August 1998).

One line summary: This paper describes Core-Stateless Fair Queueing, a mechanism for implementing fair allocation in networks in such a way that minimizes the amount of state stored and the complexity of algorithms implemented in core network routers by relying on edge routers to perform flow rate estimation and labeling.

Summary

This paper presents a mechanism for fair allocation called Core-Stateless Fair Queueing (CSFQ). CSFQ identifies contiguous portions of the network as islands consisting of edge routers and core routers performing different functions to provide fair allocation in a simple and effective way. Edge routers compute per-flow estimates and label packets passing through them for use by core routers. Core routers keep no per-flow state and execute FIFO queueing and probabilistic dropping using the information in the packet labels put in place by the edge routers. The authors’ motivation for designing CSFQ begins with the assumptions that fair allocation is important to congestion control and that the complexity of existing mechanism hinders their adoption. The goal of CSFQ then is to achieve fair allocation while avoiding the need for core routers to maintain per-flow state or use complicated algorithms.

The authors use a fluid model to design and analyze CSFQ. Their fairness model is min-max fairness. Their algorithm uses the concept of max-min fairness to calculate the fair share of each flow. When the total arrival rate of all flows is less than the link capacity, no packets will be dropped. When arrival rate exceeds link capacity, those flows that are using more than their fair share will have packets dropped with a probability based on the amount they are using over their fair share. Two challenges in this algorithm are estimating flow rates and estimating the fair share. To estimate flow arrival rates, the algorithm uses an exponential averaging formula. Estimating fair share is more complicated. It is based on the prior estimate of fair share, the estimated total arrival rate, the estimated rate at which the algorithm accepts packets, and whether or not the link is congested. Edge routers estimate flow rates and use this to label packets. Note that these labels must be rewritten at a link that is congested because the estimated rate will no longer be accurate; the minimum of the estimated rate and the fair share rate instead. When experiencing congestion, core routers use the estimate of the flow rate and the estimate of the fair share to perform probabilistic dropping of flows’ packets based on how much beyond their fair share these flows are using. The authors briefly mention an extension that could be implemented to CSFQ that uses weights. They provide performance bounds for CSFQ and show that flows cannot use more than their fair share over the long run, so their algorithm does ensure fairness. They also demonstrate how their algorithm could be efficiently implemented in contemporary routers.

The authors compare CSFQ with four other algorithms. Two are baseline cases that do not ensure fairness: FIFO and RED. The other two represent alternative approaches to ensuring fairness: FRED and DRR. The authors note that CSFQ edge routers have complexity comparable to FRED while core routers have complexity comparable to RED. This is important because one of the main arguments of the authors is that CSFQ is simpler and thus more feasible to deploy. Their findings include that CSFQ provides reasonable fairness even in the face of ill-behaving source, works with different underlying flow control schemes, performs well over multiple congested links, reacts well to bursty traffic, performs reasonably when there is large link delay, and works well even when packets must be relabeled. In general, CSFQ provides fairness comparable to FRED but in a much more scalable manner. The authors also show how CSFQ can be extended to punish unfriendly or unresponsive sources.

Critique

In general, I liked this paper. I especially thought their experiments were nice and clean and revealed interesting aspects of the systems tested. However, as the authors admit, their experiments do not fully reveal how CSFQ would operate over more complex topologies, especially one containing large latencies. Large latency links seem to be a hazardous area for many a network protocol.

One concern is whether CSFQ would in fact be easier to adopt and administer than other fair queueing algorithms. That was one of their main motivations for proposing the architecture, but the fact that it has to be adopted on an “island-by-island” basis as opposed to a router-by-router basis seems potentially awkward. As the authors freely admit in the beginning of their paper, their underlying assumptions are open to debate. In particular, it was (and still is?) arguable whether other fair allocation mechanisms are too complex to implement and deploy in contemporary routers. While this may indeed turn out to be false, I think I recall discussing in class the fact that fair allocation mechanisms weren’t really implemented in most routers today. I may be misremembering this, but if not, then we can’t really say if the authors’ goal of providing a simple and feasibly deployable fair allocation mechanism has been met, since it hasn’t really been tested in practice.

Sara's Networks Class Blog