Thursday, November 12, 2009

X-Trace: A Pervasive Network Tracing Framework


R. Fonseca, G. Porter, R. H. Katz, S. Shenker, I. Stoica, "X-Trace: A Pervasive Network Tracing Framework," NSDI'07, (April 2007).


One line summary: This paper describes a diagnostic tracing framework for distributed or networked applications called X-Trace that is intended to be used as a debugging tool.

Summary

This paper describes a tracing framework for distributed or networked applications called X-Trace. X-Trace is intended to be used as a diagnostic tool. It works by tagging each application task with metadata including an identifier, and each network operation originating from that task is tagged with that task identifier, along with data relevant to that operation. From a trace of these tags, it is possible to construct a task tree, which can be used to debug networked applications. In X-Trace, the trace request is sent in band, while the collected trace data itself is sent out of band back to the requesting source. The authors implemented X-Trace and tested the performance of an web server and database back-end with X-Trace enabled. They observed a small decrease in performance, but feel that this is justified given X-Trace is meant to be used for debugging. The authors also provide a number of usage scenarios.

Critique

I didn’t think that this paper was a particularly great idea. It doesn’t seem that difficult of a concept to come up with, but it does seem very unlikely that it would actually become widely used because it requires modifications to applications and network protocols. It also would not be very useful if partially deployed, as the authors mention. Other drawbacks that the authors mention include security concerns, non-tree request structures that can’t be captured, the difficulty in distinguishing lost reports from other problems and the difficulty in constructing causal connections if reports are lost, the extra traffic caused by X-Trace, and the difficulty in ascertaining whether X-Trace is actually a useful tool in helping system administrators and other users. The authors treatment of these issues did not really convince me that X-Trace was still a worthwhile consideration. I am not altogether convinced that alternative and potentially currently available diagnostic tools and methods could be just as, if not more, useful, especially considering the potential difficulty in deploying X-Trace. I think something that requires no modifications to existing applications and protocols is vastly more appropriate.

NetFPGA: A Tool for Network Research and Education


G. Watson, N. McKeown, M. Casado. "NetFPGA: A Tool for Network Research and Education," 2nd Workshop on Architecture Research using FPGA Platforms (WARFP), (February, 2006).


One line summary: This paper describes a hardware platform called NetFPGA that is intended to allow students and researchers quickly deploy and debug new networking hardware in an operational network.

Summary

This paper describes NetFPGA, a hardware platform that enables the deployment and development of networking hardware. The goal of this was originally to allow students to gain experience with the link and physical layer sides of networking. It later became interesting for use in network research. NetFPGA went through two versions, with the second improving upon the first in a number of ways. It has been used in several courses and is currently being tested for use in two research projects, Rate Control Protocol (RCP) and shunting for intrusion detection.

Critique

NetFPGA sounds like it would be fun to play with and I wish I could have taken a course that uses it. However I’m not sure why this paper is in the syllabus. I don’t really think it adds anything very useful.


A Policy-aware Switching Layer for Data Centers


D. Joseph, A. Tavakoli, I. Stoica, "A Policy-aware Switching Layer for Data Centers," ACM SIGCOMM Conference, (August 2008)


One line summary: This paper presents a new layer-2 mechanism for supporting middleboxes in datacenters called the PLayer that incorporates pswitches, which allow administrators to specify the sequence of middleboxes a given type of traffic should traverse.

Summary

This paper presents PLayer, a new layer-2 for data centers consisting of policy-aware switches called pswitches. The main goal of the PLayer is to provide a mechanism to support the use of middleboxes in datacenters. The problem with using middleboxes in datacenters today is that (1) they are hard to configure to ensure traffic traverses the correct sequence of middleboxes without physically changing the network topology or overloading other layer-2 mechanisms, which is difficult and error-prone, (2) network inflexibility makes introducing changes difficult and can result in fate-sharing between middleboxes and other traffic flow, and (3) some traffic is often forced to traverse some middleboxes unnecessarily and load balancing across multiple instances of a middlebox is difficult. PLayer seeks to address these problems by providing correctness, flexibility, and efficiency. It does this by adhering to two main design principles: (1) separate policy from reachability and (2) take middleboxes off the physical network path.

The PLayer consists of pswitches, which forward frames according to an administrator-specified policy. Policies define the sequence of middleboxes that a given type of traffic should traverse and are of the form “[start location, traffic selector] → sequence”, where a traffic selector is a 5-tuple consisting of source and destination IP addresses and port numbers and the protocol type. These policies get translated into rules of the form “[previous hop, traffic selector] → next hop”. Administrators specify policies at a centralized policy controller, which disseminates those policies to each pswitch. A centralized middlebox controller monitors the liveness of middleboxes and keeps the pswitches informed of failures. The PLayer achieves requiring minimal infrastructure changes by only requiring switches be replaced, using encapsulation, and being incrementally deployable. It allows the use of unmodified middleboxes and servers by ensuring packets only reach middleboxes and servers in the proper Ethernet format, using only nonintrusive techniques to identify previous hops, and supporting the various middlebox addressing requirements. It supports the use of non-transparent middleboxes, that is, middleboxes that modify frame headers or content, through the use of something called per-segment policies. The PLayer can be enhanced through the use of stateful switches, which store hash tables in order to track the processing of a flow. PLayer guarantees correctness under network, policy, and middlebox churn, as demonstrated in the paper. The end of the paper includes a description of the implementation of pswitches in Click and a validation of the their functionality. It also includes a performance test, and as expected, pswitches achieve less TCP throughput and increased latency, as do middleboxes deployed with them. It concludes with a summary of the limitations of the PLayer. Briefly, these include the introduction of indirect paths since the PLayer directs traffic to off-path middleboxes, difficulty in traffic classification and policy specification, incorrect packet classification when the 5-tuple is insufficient for correct classification, failure of the PLayer if middleboxes are incorrectly wired, the existence of some policies the PLayer cannot support, and the added complexity of pswitches.

Critique

I thought the PLayer seemed like a good idea, and very well-motivated. I don’t think we should have necessarily read this version of the paper, however. Some sections, such as the formal analysis and the appendix, didn’t really add much toward understanding the PLayer, but rather just made the paper an annoyingly long read. With respect to the PLayer itself, it seems that if it could overcome its poor performance when compared to current datacenter switches, it could become pretty competitive. Or, at least the idea of allowing administrators to specify which traffic should traverse which middleboxes in a flexible way requiring minimal changes to existing infrastructure seems like an idea that could be very applicable, and perhaps companies such as Cisco would do well to incorporate some of those ideas into their products.

Internet Indirection Infrastructure


I. Stoica, D. Adkins, S. Zhuang, S. Shenker, S. Surana, "Internet Indirection Infrastructure," ACM SIGCOMM Conference, (August 2002).


One line summary: This paper introduces the Internet Indirection Infrastructure, an overlay architecture for the Internet that is designed to make it easy to implement otherwise difficult to build applications such as anycast, service composition, multicast, and mobility.

Summary

This paper proposes an overlay framework for the Internet called the Internet Indirection Infrastructure (i3) that provides a communication abstraction to facilitate the implementation of applications such as multicast, anycast, and mobility. These are typically difficult to implement in the Internet, which was originally designed to provide unicast point-to-point communication between fixed locations. Many current solutions that provide functionality for one application aren’t general enough to support the other applications. i3 seeks to resolve that, primarily by adding a layer of indirection that decouples senders from receivers. Senders and receivers thus need not be aware of each other’s existence or location. The is achieved by a mechanism in which senders send packets to logical identifiers and receivers express interest in receiving packets destined to that identifier by inserting triggers. Thus, packets are pairs (id, data) and triggers are pairs (id, addr). IDs are m bits long and longest prefix matching is used to match triggers to IDs. A set of servers in the overlay network that makes up i3 is responsible for storing these triggers and forwarding packets to interested receivers. Each ID maps to a unique server. A generalization of i3 allows stacks of identifiers to be used in order to send packets to a series of identifiers, like source routing.

The authors provide examples of how i3 might be used. These include service composition, heterogeneous multicast, server selection, and large scale multicast. They then discuss several design and performance issues. One consideration is that the overlay network on which i3 is built must be robust, efficient, scalable, and stable. Thus, the authors’ prototype implementation of i3 is built on top of Chord. This provides many of the desired properties just mentioned. Some other advantages of i3 including those inherited from the Chord overlay network include that it is incrementally deployable and can provide support for legacy applications. Some of the main concerns regarding i3 are with respect to routing efficiency, hot spots, and security. The authors address each in turn and suggest possible ways to address these concerns. Many of these solutions rely on the concept of private versus public triggers. Public triggers are knowable by all end hosts in the system and are used to exchange private triggers with which subsequent communication occurs. This concept can be used to address routing efficiency concerns. The main problem here is triangle routing: the server on which a trigger is stored and through which packets are forwarded may be far away from either of the end hosts. To address this the authors suggest that end hosts probe servers to find the closest ones and choose private triggers that will be stored on these nearby servers. One security concern, eavesdropping, can also be addressed by the use of private triggers. Hot spots can be addressed by replicating triggers across more than one server. Trigger hijacking can be prevented by inserting two triggers that chain together such that the shared part of each trigger is only known to the host itself. DoS attacks can be addressed by a series of mechanisms, including challenging hosts that insert triggers to verify they intended to insert the trigger, having servers use Fair Queuing, and implementing loop detection. The authors argue that these measures make security in i3 no worse than in today’s Internet. The authors provide some simulation results, and as expected, there is some latency stretch associated with routing over i3. They also experiment on their prototype, examining the latency for trigger insertion, packet forwarding, internal i3 routing, and throughput.

Critique

I liked how the authors of this paper built upon many of the research ideas we’ve read about and really integrated a lot of the “received wisdom” into their system (e.g. indirection good). I thought they did a nice job of trying to address every potential criticism of i3 by suggesting possible solutions. However, in practical terms, I think the performance and security issues associated with i3 are in reality too much work to fix and the stigma or concern over them too great for i3 to ever actually be used. But I’m getting pretty used to the idea that most research ideas that are published will never actually see real use, at least in the form in which they are initially published, even if they seem like the greatest thing ever invented. I suppose a lot of times it is enough that the underlying ideas and contributions make it out there somehow, whether as components of or inspiration for other systems.

DNS Performance and the Effectiveness of Caching


J. Jung, E. Sit, H. Balakrishnan, "DNS Performance and the Effectiveness of Caching," IEEE/ACM Transactions on Networking, V. 10, N. 5, (October 2002).


One line summary: This paper studies the Domain Name System with respect to client-perceived performance and the effectiveness of caching by analyzing several sets of real trace data including DNS and TCP traffic.

Summary

This paper examines DNS performance from the perspective of clients by analyzing DNS and related TCP traffic traces collected at MIT and the Korea Institute of Science and Technology (KAIST). It seeks to answer two questions: (1) what the performance perceived by clients in terms of latency and failures is, and (2) how varying the TTL and cache sharing impacts the effectiveness of caching. The paper first provides an overview of DNS. It then explains the methodology. The study uses three sets of traces that include outgoing DNS queries and responses and outgoing TCP connections. It collects various statistics on these traces for analysis, including, for a given query, the number of referrals involved, the lookup latency, and whether the query was successful. The dominant type of query the authors observed was A, at approximately 60%, followed by PTR at approximately 30% and MX and ANY making up a roughly equal share of the remainder. Only 50% of the queries were associated with TCP connections and 20% of the TCP connections were not preceded by DNS queries.

The rest of the paper is organized with respect to the two main questions it attempts to answer. First, they examine client-perceived performance. They find median lookup latencies of 85 and 97 ms. 80% of lookups are resolved without any referrals, and the number of referrals increases the overall lookup latency. About 20-24% of lookups receive no response. Of the answered lookups, 90% involved no retransmissions. Queries that receive no referrals generate much more wide-area traffic than those that do. These last three facts suggest that many DNS name servers are too persistent in their retry strategies. They also find that most of the negative responses are due to typos and reverse lookups for addresses that don’t have a host name and that negative caching does not work as well as it should, probably because the distribution of names causing negative responses is heavy tailed. They find 15-18% of lookups contact root or gTLD servers and 15-27% of such lookups receive negative responses. Next they examine the effectiveness of caching. Because the distribution of domain name popularity is very long tailed, many names are accessed just once, so having a low TTL doesn’t hurt because caching doesn’t really help. However, not caching NS records would greatly increase loads to root and gTLD servers, so TTL values should not be set too low for these. Also due to the distribution of domain name popularity, sharing caches does does not increase the hit rate very much after a certain point, because while some names are very popular, the rest are likely to be of interest to just one host. They also found that increasing TTL value past a few minutes does not greatly increase the cache hit rate because most cache hits are produced by single clients looking up the same server multiple times in a row. This summarizes most of their main findings.

Critique

I always really like papers that examine systems like DNS in this way because I think it’s interesting that even though we built it, we don’t always understand and aren’t able to predict the kinds of observations that we later make in studies like this. As always, it would be interesting again to go back and do a similar study now that about a decade of time has passed and see what turns up. One thing in this paper that I am confused about is how the authors could have found that such a significant fraction of lookups never receive an answer. I don’t think they provide much in the way of an explanation for this, if I remember correctly. Are these unanswered lookups just due to dropped packets? Because that seems like a lot of dropped packets, and I guess I find that surprising, although I have no real basis for thinking so.

Development of the Domain Name System


P. Mockapetris, K. Dunlap, "Development of the Domain Name System," ACM SIGCOMM Conference, 1988.


One line summary: This paper describes the design and deployment of the Domain Name System and reflects on its surprises, successes and shortcomings from the perspective of some years after its initial introduction.

Summary

This paper discusses the design and development of the Domain Name System (DNS), candidly reflecting on its history and future. DNS provides the naming service for the Internet that primarily maps hostnames to IP addresses, among other things. Prior to the development and adoption of DNS, the method for disseminating the mappings of hostnames to IP addresses was the HOSTS.TXT file, centrally maintained on a server at SRI and distributed to machines on the Internet via file transfers. As the number of hosts on the Internet grew it became clear that this approach was not scalable. Thus another was needed, and the one developed was the DNS. The design goals of the DNS were (1) it must provide the same information as the old HOSTS.TXT system, (2) it must be able to be maintained in a distributed manner, (3) it must have no obvious size limits for names and associated data, (4) it must interoperate across the Internet, and (5) it must provide reasonable performance. In the light of these constraints, it had to be extensible and avoid forcing a single OS, architecture, or organization onto its users. Initially, the design of DNS tried to balance between being very lean and being completely general.

DNS is a hierarchical system. The main components of this system are name servers and resolvers. Name servers contain mappings and answer queries. Resolvers provide the client interface, including algorithms to find a name server to query for the information desired by the client. The DNS name space is a tree, with each node having an associated label that is the concatenation of all the labels on the path from the root of the tree to that node. Labels are case-insensitive and the zero-length label is reserved for the root. DNS decouples the tree structure from implicit semantics in order to provide applications with more choices. Each name in the DNS is stored as a resource record (RR) and has an associated type and class field along with data. Some of the more common types include A records that map hostnames to IP addresses, PTR records that provide a reverse map from IP addresses to hostnames, and NS records that map zones to name servers. A zone is a contiguous section or subtree of the namespace that is controlled by specific organizations; for example, UC Berkeley controls berkeley.edu. The controlling organization is responsible for delegating subzones, maintaining the data of that zone, and providing redundant name servers for that zone. In addition through distributing data through these zones, DNS name servers and resolvers also cache RRs, which are removed from the cache after their TTL expires.

In addition to describing the basics of DNS, the paper describes its early implementation status and its first deployment in the Berkeley network. This was apparently painful but necessary. The paper then goes on to describe a number of surprising issues in the operation of the DNS. The first was that the assumption that the form and content of the information stored by the DNS was a poor one. A second is related to performance. The performance of the underlying network was initially much worse than the designers of DNS anticipated, and as a result DNS query performance was poor. Also, they found it difficult to make reasonable measurements of DNS performance due to the interference of unrelated effects. The third surprise was the high amount of negative responses to queries, leading to the need for caching such responses, which is referred to as negative caching. There were a number of very successful aspects of DNS as well. These include the choice of using a variable-depth hierarchy, the organization-oriented structure of names, the use of UDP to access name servers, the feature that allows a name server to include additional data in response to a query if it sees fit, allowing it to anticipate some requests, the use of caching, and the agreement to use organizationally structured domain names for mail addressing and routing. Lastly, shortcomings of the DNS include difficulty in defining new classes and types of RRs, difficulty in upgrading applications to use DNS, and difficulty in conveying to system administrators the correct way of configuring and using DNS. The paper concludes by offering several pieces of advice and describing some future directions for DNS.

Critique

I really liked this paper; I usually always enjoy papers that provide historical perspective and reflection on well-known systems, like this one and the “design of the DARPA Internet protocols” paper that we read at the beginning of the semester. I especially thought that a lot of the lessons learned and resulting advice contained in this paper was just good system-building advice, plain and simple. Some particular gems I liked include, “Documentation should always be written with the assumption that only the examples are read,” “It is often more difficult to remove functions from systems than it is to get a new function added,” “The most capable implementors [sic] lose interest once a new system delivers the level of performance they expect; they are not easily motivated to optimize their use of others’ resources or provide easily used guidelines for administrators that use the systems,” and “Allowing variations in the implementation structure used to provide service is a great idea; allowing variation in the provided service causes problems.”

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications


I. Stoica, R. Morris, D. Karger, F. Kaashoek, H. Balakrishnan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications," ACM SIGCOMM Conference, 2001.


One line summary: This paper presents Chord, a protocol for a distributed lookup service built on a distributed hash table abstraction.

Summary

This paper presents a distributed lookup protocol called Chord that is simple and has provable correctness and performance properties. It is based on the distributed hash table (DHT) abstraction. It addresses the problems of load balancing, decentralization, scalability, availability, and flexible naming. Chord specifies how to map key/value pairs to storage nodes. It does this through the use of consistent hashing. This works as follows: each of the nodes participating in Chord is assigned an identifier based on the hash of its IP address. Each key also has an identifier that is the hash of the key. The identifiers form an identifier circle that has 2m places in it, where m is the length of the identifier. The identifier of a node determines its place in the circle. A key is assigned to the first node that has an identifier equal to or following it on the circle. The node that the key is assigned to is responsible for storing the corresponding key/value pair. Consistent hashing is used for this because it theoretically ensures that key/value pairs are roughly evenly distributed across nodes.


Each node in the circle stores a pointer to its successor so that queries for a key/value pair can be routed to the correct node. Each node also stores a finger table in which entry i contains a pointer to the node that is at least 2i-1 places ahead of it on the circle. As an optimization, each node can also store a pointer to its predecessor. The result of this setup is that a query for a key/value pair takes O(log N) steps to answer, where N is the number of nodes. When a node joins Chord it is logically placed in the spot on the circle to which its identifier maps. The predecessors and fingers of the new node and the other nodes are updated and all key/value pairs with identifiers equal to or following the new node’s identifier are transferred from the node where they were previously stored to the new node. To join the circle, a node only need know the identity of one other node already in Chord. From that node it can learn its place on the circle. Chord periodically runs a stabilization procedure in which each node verifies its successor. To deal with failures, each Chord node keeps a list of its n successors so that it can find a new one when its old one fails.


The paper next provides an evaluation of Chord. The authors first find that due to the non-uniform distribution of node identifiers, keys are not evenly mapped across the nodes, so they suggest a workaround involving virtual nodes. They next show that the average path length traversed by a query is ½ log N, where N is the number of nodes. They find that Chord experiences virtually no lookup failures due to node failures and very few lookup failures (~3%) for lookups that occur during stabilization. Lastly, they find that lookup latency grows slowly with the number of nodes.


Critique

I liked reading this paper because I’ve seen Chord used in a number of other systems. Briefly, I have only two main criticisms of this paper. First, the load balancing or distribution of keys among Chord nodes without using virtual nodes is disappointing but foreseeable. Second, it might be nice if Chord provided some optional automatic or built-in way of replicating keys across nodes. It seems straightforward to do so it would be nice if it were just automatically provided instead of left to the user to do.


Tuesday, November 3, 2009

Looking Up Data in P2P Systems


H. Balakrishnan, F. Kaashoek, D. Karger, R. Morris, I. Stoica, "Looking Up Data in P2P Systems," Communications of the ACM, V. 46, N. 2, (February 2003).


One line summary: This paper provides an overview of the various DHT-based data lookup mechanisms for peer-to-peer systems.

Summary

This paper begins by introducing the lookup problem in peer-to-peer (P2P) systems, which is that of finding data in a large P2P system containing a dynamic set of nodes in a scalable manner without introducing any centralization or hierarchy. The paper focuses primarily on solutions based on the distributed hash table (DHT) abstraction, after briefly outlining the problems with solutions involving a central database or hierarchical distributed storage structures. The main benefits of a DHT-based solution are that it has few constraints, including that all data items be uniquely identified with a numeric key and that nodes store data for other nodes. A DHT implements only a lookup operation that takes as input the identifying key of a data item and returns the location of the node that is responsible for that data item. A DHT lookup algorithm must address the following: mapping keys to nodes in a load-balanced way, forwarding lookups to the appropriate nodes, providing a distance function that describes the closeness of keys to each other and nodes to each other, and adaptively build routing tables.

The various DHT-based lookup algorithms reviewed in this paper include CAN, Chord, Kademlia, Pastry, Tapestry, and Viceroy. These vary in the data structures they use to provide O(log N) lookups, their routing dimensions, emphasis, distance functions, various operation costs, degree of fault tolerance and concurrency, routing, and security measures. For example, Chord maintains a skiplist-like data structure, while Pastry, Tapestry, and Kademlia use a tree-based structure. In contrast, CAN uses a d-dimensional Cartesian coordinate space as its abstraction. With respect to distance functions, Chord uses distance in the skiplist that is unidirectional and asymmetric, while Pastry uses prefixes and numeric distances and so is symmetric but not unidirectional, and Kademlia uses an XOR-based function that is both unidirectional and symmetric. Topics for future work in this area include analyzing operation costs when the system is large and joins and departures are frequent, improving routing latency, providing security mechanisms, and implementing efficient indexing and keyword search.

Critique

There isn’t much to say in terms of a critique since this paper is just a brief introduction to and description of various algorithms, but I did find it helpful and thought the writing was nicely structured, so I would keep it in the syllabus.