Friday, October 30, 2009

Resilient Overly Networks


D. Andersen. H. Balakrishnan, F. Kaashoek, R. Morris, "Resilient Overly Networks," 18th Symposium on Operating Systems Principles, (December 2001).


One line summary: This paper describes a Resilient Overlay Network architecture which is intended to run at the application layer over the existing Internet infrastructure to provide path failure detection and recovery as well as mechanisms for applications to choose their own routing metrics and implement expressive routing policies.

Summary

This paper describes Resilient Overlay Network (RON), an application-layer overlay that exists on top of the existing Internet. The goal of RON is to allow Internet applications to detect and recover from path failures and performance degradation quickly. That is, a RON seeks to (1) provide failure detection and recovery within 20 seconds, (2) integrate routing more tightly with applications by letting applications choose their own routing metrics, and (3) allow expressive routing policies relating to acceptable use polices and rate controls.

A RON consists of a set of nodes deployed at various locations on the Internet and each path between two RON nodes is represented as a single virtual link. An application that communicates with a RON node is a RON client that interacts with the RON node across an API. Each client instantiates a RON router that implements a routing protocol and a RON membership manager that keeps track of the list of RON members. By default, RON maintains information about three metrics for each virtual link: latency, loss rate, and throughput. It obtains this information via probes and shares it with all RON nodes by storing it in a central performance database. It also uses these probes to detect path outages. RON uses the data collected from these probes and stored in the performance database in a link-state routing protocol to build routing tables according to each metric. When it detects a path outage, RON uses single-hop indirection to route over a new path. RON provides policy routing by allowing administrators to define types of traffic allowed on certain links and it implements this policy routing using packet classification methods. RON manages membership via either a static mechanism that reads the list of members from a file or a dynamic mechanism that uses announcement flooding and stores the discovered members in soft-state at each node. One important note about RONs is that they are intended to consist of only a small number of nodes (approximately less than 50) because the many of the mechanisms used will not scale, but RON can still provide the intended benefits with only a small number of nodes.

The authors evaluated RON by deploying two wide-area RONs, one with 12 nodes (RON1) and one with 16 nodes (RON2). They found that RON1 recovered from path outages 100% of the time and RON2 60% of the time. (The lower figure from RON2 is in part due to a site becoming completely unreachable.) RON is able to detect and share information about a virtual link failure in an average of 26 seconds. It does this with an overhead of 30 Kbps of active probing. RON was also able to route around flooding links and improve performance in a small number of cases. The authors were also able to show that RON routes are relatively stable, with the median length of time for a route to persist being about 70 seconds.

Critique

RONs are a somewhat interesting idea and I do think the problem they address – that of allowing applications to recover from path failures because BGP is very slow at doing so and of allowing applications to choose their own routing metrics – is compelling. However, there are several big issues with RONs, the two most obvious being poor performance and inability to scale. With respect to the first, RON’s IP forwarder that the authors implemented could only forward packets at a maximum rate 90 Mbps. Also, RON was only able to select paths with better performance with respect to the default metrics a very small amount of the time. RON also introduced a fair amount of overhead due to its use of flooding. This also relates to its inability to scale. However, I still think many of the ideas behind RONs are useful.

1 comment:

  1. I think you brought up an important point in class -- that the RON nodes have to be placed in such a fashion as to provide sufficient diversity in paths to provide true resiliency to path failure. Most RON nodes are likely to be in different edge ASs, since they are in universities. Hopefully there are sufficient independent Tier 1 ISPs to have enough diversity at that level, but it would be interesting to understand path failure in the current context rather than Year 2000.

    ReplyDelete