Sara's Networks Class Blog: Mahajan

Showing posts with label Mahajan. Show all posts

Saturday, October 17, 2009

Interactive WiFi Connectivity For Moving Vehicles

A. Balasubramanian, R. Mahajan, A. Venkataramani, B. N. Levine, J. Zahorjan, "Interactive WiFi Connectivity For Moving Vehicles", ACM SIGCOMM Conference, (August 2008).

One line summary: This paper describes a protocol for providing WiFi connectivity to moving vehicles called ViFi.

Summary

This paper introduces ViFi, a protocol for providing WiFi connectivity to moving vehicles with minimal disruptions. ViFi achieves this by allowing a vehicle to use several base stations (BSes) simultaneously so that handoffs are not hard handoffs. The paper begins by evaluating six different types types of handoff policies on their VanLAN testbed at the Microsoft Redmond campus: (1) RSSI—the client associates with the base station with the highest received signal strength indicator, (2) BRR—the client associate with the base station with the highest beacon reception ratio, (3) Sticky—the client doesn’t disassociate with a base station until connectivity is absent for some defined period of time, then chooses the one with the highest signal strength, (4) History—the client associates with the base station that has historically provided the best performance at that location, (5) BestBS—the client associates, in a given second, with the base station that will provide the best performance in the next second, and (6) AllBSes—the client opportunistically uses all base stations in the vicinity. The first five policies use hard handoffs. The last is an idealized method that represents an upper bound. The metrics used to judge these policies are aggregate performance and periods of uninterrupted connectivity. Sticky performs the worst, and RSSI, BRR, and History perform similarly. AllBSes is by far the best. They also observe gray periods in which connection quality drops sharply and unpredictably.

The authors explain that the effectiveness of AllBSes stems from the fact that the vehicle is often in the range of multiple BSes and packet losses are bursty and roughly independent across senders and receivers. They design ViFi as a practical version of AllBSes. In ViFi, there is an anchor base station and a set of auxiliary base stations. The identities of these are embedded in the beacons that the vehicle broadcasts periodically, along with the identity of the previous anchor. An auxiliary base station attempts to relay packets that have not been received by the destination. This is done with according to an estimated probability that is computed according to the following guidelines: (1) relaying decisions made by other potentially relaying auxiliaries must be accounted for, (2) auxiliaries with better connectivity to the base station should be preferred, and (3) the expected number of relayed transmissions should be limited. ViFi also allows for salvaging, in which new anchors contact the previous anchor for packets that were supposed to be delivered to the vehicle but weren’t able to be delivered before the vehicle moved out of range. ViFi uses broadcasts and implements its own retransmits. It doesn’t implement exponential backoff because it assumes collision will not be a problem. To evaluate ViFi the authors deploy it on VanLAN and do a trace-driven simulation using measurements from another testbed called DieselNet. The most important findings according to the authors are: (1) the link layer performance of ViFi is close to ideal, (2) ViFi improves application performance twofold over current handoff methods, (3) ViFi places little additional load on the vehicle-base station medium, and (4) ViFi’s mechanism for coordinating relays has low false positive and false negative rates.

Critique

I didn’t find this paper very exciting but I still thought there were a lot of good things about it. The idea behind ViFi isn’t all that novel since it has been used in cellular networks. I liked that in this paper the authors first examined a variety of techniques and gave results for those. One thing I am confused about is how they implemented AllBSes to test it considering they argued that it is impractical to implement and represents the ideal. In general, the paper seemed to have a lot of evaluation in it, which I liked. Although ViFi did perform well in the settings in which it was tested, overall, I’m not sure how practically useful it is. For example, they assume that collisions are not an issue. If ViFi were to become successful and widely deployed it’s possible that collisions might become a problem, and they don’t even attempt to consider situations in which there might be collisions. Also, I can only imagine people wanting to connect to the Internet if they are going fairly long distances but as the authors mention in the paper, WiFi deployments may not be broad enough and lack of coverage between islands may make it unusable. There are several other deployment challenges too, and not just those mentioned in the paper. I just can’t imagine it being desirable in many settings. I don’t see cities wanting to invest in this. I’m not sure you could get a lot of other people besides the city government to freely provide access. It could be useful to deploy many ViFi base stations along a commuter rail line though, but in that case, since you know the speed, location, and schedule of the moving vehicles you could probably do even better than ViFi.

Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, "Detailed Diagnosis in Enterprise Networks," ACM SIGCOMM Conference, (August 2009).

One line summary: This paper presents NetMedic, a diagnostic system for enterprise networks that uses network history to infer causes of faults without application specific knowledge by representing components as a directed graph and reasoning over this structure; this paper shows NetMedic is able to pinpoint fault causes with relative specificity compared to previous diagnostic systems.

Summary

This paper describes a diagnostic system for enterprise networks called NetMedic. NetMedic approaches the problem of diagnosing faults as one of inference. The goal of NetMedic is to build a system that can identify the likely causes of a fault with as much specificity as possible and to do so with minimal application specific knowledge. It does this by modeling the network as a dependency graph and then using history to detect abnormalities and likely causes. The nodes of this graph are network components such as application processes, machines, and configurations, and network paths. There is a directed edge from a node A to a node B if A impacts B, and the weight of this edge represents the magnitude of this impact. Each component has a state consisting of many variables. The abnormality of a component at a given time is determined and used to compute the edge weights. The authors describe various extensions to make this process more robust to large and diverse sets of variables. After these weights are obtained, causes are ranked such that more likely causes have lower ranks.

The authors implement NetMedic on the Windows platform, using Windows Performance Counter framework for their source of data. They claim to be developing a prototype for Linux as well. They evaluate NetMedic in comparison to a course diagnosis method loosely based on previous methods. They evaluate in a live environment and a controlled environment, but it is unclear how realistic even the live environment is, as they inject the faults they are trying to detect. In one of their evaluations, for 80% of the faults the median rank of the true cause is 1 in NetMedic, meaning NetMedic correctly identifies the culprit in these cases. They also demonstrate the benefit of their extensions by comparing them with a version of NetMedic that has application specific information hand coded in it. Their extensions perform well in this comparison. They also study how NetMedic does when diagnosing two simultaneous faults and study the impact of the length of history used.

Critique

This paper was interesting to read. After reading the first section or two the first question that comes to mind is how they manage without knowing application specific details, because it is pretty obvious that this method isn’t workable in the general case. They have a clever way of getting around this in the analysis phase but then they do admit that in the data collection phase of their experiments they do utilize application specific information about where configuration data is stored, though they claim to be working on a way to get around this. There is one part in the section on implementation where they talk about how they handle some counters differently from others because they represent cumulative values and it made me wonder how they determine which counters fall into this category. Does that not count as having application specific information? They talk about automatically detecting cumulative variables earlier in the paper when discussing their extensions, such as aggregate relationships across variables, but the example they give with the counter (number of exceptions a process has experienced) doesn’t seem to fall into the same category as those discussed in the extension section.

Since NetMedic would be used by network administrators and you can imagine that some things about a network stay the same all the time (such as what kind of applications are running) it might be interesting to see how NetMedic could be enhanced if administrators had the option of providing some application specific details, and if there are ways NetMedic could leverage this. I’m not really sure how or if that would work, but it is something to consider.

I wasn’t particularly impressed by their evaluation. It would be more compelling if they had more data from real-world situations instead of constructed situations with injected faults. It might have been informative too if they had shown some results measuring metrics other than the rank of the correct cause, although I can’t think of another metric off the top of my head. Also, they compared their system against another that was “loosely based” on systems such as Sherlock and Score, and they don’t really discuss that system much, so it seems a bit questionable as to whether this is a fair comparison. Lastly, their evaluation in which they show NetMedic identifies a virus scanning program or sync utility as abnormal doesn’t seem like something to brag about. I’m not sure I understand this section or why this is a good thing, since presumably virus scanning is an acceptable activity. In this section, they claim to be showing how NetMedic can help with naturally occurring faults, but I’m not sure they actually accomplish that. I could be just entirely misunderstanding this section.

It might potentially be interesting to explore using variables as nodes in the graph instead of just the components. This could probably make it much harder to scale though. It would also be interesting to see how NetMedic performs when a fault is due to a confluence of factors and not just one culprit alone.

Thursday, September 3, 2009

Understanding BGP Misconfiguration

R. Mahajan, D. Wetherall, T. Anderson, "Understanding BGP Misconfiguration," ACM SIGCOMM Conference, (August 2002).

One line summary: In this paper the authors identify and analyze BGP misconfiguration errors, measure their frequency, classify them into a number of types, examine their causes, and suggest a number of mechanisms for reducing these misconfigurations.

Summary

This paper provides a quantitative and systematic study of BGP misconfiguration. They classify misconfigurations into two main types: origin misconfiguration and export misconfiguration. Origin misconfiguration occurs when an AS inadvertently advertises an IP prefix and it becomes globally visible. Export configuration occurs when an AS fails to filter a route that should have been filtered, thereby violating policies of one or more of the ASs in the AS path. They identify a number of negative effects of such misconfigurations, including increased routing load, connectivity disruption, and policy violation. In order to measure and analyze misconfigurations, the authors collected data from 23 peers in 19 ASs over a period of three weeks. They examine new routes and assume that those that don’t last for very long are likely due to misconfiguration and failures, and so select these to investigate. As part of their investigation they used an email survey of the operators of the ASs involved, as well as a connectivity verifier to determine the extent of disruptions. They note that their method underestimates the number and effect of misconfigurations for various reasons.

The authors first describe their results for origin misconfiguration analysis. They classify the new routes that are potential results of origin misconfiguration into three categories, self-deaggregation of prefixes, announcement of a new route with an origin related to the origin of the old route via their AS paths, and announcement of a new route with a foreign origin (unrelated to that of the old route). They observe that the number of incidents from each of these three categories is roughly the same, with self-deaggregation being slightly higher. They note, however, that the success rates for identifying these different types of origin misconfigurations are different for each, as some incidents that were classified as origin misconfigurations were actually the result of failures. Some interesting conclusions they draw from their analysis are that at least 72% of new routes seen by a router in a day are the result of misconfiguration, that 13% of incidents cause connectivity disruptions, mainly caused by new routes of foreign origin, that compared to failures connectivity disruptions due to misconfigurations play a small role, and that 80% of misconfigurations are corrected within an hour, often less if the misconfiguration disrupts connectivity. The authors next examine export misconfigurations. They note that such misconfigurations do not tend to cause connectivity problems directly, and that most incidents involved providers rather than peers. The authors also examine the effect of misconfigurations on routing load, and conclude that in the extreme case, load can spike to 60%.

In the paper, the authors identify and classify a number of causes of misconfiguration, which they classify into slips and mistakes. Slips and mistakes turn out to be roughly equally responsible for misconfiguration. Mistakes in origin misconfiguration that they identified include initialization bugs, reliance on upstream filtering, and use of old configurations. Slips in origin misconfiguration include accidents (such as typos) in specifying redistribution, attachment of the wrong community attribute to prefixes, hijacks, forgotten filters, incorrect summaries, unknown errors, and miscellaneous problems. They also identify three additional mistakes causing export misconfiguration, including prefix-based configuration, bad ACL or route map, and initialization bugs. Lastly, they identify a number of causes for short-lived new routes that are not misconfigurations, including failures, testing, migration, and load balancing.

Lastly, the authors suggest a number of ways to reduce misconfigurations. These include enacting improvement to the router CLIs, implementing transactional semantics for configuration changes, developing and supporting high-level configuration tools, developing configuration checkers, and building database consistency mechanisms. They also describe a protocol extension to BGP called S-BGP which would prevent about half of the misconfigurations they observed.

Critique

In general, I thought this was an entertaining read, especially as I could relate to how difficult router CLIs are to use and how easy it is to make mistakes in configuring BGP, having had to do this in a previous networks class. It is unfortunate that because misconfigurations are hard to identify, the authors’ methodology was necessarily limited. I’d be interested to see if other newer techniques have been or could be developed to do this and similar sorts of analysis. Due to the weaknesses in their methodology, I’m not sure how meaningful some of the figures and percentages they derive actually are, but they do still provide some interesting insights, especially if they are correct in arguing that their study provides a lower bound. That said, I still think their approach was clever, given the difficulties.

I particularly like their suggestions for reducing some of the causes of misconfigurations. User interface design improvements seemed to me to be the most obvious thing to do. In general, I wonder why many of their suggested solutions, which have been used in many other contexts and computer systems, have not been used in router configuration. Although the authors do briefly discuss some of the barriers to implementing such solutions, it still surprises me that harried system administrators haven’t risen up and demanded that at least some of the more obvious steps be taken sooner, but maybe the potential for missteps makes things more interesting for them, it’s hard to say. I think investigation of some of the improvements they suggest would be an interesting area for research, although they did seem to imply at one point that industry can be a barrier to the adoption of some of these improvements, which might be too frustrating to deal with.

Sara's Networks Class Blog