Thursday, September 3, 2009

Understanding BGP Misconfiguration


R. Mahajan, D. Wetherall, T. Anderson, "Understanding BGP Misconfiguration," ACM SIGCOMM Conference, (August 2002).

One line summary: In this paper the authors identify and analyze BGP misconfiguration errors, measure their frequency, classify them into a number of types, examine their causes, and suggest a number of mechanisms for reducing these misconfigurations.

Summary

This paper provides a quantitative and systematic study of BGP misconfiguration. They classify misconfigurations into two main types: origin misconfiguration and export misconfiguration. Origin misconfiguration occurs when an AS inadvertently advertises an IP prefix and it becomes globally visible. Export configuration occurs when an AS fails to filter a route that should have been filtered, thereby violating policies of one or more of the ASs in the AS path. They identify a number of negative effects of such misconfigurations, including increased routing load, connectivity disruption, and policy violation. In order to measure and analyze misconfigurations, the authors collected data from 23 peers in 19 ASs over a period of three weeks. They examine new routes and assume that those that don’t last for very long are likely due to misconfiguration and failures, and so select these to investigate. As part of their investigation they used an email survey of the operators of the ASs involved, as well as a connectivity verifier to determine the extent of disruptions. They note that their method underestimates the number and effect of misconfigurations for various reasons.


The authors first describe their results for origin misconfiguration analysis. They classify the new routes that are potential results of origin misconfiguration into three categories, self-deaggregation of prefixes, announcement of a new route with an origin related to the origin of the old route via their AS paths, and announcement of a new route with a foreign origin (unrelated to that of the old route). They observe that the number of incidents from each of these three categories is roughly the same, with self-deaggregation being slightly higher. They note, however, that the success rates for identifying these different types of origin misconfigurations are different for each, as some incidents that were classified as origin misconfigurations were actually the result of failures. Some interesting conclusions they draw from their analysis are that at least 72% of new routes seen by a router in a day are the result of misconfiguration, that 13% of incidents cause connectivity disruptions, mainly caused by new routes of foreign origin, that compared to failures connectivity disruptions due to misconfigurations play a small role, and that 80% of misconfigurations are corrected within an hour, often less if the misconfiguration disrupts connectivity. The authors next examine export misconfigurations. They note that such misconfigurations do not tend to cause connectivity problems directly, and that most incidents involved providers rather than peers. The authors also examine the effect of misconfigurations on routing load, and conclude that in the extreme case, load can spike to 60%.


In the paper, the authors identify and classify a number of causes of misconfiguration, which they classify into slips and mistakes. Slips and mistakes turn out to be roughly equally responsible for misconfiguration. Mistakes in origin misconfiguration that they identified include initialization bugs, reliance on upstream filtering, and use of old configurations. Slips in origin misconfiguration include accidents (such as typos) in specifying redistribution, attachment of the wrong community attribute to prefixes, hijacks, forgotten filters, incorrect summaries, unknown errors, and miscellaneous problems. They also identify three additional mistakes causing export misconfiguration, including prefix-based configuration, bad ACL or route map, and initialization bugs. Lastly, they identify a number of causes for short-lived new routes that are not misconfigurations, including failures, testing, migration, and load balancing.


Lastly, the authors suggest a number of ways to reduce misconfigurations. These include enacting improvement to the router CLIs, implementing transactional semantics for configuration changes, developing and supporting high-level configuration tools, developing configuration checkers, and building database consistency mechanisms. They also describe a protocol extension to BGP called S-BGP which would prevent about half of the misconfigurations they observed.

Critique

In general, I thought this was an entertaining read, especially as I could relate to how difficult router CLIs are to use and how easy it is to make mistakes in configuring BGP, having had to do this in a previous networks class. It is unfortunate that because misconfigurations are hard to identify, the authors’ methodology was necessarily limited. I’d be interested to see if other newer techniques have been or could be developed to do this and similar sorts of analysis. Due to the weaknesses in their methodology, I’m not sure how meaningful some of the figures and percentages they derive actually are, but they do still provide some interesting insights, especially if they are correct in arguing that their study provides a lower bound. That said, I still think their approach was clever, given the difficulties.

I particularly like their suggestions for reducing some of the causes of misconfigurations. User interface design improvements seemed to me to be the most obvious thing to do. In general, I wonder why many of their suggested solutions, which have been used in many other contexts and computer systems, have not been used in router configuration. Although the authors do briefly discuss some of the barriers to implementing such solutions, it still surprises me that harried system administrators haven’t risen up and demanded that at least some of the more obvious steps be taken sooner, but maybe the potential for missteps makes things more interesting for them, it’s hard to say. I think investigation of some of the improvements they suggest would be an interesting area for research, although they did seem to imply at one point that industry can be a barrier to the adoption of some of these improvements, which might be too frustrating to deal with.

1 comment:

  1. Nicely written summary, well done!
    Researchers have tried to come up with better ways to specify routing policies, to little practical effect. There seems to be great resistance to doing something differently, despite the problems.

    ReplyDelete