Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks


S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, "Detailed Diagnosis in Enterprise Networks," ACM SIGCOMM Conference, (August 2009).


One line summary: This paper presents NetMedic, a diagnostic system for enterprise networks that uses network history to infer causes of faults without application specific knowledge by representing components as a directed graph and reasoning over this structure; this paper shows NetMedic is able to pinpoint fault causes with relative specificity compared to previous diagnostic systems.

Summary

This paper describes a diagnostic system for enterprise networks called NetMedic. NetMedic approaches the problem of diagnosing faults as one of inference. The goal of NetMedic is to build a system that can identify the likely causes of a fault with as much specificity as possible and to do so with minimal application specific knowledge. It does this by modeling the network as a dependency graph and then using history to detect abnormalities and likely causes. The nodes of this graph are network components such as application processes, machines, and configurations, and network paths. There is a directed edge from a node A to a node B if A impacts B, and the weight of this edge represents the magnitude of this impact. Each component has a state consisting of many variables. The abnormality of a component at a given time is determined and used to compute the edge weights. The authors describe various extensions to make this process more robust to large and diverse sets of variables. After these weights are obtained, causes are ranked such that more likely causes have lower ranks.

The authors implement NetMedic on the Windows platform, using Windows Performance Counter framework for their source of data. They claim to be developing a prototype for Linux as well. They evaluate NetMedic in comparison to a course diagnosis method loosely based on previous methods. They evaluate in a live environment and a controlled environment, but it is unclear how realistic even the live environment is, as they inject the faults they are trying to detect. In one of their evaluations, for 80% of the faults the median rank of the true cause is 1 in NetMedic, meaning NetMedic correctly identifies the culprit in these cases. They also demonstrate the benefit of their extensions by comparing them with a version of NetMedic that has application specific information hand coded in it. Their extensions perform well in this comparison. They also study how NetMedic does when diagnosing two simultaneous faults and study the impact of the length of history used.

Critique

This paper was interesting to read. After reading the first section or two the first question that comes to mind is how they manage without knowing application specific details, because it is pretty obvious that this method isn’t workable in the general case. They have a clever way of getting around this in the analysis phase but then they do admit that in the data collection phase of their experiments they do utilize application specific information about where configuration data is stored, though they claim to be working on a way to get around this. There is one part in the section on implementation where they talk about how they handle some counters differently from others because they represent cumulative values and it made me wonder how they determine which counters fall into this category. Does that not count as having application specific information? They talk about automatically detecting cumulative variables earlier in the paper when discussing their extensions, such as aggregate relationships across variables, but the example they give with the counter (number of exceptions a process has experienced) doesn’t seem to fall into the same category as those discussed in the extension section.

Since NetMedic would be used by network administrators and you can imagine that some things about a network stay the same all the time (such as what kind of applications are running) it might be interesting to see how NetMedic could be enhanced if administrators had the option of providing some application specific details, and if there are ways NetMedic could leverage this. I’m not really sure how or if that would work, but it is something to consider.

I wasn’t particularly impressed by their evaluation. It would be more compelling if they had more data from real-world situations instead of constructed situations with injected faults. It might have been informative too if they had shown some results measuring metrics other than the rank of the correct cause, although I can’t think of another metric off the top of my head. Also, they compared their system against another that was “loosely based” on systems such as Sherlock and Score, and they don’t really discuss that system much, so it seems a bit questionable as to whether this is a fair comparison. Lastly, their evaluation in which they show NetMedic identifies a virus scanning program or sync utility as abnormal doesn’t seem like something to brag about. I’m not sure I understand this section or why this is a good thing, since presumably virus scanning is an acceptable activity. In this section, they claim to be showing how NetMedic can help with naturally occurring faults, but I’m not sure they actually accomplish that. I could be just entirely misunderstanding this section.

It might potentially be interesting to explore using variables as nodes in the graph instead of just the components. This could probably make it much harder to scale though. It would also be interesting to see how NetMedic performs when a fault is due to a confluence of factors and not just one culprit alone.

1 comment:

  1. Evaluation of algorithms like this are indeed rather complicated. You never know what kind of problems will be left undiscovered in real systems. Fault injection is a fairly common approach. It makes things more believable if those faults are based on actually observed faults in the wild.

    ReplyDelete