Sara's Networks Class Blog: botnets

Tuesday, December 1, 2009

BotGraph: Large Scale Spamming Botnet Detection

Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, E. Gillum, "BotGraph: Large Scale Spamming Botnet Detection," NSDI'09, (April 2009).

One line summary: This paper describes a method for detecting web account abuse attacks using EWMA and BotGraph, which creates a user-user graph to find correlations and distinguish bot-users from normal users.

Summary

This paper presents a system for detecting web account abuse attacks. These are attacks in which spammers use botnets to create many fake email accounts through free web email service providers (such as Gmail and Hotmail) from which to send spam. To detect this situation, the authors provide an algorithmic means of finding correlations among bot-users and distinguishing them from real users, as well as a mechanism for efficiently analyzing large volumes of data to reveal such correlations. The authors name their approach BotGraph. The goal of BotNet is to be able to captures spamming email accounts used by botnets.

BotGraph works by constructing a large user graph and analyzing the underlying correlations. The key observation is that bot-users share IP addresses when they log in and send emails. There are two main components of BotGraph – the first tries to detect aggressive signups and the second tries to detect the remaining stealthy bot-users based on their login activities. To detect aggressive signups the authors use a simple Exponentially Weighted Moving Average (EWMA) algorithm that detects sudden changes in signup activity. The rationale for this is that in the normal case, signups should happen infrequently at a single IP address or at least be roughly consistent over time, whereas a sudden increase in signups may indicate the IP address may be associated with a bot. To detect stealthy bot-users, the authors build a user-user graph, in which each vertex is a user and the weight of an edge between two users is based on the number of common IP addresses from which the two users logged in. This is because the authors reason that if aggressive signups are limited, each bot-user will need to log in and send more emails multiple times at different locations. This results in shared IP addresses because firstly, the number of bot-users is often much larger than the number of bots, and secondly, botnets have a high churn rate, so bot-users will be assigned to different bots over time. Since normal users also share IP addresses when dynamic IP addresses and proxies are used, the authors exclude this case by counting multiple shared IP addresses in the same autonomous system (AS) as one shared IP address. Once this graph is constructed, the authors show that bot-user groups distinguish themselves from normal user groups by forming giant components in the graph. The authors then describe a recursive algorithm for extracting the set of connected components from the graph. They prune the resultant set of components to remove normal user groups, as well as group the components by bot-user group, by using two statistics: the percentage of users who sent more than three emails per day and the percentage of users who sent out emails with a similar size.

The authors next describe how they constructed their large user-user graph from 220 GB of Hotmail login data. They use Dryad/DryadLINQ, though they suggest other programming environments for distributed, data-parallel computing might also work. They describe two methods. The first involves partitioning the data by IP address and using the well-known map and reduce operations, whereas the second involves partitioning the data by user ID and using some special optimizations. They find the second method is more suited to their application and also much much faster: 1.5 hours on 240 machines for method 2 compared to over 6 hours for method 1. However, as method 2 has increasing communication overhead it may be less scalable. The authors describe a number of optimizations as well. The authors evaluated their approach on two month-long datasets. BotGraph detected approximately 21 million bot-users in total. It was able to detect 76.84% and 85.80% respectively of known spammers with the graph approach and 85.15% and 8.15% respectively with the EWMA approach. The authors perform a false positive analysis using naming patterns and signup dates and estimate a false postive rate of 0.44%.

Critique

I liked this paper a lot. I’m not sure how much it has to teach about networking per se, not that that’s necessarily a bad thing. It was good that the authors spent some time discussing potential countermeasures attackers might take to their approach. One thing that may other students have mentioned that I’m not sure was mentioned in the paper however is that BotGraph will not detect bot-users all behind one IP address (e.g. NAT) and it is becoming increasingly common in countries like China for there to be very large numbers of machines behind a NAT, so it is something to note. Overall this paper was very good and I think it should be kept in the syllabus. I would be interested to see future work along these lines that use additional features to correlate and other methods besides graphs to detect correlations, as the authors mention.

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

R. Gummadi, H. Balakrishnan, P. Maniatis, S. Ratnasamy, "Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks," NSDI'09, (April 2009).

One line summary: This paper presents a system called Not-A-Bot (NAB) that distinguishes human-generated traffic from bot-generated traffic by attesting human-generated traffic at the clients that is then verified at the servers in order to mitigate such problems as spam, DDoS attacks, and click-fraud.

Summary

This paper presents a system called Not-A-Bot (NAB) for identifying human-generated web traffic as distinguished from bot-generated web traffic. The motivation for this is that bots are responsible for a large amount of spam, distributed denial-of-service (DDoS) attacks, and click fraud, so being able to determine if an email or request is human-generated as opposed to bot-generated would help mitigate these problems. NAB consists of an attester and a verifier. When an attestation is requested by an application for a particular request or email, the attester determines if that request or email was indeed generated by a human, and if so, it attaches a signed statement to it that verifies that it was sent by a human. The verifier runs on the server and checks whether the request or email has an attestation and if it is valid. If it does, the application may choose, for example, to prioritize the request, or increase the score of an email so it is more likely to get through a spam filter. Otherwise it may consider the request or email as more likely to have come from a bot.

NAB assumes that applications and the OS are untrusted and so relies on a Trusted Platform Model (TPM) to load the attester code to ensure that it is trusted. As mentioned, the NAB attester decides to grant an attestation if it determines that a human generated the associated request or email. It does this by guessing, using as a heuristic how recently before an attestation request the last keyboard or mouse activity was observed. If the attestation was requested within a certain amount of time since the last keyboard or mouse activity, the attester grants the attestation. An attestation is non-transferrable and is bound to the content of the request it is generated for. An attestation is over the entire content of the application-specific payload and is responder-specific, and where appropriate, challenger-specific. The mechanism for attesting web requests and email in the common case is straightforward. The only complicated case is script-generated email, which requires deferred attestations. The verifier is straightforward and as metioned, implements an application-specific policy. The authors provide several example policies for spam, DDoS, and click-fraud mitigation.

The authors next describe their evaluation of NAB. They evaluate the attester with respect to TCB size, CPU requirements, and application changes. They evaluate the verifier with repect to the extent to which it mitigates attacks and the rate at which it can verify attestations. They find that the attester is 500 SLOC out of 30,000 for the TCB total, that the worst-case latency for generating an attestation is 10 ms on a 2 GHz Core 2 processor, and that modifications to two programs to include attestations required less than 250 SLOC each. For the verifier, they find that the amount of spam can be reduced by 92% with no false positives, that it can reduce the peak processing load seen at mail servers, that it can filter out 89% of bot-generated DDoS activity while not filtering out human-generated requests, and that it can identify click-fraud activity with more than 87% accuracy without filtering out human-generated clicks.

Critique

I didn’t really think that much of this paper. One criticism is that though adding NAB to applications wouldn’t be technically difficult, as the authors explain, you would still have to make sure a lot of applications (i.e. email clients, web browsers, servers, etc.) did include it and then make sure a lot of hosts ran the versions of the applications that included NAB, because it seems it wouldn’t be nearly as useful if not all the clients used it. In their evaluation, all of the client programs ran NAB, but if that weren’t the case it would be less effective. Another criticism or perhaps point of confusion is with regard to their deferred attestations, which are supposed to be used for script-generated emails. I don’t see why attackers couldn’t leverage this to generate attestations for their own spam emails or whatever else they wanted. Another criticism is that with their simple heuristic (guessing that an activity is human-generated if it is within a certain amount of time of keyboard or mouse activity), bots can still generate attestations for its own traffic by, as the authors say, harvesting human activity. This is probably still sufficient for generating large amounts of spam and such. They would probably be better off using the strategy of using specific mouse or keyboard activity to decide whether or not to grant an attestation request, but in the paper they claim that strategy is too complex to implement.

Sara's Networks Class Blog

Tuesday, December 1, 2009

BotGraph: Large Scale Spamming Botnet Detection

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

Blog Archive

About Me