Tuesday, December 1, 2009

BotGraph: Large Scale Spamming Botnet Detection


Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, E. Gillum, "BotGraph: Large Scale Spamming Botnet Detection," NSDI'09, (April 2009).


One line summary: This paper describes a method for detecting web account abuse attacks using EWMA and BotGraph, which creates a user-user graph to find correlations and distinguish bot-users from normal users.

Summary

This paper presents a system for detecting web account abuse attacks. These are attacks in which spammers use botnets to create many fake email accounts through free web email service providers (such as Gmail and Hotmail) from which to send spam. To detect this situation, the authors provide an algorithmic means of finding correlations among bot-users and distinguishing them from real users, as well as a mechanism for efficiently analyzing large volumes of data to reveal such correlations. The authors name their approach BotGraph. The goal of BotNet is to be able to captures spamming email accounts used by botnets.

BotGraph works by constructing a large user graph and analyzing the underlying correlations. The key observation is that bot-users share IP addresses when they log in and send emails. There are two main components of BotGraph – the first tries to detect aggressive signups and the second tries to detect the remaining stealthy bot-users based on their login activities. To detect aggressive signups the authors use a simple Exponentially Weighted Moving Average (EWMA) algorithm that detects sudden changes in signup activity. The rationale for this is that in the normal case, signups should happen infrequently at a single IP address or at least be roughly consistent over time, whereas a sudden increase in signups may indicate the IP address may be associated with a bot. To detect stealthy bot-users, the authors build a user-user graph, in which each vertex is a user and the weight of an edge between two users is based on the number of common IP addresses from which the two users logged in. This is because the authors reason that if aggressive signups are limited, each bot-user will need to log in and send more emails multiple times at different locations. This results in shared IP addresses because firstly, the number of bot-users is often much larger than the number of bots, and secondly, botnets have a high churn rate, so bot-users will be assigned to different bots over time. Since normal users also share IP addresses when dynamic IP addresses and proxies are used, the authors exclude this case by counting multiple shared IP addresses in the same autonomous system (AS) as one shared IP address. Once this graph is constructed, the authors show that bot-user groups distinguish themselves from normal user groups by forming giant components in the graph. The authors then describe a recursive algorithm for extracting the set of connected components from the graph. They prune the resultant set of components to remove normal user groups, as well as group the components by bot-user group, by using two statistics: the percentage of users who sent more than three emails per day and the percentage of users who sent out emails with a similar size.

The authors next describe how they constructed their large user-user graph from 220 GB of Hotmail login data. They use Dryad/DryadLINQ, though they suggest other programming environments for distributed, data-parallel computing might also work. They describe two methods. The first involves partitioning the data by IP address and using the well-known map and reduce operations, whereas the second involves partitioning the data by user ID and using some special optimizations. They find the second method is more suited to their application and also much much faster: 1.5 hours on 240 machines for method 2 compared to over 6 hours for method 1. However, as method 2 has increasing communication overhead it may be less scalable. The authors describe a number of optimizations as well. The authors evaluated their approach on two month-long datasets. BotGraph detected approximately 21 million bot-users in total. It was able to detect 76.84% and 85.80% respectively of known spammers with the graph approach and 85.15% and 8.15% respectively with the EWMA approach. The authors perform a false positive analysis using naming patterns and signup dates and estimate a false postive rate of 0.44%.

Critique


I liked this paper a lot. I’m not sure how much it has to teach about networking per se, not that that’s necessarily a bad thing. It was good that the authors spent some time discussing potential countermeasures attackers might take to their approach. One thing that may other students have mentioned that I’m not sure was mentioned in the paper however is that BotGraph will not detect bot-users all behind one IP address (e.g. NAT) and it is becoming increasingly common in countries like China for there to be very large numbers of machines behind a NAT, so it is something to note. Overall this paper was very good and I think it should be kept in the syllabus. I would be interested to see future work along these lines that use additional features to correlate and other methods besides graphs to detect correlations, as the authors mention.

2 comments:

  1. when there are more sign ups in your ip there is more confusion in subliming the signups log onto www.ip-details.com and better results are got...........

    ReplyDelete
  2. Hey my very first comment on your site. ,I have been reading your blog for a while and thought I would completely pop in and drop a friendly note. . It is great stuff indeed. I also wanted to ask..is there a way to subscribe to your site via email?

    Email Spam Filtering

    ReplyDelete