Tuesday, December 1, 2009

BotGraph: Large Scale Spamming Botnet Detection


Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, E. Gillum, "BotGraph: Large Scale Spamming Botnet Detection," NSDI'09, (April 2009).


One line summary: This paper describes a method for detecting web account abuse attacks using EWMA and BotGraph, which creates a user-user graph to find correlations and distinguish bot-users from normal users.

Summary

This paper presents a system for detecting web account abuse attacks. These are attacks in which spammers use botnets to create many fake email accounts through free web email service providers (such as Gmail and Hotmail) from which to send spam. To detect this situation, the authors provide an algorithmic means of finding correlations among bot-users and distinguishing them from real users, as well as a mechanism for efficiently analyzing large volumes of data to reveal such correlations. The authors name their approach BotGraph. The goal of BotNet is to be able to captures spamming email accounts used by botnets.

BotGraph works by constructing a large user graph and analyzing the underlying correlations. The key observation is that bot-users share IP addresses when they log in and send emails. There are two main components of BotGraph – the first tries to detect aggressive signups and the second tries to detect the remaining stealthy bot-users based on their login activities. To detect aggressive signups the authors use a simple Exponentially Weighted Moving Average (EWMA) algorithm that detects sudden changes in signup activity. The rationale for this is that in the normal case, signups should happen infrequently at a single IP address or at least be roughly consistent over time, whereas a sudden increase in signups may indicate the IP address may be associated with a bot. To detect stealthy bot-users, the authors build a user-user graph, in which each vertex is a user and the weight of an edge between two users is based on the number of common IP addresses from which the two users logged in. This is because the authors reason that if aggressive signups are limited, each bot-user will need to log in and send more emails multiple times at different locations. This results in shared IP addresses because firstly, the number of bot-users is often much larger than the number of bots, and secondly, botnets have a high churn rate, so bot-users will be assigned to different bots over time. Since normal users also share IP addresses when dynamic IP addresses and proxies are used, the authors exclude this case by counting multiple shared IP addresses in the same autonomous system (AS) as one shared IP address. Once this graph is constructed, the authors show that bot-user groups distinguish themselves from normal user groups by forming giant components in the graph. The authors then describe a recursive algorithm for extracting the set of connected components from the graph. They prune the resultant set of components to remove normal user groups, as well as group the components by bot-user group, by using two statistics: the percentage of users who sent more than three emails per day and the percentage of users who sent out emails with a similar size.

The authors next describe how they constructed their large user-user graph from 220 GB of Hotmail login data. They use Dryad/DryadLINQ, though they suggest other programming environments for distributed, data-parallel computing might also work. They describe two methods. The first involves partitioning the data by IP address and using the well-known map and reduce operations, whereas the second involves partitioning the data by user ID and using some special optimizations. They find the second method is more suited to their application and also much much faster: 1.5 hours on 240 machines for method 2 compared to over 6 hours for method 1. However, as method 2 has increasing communication overhead it may be less scalable. The authors describe a number of optimizations as well. The authors evaluated their approach on two month-long datasets. BotGraph detected approximately 21 million bot-users in total. It was able to detect 76.84% and 85.80% respectively of known spammers with the graph approach and 85.15% and 8.15% respectively with the EWMA approach. The authors perform a false positive analysis using naming patterns and signup dates and estimate a false postive rate of 0.44%.

Critique


I liked this paper a lot. I’m not sure how much it has to teach about networking per se, not that that’s necessarily a bad thing. It was good that the authors spent some time discussing potential countermeasures attackers might take to their approach. One thing that may other students have mentioned that I’m not sure was mentioned in the paper however is that BotGraph will not detect bot-users all behind one IP address (e.g. NAT) and it is becoming increasingly common in countries like China for there to be very large numbers of machines behind a NAT, so it is something to note. Overall this paper was very good and I think it should be kept in the syllabus. I would be interested to see future work along these lines that use additional features to correlate and other methods besides graphs to detect correlations, as the authors mention.

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks


R. Gummadi, H. Balakrishnan, P. Maniatis, S. Ratnasamy, "Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks," NSDI'09, (April 2009).


One line summary: This paper presents a system called Not-A-Bot (NAB) that distinguishes human-generated traffic from bot-generated traffic by attesting human-generated traffic at the clients that is then verified at the servers in order to mitigate such problems as spam, DDoS attacks, and click-fraud.

Summary

This paper presents a system called Not-A-Bot (NAB) for identifying human-generated web traffic as distinguished from bot-generated web traffic. The motivation for this is that bots are responsible for a large amount of spam, distributed denial-of-service (DDoS) attacks, and click fraud, so being able to determine if an email or request is human-generated as opposed to bot-generated would help mitigate these problems. NAB consists of an attester and a verifier. When an attestation is requested by an application for a particular request or email, the attester determines if that request or email was indeed generated by a human, and if so, it attaches a signed statement to it that verifies that it was sent by a human. The verifier runs on the server and checks whether the request or email has an attestation and if it is valid. If it does, the application may choose, for example, to prioritize the request, or increase the score of an email so it is more likely to get through a spam filter. Otherwise it may consider the request or email as more likely to have come from a bot.

NAB assumes that applications and the OS are untrusted and so relies on a Trusted Platform Model (TPM) to load the attester code to ensure that it is trusted. As mentioned, the NAB attester decides to grant an attestation if it determines that a human generated the associated request or email. It does this by guessing, using as a heuristic how recently before an attestation request the last keyboard or mouse activity was observed. If the attestation was requested within a certain amount of time since the last keyboard or mouse activity, the attester grants the attestation. An attestation is non-transferrable and is bound to the content of the request it is generated for. An attestation is over the entire content of the application-specific payload and is responder-specific, and where appropriate, challenger-specific. The mechanism for attesting web requests and email in the common case is straightforward. The only complicated case is script-generated email, which requires deferred attestations. The verifier is straightforward and as metioned, implements an application-specific policy. The authors provide several example policies for spam, DDoS, and click-fraud mitigation.

The authors next describe their evaluation of NAB. They evaluate the attester with respect to TCB size, CPU requirements, and application changes. They evaluate the verifier with repect to the extent to which it mitigates attacks and the rate at which it can verify attestations. They find that the attester is 500 SLOC out of 30,000 for the TCB total, that the worst-case latency for generating an attestation is 10 ms on a 2 GHz Core 2 processor, and that modifications to two programs to include attestations required less than 250 SLOC each. For the verifier, they find that the amount of spam can be reduced by 92% with no false positives, that it can reduce the peak processing load seen at mail servers, that it can filter out 89% of bot-generated DDoS activity while not filtering out human-generated requests, and that it can identify click-fraud activity with more than 87% accuracy without filtering out human-generated clicks.

Critique

I didn’t really think that much of this paper. One criticism is that though adding NAB to applications wouldn’t be technically difficult, as the authors explain, you would still have to make sure a lot of applications (i.e. email clients, web browsers, servers, etc.) did include it and then make sure a lot of hosts ran the versions of the applications that included NAB, because it seems it wouldn’t be nearly as useful if not all the clients used it. In their evaluation, all of the client programs ran NAB, but if that weren’t the case it would be less effective. Another criticism or perhaps point of confusion is with regard to their deferred attestations, which are supposed to be used for script-generated emails. I don’t see why attackers couldn’t leverage this to generate attestations for their own spam emails or whatever else they wanted. Another criticism is that with their simple heuristic (guessing that an activity is human-generated if it is within a certain amount of time of keyboard or mouse activity), bots can still generate attestations for its own traffic by, as the authors say, harvesting human activity. This is probably still sufficient for generating large amounts of spam and such. They would probably be better off using the strategy of using specific mouse or keyboard activity to decide whether or not to grant an attestation request, but in the paper they claim that strategy is too complex to implement.

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems


S. Nedevschi, J. Chandrashekar, J. Liu, B. Nordman, S. Ratnasamy, N. Taft, "Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems," NSDI'09, (April 2009).


One line summary: This paper examines the value of using proxies to handle idle-time traffic for sleeping hosts with the goal of reducing wasted energy consumption in networked end-systems; it does this by analyzing and classifying traffic to see what can be ignored or automatically handled and by examining several potential proxy designs.

Summary

This paper examines the problem of reducing wasted energy consumption in powered-on but idle networked end-systems such as desktops in home and office environments. It discusses various solutions, and in particular examines the value of using a proxy to handle idle-time traffic on behalf of sleeping hosts. The main problem is that while vendors have built in hardware support for sleep (S-states) to reduce power consumption while idle, surveys of office buildings indicate that the vast majority of machines are fully on while idle, instead of taking advantage of these sleep states. One reason for this is that a sleeping machine loses its network presence i.e. it cannot send or receive network messages, and another reason is that users or administers occasionally want to be able to schedule tasks to run during these idle times. Both of these reasons cause users to not make use of sleep states. This paper thus tries to answer the following questions: (1) Is the problem worth solving? (2) What network traffic do idle machines see? (3) What is the design space for a proxy? (4) What implications does proxying have for future protocol and system design.

To begin to answer these questions, the authors first collect network and user-level activity traces from 250 client machines belonging to Intel employees and attempt to classify the traffic. They first classify each packet as being either broadcast, multicast, or unicast, then whether it is incoming or outgoing. They find outgoing traffic tends to be dominated by unicast while incoming traffic is made of significant proportions of all three. They estimate the potential for sleep in four scenarios: (a) ignore broadcast and wake for the rest, (b) ignore multicast and wake for the rest, (c) ignore both broadcast and multicast, and (d) wake for all packets. They find that broadcast and multicast are mainly responsible for reducing the amount of potential sleep time and that doing away with just one of broadcast or multicast is not effective. The authors next classify the traffic based on protocol type, and evaluate each protocol on two metrics: total volume of traffic, and something the authors call half-sleep time. A high half-sleep time means that protocol’s packets could be handled by waking the machine up, whereas a low half-sleep time means that to achieve useful amounts of potential sleep time the proxy would have to handle them. They find that the bulk of broadcast traffic is for address resolution and service discovery, and a lot of other broadcast traffic is from router-specific protocols. Broadcast traffic allows for very little sleep in the office, but significantly more in the home. A proxy could easily handle most of these broadcast protocols. Multicast traffic is mostly caused by router protocols and is often absent or extremely reduced in homes as compared to offices. All router traffic is ignorable. From their analysis of unicast traffic, they speculate that it might be possible to ignore or eliminate much of unicast traffic. Finally, they classify traffic into one of the following three categories regarding the need to proxy that traffic: don’t wake, don’t ignore, and policy-dependent, and into one of the following three categories regarding the difficulty in proxying that traffic: ignorable/drop, handle via mechanical responses, and require specialized processing.

Next, they present four proxy designs. The first ignores all traffic classified as ignorable and wakes the host for the rest. The second ignores all traffic classified as ignorable, responds to traffic listed as capable of being handled by mechanical responses, and wakes the machine for the rest. The third does the same as the second except that it wakes up for traffic belonging to a certain set and drops any other incoming traffic. Lastly, the fourth does the same as the third except that it also wakes up for a certain set of scheduled tasks. They find that the simplest proxy, proxy 1, is inadequate for office environments and nearly inadequate for home environments, but that proxy 3 achieves a good amount of sleep time in all scenarios – more than 70% of the idle time. They also find that the effectiveness of proxy 2 depends a great deal on the environment. Given this, the best trade-off between complexity of design and power savings depends on the environment. Also, the authors also note that since scheduled wake-ups are infrequent, the impact on sleep is minimal, so proxy 4 performs practically the same as proxy 3. Finally, they offer a basic proxy architecture that could serve as a framework to building the different proxies designs they considered, and to demonstrate the feasibility of building a proxy, they implemented a simple proxy prototype in Click. The authors end by speculating about how systems could be redesigned to make them more power-aware, thereby simplifying the implementation of proxies, making proxies more effective, or eliminating the need for proxies altogether.

Critique

One thing I really liked about this paper is how the authors analyzed and classified network traffic before considering proxy design. In retrospect it seems absolutely necessary to guiding the design of a proxy. It was also just informative in general to look at the traffic traces from the perspective of which packets are ignorable, which can be handled automatically, and which are actually “important”. I also liked how they examined several points in the proxy design space and compared them. Overall I thought this was a very thoughtful and well organized paper and I think it should stay in the syllabus.

Cutting the Electric Bill for Internet-Scale Systems


A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, B. Maggs, "Cutting the Electric Bill for Internet-Scale Systems," ACM SIGCOMM Conference, (August 2009).


One line summary: In this paper the authors suggest a new method for reducing the energy costs of running large Internet-scale systems by rerouting traffic on an hourly basis to data centers in regions where the price of energy is cheaper.

Summary

This paper examines a new method for reducing energy costs of running large Internet-scale systems. This method is based on the observations (1) that electricity prices vary on an hourly basis and are not well correlated across different geographic locations and (2) that large distributed systems already incorporate request routing and replication. Considering this, the problem the authors want to solve is: given a set of datacenters or server clusters spread out geographically, map client requests to clusters such that the total electricity cost in dollars of the system is minimized, possibly subject to constraints such as staying under a maximum response time.

The authors first do an empirical market analysis. They verify that prices are not well correlated at different locations, and also note that the price differentials between any two locations generally vary hourly and somewhat unpredictably, suggesting that a pre-determined assignment is not optimal. They examine the overall differential distribution and conclude that there does exist the opportunity to exploit these differentials for savings. To evaluate their method, they use traffic data from Akamai to derive a distribution of client activity and cluster sizes and locations. They then use a simple model to map prices and cluster-traffic allocations to energy prices. They handle the issue of bandwidth costs (i.e. changing assignments of clients to clusters could increase bandwidth costs) by estimating the 95th percentile from the data and constrain their reassignments such that this 95th percentile is not increased for any location. To estimate network performance in their model, they use the geographic distance between client and server. They model the energy consumption of a cluster to be roughly proportional to its utilization. The authors assume the system is fully replicated and that the optimization for cost happens every hour. They then use a simple discrete time simulator that steps through the Akamai usage statistics. At each time step a routing module with a global view of the network allocates clusters, and from this they model each cluster’s energy consumption, and use observed hourly market prices to calculate expenditures.

The authors are able to show that existing systems can reduce energy cost by 2% without significant increase in bandwidth or reduction in client performance. They also find that savings rapidly increase with energy elasticity, which is the degree to which energy consumed by a cluster depends on the load placed on it. Lastly, they find that allowing increased distances between the client and server leads to increased savings.

Critique

This paper was somewhat thought-provoking and brought up some interesting issues. However, what I didn’t really enjoy about this paper is that their model is forced to make so many simplifying assumptions about so many things, such as network performance, cluster power usage, and bandwidth costs, to name a few, that I’m not sure how valid their actual findings turn out to be or how much credence to give them. Along this vein, it would be interesting to know to what extent their results depend on the traffic data they used. Another assumption I didn’t really like is that they assume complete replication in the system in question, so that any request can be routed anywhere. I personally can’t guess how true that is in practice but my inclination is that it is not necessarily true for all Internet-scale systems, though it may be true for many. Furthermore, while in theory the basic idea seems like it would work, the authors do hint at a few complicating factors (that are not necessarily purely computer science problems), and it’s hard to know what others might arise in actually trying to implement something like this. For instance, the authors assume that rerouting traffic in this way would not increase bandwidth prices, but that may not be true, and companies are averse to increasing their bandwidth costs. As another example, in reality, renegotiating existing contracts between companies and cluster operators work if the company rents space in a co-located and between companies and the utility companies to take into account this new method may be difficult. Given all the assumptions the authors are forced to make and other complicating details that they may or may not be aware of, it’s hard to judge whether or not something like this would be feasible in reality. And, if everybody did this, it does make you wonder how that would change things, since there is likely feedback between the number of requests routed to an area and the price of energy there. So in summary, although the authors didn’t really have a choice but to try to make good assumptions or estimates of the various unknowns in their model, I couldn’t help but be frustrated by this aspect of the paper. Regardless, I was still impressed by their model and simulation and found it interesting.

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing


S. Floyd, V. Jacobson, S. McCanne, C-G Liu, L. Zhang, "A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing," ACM SIGCOMM Conference, (August 1995).


One line summary: This paper describes Scalable Reliable Multicast, a minimal framework from which applications designers can build multicast functionality suitable to their application; it also describes some analytical results pertaining to request/repair algorithms in multicast, along with several simulations.

Summary

This paper describes a reliable multicast framework called Scalable Reliable Multicast (SRM). SRM builds off of the principles of Application Level Framing (ALF), which “explicitly includes an application’s semantics in the design of that application’s protocol’, and Light-Weight Sessions (LWS), which centers on a “light-weight rendezvous mechanism based on the IP multicast distribution model” with receiver-based adaption. Thus, the SRM is designed to meet only the minimal definition of reliable multicast so as to not force on applications unnecessary overhead for functionality that they do not need. It is also designed to adhere to the core principles of TCP/IP in that it only requires the basic IP delivery model and dynamically adjusts control parameters based on observed performance, much like TCP. The authors argue that receiver-based reliability is more appropriate for SRM than sender-based reliability because the fate-sharing-based coupling of unicast does not generalize well to multicast (due to such factors as the ACK implosion effect and the need for the sender to maintain state about each of the receivers in the receiver set), and because the vocabulary of unicast conventions migrates poorly to multicast. SRM attempts to serve as a skeleton common to scalable, reliable multicast applications that supply the framework with details such as a namespace, policies and mechanisms for apportioning bandwidth, etc.

The authors go on to describe a network conferencing tool that provides a distributed whiteboard called wb, which builds off of SRM. In wb, users are members, each of whom has a globally unique identifier which is used to label pages they create and edit. Wb assumes that all data has a unique name, that the name always refers to the same data, that source-IDs are persistant, that IP multicast datagram delivery is available, and that all participants join the same multicast group. The paper describes wb’s instantiation of SRM. It then describes more generally request/repair algorithms for several simple topologies, including chains, stars, and bounded degree trees. In the context of these algorithms, it defines deterministic suppression of duplicate messages, and probabilistic suppression. They find, via simulations, that their algorithm that uses fixed timer parameters performs well in random or bounded degree trees when every node in the tree is a member of the multicast group. They use this to motivate the development of an adaptive algorithm for request/repair that adjust timer parameters as a function of delay as well as of duplicate requests or repairs in recent recovery exchanges. They demonstrate trade-offs between low delay and low number of duplicates.

Critique

I didn’t really like reading this paper. I don’t really like simulations, for one. Also, that the authors chose what they admit to be arbitrary values for some of their parameters in their algorithm annoys me. Nevertheless, however simplified the topologies in their analysis section, I would probably agree that it is good to have some mathematical results for many problems, in this case those pertaining to request/repair algorithms in multicast, in order to provide some intuition to people implementing actual systems and to help guide their choices. In that vein, I would be interested in knowing more about any work that really directly builds off of or uses their idea of a framework for scalable, reliable multicast and learn in exactly what way this particular paper was useful. On an unrelated note, it also very seriously annoyed me that the graphs in this paper lacked axis or data labels of any kind.

Scalable Application Layer Multicast


S. Banerjee, B. Bhattacharjee, C. Kommareddy, "Scalable Application Layer Multicast," ACM SIGCOMM Conference, (August 2002).


One line summary: This paper describes NICE, a new application-layer multicast protocol intended primarily for low-bandwidth, data streaming applications with large receiver sets.

Summary

This paper describes a new application-layer multicast protocol for low-bandwidth, data streaming applications with large receiver called NICE. The problem with network-layer multicast, as opposed to application-layer multicast, is that it has not been widely adopted by most ISPs, so much of the Internet is incapable of native multicast. In application-layer multicast, network infrastructure is left unchanged, while the multicast forwarding functionality is implemented at end-hosts. Application-layer multicast protocols are less efficient than network-layer multicast protocols because they must send the same packet over the same link. Thus, two good metrics by which to judge application-layer multicast protocols are stress and stretch. Stress is defined per link and is the count of the number of identical packets sent by a protocol over each underlying link in the network. Stretch is defined per multicast group member and is the ratio of the path length from the source to the member along the overlay used by the multicast protocol to the length of the direct unicast path.

The way the NICE protocol works is by logically arranging the multicast group members into a hierarchy. Each host in the hierarchy belongs to a layer, and the hosts in each layer are partitioned into a set of clusters, where each cluster has a cluster leader that is the graph-theoretic center the cluster. The size of each cluster is bounded between k and 3k-1 and clusters consist of hosts that are close to each other. Hosts are organized into layers according to the rule that all hosts are part of the lowest layer, L(0), then all the cluster leaders of all the clusters in layer L(i) join layer L(i+1). A host is allowed to belong to only a single cluster at any layer. If a host is present in layer L(i) it must be a cluster leader in each of layers L(0), ..., L(i-1). If a host is not present in a layer L(i) it cannot be present in any layer L(j) where j > i. Lastly, there are allowed to be at most logkN layers, and the highest layer has only a single member. The supercluster of any host x is defined as the cluster in the next highest layer to which x’s cluster leader y belongs.

The hierarchy is then used to define different overlay structures for control messages and data delivery paths. In NICE, the control topology is a clique and the data topology is a star, but it is possible to choose other topologies. The NICE protocol assumes the existence of a special host that all hosts know of called the Rendezvous Point (RP), and the RP is always the leader in the cluster at the highest layer of the hierarchy. The NICE protocol consists of three main components: (1) initial cluster assignment when a new host joins, (2) periodic cluster maintenance and refinement, and (3) recovery from leader failures. In initial cluster assignment, the joining hosts contacts all the members in the highest layer to identify the member closest to it, then contacts the lower-level cluster of that closest member to find the closest member there, and so on until the hosts finds its L(0) cluster. This process involves O(k log N) query-response pairs. In this process, some the center of some clusters may change, so a new cluster leader must be chosen. To aid in cluster maintenance and refinement, each member of a cluster sends periodic heartbeat messages to the other members of its cluster. There is also a method for cluster splitting and merging when to keep clusters within the size bound, as well as a method for refining attachments in case a host is not able to locate the closest cluster in a layer when joining. Leader failures are dealt with using a remove message if the leader is able to do a graceful-leave, or by detecting failures using the heartbeat messages and selecting a new leader.

To evaluate NICE, the authors first simulated their protocol and compared it to three other schemes: multi-unicast, native IP multicast (CBT), and the Narada application-layer multicast protocol. Their main findings are that NICE data paths have stretch comparable to Narada, the stress on links and routers is lower in NICE, especially as the multicast group size increases, the failure recovery of both schemes are comparable, and this performance is achievable with orders of magnitude lower control overhead for group sizes greater than 32. The authors also implemented the NICE protocol and tested their implementation in a wide-area testbed over a period of a month. They observe a maximum packet loss of 1% as members join and leave the group at random and an average control overhead of less than 1 Kbps for groups of size 100.

Critique

I liked this paper and it was informative to read about application-layer multicast as opposed to network-layer multicast. Even though the performance at the application-layer is, from what I can see, pretty much unavoidably worse than at the network-layer, if the authors are right that ISPs do not tend to support native multicast then that is a good enough reason to use the application-layer multicast despite the performance hit. I thought this paper did a nice job of describing how the NICE hierarchy is built but the authors could have done a better job explaining the control path and data paths used. Overall, as it appears that there are many application-layer multicast protocols that have been developed, this one seemed like as fine a choice as any to read about, though it’s hard to say without having read most of the other application-layer multicast protocol papers. It might be nice to read a survey paper like we did for DHT-based lookup protocols that summarizes the different kinds of multicast protocols and their differences.