Posted on Saturday, February 14th, 2009 | Bookmark on del.icio.us

Clustering Fast Flux Networks Through Content Hashing

by Jose Nazario

I’ve spent some time recently looking at how to improve our visibility into fast flux botnets by adding additional data. The discovery of such botnets usually yields an interesting gold mine of other nefarious activity. To do so, I’m now combining the fast flux data from ATLAS with other data sources to grow its view. We’re still actively mining spam trap data to look for fast flux domain names advertised in URLs and use the scoring system as described in the paper I did with Holz last year, As the Net Churns. Now what I’m doing is several times a day, taking that domain list, looking at the IP addresses we’re seeing actively fluxing, and then looking at what else resolves to those IPs. Our passive data source is hosted at the ISC SIE, a data sharing resository. This gives us new candidates we may not have seen and we can then add them to the ATLAS monitor.

As noted in the above paper, we tried to determine how many active fast flux botnets we were tracking through set analysis of the botnet’s members’ IP addresses. Full set intersections between domain results indicates the same physical botnet behind the content across multiple domain names. Now I wanted to take that one step further to see if there are relationships we are missing. Another way to infer relationships is to look at the content hosted on the fast flux domains.

This idea isn’t new, and I recalled it being used in the paper Spamscatter: Characterizing Internet Scam Hosting Infrastructure by Anderson et al. They use a method where they get a screen shot of the domain and do image “shingle” analysis to look for the same content being pushed in different spam campaigns. I decided to take a simpler approach and just compare the HTML. The gross assumption is that the HTML will not differ significantly for the same campaign, and nor will the layout or images. So all I need is the HTML of the landing page and a way to compare it.

Initially I tried the string distance library libdistance, something I wrote a few years ago to do honeypot payload analysis. I ran into a bunch of scalability problems as well as some software bugs on larger data sets, so I wound up using a simpler approach: fuzzy hashing and the distances between hashes.

The fuzzy hashing library I code was Ssdeep by Jesse Kornlum. Basically ssdeep takes blocks of size B of the input (binary, executable, text, doesn’t matter) and hashes it into a very small representation (a few characters at worst), then moves on to the next block to repeat. The output is a simple set of hashes and block sizes for the input to determine it’s “fingerprint”. The whole guts are explained in the paper Identifying almost identical files using context triggered piecewise hashing. Needless to say the tool has been useful for us, and many others, at analyzing related executables in our malcode repositories.

Running ssdeep on the set of domain names we discovered active in a 30 minute window, based on the fast flux bots we determined in ATLAS, yielded four distinct fast flux campaigns afoot. Note that this is different than saying four distinct botnets, as one botnet could have multiple campaigns afoot (e.g. the Storm worm would host its own drive by lures as well as pharma sites, for example). Looking only at pairs of domains and their hashes that lie within a score of 90 or greater (indicating a strong resemblence) yields the graph below.

fastflux.png

Two of the groups are two domain names apiece, while two are sizable. The graph is a subset of all of the connections because it just chewed up too much time on my laptop’s CPU to do all 5000 connections. But, you get the idea.

I’m still investigating, but I think this method of analyzing botnet content – using ssdeep on the malicious website – may have some merit.

Leave a Comment