Posted on Friday, October 6th, 2006 | Bookmark on del.icio.us

Static Code Analysis Using Google Code Search

by Dug Song

Today’s guest-blogger post is from Aaron Campbell, long-time Arbor hacker and one of Canada’s finest:

Lint first appeared (outside of Bell Labs) in the seventh version (V7) of the UNIX operating system in 1979. 27 years later, you’d think static code analysis would be dead. But nothing could be further from the truth. This much I’ve proven to myself today after toying with Google’s newest gift to the world, Google Labs Code Search.

Now, this isn’t exactly a new concept. Koders launched last year, and claims its database contains 225,816,744 lines of searchable open source code. Not to be outdone, The Goog has seriously one-upped the competition by providing regular expression matching. And not a hacked-up, watered down subset of regexp, but full POSIX extended regular expression syntax, as well as select Perl extensions. Kid -> candy store.

Ok, I admit it. Recalling a previous debate over profanity in the Linux kernel source, my inaugural search term for Google’s Code Search was a naughty word. Much to my amusement, the first page of results contained colourful language not only in code comments, but also variable and function names. Potty mouths, the whole lot of us.

Anyhow, as an OpenBSD developer (on extended hiatus though, it would seem, ahem), and having worked in the industry as a vulnerability researcher, I’ve come to know a thing or two about code correctness. The smallest of errors, I’ve learned, can bite you in the biggest of ways. For example,

        if ((foo == bar) && (baz == qux));
                party_on();

Looks innocuous enough, until you notice the superfluous semi-colon left dangling at the end of the first line, pre-maturely terminating the if () clause. The C compiler, being blissfully unconcerned with the extra whitespace in the following line, is no help to me. Quite the idiot I am, letting the party go on with foo not equalling bar and/or baz not equalling qux. Thank you Python for not allowing this to happen.

By now you know where I’m headed with this. We can use Google’s new tool to expose bogus, yet syntactically correct lines of source code. At the risk of boring all you fashion-forward programmers out there, I’m going to leave Ruby at the door and stick with C code analysis, for now. But let’s keep it interesting and start this exercise with a class of bugs that many of our readers have likely never encountered.

Goto http://www.google.com/codesearch and click Advanced Code Search. Select “Case-sensitive search”, and enter the following regular expression, or just click the provided link:

flags\ *&&\ *[A-Z_]+

See what is happening here? The query is not perfect, but some human filtering will quickly weed out the false positives. For the buggy lines of code shown in the results, the author intended to do a bitwise AND to test his flags variable for a bit constant, but has actually fat-fingered the keyboard and put a logical AND is in its place. Bad. After combing through the results, I came across one of these bugs in OpenSSL. The following diff was sent to the OpenSSL team, and has since been committed to the 0.9.8 source tree:


--- crypto/x509v3/pcy_tree.c.orig       Thu Oct 5 12:20:10 2006
+++ crypto/x509v3/pcy_tree.c    Thu Oct 5 12:20:22 2006
@@ -197,7 +197,7 @@
                        /* Any matching allowed if certificate is self
                         * issued and not the last in the chain.
                         */
- if (!(x->ex_flags && EXFLAG_SS) || (i == 0))
+ if (!(x->ex_flags & EXFLAG_SS) || (i == 0))
                                level->flags |= X509_V_FLAG_INHIBIT_ANY;
                        }
                else

Of course, the same style of bug may manifest as a bitwise OR vs logical OR botch-up. As well, the string “flags” as a variable name was hard-coded into this example query only for the purpose of clarity in demonstration– the same mistake could be applied to a variable of any name, obviously.

As I write this, I’m sure hundreds of bored teenagers are plugging away with queries like “strcpy”, “sprintf”, or other unsafe string functions. Not to say such a method will uncover 0 bugs, but it would have been more useful back in 1995. Finding flaws in the most popular software will require a little more creativity. Take the following regexp, for example:

\[sizeof\(.*\)\]\ *=\ *’?\\?0′?;$

This will reveal stuff like:

buf[sizeof(buf)] = ‘\0′;


This is almost certainly wrong. Variable buf will be declared as something like “char buf[1024]“, therefore sizeof(buf) is 1024. But buf[1024] = ‘\0′ will overwrite one byte beyond buf will a null byte. Off-by-one heaven.

Format string bugs anyone?

^[\ \t]*printf\(getenv

Bad errno checking (assignment operator instead of equality operator):

“if (errno = E”

Back to a simpler, almost laughable example:

“<= 65553″

No, I didn’t typo USHRT_MAX. But someone else has. Try some more of your favourite power-of-2-minus-1 and you’ll have yourself some juicy 0 day BUGTRAQ fodder in no time. I feel I have to say it again. The tiniest of errors can have the most unintended of effects.

But why stick to decimal? How about:

0xfffffff[^0-9a-f]

Get it? Note the count of ‘f’ characters, just 7, not 8. It’s hard to visually distinguish 0xffffffff from 0xfffffff. This is far from an exact science; most of the hits from this query, I suspect, will not identify a bug. But someone has messed this up, somewhere in the 200 search results, guaranteed.

Check for non-sensical misuse of an API. For example,

getopt\ *\(argc,\ *argv,\ *\”[^\"]*;

According to the man page, a getopt(3) optstring may contain the following elements: individual characters, characters followed by a colon, and characters followed by two colons. In this sample query, we are looking for cases where a colon was mistyped as a semi-colon. Four results showing as of the time of this writing. getopt(3) is supposed to make command line parsing easy, but clearly some command-line options go completely untested.

Based on this research, I’ve filed a few bug reports to various open source projects on some flaws I’ve found over the past 24 hours using these techniques. However, as it turns out, these search queries have been turning up far too many bugs for this to be a one-man effort. Some of these bugs are harmless. Some of them are bound to be security holes. My hope is that I’ve provided enough ideas and examples that our readership can join in.

To wrap it up, what makes the Google tool so powerful is the instant search response– I’m still scratching my head about how they managed to pull it off. Any one of us could download thousands of open source packages, untar them, and run find/grep with the regular expressions I’ve shown here, but with far from the immediate gratification that Google Code Search supplies. Now if only they’d add multi-line matching…

Happy bug hunting.

14 Responses | Add your own



Comment Post by: SecuriTeam Blogs » More fun with Google Code Search! — October 6th, 2006 @ 11:29 pm EST  Reply

[...] From the secure coding mailing list: Robert C. Seacord points to the arbor blog, which discusses static analysis using this service: http://asert.arbornetworks.com/2006/10/static-code-analysis-using-google-code-search/ [...]

Comment Post by: musc@ - Daniele Muscetta’s Weblog » Blog Archive » Google has pissed me off this week! — October 7th, 2006 @ 3:14 am EST  Reply

[...] Now I pretty much liked GMail and Google in general. But this time they REALLY pissed me off! I will tell you that I am not a google-hater even if I work for a competing company. Of course not everything that Google does is wonderful, but some of their services are really cool and useful and I have never denied to say they rocked when I felt they did. In general, people seem to love them, and their stock value shows it (with the launch of “Code Search” this week they made a lot of people scream “how cool is this” so that they got back from just under 400 dollars to 417!). But that’s not the issue. That is cool, that works. It’s ok they make money if they make cool tools. It’s fine for me. [...]

Comment Post by: Derek Jones — October 7th, 2006 @ 7:23 am EST  Reply

Your sizeof example need not be specific to assignment. Unfortunately Google does not seem to support the ( ) \n form of matching (ie using the result of a previous match later on in the match). The following only returns 50 matches, so they can be checked manually (the [:blank:] pattern does not appear to work).

[_a-zA-Z0-9]+\ *\[\ *sizeof\ +[_a-zA-Z0-9]+\ *\]

Comment Post by: Google your way to new vulnerabilities, exposures and fun… « Observations of a digitally enlightened mind — October 8th, 2006 @ 1:10 am EST  Reply

[...] Google recently introduced “Google code search” providing static code analysis, including full regex, for any publicly available source code - trust me, hilarity will quickly ensue.  Aaron Campbell from Arbor Networks has a good blog posting on the topic, and Gadi “Botslayer” Evron provides some links describing some of the fun folks are having with the new service.  As with the full-disclosure debate it is almost pointless to argue whether this is good or bad, as I am sure there will be debate on both sides for the use of this service. The reality is that it is here and the open source community, or anyone who has publically available source code, should brace themselves for an onslaught of vuln findings from kiddie@some*.edu. [...]

Comment Post by: CMoi — October 11th, 2006 @ 2:42 pm EST  Reply
Comment Post by: lzh — November 2nd, 2006 @ 12:26 pm EST  Reply

your \ * should be either \s*, or \s+. \ * will match zero or more spaces. \s* will match zero or more spaces or tabs. + is 1 or more of the preceding. You may realize this. But why not use the correct thing.

Comment Post by: lzh — November 2nd, 2006 @ 1:07 pm EST  Reply

hmm… that comes of as quite the jerk comment. Sorry. I just wanted to recommend something that will likely find stuff a little more correctly. I suppose \ * isn’t as readable as \s+. But since source code often has tabs \ * will likely miss stuff that you probably don’t want to miss.

Yes \s also matches newline, carriage return, and form feed. I don’t think that aspect matters much here. I have yet to get a regex to match across a line boundary using Google’s code search.

Comment Post by: lzh — November 2nd, 2006 @ 1:28 pm EST  Reply

er… I meant \s+ isn’t as readable as \ *. Remind me not to do this stuff when short of sleep. :/

Comment Post by: Digital Bond » Fun with Google Code Search — November 8th, 2006 @ 8:22 am EST  Reply

[...] Last week, on many security mailing lists, folks were talking about using Google Code Search to look for various sorts of vulnerabilities in publicly-accessible source code repositories. Given the tool’s robust support for regular expressions, it is not inconceivable for static analysis tools (aka source code scanners) to be quickly google-ified to search repositories instead of a local filesystem. [...]

Comment Post by: Harald Korneliussen — November 9th, 2006 @ 8:12 am EST  Reply

This was brilliant, I’ll definitively remember (or rather, bookmark) these regexps to use myself when maintaining large, old c apps… There should be a collection of such expressions.

On a side note, I found this site by searching reddit for “static”, because I was looking for articles on static analysis.

Comment Post by: Patrick Smacchia — May 3rd, 2007 @ 5:56 am EST  Reply

You might be ineterested by the tool NDepend:
http://www.NDepend.com

NDepend analyses source code and .NET assemblies. It allows controlling the complexity, the internal dependencies and the quality of .NET code. NDepend provides a language (CQL Code Query Language) dedicated to query and constraint a codebase. It also comes from with advanced code visualization (Dependencies Matrix, Metric treemap, Box and Arrows graph…), more than 60 metrics, facilities to generate reports and to be integrated with mainstream build technologies and development tools. NDepend also allows to compare precisely different versions of your codebase.

Comment Post by: google code search at 不断往后看 — April 28th, 2008 @ 2:16 am EST  Reply

[...] 如果你是一个程序员,却没有好好利用互联网带给我们的一切,那实在是暴殄天物了。 Google code search可以看作是google送给我们程序员最好的礼物,不用下载,在任何地方都可以阅读mysql源代码的感觉实在是很好。 当然,要最大程度地发挥这个工具的作用,你需要一点技巧和学习。 今天找到了一篇文章,可以作为入门指导,供大家参考。 Static Code Analysis Using Google Code Search [...]

Comment Post by: jugar poquer internet — June 20th, 2008 @ 12:51 pm EST  Reply

poker en internet…

Will casino bonus code poker flash game regle du poker ringtones for nextel phone roulette paginas internet…

Leave a Comment