LLVM Project News and Details from the Trenches

Monday, May 23, 2011

C++ at Google: Here Be Dragons

Google has one of the largest monolithic C++ codebases in the world. We have thousands of engineers working on millions of lines of C++ code every day. To help keep the entire thing running and all these engineers fast and productive we have had to build some unique C++ tools, centering around the Clang C++ compiler. These help engineers understand their code and prevent bugs before they get to our production systems.

(Cross-posted on the Google Engineering Tools Blog)

Of course, improving the speed of Google engineers—their productivity—doesn’t always correlate to speed in the traditional sense. It requires the holistic acceleration of Google’s engineering efforts. Making any one tool faster just doesn’t cut it; the entire process has to be improved from end to end.

As a performance junkie, I like to think of this in familiar terms. It’s analogous to an algorithmic performance improvement. You get “algorithmic” improvements in productivity when you reduce the total work required for an engineer to get the job done, or fundamentally shift the time scale that the work requires. However, improving the time a single task requires often runs afoul of all the adages about performance tuning, 80/20 rules, and the pitfalls of over-optimizing.

One of the best ways to get these algorithmic improvements to productivity is to completely remove a set of tasks. Let’s take the task of triaging and debugging serious production bugs. If you’ve worked on a large software project, you’ve probably seen bugs which are somehow missed during code review, testing, and QA. When these bugs make it to production they cause a massive drain on developer productivity as the engineers cope with outages, data loss, and user complaints.

What if we could build a tool that would find these exact kinds of bugs in software automatically? What if we could prevent them from ever bringing down a server, reaching a user’s data, or causing a pager to go off? Many of these bugs boil down to simple C++ programming errors. Consider this snippet of code:

Response ProcessRequest(Widget foo, Whatsit bar, bool *charge_acct) {
// Do some fancy stuff...
if (/* Detect a subscription user */) {
charge_acct = false;
}
// Lots more fancy stuff...
}


Do you see the bug? Careful testing and code reviews catch these and other bugs constantly, but inevitably one will sneak through, because the code looks fine. It says that it shouldn’t charge the account right there, plain as day. Unfortunately, C++ insists that ‘false’ is the same as ‘0’ which can be a pointer just as easily as it can be a boolean flag. This code sets the pointer to NULL, and never touches the flag.

Humans aren’t good at spotting this type of devious typo, any more than humans are good at translating C++ code into machine instructions. We have tools to do that, and the tool of choice in this case is the compiler. Not just any compiler will do, because while the code above is one example of a bug, we need to teach our compiler to find lots of other examples. We also have to be careful to make certain that developers will act upon the information these tools provide. Within Google’s C++ codebase, that means we break the build for every compiler diagnostic, even warnings. We continually need to enhance our tools to find new bugs in new code based on new patterns, all while maintaining enough precision to immediately break the build and have high confidence that the code is wrong.

To address these issues we started a project at Google which is working with the LLVM Project to develop the Clang C++ compiler. We can rapidly add warnings to Clang and customize them to emit precise diagnostics about dangerous and potentially buggy constructs. Clang is designed as a collection of libraries with the express goal of supporting diverse tools and application uses. These libraries can be directly integrated into IDEs and commandline tools while still forming the core of the compiler itself.

We’ve been working on Clang for over a year now so that it can understand and reason about all of the C++ code at Google. But building the tools and technology to catch these bugs is only half the battle; we have to get engineers to use them as well. When other teams at Google respond to production bugs, our team will often begin working to enable any Clang diagnostics that might have caught the bug. Within one week of production issues, we can sweep the entire code base using these diagnostics to fix any latent bugs.

Recently we enabled the Clang C++ compiler for every C++ build at Google in order to provide accurate and helpful warnings and diagnostics to engineers. Some examples of how Clang can help developers with bad code are discussed on this post to the LLVM blog. Beyond that, once we have swept the codebase with a bug-finding diagnostic, we can enable it for all our engineers to catch future bugs before they’re committed. These diagnostics break the entire build of that piece of software to ensure that they aren’t ignored and are acted on immediately. For the code sample above, the user gets an error message:

example1.cc:4:17: error: initialization of pointer of type 'bool *' from literal 'false' [-Werror,-Wbool-conversions]
charge_acct = false;
^


Here are two other classes of bugs we’ve found::

long kMaxDiskSpace = 10 << 30; // Ten gigs ought to be enough for anybody.

void SomeService() {
// Setup task using external resource...
while (/* Check if resource is available yet ... */) {
sleep(0.5); // Yield the CPU
}
}


Which now trigger the following errors:

example2.cc:12:25: error: shift result (10737418240) requires 35 bits to represent, but 'int' only has 32 bits [-Werror,-Wshift-overflow]
long kMaxDiskSpace = 10 << 30;
~~ ^ ~~
example2.cc:16:11: error: implicit conversion turns literal floating-point number into integer: 'double' to 'unsigned int' [-Werror,-Wliteral-conversion]
sleep(0.5);
~~~~~ ^~~


All of these represent real bugs that we have found in our code, and that we are catching and fixing with the help of Clang today.

Clang and its diagnostics don’t in any way obviate the need for careful code review and thorough testing. Rather, they complement these practices, combining to help reduce the number of bugs in our code. This is the platform on which we are developing new and better diagnostics for engineers going forward. This is how we are providing an algorithmic improvement to their productivity, and accelerating Google.

Stay tuned for more posts about how we rolled Clang out to Google engineers, how we have enhanced Clang to make it even more relevant for our code and our developers’ needs, and some of the exciting tools we’re building on top of this platform.