Monday, July 14, 2014

Measuring Coverage at Google

By Marko Ivanković, Google Zürich

Introduction


Code coverage is a very interesting metric, covered by a large body of research that reaches somewhat contradictory results. Some people think it is an extremely useful metric and that a certain percentage of coverage should be enforced on all code. Some think it is a useful tool to identify areas that need more testing but don’t necessarily trust that covered code is truly well tested. Others yet think that measuring coverage is actively harmful because it provides a false sense of security.

Our team’s mission was to collect coverage related data then develop and champion code coverage practices across Google. We designed an opt-in system where engineers could enable two different types of coverage measurements for their projects: daily and per-commit. With daily coverage, we run all tests for their project, where as with per-commit coverage we run only the tests affected by the commit. The two measurements are independent and many projects opted into both.

While we did experiment with branch, function and statement coverage, we ended up focusing mostly on statement coverage because of its relative simplicity and ease of visualization.

How we measured


Our job was made significantly easier by the wonderful Google build system whose parallelism and flexibility allowed us to simply scale our measurements to Google scale. The build system had integrated various language-specific open source coverage measurement tools like Gcov (C++), Emma / JaCoCo (Java) and Coverage.py (Python), and we provided a central system where teams could sign up for coverage measurement.

For daily whole project coverage measurements, each team was provided with a simple cronjob that would run all tests across the project’s codebase. The results of these runs were available to the teams in a centralized dashboard that displays charts showing coverage over time and allows daily / weekly / quarterly / yearly aggregations and per-language slicing. On this dashboard teams can also compare their project (or projects) with any other project, or Google as a whole.

For per-commit measurement, we hook into the Google code review process (briefly explained in this article) and display the data visually to both the commit author and the reviewers. We display the data on two levels: color coded lines right next to the color coded diff and a total aggregate number for the entire commit.


Displayed above is a screenshot of the code review tool. The green line coloring is the standard diff coloring for added lines. The orange and lighter green coloring on the line numbers is the coverage information. We use light green for covered lines, orange for non-covered lines and white for non-instrumented lines.

It’s important to note that we surface the coverage information before the commit is submitted to the codebase, because this is the time when engineers are most likely to be interested in improving it.

Results


One of the main benefits of working at Google is the scale at which we operate. We have been running the coverage measurement system for some time now and we have collected data for more than 650 different projects, spanning 100,000+ commits. We have a significant amount of data for C++, Java, Python, Go and JavaScript code.

I am happy to say that we can share some preliminary results with you today:


The chart above is the histogram of average values of measured absolute coverage across Google. The median (50th percentile) code coverage is 78%, the 75th percentile 85% and 90th percentile 90%. We believe that these numbers represent a very healthy codebase.

We have also found it very interesting that there are significant differences between languages:

C++ Java Go JavaScript Python
56.6% 61.2% 63.0% 76.9% 84.2%


The table above shows the total coverage of all analyzed code for each language, averaged over the past quarter. We believe that the large difference is due to structural, paradigm and best practice differences between languages and the more precise ability to measure coverage in certain languages.

Note that these numbers should not be interpreted as guidelines for a particular language, the aggregation method used is too simple for that. Instead this finding is simply a data point for any future research that analyzes samples from a single programming language.

The feedback from our fellow engineers was overwhelmingly positive. The most loved feature was surfacing the coverage information during code review time. This early surfacing of coverage had a statistically significant impact: our initial analysis suggests that it increased coverage by 10% (averaged across all commits).

Future work


We are aware that there are a few problems with the dataset we collected. In particular, the individual tools we use to measure coverage are not perfect. Large integration tests, end to end tests and UI tests are difficult to instrument, so large parts of code exercised by such tests can be misreported as non-covered.

We are working on improving the tools, but also analyzing the impact of unit tests, integration tests and other types of tests individually.

In addition to languages, we will also investigate other factors that might influence coverage, such as platforms and frameworks, to allow all future research to account for their effect.

We will be publishing more of our findings in the future, so stay tuned.

And if this sounds like something you would like to work on, why not apply on our job site?

Monday, June 30, 2014

ThreadSanitizer: Slaughtering Data Races

by Dmitry Vyukov, Synchronization Lookout, Google, Moscow

Hello,

I work in the Dynamic Testing Tools team at Google. Our team develops tools like AddressSanitizer, MemorySanitizer and ThreadSanitizer which find various kinds of bugs. In this blog post I want to tell you about ThreadSanitizer, a fast data race detector for C++ and Go programs.

First of all, what is a data race? A data race occurs when two threads access the same variable concurrently, and at least one of the accesses attempts is a write. Most programming languages provide very weak guarantees, or no guarantees at all, for programs with data races. For example, in C++ absolutely any data race renders the behavior of the whole program as completely undefined (yes, it can suddenly format the hard drive). Data races are common in concurrent programs, and they are notoriously hard to debug and localize. A typical manifestation of a data race is when a program occasionally crashes with obscure symptoms, the symptoms are different each time and do not point to any particular place in the source code. Such bugs can take several months of debugging without particular success, since typical debugging techniques do not work. Fortunately, ThreadSanitizer can catch most data races in the blink of an eye. See Chromium issue 15577 for an example of such a data race and issue 18488 for the resolution.

Due to the complex nature of bugs caught by ThreadSanitizer, we don't suggest waiting until product release validation to use the tool. For example, in Google, we've made our tools easily accessible to programmers during development, so that anyone can use the tool for testing if they suspect that new code might introduce a race. For both Chromium and Google internal server codebase, we run unit tests that use the tool continuously. This catches many regressions instantly. The Chromium project has recently started using ThreadSanitizer on ClusterFuzz, a large scale fuzzing system. Finally, some teams also set up periodic end-to-end testing with ThreadSanitizer under a realistic workload, which proves to be extremely valuable. When races are found by the tool, our team has zero tolerance for races and does not consider any race to be benign, as even the most benign races can lead to memory corruption.

Our tools are dynamic (as opposed to static tools). This means that they do not merely "look" at the code and try to surmise where bugs can be; instead they they instrument the binary at build time and then analyze dynamic behavior of the program to catch it red-handed. This approach has its pros and cons. On one hand, the tool does not have any false positives, thus it does not bother a developer with something that is not a bug. On the other hand, in order to catch a bug, the test must expose a bug -- the racing data access attempts must be executed in different threads. This requires writing good multi-threaded tests and makes end-to-end testing especially effective.

As a bonus, ThreadSanitizer finds some other types of bugs: thread leaks, deadlocks, incorrect uses of mutexes, malloc calls in signal handlers, and more. It also natively understands atomic operations and thus can find bugs in lock-free algorithms (see e.g. this bug in the V8 concurrent garbage collector).

The tool is supported by both Clang and GCC compilers (only on Linux/Intel64). Using it is very simple: you just need to add a -fsanitize=thread flag during compilation and linking. For Go programs, you simply need to add a -race flag to the go tool (supported on Linux, Mac and Windows).

Interestingly, after integrating the tool into compilers, we've found some bugs in the compilers themselves. For example, LLVM was illegally widening stores, which can introduce very harmful data races into otherwise correct programs. And GCC was injecting unsafe code for initialization of function static variables. Among our other trophies are more than a thousand bugs in Chromium, Firefox, the Go standard library, WebRTC, OpenSSL, and of course in our internal projects.

So what are you waiting for? You know what to do!

Monday, June 16, 2014

GTAC 2014: Call for Proposals & Attendance

Posted by Anthony Vallone on behalf of the GTAC Committee

The application process is now open for presentation proposals and attendance for GTAC (Google Test Automation Conference) (see initial announcement) to be held at the Google Kirkland office (near Seattle, WA) on October 28 - 29th, 2014.

GTAC will be streamed live on YouTube again this year, so even if you can’t attend, you’ll be able to watch the conference from your computer.

Speakers
Presentations are targeted at student, academic, and experienced engineers working on test automation. Full presentations and lightning talks are 45 minutes and 15 minutes respectively. Speakers should be prepared for a question and answer session following their presentation.

Application
For presentation proposals and/or attendance, complete this form. We will be selecting about 300 applicants for the event.

Deadline
The due date for both presentation and attendance applications is July 28, 2014.

Fees
There are no registration fees, and we will send out detailed registration instructions to each invited applicant. Meals will be provided, but speakers and attendees must arrange and pay for their own travel and accommodations.

Update : Our contact email was bouncing - this is now fixed.



Wednesday, June 04, 2014

GTAC 2014 Coming to Seattle/Kirkland in October

Posted by Anthony Vallone on behalf of the GTAC Committee

If you're looking for a place to discuss the latest innovations in test automation, then charge your tablets and pack your gumboots - the eighth GTAC (Google Test Automation Conference) will be held on October 28-29, 2014 at Google Kirkland! The Kirkland office is part of the Seattle/Kirkland campus in beautiful Washington state. This campus forms our third largest engineering office in the USA.



GTAC is a periodic conference hosted by Google, bringing together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It’s a great opportunity to present, learn, and challenge modern testing technologies and strategies.

You can browse the presentation abstracts, slides, and videos from last year on the GTAC 2013 page.

Stay tuned to this blog and the GTAC website for application information and opportunities to present at GTAC. Subscribing to this blog is the best way to get notified. We're looking forward to seeing you there!

Friday, May 30, 2014

Testing on the Toilet: Risk-Driven Testing

by Peter Arrenbrecht

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

We are all conditioned to write tests as we code: unit, functional, UI—the whole shebang. We are professionals, after all. Many of us like how small tests let us work quickly, and how larger tests inspire safety and closure. Or we may just anticipate flak during review. We are so used to these tests that often we no longer question why we write them. This can be wasteful and dangerous.

Tests are a means to an end: To reduce the key risks of a project, and to get the biggest bang for the buck. This bang may not always come from the tests that standard practice has you write, or not even from tests at all.

Two examples:

“We built a new debugging aid. We wrote unit, integration, and UI tests. We were ready to launch.”

Outstanding practice. Missing the mark.

Our key risks were that we'd corrupt our data or bring down our servers for the sake of a debugging aid. None of the tests addressed this, but they gave a false sense of safety and “being done”.
We stopped the launch.


“We wanted to turn down a feature, so we needed to alert affected users. Again we had unit and integration tests, and even one expensive end-to-end test.”

Standard practice. Wasted effort.

The alert was so critical it actually needed end-to-end coverage for all scenarios. But it would be live for only three releases. The cheapest effective test? Manual testing before each release.


A Better Approach: Risks First

For every project or feature, think about testing. Brainstorm your key risks and your best options to reduce them. Do this at the start so you don't waste effort and can adapt your design. Write them down as a QA design so you can point to it in reviews and discussions.

To be sure, standard practice remains a good idea in most cases (hence it’s standard). Small tests are cheap and speed up coding and maintenance, and larger tests safeguard core use-cases and integration.

Just remember: Your tests are a means. The bang is what counts. It’s your job to maximize it.