Tuesday, June 14, 2011

Testing at the speed and scale of Google

(Cross-posted from the Google Engineering Tools blog)

By Pooja Gupta, Mark Ivey and John Penix


Continuous integration systems play a crucial role in keeping software working while it is being developed. The basic steps most continuous integration systems follow are:

1. Get the latest copy of the code.
2. Run all tests.
3. Report results.
4. Repeat 1-3.

This works great while the codebase is small, code flux is reasonable and tests are fast. As a codebase grows over time, the effectiveness of such a system decreases. As more code is added, each clean run takes much longer and more changes gets crammed into a single run. If something breaks, finding and backing out the bad change is a tedious and error prone task for development teams.

Software development at Google is big and fast. The code base receives 20+ code changes per minute and 50% of the files change every month! Each product is developed and released from ‘head’ relying on automated tests verifying the product behavior. Release frequency varies from multiple times per day to once every few weeks, depending on the product team.

With such a huge, fast-moving codebase, it is possible for teams to get stuck spending a lot of time just keeping their build ‘green’. A continuous integration system should help by providing the exact change at which a test started failing, instead of a range of suspect changes or doing a lengthy binary-search for the offending change. To find the exact change that broke a test, we could run every test at every change, but that would be very expensive.

To solve this problem, we built a continuous integration system that uses dependency analysis to determine all the tests a change transitively affects and then runs only those tests for every change. The system is built on top of Google’s cloud computing infrastructure enabling many builds to be executed concurrently, allowing the system to run affected tests as soon as a change is submitted.

Here is an example where our system can provide faster and more precise feedback than a traditional continuous build. In this scenario, there are two tests and three changes that affect these tests. The gmail_server_tests are broken by the second change, however a typical continuous integration system will only be able to tell that either change #2 or change #3 caused this test to fail. By using concurrent builds, we can launch tests without waiting for the current build/test cycle to finish. Dependency analysis limits the number of tests executed for each change, so that in this example, the total number of test executions is the same as before.




Let’s look deeper into how we perform the dependency analysis.

We maintain an in-memory graph of coarse-grained dependencies between various tests and build rules across the entire codebase. This graph, several GBs in-memory, is kept up-to-date with each change that gets checked in. This allows us to transitively determine all tests that depend on the code modified in a given change and hence need to be re-run to know the current state of the build. Let’s walk through an example.

Consider two sample projects, each containing a different set of tests: 



where the build dependency graph looks like this:


We will see how two isolated code changes, at different depths of the dependency tree, are analyzed to determine affected tests, that is the minimal set of tests that needs to be run to ensure that both Gmail and Buzz projects are “green”.

Case1: Change in common library

For first scenario, consider a change that modifies files in common_collections_util.




As soon as this change is submitted, we start a breadth-first search to find all tests that depend on it.



Once all the direct dependencies are found, continue BFS to collect all transitive dependencies till we reach all the leaf nodes.



When done, we have all the tests that need to be run, and can calculate the projects that will need to update their overall status based on results from these tests.





Case2: Change in a dependent project:

When a change modifying files in youtube_client is submitted.




We perform the same analysis to conclude that only buzz_client_tests is affected and status of Buzz project needs to be updated:



The example above illustrates how we optimize the number of tests run per change without sacrificing the accuracy of end results for a project. A lesser number of tests run per change allows us to run all affected tests for every change that gets checked in, making it easier for a developer to detect and deal with an offending change.

Use of smart tools and cloud computing infrastructure in the continuous integration system makes it fast and reliable. While we are constantly working on making improvements to this system, thousands of Google projects are already using it to launch-and-iterate quickly and hence making faster user-visible progress.

12 comments:

  1. This is a great concept, and one that I've had a lot of questions about in my test role in a CI-focused team.

    Two specific questions:
    1) How well does this scale when you have a much more larger and more complicated graph of test (i.e. orders of magnitude)?
    2) When running system-level, rather than unit-level tests, can one really map tests to limited domains enough to significantly cut down the test fanout for a given set of changes?

    Thanks for any feedback.

    ReplyDelete
  2. Great concept; one that has been discussed in my team more than once.

    Two questions:
    o Does this really work when the number of tests, number of software modules and the corresponding graph complexity gets orders of magnitude bigger?
    o Any thoughts on if/how this can apply when the tests are more system-oriented and hence most tests have a transitive relationship with most software modules?

    Thanks for this post...love reading the Google testing blog.

    ReplyDelete
  3. That's amazing. But, how do you maintain the dependencies? Is it each individual test owner's responsibility to set his/her tests dependencies?

    ReplyDelete
  4. Can you give more info about the dependency graph generation?
    Is it an automated process (ie: based on build system / makefiles / ...) or do the developers need to manually specify the dependencies?

    ReplyDelete
  5. Intresting, i have been thinking about this idea for a long time. Could be useful for upgrades aswell.

    How does the framework handle callback interfaces?

    -Kristoffer

    ReplyDelete
  6. It's an interesting idea, applying the dependency management to the test runs and not only the builds.

    But I have a question: in the first example you can track down that both change #2 and change #3 are breaking gmail_server_tests independently.

    But how do you go back in the version control system and fix those changes individually? Do you go back to pre-change #2 and then serialize the (fixed) changes?

    Another question is that now you can know earlier if it was change #2 or #3 that broke the tests, but you will not know (till later) if the combination of change #2 and #3 (also) breaks the tests.

    Is this a tradeoff you agree upon since maybe individual changes breaking the build are more common than combination of changes?

    Thanks for the interesting post!

    ReplyDelete
  7. I do not quite understand, but I'll learn it. Thank you for this interesting information.

    ReplyDelete
  8. Does the dependency graph get updated by the compiler at compile time ?

    This would work great for currently extant/new features tied to the current dependency graph/map but how do the dependencies tie in with third party or external systems (same as the call back question posted by an earlier comment) that could get impacted by a change made within google's code base? Are their dependencies also mapped into this graph?

    Honestly, do you guys ever miss any tests that should have been run - how often are these a result of machine error vs human error?

    ReplyDelete
  9. @John

    1. This scales pretty well at the current scale as described in http://google-engtools.blogspot.com/2011/05/welcome-to-google-engineering-tools where all these numbers correspond to same code repository. Yes, it works! :)
    2. System level tests are the ones that tend to be flaky, so even when they cannot be cut down to an extent as much as the unit tests, running them fewer that at every change definitely helps.

    @Alex/@PJayTycy : Build rules are used to maintain dependencies. See similar discussion in: http://google-engtools.blogspot.com/2011/06/testing-at-speed-and-scale-of-google.html. The build rules are defined as per compiler specifications, and then the dependency graph generation is automatic.

    @krisskross Can you elaborate more?

    @PJ: There are occasional instances when we do miss tests because of software bug, but it is rare enough that teams can continue to rely on the system.

    ReplyDelete
  10. doit is an open-source project that was created to be able to handle this kind of "runs only those tests for every change".

    For running python unittests, there is py.test plugin that integrates with doit to automatically find imports for python files and keep track of dependencies.

    ReplyDelete
  11. It sounds like this is more about helping the RCA process than saving time/resources in running the tests. It's great to have a quick feedback about how your single change passed/failed the regression suite, but like John mentioned, the system-level tests never map to a single component, and so the graph becomes much more complex.

    In our previous build system we had a similar system which was based on code coverage data. Test selection was automatic based on the code every changeset changed.

    ReplyDelete
  12. nice..its really usefull, i have try n make it like a personal barometer.

    ReplyDelete

The comments you read and contribute here belong only to the person who posted them. We reserve the right to remove off-topic comments.