Monday, April 14, 2014

Testing on the Toilet: Test Behaviors, Not Methods

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

After writing a method, it's easy to write just one test that verifies everything the method does. But it can be harmful to think that tests and public methods should have a 1:1 relationship. What we really want to test are behaviors, where a single method can exhibit many behaviors, and a single behavior sometimes spans across multiple methods.

Let's take a look at a bad test that verifies an entire method:

@Test public void testProcessTransaction() {
  User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
  transactionProcessor.processTransaction(
      user,
      new Transaction("Pile of Beanie Babies", dollars(3)));
  assertContains("You bought a Pile of Beanie Babies", ui.getText());
  assertEquals(1, user.getEmails().size());
  assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Displaying the name of the purchased item and sending an email about the balance being low are two separate behaviors, but this test looks at both of those behaviors together just because they happen to be triggered by the same method. Tests like this very often become massive and difficult to maintain over time as additional behaviors keep getting added in—eventually it will be very hard to tell which parts of the input are responsible for which assertions. The fact that the test's name is a direct mirror of the method's name is a bad sign.

It's a much better idea to use separate tests to verify separate behaviors:

@Test public void testProcessTransaction_displaysNotification() {
  transactionProcessor.processTransaction(
      new User(), new Transaction("Pile of Beanie Babies"));
  assertContains("You bought a Pile of Beanie Babies", ui.getText());
}
@Test public void testProcessTransaction_sendsEmailWhenBalanceIsLow() {
  User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
  transactionProcessor.processTransaction(
      user,
      new Transaction(dollars(3)));
  assertEquals(1, user.getEmails().size());
  assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Now, when someone adds a new behavior, they will write a new test for that behavior. Each test will remain focused and easy to understand, no matter how many behaviors are added. This will make your tests more resilient since adding new behaviors is unlikely to break the existing tests, and clearer since each test contains code to exercise only one behavior.

Tuesday, April 01, 2014

The *Real* Test Driven Development


Update: As noticed in the comments, the date of the post was not a mere coincidence :)


by Kaue Silveira

Here at Google, we invest heavily in development productivity research. In fact, our TDD research group now occupies nearly an entire building of the Googleplex. The group has been working hard to minimize the development cycle time, and we’d like to share some of the amazing progress they’ve made.

The Concept

In the ways of old, it used to be that people wrote tests for their existing code. This was changed by TDD (Test-driven Development), where one would write the test first and then write the code to satisfy it. The TDD research group didn’t think this was enough and wanted to elevate the humble test to the next level. We are pleased to announce the Real TDD, our latest innovation in the Program Synthesis field, where you write only the tests and have the computer write the code for you!

The following graph shows how the number of tests created by a small feature team grew since they started using this tool towards the end of 2013. Over the last 2 quarters, more than 89% of this team’s production code was written by the tool!

See it in action:

Test written by a Software Engineer:

class LinkGeneratorTest(googletest.TestCase):

  def setUp(self):
    self.generator = link_generator.LinkGenerator()

  def testGetLinkFromIDs(self):
    expected = ('https://frontend.google.com/advancedSearchResults?'
                's.op=ALL&s.r0.field=ID&s.r0.val=1288585+1310696+1346270+')
    actual = self.generator.GetLinkFromIDs(set((1346270, 1310696, 1288585)))
    self.assertEqual(expected, actual)

Code created by our tool:

import urllib

class LinkGenerator(object):

  _URL = (
      'https://frontend.google.com/advancedSearchResults?'
      's.op=ALL&s.r0.field=ID&s.r0.val=')

  def GetLinkFromIDs(self, ids):
    result = []
    for id in sorted(ids):
      result.append('%s ' % id)
    return self._URL + urllib.quote_plus(''.join(result))

Note that the tool is smart enough to not generate the obvious implementation of returning a constant string, but instead it correctly abstracts and generalizes the relation between inputs and outputs. It becomes smarter at every use and it’s behaving more and more like a human programmer every day. We once saw a comment in the generated code that said "I need some coffee".

How does it work?

We’ve trained the Google Brain with billions of lines of open-source software to learn about coding patterns and how product code correlates with test code. Its accuracy is further improved by using Type Inference to infer types from code and the Girard-Reynolds Isomorphism to infer code from types.

The tool runs every time your unit test is saved, and it uses the learned model to guide a backtracking search for a code snippet that satisfies all assertions in the test. It provides sub-second responses for 99.5% of the cases (as shown in the following graph), thanks to millions of pre-computed assertion-snippet pairs stored in Spanner for global low-latency access.



How can I use it?

We will offer a free (rate-limited) service that everyone can use, once we have sorted out the legal issues regarding the possibility of mixing code snippets originating from open-source projects with different licenses (e.g., GPL-licensed tests will simply refuse to pass BSD-licensed code snippets). If you would like to try our alpha release before the public launch, leave us a comment!

Tuesday, March 18, 2014

Testing on the Toilet: What Makes a Good Test?

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Unit tests are important tools for verifying that our code is correct. But writing good tests is about much more than just verifying correctness — a good unit test should exhibit several other properties in order to be readable and maintainable.

One property of a good test is clarity. Clarity means that a test should serve as readable documentation for humans, describing the code being tested in terms of its public APIs. Tests shouldn't refer directly to implementation details. The names of a class's tests should say everything the class does, and the tests themselves should serve as examples for how to use the class.

Two more important properties are completeness and conciseness. A test is complete when its body contains all of the information you need to understand it, and concise when it doesn't contain any other distracting information. This test fails on both counts:

@Test public void shouldPerformAddition() {
  Calculator calculator = new Calculator(new RoundingStrategy(), 
      "unused", ENABLE_COSIN_FEATURE, 0.01, calculusEngine, false);
  int result = calculator.doComputation(makeTestComputation());
  assertEquals(5, result); // Where did this number come from?
}

Lots of distracting information is being passed to the constructor, and the important parts are hidden off in a helper method. The test can be made more complete by clarifying the purpose of the helper method, and more concise by using another helper to hide the irrelevant details of constructing the calculator:

@Test public void shouldPerformAddition() {
  Calculator calculator = newCalculator();
  int result = calculator.doComputation(makeAdditionComputation(2, 3));
  assertEquals(5, result);
}

One final property of a good test is resilience. Once written, a resilient test doesn't have to change unless the purpose or behavior of the class being tested changes. Adding new behavior should only require adding new tests, not changing old ones. The original test above isn't resilient since you'll have to update it (and probably dozens of other tests!) whenever you add a new irrelevant constructor parameter. Moving these details into the helper method solved this problem.

Monday, March 03, 2014

When/how to use Mockito Answer

by Hongfei Ding, Software Engineer, Shanghai

Mockito is a popular open source Java testing framework that allows the creation of mock objects. For example, we have the below interface used in our SUT (System Under Test):
interface Service {
  Data get();
}

In our test, normally we want to fake the Service’s behavior to return canned data, so that the unit test can focus on testing the code that interacts with the Service. We use when-return clause to stub a method.
when(service.get()).thenReturn(cannedData);

But sometimes you need mock object behavior that's too complex for when-return. An Answer object can be a clean way to do this once you get the syntax right.

A common usage of Answer is to stub asynchronous methods that have callbacks. For example, we have mocked the interface below:
interface Service {
  void get(Callback callback);
}

Here you’ll find that when-return is not that helpful anymore. Answer is the replacement. For example, we can emulate a success by calling the onSuccess function of the callback.
doAnswer(new Answer<Void>() {
    public Void answer(InvocationOnMock invocation) {
       Callback callback = (Callback) invocation.getArguments()[0];
       callback.onSuccess(cannedData);
       return null;
    }
}).when(service).get(any(Callback.class));

Answer can also be used to make smarter stubs for synchronous methods. Smarter here means the stub can return a value depending on the input, rather than canned data. It’s sometimes quite useful. For example, we have mocked the Translator interface below:
interface Translator {
  String translate(String msg);
}

We might choose to mock Translator to return a constant string and then assert the result. However, that test is not thorough, because the input to the translator function has been ignored. To improve this, we might capture the input and do extra verification, but then we start to fall into the “testing interaction rather than testing state” trap.

A good usage of Answer is to reverse the input message as a fake translation. So that both things are assured by checking the result string: 1) translate has been invoked, 2) the msg being translated is correct. Notice that this time we’ve used thenAnswer syntax, a twin of doAnswer, for stubbing a non-void method.
when(translator.translate(any(String.class))).thenAnswer(reverseMsg())
...
// extracted a method to put a descriptive name
private static Answer<String> reverseMsg() { 
  return new Answer<String>() {
    public String answer(InvocationOnMock invocation) {
       return reverseString((String) invocation.getArguments()[0]));
    }
  }
}

Last but not least, if you find yourself writing many nontrivial Answers, you should consider using a fake instead.

Monday, February 03, 2014

Minimizing Unreproducible Bugs



by Anthony Vallone

Unreproducible bugs are the bane of my existence. Far too often, I find a bug, report it, and hear back that it’s not a bug because it can’t be reproduced. Of course, the bug is still there, waiting to prey on its next victim. These types of bugs can be very expensive due to increased investigation time and overall lifetime. They can also have a damaging effect on product perception when users reporting these bugs are effectively ignored. We should be doing more to prevent them. In this article, I’ll go over some obvious, and maybe not so obvious, development/testing guidelines that can reduce the likelihood of these bugs from occurring.


Avoid and test for race conditions, deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues

I am lumping together many bug types in this section, but they are all related somewhat by how we test for them and how disproportionately hard they are to reproduce and debug. The root cause and effect can be separated by milliseconds or hours, and stack traces might be nonexistent or misleading. A system may fail in strange ways when exposed to unusual traffic spikes or insufficient resources. Race conditions and deadlocks may only be discovered during unique traffic patterns or resource configurations. Timing issues may only be noticed when many components are integrated and their performance parameters and failure/retry/timeout delays create a chaotic system. Memory corruption or uninitialized memory access may go unnoticed for a large percentage of calls but become fatal for rare states. Memory leaks may be negligible unless the system is exposed to load for an extended period of time.

Guidelines for development:

  • Simplify your synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.
  • Always obtain locks in the same order. This is a tried-and-true guideline to avoid deadlocks, but I still see code that breaks it periodically. Define an order for obtaining multiple locks and never change that order.
  • Don’t optimize by creating many fine-grained locks, unless you have verified that they are needed. Extra locks increase concurrency complexity.
  • Avoid shared memory, unless you truly need it. Shared memory access is very easy to get wrong, and the bugs may be quite difficult to reproduce.

Guidelines for testing:

  • Stress test your system regularly. You don't want to be surprised by unexpected failures when your system is under heavy load.
  • Test timeouts. Create tests that mock/fake dependencies to test timeout code. If your timeout code does something bad, it may cause a bug that only occurs under certain system conditions.
  • Test with debug and optimized builds. You may find that a well behaved debug build works fine, but the system fails in strange ways once optimized.
  • Test under constrained resources. Try reducing the number of data centers, machines, processes, threads, available disk space, or available memory. Also try simulating reduced network bandwidth.
  • Test for longevity. Some bugs require a long period of time to reveal themselves. For example, persistent data may become corrupt over time.
  • Use dynamic analysis tools like memory debuggers, ASan, TSan, and MSan regularly. They can help identify many categories of unreproducible memory/threading issues.


Enforce preconditions

I’ve seen many well-meaning functions with a high tolerance for bad input. For example, consider this function:

void ScheduleEvent(int timeDurationMilliseconds) {
  if (timeDurationMilliseconds <= 0) {
    timeDurationMilliseconds = 1;
  }
  ...
}

This function is trying to help the calling code by adjusting the input to an acceptable value, but it may be doing damage by masking a bug. The calling code may be experiencing any number of problems described in this article, and passing garbage to this function will always work fine. The more functions that are written with this level of tolerance, the harder it is to trace back to the root cause, and the more likely it becomes that the end user will see garbage. Enforcing preconditions, for instance by using asserts, may actually cause a higher number of failures for new systems, but as systems mature, and many minor/major problems are identified early on, these checks can help improve long-term reliability.

Guidelines for development:

  • Enforce preconditions in your functions unless you have a good reason not to.


Use defensive programming

Defensive programming is another tried-and-true technique that is great at minimizing unreproducible bugs. If your code calls a dependency to do something, and that dependency quietly fails or returns garbage, how does your code handle it? You could test for situations like this via mocking or faking, but it’s even better to have your production code do sanity checking on its dependencies. For example:

double GetMonthlyLoanPayment() {
  double rate = GetTodaysInterestRateFromExternalSystem();
  if (rate < 0.001 || rate > 0.5) {
    throw BadInterestRate(rate);
  }
  ...
}

Guidelines for development:

  • When possible, use defensive programming to verify the work of your dependencies with known risks of failure like user-provided data, I/O operations, and RPC calls.

Guidelines for testing:

  • Use fuzz testing to test your systems hardiness when enduring bad data.


Don’t hide all errors from the user

There has been a trend in recent years toward hiding failures from users at all costs. In many cases, it makes perfect sense, but in some, we have gone overboard. Code that is very quiet and permissive during minor failures will allow an uninformed user to continue working in a failed state. The software may ultimately reach a fatal tipping point, and all the error conditions that led to failure have been ignored. If the user doesn’t know about the prior errors, they will not be able to report them, and you may not be able to reproduce them.

Guidelines for development:

  • Only hide errors from the user when you are certain that there is no impact to system state or the user.
  • Any error with impact to the user should be reported to the user with instructions for how to proceed. The information shown to the user, combined with data available to an engineer, should be enough to determine what went wrong.


Test error handling

The most common sections of code to remain untested is error handling code. Don’t skip test coverage here. Bad error handling code can cause unreproducible bugs and create great risk if it does not handle fatal errors well.

Guidelines for testing:

  • Always test your error handling code. This is usually best accomplished by mocking or faking the component triggering the error.
  • It’s also a good practice to examine your log quality for all types of error handling.


Check for duplicate keys

If unique identifiers or data access keys are generated using random data or are not guaranteed to be globally unique, duplicate keys may cause data corruption or concurrency issues. Key duplication bugs are very difficult to reproduce.

Guidelines for development:

  • Try to guarantee uniqueness of all keys.
  • When not possible to guarantee unique keys, check if the recently generated key is already in use before using it.
  • Watch out for potential race conditions here and avoid them with synchronization.


Test for concurrent data access

Some bugs only reveal themselves when multiple clients are reading/writing the same data. Your stress tests might be covering cases like these, but if they are not, you should have special tests for concurrent data access. Case like these are often unreproducible. For example, a user may have two instances of your app running against the same account, and they may not realize this when reporting a bug.

Guidelines for testing:

  • Always test for concurrent data access if it’s a feature of the system. Actually, even if it’s not a feature, verify that the system rejects it. Testing concurrency can be challenging. An approach that usually works for me is to create many worker threads that simultaneously attempt access and a master thread that monitors and verifies that some number of attempts were indeed concurrent, blocked or allowed as expected, and all were successful. Programmatic post-analysis of all attempts and changing system state may also be necessary to ensure that the system behaved well.


Steer clear of undefined behavior and non-deterministic access to data

Some APIs and basic operations have warnings about undefined behavior when in certain states or provided with certain input. Similarly, some data structures do not guarantee an iteration order (example: Java’s Set). Code that ignores these warnings may work fine most of the time but fail in unusual ways that are hard to reproduce.

Guidelines for development:

  • Understand when the APIs and operations you use might have undefined behavior and prevent those conditions.
  • Do not depend on data structure iteration order unless it is guaranteed. It is a common mistake to depend on the ordering of sets or associative arrays.


Log the details for errors or test failures

Issues described in this article can be easier to reproduce and debug when the logs contain enough detail to understand the conditions that led to an error.

Guidelines for development:

  • Follow good logging practices, especially in your error handling code.
  • If logs are stored on a user’s machine, create an easy way for them to provide you the logs.

Guidelines for testing:

  • Save your test logs for potential analysis later.


Anything to add?

Have I missed any important guidelines for minimizing these bugs? What is your favorite hard-to-reproduce bug that you discovered and resolved?