The FedEx Tour

By Rajat Dewan


I appreciate James' offer to talk about how I have used the FedEx tour in Mobile Ads. Good timing too as I just found two more priority 0 bugs with the automation that the FedEx tour inspired! It was fun presenting this at STAR and I am pleased so many people attended.

Mobile has been a hard problem space for testing: a humongous browser, phone, capability combination which is changing fast as the underlying technology evolves. Add to this poor tool support for the mobile platform and the rapid evolution of the device and you'll understand why I am so interested in advice on how to do better test design. We've literally tried everything, from checking screenshots of Google's properties on mobile phones to treating the phone like a collection of client apps and automating them in the UI button-clicking traditional way.

Soon after James joined Google in May 2009, he started introducing the concept of tours, essentially making a point of "structured" exploratory testing. Tours presented a way for me to look at the testing problem in a radical new way. Traditionally, the strategy is simple, focus on the end user interaction, and verify the expected outputs from the system under test. Tours (at least for me) change this formula. They force the tester to focus on what the software does, isolating the different moving parts of software in execution, and isolating the different parts of the software at the component (and composition) level. Tours tell me to focus on testing the parts that drive the car, rather than on whether or not the car drives. This is somewhat counter intuitive I admit, that's why it is so important. The real value add of the tours comes from the fact that they guide me in testing those different parts and help me analyze how different capabilities inter-operate. Cars will always drive you off the lot, which part will break first is the real question.

I think testing a car is a good analogy. As a system it's devilishly complicated, hard to automate and hard to find the right combination of factors to make it fail. However, testing the dashboard can be automated; so can testing the flow of gasoline from the fuel tank to the engine and from there to the exhaust, so can lots of other capabilities. These automated point solutions can also be combined to test a bigger piece of the whole system. It's exactly what a mechanic does when trying to diagnose a problem: he employs different strategies for testing/checking each mechanical subsystem.

At STAR West, I spoke about evolving a good test strategy with the help of tours, specifically the FedEx tour. Briefly, the FedEx tour talks about tracking the movement of data and how it gets consumed and transformed by the system. It focuses on a very specific moving part, and as it turns out a crucial one for mobile.

James' FedEx tour tells me to identify and track data through my system. Identifying it is the easy part: the data comes from the Ads Database and is basically the information a user sees when the ad is rendered. When I followed it through the system, I noted three (and only three) places where the data is used (either manipulated or rendered for display). I found this to be true for all 10 local versions of the Mobile Ads application. The eureka moment for me was realizing that if I validated the data at those three points, I had little else to do in order to verify any specific localized version of an ad. Add all the languages you want, I'll be ready!

I was able to hook verification modules at each one of these three data inflection points. This basically meant validating data for the new Click-to-Call Ad parameters and locale specific phone number format. I was tracking how code is affecting the data at each stage, which also helps in localizing a bug better than other conventional means...I knew exactly where the failure was! For overcoming the location dependency, I mocked the GPS location parameters of the phone. As soon as I finished with the automation, I ran each ad in our database through each of the language versions verifying the integrity of the data. The only thing that was left was to visually verify rendering of the ads on the three platforms, reducing the manual tests to three (one each for Android, iPhone and Palm Pre).

The FedEx tour guided me to build a succinct piece of automation and turned what could have been a huge and error prone manual test into a reusable piece of automation that will find and localize bugs quickly. We're now looking at applying the FedEx tour across ads and in other client and cloud areas in the company. Hopefully there will be more experience reports from others who have found it useful.

Exploratory Testing ... it's not just for manual testers anymore!

Permalink | Links to this post | 4 comments

STAR West Trip Report

By James A. Whittaker


I am happy to report that attendance is way up at STAR. My back of the envelope calculations put it at several hundred more than STAR East a mere five months ago. A sure sign of economic recovery; I am surprised the stat hasn't made it to Obama's resume yet.

The Expo was my main disappointment. The vendor exhibits are still in atrophy. I realize the days of Mercury and Rational are over and Empirix's $ix figure rotating-parts booth is packed away in someone's garage, but there were only two short rows of sedate booths. (The magician was a nice touch though ... wish I could remember what he was selling.) Where have all the big players gone?

I gave a tutorial with the arrogant title (Lee Copeland's idea, not mine) "James Whittaker: On Testing." It was listed as sold out (STAR capped the audience at 100) but a couple dozen truants clearly snuck in. Apparently there is a bug in their 'sold out' exception handler and I am a poor door warden. The tutorial is a discussion of problems and trends in testing. I gave it at STAR East and it was different again this time. It's half a discussion of what we do wrong in testing and half about how to correct those behaviors. As my understanding of these issues evolves, so does this tutorial. If you attended (only a small handful of the 100+ would admit to reading this blog), feel free to post a comment, I promise not to delete any negative ones.

I had an amicable hallway conversation with James Bach. His blogger angst at my use of the title 'Exploratory Testing' didn't spill over to a face-to-face discussion. Frankly, I am not surprised. I've never claimed the term as my own, I simply took it and made it work in the hands of real testers on real software under real ship pressure. Consultants can coin all the terms they want, but when us practitioners add meat to their pie, why cry foul? Is it not a better reaction to feel happy that there are people actually doing something with the idea?

Yet I still made some jabs at the broader consultant community in my keynote. STAR remains full of vendors and people trying to sell ideas instead of results and good engineering practice. I am committing Google and the projects that I lead here to an openness regarding how we do testing and hope to be joined by others. I'd like to see the real practitioners, those who work at financial companies, data centers, ISVs, online retailers, and so forth to come out in larger numbers ... not just as the learners and attendees but also as speakers, panelists and active participants. I'm not saying the consultant community has nothing to say, those guys simply need no encouragement to open their mouths. It's the practitioners who I want to encourage. It's one thing to think really hard about testing, it's another thing to actually put those thoughts into practice.

The jabs aside, my keynote was aimed at describing the practice of exploratory testing I helped create at Microsoft and am now employing at Google and which is embodied in my new book. But it was my Google cohort Rajat Dewan who stole the show. After I detailed the Landmark Tour and how we applied it to Chrome, I ran out of time to talk about the FedEx tour. The folks at STAR were kind enough to set up an impromptu breakfast presentation for Rajat and he delivered a 20 minute talk to a standing room only crowd (I stopped counting at 150) on how he applied the FedEx tour to Mobile Ads. He showed three bugs the tour helped find and described how he automated the tour itself. (Has anyone coined the term 'automated exploratory testing' yet?)

Perhaps he can steal the show again by blogging about his presentation. Rajat?

Other highlights: apparently the twitter-verse was alight over my comment about god-the-developer. I don't tweet and I avoid twits at all costs so I am not sure if people were offended or found it insightful. Comments from tweeters? Also, I've been invited back for the tutorial at STAR East and also plan on submitting a track talk on How we test Google Chrome. Let the detailed discussion about real testing, warts and all, begin!

Permalink | Links to this post | 6 comments

TotT: Making a Perfect Matcher

by Zhanyong G. Mock Wan in Google Kirkland

In the previous episode, we showed how Google C++ Mocking Framework matchers can make both your test code and your test output readable. What if you cannot find the right matcher for the task?

Don't settle for anything less than perfect. It's easy to create a matcher that does exactly what you want, either by composing from existing matchers or by writing one from scratch.

The simplest composite matcher is Not(m), which negates matcher m as you may have guessed. We also have AnyOf(m1, ..., mn) for OR-ing and AllOf(m1, ..., mn) for AND-ing. Combining them wisely and you can get a lot done. For example,

EXPECT_THAT(new_code, AnyOf(StartsWith(“// Tests”)),
              Not(ContainsRegex(“TODO.*intern”))));

could generate a message like:

Expected: (starts with “// Tests”) or
          (doesn't contain regular expression “TODO.*intern”)
Actual: “/* TODO: hire an intern. */ int main() {}”

If the matcher expression gets too complex, or your matcher logic cannot be expressed in terms of existing matchers, you can use plain C++. The MATCHER macro lets you define a named matcher:

MATCHER(IsEven, “”) { return (arg % 2) == 0; }

allows you to write EXPECT_THAT(paren_num, IsEven()) to verify that paren_num is divisible by two. The special variable arg refers to the value being validated (paren_num in this case) – it is not a global variable.

You can put any code between {} to validate arg, as long as it returns a bool value.

The empty string “” tells Google C++ Mocking Framework to automatically generate the matcher's description from its name (therefore you'll see “Expected: is even” when the match fails). As long as you pick a descriptive name, you get a good description for free.

You can also give multiple parameters to a matcher, or customize its description. The code:

// P2 means the matcher has 2 parameters. Their names are low and high.

MATCHER_P2(InClosedRange, low, high, “is in range [%(low)s, %(high)s]”) {
  return low <= arg && arg <= high;
}
...
EXPECT_THAT(my_age, InClosedRange(adult_min, penalty_to_withdraw_401k));

may print:

Expected: is in range [18, 60]
  Actual: 2

(No, that's not my real age.) Note how you can use Python-style interpolation in the description string to print the matcher parameters.
You may wonder why we haven't seen any types in the examples. Rest assured that all the code we showed you is type-safe. Google C++ Mocking Framework uses compiler type inference to “write” the matcher parameter types for you, so that you can spend the time on actually writing tests – or finding your perfect match.

Toilet-Friendly Version

Permalink | Links to this post | 2 comments

Cost of Testing

By Miško Hevery

A lot of people have been asking me lately, what is the cost of testing, so I decided, that I will try to measure it, to dispel the myth that testing takes twice as long.

For the last two weeks I have been keeping track of the amount of time I spent writing tests versus the time writing production code. The number surprised even me, but after I thought about it, it makes a lot of sense. The magic number is about 10% of time spent on writing tests. Now before, you think I am nuts, let me back it up with some real numbers from a personal project I have been working on.

TotalProductionTestRatio
Commits1,3471,3471,347
LOC14,7098,7115,98840.78%
JavaScript LOC10,0776,8193,25832.33%
Ruby LOC4,6321,8922,74059.15%
Lines/Commit10.926.474.4540.78%
Hours(estimate)1,2001,08012010.00%
Hours/Commit0.890.800.09
Mins/Commit53485

Commits refers to the number of commits I have made to the repository. LOC is lines of code which is broken down by language. The ratio shows the typical breakdown between the production and test code when you test drive and it is about half, give or take a language. It is interesting to note that on average I commit about 11 lines out of which 6.5 are production and 4.5 are test. Now, keep in mind this is average, a lot of commits are large where you add a lot of code, but then there are a lot of commits where you are tweaking stuff, so the average is quite low.

The number of hours spent on the project is my best estimate, as I have not kept track of these numbers. Also, the 10% breakdown comes from keeping track of my coding habits for the last two weeks of coding. But, these are my best guesses.

Now when I test drive, I start with writing a test which usually takes me few minutes (about 5 minutes) to write. The test represents my scenario. I then start implementing the code to make the scenario pass, and the implementation usually takes me a lot longer (about 50 minutes). The ratio is highly asymmetrical! Why does it take me so much less time to write the scenario than it does to write the implementation given that they are about the same length? Well look at a typical test and implementation:

Here is a typical test for a feature:
ArrayTest.prototype.testFilter = function() {
var items = ["MIsKO", {name:"john"}, ["mary"], 1234];
assertEquals(4, items.filter("").length);
assertEquals(4, items.filter(undefined).length);

assertEquals(1, items.filter('iSk').length);
assertEquals("MIsKO", items.filter('isk')[0]);

assertEquals(1, items.filter('ohn').length);
assertEquals(items[1], items.filter('ohn')[0]);

assertEquals(1, items.filter('ar').length);
assertEquals(items[2], items.filter('ar')[0]);

assertEquals(1, items.filter('34').length);
assertEquals(1234, items.filter('34')[0]);

assertEquals(0, items.filter("I don't exist").length);
};

ArrayTest.prototype.testShouldNotFilterOnSystemData = function() {
assertEquals("", "".charAt(0)); // assumption
var items = [{$name:"misko"}];
assertEquals(0, items.filter("misko").length);
};

ArrayTest.prototype.testFilterOnSpecificProperty = function() {
var items = [{ignore:"a", name:"a"}, {ignore:"a", name:"abc"}];
assertEquals(2, items.filter({}).length);

assertEquals(2, items.filter({name:'a'}).length);

assertEquals(1, items.filter({name:'b'}).length);
assertEquals("abc", items.filter({name:'b'})[0].name);
};

ArrayTest.prototype.testFilterOnFunction = function() {
var items = [{name:"a"}, {name:"abc", done:true}];
assertEquals(1, items.filter(function(i){return i.done;}).length);
};

ArrayTest.prototype.testFilterIsAndFunction = function() {
var items = [{first:"misko", last:"hevery"},
{first:"mike", last:"smith"}];

assertEquals(2, items.filter({first:'', last:''}).length);
assertEquals(1, items.filter({first:'', last:'hevery'}).length);
assertEquals(0, items.filter({first:'mike', last:'hevery'}).length);
assertEquals(1, items.filter({first:'misko', last:'hevery'}).length);
assertEquals(items[0], items.filter({first:'misko', last:'hevery'})[0]);
};

ArrayTest.prototype.testFilterNot = function() {
var items = ["misko", "mike"];

assertEquals(1, items.filter('!isk').length);
assertEquals(items[1], items.filter('!isk')[0]);
};

Now here is code which implements this scenario tests above:
Array.prototype.filter = function(expression) {
var predicates = [];
predicates.check = function(value) {
for (var j = 0; j < predicates.length; j++) {
if(!predicates[j](value)) {
return false;
}
}
return true;
};
var getter = Scope.getter;
var search = function(obj, text){
if (text.charAt(0) === '!') {
return !search(obj, text.substr(1));
}
switch (typeof obj) {
case "bolean":
case "number":
case "string":
return ('' + obj).toLowerCase().indexOf(text) > -1;
case "object":
for ( var objKey in obj) {
if (objKey.charAt(0) !== '$' && search(obj[objKey], text)) {
return true;
}
}
return false;
case "array":
for ( var i = 0; i < obj.length; i++) {
if (search(obj[i], text)) {
return true;
}
}
return false;
default:
return false;
}
};
switch (typeof expression) {
case "bolean":
case "number":
case "string":
expression = {$:expression};
case "object":
for (var key in expression) {
if (key == '$') {
(function(){
var text = (''+expression[key]).toLowerCase();
if (!text) return;
predicates.push(function(value) {
return search(value, text);
});
})();
} else {
(function(){
var path = key;
var text = (''+expression[key]).toLowerCase();
if (!text) return;
predicates.push(function(value) {
return search(getter(value, path), text);
});
})();
}
}
break;
case "function":
predicates.push(expression);
break;
default:
return this;
}
var filtered = [];
for ( var j = 0; j < this.length; j++) {
var value = this[j];
if (predicates.check(value)) {
filtered.push(value);
}
}
return filtered;
};

Now, I think that if you look at these two chunks of code, it is easy to see that even though they are about the same length, one is much harder to write. The reason, why tests take so little time to write is that they are linear in nature. No loops, ifs or interdependencies with other tests. Production code is a different story, I have to create complex ifs, loops and have to make sure that the implementation works not just for one test, but all test. This is why it takes you so much longer to write production than test code. In this particular case, I remember rewriting this function three times, before I got it to work as expected. :-)

So a naive answer is that writing test carries a 10% tax. But, we pay taxes in order to get something in return. Here is what I get for 10% which pays me back:

  • When I implement a feature I don't have to start up the whole application and click several pages until I get to page to verify that a feature works. In this case it means that I don't have to refreshing the browser, waiting for it to load a dataset and then typing some test data and manually asserting that I got what I expected. This is immediate payback in time saved!

  • Regression is almost nil. Whenever you are adding new feature you are running the risk of breaking something other then what you are working on immediately (since you are not working on it you are not actively testing it). At least once a day I have a what the @#$% moment when a change suddenly breaks a test at the opposite end of the codebase which I did not expect, and I count my lucky stars. This is worth a lot of time spent when you discover that a feature you thought was working no longer is, and by this time you have forgotten how the feature is implemented.

  • Cognitive load is greatly reduced since I don't have to keep all of the assumptions about the software in my head, this makes it really easy to switch tasks or to come back to a task after a meeting, good night sleep or a weekend.

  • I can refactor the code at will, keeping it from becoming stagnant, and hard to understand. This is a huge problem on large projects, where the code works, but it is really ugly and everyone is afraid to touch it. This is worth money tomorrow to keep you going.


These benefits translate to real value today as well as tomorrow. I write tests, because the additional benefits I get more than offset the additional cost of 10%. Even if I don't include the long term benefits, the value I get from test today are well worth it. I am faster in developing code with test. How much, well that depends on the complexity of the code. The more complex the thing you are trying to build is (more ifs/loops/dependencies) the greater the benefit of tests are.

So now you understand my puzzled look when people ask me how much slower/costlier the development with tests is.

Permalink | Links to this post | 6 comments