Performance Testing of Distributed File Systems at Google

Posted by Rajat Jain and Marc Kaplan, Infrastructure Test Engineering

Google is unique in that we develop most of our software infrastructure from scratch inside the company. Distributed filesystems are no exception, and we have several here at Google that all serve different purposes. One such filesystem is the Google File System (GFS) which is used to store almost all data at Google. Although, GFS is the ultimate endpoint for much of the data at Google, there are many other distributed file systems built on top of GFS for a variety of purposes (see Bigtable, for example -- but several others also exist) with developers constantly trying to improve performance to meet the ever-increasing demands of serving data at Google. The challenge to the teams testing performance of these filesystems is that running performance tests, analyzing the results, and repeating over and over is very time consuming. Also, since each filesystem is different, we have traditionally had different performance testing tools for the different filesystems, which made it difficult to compare performance between the filesystems, and led to a lot of unnecessary maintenance work on the tools.

In order to streamline testing of these filesystems, we wanted to create a new framework that is capable of easily performance testing the filesystems at Google. The goals of this system were as follows:
  • Generic: The testing framework should be generic enough to test any type of file-system inside Google. In having a generic framework, it will be easier to compare the performance of different filesystems across many operations.
  • Ease of use: The framework should be easy enough to use so that software developers can design and run their own tests without any help from the test team.
  • Scalable: Testing can be done at various scales depending on the scalability of the FS. The framework can issue any number of operations simultaneously. So, for a testing a Linux file system, we might only issue 1000 parallel requests, while for the Google File System, we might want to issue requests at a much larger scale.
  • Extensible:
    • Firstly, it should be easy to add a new kind of operation in the framework, if its developed in future (For example, RecordAppend operation in GFS).
    • Also, the framework should allow the user to easily generate complex types of loadscenarios on the server. For example, we might want to have a scenario in which we issue File Create operations simultaneously with Read, Write, and Delete operations. Thus, we want a good mix of operations but not in a randomized way, so that we can have benchmark results.
  • Unified testing: The framework should be stand-alone or independent ie it should be a one-stop solution to setup, run the tests and monitor the results.

We developed a framework which allows us to achieve all the above mentioned goals. We used the Google's generic File API for writing the framework, since every file system can be tested just by changing the file namespace in which the testing data will be generated (e.x. /gfs vs. /bigtable). Following Google's standard, we developed a Driver + Worker system. The Driver co-ordinates the overall test, by reading configuration files to set up the test, automatically launching different number of workers depending on the load, monitoring the health of workers, collecting performance data from each worker and calculating the overall performance. TheWorker class is the one which loads the file systems with appropriate operations. A worker is an abstract class and a new child class can be created for each file operation, which gives us the flexibility to add any operation we want in the future. A separate Worker instance is launched on a different machine depending on the load that we want to generate. It is simple to run more or less workers on remote machines simply by changing the config file.

The test is divided into various phases. In a phase, we can run a single operation N number of times (with a given concurrency) and collect performance data. So, we can run a create phase followed by a write phase followed by a read phase. We can also have multiple sub-phases inside a phase, which gives us the ability to generate many different simultaneous operations on the system. For example, in a phase, we might add three subphases create, write and delete, which will issue all the different kinds of operations simultaneously on remote client machines against the distributed filesystem.

It is instructive to look at an example config file for an idea of how the load is specified against this filesystem:

# Example of create
phase {
label: "create"
shards: 200
create {
file_prefix: "metadata_perf"
}
count: 100
}

So in the example above, we launch 200 shards (which all run of different client machines) that all do creates of files with a prefix of metadata_perf, and suffixes based upon the index of the worker shard. In practice, the user of the performance test passes a flag into the performance test binary that specifies a base path to use: i.e. /gfs/cell1/perftest_path, and the resulting files will be /gfs/cell1/perftest_path/worker.i/metadata_perf.j, for i=1 until i=#shards, and j=1, until j=count.

# Example of using subphases
phase {
label: "subphase_test"
subphase {
shards: 200
label: "stat_subphase"
stat {
file_prefix: "metadata_perf"
}
count: 100
}
subphase {
shards: 200
label: "open_subphase"
open {
file_prefix: "metadata_perf"
}
count: 100
}
}

In the example above, we simultaneously do stats and opens of the files that were initially created in the create phase. Different workers execute these, and then report their results to the driver.

On conclusion of the test, the driver prints a performance test results report that details the aggregate results of all of the clients, in terms of MB/s for data intensive ops, ops/s for metadata intensive ops, and latency measures of central tendency and dispersion.

In conclusion, Google's generic File API, use of Driver & Workers and the concept of phaseshave been very useful in the development of the performance testing framework and hence making performance testing easier. Almost as important, the fact that this is a simple script-driven method of testing complex distributed filesystems has resulted in an ease of use that has given both developers and testers, the ability to quickly experiment and iterate, resulting in faster code development and better performance overall.

Permalink | Links to this post | 4 comments

TotT: The Invisible Branch

Full statement coverage may be necessary for good testing coverage, but it isn't sufficient. Two places where statement coverage will be inadequate are branches and loops. In this episode, we'll look at branches, and specifically the differences between statement coverage and branch coverage.


Let's consider a case where branch coverage and statement coverage aren't the same. Suppose we test the following snippet. We can get complete statement coverage with a single test by using a berserk EvilOverLord:


bool DeathRay::ShouldFire(EvilOverLord& o, Target& t) {
  double accumulated_rage = 0.0;

  if (o.IsBerserk())
    accumulated_rage += kEvilOverlordBerserkRage;

  accumulated_rage += o.RageFeltTowards(t);
  return (accumulated_rage > kDeathRayRageThreshold);
}


But what if DeathRay should fire at this Target even with a non-berserk Overlord? Well, we need another test for that. What should the test be? Let's rewrite the code a little bit. We would never see code like this in the real world, but it'll help us clarify an important point.


bool DeathRay::ShouldFire(EvilOverLord& o, Target& t) {
  double accumulated_rage = 0.0;

  if (o.IsBerserk()) {
    accumulated_rage += kEvilOverlordBerserkRage;
  } else {
  }

  accumulated_rage += o.RageFeltTowards(t);
  return (accumulated_rage > kDeathRayRageThreshold);
}


Why do we add an else clause if it doesn't actually do anything? If you were to draw a flowchart of both snippets (left as an exercise – and we recommend against using the paper provided), the flowcharts would be identical. The fact that the else isn't there in the first snippet is simply a convenience for us as coders – we generally don't want to write code to do nothing special – but the branch still exists... put another way, every if has an else. Some of them just happen to be invisible.


When you're testing, then, it isn't enough to cover all the statements – you should cover all the the edges in the control flow graph – which can be even more complicated with loops and nested ifs. In fact, part of the art of large-scale white-box testing is finding the minimum number of tests to cover the maximum number of paths. So the lesson here is, just because you can't see a branch doesn't mean it isn't there – or that you shouldn't test it.


Remember to download this episode of Testing on the Toilet and post it in your office.

Permalink | Links to this post | 3 comments

Intro to Ad Quality Test Challenges

Permalink | Links to this post | 6 comments

Exploratory Testing on Chat


Testing Google Talk is challenging -- we have multiple client implementations, between the Google Talk client, the Google Talk Gadget, and Gmail chat, while also managing new features and development. We rely heavily on automation. Yet there's still a need to do manual testing before the release of the product to the public.

We've found that one of the best ways to unearth interesting bugs in the product is to use Exploratory Testing (http://www.satisfice.com/articles/et-article.pdf) The trouble with ET is that while there appears to be a genetic disposition to be naturally good at exploring the product effectively, it's very easy to miss great swathes of the product when one follows their intuition through the product rather than focusing on coverage metrics. And speaking of coverage, how do we measure how well a team is doing finding bugs and getting coverage over the functional use cases for the product? All of the things that we rely on to measure the quality of the product, boundary and edge cases being covered? Plus, if not everyone is proficient at ET, how do we solve the overhead of having an experienced team member looking over people's shoulders to make sure they are executing well?

To do this, we start with the definition of a Test Strategy. This is where we outline the approach we are taking to the testing of the product as a whole. It's not super-detailed -- instead it mentions the overarching areas that need to be tested, whether automation can be used to test the area, and what role manual testing needs to play. This information lets developers and PMs know what we think we need to test for the product, and allows them to add unit tests etc to cover more ground.

Some basic test case definition go into the Test Plan. The aim of the test plan (and any test artifacts generated) is not to specify a set of actions to be followed in a rote manner, but instead a rough guide that encourages creative exploration. The test plan also acts as the virtual test expert, providing some framework under which exploratory testing can be executed effectively by the team. The plan decomposes the application into different areas of responsibility, that are doled out to members of the team in sessions that are one-working-day or less duration. By guiding people's thinking, we can cover the basics, fuzzy cases, and avoids a free-for-all, duplication, and missed areas.

Finally we get a status report from the testers every day, that describes the testing that was performed that day, any bugs raised, and blocking issues identified. The reports acts as an execution of the "contract" and gives traceability, and the ability to tweak exploratory testing that has gone off track from where we've determined we need to concentrate efforts. We can use these status reports along with bug statistics to gauge the effectiveness of the test sessions.

This is approach is fairly simple, but sometimes simple works best. Using this method has allowed us to make the best use of test engineers and maximized the effectiveness of each test pass. It's proven itself to be a fruitful approach and balances the need for reporting and accountability with the agility of exploratory testing.

Permalink | Links to this post | 7 comments

TotT: Using Dependancy Injection to Avoid Singletons

It's hard to test code that uses singletons. Typically, the code you want to test is coupled strongly with the singleton instance. You can't control the creation of the singleton object because often it is created in a static initializer or static method. As a result, you also can't mock out the behavior of that Singleton instance.

If changing the implementation of a singleton class is not an option, but changing the client of a singleton is, a simple refactoring can make it easier to test. Let's say you had a method that uses a Server as a singleton instance:

public class Client {
  public int process(Params params) {
    Server server = Server.getInstance();
    Data data = server.retrieveData(params);
    ...
  }
}


You can refactor Client to use Dependency Injection and avoid its use of the singleton pattern altogether. You have not lost any functionality, and have also not lost the requirement that only a singleton instance of Server must exist. The only difference is that instead of getting the Server instance from the static getInstance method, Client receives it in its constructor. You have made the class easier to test!

public class Client {
  private final Server server;

  public Client(Server server) {
    this.server = server;
  }

  public int process(Params params){
    Data data = this.server.retrieveData(params);
    ...
  }
}


When testing, you can create a mock Server with whatever expected behavior you need and pass it into the Client under test:

public void testProcess() {
  Server mockServer = createMock(Server.class);
  Client c = new Client(mockServer);
  assertEquals(5, c.process(params));
}


Remember to download this episode of Testing on the Toilet and post it in your office.

Permalink | Links to this post | 5 comments

TotT: Testable Contracts Make Exceptional Neighbors

Consider the following function, which modifies a client-supplied object:

bool SomeCollection::GetObjects(vector* objects) const {
  objects->clear();
  typedef vector::const_iterator Iterator;
  for (Iterator i = collection_.begin();
        i != collection_.end();
        ++i) {
    if ((*i)->IsFubarred()) return false;
    objects->push_back(*i);
  }
  return true;
}



Consider when GetObjects() is called. What if the caller doesn't check the return value, and assumes the data is in a valid state when it actually is not? If the caller does check the return value, what can it assume about the state of its objects in the failure case? When GetObjects() fails, it would be much better if either all the objects were collected or none of them. This can help avoid introducing hard to find bugs.

By using good design contracts and a solid implementation, it is reasonably easy to make functions like GetObjects() behave like transactions. By following Sutter's rule of modifying externally-visible state only after completing all operations which could possibly fail [1], and mixing in Meyers's “swap trick” [2], we move from the realm of undefined behavior to what Abrahams defines as the strong guarantee [3]:

bool SomeCollection::GetObjects(vector* objects) const {
  vector known_good_objects;
  typedef vector::const_iterator Iterator;
  for (Iterator i = collection_.begin();
        i != collection_.end();
        ++i) {
    if ((*i)->IsFubarred()) return false;
    known_good_objects->push_back(*i);
  }
  objects->swap(known_good_objects);
  return true;
}



At the cost of one temporary and a pointer swap, we've strengthened the contract of our interface such that, at best, the caller received a complete, new collection of valid objects; at worst, the state of the caller's objects remains unchanged. The caller might not verify the return value, but will not suffer from undefined results. This allows us to reason much more clearly about the program state, making it much easier to verify the intended outcome with automated tests as well as recreate, pinpoint, and banish bugs with regression tests.

  1. http://www.gotw.ca/publications
  2. Scott Meyers, Effective C++
  3. http://www.boost.org/more/generic_exception_safety.html

Remember to download this episode of Testing on the Toilet and post it in your office.

Permalink | Links to this post | 0 comments