The WebRTC project is all about enabling peer-to-peer video, voice and data transfer in the browser. To give our users the best possible experience we need to adapt the quality of the media to the bandwidth and processing power we have available. Our users encounter a wide variety of network conditions and run on a variety of devices, from powerful desktop machines with a wired broadband connection to laptops on WiFi to mobile phones on spotty 3G networks.
We want to ensure good quality for all these use cases in our implementation in Chrome. To some extent we can do this with manual testing, but the breakneck pace of Chrome development makes it very hard to keep up (several hundred patches land every day)! Therefore, we'd like to test the quality of our video and voice transfer with an automated test. Ideally, we’d like to test for the most common network scenarios our users encounter, but to start we chose to implement a test where we have plenty of CPU and bandwidth. This article covers how we built such a test.
First, we must define what we want to measure. For instance, the WebRTC video quality test uses peak signal-to-noise ratio and structural similarity to measure the quality of the video (or to be more precise, how much the output video differs from the input video; see this GTAC 13 talk for more details). The quality of the user experience is a subjective thing though. Arguably, one probably needs dozens of different metrics to really ensure a good user experience. For video, we would have to (at the very least) have some measure for frame rate and resolution besides correctness. To have the system send somewhat correct video frames seemed the most important though, which is why we chose the above metrics.
For this test we wanted to start with a similar correctness metric, but for audio. It turns out there's an algorithm called Perceptual Evaluation of Speech Quality (PESQ) which analyzes two audio files and tell you how similar they are, while taking into account how the human ear works (so it ignores differences a normal person would not hear anyway). That's great, since we want our metrics to measure the user experience as much as possible. There are many aspects of voice transfer you could measure, such as latency (which is really important for voice calls), but for now we'll focus on measuring how much a voice audio stream gets distorted by the transfer.
Feeding Audio Into WebRTC
In the WebRTC case we already had a test which would launch a Chrome browser, open two tabs, get the tabs talking to each other through a signaling server and set up a call on a single machine. Then we just needed to figure out how to feed a reference audio file into a WebRTC call and record what comes out on the other end. This part was actually harder than it sounds. The main WebRTC use case is that the web page acquires the user's mic through getUserMedia, sets up a PeerConnection with some remote peer and sends the audio from the mic through the connection to the peer where it is played in the peer's audio output device.
WebRTC calls transmit voice, video and data peer-to-peer, over the Internet.
But since this is an automated test, of course we could not have someone speak in a microphone every time the test runs; we had to feed in a known input file, so we had something to compare the recorded output audio against.
Could we duct-tape a small stereo to the mic and play our audio file on the stereo? That's not very maintainable or reliable, not to mention annoying for anyone in the vicinity. What about some kind of fake device driver which makes a microphone-like device appear on the device level? The problem with that is that it's hard to control a driver from the userspace test program. Also, the test will be more complex and flaky, and the driver interaction will not be portable.
Instead, we chose to sidestep this problem. We used a solution where we load an audio file with WebAudio and play that straight into the peer connection through the WebAudio-PeerConnection integration. That way we start the playing of the file from the same renderer process as the call itself, which made it a lot easier to time the start and end of the file. We still needed to be careful to avoid playing the file too early or too late, so we don't clip the audio at the start or end - that would destroy our PESQ scores! - but it turned out to be a workable approach.
Recording the Output
Alright, so now we could get a WebRTC call set up with a known audio file with decent control of when the file starts playing. Now we had to record the output. There are a number of possible solutions. The most end-to-end way is to straight up record what the system sends to default audio out (like speakers or headphones). Alternatively, we could write a hook in our application to dump our audio as late as possible, like when we're just about to send it to the sound card.
We went with the former. Our colleagues in the Chrome video stack team in Kirkland had already found that it's possible to configure a Windows or Linux machine to send the system's audio output (i.e. what plays on the speakers) to a virtual recording device. If we make that virtual recording device the default one, simply invoking SoundRecorder.exe and arecord respectively will record what the system is playing out.
They found this works well if one also uses the sox utility to eliminate silence around the actual audio content (recall we had some safety margins at both ends to ensure we record the whole input file as playing through the WebRTC call). We adopted the same approach, since it records what the user would hear, and yet uses only standard tools. This means we don't have to install additional software on the myriad machines that will run this test.
The only remaining step was to compare the silence-eliminated recording with the input file. When we first did this, we got a really bad score (like 2.0 out of 5.0, which means PESQ thinks it’s barely legible). This didn't seem to make sense, since both the input and recording sounded very similar. Turns out we didn’t think about the following:
- We were comparing a full-band (24 kHz) input file to a wide-band (8 kHz) result (although both files were sampled at 48 kHz). This essentially amounted to a low pass filtering of the result file.
- Both files were in stereo, but PESQ is only mono-aware.
- The files were 32-bit, but the PESQ implementation is designed for 16 bits.
As you can see, it’s important to pay attention to what format arecord and SoundRecorder.exe records in, and make sure the input file is recorded in the same way. After correcting the input file and “rebasing”, we got the score up to about 4.0.
Thus, we ended up with an automated test that runs continously on the torrent of Chrome change lists and protects WebRTC's ability to transmit sound. You can see the finished code here. With automated tests and cleverly chosen metrics you can protect against most regressions a user would notice. If your product includes video and audio handling, such a test is a great addition to your testing mix.
How the components of the test fit together.
- It might be possible to write a Chrome extension which dumps the audio from Chrome to a file. That way we get a simpler-to-maintain and portable solution. It would be less end-to-end but more than worth it due to the simplified maintenance and setup. Also, the recording tools we use are not perfect and add some distortion, which makes the score less accurate.
- There are other algorithms than PESQ to consider - for instance, POLQA is the successor to PESQ and is better at analyzing high-bandwidth audio signals.
- We are working on a solution which will run this test under simulated network conditions. Simulated networks combined with this test is a really powerful way to test our behavior under various packet loss and delay scenarios and ensure we deliver a good experience to all our users, not just those with great broadband connections. Stay tuned for future articles on that topic!
- Investigate feasibility of running this set-up on mobile devices.
1 It would be tolerable if the driver was just looping the input file, eliminating the need for the test to control the driver (i.e. the test doesn't have to tell the driver to start playing the file). This is actually what we do in the video quality test. It's a much better fit to take this approach on the video side since each recorded video frame is independent of the others. We can easily embed barcodes into each frame and evaluate them independently.
This seems much harder for audio. We could possibly do audio watermarking, or we could embed a kind of start marker (for instance, using DTMF tones) in the first two seconds of the input file and play the real content after that, and then do some fancy audio processing on the receiving end to figure out the start and end of the input audio. We chose not to pursue this approach due to its complexity.
2 Unfortunately, this also means we will not test the capturer path (which handles microphones, etc in WebRTC). This is an example of the frequent tradeoffs one has to do when designing an end-to-end test. Often we have to trade end-to-endness (how close the test is to the user experience) with robustness and simplicity of a test. It's not worth it to cover 5% more of the code if the test become unreliable or radically more expensive to maintain. Another example: A WebRTC call will generally involve two peers on different devices separated by the real-world internet. Writing such a test and making it reliable would be extremely difficult, so we make the test single-machine and hope we catch most of the bugs anyway.
3 It's important to keep the continuous build setup simple and the build machines easy to configure - otherwise you will inevitably pay a heavy price in maintenance when you try to scale your testing up.
4 When sending audio over the internet, we have to compress it since lossless audio consumes way too much bandwidth. WebRTC audio generally sounds great, but there's still compression artifacts if you listen closely (and, in fact, the recording tools are not perfect and add some distorsion as well). Given that this test is more about detecting regressions than measuring some absolute notion of quality, we'd like to downplay those artifacts. As our Kirkland colleagues found, one of the ways to do that is to "rebase" the input file. That means we start with a pristine recording, feed that through the WebRTC call and record what comes out on the other end. After manually verifying the quality, we use that as our input file for the actual test. In our case, it pushed our PESQ score up from 3 to about 4 (out of 5), which gives us a bit more sensitivity to regressions.