Friday, November 08, 2013

WebRTC Audio Quality Testing

by Patrik Höglund

The WebRTC project is all about enabling peer-to-peer video, voice and data transfer in the browser. To give our users the best possible experience we need to adapt the quality of the media to the bandwidth and processing power we have available. Our users encounter a wide variety of network conditions and run on a variety of devices, from powerful desktop machines with a wired broadband connection to laptops on WiFi to mobile phones on spotty 3G networks.

We want to ensure good quality for all these use cases in our implementation in Chrome. To some extent we can do this with manual testing, but the breakneck pace of Chrome development makes it very hard to keep up (several hundred patches land every day)! Therefore, we'd like to test the quality of our video and voice transfer with an automated test. Ideally, we’d like to test for the most common network scenarios our users encounter, but to start we chose to implement a test where we have plenty of CPU and bandwidth. This article covers how we built such a test.

Quality Metrics
First, we must define what we want to measure. For instance, the WebRTC video quality test uses peak signal-to-noise ratio and structural similarity to measure the quality of the video (or to be more precise, how much the output video differs from the input video; see this GTAC 13 talk for more details). The quality of the user experience is a subjective thing though. Arguably, one probably needs dozens of different metrics to really ensure a good user experience. For video, we would have to (at the very least) have some measure for frame rate and resolution besides correctness. To have the system send somewhat correct video frames seemed the most important though, which is why we chose the above metrics.

For this test we wanted to start with a similar correctness metric, but for audio. It turns out there's an algorithm called Perceptual Evaluation of Speech Quality (PESQ) which analyzes two audio files and tell you how similar they are, while taking into account how the human ear works (so it ignores differences a normal person would not hear anyway). That's great, since we want our metrics to measure the user experience as much as possible. There are many aspects of voice transfer you could measure, such as latency (which is really important for voice calls), but for now we'll focus on measuring how much a voice audio stream gets distorted by the transfer.

Feeding Audio Into WebRTC
In the WebRTC case we already had a test which would launch a Chrome browser, open two tabs, get the tabs talking to each other through a signaling server and set up a call on a single machine. Then we just needed to figure out how to feed a reference audio file into a WebRTC call and record what comes out on the other end. This part was actually harder than it sounds. The main WebRTC use case is that the web page acquires the user's mic through getUserMedia, sets up a PeerConnection with some remote peer and sends the audio from the mic through the connection to the peer where it is played in the peer's audio output device.



WebRTC calls transmit voice, video and data peer-to-peer, over the Internet.

But since this is an automated test, of course we could not have someone speak in a microphone every time the test runs; we had to feed in a known input file, so we had something to compare the recorded output audio against.

Could we duct-tape a small stereo to the mic and play our audio file on the stereo? That's not very maintainable or reliable, not to mention annoying for anyone in the vicinity. What about some kind of fake device driver which makes a microphone-like device appear on the device level? The problem with that is that it's hard to control a driver from the userspace test program. Also, the test will be more complex and flaky, and the driver interaction will not be portable.[1]

Instead, we chose to sidestep this problem. We used a solution where we load an audio file with WebAudio and play that straight into the peer connection through the WebAudio-PeerConnection integration. That way we start the playing of the file from the same renderer process as the call itself, which made it a lot easier to time the start and end of the file. We still needed to be careful to avoid playing the file too early or too late, so we don't clip the audio at the start or end - that would destroy our PESQ scores! - but it turned out to be a workable approach.[2]

Recording the Output
Alright, so now we could get a WebRTC call set up with a known audio file with decent control of when the file starts playing. Now we had to record the output. There are a number of possible solutions. The most end-to-end way is to straight up record what the system sends to default audio out (like speakers or headphones). Alternatively, we could write a hook in our application to dump our audio as late as possible, like when we're just about to send it to the sound card.

We went with the former. Our colleagues in the Chrome video stack team in Kirkland had already found that it's possible to configure a Windows or Linux machine to send the system's audio output (i.e. what plays on the speakers) to a virtual recording device. If we make that virtual recording device the default one, simply invoking SoundRecorder.exe and arecord respectively will record what the system is playing out.

They found this works well if one also uses the sox utility to eliminate silence around the actual audio content (recall we had some safety margins at both ends to ensure we record the whole input file as playing through the WebRTC call). We adopted the same approach, since it records what the user would hear, and yet uses only standard tools. This means we don't have to install additional software on the myriad machines that will run this test.[3]

Analyzing Audio
The only remaining step was to compare the silence-eliminated recording with the input file. When we first did this, we got a really bad score (like 2.0 out of 5.0, which means PESQ thinks it’s barely legible). This didn't seem to make sense, since both the input and recording sounded very similar. Turns out we didn’t think about the following:

  • We were comparing a full-band (24 kHz) input file to a wide-band (8 kHz) result (although both files were sampled at 48 kHz). This essentially amounted to a low pass filtering of the result file.
  • Both files were in stereo, but PESQ is only mono-aware.
  • The files were 32-bit, but the PESQ implementation is designed for 16 bits.

As you can see, it’s important to pay attention to what format arecord and SoundRecorder.exe records in, and make sure the input file is recorded in the same way. After correcting the input file and “rebasing”, we got the score up to about 4.0.[4]

Thus, we ended up with an automated test that runs continously on the torrent of Chrome change lists and protects WebRTC's ability to transmit sound. You can see the finished code here. With automated tests and cleverly chosen metrics you can protect against most regressions a user would notice. If your product includes video and audio handling, such a test is a great addition to your testing mix.


How the components of the test fit together.

Future work

  • It might be possible to write a Chrome extension which dumps the audio from Chrome to a file. That way we get a simpler-to-maintain and portable solution. It would be less end-to-end but more than worth it due to the simplified maintenance and setup. Also, the recording tools we use are not perfect and add some distortion, which makes the score less accurate.
  • There are other algorithms than PESQ to consider - for instance, POLQA is the successor to PESQ and is better at analyzing high-bandwidth audio signals.
  • We are working on a solution which will run this test under simulated network conditions. Simulated networks combined with this test is a really powerful way to test our behavior under various packet loss and delay scenarios and ensure we deliver a good experience to all our users, not just those with great broadband connections. Stay tuned for future articles on that topic!
  • Investigate feasibility of running this set-up on mobile devices.




1 It would be tolerable if the driver was just looping the input file, eliminating the need for the test to control the driver (i.e. the test doesn't have to tell the driver to start playing the file). This is actually what we do in the video quality test. It's a much better fit to take this approach on the video side since each recorded video frame is independent of the others. We can easily embed barcodes into each frame and evaluate them independently.

This seems much harder for audio. We could possibly do audio watermarking, or we could embed a kind of start marker (for instance, using DTMF tones) in the first two seconds of the input file and play the real content after that, and then do some fancy audio processing on the receiving end to figure out the start and end of the input audio. We chose not to pursue this approach due to its complexity.

2 Unfortunately, this also means we will not test the capturer path (which handles microphones, etc in WebRTC). This is an example of the frequent tradeoffs one has to do when designing an end-to-end test. Often we have to trade end-to-endness (how close the test is to the user experience) with robustness and simplicity of a test. It's not worth it to cover 5% more of the code if the test become unreliable or radically more expensive to maintain. Another example: A WebRTC call will generally involve two peers on different devices separated by the real-world internet. Writing such a test and making it reliable would be extremely difficult, so we make the test single-machine and hope we catch most of the bugs anyway.

3 It's important to keep the continuous build setup simple and the build machines easy to configure - otherwise you will inevitably pay a heavy price in maintenance when you try to scale your testing up.

4 When sending audio over the internet, we have to compress it since lossless audio consumes way too much bandwidth. WebRTC audio generally sounds great, but there's still compression artifacts if you listen closely (and, in fact, the recording tools are not perfect and add some distorsion as well). Given that this test is more about detecting regressions than measuring some absolute notion of quality, we'd like to downplay those artifacts. As our Kirkland colleagues found, one of the ways to do that is to "rebase" the input file. That means we start with a pristine recording, feed that through the WebRTC call and record what comes out on the other end. After manually verifying the quality, we use that as our input file for the actual test. In our case, it pushed our PESQ score up from 3 to about 4 (out of 5), which gives us a bit more sensitivity to regressions.

11 comments:

  1. Nice post. I think this may have use for SDET/QA beyond streaming audio/media space. The technique may prove useful in telecom hardware/software/services testing as well.

    ReplyDelete
  2. Great Post! Thanks for sharing. :)

    Just some queries on the big picture here:- this looks like a one off project and the challenge here is considered quite unique and not the usual day to day question that we encounter during testing. It interests me when the team manage to find out so many metrics & formula to measure the quality of the application. Does it mean that the test team has also spent effort to understand the formula, code it into test for calculation, etc? Or the test team actually engage some SME (subject matter expert) to advise? Would hope to know how Google test team approach this situation? Besides, mind sharing how long did you guys use to complete this project?

    ReplyDelete
  3. Great post, thanks!
    Is there any articles/posts analyzing WebRTC video quality? Many thanks.

    ReplyDelete
  4. Great post! I managed to run the test, however, I am getting scores less or around 2. And yes, I used 'rebasing' technique with the right sampling rate, etc. I would like to try with your file and see if it makes it any better. Could you please share the wav file at pyauto_private/webrtc/human-voice-linux.wav since it seems we (public) doesnt seem to have access to that (or I couldnt figure out how). One thing to note is that I am trying this on a virtual machine (ubuntu flavor).

    Thanks!

    ReplyDelete
    Replies
    1. I'm a newbie to webrtc.
      Kindly, someone explain me in detail, how I can setup my machine to run the test? - I was able to compile the chrome nightly build using ninja. What should be the next step to run the above mentioned test (i.e., which file or command I should execute to perform the above mentioned tasks).

      Delete
    2. Ok, here's how you do it. First, get the code (http://dev.chromium.org/developers/how-tos/get-the-code) or update your existing checkout so you get the latest code (I landed some patches very recently). Instead of building chrome, build browser_tests. Configure your machine like instructed in https://code.google.com/p/chromium/codesearch#chromium/src/chrome/browser/media/chrome_webrtc_audio_quality_browsertest.cc&q=chrome_webrtc_a&sq=package:chromium&l=49 (if you run Mac you're out of luck; the test isn't implemented there).

      Then add this to the solutions list your .gclient file (it's in the folder above your chrome src/ folder:

      {
      "name" : "webrtc.DEPS",
      "url" : "svn://svn.chromium.org/chrome/trunk/deps/third_party/webrtc/webrtc.DEPS",
      "managed" : True,
      },

      That will download the resources you need. It will probably fail on downloading from the webrtc-chrome-resources bucket, so you need to comment that part out in the hooks in webrtc.DEPS/DEPS. Then you need to get a hold of PESQ (and if you're on windows: sox.exe), build those yourself and put them in src/chrome/test/data/webrtc/resources/tools. We can't redistribute those binaries but they're readily available on the web.

      Then run out/Debug/browser_tests --gtest_filter="WebRtcAudio*" --run-manual. If it works you'll get a PESQ score printed out.

      Good luck! :)

      Delete
  5. Unknown: Sure, that file is private only for historical reasons. I'm going to try to pull it out to the public world when I get time. For now I'll attach it in comment #3 here: https://crbug.com/279195.

    The best way to figure out why the score is low is to listen to the recorded file (just uncomment the DeleteFile on trimmed_recording). Load it up in audacity and compare to the source file. Often you will find that the volume level of the recording is too low because the system's input or playback volume levels are wrong. If they're too low you'll get a bad recording and if they're too high you'll get distortions. Read the comments on the test on how to set up your machine carefully and look at the volume levels in pavucontrol. I think they're all 100% on our machines, but this may or may not be appropriate for your sound hardware.

    ReplyDelete
    Replies
    1. Thanks for the answer Patrik - I think volume levels are okay. looking at the waveforms in audacity, recorded waveform is always expanded somewhat. To give an example, I shared a photo below where the original signal is 5.181 seconds long and recorded signal is 5.581 seconds long.

      https://plus.google.com/105460497673148445079/posts?banner=pwa

      From what I know, PESQ can deal with the silence in between talkspurts well if they are different between original and recorded files, i.e. one may want to reduce the silence period and play out quicker at the receiving side without much quality compromise. However, if my memory serves me right, PESQ cannot deal with the expanded wave forms, which could be the cause for low PESQ results in my environment.

      But this doesnt explain why your tests are at around 4. I wonder if virtual machine environment has a contributing factor here. Do you run your tests on physical machines? Thanks again.

      Delete
  6. BloodyArmy: We do have very skilled audio engineers on the WebRTC team, which is how I learned about PESQ, sampling rates and so on. I wouldn't have been able to pull this test off without help from those experts for sure.

    I would say this test took about three manweeks to implement. It was very hard with all the OS-specific quirks, and it took a lot of testing and tweaking to get it to run well.

    ReplyDelete
  7. Akmal Nishanov: Yes! I did a tech talk on GTAC 2013. Find the video and slides here: http://www.youtube.com/watch?v=IbLNm3LsMaw&list=SPSIUOFhnxEiCODb8XQB-RUQ0RGNZ2yW7d

    There's no article like this one though; I might write one in the future.

    ReplyDelete
  8. Hope this doesn't double post, tried posting and browser crashed... anyway, I found this article really appropriate to the webRTC work I'm testing. Very helpful. I saw in your future considerations you're looking at latency models. I did some work on that. Before webrtc, I was doing web automation, and I built a latency generator using a linux VM and Netem. I put squid on there to open a port and then route the browser calls through the proxy. Before the tests starts I would make a call that remotely sets the bandwidth/latency/packet loss profile on the VM. Then run the test through the VM. It worked very well for us. If you are interested I have a write up on it at my blog: http://sdet.us/simulating-real-world-latency-during-automation/

    ReplyDelete

The comments you read and contribute here belong only to the person who posted them. We reserve the right to remove off-topic comments.