Listening Tests


So now we've come to the last step, and it's a big one. Just how good are these Mp3s? Have we set up the encoder correctly? How do they sound compared to the CD they were made from? The obvious way to find out is with a listening test. But it's not as simple as it sounds.

There are a number of audio listening tests which are in common use. People have created computer programs to automate them, or gathered together with stereo equipment to run them manually. They've been done on big and small scales. The vast majority of them work on the same principal though, the idea that if any audible difference exists between two pieces of audio gear (speakers, amplifiers, file types, etc,) that it should be a simple matter of playing the same song through one and then the other and asking people which sounded better. This is the classic A/B test. There also exists an ABX test which goes one step further by playing one, then the other, and then randomly one of the two again and asking which one it was. The goal in either case is to determine if the equipment sounds the same or not, and hopefully pick the superior equipment. These tests are best performed blind, meaning that the people doing the listening aren't told anything about which gear is in use at a given time, so that they don't do anything like subconsciously favour the more expensive brands. And even that can be made better still by making sure the person recording the results doesn't know which is which either, lest they ask leading questions or unintentionally adjust what people said to fit what they think is the right answer. So it sounds great, A/B and ABX testing seem like a foolproof way to determine once and for all if there's any quality gains to be had between various gear or audio formats.

But I've recently started to have my doubts.

Let me be clear about something. It is not the blind aspect which I object to, not at all. This blinding is very important. The unfortunate fact is that we humans tend to see what we're expecting to. This is called confirmation bias. We look for a few key facts that seem to support what we expect to find, and we don't look for others that might disprove it. Time and time again it has been shown that if you wrap one bottle of wine with a cheap label and another identical bottle with a more expensive brand's label, that both the ordinary person and the enthusiast alike will almost always say the more expensive wine tastes better. The enthusiast will even use the full breadth of their vocabulary to tell you all the ways in which the two wines differ. Likewise, Even the most obviously bogus audiophile scam products will all get at least some good reviews from happy customers who truly believe their money was well spent. $1500 power cords infused with genuine good vibes and fairy dust are described as causing records to sound much better than standard $5 cables, in precisely the same way as the relabeled bottles of wine. If it costs that much, it MUST be good, right?

No, it's not the blinding which worries me. It's the fact that all traditional audio tests, be they A/B, ABX, sighted or blind, all hinge on two assumptions. They all assume that the average untrained person's perception is good enough to notice things like exactly how crisp a cymbal crash in a song is, or how muted a violin, and that the human memory is good enough to store all that information away so that it can be recalled perfectly when it comes time to listen the second (or third) time. And this strikes me as a very poor assumption.

On the first point, we tend to be very bad at thinking about things which we lack terms for. And most of us don't have a very thorough vocabulary for audio characteristics. We simply don't work with audio on that level very often in day to day life. So most of us don't know what to look for when it comes to quantifying recording quality, beyond the broad strokes. The obvious solution then is to devote years of ones life to learning about acoustics, about the way the ears pick up sound, how audio compression engines work, and to develop a rich language of descriptive words to rival that of the snootiest of wine tasters. All this would certainly make changes in audio quality a lot more quantifiable. But it would clearly also take a hell of a lot of work. Far more, I think, than the average person wants to invest in something simple like putting good music on their phone.

Secondly, the human mind does not seem to be very good at accurately comparing a current experience to the memory of a previous one. How many times have you heard someone complain about how much better life was when they were a kid? When men were real men, when kids respected the elders, and when crime wasn't an issue? Or listened to 3 different people give 3 totally different reports to a police officer about what a thief looked like? All our memories are distorted and coloured by the wet meat machine between our ears, especially when we try to recall something we weren't paying much attention to when we experienced it. And I'm not sure that there's much which we can do to improve our memories in this regard.

It seems then that it might not be safe to trust ourselves with traditional listening tests. So I sought to devise a test which didn't rely on memory. It would instead be a purely perceptive test, to see if the listener noticed a change in audio quality at the moment one occurred. It might not tell us exactly what the difference is, but it will at least let us know if the change is significant enough for the human ear to notice, rather than just the human memory.

The idea came to me when a friend of mine insisted she could hear the difference between 320kpbs Mp3s and ones of even slightly lower bitrate. Having researched the format and experimented with it myself, I strongly suspected that she couldn't, but my logic alone wasn't going to change her mind. She knew she could and wouldn't be told otherwise. Frustrated with the mind's ability to find a difference whether it was there or not, I thought as I often did about how nice it would be to conduct a more immediate listening test. One where I could switch between two audio sources of different quality at any point in a song, with no delays or changes in things like timing or volume. This last part would be critical, since our brains recognize a slight increase in volume as an increase in clarity. I imagined a pair of carefully cued audio files and a slider to electrically pan between them. If there was a significant difference in quality it should be heard as a change of some sort as the slider was moved. But I realized that I'd never be able to do something like that in hardware since the two audio sources would need to be aligned not merely to the second but to the very sample, else their waveforms would be offset, which would cause them to alternately reinforce and cancel each other out instead of neatly overlapping as they were panned between.

Then it hit me. I could use that very property to do something even better, with commonly available software instead of mixers. I would draw on a technique which was almost as old as recorded music itself. Let me see if I can explain.

Sound is a wave. It flows down wires and through the air as an oscillation. When a speaker is producing sound, it's pushing and pulling its little membrane back and forth, corresponding to the rising and falling of the recorded wave. This creates little bursts of higher and lower pressure air which in turn move our ear drums back and forth, and we hear sound. Simple stuff, right?

A common mistake when hooking up home or car stereos is to reverse the polarity of one of the speakers relative to the other. Then when one speaker is pushing, the other is pulling, and the two fight each other. This means that any sound which is present on both stereo channels will be much quieter than it should be as the two speakers cancel each other out. The effect is sometimes called Out Of Phase Stereo, or OOPS. Not good for music, but some people discovered they could deliberately set up their stereos that way to help them analyze music. Vocals, you see, are usually present in both channels of pop music, with the instruments on one or the other. By removing only the stuff present on both using Out Of Phase Stereo, you can hear the instruments alone without vocals.

Okay, that's pretty cool, but how does it help us to compare the quality of Mp3s? Well, what if instead of removing the sounds common to both stereo channels of a single audio file, we removed the sounds common to a pair of audio files? Say, an uncompressed file and the Mp3 version? Then we would get to hear only the bits which were different. And then we can do something REALLY clever.

I took a CD off my shelf and ripped the same track from it 3 times. First as an uncompressed wave file, then as a 320 bitrate Mp3 and finally with my preferred VBR profile.

I loaded the raw wave file and the 320kbps version into Audacity, one below the other.


Then I went to the start of the files and zoomed in. The Mp3 compression process had padded the start and finish of the file subtly, so that it no longer lined up with the original. But unlike analog mixing boards, audacity's natural habitat is the sample. I measured exactly how many samples late the Mp3 was (2,256 in this particular case) and removed that much silence from the start of the file.


Ta dah! The raw wave and the 320kbps Mp3 were now in perfect waveform alignment. The holy grail had been found.


Now came the fun part. I selected the whole of the 320kbps Mp3 version of the song and inverted it. This made it so that where ever the waveform had previously gone up (corresponding to the speaker pushing out) it now went down (making the speaker pull in) and vice versa.


Then I selected the inverted Mp3 and the original wave and told audacity to mix and render. This caused it to combine them together into a single file. But because I had inverted the Mp3 version first, any aspect of it which was 100% identical to the original wave version would now be its exact opposite, thus canceling it out and producing silence. The resulting file contained only the difference between the source files.

Let me see if I can explain all this a little better. If one audio file is telling the speaker to move outward by precisely, say, 1.5 millimeters, and the other is at that same moment telling the speaker to move inward by precisely 1.5 millimeters, then the net result is that the speaker sits still and makes no sound. What I've done here is mixed together two files which were almost, but not quite identical. The difference between them was the Mp3 compression. That means where ever the files were still identical, they canceled each other out and make silence. But where ever one file had a sound that was missing from the other (because the Mp3 encoder had taken it out to save space,) that sound would remain in the resulting output. In other words, I had created a file which contained only the sounds normally removed by turning a wave into an Mp3.


And what a beautiful file it was. For the first time I could see exactly what was being altered during compression. I could hear the sorts of things which were being lost. And you can too, by clicking here.

But it was about to get even cooler.


I left this difference file open in audacity and loaded the 320kbps Mp3 in beneath it a second time. I did the sample measuring trick again to make sure they had perfect waveform alignment. Then I hit play. Both tracks played together, sounding just like the source CD. And mathematically, it WAS the source CD. The compressed file plus the difference file equaled the source file. And by muting the difference file at any point in the song, I got to hear, instantly and without distortion or changes in volume, what the quality lost by compressing a wave file to a 320kbps Mp3 was.

And what loss did I hear?


The Mp3 file sounded exactly like the original to my ears. I was performing the opposite of a blinded test, a sighted one, to give myself every possible chance to notice if there was a difference, and I couldn't find one. I even listened to the difference file on its own so I knew what sort of sounds had been most affected, but still I couldn't pick out a change in quality as I toggled the difference file on and off.

(Had I found a difference with a sighted test but not a blinded one, I would have been highly suspect of the result, and gone on to do further blinded ones to see if I could reproduce the result. My goal in performing it initially sighted was to make sure I knew when to expect a change, and what the change might sound like, to make sure I was looking in the right places.)

Want to try it for yourself? Listen to this and see if you can spot the points where the quality changes. No? Try again, listening for an increase in quality at 5 seconds which disappears at 10. If you can hear a difference, let me know. So far no one has.


Next I repeated the test using my more compressed VBR Mp3 instead of the 320. As one would expect given the smaller file size, the VBR's difference file did show more removed audio than the 320 one had. If my friend was correct, this would be enough to cause a small but detectable change in playback quality when I toggled the difference file. You can try it for yourself by listening to this. If a VBR Mp3 averaging 233kpbs isn't enough to recreate CD audio, the quality should improve slightly at the 5 second mark, then drop off at 10. But I could not hear a change. No slight drop in volume, no increase in noise, no dulling of delicate notes. Once more I listened to the difference file to give myself an unfair advantage, and even when I knew what to expect I couldn't find a change. Did that mean there really wasn't a significant difference? Did that suggest my VBR scheme sounded as good as an uncompressed CD?


I wanted to make sure I wasn't falling victim to confirmation bias before I made a claim like that. To make sure that I hadn't bungled a step somewhere I went back to the CD and ripped the test track once more at only 56kbps. This time there would be no way to miss the transitions between the source and the Mp3 if the method worked. I loaded it into Audacity, aligned it, inverted it, mixed it, loaded it again. It worked! This time the difference file was huge! There was a ton of audio which had been removed. When I started playing the 56 bitrate Mp3 the song sounded dreadful. But when I unmuted the difference track, it suddenly sounded just as good as the CD once more, as I had predicted. So the method DID work! You can hear it for yourself here.

Okay, so what's the bottom line?

Well, to start with, this method is only useful for very specific circumstances.It would be useless for comparing vinyl to CDs, for instance. And an encoder that did nothing more than take the input and invert it would show up as deleting the entire track using this method. Just because a sound is different between the input and the output doesn't necessarily mean it was removed, the encoder may have changed it in some other way. So the difference track might actually exaggerate the trimming to the Mp3 files for all I know. Take it with a grain of salt. Despite this, it remains true that the difference track plus the encoded track exactly equals the source track, so that part we can trust.

And secondly, the use of Audacity and this slightly convoluted method is far from the only way, or even the best way to perform this test. I simply chose it because it was a free tool which I already had and knew how to use. In theory one could use a simpler program which just switched audio sources at the click of a button. But it would have to be smart enough to align the files to the sample, and it would have to be engineered very carefully to avoid making popping sounds or anything like that at the transition. I found that Audacity was surprisingly good at remaining transparent through these tests, so that's what I use.

Those limits stated, I feel the test as it stands has produced some interesting and plausible results. It suggests that my 22 year old ears, covered by a pair of Sennheiser HD-280s, which have been plugged into a Soundblaster Audigy 2 ZS, are unable to hear the difference between a particular Sarah McLachlan song which has been compressed as 230 VBR and one which has been left as a wave. This in itself isn't terribly interesting, but I've been repeating the experiment with other songs since different types of music will respond differently to compression. And I've been sending comparison files to other people in an effort to see if any of them can pick up a difference I missed. So far though it's looking good for Mp3. I have yet to find a song which sounded different in the VBR Mp3 profile vs the source CD.

This suggests to me that there's likely no advantage to using CBR 320kbps Mp3s for daily listening, especially since they take up 40% more space on average than high quality VBR Mp3s. More generally, it also means that people who insist Mp3s sound worse than CDs are probably fooling themselves, except in extremely rare cases. Especially if they're in their 40s with worn out ears.

I also discovered that I could go surprisingly low on the bitrate and still retain tolerable sound quality. It seems that the Mp3 encoders have improved dramatically in the past decade. No longer do we have to suffer a sea of audio errors if an Mp3 is "only" 192kbps. In fact I dare say that dropping to 128kbps on a modern encoder would be fine if the files were being played on an iPod or similar, where listening conditions are poor.

Another interesting fact. When I tried comparing a very heavily compressed Mp3 of 56kbps to the CD, I found that my ears tended to notice when the quality increased, but often missed it when the song got worse. I'm unsure what the implications of this are, but it might be useful to know when pinpointing a desired quality level.

All in all, I feel it's time to give up the notion that Mp3s are only good enough for puddingheads with $2 speakers. My tests suggest they can be good enough for serious enjoyment on serious equipment.

Last modified September 30th 2013

Can you hear me, Major Tom?