Basic Audio Compression



This article describes some basic audio compression techniques. It is intended to be an introduction into the subject rather than a state-of-the-art code snippet library. If you already know about MP3, compounding and U-law/A-law then this article will probably bore you to tears because it won't cover any ground which you don't already know.

Lossless compression

Many other compression schemes used in the world of computing is "lossless", this means that NO errors are introducted into the compression/decompression process. If you input a data-stream then you get the same, identical data-stream as the output. So it loses nothing, nothing is lost. This is vital for code, database information and other value critical data where even the smallest error of +/- 1 will result in problems. You can imagine that if one opcode byte was wrong that an entire system could crash or lock-up.

Lossy compression

Lossy on the other hand is where some precision in the data representation is given up for better compression. We are willing to tolerate small (or large) errors so long as the quality isn't affected that greatly.

Lossy is used in the JPEG, MPEG and MP3 compression schemes. These compress digitised graphical and audio data using quantisation and modelling techniques to give a "close" approximation to the data-stream. Not only do all these schemes sacrifice some precision for more compression, but they all process data-streams which have themselves already given up some precision. By this I mean that when a sound was sampled some quantisation errors were introducted when each 8-bit or 16-bit sample was taken. Also with scanned or captured video or images some information was lost when it was converted from an analogue signal into a digital one.

Because we don't care that much about a small amount of error during the digitizing process we would record a sound using an 8-bit sample rather than a 16-bit sample this would halve the amount of space needed. Of course you don't get anything for nothing, using only 8-bits will often create noise in the recorded sound because our precision is 1/256th of a 16-bit sample.

This is the main trade-off in lossy compression, quality vs. quantity.

Smaller precision

So we have already seen a way to reduce a sample's size just by reducing the number of bits used to represent each instance of time (16 into 8 bits). We could also reduce this down to 7, 6 or even 1 bit if we wished but the amount of unwanted noise and lack of quality will probably make us stop around the 6 bit mark (depending on the sound being sampled).

Another way to quickly reduce a sample's size is to resample it at a lower frequency. For example instead of 44.1Khz use 22.5Khz. We have again halved the number of bytes needed to store the sound's sample.

Precision and sound level

There is an important thing to remember when attempting to choose a precision for a sound. That is whether the sound is loud or quiet. For a 16-bit sample resolution we have 65536 different levels to quantise our sound into, but if we record a very quiet sound then it is possible that only a range of 256 different levels would be used and this means we could get away with just 8-bits.

There are still a few semi-professional rack mounted samplers which use only a 12-bit resolution for their samples instead of 16-bits. But this tends to be rare these days because of the low price of storage devices like cheap RAM modules and even built-in hard drives. To be honest the difference between a 12-bit sampler and a 16-bit sample isn't that great, it seems to depend on the filter and sound source quality more than the actual bit resolution.


So to cope with both quiet and loud sample it first appears that we need to use 12 or 16-bits to get good results. We need the 4096 or 65536 levels to represent the small volume changes that quiet sound have. If we simply pumped up the volume of the sound source so that the quiet sound used the entire 256 range of a 8-bit sample then what would happen if suddenly a very loud signal appeared?

The over amp'd sound would distort because we're off the top/bottom of the scale. If we reduce the volume so that loud sounds fit into the 8-bit resolution then the quiet sounds are scaled down towards the 0 line so rather than spanning 6 or 7 bits they are squashed down into 2 or 3 bits.

The problem is not the loud sounds but the quiet ones, we need a greater resolution for these near-0 levels.

The idea behind compounding is that a non-linear scale is used for the sample resolution levels. This means that there are more fine precision steps at the bottom of the scale for low level sample steps but large steps at the top of the scale to encode the louder sounds.


The description which follows is directly from my poor memory, so please excuse any stoopid errors (my brain works in a lossy way sometimes).

The basic idea of U-LAW and A-LAW is to pack reduce to number of MSBs (most significant bits) in each sample word. Starting with a 16-bit sample it is obvious that most of time (especially for low/medium level sounds) that not all 65536 levels will be used. In many cases only 10 or 8 bits will be used, so the top 6 or 8 bits are being wasted.

Imagine that the following numbers were taken from a 16-bit sample.

      hex                  binary
      ----            ----------------

      04D2            0000010011010010
      007F            0000000001111111
      0105            0000000100000101
      002E            0000000000101110

You should note that there are many leading zeros starting at bit #15. This is what the U-LAW and A-LAW encoding schemes reduce into a small count.

                           binary             encoding
                      ----------------        --------
                      FEDCBA9876543210        76543210

                      s---00000001wxyz        s000wxyz
                      s---0000001wxyz-        s001wxyz
                      s---000001wxyz--        s010wxyz
                      s---00001wxyz---        s011wxyz
                      s---0001wxyz----        s100wxyz
                      s---001wxyz-----        s101wxyz
                      s---01wxyz------        s110wxyz
                      s---1wxyz-------        s111wxyz

The 's' bit represents the sign-bit of the sample which must be negated to make the process easier so we only count the leading 0 bits.

      IF sample_word < 0 THEN
                              s = 0
                              s = 1
                              sample_word = (0 - sample_word)

The bits marked with '-' means that their value is thrown away. The leading zeros are counted and encoded in a 3 bit count. The next 4 bits following the most significant 1 are encoded as normal. This allows quiet sounds to be encoded with no loss of precision. The louder sounds lose some of their precision but the loss in quality will hardly be noticable.

The compression rate is 2:1 so each 16 bit sample is reduced down to 8 bits.

To speed up this compression/expansion process a 65535 element table can easily be built up beforehand. You could probably employ this compression scheme in your real-time tracker/mod player to halve the size of every sample. The advance would be that 16-bit samples can be access using a byte index rather than a word index (which in pmode is slower). The decompression table could consist of float-point numbers rather than integers. This "could" be good news as many mixing routine now use FP instructions rather than the slower integer ones.

Closing Words

That's all folks. In the future I hope to write about the MP3 (as soon as I find some good, clear documentation). I might even read some official documents before I write my next article...

...Nah, It hasn't bothered me so far (grin).

Happy tracking.


TAD #:o)