Decoding a Tree-Less Huffman

TAD

Introduction

This text is intended to follow on from my previous article LNKcotadhuf.txt"A HUFFMAN, BUT NOLKE LNKcotadhuf.txtTREE?"LKE. In that article I described a simple way to build up the famous HUFFMAN codes using a nice linked-list technique instead of the normal, more memory hungry binary-tree and recursive prefix gathering technique.

If you don't understand the above, then please go and read about HUFFMAN, SHANNON-FANO coding and Minimum redundancy coding. If you are a "newbie" (beginner) programmer then this text may be too difficult for you to understand, you really need to know the Huffman algorithm and theory of compression in general before this will make sense.

Just like me, my articles are short and simple and hopefully interesting.

What is the problem with trees?

Nothing, they are good, fast data structures, but they are usually memory hungry monsters. Not only do you need the actual 'leaf' structures for your Huffman codes, but you also need the internal 'nodes' and this is where the problem lies. The nodes are a vital part of the binary-tree (or any other kind of tree structure), they are used to navigate through all the possible prefix codes until you finish up at the leaf. A prefix code is actually the path you must take to reach a unique leaf, and so the prefix code must also be unique.

During the creation of the Huffman prefix codes a tree is normally used to form a hierarchical structure of symbols with the most frequent being placed towards the top (root) of the tree and the least frequent placed towards the bottom of the tree.

But as I have hopefully demonstrated in my previous article you don't really need to construct a tree. A simple, memory saving approach using linked-lists can be used. The linked-list approach removes the need for internal nodes and so saves a great deal of space.

The Decoding Tree

Okay, so now we can build up the Huffman codes it would be nice to be able to decode them too.

The usual way to decode variable length prefixes is by using a binary-tree. Every time you come to an internal node you read a bit from your bitstream and take a left or right turn until you finally end at one of the leaf nodes.

But one of the problems with Huffman, or any other kind of prefix encoder, is that the decoder (the depacker/decompressor etc.) needs to know the prefix codes for each and every symbol otherwise it can't decode anything. In the worst case the entire tree is passed from the encoder (packer) to the decoder (depacker) in the compressed data-stream. If the tree is large then this can mean a lot of extra data and can possibly wipe out any saving that the Huffman created in the first place. Also if the tree is large it usually means that the amount of compression wasn't that great.

The Prefix List

One way to avoid having to pass the entire tree to the decoder is by passing the frequency list of each symbol, so then the tree can be constructed in exactly the same way as the encoder did using these frequency counts.

But this would mean that the decoder would require a lot of extra program code to sort through the symbol frequency list, construct the tree and gather all the prefix codes. In affect, it would do around 70% of the same work that the encoder did. For an archive program where the encoder/decoder is contained within the same program this isn't too bad because most of the same routines can be reused, but for a seperate decoder this isn't good.

It seems that the tree either needs to be constructed by the decoder or passed to it together with the compressed data-stream.

But in fact we can use another simple list to perform a similar task to the tree, except we don't need any internal nodes.

We know that each prefix code is unique, and we also know that it is made up from variable number of bits. Imagine we have 4 symbols "A" to "D" and that we already have the Huffman prefix codes and their length in bits.

             Prefix     Bits        Symbol
             ------     ----        ------
               0         1            "A"
               10        2            "B"
               110       3            "C"
               111       3            "D"

Because each prefix is unique no other prefix code can share the same code but have a different length. Is it impossible to have one complete prefix used at the beginning of another longer prefix? Look at symbol "A", its prefix is 0, and no other prefix starts with 0. Also look at symbol "B", there is no other prefix which also starts with 10.

             Prefix     Bits        Symbol
             ------     ----        ------
               0         1           "A"
 ** ERROR **   00        2           "B"
 ** ERROR **   001       3           "C"
 ** ERROR **   000       3           "D"

The above 3 codes are all invalid because the part of their unique code is actually a complete prefix used by the symbol "A". What I am trying to say is that you can not find a prefix of length N in another prefix whose first N bits are indentical.

Don't worry, the Huffman algorithm takes care of keeping all these prefix codes uniques for us, so there is no need to lose sleep.

The Decoder

Okay, enough waffle.

             Prefix     Bits        Symbol
             ------     ----        ------
               0         1           "A"
               10        2           "B"
               110       3           "C"
               111       3           "D"

You can actually decode Huffman codes with just this list. This idea was suggested by a muzak-mad friend ("Yo Phil, mucho Lardo") when we were talking about depacking fax bitstreams.

Right, you align the prefix with the MSB (bit 7, 15 or 31 etc.) and pad the remaining bits with zeros. I have used 'x' to denote these padding bits and used a byte. You will probably need a word or dword.

             Prefix    N-Bits       Symbol
             ------    -----        ------
            0xxxxxxx     1           "A"
            10xxxxxx     2           "B"
            110xxxxx     3           "C"
            111xxxxx     3           "D"

Now you simply search through the entire list of prefix codes checking the top N bits against N bits in your bitstream. Once you find a match you output the symbol value (for example, "A", "B" or "C" etc.) and then advance N bits in your bitstream.

You basically search the bitstream until you find a prefix code and then you advance past it and onto the next prefix code.

There are a few important things to remember here:

 1. ALL the prefix codes must be pre-aligned to the left (MSB).
 2. Your bitstream rack (byte, word or dword) must also be left-aligned.
 3. You must be able to advance through your bitstream at different rates.
 4. You need to be able to pre-fetch upto 16 bits from your bitstream.

Some Example Code

Like I have said before, this code is intended to inspire others to write their own faster, shorter, bug-free routines.

To reduce code and memory space I will use the index within the list to denote the symbol value. So if we are at index=0 then this denotes 0 and if we are at index=65 then this denotes "A".

For every possible symbol value (0 to 255) we have a prefix value and a prefix-bit count (length in bits of the prefix code).

              [DS:SI] --> bitstream
              [ES:DI] --> output buffer

              .data
 prefixes     dw      256 dup (?)     ; array of 256 prefix codes
 prefbits     dw      256 dup (?)     ; array of 256 prefix lengths

 byterack     db      ?               ; temporary shift byte
 rackflag     db      ?               ; marker flag (for every 8th bit)

              .code
 Initialise:
              mov     dh, [si]
              inc     si
              mov     dl, [si]
              inc     si              ; DX = first 16-bits from bitstream
              mov     [rackflag], 80h ; signal that 'byterack' is empty

Here is some example 80x86 to decode one symbol from the bitstream.

              .code
 decodesymbol:
              sub     bx, bx          ; FOR bx = 0 to 255 (word index)
 find:
              mov     cx, prefbits[bx]
              dec     cx
              js      short skip      ; length < 1 bit?
              mov     ax, 8000h
              sar     ax, cl          ; create a bitmask for the top N bits
              and     ax, dx          ; isolate top N bits in bitrack
              cmp     ax, prefixes[bx]
              jz      short found     ; have we found the prefix code?
 skip:
              add     bx, 2
              cmp     bx, (2*256)
              jnz     short find      ; check all 256 prefix-codes

You may wish to add some error code at this point. After all 256 possible symbols have been searched but no prefix code match was found. This should NEVER occur unless a prefix-code is incorrect or the compressed bitstream data has an error.

We have found the prefix, so we must output the symbol for this prefix. This is a simple case of dividing our index by 2 and writting the result.

 found:
              mov     ax, bx          ; divide the index (BX) by 2
              shr     ax, 1           ; for the actual symbol value
              stosb                   ; then output the byte

We need to advance through our bitstream by N bits (where N is the number of bits in the prefix-code). We shift in N bits into the bitrack DX register.

              mov     cx, prefbits[bx] ; N bits
              mov     al, [byterack]   ; get byterack
 shiftin:
              rol     [rackflag], 1    ; is our byterack empty?
              jnc     short notempty   ; no?
              lodsb                    ; else reload byterack
 notempty:
              add     al, al           ; move next bit from byterack
              rcl     dx, 1            ; into the DX bitrack.
              loop    shiftin          ; repeat for all the bits

              mov     [byterack], al   ; store our byterack

All the above code does is to read one bit from the bitstream and shift it into the DX 'bitrack' register. I know this is a SLOW way to get N bits, but the idea in the article is to try and reduce code size. If you want a faster way to get N bits from a bitstream then try looking in a recent issue of Hugi.

Problems

The algorithm presented above may cause a few problems because it needs to pre-fetch a word (for the DX register) so it may overflow past the end of your bitstream. This could cause a Protection-fault, or incorrect disk read error because you are trying to read past the end of file.

You can always fudge this by adding a dummy word at the end of your compressed data, not ideal, but what is these days?

Improvements

At the present time the decoder needs a list of prefix codes and prefix lengths in order to decode. This means about 1024 bytes are needed using a 16-bit prefix code and a 16-bit prefix length for all 256 symbols.

This can be improved by only encoding prefix-lengths as 4 bits counts followed by the prefix itself (if the count <> 0). This would mean a maximum of 640 bytes would be needed. This assumes that all 256 symbols have a 16-bit prefix code which should never happen, in most cases these codes will only be between 2 and 10 bits depending on the actual data being compressed.

You may consider that this 4-bit count for each of the 256 symbols is still too much in which case you could use a single bit prefix to indicate unused prefixes which have no prefix codes and a zero length.

In terms of compression Huffman isn't that good, something like a dictionary based LZ algorithm will yield far better savings. But tagged on the end of other compression techniques Huffman can still save a few bytes and that Ladies and Gentlemen is what compression is all about.

Closing Words

Well I hope you understood all the above and the tree-less Huffman. The code will not win any prizes for speed, but it hopefully demonstrates that trees aren't always needed and in some case just eat up more memory and coding time.

There are millions of different ways to put 80x86 instructions together, so why just stick with the same, old boring ways that everyone else uses?

Enjoy.

Regards

TAD #:o)