Some Thoughts about LZ Compression

TAD

Introduction

In this article I hope to explain some ideas and general guide lines which may be useful to know, before you start coding your fantastic new LZ compression engine.

I reckon that every coder before they begin to implement their latest encoding scheme probably thinks, "Yeah. This will beat every other packer out there, nobody has thought of this special case before." But sadly a thousand other coders probably have, and the final compression rate is never as good as you thought it would be.

One of the golden rules about compression, is that there is never just one special case which you need to deal with, there are millions. Always try as many different forms of input data as possible, just because it compresses one particular file well doesn't mean it will compress all files that well.

And don't forget to design for the worst case, as well as the best one!

LZ77 and dictionaries

If you don't know what LZ77 means, then look it up in your dictionary.

Like many others I have read the The Data Compression Book by Mark Nelson, but I felt that there was a large gap in the LZ algorithm. It is very widely known that rather than using a fixed sized number of bits for the LZ 'copy' commands and/or parameters you can form a more efficient encoding scheme using different length (and unique) prefix codes.

E.g.

              0 xxxxxxxx              ; Output 1 RAW byte (the xxxxxxxx bits)
              10 xxxx                 ; Copy 1 byte at the 4 bit offset xxxx
              11 xxxxxxxx             ; Copy 2 bytes at the 8 bit offset

Of course this is just a very simple example, but the idea is the same in almost all LZ based encoding schemes. For difficult input streams which are hard to compress the overhead for each 'RAW' byte should be as small as possible, and for highly repetitive input streams (where many bytes are frequently repeated) you should again use the smallest number of bits to encode each copy command.

As you can see from the 3 command example above we can cope with copying 1 or 2 bytes with a maximum offset of -16 and -256. But we could extend this prefix idea much, much further.

For example, we could encode another prefix command to encode a 2 byte copy at a 4-bit offset.

E.g.

              0 xxxxxxxx              ; Output 1 RAW byte (the xxxxxxxx bits)
              10 xxxx                 ; Copy 1 byte at the 4 bit offset xxxx
              110 xxxxxxxx            ; Copy 2 bytes at the 8 bit offset
              111 xxxx                ; Copy 2 bytes at the 4 bit offset

This has some important attributes:

1) We can now save an extra 3 bits on every 2-byte copy by using the 4-bit offset rather than the 8-bit offset.

2) Because we can encode an offset of -16 to -1 using the 4-bit command we don't need to encode these values in the 8-bit command. So we can begin the 8-bit offset with a starting value (bias) of -17.

This extends the 8-bit range to -272 to -17, thus improving our chances of finding a 2-byte match in our input data stream.

3) On the negative side, the 8-bit offset command has grown by an additional bit for each and every occurance of it. But hopefully the 4-bit offset command will cancel out this gain.

The Optimum Prefix Tree

We can fine tune our prefix commands as much as we want, we could add a copy-3 command, copy-4 and so on... (Beyond a certain point it is better to encode a fixed-length bitfield such as a 4-bit or 8-bit count field after the command rather than squeeze more and more prefixes in). As we already know simply adding more and more prefixes can have a negative affect on the overall compression because it burdens other commands with the extra bit which is needed to keep each prefix as a unique code.

But what is the best way to encode these prefix codes? And which prefix codes do we actually need?

Sadly there is NOT one set of prefix commands which is the best for every kind of input data stream. One encoding scheme might be better for one data file and a totally different one might be better for another data file.

This is the reason why all those hundreds (or thousands) of different LZ based compression engines end up with totally different compression rates for different files.

Adaptive Prefix Commands

One way to tackle the prefix problem is to use an adaptive set of commands which would modify themselves as we progress through the input data stream. This would mean it could exploit local redundancies more than a fixed set of commands could possibly do.

But this is not always desired because the decoder must also perform all the adaptive book-keeping operations in exactly the same way as the encoder does. The reason why the LZ method was chosen in the first place is due to it's very simple and very fast decompression process. Other algorithms like LZW can often compress just as well, but the dictionary book keeping stage makes it far slower and LZW requires some additional memory to store the dictionary.

A 2-stage LZ method

One way we could overcome the adaptive book-keeping problem while still being reasonably flexible for different data stream would be to perform a 2-stage compression process. The first stage would collect statisics about the redundancy of the input data stream. This would be used to determine the most frequent offsets and counts for the copy commands. Once these have been found the second stage would be to build up the prefix codes for the commands based upon these statistics.

These commands of course should be based upon the total saving, and not the length of a single prefix code (as the old, static model would be). In some cases, where the input data is highly redundant, the RAW prefix could be many bits long, perhaps even 8 or more. This may at first seem wrong as each RAW byte would now be 16 bits long instead of the original 8 bits, but if the most commonly used copy command is now encoded in a single bit then the overall total would still be smaller.

In short the actual input data stream should dictate the LZ commands used, so we can build a custom decompressor for each data stream. This should give a better compression rate that a fixed set of LZ commands could ever possibly give.

How do we decode these commands?

Most LZ compression methods rely on a static list of RAW and copy command which are hardcoded into a small decompression routine. So how can we decode the data stream which we our fantastic new 2-stage engine used?

One way would be to use a similar technique to most Huffman decoders, a binary-tree. In fact a multi-way tree would be faster with each node specifying the number of bits which should be used to form the branch/path index to it's children. But the compressed data stream would need to be burdened with this decode tree.

Another approach could be to attach the decompression code into the header of the compressed data file/stream. This would be harder to build than the tree, but of course would give a far quicker decompression speed. The down side to this would be that the compressed data is now machine/CPU dependant (not to mention Real/Protected-mode dependant). And perhaps the worst disadvantage is the possible risk of some virus code being hidden in the decompression code.

One way to overcome these problems could be to employ a tokenized scheme for the decompression code. Now rather than having the actual 80x86/68000 instructions we have token bytes to indicate the decompression algorithm.

E.g.

            Token                      Action
            --------                   -------------------------------
              00                       end token
              01 xx                    use xx bits as the offset field
              02 xx                    use xx bits as the count field
              03                       read 1 RAW byte
              04                       output the current byte
              05                       perform the copy
              06                       define a label
              07 xx                    jump to label xx

Of course the above tokens are just examples, a real decoder would need to include all of the usual instructions, ADD, SUB, CMP, JNZ, JZ etc...

Again this method would increase the size of the compressed data and these tokens would need to be converted by the target machine into real instructions or interpreted (which would be much slower). The advantage of converting these tokens into real instructions over using multi-way trees is that the decompression speed would be far greater, especially for large files.

Closing Words

I hope you have enjoyed this article. One of the really interesting things about compression (like coding in general) is that there is always something new to learn and just when think you know it all, alongs comes something completely new which is more efficient, quicker and easier to code.

I was very suprised when I read about LZP in a previous issue of Hugi, so much so that I tried the algorithm out. But sadly the results were pretty bad because I only used small 20Kb chunks of files rather than huge 500Kb blocks (like the author suggested). The problem with LZP is that only ONE prediction is made based upon the previous N bytes. If this prediction fails (like it does far too many times for small files) then data is not compressed. The older dictionary based LZ methods involving offsets and searching for the largest matching block has far more chance of finding a match. For example an 8-bit offset has 256 different chances to find a match, whereas the LZP method has only 1.

Perhaps some form of LZP and LZ77/78 hybrid would be worth considering. There is no doubt that the sheer speed and simplicity of LZP makes it an attractive algorithm to use, but as always there is room for improvement.

Greetings from the bitstream.

Regards

TAD #:o)