LZT - A Variation on the LZP Algorithm

TAD

Introduction

After reading about the LZP algorithm in a previous issue of Hugi I tried it out, but only on small 20 kb chunks of files. And just as the author said in that LZP article, the compression rate was not that good, in fact it was pretty bad. This is no doubt due to the fact that with small files not enough statistics have been seen to give a good level of prediction, and without good prediction the compression would be disappointing.

Earlier today I had an idea which had begun shortly after re-reading some of the text files which I had posted to Jibz a few days ago (hi Jibz).

From my earlier indications from my limited series of tests it does indeed seem to compress data better than LZP especially for small/medium sized files. Although the algorithm is still "wet-behind-the-ears" I want to publish it somewhere go give other coders the chance to try out the algorithm themselves and to confirm or dismiss my initial results.

Introduction

It is basically another form of the LZP algorithm, but it is a little easier to understand and code (as if LZP wasn't easy enough?)

If you understood the LZP algorithm, then this will be so easy that it hurts.

The name LZT comes from "LZ + Tail" because of the way in which length information and RAW bytes ("literal" bytes as Jibz calls them) are encoded.

It is LZP or LZT?

The LZT and LZP algorithms can almost be called twins, their basic structure and encoding methods are very similar, probably identical. After reading about LZP only in Hugi and nothing else I cannot give a very detailed description or draw a firm dividing line between it and LZT. No doubt the algorithm described here has already been published elsewhere, possibly with the original LZP paper.

So how does it work?

Just like LZP the LZT is based upon a prediction, but instead of using the last two or three previous bytes together with an 8 kb hash table to form the prediction, LZT uses just the previous 1 byte together with a 256 entry hash table. In fact it isn't really a hash at all, it is just the last recorded address of a particular byte.

The 256 address entries in the table are used as starting points in the previously seen data stream to compare (and count) against the remaining bytes in the data stream. To choose which one of these 256 possible entries to use we just use the last encoded byte (present position -1) as an index. The entry in this table is then updated with the address of our present position. This records where in the data stream the last occurance of that byte was.

Now that we have a starting position somewhere in the previously seen data stream we can compare and count every byte which matches the remaining bytes in the data stream.

The encoding Tail.

The method of encoding length information is very simple. A variable number 1's are written followed by a single '0' bit. The number of '1' bits represent the actual number of bytes which should be copied from the predicted address.

       Bits                          Action
      -------              --------------------------
         0               - RAW byte follows
         10              - Copy 1 byte, then 1 RAW byte follows
         110             - Copy 2 bytes, then 1 RAW byte follows
         1110            - Copy 3 bytes, then 1 RAW byte follows
         11110           - Copy 4 bytes, then 1 RAW byte follows

As you can see the '0' marks the end of the tail, and the number of '1' bits represent the number of bytes to copy from the predicted address.

The Juicy Hex Bytes.

And now some 8086 Real-mode code for you to chew over...

         [DS:SI] --> address of data to pack
         CX = number of bytes to pack

         .DATA?
 table   dw 256 dup (?)               ; our history/prediction table

         .CODE
         sub   dx, dx                 ; number of bytes already packed
  pack:
         test  dx, dx
         jz    short new              ; 0 previous bytes?

         mov   bl, [si-1]
         mov   bh, 0                  ; BX = previous byte
         add   bx, bx
         mov   ax,table[bx]           ; [DS:AX] --> previous addr of byte
         mov   table[bx], si          ; record current addr of byte

         test  ax, ax                 ; 1st occurance of this byte?
         jz    short new
         mov   bx, ax                 ; [DS:BX] --> predicted block
  tail:
         mov   al, [bx]
         cmp   al, [si]
         jnz   short raw1             ; no byte match?

         call  output_1               ; output a '1' bit
         inc   bx
         inc   si
         inc   dx                     ; advance by 1 byte
         loop  tail

         jmp   short done

  raw1:
         call  output_0               ; output a '0' bit (end of tail)
  new:
         mov   al, [si]
         call  output_byte            ; output byte(AL)
         inc   si
         inc   dx
         loop  pack                   ; advance by 1 byte
  done:

Of course you need to write the "output_0", "output_1" and "output_byte" code routines yourself. It should be obvious what these should do, the first two write a single '0' or '1' bit to the output bitstream, and the last writes 8 bits to the output bitstream.

Improvements.

Of course this is still a very crude algorithm and can be improved in many ways. The most obvious would be to encode a very long series of 1's using a fixed size bitfield (4 or 8 bits for example). For a very small number of byte copies this would harm compression rates. In its present form a Copy-1 command is encoded in just 2 bits, and a Copy-2 in just 3. If a 4-bit field was used then they would need 6 and 7 bits. For a count of say 16 bytes a tail of 17 bits are needed. This means the maximum possible compression would be 1 bit-per-byte, but for large areas filled with the same byte or a number of bytes repeated many times over this could be done more efficently using a bitfield. So if we get for example four 1's in a row then we switch to using a bitfield of 4 or 8 bits to represents a Copy count of 1-16 or 1-256 instead of encoding 16 or 256 1's.

The LZT algorithm, like LZP, does have a bad characteristic; relying on a single prediction. This is very good if the prediction is correct because 1 bit is all that is needed to denote a match, but of course if the prediction fails then you have no alternative than to encode a RAW 'literal' byte. The older LZ77 algorithm, or at least an adaption of it, still seems to have the edge over LZP and LZT at the present time for small/medium sized data files.

Closing Words

The sheer power of LZP and LZT is their quick compression speed and the fact that as the size of the input data grows so does the compression rate. The more complicated and slightly more memory hungry hash technique used by the LZP algorithm will probably give far better results for very large files (one meg or more), so perhaps the LZT is best suited for smaller files. This seems sensible because the decompression algorithm is very simple and so suits small files better than LZP would.

Perhaps a LZP and LZT hybrid could be developed with the LZT being used for the first 400 kb or so, and then the LZP takes over...

The beauty of them is the way they effectively remove the need for slow block searches and complicated encoding schemes. This gives coders the chance to compress the same block of data a number of times using different parameters and keeping the best one, something which you would never even dare to consider using the old LZ77 algorithm with painfully slow searches.

I hope someone has enjoyed this slight variation of the LZP algorithm. If you have written a LZP/LZT compressor, or have found some useful tricks then why not write a quick article, I know many other coders would enjoy reading it.

Have fun.

Regards

TAD #:o)