LZW - Some Patent Free Ideas For Improvement/Further Research

WARNING !!

I am not 100% fluent on the LZW algorithm, and like all my other articles this has been written straight off the top of my head, so please overlook any stupid errors or misunderstanding. Anyway an article is supposed to inspire you to develop and write your own code and not to be fodder for cut-n-paste lamers.

Introduction

As the LZW is based upon the PATENT-FREE LZ algorithms of Lempel and Ziv, you should be able to make a "LZW-ISH" algorithm without stepping on too many toes. Personally I really, really, really dislike this hoarding of algorithms. But what really sticks in my claw, is that is was based on the great work of Lempel and Ziv, using their spark of creativity to fill someone else's pockets. Okay, that's the rant out of my system.

If you don't have a modest understanding of the LZW algorithm, then this is NOT the place to learn it.

Terminology

"Sequence": A variable number of bytes in a continuous block. This is often referred to as a "String". In this article they mean the same thing.

"Token": This is a value, or 'code' associated to a particular sequence.

"Prefix": This is a number which preceeds something else, like an area code for an international telephone number. A prefix normally alters the behaviour of the following item.

"Suffix": This is the opposite of a prefix, and is placed after an item. It normally doesn't modify the behaviour of an item but instead acts as a parameter or additional data.

Simple overview of the LZW algorithm

The idea is, like all the best ones, very simple. Whenever we see a block of bytes once we create a sequence from them and assign a token value to it. Now if we encounter these same bytes again then we can use the token instead of the entire sequence of bytes, this way compression takes place. Each sequence has it's own unique token which is usually taken from an increasing counter, so the first sequence has the value 0, then 1, 2, etc.

But for the depacker (decompressor) to work it needs to know what these sequences are (the actual bytes which make up the sequence), and the length in bytes of each sequence. The smart thing about the LZW and LZ78 algorithms is that they don't need to store these sequences in the compressed data stream, instead they are built up by both the packer and the depacker in an identical manner.

After each sequence is encoded using its token, the following byte is added to the end to form a new sequence which is given the next unique token code. This continues and over time very long sequences are built up, leading to better compression.

But when we first start to pack data we don't have any sequences. So how do we encode bytes we haven't seen before?

One way is to use a 1-bit prefix to distinguish between a sequence and one of these 'RAW' (uncoded) bytes. But this adds an extra bit to every token and so can impact on the compression rate.

The LZW algorithm preloads the "dictionary" of sequences with all 256 possible byte values. This means we begin with 256 1-byte sequences and so any possible 'RAW' byte can be encoded using the correct token.

For example, sequence token 66 represents the byte 66 (ASCII 'B').

As more and more sequences are added to the dictionary, more and more bits are used for the token code. So if we have 256 sequences then we can use 8-bits, for between 257 and 512 we use 9 bits, and between 513 and 1024 we use 10-bits, and so on...

An Empty Dictionary?

Preloading the dictionary with 256 sequences is not always a good idea. For ASCII text files only perhaps 100 characters or less are used, so less than half of those 1-byte sequences are actually used.

Only loading those 100 characters means we only need a 7-bit token code and so we are already saving 1-bit on each of these sequence tokens.

But wait. If don't preload all 256 possible byte values, how can we encode a RAW byte which we haven't seen before? And how do we tell the depacker which 100 of the possible 256 values to use?

We can use a special code to denote that a 'new' byte follows. This is normally called an "ESCAPE-CODE".

So now when we start we only have 2 token values, the "RAW" token and the very first data byte. This means we can use a single token bit to distinguish between the two. We can fill a long block of bytes using a single bit.

If we encounter a byte which hasn't been seen before we can use the ESCAPE-CODE followed by the actual byte. A new 1-byte sequence is then built up from this byte and the next, unique token code is assigned to it.

As you can hopefully see, this method will result in fewer sequences and so fewer token codes, and a smaller number of bits for each token.

Recycled and Final bytes

Okay, we have seen that new bytes can be added by using the ESCAPE-CODE. This deals with the problem when a byte is used for the first time, but what about the last time a byte is used?

Well, when we encounter a byte for the final time we could use the ESCAPE-CODE again and resend the byte, but this time instead of creating a new sequence from this byte we delete it. It is easy enough to distinguish between the two (the create, or delete) simply by checking whether the 1-byte sequence is already defined.

So now we can create, or delete 1-byte sequences which should hopefully mean the number of sequences, tokens and bits will be reduced too.

This seems like a reasonable idea, but it can be taken a little further.

Because we know when we encounter this delete situation that this particular byte will never be used again (because it's the final occurance of it), we can also delete every sequence which contains this byte. This operation alone can delete hundreds of sequences, and so should help to reduce the number of bits needed for each token code.

Spare values in the token codes

Because we must use a certain number of bits for a sequence token, there will be many situations when we have a large number of spare token values which haven't been assigned to a sequence yet.

For example, say we had 400 sequences. We need to describe 400 possible token values. For this we need 9 bits, which give a way to describe 512 possible values, 112 more than we actually need.

These spare token codes (401 to 511) could be used as repeat count to fill blocks with a certain byte, or sequence. The normal "offset" based LZ compressor does have the advantage of being able to repeat ("fill") long blocks with the previous byte simply by using an offset of -1 from the current decompression position.

The LZW is pretty bad at encoding these repeated runs because sequences are only built up one byte at a time it takes a long time to encode a large run of bytes. For example say we had a sequence of 100 bytes, the offset based LZ method would be able to encode this block in 1 or 2 commands, whereas the LZW would need to slowly build up longer and longer sequences until it had perhaps 10 sequences each 1 byte longer than the previous one.

Of course there are many different uses to which these spare token codes could be put to.

Faster Sequence Decoding

The normal way to decode the LZW sequences is by using a prefix-token and a suffix-byte. The suffix-byte is pushed onto a stack and the prefix-token is followed like a linked-list onto the next sequence and its suffix-byte is pushed, and so this continues until we meet a single-byte sequence.

All the suffix-bytes are then read off the stack and written to the output data stream. The stack operation is important because it reverses the order in which the suffix-bytes are collected.

This gathering of suffix-byte is performed within a small loop, then another loop is used to pop bytes from this stack.

But as sequences are already present in the already decompressed data we can employ a quick table look-up method. With the table containing the address of the sequence and it's length. So given a token we just need to look-up the sequence address and it's length, then its a simple block-copy (e.g. REP MOVSB). This would make the decompression very fast (at least as fast as a normal LZ offset method).

Closing Words

Well, I hope there was something of interest in here for everyone. And please don't complain if something doesn't work, or you can't understand it. (Hey even I don't understand my own articles sometimes.)

Enjoy.

Regards