LZP-S: Yet Another Idea For The LZP Algorithm

TAD

Introduction

This article describes another idea I've had for the LZP (LZ-prediction) algorithm. If you already know the LZP algo, then this article will be child's play, if not then please go and read about this nice, new algorithm. I am sure that you will really like it.

Please ignore this section.

Earlier today I finally surfed over to Dario Phong's website (something which Dario himself has tried to make me do for some time now... Heh heh... you'll never take me alive copper!). About a quarter way through his LZP article I suddenly got a spark of inspiration and remembered a previous idea which I had while writing some docs for Jibz (hi Jibz! - perhaps this article should be called "JIBZ10.TXT" or something... heh heh).

!! Warning !!

If you're looking for a fast LZP, then look elsewhere. This algo will be much slower than the normal LZP and requires more work (and code) in your depacker.

In this regard the 'S' in LZP-S title, it stands for 'slug' but I think it has some interesting attributes which will be useful (and hopefully good for compression too).

Look into the future...

The secret of the LZP is the 'P' - the prediction. The better it is, the better the compression SHOULD be, but this is not always the case. For some data-streams the prediction does a really bad job whereas a LZ77 algo would do better... but for the purpose of this article I'll stick with just LZP.

The current LZP (as I understand it) is very, very biased towards the last instant of prefix/predictor bytes. It's kinda like a MTF (Move To Front) queue-jumping model where the most recent occurance jumps directly to the top of the list without taking into consideration the weights/frequencies of other phrazes.

I previously described a very simple LZP method which took just the last one byte and used that as the predictor hash value. As you can imagine the prediction success wasn't that good. But the amount of memory needed for the prediction array was very, very small.

The amount of memory needed for the prediction array depends on the precision of the prediction (or 'hash') key. This is a hashed value which is used to form an index into the array of address-pointers. If we use an 8-bit hash value then we need space for 256 address-pointers. In Protected-mode using 32-bit pointers this means 1024 bytes (256 x 4 bytes) for the array.

Now suppose we move up from using just the last one byte to using the last two bytes... Well, the bad news is that we need a 16-bit hash value which translates into an array of 65536 address-pointers (262,144 bytes!).

                    -*-*-*- Reality disabled -*-*-*-

Now let's pretend we wanted to try using the last three bytes... this means a 24-bit hash value and so 16,777,216 address pointers = 67,108,864 bytes!

Okay, so it's not possible in Real-mode with 640 kb (grin). But let's have a peek at a 32-bit hash value.. 4,294,967,295 pointers = 17,179,869,184 bytes.

                     -*-*-*- Reality enabled -*-*-*-

"Hey, stoopid... is there a point to all this nonsense?"

Err... yeah.

Wouldn't it be nice if we could use a 40-bit or 64-bit hash value....?

The optimal hash precision.

One of the most important considerations when coding an LZP is the 'hash' function, its size and precision. If you choose a small precision for the hash key (for example 8 bits) then you only need a 256 element array which does mean that the array will be filled out more quickly than a 16-bit or 24-bit array. This can be both good and bad.

It obviously means that less memory is required for the address-pointers, and it also means that these prediction-entries will be filled up far more quickly and so a prediction (and hopefully a saving) can be made. BUT, if you choose a hash-function which is too low (like bits 3..0) then it is possible that hash collisions will take place, where two or more hash values share the same address-pointer entry. If you are lucky with your data-stream then you may find that these collisions help to compress data.

For example say we find that the last 2 bytes are 4A F3 hex and we use a 12-bit hash function (MOD 4096 in his case) then our hash key is 0AF3 hex. Now suppose later on we encounter that the last 2 bytes are EA F3 hex. It gives exactly the same hash key of 0AF3 hex, so it shares the same address-pointer value. If we are lucky then both predictions are correct.

              4A F3 57 C2 FF 12 hex

              EA F3 57 C2 00 8E hex
             -----

Both sequences (57 C2 FF 12 and 57 C2 00 8E hex) share the same 12-bit hash value of 0AF3 hex.

In this case a collision in the hash table is a good thing since both sequences have some common bytes (57 C2 hex).

And if we are really, really lucky then we can predict the EA F3 xx hex sequence without ever seeing EA F3 hex before (after seeing just 4A F3 hex). In short one previously unseen sequence could be predicted correctly using a 'similar' N byte prefix.

Prefix Ownership.

Depending on the data being compressed the hash collision might be a good or bad thing. One idea to resolve collision could be to keep an array of prefix bytes for each address-pointer entry in the hash table. Now if we get a collision after hashing the last N bytes, we can simply examine the byte at -(N+1) and compare it against our prefix entry. If we find a match then we can be more certain that this hash entry belongs to the actual sequence we are attempting to predict, if not a collision has occurred. At this point you may wish to either:

Encode a raw/literal byte (just an 8-bit byte WITHOUT any 1-bit prefix flag).

Try a different hashing function (e.g. add a bias offset to the hash value).

Progress onto another hash table.

Predict using the entry anyway.

Of course with every address-pointer update you must also update the prefix entry along with it, to record the owner of each pointer.

The prefix byte is just my attempt to help resolve collisions between two or more identical hash values. It does NOT solve the collision, but it does help identify which of the numerous owners the current entry belongs to. Just remember that the owner can change as each entry is updated with a new address-pointer.

Why not use a higher-order table?

Well, if we stepped up from a 16-bit hash key to a 24-bit hash key then memory increases by 256 times. With the 'prefix' idea above it is possible that only an extra 64 kb is needed, instead of 66,846,720 bytes if you used a full 24-bit hash key.

If you use a 24-bit offset instead of 32-bit address-pointers as the entries in the table then you could use the spare 8 bits to store the prefix byte, so by saving 64 kb.

But a higher-order table would have far less collisions and should model the data better to give more reliable predictions and compression, the major problem with order-2 and order-3 is simply memory, and the lack of it.

Filling the gaps.

Using the hash value as a straight index into the address-pointers is a really fast method in terms of speed, but not in terms of memory space. There will be many situations where not all the entries in the table will be used, and this means wasted entries and wasted memory. When you consider using a 24-bit or 32-bit hash value then this problem becomes much more important than it did for an 8-bit or 16-bit hash.

Unless you are compressing files of 2^26 or 2^36 bits in length you will not be using all the entries in a 24-bit or 32-bit table.

One (slow) solution to all these gaps (the unused entries) in the table could be to use a dynamic list of hash keys and address pointers. Now rather than simply indexing into the table with the hash value you must search for a matching key value. If no match is found then a new entry is added to the end of this list. As you can imagine this is much, much slower than simply indexing, but it should help remove all the gaps in the table.

The normal quick/radix/index searching techniques could be applied to help speed up the process of finding a certain entry in the list. You can also limit the size of this list and perform some 'decay' operations to remove really out of date entries in the table if you want to. The only disadvantage to performing a decay operation is that you lose the very old entries. In a way the normal LZP performs a decay operation in the form of overwriting an existing address-pointer entry.

Another solution could be to begin with an 8-bit table and progress onto a 256 element sub-table once that hash value has been encountered to form a 16-bit table. For example, say that the byte F3 hex is never used in the file being compressed, then there is no need to store all the entries which use F3 hex. In the case of a 16-bit table this means 256 entries can be removed which means a saving of 1024 bytes. This might not seem that great, but consider a 24-bit table; this would save a large 262,144 bytes (256 kb) for just that F3 hex byte!

A key-less List.

Thinking about the key and address-pointer list (as mentioned above) it may not be necessary to store the hash values with each address pointer.

The hash key value is formed from the previous N bytes before a certain address and each address is stored in the table. So we can simply fetch the address-pointer value (or offset address) and recalculate the hash key value again by hashing the N bytes before it.

This of course makes the searching a little slower, but it saves having to store each hash key value. For a 12 or 16-bit value this means a word per entry and with 65536 entries thats 128 kb.

The LZP problem.

There seems to be a problem with LZP which I haven't seen mentioned anywhere else, yet. This could be due to my limited knowledge and not having read every LZP article on the net... but hey, this hasn't stopped me in the past (grin).

Most other compression algorithms I have read about and tried out allow an input data-stream to be compressed in stages, e.g. 32 kb blocks so even a 4 Gb file could be compressed easily with a 640 kb machine. But the LZP doesn't seem to allow this kind of processing, instead it needs to keep the entire file in memory at the same time. This is because we are recording address-pointers and not data sequences, and because it is possible that a sequence at the very end of a file will match one at the very start. If we used a small 32 kb buffer, compressed the data in it and then read in another 32 kb of data then the address-pointers would be meaningless because they belong to the old 32 kb block we just threw away.

You could re-initialise the address-pointers in the hash table with each 32 kb block, but it would kill all those valuable pointer entries and would be bad news for compression.

One way to avoid this hash table culling after each block could be to scan all the address-pointers and build up a fake data block using the maximum match length. If our maximum matching length is 16 bytes and we have a 12-bit hash then this means a possible 16 x 4096 byte buffer is needed for the fake data. The hash table address-pointers are updated to use this fake data at the same time.

A slight catch.

The fake block idea might seem reasonable, but it has one small problem. Suppose an address-pointer points into the fake data area instead of our 32 kb chunk buffer. This time we don't need to update the address-pointer or move any data around.

A more logical solution to the fake data buffer would be to store the previous N byte + maximum match bytes as a series of byte strings. This would remove the need for address-pointers, you then have to search each string checking the hash of the first N bytes against the hash of the current position hash value. If you find a hash match, you perform the usual match counting loop against the remainder of the byte string.

You may wonder why I say to store the previous N bytes instead of the hash value (because that would remove the need to perform the hash function for each and every string during the search), well the advantage of having the previous N bytes is that you can try multiple hashing functions and not just one. Of course it does take more CPU time... (boo hoo).

The money shot.

So how much memory would you need for a 128-bit hash value, or 256-bit?

In fact you need ZERO (Nil, zilch, nothing, zipo) memory for the array. The only thing you need is a little CPU time, and because the normal LZP algorithm is so fast, we have a few clock-cycles to spare...

Now for this to work we must perform a one-shot load and compress stage where the entire data being processed remains in memory all the time. For a HUGE file you can still break it down into 32 kb chunks but both the compressor and depacker must process each chunk as an individual, isolated block of data and ZERO any address-pointers or other data each time.

Okay, here goes...

"What are we doing when we hash the previous N bytes together in a normal LZP?"

Well, we are trying to find an identical phraze using the tail end of a previous phraze or collection of raw/literal bytes.

Suppose we have the following collection of symbols:

       <---- previously seen data <---------------- yet to encode ---->
                                               --------- -
  B A B C D J D E J A B C D E J I E A C B A B CD J D Z I A B K D E A ...
    """""           '''''                 ^^^^^--------- -
          (2)             (1)             prefix

We want to try and predict the next phraze (D J D Z I A ....) and we are using the previous 3 prefix symbols (A B C) to form a hash key value.

Because most hashing functions are fairly crude (squeezing 16 or 24-bits down into 12 or less) some collisions will occur. Looking at the above example data-stream you can spot two positions where A B C occurs. A normal LZP would just remember the most recent occurance which would predict the phraze (1) D E J I E .... But phraze (2) D J D E J A ... would be a better match.

Now let's forget all about using address-pointers and performing any hash functions and try a different approach.

Simply take the previous N bytes (3 in this example) and perform a slow backwards search for them within the previously seen data. We first encounter the 3 symbols 'A B C' again at (1) and we did it WITHOUT any address-pointer!

"Hey, that sucks. It's a zillion times slower than LZP."

Yeah, but think about this. We can use the previous 4,5 or 100 bytes to be our prefix search pattern - something a normal LZP hash couldn't do without losing loads of precision and gaining lots of collisions. So we can use this simple (but SLOW) search technique to perform a 24, 32, 64 or 8000 bit hash without losing a single bit of precision.

Also if we approach this search like a normal 'find the longest match' operation then we can perform some statistical operations and apply a more informed prediction based upon 1, 2, 3 or 10000 matches. We can begin with a 1-byte prefix (just looking at the previous 1 byte) and progress onto a 2-byte prefix if we find one, then a 3-byte prefix and so on... Because a prefix like A B C D E F 'should' give a much more reliable prediction than a prefix of A B it should be good news for compression.... but hey, I could be wrong here (I normally am).

Closing Words

Well I hope someone has understood the above. The idea is quite simple but rather slow. It's the old problem in programming of speed against memory. No doubt some kind of linked-list could be employed to help speed up this mollusc-like algo...

(I hope you enjoyed this Dario... I'm sure that you will email me. GRIN.)

Happy LZ-Peeing.

Regards

TAD #:o)