Storing and Searching For Byte Sequences

TAD

Introduction

A common problem which coders face is that of storing and search for ASCII strings (or in fact, any variable length sequences of bytes). Anyone who has ever written an assembler, or a dictionary based compression program will understand how important creating a fast string search and storage method is. In this article I hope to describe some techniques which are fast and more importantly easy to understand.

Terminology

"Link": Just like its name-sake this is a way to connect two items together. An address-pointer is stored inside one item and its value gives the address of the next item. By changing the value of a link we can point to an entirely different area of memory.

"Null Link": An invalid value (usually 0) is used as the link's value. Instead of using this value as an address to the next item, it denotes that there is no connected item, i.e. the end of the link chain (end of a list etc.).

"Node": A small, simple data structure which only has a few data items and most important, it has a number of "Links". These links can be used to either point to a 2nd node, or point to another data structure.

"Hash": This is a small number which has been created from a much larger value such as an ASCII string, or a number of different data values. As a hash value is such a crude representation of something much bigger it can only be used as the first step in finding an ASCII string or byte sequence.

Byte-by-Byte

Compression algorithms like LZW or LZ78 require you to search through an existing dictionary of byte sequences (or "strings") to locate the longest one which matches the input data. This is basically a slow string-search, and this can take a great deal of time, not to mention storage space for each and every byte sequence.

And after you have located the longest matching sequence, you must create a new sequence from this string by adding the next byte. This new sequence is then added to the existing dictionary. This means over time very long sequences can be built up, but this also means more time will be spent searching for the longest string the next time around.

For example, say we found that the longest match was "ABCDEF" and the next byte was "Z" then we would need to add the sequence "ABCDEFZ" to our dictionary. Then if we found "ABCDEFZ" and the next byte was "q" then we would need to add "ABCDEFZq" to the dictionary. As you can see these sequences can add up to a great deal of memory.

But as each sequence is built up from another sequence which is just one character less, we can use some nodes and links.

First we need to define a structure for our node:

 node         STRUC

 Character      DB    ?       ; the final byte in this sequence

 SideLink       DW    ?       ; link to the the next node which shares
                              ; the same sequence, but has a different
                              ; 'Character' value.

 LongerLink     DW    ?       ; link to next node for the sequence
                              ; which is 1 byte longer than this one.

 node         ENDS

As the above doesn't mean much, let's draw a nice ASCII.

       Longer  longer  longer
      A--->-- D--->-- D--->-- E--->-- D
 side                
 link                
      v               v

      B--->-- E       C--->-- D--->-- D
 side                               
 link                               
      v                       v       v

      C                       X       Y
      
      
      v

      W

The above diagram shows 14 nodes, each node has its own 'Character' field (e.g. A, D, E, B, X, Y, W). As each node is 5 bytes long (1+2+2) we need 70 bytes. This may seen a lot, but it actually describes all these sequences:

        A
        AD
        ADD
        ADDE
        ADDED
        ADC
        ADCD
        ADCDD
        ADCX
        ADCDY
        B
        BE
        C
        W

Finding the longest byte sequence

Okay, the diagram looks very pretty (well, as pretty as ASCII can be), but how does it help with byte sequences?

Well, we begin with the 1st byte that we want to find and compare this against the 1st node's "Character" field. If we get a match then we follow the "LongerLink" pointer and repeat for the next byte that we want to find. This continues until we run out of bytes, encounter a null "LongerLink" or a non-matching "Character" field.

When we do encounter a byte which does not match the "Character" field, we follow the "SideLink" pointer until we either find a match, or encounter a null "SideLink".

E.g.

       lea    dx, dictionary          ; [DS:DX] --> 1st node
       sub    cx, cx                  ; match length = 0

       call   AllocateNode
       mov    di, ax                  ; [DS:DI] --> next free node

 @@find:
       call   InputByte               ; AL = next byte/character to find

 @@follow:
       mov    si, dx                  ; [DS:SI] --> node
       cmp    [si+node.Character], al
       jnz    short @@match           ; does the node match our newbyte?
       mov    dx, [si+node.SideLink]
       test   dx, dx
       jnz    short @@follow          ; follow "SideLink" if <> 0
       mov    [si+node.SideLink], di  ; else add a new "SideLink" node
       jmp    short @@extend

 @@match:
       inc    cx                      ; count number of byte matches
       mov    dx, [si+node.LongerLink]
       test   dx, dx
       jnz    short @@find            ; follow "LongerLink" if <> 0
       mov    [si+node.SideLink], di  ; else add a new "LongerLink" node

 @@extend:
       mov    [di+node.Character],al  ; \
       mov    [di+node.SideLink],0    ;  + build up the new node
       mov    [di+node.LongerLink],0  ; /

The above example code not only searches for the longest matching sequence which is already in the node dictionary, but it also adds a new node for the the new, longer sequence.

Of course you need to write your own "AllocateNode" and "InputByte" code. Both of these routines will be very simple. The "AllocateNode" could look something like this:

E.g.

 AllocateNode:
       mov    ax, [nodepnt]
       add    [nodepnt], 5
       ret

Where [nodepnt] has been initialised like this:

 InitNodes:
       lea    di, dictionary
       call   InputByte
       mov    [di+node.Character], al
       mov    [di+node.LongerLink], 0
       mov    [di+node.SideLink], 0
       add    di, 5
       mov    [nodepnt], di
       ret

NOTE: I have allocated and built a node for the very 1st byte. This removes the need in check in the main loop for an empty dictionary (no nodes have been defined).

Mash the Hash

The node method is very good for LZ type compression, where sequences are built up byte-by-byte. But for assemblers, compilers and other programs which require a more flexible way to build strings the previous node method isn't very efficient in terms of memory storage.

Currently there are around 400 or more instructions, reserved symbols or assembler/compiler directives, and of course operations and functions. Given the fact that projects (and so, source code) is getting ever larger with a typical project ranging from 10,000 upto 100,000 lines of code, the need for an efficient string-search to locate a particular symbol is vital.

On a typical line of source code there could be around 4 symbols which need to be scanned, found and processed. Also when you take into account macros, text-equates, register names and so on... speed is even more important.

When searching for a particular string we could use a single loop, but as more and more symbols are added into the symbol-table (our dictionary of ASCII byte sequences) the time taken would get worse and worse.

We need to break up the vast number of these strings to help narrow down the search more quickly. Something like a binary-search wouldn't really help because the order in which strings are added to the symbol-table is almost random, (i.e. symbols are not added in an alphabetical order). We would end up having to sort the entire symbol-table every time we added a new symbol.

But, we could use the number of characters in each string to group together every symbol with the same length (e.g. "ADD" "SUB" "ECX" "END" etc.). Now when we come to find a particular symbol in the symbol-table we use the length of the symbol, index into a table and compare each string until we either find a match or run out of symbols.

By using the length of a symbol to divide up a large symbol-table we have used it as a form of "hash", although it's a very basic hash, it does demonstrate the basic idea of one. But there is a much better way.

Many years ago I wrote a 68000 assembler on the old Atari ST, after performing a few tests for help speed up the string searching I discovered that the length method wasn't that good because most symbols (labels, equates, macros) had roughly the same length of between 6 and 10 characters. #:o(

A widely used hash technique is the multiply by 13 method. This was in the Amiga disk function (if I remember correctly). As you fetch each character in the symbol name you multiply the current total by 13 and then add this new character to the total.

E.g.

       total = 0
       symblen = 0
       While newchar is valid
               total = ( total * 13 ) + newchar
               newchar = InputByte()
               symblen + 1
       Wend

This 'total' is a hash of the entire symbol-string, and so it can be used to quickly narrow down a search in the symbol-table. For example, you could logically AND this total with 3FF hex to form a 10 bit value and use that directly as an index into a table of starting nodes. Each node would describe an unique string (e.g. "DrawSprite" or "TrackerCountDown") and have a link to the next node which shares the same hash value. This way you only compare the number of symbols which share the same hash value. In the most ideal situation symbols will be distrubuted evenly throughout the entire 10 bit (1024 entry) start table, so we have given 10,000 symbols we would only need to compare about 9 strings!!

But this assumes that every symbol was the same length. In fact this can help reduce the number of string compares even further. In the node structure we could add another link; "LengthLink". This is used to narrow down our search to only the symbol which share the same hash and the same length.

A node structure for a symbol could look like this:

 symb        STRUC

 NextLink     dw     ?    ; pointer to the next symbol of the same hash
                          ; and same length

 LengthLink   dw     ?    ; pointer to the next symbol of the same hash
                          ; but of a different length

 Length       dw     ?    ; number of character in the following string

 String       dw     ?    ; address of the ASCII string

 symb         ENDS

The process goes something like this:

1. Scan the symbol "search" string, create the hash and count the length.

2. Index into the hashtable using the hash, this gives the starting node. If the hashtable entry is 0 (null), then there is no string which matches the hash, so define the new entry in the hashtable, and finish.

3. If the node's string length <> symbol string length then follow the "LengthLink" pointer and repeat until we find the correct length.

4. Compare the node's string against the search string, if it is equal then we have found the string, so quit. Otherwise follow the "NextLink" and repeat the process.

As you can hopefully see, it's an efficient way to reduce the number of string compares. First the hash divides the number of possible searches by 1024 or so, and then the length of the symbol reduces this further. And all of this happens before a single string-compare has been done.

A groovy new hash technique

Some time ago I had a nice idea to improve the string hashing function, and it all came from thinking about disks, sectors and formatting.

Okay, so what do we need for a hashing function?

It must produce the greatest number of hash values for the widest number of input strings. It should be very fast and it give the same, identical value for the same, identical input string. And most important of all, the hash value must be much smaller than the input string, a word or a byte is ideal. Similar string should give very different hash values.

There is something which already does this...

Do the letters C...R...C... ring any bells?

After reading about CRCs and writing a simple test program, it does indeed seem to work and it's no slower than the "multiply by 13" method.

Closing Words

Well, that's my 3rd article for Hugi done. I hope it wasn't too dull. If you are still having problems trying to understand the techniques used here then try reading about "Linked-Lists" and "Binary-Trees" in case you need a little reminder.

Have fun.

Regards

TAD #:o)