Optimizing "Putbits" Part II



After reading Hugi #14, and especially Dario Phong's "Optimizing Putbits", I thought there was quite a bit more which needed to be said on this subject. And so here is my contribution, which I hope will inspire others to take this subject still further.

As Dario Phong correctly said, the "putbits" function is very heavily used when writing a compressor (or "packer" as many call it), so it seems sensible to try and optimize it to reduce the amount of time needed to compress a file.

But, this should be done after first optimizing the rest of the compressor, because let's face it, searching for the longest string match in a large block, or the most optimum packing sequence can take a 100 or a 1000 times more time than the "Putbits" routine which outputs the finished compression token.

If anything the "Getbits" routine (used in the depacker/decompressor) would be a far better place to spend your time optimizing, after, of course, the actual compression algorithm itself.

In this article I will only focus on optimizing the Putbits" routine, and I leave the rest of your super-compression engine for you to optimize.

There is no real brain-blasting methods or fantastic new algorithms here, most of the techniques have been know about for years. The marker bit for example was used in the ZX Spectrum tape load/save routines and probably before years before 1982.


"Bitstream": A continous block of bytes treated as one long series of bits.

"Token": This represents a command or a number of data bytes. The token can range from a single bit upto 16 or more in length.

"Putbits": This takes a token and writes all of its bits into the Bitstream. It has to deal with crossing byte boundaries and writing bit(s) to the correct bit positions in each Bitstream byte.

"Sequence": I use the term sequence and string to mean the same thing. A continious block of bytes of a certain length. A string is more usually used when dealing with ASCII characters, so sequence seems a better choice.

"Bitrack": A temporary byte (or word) into which a number of bits are shifted into, until an entire byte (or word) has been filled. After which it is written to disk/memory and the Bitrack is re-initialised, so the process can start again.


I will first decribe a 1 bit "Putbits" routine in which the "newbit" is only 1 in length and has the value 0 or 1.


                BitRack = 0
                BitPosition = 0


                BitRack = BitRack OR (newbit << BitPosition)
                BitPosition + 1

                If BitPosition = 8 then OutputByte(BitRack)
                                        BitRack = 0
                                        BitPosition = 0

This would output bits into the bitstream in this order:

                --------------------> bitstream

                hgfedcba ponmlkji  xwvutsrq .... etc ...

                 byte 0    byte 1    byte 2

where the very first bit is written is to bit #0 of the first byte.


A much easier way to understand (and code) is to start at bit 7 and countdown until we go past bit 0 and then output the byte before repeating the process again from bit 7.


                BitPostion = 7
                BitRack = 0

                BitRack = BitRack OR (newbit << BitPosition)
                BitPosition - 1

                IF BitPosition < 0 then OutputByte(BitRack)
                                        BitRack = 0
                                        BitPosition = 7

Now bits are written in this order:

                --------------------> bitstream

                abcdefgh ijklmnop  qrstuvwx .... etc ...

                 byte 0    byte 1    byte 2

The Marker Bit

We can actually get rid of the BitPosition variable and combine it with the Bitrack. I will use some 80x86 for this.

Rather than taking the input bit, shifting it to the correct bit-location and combining it with our bitstream byte, we can simplify this by using a rotating byte (word, or dword etc.) instead.


                mov     [BitRack], 01h  ; Bit 0 = marker bit

                mov     al, <newbit>    ; AL = new bit
                shr     al, 1           ; shift the bit into CF

                rcl     [BitRack], 1    ; rotate CF into Bitrack
                jnc     short notfull
                mov     al, [BitRack]
                call    OutputByte      ; OutputByte(BitRack)
                mov     [BitRack], 01h  ; re-init the BitRack marker

This time the BitRack itself acts as the bit counter. It is initialised with 01h (binary 00000001). As each new bit is rotated into the BitRack the marker bit will be also be rotated until finally the marker bit falls into the CF. At this point an entire 8 bits have been stored in BitRack in the order abcdefgh, we just have to output the BitRack and re-initialise it with the marker value.

As you can see it is far more efficent in terms of speed and register usage than using the "BitPosition" method.

This time bits are written into byte starting from bit 7:

                --------------------> bitstream

                abcdefgh ijklmnop  qrstuvwx .... etc ...

                 byte 0    byte 1    byte 2

PLEASE NOTE: In order to output the final BitRack, you must do this:

                mov     [BitRack], 01h  ; Is the BitRack empty ?
                jz      short rackempty
                shl     [BitRack], 1    ; shift in 0's until
                jnc     short flushloop ; the marker-bit falls out
                mov     al, [BitRack]   ; then write the final byte
                call    OutputByte      ; OutputByte(BitRack)

Without the above above code the final byte of the bitstream would be missing. The SHL instruction is VERY important, it aligns the remaining bits in the BitRack to begin at bit position 7, without it our marker-bit would still be in the BitRack somewhere, with the last 0 to 6 bits after it. (E.g. 00001xxx where 'xxx' are the last 3 bits written into the BitRack with our markerbit before it).


Using the "Putbit" routine is fine for single bit tokens, but for multiple bits a loop would be needed, and so it would be much, much slower.

A far better method is to write multiple bits into the bitrack at once as this avoids having to loop for each and every bit.

                token           - This is the value we wish to
                                  write into the bitstream.

                tokenbits       - The number of bits to write.
                                  (max. 16 bits)


                BitPos = 0
                BitRack = 0

                BitRack = BitRack OR ( token << BitPos )
                BitPos + tokenbits

                While BitPos >= 8
                                Bitrack = token >> (BitPos AND 7)
                                BitPos - 8

Now bits are written into byte starting from bit 0:

                --------------------> bitstream

                hgfedcba ponmlkji  xwvutsrq .... etc ...

                 byte 0    byte 1    byte 2

Now in 80x86:

                dx = token
                ch = tokenbits

                mov     cl, [BitPos]
                mov     al, dl
                shl     al, cl
                or      [BitRack], al   ; Bitrack OR (token << BitPos)
                add     cl,ch           ; BitPos + tokenbits

                cmp     cl, 8           ; is BitRack full ?
                jb      short wendlp    ; no...

                mov     al,[BitRack]
                call    OutputByte      ; OutputByte(BitRack)

                push    cx
                mov     ax, dx
                and     cl, 7
                shr     ax, cl
                mov     [BitRack], al   ; Bitrack = token >> (BitPos AND 7)
                pop     cx
                sub     cl, 8           ; BitPos - 8
                jmp     whilelp

                mov     [BitPos], cl

Could do Better

Now the above looks like a nice, neat loop, but in fact it's a very general solution. The loop itself is only entered when the BitRack has been filled with 8 bits. In which case the BitRack is output and the remaining bits in the token are placed in the BitRack and the loop repeats, until the entire token has been shifted into the BitRack.

But when the BitRack is full (and the while loop is entered) the remaining bits from the token are written at bit position 0 in the Bitrack.

Here is a much better solution:

                dx = token
                ch = tokenbits          (0 to 16)

                mov     cl, [BitPos]
                mov     ax, dx
                shl     ax, cl
                or      al, [BitRack]   ; AX = BitRack OR (token << BitPos)
                add     cl, ch          ; BitPos + tokenbits

                cmp     cl, 8
                jb      short notfull   ; BitRack is NOT full?
                call    OutputByte      ; OutputByte(AL)

                mov     al, ah          ; AL = BitRack (bits 15..8)
                sub     cl, 8
                cmp     cl, 8
                jb      short notfull   ; BitRack is NOT full?
                call    OutputByte      ; OutputByte(AL)

                sub     cl, 8
                mov     ah, dh
                mov     al, 0
                rol     ax, cl          ; AL = overflow byte (bits 23..16)
                mov     [BitRack], al
                mov     [BitPos], cl

To understand this, look at this terrible ASCII. Here is the worst possible case,

                BitPos = 7
                tokenbits = 16

           bit 15               0
                aaaaaaaa bbbbbbbb       <-- the token

                7      0
                _rrrrrrr                <-- the BitRack

After shifting the BitRack looks like this:

               23      16       8        0
                _aaaaaaa abbbbbbb b_______

Combined with the BitRack:

               23      16       8        0
                _aaaaaaa abbbbbbb brrrrrrr


Well, I have only used 8086 instructions and 16-bit registers to make this as compatible as possible. You will no doubt be using 32-bit registers and addressing-modes which will certainly help speed things up a great deal.

The SHLD and SHRD instructions might be useful.

The BitRack is probably best extended to 32-bits to reduce memory accesses. Using Protected-Mode will again help to speed things and of course it allows huge buffers to be easily maintained without having to juggle segment registers.

"RAW" bytes

I almost forgot this tip.

Roughly about 50% (hopefully less) of the time you will be forced to output "RAW" 8 bit bytes because you are unable to compress data. For example, the first time you encounter a byte which hasn't been seen before you need some way to encode it. This is normally done with a single-bit prefix '0' or '1' so when decompression takes place we can decode "RAW" bytes quickly without too much overhead.

Now, rather than encode every 8-bit byte using our "putbits" routine which has to perform shifting, byte boundary checks and so on... We can output the byte directly into the bitstream.

This means "RAW" bytes are encoded and decoded without the need to shift or deal with 8 bits across byte boundaries in the bitstream.

To use this method you must use some kind of output buffer which allows random access because you will need to go back and insert/overwrite a previously written byte later on.

You must also reserve space in the bitstream for each BitRack image BEFORE you output any "RAW" bytes.

               +------>--------->------+      +>+
               |                       |      | |
        xxxxxxxx <RAW byte> <RAW byte> xxxxxxxx xxxxxxxx <RAW byte>

         bitrack                       bitrack  bitrack
         image 1                       image 2  image 3

In the above diagram we have 3 RAW bytes and 24 BitRack bits (denoted by the 'x' characters).

Because we are writing BitRack images and RAW bytes at different speeds we must be VERY careful how we output bytes. We must reserve space for the BitRack image byte before writing any RAW bytes.

        byte OutputBuffer[1000]         ; an example output buffer

        OutAddr = 0

                RackAddr = OutAddr      ; reserve space for the
                OutAddr + 1             ; 1st BitRack image

                BitRack = 0
                BitPos = 0

We have 2 output routines now, Put1Bit and PutRAW:

                BitRack = BitRack OR (newbit << BitPos)
                BitPos + 1
                if BitPos = 8 then OutputBuffer[RackAddr] = BitRack

                                   RackAddr = OutAddr
                                   OutAddr + 1

                                   BitRack = 0
                                   BitPos = 0

And the PutRAW routine:

                OutputBuffer[OutAddr] = rawbyte
                OutAddr + 1

The reason why we reserve a byte for the BitRack BEFORE it is full, is because when we come to depack the bitstream we need to read the BitRack image first, to know what to do with the following bitstream bytes.

Closing Words

Well, that's all for putbits. Don't worry if you can't understand it all in one go, just start off with simple code and work you way up slowly, making sure you understand things fully before moving onto something more difficult.



TAD #:o)