Optimizing "Getbits" Part I

Introduction

After reading putbits I and II, we all know how to do a 'putbits' routine (right?). So here is the other vital routine for any depacker/decompressor. This document contains tips and advice on how to read a number of bits from a bitstream.

Terminology

"Bitstream": This is a continuous block of bytes which is used like a huge series of bits. If our bitstream is 100 bits long then we need 13 bytes of storage (actually only 12.5 with 4 bits spare).

"BitRack": A temporary byte/word/dword in which bits are shifted out from, or into.

"InputByte": This fetches the next byte from our bitstream and passes it back in the AL register.

Bit-by-Bit

As we are still dealing with a bit-stream, we might as well start with fetching 1 bit at a time from it.

Right, we need a counter to record how many bits we have left in the BitRack, and the BitRack itself.

        Initialise:

                BitRack = InputByte()
                BitPos = 7

        Get1Bit:
                newbit = ( BitRack >> BitPos) AND 1
                BitPos - 1

                If BitPos < 0 then BitRack = InputByte()
                                   BitPos = 7

Now, let's use some 80x86 to make life easy.

        Initialise:
                call    InputByte
                mov     ch, al          ; CH = BitRack
                mov     cl, 7           ; Bit = 7

        Get1Bit:
                mov     al, ch
                shr     al, cl
                and     al, 1           ; newbit = (BitRack >> BitPos) AND 1

                dec     cl              ; BitPos - 1
                jns     short notempty  ; if BitPos < 0 then
                call    InputByte       ;
                mov     ch, al          ;       BitRack = InputByte()
                mov     cl, 7           ;       BitPos = 7
        notempty:

The bits a-x in the bitstream are stored like this:

                --------------------> bitstream

                abcdefgh ijklmnop  qrstuvwx .... etc ...

                 byte 0    byte 1    byte 2

Mask and Wrap

As you may have already noticed the BitPos counter (register CL) counts down to -1 and then begins again from 7, i.e. it wraps on a power of 2.

All the BitPos is currently doing is telling us when the 8th bit has been shifted out of the BitRack.

Another way to do this is to use a logical AND instruction to mask off unwanted bits and limit the results to a small range, in this case 8 (7 to 0).

E.g.

                dec     cl                      ; 2
                and     cl, 00000111b           ; 2
                jnz     short <label>           ; 2

This takes 6 bytes and uses the ZF (Zero Flag) to indicate when the 8th bit has been reached.

A better way is to use the MOD 256 property of the byte register itself to perform the logical AND for free, and simply scale up the countdown of CL to be 1/8th of 256 (which is 32). E.g.

                sub     cl,32                   ; 2
                jnz     short <label>           ; 2

As a bonus, we get the CF (Carry Flag) set every 8th time, which could be an advantage depending on your code.

The Bit Loop (or "fake BitRack")

Another way to perform the once every 8th (or 16th or 32nd) event counter is by using a single set bit within a byte, word or dword.

This is just a "fake" BitRack which is rotated round and round without being reloaded or modified in any other way.

        Intialise:
                        mov     cl, 01h

        Get1Bit:
                        rol     cl, 1
                        jnc     short notyet

                        call    InputByte
                        mov     ch, al          ; ch = InputByte()
                notyet:
                        shl     ch, 1           ; BitRack << 1
                                                ; CF = next bit

As you can see, we never need to reload CL after the initialisation, the ROL instruction takes care of it for us.

This "fake" BitRack method can be used for every 2nd 4th, 16th etc. event tick, simply load the fake BitRack with the correct value during initialisation:

E.g.

                00000001        = every 8th
                00010001        = every 4th
                01010101        = every 2nd
                11111111        = every 1 event tick.

This method can be employed in many different programming problems, not just packing/depacking bits in a bitstream.

9 Bit BitRacks

So far we have seen the countdown technique and the Bit-Loop (or "fake BitRack") technique. Both are pretty good in terms of speed and memory usage. But there is another method which combines them both into a single BitRack, and so can be implemented in a single variable or register.

This method can be tricky to understand and code, but don't worry, it's not that bad.

At first sight it may seem that we can just shift the BitRack until is it empty, or zero. Then ALL 8 bits have been shifted out, right?

WRONG !!

Try loading the BitRack with the value 192 (binary 11000000) and you will soon realise that after just 2 shift operations the BitRack would be 0 (binary 00000000), which would indicate that it is empty, but in fact there are 6 more bits which must be shifted out.

The problem is that they are all 0's, because our bitstream is made up from individual bits and a bit can ONLY have two values (unless you know better), it is impossible to have a tri-state bit. We need this 3rd state to denote a "marker" bit.

But for certain values where the final bit is 1 this method does indeed work for all 8 bits. Values such as:

              1st      8th

                11000001
                00101001
                00001111
                10000001
                00011001

After the 8th bit has been shifted out of the BitRack, we do indeed have an empty BitRack, and its value is 00000000.

Well, in fact every ODD number works. The trick lies in the final (8th) bit to be shifted out of the BitRack MUST be 1.

But if we set this 8th bit to 1 all the time then we could only store 7 bits in each byte. This is NOT good news for compression, is it? #:o(

We need to somehow pack 9 bits into just 8!

RCL and RCR

The solution to this is to use the RCL and RCR instructions and set the CF (Carry Flag) to 1 before shifting out the first bit and every 8th bit (or whenever the BitRack is reloaded with a new byte).

        Initialise:
                        call    InputByte
                        mov     ch,al           ; CH  = InputByte()
                        stc                     ; **
                        rcl     ch, 1           ; **

                CF = 1st bit

        Get1Bit:
                        shl     ch, 1
                        jnz     short notyet
                        call    InputByte
                        mov     ch, al          ; reload BitRack
                        stc                     ; **
                        rcl     ch, 1           ; **
                notyet:

As you can see this method is slightly different from the normal ones. Please note the ** instructions. These shift the 1st bit of each byte into the CF while also setting the marker bit (Bit #0) to 1.

Say we had a byte with bits abcdefgh, then the ** instructions do this:

        CF = 1  ----------->>-----------+       stc
                                        |
                                        |
        CF = a <--<---  ch = bcdefgh1 <-+       rcl     ch, 1
                                    ^

The CF has the value of the 1st bit, and the final bit in the BitRack has the value 1, which is the marker bit.

When the marker bit has itself been shifted out of the BitRack, the BitRack will be 0 (binary 00000000). At this point we need to reload the BitRack with the next byte from our bitstream and perform the 1st shift operation again, just like the initialisation did.

TIP: If your InputByte routine (read from memory, disk etc.) does not affect the CF then you can remove every STC instruction AFTER the initialisation one, because we know the marker bit has just been shifted out of the BitRack, so CF is already equal to 1.

Of course in your packer/compressor you could encode the 1st byte of the bitstream with the final bit = 1. This would mean the first byte only having 7 bits, but it would possibly save a STC instruction in your depacker or decompressor (saving 7 bits).

Multiple Bits

As I said in Optimizing PutBits II, a bit-by-bit approach can be very slow when used within a loop. For decoding a .GIF picture this would be very lame.

                "token"         - This is a single code read from
                                  the bitstream.
                                  Usually 16-bit variable/register.

                "tokenbits"     - How many bits should be read to
                                  make up the "token".

        Initialise:
                        BitsLeft = 0

        GetBits:
                        If BitsLeft = 0 then BitRack = InputByte()
                                             BitsLeft = 8

                        token = BitRack >> ( 8 - BitsLeft )

                        While tokenbits > BitsLeft
                                        BitRack = InputByte()
                                        BitsLeft = BitsLeft + 8
                                        token = token OR (BitRack << BitsLeft)
                        Wend

                        BitsLeft - tokenbits

                        token = token AND ((1 << tokenbits) -1)

NOTE: Be careful, the BitRack is normally only 8 bits long, but the "token" is usually 16 bits long, so YOU must deal with the zero extending and data conversion yourself.

Or, just look at the following 80x86 code.

        Initialise:
                cl = 0                  ; BitsLeft = 0
                BL = tokenbit

        GetBits:
                cmp     cl, 0
                jnz     short notzero
                call    InputByte
                mov     ch, al          ; ch = InputByte()
                mov     cl, 8           ; BitsLeft = 8
        notzero:

                mov     dh, 0
                mov     dl, ch
                push    cx
                neg     cl
                add     cl,8
                shr     dx,cl           ; token = BitRack >> ( 8 - BitsLeft)
                pop     cx

        whilelp:
                cmp     bl, cl
                jbe     short wendlp    ; while tokenbits > BitsLeft

                call    InputByte
                mov     ch, al          ; BitRack = InputByte()
                add     cl, 8           ; BitsLeft + 8
                mov     ah, 0
                mov     al, ch
                shl     ax, cl
                or      dx, ax          ; token OR (BitRack << BitsLeft)
                jmp     short whilelp
        wendlp:
                sub     cl, bl          ; BitsLeft - tokenbits

                push    cx
                mov     ax, 1
                mov     cl, bl
                shl     ax, cl
                pop     cx
                dec     ax
                and     ax, dx          ; token AND ((1 << tokenbits) - 1)

        Gives:
                AX = token

Improvements

You could extend this method to use 16 or 32 bits for your BitRack instead of the current 8 bits.

Of course, the "InputByte" shouldn't be a sub-routine which you call, I only used it to help explain things. The most likely replacement is probably a LODSB, LODSW or some other MOV & INC instruction pair.

TIP: For packing & depacking really huge files it is much easier to break it down into 16 or 32 Kb chunks (sometimes even 1 or 2 Kb chunks). This not only means that all the code can be easily written in Real/8086 mode (and so works on ANY machine), but it allows you to use the most optimum compression method for each, individual chunk.

Take EXE files for example, at the start there is normally a very large range of possible byte values but towards the end of the EXE there is plenty of data and redundant 0's. If you break the EXE down into small 2 Kb chunks then you can use a byte before each chunk to tell the decompressor which method of compression was used on it. The most optimum would just be a single byte and a repeat-count, and the worst would be a "RAW" uncompressed chunk.

No matter how fantastic you think your compression is, there are always situations in which you CANNOT compress the data, remember this. Then think about the "chunk" method...

Closing Words

Well, I hope you have enjoyed reading this document and I hope some coders out there either learned something, or better still, have been inspired to code their own packer/depacker routines.

If you are new to compression, then I would suggest reading about the LZ77, LZ78 and LZW algorithms, but first read about the elegant Huffman, respect due.

Have fun.

Regards

TAD #:o)