Optimizing "Putbits" Part III

(or how I beat Dario Phong... heh heh)

TAD

Introduction

Earlier today I recieved a curious email from Dario Phong (whose original article in Hugi #14 inspired me to write my first ever article).

After some very fair comments about LZT (which I had outlined in my LZT article) he mentioned about the 'putbits' routine in which he said some strange things which I disagree with, but rather than argue what's the fastest putbits etc. etc. I will describe some size optimizations for dealing with bitstreams together with a very simple way to write multiple bits to the bitstream using just ONE shift operation.

------ Serious mode off ------

I want to dedicate this entire article to Dario Phong for his rather confusing email he sent me today. I've read it a number of times, but I'm still baffled!

------ Serious mode on ------

Random Access.

Dario said: "I think random access should not be done in the putbits."

As I am unclear about what he meant by "random access" I will reply to the two possible things which I think he is referring to.

1. Random Bit Access:

For a single-bit write, a random-access to the current bit position inside the bitrack is slow. That's why I wrote the 'marker-bit' section in which the bitrack is initialised with 01h (binary 00000001) and is used to indicate when a full 8-bits have been shifted in.

But for multiple bits (i.e. writing more than 1-bit per loop) this random-access is vital. It not only tells us the starting bit-position within the BitRack for the new bits to go, but it is also used to determine if the BitRack is full.

2. Random BitRack Access:

The only other random-access I can think of is for the BitRack and Raw (literal) bytes. Because bits are written into the BitRack and complete Raw bytes are written at different rates, we need random-access to go back and output the full BitRack later on.

As far as I know there is no way to overcome this problem because of the way most depackers work (BitRack bytes and Raw/literal bytes are intermixed throughout the packed data-stream).

The only way to remove this random-access would be to use two buffers, one for the bitstream and one for the bytestream (raw literal bytes). But this complicates the depacker code because another memory pointer is needed for the 2nd buffer, not to mention a couple more instructions to set this pointer up. The only good thing about this method is that some other compression scheme can be performed on the raw/literal byte stream, so by gaining a little more compression.

Too many Shifts.

Okay, here is the bit which most people have been waiting for... (apart from the end of this text... heh heh).

I am surprised that I didn't think of this method before, but after writing close to 20 articles for Hugi #15 I did get rather bored.

What is the problem with the original 'putbits' routine?

Well, nothing; except for tokens which cross the BitRack (byte) boundary. This needs two shift operations, one to align the lower bits with the current bit-position within the BitRack, and one to align the overflow bits with bit #0 of the next BitRack byte.

Say we have a 16-bit token we wish to write into the bitstream:

                <---- token ---->
                abcdefgh ijklmnop

And we have a BitRack with 3 bits (xxx) already in it:

                          BitRack
                         76543210
                         _____xxx

If we use a 24-bit temporary buffer and shift the token left 3 places:

       <----- the "shifter" ---->

       76543210 76543210 76543210
                         _____xxx     <-- the BitRack
       _____abc defghijk lmnop___     <-- the token << 3 bits

So it looks like we need a 24-bit register to perform the above operation. This would mean a 80386 or better CPU for those nice 32-bit reggies.

But after thinking about some old 8086 sprite code that I wrote on an old 8Mhz Amstrad PC (yuck) I remembered this obvious trick. Sometimes even with the latest super-sonic PC, the old tricks are still the best.

Don't stop now, I'm on a ROL-L

Yeah, instead of a SHL (shift left) use a ROL (rotate left) instruction. This means the rotated token would look like this:

       76543210 76543210 76543210
                         _____xxx     <-- the BitRack
            ... defghijk lmnopabc     <-- the token << 3 bits
            -               ^^^
             ---->------>------

Now let's forget all about the BitRack and access the Bitstream directly.
(Hey, Dario... check this out...)

*** WARNING ***

This code is TOTALLY untested, so there probably is a heap of bugs in it. I typed it directly off the top of my head, anyway I will demonstrate a much better 'putbits' very soon..

        (AX) = value to write (the 'token')
        (CH) = bits to write ('NumBits')

          push   bx dx di
          mov    bx,[BitPos]
          mov    di,[BitPtr]            ; [ES:DI] --> Bitstream byte
          mov    cl,bl
          rol    ax,cl                  ; rotate token left
          mov    dx,ax
          and    dl,table[bx]           ; get overflow bits
          xor    al,dl
          add    cl,ch                  ; BitPos + NumBits
          or     es:[di],al             ; combine with previous BitRack bits
          cmp    cl,8
          jb     short @@done           ; < 8 bits in shifter?
          inc    di
          mov    es:[di],ah             ; else output Bits 15..8
          cmp    cl,16
          jb     short @@done           ; < 16 bits in shifter?
          inc    di
          mov    es:[di],dl             ; else output Bits 23..16 (overflow)
 @@done:
          and    cl,7                   ; BitPos MOD 8
          mov    [BitPtr],di
          mov    byte ptr [BitPos],cl
          pop    di dx bx
          ret

 table    db        00000000b
          db        00000001b
          db        00000011b
          db        00000111b
          db        00001111b
          db        00011111b
          db        00111111b
          db        01111111b
          db        11111111b

Faster? Are you taking the P-mode?

Most of us 80x86 coders are now using p-mode, right? If not, why not? Well the following method of accessing the bit-stream directly can be used in either P-mode or Real/V86-mode, of course 32-bit code segments will require a few less override bytes.

                (EAX) = value to write (the 'token')
                (CH) = number of bits to write

              mov     cl, [BitPos]
              mov     edi, [BitPtr]
              shl     eax, cl          ; shift token to BitPos
              or      [edi], eax       ; combine with Bitstream
              add     cl, ch           ; BitPos + NumBits
              movzx   eax, cl
              shr     eax, 3           ; EAX = BitPos / 8
              and     cl, 7            ; BitPos MOD 8
              add     [BitPtr], eax    ; Advance BitPtr by EAX bytes
              mov     [BitPos], cl

Well, that certainly looks fast and compact, but there are two small details which I have missed..

In order for the above code to work you MUST zero out the bitstream buffer before using the above routine. A nice REP STOSD or optimized loop will do.

Also it may overrun the bitstream buffer by upto 3 bytes due to the

              or      [edi], eax       ; combine with Bitstream

instruction, so an extra dword in your buffer is perhaps a good idea.

You can remove the need to clear the bitstream by replacing the OR instruction with these:

              or      al, [edi]        ; combine with Bitrack byte
              mov     [edi], eax       ; wrt BitRack + overflow byte(s)

But remember to zero out the very 1st byte in the bitstream buffer!

Also the 'token' value in the (EAX) register MUST be zero-extended, so you can't load 1234h into (EAX) and only write the lower 8 bits, you MUST load (EAX) with 34h. This is due to this instruction:

              or      [edi], eax       ; combine with Bitstream

If you want to only write a certain number of bits from (EAX) then you can perform a logical AND instruction before putbits (i.e. and eax, 255).

BTW: bits are written into bytes starting from bit 0:

                --------------------> bitstream

                hgfedcba ponmlkji  xwvutsrq .... etc ...

                 byte 0    byte 1    byte 2

It's nice to see that Intel's low...high byte order can sometimes be very useful (heh heh).

Improvements

Of course the above code could be optimized. The usual AGI stalls and pairing rules should be used.

You could use the above code as a MACRO in your source code, and reserving the (CL) and (EDI) registers for the bitstream could also help speed things up a little more.... but like I said in a previous article the string-search and other CPU hungry parts of your packer are better places to spend your time optimizing.

You could even write a MMX version using this 'sprite' technique. But perhaps the MMX instructions would be better used in the string compare/search routines in your packer.

The P-mode version can handle up to 24 bits, which should be enough for most people.

Maybe using dwords (32 bits) instead of bytes (8 bits) for the bitrack would also be a good idea, as this would reduce the extra clock-cycles needed for a mis-aligned dword. Of course if you need more than 24 bits then you could give the SHLD and SHRD instructions a quick look.

Closing Words

Well, I hope that's the last 'PutBits' article that I will write.

...unless I get another email from Dario (grin).

Enjoy.

Regards

TAD #:o)