Barmy Bitstream Bashing
This is kind of like an update to my old Hugi 15 article called (yeah, you've probably already guessed) "Optimising 'Getbits' part 1". I guess most seasoned 80x86 coders might want to skip this as because its not a ground breaking article and only contains one 'new' trick for the masking out of unwanted bits from the bitstream. I say 'new' to mean, 'it does not use an AND instruction for the masking'. Interested? then keep reading. ;)
Firstly, a 'getbits' routine (or similarly named one) is used to read a varying number of bits quickly from a bitstream. It is often needed in decompression (or 'depacker') programs in order to read back Huffman prefix-codes or in .GIF picture loaders (damn those patents).
The code in here is mostly 32-bits, so you will have to re-code certain parts to get a pure 16-bit version.
First, let's describe how the bitstream is stored. Sorry if this is boring, but it does help to explain things later on.
Bits are stored starting at Bit #0 and ending at Bit #7 of each byte. This is the same format as the .GIF bitstream and seems to be the best suited for 80x86 low-high data storage.
If we read the first 32-bits from the bitstream into EAX then it would look like this:
Say we have already used the first 5 bits ('a' to 'e') of the bitstream and want to get the next 21 bits ('f' to 'z').
After loading EAX with 4 bitstream bytes, we can simply shift EAX right to remove the already 'seen' bits then mask out those extra, unwanted bits.
There are a number of ways to perform the masking. Here are some:
(CL) = Number_Of_Bits_Wanted method 1: mov edx, 1 shl edx, cl dec edx ; mask = (1 << BitsWanted) - 1 method 2: sub edx, edx bts edx, ecx dec edx ; mask = (2 ^ BitsWanted) - 1 method 3: movzx edx, cl mov edx, table[edx*4] ; look-up mask from pre-built table method 4: mov ch, 32 sub ch, cl mov cl, ch mov edx, 0FFFFFFFFh shr edx, cl ; mask = 0FFFFFFFFh >> (32 - BitsWanted)
Advance vs. Random-access
After getting N bits from the bitstream we need to update the position. There are two ways to do this. The first is the most widely used one. It is the 'advance' or 'increment' method. This involves updating both the Start-Bit-Pos and the Start-Byte-Pointer variables.
The following code will allow upto the maximum of 32-bits to be extracted from the bitstream and returned in the EAX register.
.DATA? StartBitPos db ? .CODE ; ; (EAX) = Get (CH) bits from bitstream[DS:ESI] ; GetBits32: mov cl,[StartBitPos] mov eax, [esi] ; \ DL.EAX = read 32+8 bits mov dl, [esi+4] ; / shrd eax, edx, cl ; align with bit #0 xchg cl, ch ; CL = BitsWanted, CH=StartBitPos ;; create and use mask ;; mov edx, 1 shl edx, cl dec edx ; mask = (1 << BitsWanted) - 1 and eax, edx ; mask out unwanted bits ;; update position ;; add cl, ch ; EndBit = StartBitsPos + BitsWanted movzx edx, cl shr edx, 3 add esi, edx ; advance ESI by (EndBit DIV 8) and cl, 7 mov [StartBitPos], cl ret
Now the second way. The 'random-access' method. Instead of updating the pointer ESI after each byte, we recalculate the StartByte and StartBitPos each time. It has the advantage of allowing random-access in the bitstream (and, as you will soon see, slightly shorter code :). Also, if your bitstream is in a static place then, you could hardcode the base address into the [EDX+nnnnnnnn] and free up a 32-bit register (namely ESI).
Again this method allows for up to a maximum of 32 bits.
.DATA? BitSeekPos dd ? ; *** note: 32-bit variable *** .CODE ; ; (EAX) = get (CH) bits from bitstream[DS:ESI] ; GetRandBits32: mov edx, [BitSeekPos] mov cl, 7 and cl, dl ; StartBitPos = (BitSeekPos MOD 8) shr edx, 3 ; StartOffs = (BitSeekPos DIV 8) mov eax, [esi+edx] mov dl, [esi+edx+4] shrd eax, edx, cl xchg ch, cl ;; create and use mask ;; sub edx, edx bts edx, ecx dec edx ; mask = (2 ^ BitsWanted) - 1 and eax, edx ; mask out unwanted bits ;; advanced position ;; add cl, ch movzx edx, cl add [BitSeekPos], edx ; += BitsWanted ret
"The King of the Bits"
Okay, you have waited long enough. Here is the 'new' method I promised many ASCII lines ago. This version is restricted to a maximum of 24-bits (due to the ROR masking method). The 24-bit limit should be enough for most LZ-style packers.
.DATA? BitSeekPos dd ? .CODE ; ; (EAX) = Get (CH) bits from bitstream[DS:ESI] ; GetRandBits24: mov edx, [BitSeekPos] ; mov cl, 7 ; and cl, dl ; StartBitPos = (BitSeekPos MOD 8) shr edx, 3 ; StartOffs = (BitSeekPos DIV 8) mov eax, [esi+edx] ; read 24+8 bits add cl, ch ; LastBit = StartBitPos + BitsWanted ror eax, cl ; # 1 sub cl, 32 ; \ #2 neg cl ; / shr eax, cl ; # 1 Mask & align with Bit #0 nice :) movzx edx, ch ; add [BitSeekPos], edx ; advance by BitsWanted ret
There are two tricks in the above code snippet. Firstly it aligns the last bit in EAX with bit #31 then performs the SHR EAX, CL to both mask and align the LSB with bit #0 in EAX. This saves having to build the mask and then perform the AND operation. The only downside is having to calculate the correct SHR value (read on...).
The second trick is used to work out the SHR value. The formula is:
(32 - (StartBitPos + BitsWanted))
At first glance this would appear to need a temporary register/variable to hold the 32, then subtract the StartBitPos + BitsWanted from it. But you don't. Check out this useful optimisation trick.
A - B ; e.g. 20 - 5 = +15
can be done:
- (B - A) ; (5 - 20) = -15 ; negate = +15
More news at 11.
You can optimise the 'GetRandBits24' further by removing another instruction. It's to do with the 32-(StartBitPos+BitsWanted) part and the rather nice barrel-shifter (ah, you've gotta love it).
mov edx, [BitSeekPos] mov cl, 7 and cl, dl shr edx, 3 mov eax, [esi+edx] add cl, ch ror eax, cl neg cl ;** note: no "SUB CL,32" !! shr eax, cl movzx edx, ch add [BitSeekPos], edx
Unless I've made a complete fool of myself (and that happens often) the built-in MOD 32 feature of the barrel-shifter removes the need for the SUB CL,32 (or a 'AND AL,31' instruction.
sub cl, 32 ; (A) neg cl equals: neg cl ; (B) ;** and cl, 31 ;; <<-- done by barrel-shifter =)
If you run through the numbers you will hopefully see that both give the same results, except for CL=0. In case (A) it will give CL=32 whereas case (B) will give CL=0. "HAHA, TAD you fool!!!" you might cheer. But remember, the barrel-shifter will turn that 32 into 0.
In any case, send me your flamez and call me 'Susan' if it isn't so.
Oh well.... that's all folks. Btw, it was called "King of the Bits" as a play on the phrase "King of the beasts". Which of course is a Lion. And the sound a Lion makes is "RO(a)R...."
I'm sorry. That was a poor joke even by my very low standard. ;)
Who says coders need sleep?