Using Overlapping in Inner Loops

When you are using heavy FPU instructions like fidv, fsqrt, fsin etc. you will be able to use overlapping with your CPU instructions (if you don't exactly know what overlapping means you should better read a tutorial about FPU optimization first). But in many cases, the following code will require the results of the preceeding, resulting in the fact that overlapping seems impossible. Here is one example of an inner loop (supposing that all necessary variables/registers are set):

                ;instruction                 ;FPU state
                loopit:
                  fld million                  ;1000000 x
                  fdiv st(0), st(1)            ;1000000/x x
                  fistp storage                ;x
                  mov eax, storage             ;x
                  mov eax, [eax*4+esi]         ;x
                  stosd                        ;x
                  fadd ten                     ;x+10
                loop loopit                    ;x+10

I won't talk about optimizing the algorithm, so if you know a far better way and do the math behind it without division or whatever, don't care about, it's a simple example where we'll just optimize the code.

I'm sure you found the big stall after the fdiv. There are nearly fourty cycles lost for ever. So what I'd done before is splitting up stosd in mov [edi],eax and add edi,4 and then put the latter after the fdiv together with a load from edi and storage to update the cache.

Well, I must say that there's a much better way to fill the stall with CPU instructions. It's clear that you cannot simply put those three lines just after the fdiv as they need its result, but you could reorganize the loop so that you do the CPU instructions of a previous fdiv after the current fdiv. You just use a buffer so that your CPU instructions are always one round behind the FPU instructions. And this buffer is the variable "storage". Unclear what I mean? Let's try to implement it:

     fld million                  ;1000000 x
     fdiv st(0), st(1)            ;1000000/x x
     dec ecx                      ;1000000/x x
     jecxz ecxwas1                ;1000000/x x
   loopit:
       fistp storage              ;x
       fadd ten                   ;x+10
       fld million                ;1000000 x+10
       fdiv st(0), st(1)          ;1000000/(x+10) x+10
       mov eax, storage           ;1000000/(x+10) x+10 overlap!
       mov eax, [eax*4+esi]       ;1000000/(x+10) x+10 overlap!
       stosd                      ;1000000/(x+10) x+10 overlap!
   loop loopit                    ;1000000/(x+10) x+10 overlap!
   ecxwas1:
     fistp storage                ;x Take care! It's different than above!
     mov eax, storage
     mov eax, [eax*4+esi]
     stosd

I must mention that this code never found its way to an assembler, so it might just not work and crash your HD. So don't do anything with it but reading and maybe learning!

Well, the code is a bit blown up, but if I counted right, the number of executed instructions stayed the same. But now have a look at the inner loop. Every CPU instruction (even the loop itself) is overlapping with the fdiv.

In this example, the speedup is rather low, especially if you optimize the CPU instructions a bit. But in most cases there will be more CPU instructions that might overlap and the speedup boosts up to twice as fast.

Hmm, I don't know whether you will need this trick as it is rather special. But I used it twice so far, and I suppose that it might be great for a perspective corrected texture mapping renderengine. No?

Chock (Chock@earthling.net)