Using the Coprocessor

Dario Phong / PhyMosys

In this article you will learn how to code for the coprocessor aka FPU. This article is aimed for a coder who already knows Assembler, and has never coded for the FPU. Of course you can do inline-asm, but a little knowledge of Asm is needed.

I will not talk here about speed optimization, because I did in another article, or internal data structures, because this article is only the first step, the rest is your job.

If you don't understand any word, refer to some theory.

I checked ALL my examples at least once, however if any of them are wrong, let me know.

I remember when I learned how to use the FPU, I only had a list with all the functions. It wasn't very hard, but I think it could have been easier if I had had this tutorial.

The stack

The stack is where the FPU has the data. It contains 8 registers, that go from st(0) to st(7):

ST(0)  ST(1)  ST(2)  ST(3)  ST(4)  ST(5)  ST(6)  ST(7)

It may have 4 possible states (for any register):

 Status      Tag Word     Meaning
 ------      --------     -------
 Valid       00b          The current register is valid, and may be used.
 Zero        01b          The current register is 0, it may be used too.
 NaN         10b          The current register is not valid format, an
                          error ocurred, can't be used.
 Empty       11b          The current register contains no data.

If you are asking what happens if you use a NaN for any operation, here is my answer: it will make as a result another NaN.

The way you use the FPU-stack is similar to the way you use the stack of your program, when you push (load) a number, it will be the first, the first the second, the second the third...

 1 - Push 12                    St(0) - 12
 2 - Push 42                    St(0) - 42  St(1) - 12
 3 - Push 52                    St(0) - 52  St(1) - 42  St(2) - 12
 4 - Pop                        St(0) - 42  St(1) - 12

Don't try to load more than 8 numbers here, or bad things can (and will) occur. I'm sure you are asking what may happen, well here it is: an exception will be raised, so you will lose some clocks (but your code will still be executed) and you will get a NaN in st(0) or a denormal. (Btw: Both of them will give as a result more NaNs, and this will make weird things with your results.) Also don't underflow the stack, that is, don't pop more data than there is. For freeing a register you may pop it, or you may also do compares with pop (even with two pops).

First example

First of all it is better that you 'reboot' the coprocessor so all the options are the defaults, and there are no NaNs and weird things laying around:

 finit          ;this will (of course) 'reboot' the FPU

Do this only once in your program, unless you know the FPU will be changed by another program, or one weird call.

Now let's do some work. Let's load 'number' in st(0):

 fld number     ;Now st(0) = number

Now let's pop it:

 fstp st(0)

Or we can free it:

 ffree st(0)

Now I must discuss the two basic data types that you will use.

Ints and floats

The basic data type that you will use is of two kinds:

- 16 bits Integers. The data that you used. (_int_ dw 1412)
- 32 bits floating point. The data that you will use. (_float_ dd 30.00012)

The way of loading and popping them is different so try to not directly mix them, in fact is better using floating point, because when you use ints the FPU has to transform the data to its internal data type (real number) and this takes more clocks than converting it from float.

The way for loading and popping ints is:

 fild _int_     ;this will load an integer to st(0)
 fistp _int_    ;this will pop st(0) to the integer value _int_

And for floating point:

 fld _float_    ;this will load a float to st(0)
 fstp _float_   ;this will pop st(0) to the floating value _float_

I recommend that you use ints with the FPU only when it's really needed. Use the CPU for ints.

Basic operations: add

Let's talk about how to use basic operations of the FPU. The old add, this is the way you use it:

 fadd st(0),st(1)

This will add to st(0) st(1) and leave the result in st(0) with st(1) unchanged. In example:

 fld x          ;x <- this refairs to the stack
 fld y          ;y-x <- now st(0) is y, and st(1) is x
 fadd st(0),st(1) ;(y+x)-x <- in st(0) you have the result, and in st(1) x

Ok? It's simple, isn't it?


This will add st(0) and st(1), pop st(0), and leave the new result in st(0). And now the same but without specifiying st(0).

 fld x          ;x
 fld y          ;y-x
 fadd st(1)     ;(y+x) <- in st(0) you have the result

For more info about this topic consult my article about FPU optimization, published in Hugi #14 and at my homepage ( I'll do a rebrief, if you only write fadd, the compiler will generate 'faddp', that means, 'floating point add with pop', so st(0) will be popped, and this is not always good. What if we use another stack register:

 fld x          ;x <- this refairs to the stack (cut'n'paste! }E-)
 fld y          ;y-x <- now st(0) is y, and st(1) is x
 fld z          ;z-y-x <- we loaded z st(0)=z st(1)=y st(2)=x
 fadd st(0),st(2) ;(z+x)-y-x <- in st(0) you have the result,
                ;               and nothing popped

Using it with pops, always remember how this exactly works:

1. Add st(0) with specified stack register.
2. Then store the result in the stack register chosen.
3. Finally pop st(0).

So one example:

 fld x          ;x
 fld y          ;y-x <- no more cut'n'paste comments, I'll promise
 fld z          ;z-y-x
 fadd st(2)     ;I'll show you this step by step: (fadd-> faddp)
  1-            ;z-y-x <- Calculate z+x
  2-            ;z-y-(z+x) <- Store the result
  3-            ;y-(z+x) <- Pop st(0)

Another example of the same kind, but more difficult:

 fld u          ;u
 fld x          ;x-u
 fld y          ;y-x-u
 fld z          ;z-y-x-u
 faddp st(2)    ;Learn this because it will be useful:
  1-            ;z-y-x-u <- Calculate z+x
  2-            ;z-y-(z+x)-u <- Store the result
  3-            ;y-(z+x)-u <- Pop st(0)

Note that you can't do "fadd st(1),st(2)" nor "faddp st(1),st(2)", this is because in all the FPU instructions that have two operands, one of them MUST be st(0).

I spent a lot of time on the fadd because I wanted you to understand how the stack works and how the fp-instructions with pops works.

Basic operations: sub

Now let's go with the sub, this is not as easy as the add, that's because:

x-y != y-x (but x+y=y+x, remember those old days at school?)

The sub works in this way: the first operand is where the result is left, and this is the number to the left of the - (minus) the other one is to the right. I.e.:

 fsub st(0),st(1) = st(0)-st(1)

This will subtract to st(0) st(1) and leave the result in st(0):

 fld x          ;x
 fld y          ;y-x
 fsub st(0),st(1) ;(y-x)-x

Or with pop:

 fld x          ;x
 fld y          ;y-x
 fsubp st(1),st(0) ;(x-1) (remember that the not valuable operand has popped)

Note that 'fsubp st(0),st(1)' doesn't exist, that's because if we sub to st(0) st(1), store the result in st(0), and then pop st(0) we don't have the result of the sub.

And the case when no parameter is given:

 fld x
 fld y          ;y-x
 fsub           ;(x-y) <- this is fsubp st(1),st(0) (fsub with pop)

Another example more:

 fld x          ;x
 fld y          ;y-x
 fld z          ;z-y-x
 fsub st(0),st(2) ;(z-x)-y-x

Note that 'fsub st(2)' doesn't exist. Since examples are always welcome:

 fld x          ;x
 fld y          ;y-x
 fld z          ;z-y-x
 fsub st(2),st(0) ;z-y-(x-z)

Don't worry, the other instructions aren't so difficult to use.

Basic operations: mul

This is very easy, believe me. A mul:

 fld x          ;x
 fld y          ;y-x
 fmul st(0),st(1) ;(y*x)-x

Note that 'fmul st(1)' anyway, no need, you can use 'fmul st(0),st(1)' and even 'fmul st(1),st(0)'! Maybe you are asking why can you do this... This's for the result is stored in another register:

 fld x          ;x
 fld y          ;y-x
 fmul st(1),st(0) ;y-(x*y) <- you see? the result is now in st(1)

And with no parameters:

 fld x          ;x
 fld y          ;y-x
 fmul           ;(y*x) <- this is fmulp st(1),st(0)

Ok? Now I just remind you that x*y = y*x. But I hope you already know that!

Basic operations: div

Now the div, 'fdiv' will divide st(1) with st(0) and leave the result in st(1), and then pop st(0), i.e.:

 fld x          ;x
 fld y          ;y-x
 fdiv           ;(x/y)

Note that 'fdiv st(1)' doesn't exist. Do you want to know why? Ask Intel! (Seriously: this is because we already have an instruction which does that.)

And another example more, this will divide st(0) by st(2) and leave the result in st(0):

 fld x          ;x
 fld y          ;y-x
 fld z          ;z-y-x
 fdiv st(0),st(2) ;(z/x)-y-x

More examples:

 fld x          ;x
 fld y          ;y-x
 fld z          ;z-y-x
 fdiv st(2),st(0) ;z-y-(x/z)

Quite easy, don't you think? Once you know that the first operand is where the result is stored, you'll not have problems with the FPU anymore.

One example

Once you know how the basic operations work, and how the stack works, let's do some examples. The example is the add of two vectors.

The formula is as follows:

 void V3D::Add (V3D &a, V3D &b, V3D &c) {
                 c.x = a.x + b.x;
                 c.y = a.y + b.y;
                 c.z = a.z + b.z;

Thanks to Tryx^Xography for that example.

And the parameters: eax = &a, ebx = &b, ecx = &c

If you don't know C, & means the 32 bit pointer to the object, note that this isn't the object itself. (32 bits if we are under pmode.)

First we calculate the x:

        fld     [eax.x]         ;load a.x
        fadd    [ebx.x]         ;add to it b.x
        fstp    [ecx.c]         ;and store it in c.x

Note that another way of doing that could be:

        fld     [eax.x]         ;load a.x
        fld     [ebx.x]         ;load b.x
        fadd                    ;add them, and pop st(0)
        fstp    [ecx.x]         ;and store it in c.x

Then for y:

        fld     [eax.y]         ;load a.y
        fadd    [ebx.y]         ;add b.y
        fstp    [ecx.y]         ;store it to c.y

Now I suppose you can do the last one. (.x, .y, .z is a data structure previously declared.)


This is one of the most important instructions used when coding for the FPU. Well, maybe it isn't very important, but believe me, once you start optimizing for the FPU almost 50% of the instructions are fxch (and this is a good frequency for compressors, he he). It's also the only pairable instruction. It simply exchanges the contents of two stack registers.

 fld x          ;x
 fld y          ;y-x
 fxch st(1)     ;x-y <- we exchange them

With another operand than st(1):

 fld x          ;x
 fld y          ;y-x
 fxch st(1)     ;x-y
 fld z          ;z-x-y
 fxch st(2)     ;y-x-z

Note that this instruction can only work with sp(0) and another stack register. Note too that when we write 'fxch' the compiler generates 'fxch st(1)'. (And that means, as you already must know: exchange st(0) and st(1).)

Second example

Now let's do the first example again, but this time in another fashion, first we load all the data, then we do the adds and finally the pops:

        fld     [eax.x]         ;ax                     Load data
        fld     [ebx.x]         ;bx-ax
        fld     [eax.y]         ;ay-bx-ax
        fld     [ebx.y]         ;by-ay-bx-ax
        fld     [eax.z]         ;az-by-ay-bx-ax
        fld     [ebx.z]         ;bz-az-by-ay-bx-ax
        fxch    st(5)           ;ax-az-by-ay-bx-bz      Recollocate it
        fxch    st(1)           ;az-ax-by-ay-bx-bz
        fxch    st(4)           ;bx-ax-by-ay-az-bz
        fadd                    ;(bx+ax)-by-ay-az-bz    Do the adds
        fxch    st(1)           ;ay-by-(bx+ax)-az-bz
        fadd                    ;(ay+by)-(bx+ax)-az-bz
        fxch    st(2)           ;az-(bx+ax)-(ay+by)-bz  Exchanges when needed
        fxch    st(1)           ;(bx+ax)-az-(ay+by)-bz
        fxch    st(3)           ;bz-az-(ay+by)-(bx+ax)
        fadd                    ;(bz+az)-(ay+by)-(bx+ax)
        fxch    st(2)           ;(bx+ax)-(ay+by)-(bz+az)
        fstp    [ecx.x]         ;(ay+by)-(bz+az)        Store them
        fstp    [ecx.y]         ;(bz+az)
        fstp    [ecx.z]         ;stack empty

This implementation is very impressive, but it's far from speed optimized, or even size optimization. If you want to see how this can be fully optimized, read my related article in Hugi #14 or at my homepage.

Reverse operations

Reverse operations, are some FPU instructions, only two (fdiv, fsub), that do the reverse action with the operands. Let's say we have:

 fld x          ;x
 fld y          ;y-x

Now a typical "fdiv st(0),st(1)" will do: y/x. But we need x/y. We could of course do: "fdiv st(1),st(0)" but that will leave the result in st(1). Then if we want x/y and the result in st(0), what shall we do?

 fdivr          ;(x/y)-x

The reverse of the fdiv will do this job. The same applies for the sub.

 fld x          ;x
 fld y          ;y-x
 fsubr st(0),st(1) ;(x-y)-x

Remember than when we specify st(1), we can pass as operand any other stack register, from 0 to 8. But remember that at LEAST one of the operands MUST be the top of the stack (st(0)).

Not so basic operations

Here I'll discuss other instructions, not all of them, because once you know how the basic instructions work then you know how other instructions should work. I'll not list all of them here, for two reasons: I don't have time. And you can learn them.

Fld1    - Loads in st(0) the value 1.0
Fldpi   - Loads in st(0) the value pi (3.141592654... E-)
Fldz    - Loads in st(0) the value 0.0
Fchs    - Changes the sign of st(0)
Fabs    - Does the absolute value of st(0)
Fsqrt   - Calculates the square root of st(0)
Fsin    - Calculates the sin of st(0) and leaves it in st(0)
Fcos    - Calculates the cos of st(0) and leaves it in st(0)
Fsincos - Calculates  the sin of st(0) and stores it in st(0), then calculates
          the  cos of st(0) and loads it into  st(0), so when it ends you have
          in  st(0)  the cos of the previos  st(0)  and in st(1) the sin. (Use
          that instead of sin and cos, it's faster.)
Fptan   - Leaves in st(0) 1 and in st(1) the tan of st(0). The fact that st(0)
          is  always 1 is only valid for  INTEL processors (as far as I know).
          Read Agner Fog's manual for more info in this topic.

I doubt you need more for starting to code.

Last example

And one last example, we have the formula for doing an environment map for a bump mapping (2d):

color= 1 - sqrt ( ( (x-centerx)/centerx )^2 + ( (y -centery)/centery)^2 )

So here it is the code, that you should correctly understand:

        fld     x               ;x
        fsub    centerx         ;(x-cx)
        fdiv    centerx         ;((x-cx)/cx)
        fld     y               ;y-((x-cx)/cx)
        fsub    centery         ;(y-cy)-((x-cx)/cx)
        fdiv    centery         ;((y-cy)/cy)-((x-cx)/cx)
        fxch                    ;((x-cx)/cx)-((y-cy)/cy)
        fmul    st(0),st(0)     ;((x-cx)/cx)^2-((y-cy)/cy)
        fxch                    ;((y-cy)/cy)-((x-cx)/cx)^2
        fmul    st(0),st(0)     ;((y-cy)/cy)^2-((x-cx)/cx)^2
        fadd                    ;(((y-cy)/cy)^2 + ((x-cx)/cx)^2)
        fsqrt                   ;sqrt (((y-cy)/cy)^2 + ((x-cx)/cx)^2)
        fld1                    ;1-sqrt (((y-cy)/cy)^2 + ((x-cx)/cx)^2)
        fsubr                   ;(1 - sqrt (((y-cy)/cy)^2 + ((x-cx)/cx)^2))
        fistp   color           ;stack empty

Note that for doing the x^2 we do x*x.

Also note that we use 'fsubr'. For learning how it works read 'Reverse operations'.


Of course you can compare the data in the stack but this is a little bit slow, so if you can avoid it, then avoid it.

Anyway here it is how you must do it:

 fadd           ;load any data or do operations
 fcom st(1)     ;compare st(0) with st(1)
 fstsw ax       ;read the status word in AX
 sahf           ;store ah in the flags
 jcc            ;do any conditional jmp

Do you want an example? Here it is:

 fld1           ;1
 fldz           ;0-1
 fcom st(1)
 fstsw ax
 ja             ;of course it will NOT jump, as 0 isn't greater than 1

For optimizations and such things refer to my other article.

MMX and Pentium II also support conditional moves.


This is the correct name for a little math trick that is explained in my optimization doc, or in Agner Fog's (excellent) Optimization manual. However I'll explain it again. The idea is instead of doing a div, multiply with the reciprocal. "What's that?" you may ask? The reciprocal of a number N is 1/N. And once this is calculated you do the 'div':

X*reciprocal of N

Ok? An easy example:

We have to do this: 12/4
Then we do the reciprocal of 4: 1/N -> n=4 -> 1/4 -> 0.25
Then the emulated div: X*reciprocal -> 12*0.25 -> 3

But, if as I hope, you think at the same time you read, you notice that for doing this emulation we MUST do a mul and a div, so: are we losing clocks? In fact if you only have to do ONE div then, we are of course using a bad technique, but if we have to do two or more divs with the SAME number (N) then we gain a lot of clocks: Instead of doing 2 divs, we do 1 div and 2 muls. Here it is when the reciprocals are useful.

Some theory

This section is like a (little) dictionary.

- FPU means Floating Point Unit.

- NaN = Not A Number. This is a not valid number, and will cause operations with it fail, and make your program lose some clocks.

- Stack register, where the FPU has the data, they go from st(0) to st(7).

- Exceptions, they occur when something is wrong with the data loaded in the FPU, or the result of an operation needs more bits than FPU has.

- Top of the stack, this is the register st(0) (also st).

Closing words

Here I showed you how to use the coprocessor, now use it, since in every PC that is actually sold there's an FPU. This article was intended to show you only the basics of FPU-programming. For more on the topic you should read other articles, and experiment, experiment a lot, this is the key.

If you want a list with all the instructions go to

If you want the timings go to

In the case that you can't go to them just email me and I'll send them to you.

         "Politicians are like soldiers,
 but their words kill twice the people."
                             DARiO PhONG

- DAriO PhONG, Barcelona 6/2/1999