Advanced Topics of the FPU
Dario Phong / PhyMosys
So you've learned how to use the FPU and how to optimize for it but you still need more, you still feel that don't have all the control over the FPU, you feel that it still has secrets... Now it's time to change that and improve your knowledge of the FPU.
This article is aimed for any coder, or programmer, that wants to use the FPU and at the same time know what he's exactly doing. You'll not learn impressive optimization tricks nor anything like that, but when you've read this article you'll know how to efficiently use the FPU. You'll learn advanced topics, topics that you'll (probably) need.
However, I'll not talk about other topics, like the FPU instruction pointer, FPU operand pointer nor opcode, because they are only useful if you want to make a debugger or something like that. Btw: if you are planning to do one, don't hesistate to ask me for help with the FPU.
I checked all the examples at least once, I hope there's not any bugs, but if you find something that is wrong, email me.
With almost every topic I tried to have the needed theory and examples, so you don't have to spend your time learning them, well, you'll learn them, but in half the time or less. In some sections, like Stack pointer, you'll find some information that you will not find anywhere, or at least I never found it. And now let's start to learn more and more, because this is one of the aims of the life. (At least for me.)
The status word keeps track of what happened with the executed instructions, if any exception occured, or comparisions. (And more things.)
Remember that finit clears it.
0 Invalid operation (LSB)
1 Denormalized operand
2 Zero divide
6 Stack fault
7 Error summary status
8 Condition code 0
9 Condition code 1
10 Condition code 2
11-13 Top of the stack pointer (3 bits)
14 Condition code 3
15 FPU busy (MSB)
Bits from 0 to 6 are exception flags. When you init the FPU they are set to 0. If any of them is set to 1, then that means that exception has occured. Bit 7 is the Error summary status, it will be set if one of the problems related to the exceptions occured but the correspondent bit in the control word was unmasked, so no exception was generated.
Remember that once an exception flag is set (to 1) it's not cleared until you reinit the FPU. (The same for Error summary status.) The condition codes have the purpose of knowing the result of a fcom, and also when any error occurs additional information is there. The Top of the stack pointer... you'll learn how it works in another section. The FPU busy flag will be set to 1 if it's currently computing.
The control word allows us change some options of the FPU. This is its structure:
0 Invalid operation exception mask
1 Denormalized operand exception mask
2 Zero divide exception mask
3 Overflow exception mask
4 Underflow exception mask
5 Precision exception mask
8-9 Precision control
10-11 Rounding control
12 Infinity control
Note that both overflow and underflow refer to numeric overflow... just not enough precision.
The exception masks work as follows:
- If it's 1 it's masked. The corresponding exception will be raised.
- If it's 0 it's unmasked. The corresponding exception will not be raised.
I doubt that you'll handle the exceptions, so what you must know is that in case of error if the bit is masked the FPU will generate NaNs, infinites or whatever, and if it was unmasked it will ignore the instruction results. You'll learn about Precision control and Rounding control in other sections. The Infinity control is ignored by the Pentium, and it's only used for compatibility with 286.
This is another topic not very useful for 3d, or any use of the FPU. However, the more knowledge, the best that you understand things. The tag word has the status of every register of the FPU. Its organization is as follows:
0-1 Tag value for st(0)
2-3 Tag value for st(1)
4-5 Tag value for st(2)
6-7 Tag value for st(3)
8-9 Tag value for st(4)
10-11 Tag value for st(5)
12-13 Tag value for st(6)
14-15 Tag value for st(7)
Every value may have up to 4 differents states:
Value Status Meaning
----- ------ -------
00b Valid The stack register has a valid number.
01b Zero It contains the value 0.
10b Special Denormal, Infinity or a NaN.
11b Empty Stack register don't have any value.
The tag word is used for detecting overflows or underflows.
The tag value doesn't really means that the value, for example with zero value has REALLY a 0. However, don't think that the FPU tries to lie to you, it will only happen if you fool it around. An example: You save the environment, change the tag, but not the numbers. But even this has its own limitations: when you load the tag word the FPU will read their values. If it's 11b (empty) it will load it. In the other cases, it will read the actual data of the register, and then set the tag itself. Remember that NaN means Not A Number.
Reinit the FPU
As you know the FPU gets reinited with: 'finit'. But what are the default values?
Control word (037Fh)
All the exceptions mask to 1. Precision: extended (11b)
Rounding: Round to nearest (00b) Infinity control: 0b
Status word (0000h)
Nothing happened. Stack pointer to the top (000b). FPU not busy.
Not comparision done.
Tag word (FFFFh)
All the registers tagged as empty.
Note that the registers are NOT changed.
How to read the Status Word
The instrucion that lets you read the Status Word is fstsw. It can have as an operand a word in memory or ax:
fstsw ax ;put the status word in ax
fstsw int_ ;put the status word in memory
The way the word is saved is the following:
First bit: Invalid operation
Second bit: Denormalized operand
And so on.
How to modify the Control Word
The Control word can be read and written.
Note that you can't specify ax as a operand, the only valid operand is a word in memory.
The way you should modify the Control Word is: first read it. Then 'and' and 'or' it. And finally write it. (load)
Example: we want the Denormalized operand exception mask to be 0:
fstcw int_ ;read it
and int_,1111111111111101b ;put to 0
fldcw int_ ;write it
Another example: we want the precision to be Double (10b):
fstcw int_ ;read it
and int_,1111110011111111b ;put to 0 precision field
or int_,0000001000000000b ; put the value 10b
fldcw int_ ;write it
Remember that the default value of the Control word when you do an finit is 037Fh (0x037F for C users).
Invalid Operation Exception
An invalid operation exception mainly occurs when you try to use infinites, NaNs or such (wrong) things. Other possible cases are when you do an fxch when the registers are empty:
fxch st(1) ;stack empty, error.
Both operands (st(0),st(1)) will be set to NaN, and the typical flags will be set: invalid operation and Stack fault.
The same will happen with fchs. (Only one operand!)
If you set the invalid operation exception mask to 0 the FPU will see that the fxch will make an error, and it will ignore it. Example:
fxch st(1) ;stack empty. Invalid operation exception mask=0
As result of this NOTHING will happen with the registers. The folowing flags will be set: Invalid operation, Stack fault and Error summary status.
The stack overflow occurs when you load a number and st(7) is NOT empty, no matter what value the stack pointer had, nor the value of st(7) (NaN, zero, valid). If you want to avoid you have two solutions:
1. Try to not overflow the stack. This is not so difficult.
2. Put the control word Invalid operation exception mask to 0. No data will be loaded, and the invalid exeception flag will be set to 1. (The Error summary status will be set too.)
Remember that also the Stack fault flag will be set to 1, and the condition code 1 will be set to 1 too.
This one occurs when you try to store from a stack register (st(0)) that is EMPTY. Example:
fld x ;x
fstp x ;stack empty
fstp x ;stack underflow
Then the stack register will be popped and the invalid operation flag, the Stack fault flag will be set to 1, and the Condition code 1 to 0. The variable in memory will be:
- If this was an integer: 8000h
- If this was a float: FFC00000h (if you load it again: a -NaN)
You can, of course, put to 0 the invalid operation exception mask, in that case no data will be popped, and the following flags will be set: invalid operation, Stack fault and Error summary status. (Of course no data will be updated.)
Remember that both stack underflow and overflow are invalid operation exceptions.
Zero Divide Exception
Get your calculator, do the following: 1/0. What you get? Exactly, an error. With the FPU happens the same:
fld1 ;1 (I used 1, but we could use any other number)
fdiv ;1/0 Error!
The result will be +infinite. The Zero divide flag will be set to 1. Another example:
fdiv ;1/0 Error!
It will also set to 1 the Zero divide flag, but the result will be -infinite. However, if you do the following:
fdiv ;1/0 Error!
The result will be a -Nan.
The Zero divide exception not only occurs with the fdiv family, it also does with fxtract and fyl2x.
Of course if Zero divide exception mask was unmasked (0) no register will be updated and both Zero divide exception flag, and Error summary status will be set.
Numeric Overflow Exception
This happens when a number is too high and it can't be represented. An example:
fld _10_ ;10
fmul st(0),st(0) ;100
fmul st(0),st(0) ;10000
fmul st(0),st(0) ;100000000
fmul st(0),st(0) ;10000000000000000
fmul st(0),st(0) ;100000000000000000000000000000000
fmul st(0),st(0) ;1.0e+64
fmul st(0),st(0) ;1.0e+128
fmul st(0),st(0) ;1.0e+256
fmul st(0),st(0) ;1.0e+512
fmul st(0),st(0) ;1.0e+1024
fmul st(0),st(0) ;1.0e+2048
fmul st(0),st(0) ;1.0e+4096
fmul st(0),st(0) ;+INFINITE, overflow
The result: +INFINITE. Flags set to 1: Overflow and Precision. When the corresponding mask was unmasked, it gave me... er... strange results. (7,212e723 or something like that) So better mask this one. Remember that the maximum is: 3.37 * 10e-4932
Numeric Underflow Exception
I'm sure you can imagine why this happens, exactly, the number was too little to be represented. This time I'll try to not write too much.
fld _div_ ;1.0e4096
fdiv st(0),st(1) ;(1.0e-4096)-1.0e4096
fdiv st(0),st(1) ;0-1.0e4096 underflow!
The result: 0. Flags set to 1: Underflow and Precision. With the bit unmasked... it gave me weird results too, exactly that:
Of course it set Error summary status, underflow and precision. Again I recommend you to mask them. (Or leave them alone, because the finit do that.)
Precision is how many bits the FPU uses internally. You can choose among three diferent kinds of precision:
Value Name Bits
----- ---- ----
00b Single 24
10b Double 53
11b Extended 64
Note that the value 01b is reserved.
Here we find a good solution for fdivs. Imagine we have a loop with fdivs, and the values needed change for them, so we cannot use the reciprocals trick. Then, if we don't care a lot about precision, we can specify less precision, and it will take less time:
Precision Value Clocks (fdiv)
--------- ----- -------------
24 00b 19
53 10b 33
64 11b 39
The difference from 53 to 64 isn't very big. But precision 24 is very attractive, from an optimization point of view.
See the section Data types for more info about precision. However, it's too much theory, so let's see an example.
1/3: 0.3333333... aka 0.3 <- So, infinite 3
You know how much clocks a fdiv with any kind of precision will take... but what about precision? Here it is:
Precision Hex Value (80 bits [real]) Decimal number
--------- -------------------------- --------------
24 3FFD AAAA AB00 0000 0000 0.3333333432674408
53 3FFD AAAA AAAA AAAA AA80h 0.33333333333333331
64 3FFD AAAA AAAA AAAA AAABh 0.33333333333333333
We can easily see how much precision anyone has. I think 7 digits of 'float' precision is more than Ok for almost all the aplications.
Remeber that when we do finit the precision is 64, so unless you are developing a calculator use precision 24!
If you are asking what happens if we use the precision value 01b, my test gave me the same result that for the 11b (63 bits) precision, however, don't choose it, maybe in the future the 10b value will represent another precision.
Rounding is... You already should know that! Here it's the table:
Round to nearest 00b
Round down 01b
Round up 10b
Round toward zero 11b
Round to nearest. It rounds to the even number that is nearest.
Round down. It rounds down.
Round up. Er...
Round toward zero. Aka truncation or chop mode. It just ignores the decimals.
And some examples:
Mode Value Result Result
---- ------ ------ ------
Round to nearest 00b 4 2
Round down 01b 3 2
Round up 10b 4 3
Round toward zero 11b 3 2
Now you just have to decide what best suits your needs. After all, I think there's no big difference, just decide what you like more. And now, some fun after all the theory and examples:
Kind of person Rounding mode
Don't like odd numbers nor odd people Round to nearest (even)
Pessimist Round down
Optimist Round up
Likes guillotine and swords Round toward zero
Well, now let's leave jokes and continue with the FPU.
Top of the Stack Pointer
This section is (probably) not really needed, but I worked on it enough so you should have a look at it.
The stack pointer is 3 bits long, so it may hold values from 0-7, exactly the stack registers. It starts at 0. Everytime we load it's decremented. When we pop it's incremented. Note that the (stack) overflow or underflow doesn't occur when this 'register' overflows or underflows. Example:
finit ;TOP = 0 \_ 0-1=7, it underflows (ony the stack pointer)
fld1 ;TOP = 7 /
fld1 ;TOP = 6
fld1 ;TOP = 5
fstp f ;TOP = 6
fstp f ;TOP = 7 \_ 7+1=0 it overflows
fstp f ;TOP = 0 /
And now let's play around with it.
First we need to change the status word... but, there's any instruction to load the value of the status word? Of course there's one. And you'll learn it in the section SAVING THE FPU ENVIRONMENT. Now go and learn it, unless you already know how it works.
Now we'll assume that we are in 16 bits real mode.
fstenv fpu_env ;read the whole FPU (unless stack registers)
mov ax,word ptr fpu_env+2 ;read status word
or ax,0011100000000000b ;put Stack pointer to 7
mov word ptr fpu_env+2,ax ;write it
fldenv fpu_env ;write it again
And what this can be useful for? Well, it's not really useful, but it's knowledge at all. Now we can do weird things with the stack. As you saw we can easily change the Stack pointer, so what will we do know? Let's say we have loaded eight numbers:
Stackp: 0 Stackp: 1
st(0) 0 st(0) 1
st(1) 1 st(1) 2
st(2) 2 st(2) 3
st(3) 3 st(3) 4
st(4) 4 st(4) 5
st(5) 5 st(5) 6
st(6) 6 st(6) 7
st(7) 7 st(7) 0
Stackp: 2 Stackp: 7
st(0) 2 st(0) 7
st(1) 3 st(1) 0
st(2) 4 st(2) 1
st(3) 5 st(3) 2
st(4) 6 st(4) 3
st(5) 7 st(5) 4
st(6) 0 st(6) 5
st(7) 1 st(7) 6
(Note that Stackp=0 was the initial value.)
You see what happened? Changing the stack pointer we may have in the top of the stack (st(0) aka st [stack top]) any register we want.
The way for doing that is putting in the Stack pointer the value of the register we want at the top. I know this isn't very useful, mainly because it's slow.
Note that we were using a brute force method. Another way, more elegant and fast would be fincstp and fdecstp. They both take 2 cycles. They, of course increment and decrement the stack pointer. So maybe you'll find useful playing around with the stack pointer at all. If you code using (modifying) the stack pointer, please let me know.
Saving the FPU Environment
There are some instructions that let you save the environment of the FPU and then load it again. This is useful for exception handlers and for almost nothing else. They are:
fstenv fpu_env ;this saves the status.
fldenv fpu_env ;restore it. ...
Fpu_env is a variable that depends on the mode we are currently in (you should know what mode you are in).
32 bits protected mode format:
2 Control word
2 Status word
4 FPU instruction pointer offset
2 FPU instruction selector
2 Opcode (0-10 bits) (bits 11-15 are 0)
4 FPU Operand pointer offset
2 FPU Operand pointer selector
16 bits real mode format:
2 Control word
2 Status word
2 tag word
I don't list more because I don't think you'll need the rest of the structure, I'll just remind you that this is 14 bytes long.
This section deals with the data types that are stored in memory (however, the extended real is the internal data type of the FPU). I'll not list there BCD format, because I never used it, and doubt that you'll ever use it, however, if you feel that I should talk about it, email me.
Name Number Sign Precision
---- ------ ---- ---------
Word 0-14 15 -32,768 to 32,767
Short 0-30 31 -2.14 * 10e+9 to 2.14 * 10e+9
Long 0-62 63 -9.22 * 10e+18 to 9.22 * 10e+18
Name Sign Exponent Fraction Precisision
---- ---- -------- -------- -----------
Single 31 23-30 0-22 1.18 * 10^-38 to 3.40 * 10^38
Double 63 52-62 0-51 2.23 * 10^-308 to 1.79 * 10^308
Extended 79 64-78 0-62 3.37 * 10^-4932 to 1.18 * 10^4932
As far as I know bit 63 of an Extended has only the purpose of identyfing a denormalized number.
There are other topics such as:
- FPU instruction pointer
- FPU operand pointer
- Exceptions handling
Most of them are only useful for debuggers, or exception handlers. Because they contain the data and the last instruction executed. But I doubt I have to do a debugger or an exception handler someday, and also doubt you'll do them.
Hope you enjoyed this article, feedback welcome. If you however feel that something is missing in this article just let me know and I'll release a new version.
Hello to all the people that email me!
"Studying is like reading yesterday's newspaper.
Working is writing tomorrow's newspaper."
- DAriO PhONG, Barcelona 6/2/1999