Floating point numbers' bottlenecks

by Ilya Palopezhentsev (iliks)

As all of you know, a computer has two common types of numbers: integer and real. Obviously, they have different representations within a computer. At the first glance this doesn't matter much for a programmer but that is a mistake. These two types of numbers have one huge difference: integer operations are absolutely precise while real ones are not. There are two possible representations of real numbers - with fixed or with floating point. In this article I will cover only floating point numbers. They are more appropriate to the real life calculations than numbers with fixed point whose only advantage is the speed of handling.

I will try to show you eight facts which are often forgotten or neglected but which are very important if you want to get predictable and precise results. Of course precise results are a must for scientific and engineer calculations, and therefore they are interesting for me, but who knows - games and demos nowadays have a great deal of high math in them so this is applicable for them too.

Just one note: throughout the article I use the term "meaningful signs" / "significant digits". They are the same and tell us how many numbers stand after the leftmost zeroes:

0.0010 has two significant digits (1 and 0)
0.001 has one significant digit
345 has three significant digits (there are no zeroes before)

Notice that 0.0010 isn't equal to 0.001 because in the first case we strictly know that digit after 1 is 0, in the second case we don't know anything about it - and probably 0.001 was got from rounding of the number 0.0008? or 0.0005? Who knows?

So, let us look at some common bottlenecks of floating point numbers which should be taken into account:

I. Inaccuracy of floating point numbers

For a start something simple but not very obvious: Not every real number can be represented in a computer exactly. All numbers are transferred with a precision loss. This could give interesting results.

For example, the number 0.7 couldn't be represented exactly. If we write the following program:

      a = 0.7
      print *,a

(it's a Fortran)

it will show us 0.6999999.

II. Unevenness of floating point numbers

Floating point numbers aren't equally spaced. They don't have an equal density at the whole domain of definition.

For example, let's consider two pairs of numbers:
First pair: 0 and 1
Second pair: 1000 and 1001

This fact is quite unbelievable but true: in a computer there are more floating point numbers between 0 and 1 than between 1000 and 1001 despite they have equal distance within a pair. So, the more the magnitude is, the more the precision loss is.

While around 0 you could easily represent numbers with a lot of digits after a comma, you can't do this around big numbers.

Why? The next fact will tell us the answer

III. 7/15 meaningful signs

32bit floating point numbers ("float" in C, "single" in Pascal, "real" in Fortran) hold only 7 meaningful digits. If you print a float variable in some programming language you (probably) will get 10 signs on screen. The two last digits are incorrect, you should drop them out.

(64bit floating point ("double" in C, "double" in Pascal, "double precision" in Fortran) holds about 15 significant digits.)

Floating point numbers are represented in exponential form:

1.00000000E+00

This is 1. You see, if the magnitude is small, the precision could be up to 7 digits (i.e. we could fill these zeroes with some numbers and they will stay).

But for numbers with a high magnitude, for example

1000000.0E+00

things are different.

You see, we have only one position after a comma. (Because the number has only 7 meaningful positions and 6 are taken by integer part.) So there are only about 10 numbers between 1000000 and 1000001:

1000000.0
1000000.1
1000000.2
1000000.3
...
1000000.9

You see, the less integer part is the more points after a comma we can represent in a computer.

IV. Comparison of floating point numbers

Floating point numbers shouldn't be compared with usual programming language's construction "==". Can you imagine?

The correct form for comparison on equality:

        instead of
               if (x==y)
        you should write
               if (abs(x-y)<epsilon*abs(max(x,y)))

Where "epsilon" is a small constant which determines the desired precision of comparison, e.g. epsilon = 0.0001 for a comparison up to the 5th digit.

V. Precision loss

Each operation on a floating point number "eats" a certain degree of precision from this number. In quite a raw form it could be expressed this way: each operation on floating point number eats one correct digit from the result.

For example, you've entered a number with 7 correct digits. After a single operation with it the result will have only 6 correct digits and so on.

This isn't always so cruel but...

The moral is the following:

MINIMIZE THE NUMBER OF CALCULATIONS WITH YOUR DATA!

From this point 6 goes:

VI. Iterative calculations

Quite a widespread error. I confess, I made it in the past too.

Imagine we have two boundaries, a and b and step h. We must iterate from a to b with this step.

How does an unexperienced programmer do this? The answer:

    x = a
    while (x<=b)
    {
       //calculate something from x
       x += h;
    }

This is TERRIBLY wrong! Look on point V. for an answer. You see, after each addition of h to x it loses one correct digit! What will be if number of iterations is 100000? We will get a huge precision loss. I repeat once again, in reality this rule isn't so cruel, i.e. we won't lose x completely after 8 operations with it, but we always should remember that the precision really goes away after each action.

How to do it the right way? The answer:

    for i:=0 to N do
    begin
      x = a + i*h;
      //calculate something from x
    end

Here we are making only two operations on your original data, multiplication and addition. And every step we are applying them on the original values, not values we've got from previous steps, so the round-off error isn't accumulating.

Oh yeah, assembler freaks will cry "we get multiplication on each iteration instead of a single addition!!!" Yes, this is slower but this is more correct.

You see, things that are fast are not always precise.

Want a concrete example?

Consider an integer

         pi/2
         /                   |pi/2
         \ sin(x)dx = -cos(x)|     = 0 - (-1) = 1
         /                   |0
         0

This is a precise result. Now let's calculate it numerically using a trapezium rule.

It looks like:

b
/
\ f(x)dx = h*( f(a)/2 + f(b)/2 + f + f + ... + f    )
/                                 1   2         n-1
a

where f1, f2, ... are f(x1), f(x2), ... where

      x  - x   = h
       j    i

i.e. we split the integration interval on small ranges, their number is N = (b-a)/h.

I will just give you Fortran procedures from my current university task on "Calculations programming" subject:

1) "Stupid" implementation, each iteration we add h to d, it holds the current x.

C    a and b are limits of integration, N is the number of subranges
C    we divide this interval into
       REAL function tr_int(a,b,N)
         h = (b-a)/N
         c = (f(a)+f(b))/2.0
         d = a+h
         DO 1 i=1,N-1
          c = c+f(d)
          d = d+h
 1       CONTINUE
         tr_int = c*h
       end

2) "Correct" but slower implementation:

       REAL function tr_int(a,b,N)
         h = (b-a)/N
         c = (f(a)+f(b))/2.0
         DO 1 i=1,N-1
          c = c+f(a+i*h)
 1       CONTINUE
         tr_int = c*h
       end

Now let us test these two subroutines.

As I said earlier, let's choose a=0, b=pi/2 and consider the routines' output on various N's. I didn't write down concrete integer values, I printed the absolute error between correct integer value, 1, and the value calculated by procedure. Of course the less this error is the more correct our procedure is.

So, the comparison:

___________________________________________
|   N   |   "stupid"     |   "correct"    |
------------------------------------------|
| 10    | 2.05703103E-03 | 2.05703103E-03 |
| 100   | 2.02221490E-05 | 2.05217548E-05 |
| 1000  | 1.25641850E-06 | 1.04455262E-06 |
| 10000 | 8.21720641E-06 | 1.29347461E-06 |
-------------------------------------------

How do you like the result?

You see, when N gets high the absolute error in "stupid" routine gets higher than in the "correct" one. The result will be more clear on higher N's.

VII. Make an action Towards the subtraction

Subtraction is the most imprecise operation from all algebraical operations in a computer. You should avoid it whenever it's possible, especially beware of subtraction of very close numbers because due to the round-off error they could turn to 0! (And if this number was in denominator you would get a "division on zero" error.)

I'm retyping an example from book "Numerical Methods" by E.Volkov:

"We should calculate sqrt(543)-sqrt(540) where 543 and 540 are precise numbers. (i.e. all of their digits are significant) We have

        sqrt(543) = 23.30....

        sqrt(540) = 23.23....

Let's round them to three significant digits (because original data had only 3 significant digits, 543 and 540):

sqrt(543) - sqrt(540) ~ 23.3 - 23.2 = 0.1

We have only one significant digit in the result.

Let's escape from the subtraction:

sqrt(543) - sqrt(540) = 3 / (sqrt(543)+sqrt(540))

(This was done by multiplication and division by sqrt(543)-sqrt(540).)

This is equal to 3/(23.3+23.2) = 3/46.5 = 0.0645, we still have 3 significant digits in the result."

VIII. Don't mix up both low and high, Big eats small without a sigh

The last tip: avoid operations where one operand is a small number and other is big. Small operand will be absorbed by big one. Explanation lies in the same 7 significant digits of mantissa in floating point numbers:

         100000.00E+00
        +
         1E-09
         --------------

This could be rewritten as follows:

           100000.00
          +
           0,000000001
           -------------
           100000.000000001

But floating point has only 7 signs in the mantissa! So the result will stay 100000.00E+00. This could lead to many errors.

IX. Conclusion

I have shown eight important facts about floating point arithmetic to you. I hope it will make your programs more clear and results more precise. Now you see - you can't deal with such numbers the same way you do with integers.

Mail me your comments and questions, please, if you have something to say. It's very hard to get motivated to write something without any feedback. I tried to show you everything in entertaining and easy way, so why not to say to me if I've succeeded or not?

mail: iliks@pochtamt.ru or iliks@mail.ru