MATH/APPM 4650
Class notes, January 18
Section 1.2: Roundoff errors
Whenever we do numerical computations, we use floating point numbers.
These are numbers expressible in scientific notation as
0.d1d2d3...dk × 10n.
For example:
- π = 0.3141592653589793 × 101 to sixteen digits
- Avogadro's number: NA = 0.602214179 × 1024 inverse moles (further digits are not significant, since this is experimentally measured)
- the speed of light: c = 0.299792458 × 109 meters per second (exact by definition of meters)
- Planck's constant: h = 0.662606896 × 10-33 Joule-seconds (further digits not significant)
k is the precision of your computational device, the number of digits that can be expressed.
The IEEE standard is k = 16.
- MATLAB conforms to this standard and only uses 16 digits.
- Maple can handle a very large number of digits: on my computer the maximum is 268435448. The default is 10.
- Mathematica can in principle handle any number of digits (though of course you have to set it to something).
Often the number of digits to use in a problem is dictated by the experimental data you're given. If you have a bunch of numbers like 2.58 and 1.11, it's silly to use any more than three significant digits for any kind of manipulations.
n is the exponent, the order of magnitude of the number.
The IEEE standard is roughly -308 ≤ n ≤ 308.
- Again, MATLAB conforms to this standard: the largest number you can have is roughly 10308 and the smallest is roughly 10-308.
- Maple allows numbers up to 102147483646.
- Mathematica allows numbers up to 10646456887.
Any number larger than this maximum will result in overflow error and may crash your program and even your computer.
Any number smaller than this minimum will result in underflow; although it won't usually crash a program, it will be indistinguishable from zero.
Thought experiment: A piece of paper is about h = 0.1 mm thick. Fold it repeatedly.
- First fold: thickness 2h
- Second fold: thickness 4h
- Third fold: thickness 8h
- ...
- nth fold is 2nh.
When n = 50, 250 = 1.1 × 1015, so that the thickness of the paper after 50 folds is a little more than the distance from Earth to Mars. After 100 folds, the thickness is the estimated size of the universe.
Similar problem: Ask for a penny from someone one day, two pennies the next, four pennies the next, etc. On the last day of the month, you'll receive 230 × $0.01 = $10 million.
So for example, if you run a program in MATLAB that continually doubles something, it only takes 1000 steps to reach an overflow error.
It's surprisingly easy to get overflow and underflow!
For clarity, we're going to keep the number of decimal places small.
But realize the same issues will crop up even with lots of decimal places.
They might just take longer to get noticed.
No matter how many digits you use, you will always have to worry about roundoff error.
For example, suppose all our numbers have two digits.
If any computation produces a number x with more than two digits, we round off (up if the last digit is 5 or more, down if the last digit is 4 or less).
We call this the floating point representation of the number and denote it by fl(x).
For example, fl(π) = 0.31 × 101.
fl(17.22) = 0.17 × 102
fl(0.00285) = 0.29 × 10-2
fl(0.0899) = 0.90 × 10-1 (observe the last digit is significant!)
Definitions:
The absolute roundoff error is |x - fl(x)|.
The relative roundoff error is |x - fl(x)| / |x|.
The size of the absolute roundoff error depends on the magnitude of x.
The relative roundoff error for a particular number can never be more than 0.5 × 10-k+1.
Performing computations gives roundoff error.
When you ask a computer to evaluate an addition x + y,
it actually computes
fl( fl(y) + fl(y) ).
These may not be the same thing!
For example, x = 0.21629, y = 0.03615. Then x + y = 0.25244.
fl(x) = 0.22; fl(y) = 0.36 × 10-1
fl( fl(x) + fl(y) = fl(0.22 + 0.036) = fl(0.256) = 0.26.
So fl(x + y) ≠ fl( fl(x) + fl(y) ).
For two significant digits, if x = 0.p1p2p3... with p1 ≠ 0, the worst error we can have in |x - fl(x)| is 0.005.
Always assume the worst! The worst error we can get in an addition of two numbers of the same magnitude (i.e., same exponent) is 0.005 + 0.005 = 0.01. So in adding two numbers of the same magnitude, the last digit may be off by 1, but no more.
If x = 0.p1p2p3... × 10m, then the worst error is |x - fl(x)| is 0.005 × 10m.
For example, if x = 2,050, then fl(x) = 0.21 × 104, and |x - fl(x)| = 50 = 0.50 × 102.
If x and y have different magnitudes (e.g., x = 0.21 × 104 and y = 0.50 × 102, then fl(x + y) = 0.22 × 104, with error 0.50 × 102. The worst error you can possibly get in an addition is determined by the larger exponent in the numbers you're adding.
General rule: if x = 0.p1p2...pk × 10m and if y = 0.q1q2...qk × 10n, then the absolute roundoff error in adding x and y is at most 10max(m,n) - k.
It's harder to come up with a general rule for the relative roundoff error for addition, but roughly speaking it's about 10-k.

Green: Plot of absolute
roundoff error in
x + y with
1 digit of precision.

Blue: Plot of relative roundoff error in x + y with 1 digit of precision.
Multiplication is usually pretty reliable.
Below are graphs of the error in multiplying two numbers.
Observe that the absolute error is pretty good for small values and bad for large values.
The relative error is just the opposite.

Green: Plot of absolute
roundoff error in
x × y with
1 digit of precision.

Blue: Plot of relative roundoff error in x × y with 1 digit of precision.
When dealing with addition and multiplication, roundoff error is usually pretty small, but very hard to predict exactly.
We face huge problems when dealing with roundoff error due to subtracting nearly equal numbers.
Example: The limit of
n*[ sqrt(1+1/n) - 1 ]
as n goes to infinity is 1/2 (from calculus).
What happens when we try to compute this numerically using five digits and very large values of n?
|
25
|
50
|
100
|
200
|
400
|
800
|
1600
|
3200
|
6400
|
12800
|
|
0.49500
|
0.50000
|
0.50000
|
0.50000
|
0.48000
|
0.48000
|
0.48000
|
0.32000
|
0.64000
|
0.00000
|
Why does this happen?
When n = 3200, we have 1/n = 0.00031250, so that fl(1+1/n) = 1.0003.
Then sqrt(1.0003) = 1.000149989, so that fl(sqrt(1.0003) = 1.0001.
Then 1.0001 - 1.0000 = 0.0001.
Then 3200 × 0.0001 = 0.3200.
When n = 12800, it's even worse. Then fl(1+1/n) = 1.0000 to five places.
Instead we should avoid this by rationalizing.
n*[ sqrt(1+1/n) - 1 ] = 1/[sqrt(1+1/n+1].
This is much easier to compute.
|
25
|
50
|
100
|
200
|
400
|
800
|
1600
|
3200
|
6400
|
12800
|
|
0.49510
|
0.49751
|
0.49875
|
0.49938
|
0.49970
|
0.49985
|
0.49992
|
0.49998
|
0.49998
|
0.50000
|
The numbers remain reliable all the way up.
The same sort of thing happens with the quadratic equation.
a x2 + bx + c = 0.
x = [ -b +- sqrt(b2 - 4ac) ] / [2a].
If b is large and positive, then the term with the plus sign involves subtracting two nearly equal numbers.
(The term with the minus sign is fine.)
Example: four significant digits.
x2 + 62.10x + 1 = 0.
x = [-62.10 +- sqrt(62.102 - 4)] / 2.
Exact solutions (to four places) are -0.1611 and -62.08.
62.102 = 3856.41 which rounds to 3856.
So 62.102 - 4 = 3852.
sqrt(3852) = 62.06448258, which rounds to 62.06.
(62.06-62.10)/2 = -0.2000. (-62.06-62.10)/2 = -62.08.
First root is WAY off, second root is correct to tolerance.
Rationalize:
x = [-2c] / [-+ sqrt(b2-4ac) + b].
Now compute sqrt(62.102-4) = 62.06, so
-2/(62.06+62.10) = -0.1611 -2/(-62.06+62.10) = -50.00.
First root is correct to tolerance, second root is WAY off.
Hence neither form of the solution gives a good algorithm for solving the quadratic equation.
We need to use conditioning depending on the data to get both roots.
Lesson: Exact formulas can sometimes be worse than approximate techniques!
Lesson: Mathematically equivalent formulas are not numerically equivalent!
In general, roundoff error is unpredictable.
Even a few additions and multiplications can accumulate roundoff error.
Therefore one should always minimize the number of computations one does.
For example,
a ( b + c ) = ab + ac mathematically.
However, the left side involves two computations, while the right side involves three.
Roundoff error is more likely to accumulate in the right side.
So we should use the left side for numerics.
When evaluating a polynomial, the same things arise.
Instead of computing ax2 + bx+c directly (which involves five operations),
we should compute x(ax+b) + c (which involves four operations).
The difference seems minor, but it adds up.
See the results of the Maple worksheet.
General rules for roundoff:
- Adding and multiplying will produce small roundoff errors.
- Subtracting numbers that are nearly equal gives HUGE roundoff errors.
- Dividing by small numbers magnifies errors GREATLY.
- Roundoff error is hard to predict. Always assume the worst.
- The more computations, the more roundoff errors. The fewer the better!
- Order of operations always matters numerically. Equivalent formulas often produce different numerical results.