In this section we will review here some general
aspects of floating point numbers. See the presentation by J.
D. Darcy and other FP references for more indepth discussions.
A floatingpoint number is represented in binary
as
ħb0.b_{1}b_{2}b_{3}...b_{n1}
* 2^{exponent}
where b_{i}
represents the i bit in the n bits of the significand (also called
the mantissa). In addition, there is a bit to indicate the sign.
A floatingpoint value is calculated as
(1)^{s}
·(b_{0} + b_{1}·2^{1} + b_{2}·2^{2}
+ b_{3}·2^{3} + ...+ b_{n1}·2^{(n1)})·2^{exponent}
where s
is a bit for the sign. Floatingpoint numbers involve a number
of complications with which the processor designers must deal.
These complications include
For fractional numbers and for very large or very
small numbers, advanced processors provide floating point representations.
In the bit representation for the Java float type:
1
bit 
8
bits 
23
bits 
Sign 
exponent 
significand 
and for double
type
1
bit 
11 
52 
Sign 
exponent 
significand 
Floating point numbers on computers involve a
number of complications:
 Approximations
T he limited number of places in the significand means that
only a finite number of fractional values can be represented
exactly. Similarly, the finite width of the exponents limit
the upper and lower size of the numbers.
 Roundoff
Arithmetic operations will often result in the need to round
off the fractional values. A roundoff (or truncation)
algorithm must be chosen by the designer of the language. Roundoffs
can have a significant impact on a long calculation as the errors
accumulate.
 Overflows/Underflows
Similarly a calculation may result in a number that is smaller
or larger value that the floating point type can represent.
Again, the language designer must select a strategy for how
to handle such situations.
 DecimalBinary Conversion
The computer represents numbers in base 2. This can result in
loss of precision since often a binary fraction cannot exactly
represent a given finite decimal fraction (0.1 for example).
All finite binary fractions, however, can be converted to finite
decimal fractions.
Java & Floating Point
To handle these FP issues, Java follows the IEEE 754 standard
in most cases. In this standard :
 Roundoff takes the binary
value nearest to the exact (or higher precision intermediate)
value. If two binary values are equally close, then choose the
even value; that is, the one with its last bit equal to 0.
 Overflows/Underflows are
represented by positive or negative infinity values.
Similarly, for undefined numbers, such as 0/0, use a NotaNumber
(NaN) representation.
No error messages are thrown for any of these cases.
Note that even simple calculations with FP can provide
surprising results. For example, the following code
float
f = 0.0;
for (int i=1; i <= 10; i++) {
f += 0.1;
}
does not result in exactly f
= 1.0 (even if double
is chosen for f) because, as mentioned above, 0.1
is not exact in binary format.
For similar reasons,
 Avoid equality (a
== b) tests between two floating point variables.
 Instead, test with <
, <= , >= , > .
 However, in some situations it may be sensible to test for
equality to 0.0 to avoid divide by zero errors.
In Java the float
representation has a 23 bit significant and double
has a 53 bit mantissa. This means that float
gives 6 to 9 digits of decimal precision while double
gives 15 to 17 digits of decimal precision.
In general, it is far safer to do calculations in
double
type. This helps to reduce roundoff errors that lower the precision
during the intermediate calculations. (You can always cast the
final value to float
if that is a more convenient size such as for I/O or storage.)
Remember the difference between precision and accuracy:
 Precision  how fine a
distinction can be made between two close values.
 Accuracy  how close the
value is to the correct value.
References & Web
Resources
Latest update: Oct. 15, 2004
