Chapter 1. Introduction
11
Figure 1.2: The double precision floating point format
1.4.4
Basic IEEE 754 formats
The standard defines five basic formats, which differ in that each has a different length
in the strings of the significand and the exponent.
Two of these format refer to a
decimal representation but they are not going to be described here. The three binary
representations are encoded with 32, 64, 128 bits respectively.
Table 1.1: The IEEE 754 floating point format
Name
Common name
Base
Digits
E min
E max
Decimal digits
Decimal E max
binary16
Half precision
2
10+1
-14
+15
3.31
4.51
binary32
Single precision
2
23+1
-126
+127
7.22
38.23
binary64
Double precision
2
52+1
-1022
+1023
15.95
307.95
binary128
Quadruple precision
2
112+1
-16382
+16383
34.02
4931.77
decimal32
10
7
-95
+96
7
96
decimal64
10
16
-383
+384
16
384
decimal128
10
34
-6143
+6144
34
1.5
The IEEE 754 double precision floating point format
Since in our implementation we use a double precision floating point arithmetic unit, we
will explain in more details the double precision format. The double precision format
uses 64 bits in total to represent floating point numbers. Figure
displays the format
of the number.
1.5.1
The sign bit
The first bit of the 64 bits, or the most significand bit, is used to represent the sign of
the number. Zero is used to represent positive numbers and one to represent negative
numbers.