Page 23

Chapter 1. Introduction

Figure 1.2: The double precision ﬂoating point format

1.4.4

Basic IEEE 754 formats

The standard deﬁnes ﬁve basic formats, which diﬀer in that each has a diﬀerent length

in the strings of the signiﬁcand and the exponent.

Two of these format refer to a

decimal representation but they are not going to be described here. The three binary

representations are encoded with 32, 64, 128 bits respectively.

Table 1.1: The IEEE 754 ﬂoating point format

Name

Common name

Base

Digits

E min

E max

Decimal digits

Decimal E max

binary16

Half precision

10+1

-14

+15

3.31

4.51

binary32

Single precision

23+1

-126

+127

7.22

38.23

binary64

Double precision

52+1

-1022

+1023

15.95

307.95

binary128

Quadruple precision

112+1

-16382

+16383

34.02

4931.77

decimal32

-95

+96

decimal64

-383

+384

384

decimal128

-6143

+6144

1.5

The IEEE 754 double precision ﬂoating point format

Since in our implementation we use a double precision ﬂoating point arithmetic unit, we

will explain in more details the double precision format. The double precision format

uses 64 bits in total to represent ﬂoating point numbers. Figure

1.2

displays the format

of the number.

1.5.1

The sign bit

The ﬁrst bit of the 64 bits, or the most signiﬁcand bit, is used to represent the sign of

the number. Zero is used to represent positive numbers and one to represent negative

numbers.