\(\newcommand{L}[1]{\| #1 \|}\newcommand{VL}[1]{\L{ \vec{#1} }}\newcommand{R}[1]{\operatorname{Re}\,(#1)}\newcommand{I}[1]{\operatorname{Im}\, (#1)}\)

Points on floats

Thanks

This page comes largely from the wikipedia floating point article.

I read What every computer scientist should know about floating point at some - er - point.

I wrote another page on Floating point error.

Starting to float

This is a floating point number:

\(-123.45\)

We can also write this as:

\(-1.2345 * 10^2\)

Well, actually, we can write it like this:

\(-1 * 1.2345 * 10^2\)

Let \(n\) (the sign) be a variable that is \(1\) if the sign of the number is positive, and \(-1\) if the number is negative. Here \(n = -1\).

Let \(s\) (the significand) be a number - here \(1.2345\), with the floating point assumed to be after the first digit. The significand is sometimes called the mantissa.

Let \(e\) (the exponent) be the power of 10 to apply. Here \(e = 2\). We now write:

\(n * s * 10^e\)

We’re all used to thinking of the \(2\) in \(10^2\) as being the number of places to shift the decimal point to the right. That is, we take the decimal point of the significand and shift it two to the right to get the number we mean.

Your actual floats

Actually, floating point numbers are almost invariably stored in IEEE 754 binary (base 2) format. So far we’ve been looking at decimal (base 10) format numbers.

Obviously we can store the sign in one binary bit.

We store the significand as a binary number, again with an implied floating point position. So:

\(1.1\)

where \(1.1\) (binary) equals \(11\) (binary) * \(2^{-1}\) (decimal) which equals (decimal) \(3/2\).

Of course \(11\) (binary) = \(2^2-1\), and in general, the maximum number that can be stored for \(p\) binary digits without a floating point is \(2^p-1\).

It looks like we’d need two bits of storage to store \(11\) (binary). But no, because, unlike the decimal case, we know that the first binary digit in the significand is 1. Why? In general (for decimal or binary) the first digit cannot be 0, because we can always represent a number beginning with 0 by subtracting from the exponent and shifting the significand digits left until the first digit is not zero. For decimal, if the first digit is not 0, it could be 1-9, but for binary, it can only be 1. So, for binary, we can infer the first \(1\) and we only need one bit of storage to store \(1.1\). Of course that means the significand can only be \(1.1\) or \(1.0\) in this case.

IEEE 32-bit binary float.

This is a common floating point format, often called a single-precision float.

As we expect, this format devotes one bit to the sign.

It devotes 23 bits to the significand. From the argument above, by assuming a first digit of 1, this gives it an effective 24 bits of storage for the significand. The significand can thus be from \(1.0\) (binary) to 1.(23 ones) in binary, which is, in decimal, a range of \(1\) to \((2^{24}-1) * 2^{-23}\). In sympy:

>>> from sympy import Integer, Float
>>> two = Integer(2)
>>> s_bits_32 = 23
>>> biggest_s_32 = (two**(s_bits_32+1)-1) * (two**(-s_bits_32))
>>> biggest_s_32
16777215/8388608
>>> Float(biggest_s_32)
1.99999988079071

With 1 bit for the sign, and 23 bits for the significand, there are 8 bits remaining for the exponent.

The exponent is not stored as a standard signed integer. An exponent of all 0s indicates a zero number or a subnormal number [1]. An exponent of all 1s indicates an infinity or not-a-number value. If we treat the 8 bits of the exponent as an unsigned number (call it \(u\)) then the actual exponent is given by:

\(e = u - b\)

where \(b\) is the bias - and the bias for 32 bit IEEE floats, is \(127\). With 8 bits, \(u\) could be 0 to 255, but both 0 and 255 are reserved (0 for zeros and subnormals; 255 for non-finite, as above). Thus the effective range of \(u\) is 1-254, and the effective range of \(e\) is -126 to 127.

What’s the largest positive 32 bit IEEE float? Easy:

>>> e_bits_32 = 8
>>> e_bias_32 = 127
>>> biggest_e_32 = (two**e_bits_32)-1-e_bias_32 - 1 # -1 for all-ones reserved
>>> biggest_e_32
127
>>> biggest_float32 = biggest_s_32 * two**biggest_e_32
>>> biggest_float32
340282346638528859811704183484516925440
>>> float(biggest_float32)
3.4028234663852886e+38

The most negative value? Just the same number with -1 sign (sign bit is 1).

And the smallest value? [1]

>>> most_neg_e_32 = -e_bias_32 + 1 # +1 for zeros reserved
>>> most_neg_e_32
-126
>>> smallest_s_32 = 1
>>> smallest_float32 = smallest_s_32 * two**most_neg_e_32
>>> smallest_float32
1/85070591730234615865843651857942052864
>>> float(smallest_float32)
1.1754943508222875e-38

IEEE 64-bit binary float.

This is the other common floating point format, often called a double-precision float.

It uses:

  • 1 bit for the sign
  • 52 bits for the significand
  • 11 bits for the exponent

and the exponent bias is 1023 (wikipedia floating point):

>>> s_bits_64 = 52
>>> biggest_s_64 = (two**(s_bits_64+1)-1) * (two**(-s_bits_64))
>>> biggest_s_64
9007199254740991/4503599627370496
>>> float(biggest_s_64)
1.9999999999999998

Well - it’s not quite 2.0 - but within the limits of the printing precision.

Largest 64-bit float:

>>> e_bits_64 = 11
>>> e_bias_64 = 1023
>>> biggest_e_64 = (two**e_bits_64)-1-e_bias_64 - 1 # -1 for all-ones reserved
>>> biggest_e_64
1023
>>> biggest_float64 = biggest_s_64 * two**biggest_e_64
>>> float(biggest_float64)
1.7976931348623157e+308

Smallest [1]:

>>> most_neg_e_64 = -e_bias_64 + 1 # +1 for zeros reserved
>>> most_neg_e_64
-1022
>>> smallest_s_64 = 1
>>> smallest_float64 = smallest_s_64 * two**most_neg_e_64
>>> float(smallest_float64)
2.2250738585072014e-308

Floating point and integers

Consider the significand in an IEEE 32 bit floating point number.

Neglect for a moment, the assumed floating point after the first digit. The significand has 24 binary digits (including the assumed first digit). That is, neglecting the floating point, it can represent the integers from 1 (\(2^1-1\)) to 16777215 (\(2^{24}-1\)). Now let’s take into account the floating point. In order to store 1, the exponent can just be 0, no problem. In order to store \(2^{24}-1\), the exponent has to be 23 to push the floating point 23 digits to the right. As we know, the IEEE exponent can range between -126 and 127, so 23 is also OK.

Now set the significand to 1.0 and the exponent to be 24. This is \(1 * 2^{24}\) - or 16777216. By setting the exponent to one greater than the number of significand digits, we have pushed the floating point one digit past the end of the significand, and got an extra implied 0 (1 followed by 23 zeros, followed by an implied 0, followed by the floating point).

The smallest possible increase we can make to this number is to replace the final 0 in the significand with a 1. But, because we’ve pushed the floating point one position past the end of the significand, the final 1 in our significand does not increase the resulting number by 1, but by 2. So the next largest number after 2**24, is 2**24 + 2. We can’t store 2**24+1 in an IEEE 32 bit float.

All this means that the IEEE 32 bit binary format can store all integers -16777216 to 16777216 (\(\pm 2^{24}\)) exactly.

By the same argument, the IEEE 64 bit binary format can exactly store all integers between \(\pm 2^{53}\).

Bit patterns

You don’t believe me? Let’s predict the bit pattern for storing the number 16777216 in IEEE 32 bit floating point. We established that this has 1.0 for the significand, and the value 24 for the exponent.

The wikipedia floating point page tells us that the IEEE standard has a 32 bit binary float stored as the sign bit, followed by 8 exponent bits, followed by the 23 significand bits, with the most significant bits first.

So we have:

  • 0 for the sign bit
  • the exponent part \(u = e + b\) = 24 + 127 = 151
  • 0 for the significand (implicit 1.0)

The binary representation of 151 is:

>>> import numpy as np
>>> np.binary_repr(151)
'10010111'

We get the memory from our float represented as an unsigned 32 bit integer:

>>> float32_mem = np.float32(16777216).view(np.uint32)

and show it as binary:

>>> np.binary_repr(float32_mem)
'1001011100000000000000000000000'

How about -16777215? It should be 1 for the sign, 23 for the exponent (\(u = 23 + 127\) = 50), and all ones for the significand:

>>> np.binary_repr(150)
'10010110'
>>> np.binary_repr(np.float32(-16777215).view(np.uint32))
'11001011011111111111111111111111'
[1](1, 2, 3) Subnormal numbers (wikipedia subnormal numbers) are numbers smaller than those you can store with the simple significand and exponent mechanisms this page describes. Thus, for a 32 bit float, the smallest normal number is around 1.17549435082229e-38. The IEEE standard contains a trick for storing smaller numbers than this, by using an exponent of 0 - see the wikipedia page for details.