32-bit IEEE Floating Point Numbers (Part 2)

gravatar
By Ranti
 · 
March 6, 2023
 · 
4 min read
  • 1 bit = Sign
  • 8 bit = Exponent with a bias of 127
  • 23 bit = Mantissa (the 1 before the point is implicit)
Diagram showing the sign, exponent and mantissa represented as fields in the IEEE 32-bit notation

How do we extract the bits in C from a floating point number?

  • Masking and shifting only works on ints (preferably unsigned ints), not floating point numbers
  • You can't just cast a floating point number into a unsigned int; 29.314 will become 29 which is a completely different number
  • Instead copy the bits from the floating point number to an unsigned int

Copying the bits from a floating point number to an unsigned int is done by using pointers

  1. Create an unsigned int pointer to point to the float
  2. Dereference the pointer
// changing a floating point number to an unsigned int

float f = 29.314;
unsigned int x;
unsigned int *p = (unsigned int *)&f;
x = *p;
C

Shortcut!

// changing a floating point number to an unsigned int (shorter version) 

float f = 29.314;
unsigned int x = *(unsigned int *)&f;
C

How do we extract the sign bit, exponent bit & mantissa bit?

Extracting the sign:

// Shift 31 places to the right (the rightmost position), &1 is the mask
// if the result is 1 the number is negative, if the result is 0 the number is positive

float f = 3.14159;
unsigned int i = *(unsigned int *)&f;
int sign = (i >> 31)&1; 
C

Extracting the exponent:

The exponent occupies 8 bits starting at bit 23

#define MASK8 0xff
#define EXPSHIFT 23

float f = 3.14159;
unsigned int i = *(unsigned int*)&f;
unsigned int exp = (i >> EXPSHIFT) &MASK8; // This result will be the actual exponent +127

int actual_exp = exp - 127;
C

Extracting the mantissa:

  • Rightmost 23 bits (no shift needed)
#define MASK23 ((1 << 23) -1)

float f = 3.14159;
unsigned int i = *(unsigned int*)&f;

int mantissa = (*i & MASK23); // bitwise AND with mask to extract mantissa
C

Alternatively:

#define MASK23 0x7fffff

float f = 3.14159;
unsigned int i = *(unsigned int*)&f;

int mantissa = (*i & MASK23); // bitwise AND with mask to extract mantissa
C
  • 1 << 23: This creates a bit mask with all bits set to 0 except for the 23 least significant bits, which correspond to the mantissa in IEEE notation.
  • ((1 << 23) - 1): This subtracts 1 from the bit mask to set all 23 bits to 1.
  • This contains only the digits (bits) after the points
  • Not the implicit 1 before the point

Now, how do we insert the 1 before the point into the mantissa bits?

  • The 23 bits already there at positions 0 - 22
  • Need to insert. 1 at bit position 23
// After extracting the bits from the above code

mantissa |= (1 << 23); // Insert the 1 before the point
C

  • To read a bit use &
  • To set a bit use |
  • To clear a bit use & with an inverse of the mask (using the NOT ~ operator

0b11001100 (the original number)

AND 0b11110111 (the inverse of the mask for the 4th bit)
= 0b11000100 (the result with the 4th bit cleared)

How do we add floating point numbers?

  • In scientific notation

Decimal:

  1. Make the exponents the same and then add the mantissas
  2. Shift the number with the smaller exponent to the right to make the exponent larger
$$ 23.314 + 6975.2 $$ $$ = 2.3314 \times 10^{1} + 6.9752 \times 10^{3} \text{(normalising the 2 numbers)}$$ $$ = 0.023314 \times 10^3 + 6.9752 \times 10^3 \text{(making the exponents the same)}$$
$$ \space 0.023314 \times 10^3 \\ + 6.975200 \times 10^3 \\ = 6.998514 \times 10^3 $$

The result of the addition may have 2 non-zero digits before the "point". If so, normalise by shifting the mantissa of the result to the right by one and add 1 to the exponent

$$\text{If the mantissa of the result is } 32.64 \times 10^{8} \text{ it is normalised to } 3.264 \times 10^9$$

The algorithm for binary floating point addition is the same

  • Make the smaller exponent the same as the alrger one
  • Add the mantissas
  • Normalise the result

Note: If one of the number's is negative subtract the smaller number from the larger number and the sign of the result should be the sign of the larger number

If both of the numbers are negative then add the numbers together and keep the sign

How to multiply floating point numbers?

$$ = (a \times 10^b) \times (c \times 10^d) $$ $$ = (a \times c) \times (10^b \times 10^ d) $$ $$ = (a \times c) \times 10^{b + d}$$ $$ \text{Product of the mantissas } \times 10^{\text{ sum of the exponents}}$$

The sign of the result is just the XOR of the signs of the operands

In binary, multiplication works exactly the same way

Important

  • The stored exponent has a bias of 127
  • So adding the 2 exponent fields together adds the 2 biases
$$ = \text{expA + expB - 127}$$ $$ \text{Add the stored exponents together and subtract 127}$$
  • Insert the leading 1 ints each mantissa before multiplying them (integer multiplication)
  • You may need to re-normalise due to the mantissa having too man
View