Spring 2015 Week 9 Additional Module Digital Circuits and Systems IEEE 754 Format Shankar Balachandran* Associate Professor, CSE Department Indian Institute of Technology Madras *Currently a Visiting Professor at IIT Bombay There is no video corresponding to this file. 2 Floating-Point Number Representation 1 S p bits m bits Exponent (E) Mantissa (M) Sign bit: S is the sign of the floating point number Exponent: p-bit exponent (E) in excess-B code. Mantissa: m-bit unsigned mantissa (M). Radix: R is the radix for the representation. Actual value of the number represented above is: F = (-1)S × 1.M × RE-B F = (-1)S × M × RE-B … (if normalized) … (if unnormalized) Arithmetic Circuits 3 IEEE Floating-Point Format Single Precision Format: (32 bits) 8 bits 1 S 31 Exponent (E) 30 unsigned Mantissa (M) 23 22 E=0 with M = 0 represents ZERO E = 255 with M = 0 represents ± oo E = 255 with M ≠ 0 represents NaN Double Precision Format: (64 bits) 0 F = (-1)S × 1. M × 2E-127 Special reserved values: 23 bits 1 11 bits S Exponent (E) 52 bits unsigned Mantissa (M) F = (-1)S × 1. M × 2E-1023 Arithmetic Circuits 4 Examples Convert 4.62×102 to IEEE single precision format: 4.62×102 = 462 = 111001110.0 = 1.110011100 × 28 Mantissa = 110011100 Exponent = 8+127 = 135 = 10000111 0 1000 0111 1100 1110 0000 0000 0000 000 = 0 87 CE0000 Convert -0.456×2-3 to IEEE single precision format: -0.456×2-3 = - 0.0111 0100 1011 1100 0110 1010 0111 1101 × 2-3 = -1.1101 0010 1111 0001 1010 101 × 2-5 Exponent = -5+127 = 122 = 01111010 1 0111 1010 1101 0010 1111 0001 1010 101 Convert 1 81 99999A = 1 7A D2F1AA to decimal representation: Mantissa = -1.1001 1001 1001 1001 1001 1010 Exponent = 8116 - 12710 = -1.1001 1001 1001 1001 1001 1010 × 22 = -0110.0110 0110 0110 0110 0110 1 = -6.4 Arithmetic Circuits 5 Floating-Point Addition How to add two floating-point numbers? (Ma,Ea) + (Mb,Eb) 1. 2. 3. 4. 5. 6. 7. Place number with larger exponent in register 1 and the other in 2. d = E1 – E2 Right-shift mantissa M2 by d bits (i.e., left-shift radix point by d bits). MSUM = M1 + M2; ESUM=E1 If (MSUM) ≥ 2, renormalize (post-normalization) by dividing by 2 (shifting right) and incrementing ESUM . Rounding may be required to store the result in the same number of bits (precision) as the inputs. Result = (MSUM, ESUM). Arithmetic Circuits 6 Example: Addition Add (11000000, 011) to (10101100, 100) (input numbers are in the normalized form with excess-4 exponent). That is, (1.11000000 × 2011-100) + (1.10101100 × 2100-100) = ? Since (100 > 011), M1 = 1.10101100 E1 = 100 and M2 = 1.11000000 E2 = 011 d = E1 – E2 = 100 – 011 = 001 Right-shift M2 by 001 (1) bits M2 = 0.11100000 E2 = 100 MSUM = 1.10101100 + 0.11100000 = 10.10001100 and ESUM= E1 = 100 Post-normalize: MSUM = 1.01000110 and ESUM= 101 Therefore, (1.11000000 × 2011-100) + (1.10101100 × 2100-100) = (1.01000110 × 2101-100) Or, (11000000, 011) + (10101100, 100) = (01000110, 101) Arithmetic Circuits 7 Floating-Point Multiplication How to multiply two floating-point numbers? (Ma, Ea) × (Mb, Eb) 1. 2. 3. 4. 5. 6. MPROD = M1 × M2 EPROD = E1 + E2- bias Post-normalize MPROD by shifting by an appropriate amount and then updating EPROD by the same amount. Rounding may be required to store the result in the same number of bits (precision) as the inputs. If necessary, normalize and update EPROD Result = (MPROD, EPROD). Arithmetic Circuits 8 Example: Multiplication Multiply (10101100, 0101) and (11000000, 0110) (input numbers are in the normalized form excess-8 exponent). That is, (1.10101100 × 20101-1000) × (1.11000000 × 20110-1000) = ? MPROD = 1.10101100 × 1.11000000 = 10.1110110100000000 EPROD= E1 + E2 - 8 = 0011 Post-normalize: MPROD = 1.01110110100000000 and EPROD= 0100 Rounding : MPROD = 1.01110111 Therefore, and (1.11000000 × 20101-1000) × (1.10101100 × 20110-1000) = (1.01110111 × 20100-1000) Or, (11000000, 0101) × (10101100, 0110) = (01110111, 0100) Arithmetic Circuits 9 End of Week 9: Additional Module Thank You Multipliers+Others 10
© Copyright 2026 Paperzz