FINITE PRECISION EFFECTS
1. FLOATING POINT VERSUS FIXED POINT
2. WHEN IS FIXED POINT NEEDED?
3. TYPES OF FINITE PRECISION EFFECTS
4. FACTORS INFLUENCING FINITE PRECISION EFFECTS
5. FINITE PRECISION EFFECTS: FIR VERSUS IIR FILTERS
6. DIRECT FIR FILTER STRUCTURE
7. CASCADE FIR FILTER STRUCTURE
8. STRUCTURES WITH QUANTIZERS
9. THE OVERFLOW PROBLEM
10. TIME-DOMAIN SCALING
11. FREQUENCY-DOMAIN SCALING
12. OVERFLOW OF INTERNAL SIGNALS
13. QUANTIZATION OF FILTER COEFFICIENTS
14. COEFFICIENT QUANTIZATION: EXAMPLE
15. SIGNAL QUANTIZATION
16. QUANTIZATION ERROR POWER
17. EXPRESSING σe2 IN TERMS OF THE NUMBER OF BITS
18. SIGNAL TO QUANTIZATION NOISE RATIO (SNR)
19. SIGNAL QUANTIZATION
I. Selesnick
EL 713 Lecture Notes
1
FLOATING POINT VERSUS FIXED POINT
• Signals can not be represented exactly on a computer, but only
with finite precision.
• In floating point arithmetic, the finite precision errors are generally not a problem.
• However, with fixed point arithmetic, the finite word length
causes several problems.
• For this reason, a floating point implementation is preferred
when it is feasible.
• Because of finite precision effects, fixed point implementation
is used only when
1. speed,
2. power,
3. size, and
4. cost
are very important.
I. Selesnick
EL 713 Lecture Notes
2
WHEN IS FIXED POINT NEEDED?
From Matlab documentation: ”When you have enough power to
run a floating point DSP, such as on a desktop PC or in your
car, fixed-point processing and filtering is unnecessary. But, when
your filter needs to run in a cellular phone, or you want to run a
hearing aid for hours instead of seconds, fixed-point processing can
be essential to ensure long battery life and small size.
Many real-world DSP systems require that filters use minimum
power, generate minimum heat, etc. Meeting these requirements
often means using a fixed-point implementation.”
I. Selesnick
EL 713 Lecture Notes
3
TYPES OF FINITE PRECISION EFFECTS
• Overflow
• Quantization of filter coefficients
• Signal quantization
1. Analog/Digital conversion (A/D converter)
2. Round-off noise
3. Limit cycles (recursive filters only)
FACTORS INFLUENCING FINITE PRECISION EFFECTS
1. The structure used for implementation (Direct, transpose, etc)
2. The word-length and data-type (fixed-point, floating-point,
etc) (1-s complement, 2-s complement, etc)
3. The multiplication-type (rounding, truncation, etc)
I. Selesnick
EL 713 Lecture Notes
4
FINITE PRECISION EFFECT: FIR VERSUS IIR FILTERS
Remarks on FIR filters:
1. Filter coefficient quantization: coefficient quantization is quite
serious for FIR filters due to the typically large dynamic range
of coefficients of FIR filters.
2. Round-off noise: Round-off noise is not as serious for FIR
filters as it is for IIR filters.
3. Limit cycles: Non-recursive (FIR) filters do not have limit cycles.
4. For FIR filters, the direct form is generally preferred.
Remarks on IIR filters:
1. Filter coefficient quantization: coefficient quantization can
make a stable IIR filter unstable! (The implementation of
an IIR filters using a cascade of second order sections prevents
that.)
2. Round-off noise: For a cascade of second order sections the
round-off noise depends on
(a) the poles-zero pairing,
(b) the ordering of the sections.
I. Selesnick
EL 713 Lecture Notes
5
DIRECT FIR FILTER STRUCTURE
To clarify the forgoing discussion on the effects of fixed-point arithmetic, it will be convenient to refer a couple of different structures
for the implementation of FIR filters.
Fig 1
This structure implements an FIR filter of length 4. The output
y(n) is given by
y(n) = h(0) x(n) + h(1) x(n − 1) + h(2) x(n − 2) + h(3) x(n − 3)
and so the impulse response is
h(n) = h(0) δ(n) + h(1) δ(n − 1) + h(2) δ(n − 2) + h(3) δ(n − 3)
and the transfer function is given by
H(z) = h(0) + h(1) z −1 + h(2) z −2 + h(3) z −3 .
This is called a direct structure because the transfer function parameters appear as multipliers in the structure itself.
I. Selesnick
EL 713 Lecture Notes
6
CASCADE FIR FILTER STRUCTURE
If the FIR transfer function H(z) is factored as
H(z) = H1 (z) H2 (z)
then the FIR filter can be implemented as a cascade of two shorter
FIR filters. If each of the shorter filters are implemented with the
direct form, then the total flowgraph for the cascade structure is as
shown in the following figure.
Fig 2
If both structures are implemented with infinite precision, then they
will implemented exactly the same transfer function. However,
when they are implemented with finite-precision arithmetic, they
will have different properties. Some structures are more immune
than other structures when fixed-point precision is used.
I. Selesnick
EL 713 Lecture Notes
7
STRUCTURES WITH QUANTIZERS
In fixed-point implementations
1. The multipliers are quantized.
2. The input, output, and internal signals are quantized.
Both types of quantization introduce some error.
If the binary numbers X and Y are each B bits long, then the
product Z = X · Y is 2B − 1 bits long.
Therefore, the multiplications occurring in a digital filter with fixed
point arithmetic must be truncated down to the word length used.
(This is represented by the Q block in the following diagrams.)
On the other hand, the addition of two-fixed point numbers does
not lead to a loss of precision.
In some (old) processors, signal quantization occurs after every
multiplier. For example, the FIR direct structure can be illustrated
as shown.
Fig 3
In most processors today, the quantizer Q comes after the multiplysum operation. This is the multiply-accumulate, or MAC, command. The following figure illustrates this.
Fig 4
Later, we will see how to analyze the degradation of accuracy caused
by the quantizer.
In processors where the quantizer comes after every multiplier, it
can be seen that there are many more quantizations than when the
quantizer comes after the multiply-sum operation.
We will assume from now that the quantizer comes after the sum.
I. Selesnick
EL 713 Lecture Notes
8
OVERFLOW
The range of values of a B-bit binary number is limited. If a digital filter is not properly implemented, overflow may occur when
the output or internal signals are larger than the maximum value
possible. To avoid overflow, it may be necessary to scale down the
input signal.
• If the signal is not scaled appropriately, many overflows will
occur.
• If the signal is scaled down too much, then the the full available
dynamic range is not utilized and the ratio of the signal power
to the round-off noise (SNR) will be degraded.
Scaling is usually
1. absorbed in the other multipliers (so it does not introduce
further round-off noise), or
2. by 2−k so it can be implemented by a bit shift (to minimize
complexity)
I. Selesnick
EL 713 Lecture Notes
9
TIME-DOMAIN SCALING
Suppose the signals x(n) and y(n) can only be represented in the
range
|x(n)| ≤ xmax
|y(n)| ≤ ymax
We want to implement a scaled version of the filter H(z) and we
want to choose the scaling constant c so as to maximize the dynamic range of the output, while at the same time avoiding y(n)
from overflowing (exceeding ymax ).
x(n)
c · H(z)
-
-
y(n)
Problem: find the maximum value of c such that |y(n)| ≤ ymax .
You can assume that |x(n)| ≤ xmax .
The solution begins with the convolutional sum.
X
y(n) =
c h(k) x(n − k)
k
|y(n)| ≤ c
≤c
X
k
X
|h(k)| |x(n − k)|
|h(k)| xmax
k
≤ c xmax
X
|h(k)|
k
≤ c xmax ||h(n)||1
where the 1-norm of h(n) is defined as
X
||h(n)||1 :=
|h(k)|.
k
I. Selesnick
EL 713 Lecture Notes
10
So
|y(n)| ≤ c xmax ||h(n)||1 ≤ ymax
gives
c≤
ymax
1
·
xmax ||h(n)||1
This is the time-domain 1-norm scaling rule.
In practice, c is often
1. rounded downward to 1/2l (so it can be implemented with a
bit shift), or
2. absorbed into existing multipliers in the filter structure.
The value
1
ymax
·
xmax ||h(n)||1
is often overly conservative. It guarantees that overflow will never
occur, however, it many cases, it is acceptable to allow overflow to
occur as long as it is rare, and to use a larger value of c so that the
full dynamic range is better utilized.
c=
I. Selesnick
EL 713 Lecture Notes
11
FREQUENCY-DOMAIN SCALING
Another approach to scaling for overflow control is derived in the
frequency domain.
y(n) = IDTFT Y f (ω)
= IDTFT c H f (ω) X f (ω)
Z π
1
=
c H f (ω) X f (ω) e−jnω dω
2π −π
Because the ‘absolute value of an integral is smaller than the integral of the absolute value’, we can write
Z π
c
|H f (ω)| |X f (ω)| dω
|y(n)| ≤
2π −π
Z π
c
≤
||X f ||∞
|H f (ω)| dω
2π
−π
where the frequency-domain infinity-norm is defined as
||X f ||∞ := max |X f (ω)|.
ω
Defining the frequency-domain 1-norm as
Z π
1
f
||H ||1 :=
|H f (ω)| dω,
2π −π
we can write
|y(n)| ≤ c ||X f ||∞ ||H f ||1
To guarantee that no overflows occur, we have the condition
|y(n)| ≤ c ||X f ||∞ ||H f ||1 ≤ ymax
or
c≤
I. Selesnick
ymax
1
·
||X f ||∞ ||H f ||1
EL 713 Lecture Notes
12
This is the frequency-domain 1-norm scaling rule.
Note that it requires an estimate of |||X f |∞ .
In the development of this scaling rule, we could have swapped the
role of X f (ω) and H f (ω). Then we would obtain the following
scaling rule.
c≤
ymax
1
·
f
||X ||1 ||H f ||∞
This is the frequency-domain infinity-norm scaling rule.
I. Selesnick
EL 713 Lecture Notes
13
OVERFLOW OF INTERNAL SIGNALS
In addition to ensuring that the output signal rarely overflows, it is
also necessary to ensure that the internal signals rarely overflow as
well.
For example, consider the cascade of two FIR filters.
x(n)
-
H1 (z)
-
H2 (z)
-
y(n)
Both H1 (z) and H2 (z) must be scaled. We then have the following
diagram.
x(n)
-
c1 · H1 (z)
-
c2 · H2 (z)
-
y(n)
It is necessary to choose the two scaling constants c1 and c2 so that
overflow of both the intermediate signal and the output signal y(n)
are rare.
If we call the intermediate (internal) signal v(n), then
1. Choose c1 first, so that v(n) does not overflow.
2. Choose c2 second, so that y(n) does not overflow.
If there are more than one internal signal which may overflow, then
begin with the first internal signal and proceed through the filter
structure.
I. Selesnick
EL 713 Lecture Notes
14
QUANTIZATION OF FILTER COEFFICIENTS
For a fixed point implementation of a digital filter the filter coefficients must be quantized to binary numbers.
This changes the actual system to be implemented and usually
results in a degradation of the frequency response.
The quantization of floating point values can be performed in Matlab with the round command:
y = q*round(x/q);
where q is the quantization step.
COEFFICIENT QUANTIZATION: EXAMPLE
If floating-point (or infinite precision) is used for h(n), then the
minimal length TYPE-I FIR filter satisfying
1 − δp ≤ A(ω) ≤ 1 + δp
|A(ω)| ≤ δp
|ω| ≤ 0.32π
0.37π ≤ |ω| ≤ π
with
δp = 0.02,
δs = −50 dB
is of length N = 83. (See the figures on the next page.)
However, if a fixed-point filter h(n) is used by quantizing h(n), then
the degraded frequency response does not meet the specifications.
A longer impulse response is needed. (See the next page.)
I. Selesnick
EL 713 Lecture Notes
15
COEFFICIENT QUANTIZATION: EXAMPLE
Hf(ω) (N = 83, FLOATING POINT)
10
0
−10
−20
−30
−40
−50
−60
−70
−80
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.7
0.8
0.9
1
0.7
0.8
0.9
1
Hf(ω) (N = 83, 12 BITS)
10
0
−10
−20
−30
−40
−50
−60
−70
−80
0
0.1
0.2
0.3
0.4
0.5
0.6
Hf(ω) (N = 93, 12 BITS)
10
0
−10
−20
−30
−40
−50
−60
−70
−80
0
I. Selesnick
0.1
0.2
0.3
0.4
0.5
0.6
EL 713 Lecture Notes
16
This example was produced with the following Matlab code.
N = 83;
Dp = 0.02;
% desired passband error
Ds = 10^(-50/20);
% desired stopband error (50dB)
Kp = 1/Dp;
Ks = 1/Ds;
wp = 0.32*pi;
ws = 0.37*pi;
wo = (wp+ws)/2;
L = 1000;
w = [0:L]*pi/L;
W = Kp*(w<=wp) + Ks*(w>=ws);
D = (w<=wo);
[h,del] = fircheb(N,D,W);
dp = del/Kp;
% actual passband error
ds = del/Ks;
% actual stopband error
[H,w] = freqz(h);
subplot(3,1,1)
plot(w/pi,20*log10(abs(H)),[0 1],20*log10(Ds)*[1 1],’--’)
axis([0 1 -80 10])
title(’H^f(\omega) (N = 83, FLOATING POINT) ’)
b = 12;
% number of bits used for fixed-point word-length
Q = 1/2^b;
% quantization step
hq = Q*round(h/Q);
% quantized coefficients
[Hq,w] = freqz(hq); % frequency response of quantized coefficients
subplot(3,1,2)
plot(w/pi,20*log10(abs(Hq)),[0 1],20*log10(Ds)*[1 1],’--’)
axis([0 1 -80 10])
title(’H^f(\omega) (N = 83, 12 BITS) ’)
N = 93;
[h,del] = fircheb(N,D,W);
hq = Q*round(h/Q);
[Hq,w] = freqz(hq);
subplot(3,1,3)
plot(w/pi,20*log10(abs(Hq)),[0
axis([0 1 -80 10])
title(’H^f(\omega) (N = 93, 12
I. Selesnick
% increase the filter length
% redesign with longer filter length
% quantize longer impulse response
1],20*log10(Ds)*[1 1],’--’)
BITS) ’)
EL 713 Lecture Notes
17
SIGNAL QUANTIZATION
Quantizing a signal is a memoryless nonlinear operation.
UNIFORM QUANTIZER CHARACTERISTIC
1
0.8
0.6
0.4
Q[x]
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
I. Selesnick
−0.8
−0.6
−0.4
−0.2
0
x
0.2
0.4
EL 713 Lecture Notes
0.6
0.8
1
18
SIGNAL QUANTIZATION
If Q[x(n)] is the quantized version of x(n), then we can define the
quantization error signal e(n) as
e(n) := Q[x(n)] − x(n)
Then we can write
Q[x(n)] = x(n) + e(n)
Under many realistic operating conditions, the quantization error
signal e(n) will appear to fluctuate randomly from sample to sample. So even though the quantization is totally deterministic, it can
be modeled as a random variable.
Specifically, under the following conditions, or assumptions, the
quantization error signal can be accurately modeled as random signal which is (1) uncorrelated from sample to sample and (2) uncorrelated from the true signal x(n).
1. The quantization step is small compared to the signal amplitude.
2. Overflow is negligible.
3. The signal changes rapidly enough from sample to sample.
We will assume that a uniform quantizer is used (all quantization
steps are equal). If the quantization step is Q, then the quantization
error signal will vary between −Q/2 and Q/2,
Q
Q
− ≤ e(n) ≤
2
2
In addition, if all values between −Q/2 and Q/2 are equally likely,
then the probability distribution function (PDF) p(e) that describes
e(n) is uniform between −Q/2 and Q/2.
Fig 5: Uniform PDF
I. Selesnick
EL 713 Lecture Notes
19
QUANTIZATION ERROR: MATLAB EXAMPLE
L = 10000;
x = 2*rand(L,1)-1;
% original signal x(n), with -1 < x(n) < 1
Rfs
b =
Q =
y =
%
%
%
%
= 2;
8;
Rfs/2^b;
Q*round(x/Q);
e = x - y;
Rfs: full scale range
number of bits
quantization step
quantized signal y(n)
% quantization error signal e(n)
figure(1)
subplot(2,1,1)
stem(e(1:100),’.’)
title(’QUANTIZATION ERROR SIGNAL’)
subplot(2,1,2)
hist(e,10)
% compute histogram (with 10 bins)
colormap([1 1 1]*1)
title(’QUANTIZATION ERROR HISTOGRAM’)
axis([-Q Q 0 2*L/10])
print -deps sig_quant
−3
QUANTIZATION ERROR SIGNAL
x 10
4
2
0
−2
−4
0
10
20
30
40
50
60
70
80
90
100
QUANTIZATION ERROR HISTOGRAM
2000
1500
1000
500
0
−6
−4
−2
0
2
4
6
−3
x 10
I. Selesnick
EL 713 Lecture Notes
20
QUANTIZATION ERROR POWER
The variance of e(n) is used to quantify how bad the quantization
error is. As e(n) is zero mean, the variance of e(n) is just given by
the average value of e2 (n).
σe2 := VAR{e(n)} = E{e2 (n)}
E{e2 (n)} is the called the power of the random signal e(n). We
can compute the average value of e2 (n) by the formula
Z Q/2
2
E{e (n)} =
e2 p(e) de
Z
−Q/2
Q/2
e2
=
−Q/2
1
de
Q
e=Q/2
1 e3 = · Q 3 e=−Q/2
" 3 #
3
Q
Q
1
− −
=
3Q
2
2
3
1 Q
Q3
=
+
3Q 8
8
2
Q 2
=
3 8
Q2
=
12
σe2
I. Selesnick
Q2
=
12
EL 713 Lecture Notes
21
QUANTIZATION ERROR POWER
The following Matlab program numerically checks the correctness
2
of the expression σe2 = Q12 .
L = 10000;
x = 2*rand(L,1)-1;
Rfs
b =
Q =
y =
% original signal x(n)
% with -1 < x(n) < 1
= 2;
8;
Rfs/2^b;
Q*round(x/Q);
%
%
%
%
Rfs: full scale range
number of bits
quantization step
quantized signal y(n)
e = x - y;
% quantization error signal e(n)
errvar_meas = mean(e.^2)
% E{e(n)^2} : measured value
errvar_model = Q^2/12
% E{e(n)^2} : value from model
We got the following result when we ran this program, which verifies
the formula for σe2 .
errvar_meas =
5.1035e-06
errvar_model =
5.0863e-06
Note that each time we run the program, we will get a different
value for errvar meas because the simulation begins with a randomly genegerated signal x(n).
I. Selesnick
EL 713 Lecture Notes
22
EXPRESSING σe2 IN TERMS OF THE NUMBER OF BITS
Lets find and expression for σe2 in terms of the number of bits used.
Let Rf s denote the full-scale range of the quantizer. That is the
range from the least value to the greatest value.
If b bits are used to represent x(n), then there are 2b quantization
levels. Therefore
2b · Q = Rf s
and
Q=
and
σe2
Rf s
= 2−b · Rf s
b
2
2−2b Rf2 s
Q2
=
=
12
12
σe2
I. Selesnick
2−2b Rf2 s
=
12
EL 713 Lecture Notes
23
SIGNAL TO QUANTIZATION NOISE RATIO (SNR)
The signal to quantization noise ratio (SNR OR SQRN) is defined
as
σ2
σx
SNR := 10 log10 x2 = 20 log10
in dB
σe
σe
where
σx2 := VAR{x(n)}
σe2 := VAR{e(n)}
It will be interesting to examine how the SNR depends on the number of bits b. First, we can write
SNR = 10 log10 σx2 − 10 log10 σe2 .
Using
σe2
2−2b Rf2 s
we have
=
12
SNR = 10 log10 σx2 − 10 log10
2−2b Rf2 s
12
!
= 10 log10 σx2 − 10 log10 2−2b − 10 log10
= 10
log10 σx2
+ 20 b log10 2 − 10 log10
Rf2 s
!
12
!
Rf2 s
12
or
SNR = C + 20 b log10 2
where C does not depend on b. As 20 log10 2 = 6.02, we have
SNR = C + 6.02 b
That is the well known result: for each additional bit, the SNR is
improved by 6 dB.
I. Selesnick
EL 713 Lecture Notes
24
© Copyright 2026 Paperzz