Pitch Estimation by

Pitch Estimation by Enhanced
Super Resolution determinator
By
Sunya Santananchai
Chia-Ho Ling
Objective
 Estimate
value of the` fundamental
`
a
a
frequency of speech F  by using F 
Enhance Super Resolution determinator
(eSRFD)
Introduction
 The
fundamental frequency of speech is
defined as the rate of glottal pluses
generated by the vibration of the vocal
folds.
 The pitch of speech is the perceptual
correlate of fundamental frequency .
 The fundamental frequency of speech is
important in the prosodic features of stress
and intonation.
fundamental frequency
determination Algorithm (FDAs).
 Determine
the fundamental frequency of
speech waveform or analyzing the pitch
automatically.
 Desire to examine methods of
fundamental frequency extraction which
use radically different techniques
`
The algorithms to determine the F 






Cepstrum-based determinator (CFD) (Noll,
1969).
Harmonic product spectrum (HPS) (Schroeder,
1968; Noll, 1970)
Feature-based tracker (FBFT) (Phillips, 1985)
Parallel processing method (PP) (Gold &
Rabiner, 1969)
Integrated tracking algorithm (IFTA) (Secrest &
Doddington, 1983)
Super resolution determinator (SRFD) (Medan
et al., 1991)
a
Enhance Super Resolution
determinator (eSRFD)
 based
on the SRFD method which uses a
waveform similarity metric normalized
cross-correlation coefficient.
 Performances of the SRFD algorithm, to
reduced the occurrence of errors.
The eSRFD algorithm

Pass the speech waveform to low-pass
filter .

The speech waveform is initially low-pass
filtered.

Each frame of filtered sample data
processed by the silence detector.



Signal is analysed frame-by-frame; interval
6.4 ms of non-overlapping.
R ` a
Contains a set of samples s  s i | i 2 @N , , N @N
Divided 3 consecutive segment
N
R ` a
Xn  x i 
R ` a
Yn  y i 
R ` a
Zn  z i 
`
a
s i @n
S
| i 2 1,  ,n
S
` a
| i 2 1,  ,n
`
a
s i
s in
S
| i 2 1,  ,n
max
S
max
`
a
Analysis segments for the enhanced super resolution F  determinator

Normalized cross-correlation for
‘voiced’ frame:

If frame of data is not classified as silence or
unvoice, then candidate values for the
fundamental period by using the first
` a
normalized cross-correlation of px,y n
B C
n +L
` a
px,y n
b
c
b
c
X x jL Ay jL
j1
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
v
w
w
w
w
w
w
w
w
w
w
w
w
w
 uB C
B C
n +L
n +L
u
b
c
b c2
2
u
t X x jL AX y jL
j1
j1

Definition threshold for candidate value

Candidate values of the fundamental period
are obtained by locating peaks in the
normalized crosscorrelation coefficient for
which the value of exceeds a specified the
threshold.

A second normalized cross-correlation
coefficient .

The `frame
is
classified
as
‘voiced’
which
has
a
T
p n
>
Determined the second normalized cross` a
correlation coefficient p n
x,y

srfd
y,z
B C
n +L
` a
p y,z n
b
c
b
c
X y jL Az jL
j1
fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
v
w
w
w
w
w
w
w
w
w
w
w
w
w
 uB C
B C
n +L
n +L
u
b
c
b c2
2
u
t X y jL AX z jL
j1
j1

` a
Candidate score for p y,z n

` a
Candidates for p y,z n exceeds the threshold
T srfd are given a score of 2, others are 1.
 If there are 1 or more candidates with a score of 2
in a frame, then all those candidates with a score
of 1 are removed from the list of candidates.
 If there is only one candidate (with score 1 or 2),
the candidate is assumed to be the best estimate
of the fundamental period of that frame.
 Otherwise, an optimal fundamental period is sought from the
`
a
set of remaining candidates , calculated the coefficient of q n m
each candidate.
nM
`
a
b
c
b
X s j @n M As j  nm
c
j1
fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
q nm  vuwwwwwwww
nM
nM
b
c
b
c
2
2
u
t X s j @n M AX s j  nm
j1
`
j1
a
 The first coefficient q n1 is assumed to be the optimal value. If
`
a
the subsequent q n m * 0.77 > the current optimal value , the
`
a
subsequent q n m is the optimal value.
 In the case of only 1 candidate score 1 but no
candidate score2, the frame status will be
reconsidered depends on the frames state of
previous frame.



If the previous frame is ‘silent’, the current value is hold
and depends on the next frame.
If the next frame is also ‘silent’, the current frame will be
considered as ‘silent’.
Otherwise, the current frame is considered as ‘voiced’
and the held will be considered as the good estimation
for the current frame.

Modification apply biasing to and
 Biasing is applied if the following conditions



The two previous frames were classified as ‘voiced’
The value of the previous frame is not being temporarily
held.
The ` F a of previous frame f. 0 is less than 7/4 *( of its
preceding voiced frame f. 0) , and greater than 5/8* f. 0
 The biasing tends to increase the percentage of
unvoiced regions of speech being incorrectly
classified as ‘voiced’.

Calculate the fundamental period:
 The fundamental` period
for the frame is estimated
a
by calculate r x,y n
n
` a
b c
b c
X x j Ay j
j1
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
r x,y n  vuwwwwwwwwwwww
n
n
b c2
b c2
u
t X x j AX y j
j1
j1
Implementation

In this report will be cover the eSRFD
algorithm, implementation by MATLAB
ver 7.2b to program following by
eSRFD algoithm
The Result
The Result
Conclusion

The acoustic correlate of pitch is the fundamental
frequency of speech.
 Enhance SRFD (eSRFD) is the performances of the
SRFD which can reduce the occurrence of error involved
in the extraction of fundamental frequency[1].
 It have occurrence error in the result which depend on
kind of speech waveform.
 In addition, the result in this project has more occurrence
error than Paul Baghaw’s result[2] because of the
problem from design to implement programming follow
by eSRFD algorithm.
References

[1] Pual Christopher Bagshaw (1994). Automatic
prosodic analysis for computer aided
pronunciation teaching. The university of
Edinburgh.
 [2] Bagshaw, Paul C, Hiller, S M, Jack, Mervyn A
(1993). Enhanced pitch tracking and the
processing of f0 contours for computer aided
intonation teaching. International Speech
Communication Association. In Proc.
Eurospeech '93, Berlin, volume 2, pages 10031006, 1993.