Shiryaev-Roberts - ORT Braude College

"‫כנס האיכות הרביעי בגליל "איכות – הלכה ומעשה‬
2011 ,‫ במאי‬12
On Optimality of the Shiryaev-Roberts Procedure for
Retrospective and Sequential Change Point Detections
1
2
Gregory Gurevich and Albert Vexler
1
The Department of Industrial Engineering and Management, SCE- Shamoon College of
Engineering, Beer-Sheva 84100, Israel
2
Department of Biostatistics, The New York State University at Buffalo,
Buffalo, NY 14214, USA
This paper demonstrates that criteria for which a given test is optimal can be declared by
the structure of this test, and hence almost any reasonable test is optimal. In order to establish
this conclusion, the principle idea of the fundamental lemma of Neyman and Pearson is
applied to interpret the goodness of tests, as well as retrospective and sequential change point
detections are considered in the context of the proposed technique. We demonstrate that the
Shiryaev-Roberts approach applied to the retrospective change point detection yields the
averagely most powerful procedure. We also prove non-asymptotic optimality properties for
the Shiryayev-Roberts sequential procedure.
1. INTRODUCTION
Without loss of generality, we can say that the principle idea of the proof of the fundamental
lemma of Neyman and Pearson is based on the trivial inequality
 A  B I A  B     0
for all
A, B
and
  [0,1]
(
(1.1)
I  is the indicator function). Thus, for example,
if we would like to classify i.i.d. observations
X 1 , X 2 ,..., X n
regarding the hypotheses:
H 0 : X1
is from a density function
f0
versus
H 1 : X 1 is from a density function f 1 ,
then the likelihood ratio test (i.e., we should reject H 0 if
where
C
is a threshold) is uniformly most powerful. This classical proposition directly follows
from the expected value under H 0 of (1.1) with
BC
i1 f1  X i  / f 0  X i   C
n
and

A  i 1 f1  X i  / f 0  X i  ,
n
that is considered as any decision rule based on the observed sample.
  n f1  X i 
  n f1  X i 

  n f1  X i 


 


E H 0 
 C   I 
 C   E H 0  
 C  
 i 1 f  X 
,
  i 1 f 0  X i 



f
X


i

1
0
i
0
i











n
 ... 
i 1
 n
f1 xi   n f1 xi 
I 
 C  f 0 xi dx1 ...dxn
f 0 xi   i 1 f 0 xi 
 i 1
 n f1 xi 
 n
 C  ... I 
 C  f 0 xi dx1 ...dxn
 i 1 f 0 xi 
 i 1
n
  ... 
i 1
f 1  xi  n
  f 0 xi dx1 ...dxn
f 0 xi  i 1
n
 C  ...   f 0 xi dx1 ...dxn .
i 1
Therefore
 n f1  X i 

 n f1  X i 

PH1  
 C   CPH 0  
 C   PH1   1  CPH 0   1 .
 i 1 f 0  X i 

 i 1 f 0  X i 

However, in the case when we have a given test-statistic
i1 f1  X i  / f 0  X i  ,
n
Sn
instead of the likelihood ratio
the inequality (1.1) yields
S n  C I S n  C  S n  C 
for any decision rule
  [0,1] .
Therefore, the test-rule
,
Sn  C
has also an optimal
property following the application of (1.1). The problem is to obtain a reasonable interpretation
of the optimality.
The present paper proposes that the structure of any given retrospective or sequential test
can substantiate type-(1.1) inequalities that provide optimal properties of that test. Although
this approach can be applied easily to demonstrate an optimality of tests, presentation of that
optimality in the classical operating characteristics (e.g., the power/significance level of tests)
is the complicated issue.
In order to represent the proposed methodology, we consider different kinds of hypothesis
testing. Firstly, we consider the retrospective change point problem. We demonstrate that the
Shiryaev-Roberts approach applied to the change point detection yields the average most
powerful procedure. Secondly, we consider the sequential testing. Here the focus on the
inequality (1.1) leads us to a non-asymptotic optimality of a well known sequential
procedure, for which only an asymptotic result of optimality has been proved in the literature.
2. RETROSPECTIVE CHANGE POINT DETECTION
In this section we will focus on a sequence of previously obtained independent observations
in an attempt to detect whether they all share the same distribution. Note that, commonly, the
change point problem is evaluated under assumption that if a change in the distribution did
occur, then it is unique and the observations after the change all have the same distribution,
which differs from the distribution of the observations before the change.
Let
X 1 ,..., X n
be independent observations with density functions
respectively. We want to test the null hypothesis
H 0 : gi  f0
for all
i  1,..., n
g1 ,..., g n ,
versus the alternative
H 1 : g1  ...  g 1  f 0  g  ...  g n  f1 , 
The maximum likelihood estimation of the change point parameter
i f1  X i  / f 0  X i 

is unknown.
applied to the
n
likelihood ratio
H 0 if
literature CUSUM test: we should reject
n
max 
1 k  n
where
C 0
is a threshold.
leads us to the well accepted in the change point
i k
f1  X i 
C
f 0 X i 
,
Alternatively, following Gurevich and Vexler (2010), the test based on the Shiryaev-Roberts
approach has the following form: we reject H 0 if
1
1 n n f1  X i 
Rn  
C
n
n k 1 i k f 0  X i 
(2.1)
.
Consider an optimal meaning of the test (2.1). Let
Pk and E k ( k  0,..., n ) respectively
denote probability and expectation conditional on
 k
(the case
k 0
H 0 ). By virtue of (1.1) with A  Rn / n and B  C , it follows that
corresponds to
1
 1
 1

R

C
I
R

C

R

C

  n

 
n
n
n
 n
 n

Without loss of generality, we assume that
sample
X 1 ,..., X n
such that if
.
(2.2)
  0,1 is any decision rule based on the observed
  1 , then we reject
H 0 . Since
 n f1  X i  
1
1 n
E0 Rn    E0  
 
n
n k 1  i k f 0  X i  
n
n
f1 xi  n
1 n
   ... 
  f 0 xi  dxi
n k 1
i  k f 0  xi  i 1
i 1
k 1
n
n
1 n
   ... I   1 f 0 xi  f1 xi  dxi
n k 1
i 1
i k
i 1
1 n
  Pk   1
n k 1
derivation of
H 0 -expectation of the left and right side of (2.2) directly provides the following
proposition.
Proposition 2.1 The test (2.1) is the averagely most powerful test, i.e.
1 n 1

1
 1 n
Pk  Rn  C   CP0  Rn  C    Pk   1  CP0   1 ,

n k 1  n

n
 n k 1
for any decision rule
  [0,1]
based on the observations
X 1 , X 2 ,..., X n .
3. SEQUENTIAL SHIRYAEV-ROBERTS CHANGE POINT DETECTION
In this section, we suppose that independent observations
sequentially. Let
whereas
The case
to
X 1 , X 2 ,...
are surveyed
X 1 , X 2 ,..., X  1 be each distributed according to a density function f 0 ,
X  , X  1 ,... have a density function f 1 ,
 
where
1   
is unknown.
corresponds to the situation when all observations are distributed according
f 0 . The formulation of sequential change point detection conforms to raising an alarm as
soon as possible after the change and to avoid false alarms. Efficient detection methods for this
stated problem are based on CUSUM and Shiryaev-Roberts stopping rules. The CUSUM
policy is: we stop sampling of
detected at the first time
X -s
n 1
and report that a change in distribution of
X
has been
that
max i k  f1  X i  / f 0  X i   C ,
n
1 k  n
for a given threshold
C.
The Shiryaev-Roberts procedure is defined by the stopping time
N C  inf n  1 : R n  C ,
where the Shiryayev-Roberts test-statistic
n
Rn
n
Rn  
k 1 i  k
(3.1)
is
f1  X i 
f 0 X i 
(3.2)
.
The sequential CUSUM detection procedure has a non-asymptotic optimal property
Moustakides, 1986). At that, for the Shiryaev-Roberts procedure an asymptotic (as
C)
optimality has been shown (Pollak, 1985). (Although Yakir (1997) attempted to prove a non-
asymptotic optimality of the Shiryaev-Roberts procedure, Mei (2006) has pointed out the
inaccuracy of the Yakir's proofs.) In order to demonstrate optimality of the Shiryaev-Roberts
detection scheme, Pollak (1985) has proved an asymptotic closeness of the expected loss
using a Bayes rule for the considered change problem with a known prior distribution of
to that using the rule

N C . However, in the context of simple application of the proposed
methodology, the procedure (3.1) itself declares loss functions for which that detection policy
is optimal. That is, setting
A  Rmin  NC ,n 
and
R
  [0,1] . Since
R
 
R
min  NC ,n 
n
in (1.1) leads to
 
 C I Rmin  NC ,n   C    0 ,
min  NC ,n 
for all
BC
min  NC ,n 
(3.3)

 C  N C  n ,
 
 
 C I Rmin  NC ,n   C  
(3.4)
  Rk  C 1   I N C  k  Rn  C   I N C  n  0 .
k 1
It is clear that (3.4) can be utilized to present an optimal property of the detection rule
NC
.
However, here, for simplicity, noting that every summand in the left side of the inequality
(3.4) is non-negative, we focus only on
Rn  C   I N C
Thus, if
in the
then

is a stopping time (assuming that for all
 -algebra generated by X 1 ,..., X k ) and 
 n  0 .
k , the event   k 
is defined by
is measurable
  I   n
E C  Rn I   n, N C  n  0 .
(3.5)
By virtue of the definition (3.2), we obtain from (3.5) that
C P N C  n  P min  , N C   n
n
  Pk N C  n  Pk min  , NC   n  0 .
k 1
Therefore,
(3.6)

C  P N C  n  P min  , N C   n
(3.7)
n 1

n
  Pk N C  n  Pk min  , N C   n  0 ,
n1 k 1
where

n
 Pk N C  n Pk min  , N C   n
(3.8)
n 1 k 1


  Pk NC  n  Pk min  , N C   n
k 1 n  k


  Ek N C  k  1  Ek min  , N C   k  1
k 1


,
a

 . Let us summarize (3.7) with (3.8) in the next proposition.
 aIa  0
Proposition 3.1 The Shiryaev-Roberts stopping time (3.1) satisfies

 En N C  n  1   CE N C 

n 1



 min  E n min  , N C   n  1   CE min  , N C 

 n 1
.
Note that,
E   
corresponds to the average run length to false alarm of a stopping rule
 , and hence small values of
En   n  1

 E   
are also preferable (because
of the sequential detection in the case
min  , N C 
are privileged, whereas small values of
detects that
En   n  1

relates to fallibility
  n ). Obviously, if   
 
faster than the stopping time
then
NC
.
4. REMARK AND CONCLUSION
In the context of statistical testing, when distribution functions of observations are known up
to parameters, one of the approaches dealing with estimation of unknown parameters is the
mixture technique. To be specific, consider the change point problem introduced in Section 2.
Assume that the density function
f1 u   f1 u;  , where 
is the unknown
parameter. In this case, we can transform the test (2.1) in accordance with the mixture
methodology (e.g. Krieger et al., 2003). That is, we have to choose a prior
 ~  . Thus, the mixture Shiryaev-Roberts statistic has form
Rn(1)
1 n n f1  X i ; u 
 
d(u )
n k 1 i k f 0  X i ; u 

and pretend that
and hence the consequential transformation of the test (2.1) has the following property
 
1 n
 1 (1)
P

k  Rn  C X j

n k 1
n

j k

1

are from f1  X i ; u d(u)  CP0  Rn(1)  C 

n

 
1 n
   Pk  declares rejection of H 0 X j
n k 1
 CP0  declares rejection of H 0 
for any decision rule
  [0,1]
j k

are from f1  X i ; u  d(u)
,
based on the observations
X 1 ,..., X n . In this case the
optimality formulated in Proposition 2.1 is integrated over values of the unknown parameter

.
Finally, note that for the case where one wishes to investigate the optimality of a given test
in an unfixed context, inequalities similar to (1.1) can be obtained by focusing on the
structure of a test. These inequalities report an optimal property of the test. However,
translation of that property in terms of the quality of tests is the issue.
Questions?
Thank you!