Simple recognition of an object at a given position using video images

Simple recognition of an object at a
given position using video images
Jan Behrens
2014-06-02
This paper shall define a simple but effective algorithm for detecting a
single object at a given position using video images. While the algorithm is
comparatively easy to describe, it provides only minimal robustness against
a change of lighting conditions and is thus only suitable for use in controlled
environments.
The algorithm consists of two parts: the learning part and the recognition
part. Input to the learning part is the following data:
• the dimensions DX and DY of all video images (i. e. image resolution
in x-axis and y-axis),
• a set E of neutral images (E1 , E2 , . . . , ENE ) (e. g. of an empty room),
where the pixel values at coordinates x, y, and color channel c are given
as En,x,y,c ∈ R,
• a number P ≥ 1 of positions to distinguish,
• for every position p ∈ {1, . . . , P } a set Tp of training images (Tp,1 , Tp,2 ,
. . . , Tp,NTp ), where the pixel values for x, y, c are given as Tp,n,x,y,c ∈ R,
• optionally a set F of images (F1 , F2 , . . . , FNF ) that show the object in an
invalid state, where the pixel values for x, y, c are given as Fn,x,y,c ∈ R.
with x ∈ {1, . . . , DX }, y ∈ {1, . . . , DY }, c ∈ {red, green, blue}
The learning part of the algorithm will create (P + 1) · DX · DY · 3 values
as output: E x,y,c ∈ R and Cp,x,y,c ∈ R with p ∈ {1, . . . , P }. To create these
values, we proceed as explained on the following two pages.
1
For every pixel, we calculate the average neutral value:
E x,y,c
NE
1 X
:=
En,x,y,c
NE n=1
We then substract these average neutral pixel values from the other input
images:
0
:= Tp,n,x,y,c − E x,y,c
Tp,n,x,y,c
0
Fn,x,y,c
:= Fn,x,y,c − E x,y,c
We calculate the average pixel value differences for every position p ∈
{1, . . . , P } per pixel:
N
T p,x,y,c
Tp
1 X 0
T
:=
NTp n=1 p,n,x,y,c
Now we remove the constant component of every averaged image T p per
color channel c:
DX X
DY
1 X
T p,i,j,c
DX DY i=1 j=1
Te p,x,y,c := T p,x,y,c −
0
We do the same for the pixel value differences Tp,n
and Fn0 of all training
(and invalid) images:
0
Tep,n,x,y,c := Tp,n,x,y,c
−
0
Fen,x,y,c := Fn,x,y,c
−
DX X
DY
1 X
0
Tp,n,i,j,c
DX DY i=1 j=1
DX X
DY
1 X
F0
DX DY i=1 j=1 n,i,j,c
2
Now, for every position p ∈ {1, . . . , P }, we try to find values Wp,x,y,c ∈ R+
such that the following values Vp get reasonably small (Vp ≈ 0):

2 

D
3
X D
Y P
P
P
Tep,n,i,j,c T p,i,j,c Wp,i,j,c  
NTp 

1 X
i=1 j=1 c
 


max 0, 1 −
Vp :=
 

D
D
3
X P
Y P
P
NTp n=1 
 


Te
T
W
p,i,j,c
p,i,j,c
p,i,j,c
i=1 j=1 c

 D D 3
2 
X P
Y P
P
Teq,n,i,j,c T p,i,j,c Wp,i,j,c
NTq 


X
1
1 X
1
i=1 j=1 c


 

+
−  
max 0, D D 3
X P
Y P
P
P −1
NTq n=1 
2  




q∈{1,...,P }\{p}
Te
T
W
p,i,j,c
p,i,j,c
p,i,j,c
i=1 j=1 c

 D D 3
2 
Y P
X P
P
Fen,i,j,c T p,i,j,c Wp,i,j,c
NF 



1 X
1
i=1 j=1 c

 

+
−  
max 0, D D 3
Y P
X P
P
NF n=1 
2  




Te
T
W
p,i,j,c
p,i,j,c
p,i,j,c
i=1 j=1 c
(
a
max(a, b) =
b
(
0
max(0, x) =
x
if a ≥ b
if a < b
if x ≤ 0
if x > 0
If P = 1, we omit the second summand (replace it with 0), and if NF = 0,
we omit the third summand (replace it with 0), avoiding the division by zero.
Once we found proper Wp,x,y,c , we can calculate the coefficients Cp,x,y,c :
T p,x,y,c Wp,x,y,c −
Cp,x,y,c :=
D
3
X D
Y P
P
P
i=1 j=1
X D
Y
P
P
1 D
T p,i,j,c Wp,i,j,c
DX DY i=1 j=1
Te p,i,j,c0 T p,i,j,c0 Wp,i,j,c0
c0
3
While the calculation of the weights Wp,x,y,c during learning may be rather
difficult, the recognition whether an image I (with the pixel values Ix,y,c ∈ R)
is showing the object in position p is rather easy. We assume a match if:
DX X
DY X
3
X
x=1 y=1
c
3
Ix,y,c − E x,y,c · Cp,x,y,c ≥
4
4