Fully Complex-valued ELM Classifiers for Human - IISc-SERC

Fully Complex-valued ELM Classifiers for Human Action
Recognition
R. Venkatesh Babu and S. Suresh
Abstract—In this paper, we present a fast learning neural
network classifier for human action recognition. The proposed
classifier is a fully complex-valued neural network with a single
hidden layer. The neurons in the hidden layer employ the fully
complex-valued hyperbolic secant as an activation function. The
parameters of the hidden layer are chosen randomly and the
output weights are estimated analytically as a minimum norm
least square solution to a set of linear equations. The fast leaning
fully complex-valued neural classifier is used for recognizing
human actions accurately. Optical flow-based features extracted
from the video sequences are utilized to recognize 10 different
human actions. The feature vectors are computationally simple
first order statistics of the optical flow vectors, obtained from
coarse to fine rectangular patches centered around the object.
The results indicate the superior performance of the complexvalued neural classifier for action recognition. The superior
performance of the complex neural network for action recognition stems from the fact that motion, by nature, consists of
two components, one along each of the axes.
I. I NTRODUCTION
Complex-valued neural networks have better computational ability than the real-valued networks [1] for handling complex signals. Their inherent orthogonal decision
boundary [2] that provides them with an exceptional ability
to perform classification tasks [3] motivates researchers to
develop complex-valued classifiers.
Recently complex-valued classifiers available in the literature are used to solve real valued classification tasks such as
Multi Layer Multi Valued Network (MLMVN) [4], the single
layer complex-valued neural network with phase encoded
input features [5] referred to as ‘Phase Encoded Complex
Valued Neural Network (PE-CVNN)’ and Phase Encoded
Complex-valued Extreme Learning Machine (PE-CELM)
[6]. A multi-valued neuron used in the MLMVN [4] uses
multiple-valued threshold logic to map the complex-valued
input to C discrete outputs using a piecewise continuous activation function, where C is the total number of classes. The
transformation used in the MLMVN is not unique, leading
to misclassification. The misclassification is further enhanced
by the increase in number of sectors inside the unit circle in
multi-category classification problems with more number of
classes (C). Moreover, the MLMVN uses a derivative-free
global error correcting rule for network parameter update
that requires significant computational effort.
In the PE-CVNN and PE-CELM presented in [5], [6],
the real-valued input feature is phase encoded in [0, π] to
R. Venkatesh Babu (E-mail: [email protected]) is with Supercomputer Education and Research Centre (SERC), Indian Institute of Science,
Bangalore, India. S. Suresh (E-mail: [email protected]) is with School
of Computer Engineering, Nanyang Technological University, Singapore.
obtain the complex-valued input feature. This transformation
retains the relational property and spatial relationship among
the real-valued input features [5]. However, the activation
functions used in [5] are similar to the split complex-valued
activation functions and do not preserve the phase information of the error signal during the backward computation.
This might result in inaccurate estimation of the decision
function while performing classification problems. Moreover,
the gradient descent based learning, presented in [5], also
requires significant time to train the classifier.
In this paper, we propose a complex-valued classifiers that
require considerably insignificant computational effort compared to the other classifiers. The classifiers have a non-linear
hidden layer and a linear output layer. The neurons in the
hidden layer of these networks use the fully complex-valued
activation function of the type of a hyperbolic secant function
[7]. Similar to the C-ELM (Complex Extreme Learning
Machines) [8], the parameters of the hidden neurons are
chosen randomly and the output parameters of the networks
are estimated analytically.
Action recognition is one of the crucial computer vision
tasks in applications like video surveillance and monitoring. Developing algorithms for human action recognition
is hard due to the complexity of the scene. Typical action
recognition approaches [9], [10], [11], [12], [13] assume
that each video clip has been pre-segmented to have a
single action. Motion information between consecutive image
frames provides vital cues about the object dynamics. These
motion information over an image plane is a two-dimensional
vector, corresponding to horizontal and vertical displacement.
This motion information is basically complex in nature and
hence it should be treated as complex valued feature rather
than real-valued feature for effective action modeling. In our
proposed approach we have extracted accumulated motion
information (optical-flow) over a small time window for
extracting complex-valued motion features for each frame to
represent the corresponding action. We have used the Weizmann human action dataset for evaluating the performance
[12].
The paper is organized as follows. Section 2 describes the
fast learning fully complex-valued neural classifier. Section
3 explains action recognition using Complex-valued Neural
Classifier and the performance evaluation. Section 4 provides
concluding remarks and future directions.
II. D ESCRIPTION OF THE C LASSIFIER
In this section, we present the detailed description of fast
learning fully complex-valued neural classifier.
A. Classification problem definition
.
Let
us
assume
N
random
observations
{(x1 , c1 ) , · · · , (xt , ct ) , · · · , (xN , cN )}, where xt ∈ Cm
are the m-dimensional complex-valued input features of tth
observation and ct ∈ {1, 2, · · · , C} is its class label. The
coded class label in the complex domain yt are obtained
using:
1 + i, if ct = l,
ylt =
l = 1, 2, · · · , C (1)
−1 − i, otherwise,
Now, the classification problem in the complex domain can
be viewed as finding the decision function (F ) that maps the
complex-valued input features to the complex-valued coded
class labels, i.e., F : Cm → CC , and then predicting the
class labels of new, unseen samples with certain accuracy.
sech(.)
sech(.)
x1
y2
h
y^
x2
sech(.)
i
y
h
B. Fast learning complex-valued classifiers
The fully complex-valued classifier is a single hidden layer
network, with a non-linear hidden layer and a linear output
layer as shown in Fig. 1.
The neurons in the hidden layer of the neural classifier
employ the fully complex-valued activation function of the
type of a hyperbolic secant function [7], and their responses
are given by
(2)
yhj = sech uTj (xt − vj ) ; j = 1 · · · K
where uj is the complex-valued scaling factor and vj is the
center of the j-th neuron, and sech(x) = 2/(ex + e−x ).
The neurons in the output layer employ linear activation
function and the output of the classifier is given by:
yl
=
K
j=1
wlj yhj , l = 1, · · · , C
(3)
where wlj are the complex-valued weight connecting the l-th
output neuron and the j-th hidden neuron.
The class labels can be estimated from the outputs using:
c=
max
real
l=1,2,··· ,C
(
yl )
(4)
Eq. (3) can be written in a matrix form as
Y
= WH
(5)
where W is the matrix of all output weights connecting the
hidden layer, and H is the K × N matrix of the response of
the hidden neurons for the samples in the training data set
given by
⎡
⎢
⎣
sech (u1 z1 − v1 )
..
.
···
..
.
H (V, B, Z)
sech (u1 zN − v1 )
..
.
sech (uK z1 − vK )
···
sech (uK zN − vK )
= (6)
⎤
⎥
⎦
Similar to the C-ELM [8], the parameters of the hidden
neurons (uj , vj ) are chosen randomly and the output weights
W are estimated by the least squares method according to:
W = Y H†
(7)
y1
h
xm
y K−1
h
sech(.)
sech(.)
yK
h
.
Fig. 1.
The architecture of a Complex-valued Neural Classifier.
where H † is the Moore-Penrose inverse of the hidden layer
output matrix, and Y is the complex-valued coded class label.
The proposed fully complex-valued neural classifier can
be summarized as:
• For a given training set (X, Y ), select the appropriate
number of hidden neurons K.
• Choose the scaling factor U and the neuron centers V
randomly.
• Calculate the output weights W analytically: W =
T YK† .
In this paper, selection of an appropriate number of hidden
neurons is done using the addition/deletion of neurons to
obtain an optimal performance, as discussed in [14] for realvalued networks.
III. H UMAN ACTION R ECOGNITION USING
C OMPLEX - VALUED N EURAL C LASSIFIER
Human action recognition is one of the crucial tasks in
applications such as video surveillance. It is a challenging
problem due to complexity of the real-life problem such as
background clutter, occlusion illumination and scale variations. There have been many proposals for modeling and
representing actions starting from Hidden Markov Models
(HMM) [15], [10] through interest point models [11], [9],
to silhouette tunnel shape models [12], [13]. All these
approaches utilize real valued spatio-temporal features to
represent various actions. The motion information between
two frames is a vital source of information to model human
actions. The motion information is basically complex in
nature and hence it should be treated as complex valued
feature rather than real-valued feature for effective action
modeling.
A. Action Representation
The motion flow between consecutive frames are obtained
using Lukas and Kanade’s Optical Flow technique [16]. We
represent the action as a motion flow-history image over a
short period of time window (w)(say, 5 frames) as given in
equation (8).
Fn (i, j), n ∈ {(k − w), . . . k} (8)
Hk (i, j) =
C. Performance Evaluation
n
Here, Hk (i, j) indicates the flow history of kth frame at
(i, j)th location and Fn (i, j) represents the optical flow
between nth and (n − 1)th frame at (i, j)th location. For
’Wave2’ action represented in Fig. 2(a), the corresponding
flow-history image is shown in Fig. 2(c). The background
subtracted image shown in Fig. 2(b), yields the object center
about which a rectangle patch is chosen as shown in Fig.
3(a) for feature extraction. This representation is very useful
in recognizing actions instantly as video frames arrive. The
action recognition for a wider time window could be achieved
from the recognition results at frame level.
B. Feature Extraction
First, the optical flow for each frame of the video sequences are obtained. The object center for each frame is
obtained from the mask generated by subtracting the given
background video frame from the current frame. Now, the
flow vectors for the current frame are accumulated from the
past few frames. In our experiment, motion information from
6 previous frames are accumulated for feature extraction.
In the experiment, 54 rectangular patches surrounding the
object, of hierarchically increasing sizes are utilized. The
mean value of non-zero motion vectors in each block is
computed as the representative motion vector as given in
Eq. (9).
(i,j)∈Rb Hk (i, j)
b
,
(9)
Fk = (i,j)∈Rb Ik (i, j)
where,
Ik (i, j)
=
surrounding the object center, in a hierarchical fashion as
illustrated in Fig. 3(a-f). Initially the mean flow vectors
are obtained at finer resolution using 36 windows each of
size 6 × 4, symmetrically laid about the image center. The
remaining features are obtained by merging the windows
sequentially to sizes of : 12 × 8 (Fig. 3(b)), 18 × 12 (Fig.
3(c)), 36 × 12 (Fig. 3(d)), 18 × 24 (Fig. 3(e)) and 36 × 24
(Fig. 3(f)) as illustrated.
The representative motion vectors of all the 54 patches are
collected as the feature vector. Since each motion vector has
x and y component, they are combined to form a complex
number (M vx + i.M vy ). The final feature vectors are, 54dimensional complex numbers. These feature vectors are
scaled appropriately to keep the real/imaginary values within
-1 to 1 range.
1,
0,
if |Hk (i, j)| > 0,
otherwise
(10)
Here, Fkb indicates the representative motion vector corresponding to bth rectangular block in kth frame, Rb indicates
the pixels that belong to bth rectangle block and I(i, j) is
a binary matrix indicating the locations of non-zero motion
vectors.
Figure 3 illustrates the process of feature extraction at
frame level. The final feature vectors are obtained by finding
the mean optical flow within each of the rectangular patches
In this section, we present performance of fully complexvalued neural classifier on human action recognition. For
this purpose, we use Weizmann Human Action Dataset
This dataset1 contains 90 low resolution video sequences
(180 × 144), where 10 actions were performed by 9 subjects.
The background videos were also provided for all action
sequences. The 10 actions include: bend, jack, jump, pjump,
run, side, skip, walk, wave1 and wave2. Totally 5036 frame
level feature vectors were obtained from 10 action classes.
For training, randomly 3000 frame level features were used
and tested against the remaining 2036 frames. The performance of the classifier is evaluated using average and overall
accuracies.
The average (ηa ) (Eq. (11)) and over-all (ηo ) (Eq. (12))
classification efficiencies of the classifiers derived from their
confusion matrices are used as the performance measures for
comparison in this study.
C
ηa
=
ηo
=
1 qii
× 100%
C i=1 Ni
C
i=1 qii
× 100%
C
i=1 Ni
(11)
(12)
where qii is the total number of correctly classified samples
in the class ci and Ni is the total number of samples
belonging to a class ci in the data set.
In our simulation study, we used a fully complex-valued
neural network with 400 hidden neurons. The number of
hidden neurons are chosen based on incremental-decremental
strategy given in [14]. The overall/average training and test
accuracy is around 93%. The performance for individual
action is given in table I and the confusion matrix is given
in table III for test. Some actions like bend and wave2 are
having similar motion flow as Wave1, hence the performance
of these actions are below average. This could possibly overcome by using additional information such as silhouette and
global features such as object trajectory. In order to compare
the results with regular real-valued classifiers, same dataset
1 http://wisdom.weizmann.ac.il/∼vision/SpaceTimeActions.html
(a)
(b)
(c)
Fig. 2.
(a) Original image (b) corresponding background subtracted mask (c) accumulated flow.
TABLE III
ACTION R ECOGNITION R ESULTS - C ONFUSION M ATRIX (T EST )
Action
Bend
Jack
Jump
pJump
Run
Side
Skip
Walk
Wave1
Wave2
Bend
80.4
0
0
0.5
0
0
1.6
0
0
0.42
Jack
0
94.8
0
0
0
0
0
0
0
0
Jump
0
0
100
0
0
2
0
0
0
0
pJump
0
1.6
0
91.5
0
0
0.55
0
0
0
TABLE I
ACTION R ECOGNITION R ESULTS USING C OMPLEX - VALUED ELM
C LASSIFIER
Action
Bend
Jack
Jump
pJump
Run
Side
Skip
Walk
Wave1
Wave2
Accuracy ηi (%)
Test
80.4
94.8
100
91.5
93.4
93.4
94.5
100
100
85
Accuracy ηi (%)
Training
77.1
96.1
98.7
89.5
97.5
97.8
98.7
99
94.8
82.6
Run
0
0
0
0
93.4
0
1.1
0
0
0
Side
0
0
0
0
0
93.4
0
0
0
0
Skip
0
0
0
0
4.4
2
94.5
0
0
0
Walk
0
0.4
0
0
2.2
2.6
2.19
100
0
0
Wave1
19.6
1.6
0
8
0
0
0
0
100
14.6
Wave2
0
1.6
0
0
0
0
0
0
0
85
TABLE II
ACTION R ECOGNITION R ESULTS USING SVM C LASSIFIER
Action
Bend
Jack
Jump
pJump
Run
Side
Skip
Walk
Wave1
Wave2
Accuracy ηi (%)
Test
24.1
93.3
63.9
39.1
80
74.2
76.9
90.6
76.6
40.1
Accuracy ηi (%)
Training
33.1
91.7
78.1
48
87.8
76.5
81.2
92.8
73.8
38.4
Object
Center
(a)
Object
Center
Object
Center
(b)
Object
Center
(d)
(c)
Object
Center
Object
Center
(e)
(f)
Fig. 3. Feature Extraction: Fine to coarse strategy (a) First level Fine Image patches (No. patches=36) (b) second level patches (No. of patches=9) (c)
3rd level coarse patches (No. of patches = 4) (d)-(f) other coarse resolution patches (No. of patches=5.)
was trained with various classifiers including multi class
SVM, using liblinear package2 . The prediction results, for
the test data, for all the classifiers were in the range of 65%
to 67%, which is 25% less than the proposed complex-valued
neural classifier. The performance for individual action using
SVM is given in table II. This result indicates the importance
of using complex valued data for motion information, which
by nature consists of components in two directions.
Recognizing actions at sequence level could be achieved
by utilizing the frame level detection results and global
features such as trajectory information. Future research direction will be towards increasing the performance with
global features such as object trajectory information with
local frame-level complex feature. Since there is no frame
level action recognition results available in literature for
the dataset, we will be converting the frame level result to
sequence level result and compare the performance.
IV. C ONCLUSION
The exceptional universal classification ability of the
complex-valued neural network is attributed to their inherent
orthogonal decision boundaries. In this paper, we present an
efficient, fast learning complex-valued classifier for human
action recognition. Here, the input weights are randomly selected and the output weights are calculated analytically. The
performances of the classifier is evaluated using benchmark
action recognition sequences. The results indicate superior
performance of the proposed complex-valued neural classifier
over the traditional real-valued classifiers.
2 http://www.csie.ntu.edu.tw/∼cjlin/liblinear
R EFERENCES
[1] T. Nitta, “The computational power of complex-valued neuron,”
in Artificial Neural Networks and Neural Information Processing
ICANN/ICONIP (Istanbul, Turkey), 2003, LNCS, pp. 993–1000.
[2] T. Nitta, “On the inherent property of the decision boundary in
complex-valued neural networks,” Neurocomputing, vol. 50, pp. 291–
303, 2003.
[3] T. Nitta, “Solving the xor problem and the detection of symmetry
using a single complex-valued neuron,” Neural Networks, vol. 16, no.
8, pp. 1101–1105, 2003.
[4] I. Aizenberg and C. Moraga, “Multilayer feedforward neural network
based on multi-valued neurons (MLMVN) and a backpropagation
learning algorithm,” Soft Computing, vol. 11, no. 2, pp. 169–193,
2007.
[5] M. F. Amin and K. Murase, “Single-layered complex-valued neural
network for real-valued classification problems,” Neurocomputing, vol.
72, no. (4-6), pp. 945–955, 2009.
[6] R. Savitha, S. Suresh, N. Sundararajan, and H. J. Kim, “Fast
learning fully complex-valued classifiers for real-valued classification
problems,” in International Symposium on Neural Networks, D. Liu
et al., Ed., 2011, vol. 6675 of LNCS, pp. 602–609.
[7] R. Savitha, S. Suresh, and N. Sundararajan, “A fully complex-valued
radial basis function network and its learning algorithm,” International
Journal of Neural Systems, vol. 19, no. 4, pp. 253–267, 2009.
[8] M. B. Li, G. B. Huang, P. Saratchandran, and N. Sundararajan, “Fully
complex extreme learning machines,” Neurocomputing, vol. 68, no.
(1-4), pp. 306–314, 2005.
[9] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A
local svm approach,” in Proceedings of International Conference on
Pattern Recognition, Aug. 2004, pp. 32–36.
[10] R. Venkatesh Babu, B. Anantharaman, K. R. Ramakrishnan, and S. H.
Srinivasan, “Compressed domain action classification using HMM,”
Pattern Recognition Letters, vol. 23, no. 10, pp. 1203–1213, 2002.
[11] J. Niebles, H. Wang, and Li Fei-Fei, “Unsupervised learning of human
action categories using spatial-temporal words,” International Journal
of Computer Vision, vol. 79, no. 3, pp. 299–318, Sept. 2008.
[12] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions
as space-time shapes,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, Dec. 2007.
[13] K. Guo, P. Ishwar, and J. Konrad, “Action recognition from video by
covariance matching of silhouette tunnels,” in Proceedings of Brazilian
Symposium on Computer Graphics and Image Processing, Oct. 2009,
pp. 299–306.
[14] S. Suresh, S. N. Omkar, V. Mani, and T. N. Guru Prakash, “Lift
coefficient prediction at high angle of attack using recurrent neural
network,” Aerospace Science and Technology, vol. 7, no. 8, pp. 595–
602, 2003.
[15] T. Starner and A. Pentland, “Visual recognition of american sign
language using hidden markov models,” in International Conference
on Automatic Face and Gesture Recognition, 1995, pp. 189–194.
[16] B. D. Lucas and T. Kanade, “An iterative image registration technique
with an application to stereo vision,” in Proceedings of DARPA Image
Understanding Workshop, Apr. 1981, pp. 121–130.