Fully Complex-valued ELM Classifiers for Human Action Recognition R. Venkatesh Babu and S. Suresh Abstract—In this paper, we present a fast learning neural network classifier for human action recognition. The proposed classifier is a fully complex-valued neural network with a single hidden layer. The neurons in the hidden layer employ the fully complex-valued hyperbolic secant as an activation function. The parameters of the hidden layer are chosen randomly and the output weights are estimated analytically as a minimum norm least square solution to a set of linear equations. The fast leaning fully complex-valued neural classifier is used for recognizing human actions accurately. Optical flow-based features extracted from the video sequences are utilized to recognize 10 different human actions. The feature vectors are computationally simple first order statistics of the optical flow vectors, obtained from coarse to fine rectangular patches centered around the object. The results indicate the superior performance of the complexvalued neural classifier for action recognition. The superior performance of the complex neural network for action recognition stems from the fact that motion, by nature, consists of two components, one along each of the axes. I. I NTRODUCTION Complex-valued neural networks have better computational ability than the real-valued networks [1] for handling complex signals. Their inherent orthogonal decision boundary [2] that provides them with an exceptional ability to perform classification tasks [3] motivates researchers to develop complex-valued classifiers. Recently complex-valued classifiers available in the literature are used to solve real valued classification tasks such as Multi Layer Multi Valued Network (MLMVN) [4], the single layer complex-valued neural network with phase encoded input features [5] referred to as ‘Phase Encoded Complex Valued Neural Network (PE-CVNN)’ and Phase Encoded Complex-valued Extreme Learning Machine (PE-CELM) [6]. A multi-valued neuron used in the MLMVN [4] uses multiple-valued threshold logic to map the complex-valued input to C discrete outputs using a piecewise continuous activation function, where C is the total number of classes. The transformation used in the MLMVN is not unique, leading to misclassification. The misclassification is further enhanced by the increase in number of sectors inside the unit circle in multi-category classification problems with more number of classes (C). Moreover, the MLMVN uses a derivative-free global error correcting rule for network parameter update that requires significant computational effort. In the PE-CVNN and PE-CELM presented in [5], [6], the real-valued input feature is phase encoded in [0, π] to R. Venkatesh Babu (E-mail: [email protected]) is with Supercomputer Education and Research Centre (SERC), Indian Institute of Science, Bangalore, India. S. Suresh (E-mail: [email protected]) is with School of Computer Engineering, Nanyang Technological University, Singapore. obtain the complex-valued input feature. This transformation retains the relational property and spatial relationship among the real-valued input features [5]. However, the activation functions used in [5] are similar to the split complex-valued activation functions and do not preserve the phase information of the error signal during the backward computation. This might result in inaccurate estimation of the decision function while performing classification problems. Moreover, the gradient descent based learning, presented in [5], also requires significant time to train the classifier. In this paper, we propose a complex-valued classifiers that require considerably insignificant computational effort compared to the other classifiers. The classifiers have a non-linear hidden layer and a linear output layer. The neurons in the hidden layer of these networks use the fully complex-valued activation function of the type of a hyperbolic secant function [7]. Similar to the C-ELM (Complex Extreme Learning Machines) [8], the parameters of the hidden neurons are chosen randomly and the output parameters of the networks are estimated analytically. Action recognition is one of the crucial computer vision tasks in applications like video surveillance and monitoring. Developing algorithms for human action recognition is hard due to the complexity of the scene. Typical action recognition approaches [9], [10], [11], [12], [13] assume that each video clip has been pre-segmented to have a single action. Motion information between consecutive image frames provides vital cues about the object dynamics. These motion information over an image plane is a two-dimensional vector, corresponding to horizontal and vertical displacement. This motion information is basically complex in nature and hence it should be treated as complex valued feature rather than real-valued feature for effective action modeling. In our proposed approach we have extracted accumulated motion information (optical-flow) over a small time window for extracting complex-valued motion features for each frame to represent the corresponding action. We have used the Weizmann human action dataset for evaluating the performance [12]. The paper is organized as follows. Section 2 describes the fast learning fully complex-valued neural classifier. Section 3 explains action recognition using Complex-valued Neural Classifier and the performance evaluation. Section 4 provides concluding remarks and future directions. II. D ESCRIPTION OF THE C LASSIFIER In this section, we present the detailed description of fast learning fully complex-valued neural classifier. A. Classification problem definition . Let us assume N random observations {(x1 , c1 ) , · · · , (xt , ct ) , · · · , (xN , cN )}, where xt ∈ Cm are the m-dimensional complex-valued input features of tth observation and ct ∈ {1, 2, · · · , C} is its class label. The coded class label in the complex domain yt are obtained using: 1 + i, if ct = l, ylt = l = 1, 2, · · · , C (1) −1 − i, otherwise, Now, the classification problem in the complex domain can be viewed as finding the decision function (F ) that maps the complex-valued input features to the complex-valued coded class labels, i.e., F : Cm → CC , and then predicting the class labels of new, unseen samples with certain accuracy. sech(.) sech(.) x1 y2 h y^ x2 sech(.) i y h B. Fast learning complex-valued classifiers The fully complex-valued classifier is a single hidden layer network, with a non-linear hidden layer and a linear output layer as shown in Fig. 1. The neurons in the hidden layer of the neural classifier employ the fully complex-valued activation function of the type of a hyperbolic secant function [7], and their responses are given by (2) yhj = sech uTj (xt − vj ) ; j = 1 · · · K where uj is the complex-valued scaling factor and vj is the center of the j-th neuron, and sech(x) = 2/(ex + e−x ). The neurons in the output layer employ linear activation function and the output of the classifier is given by: yl = K j=1 wlj yhj , l = 1, · · · , C (3) where wlj are the complex-valued weight connecting the l-th output neuron and the j-th hidden neuron. The class labels can be estimated from the outputs using: c= max real l=1,2,··· ,C ( yl ) (4) Eq. (3) can be written in a matrix form as Y = WH (5) where W is the matrix of all output weights connecting the hidden layer, and H is the K × N matrix of the response of the hidden neurons for the samples in the training data set given by ⎡ ⎢ ⎣ sech (u1 z1 − v1 ) .. . ··· .. . H (V, B, Z) sech (u1 zN − v1 ) .. . sech (uK z1 − vK ) ··· sech (uK zN − vK ) = (6) ⎤ ⎥ ⎦ Similar to the C-ELM [8], the parameters of the hidden neurons (uj , vj ) are chosen randomly and the output weights W are estimated by the least squares method according to: W = Y H† (7) y1 h xm y K−1 h sech(.) sech(.) yK h . Fig. 1. The architecture of a Complex-valued Neural Classifier. where H † is the Moore-Penrose inverse of the hidden layer output matrix, and Y is the complex-valued coded class label. The proposed fully complex-valued neural classifier can be summarized as: • For a given training set (X, Y ), select the appropriate number of hidden neurons K. • Choose the scaling factor U and the neuron centers V randomly. • Calculate the output weights W analytically: W = T YK† . In this paper, selection of an appropriate number of hidden neurons is done using the addition/deletion of neurons to obtain an optimal performance, as discussed in [14] for realvalued networks. III. H UMAN ACTION R ECOGNITION USING C OMPLEX - VALUED N EURAL C LASSIFIER Human action recognition is one of the crucial tasks in applications such as video surveillance. It is a challenging problem due to complexity of the real-life problem such as background clutter, occlusion illumination and scale variations. There have been many proposals for modeling and representing actions starting from Hidden Markov Models (HMM) [15], [10] through interest point models [11], [9], to silhouette tunnel shape models [12], [13]. All these approaches utilize real valued spatio-temporal features to represent various actions. The motion information between two frames is a vital source of information to model human actions. The motion information is basically complex in nature and hence it should be treated as complex valued feature rather than real-valued feature for effective action modeling. A. Action Representation The motion flow between consecutive frames are obtained using Lukas and Kanade’s Optical Flow technique [16]. We represent the action as a motion flow-history image over a short period of time window (w)(say, 5 frames) as given in equation (8). Fn (i, j), n ∈ {(k − w), . . . k} (8) Hk (i, j) = C. Performance Evaluation n Here, Hk (i, j) indicates the flow history of kth frame at (i, j)th location and Fn (i, j) represents the optical flow between nth and (n − 1)th frame at (i, j)th location. For ’Wave2’ action represented in Fig. 2(a), the corresponding flow-history image is shown in Fig. 2(c). The background subtracted image shown in Fig. 2(b), yields the object center about which a rectangle patch is chosen as shown in Fig. 3(a) for feature extraction. This representation is very useful in recognizing actions instantly as video frames arrive. The action recognition for a wider time window could be achieved from the recognition results at frame level. B. Feature Extraction First, the optical flow for each frame of the video sequences are obtained. The object center for each frame is obtained from the mask generated by subtracting the given background video frame from the current frame. Now, the flow vectors for the current frame are accumulated from the past few frames. In our experiment, motion information from 6 previous frames are accumulated for feature extraction. In the experiment, 54 rectangular patches surrounding the object, of hierarchically increasing sizes are utilized. The mean value of non-zero motion vectors in each block is computed as the representative motion vector as given in Eq. (9). (i,j)∈Rb Hk (i, j) b , (9) Fk = (i,j)∈Rb Ik (i, j) where, Ik (i, j) = surrounding the object center, in a hierarchical fashion as illustrated in Fig. 3(a-f). Initially the mean flow vectors are obtained at finer resolution using 36 windows each of size 6 × 4, symmetrically laid about the image center. The remaining features are obtained by merging the windows sequentially to sizes of : 12 × 8 (Fig. 3(b)), 18 × 12 (Fig. 3(c)), 36 × 12 (Fig. 3(d)), 18 × 24 (Fig. 3(e)) and 36 × 24 (Fig. 3(f)) as illustrated. The representative motion vectors of all the 54 patches are collected as the feature vector. Since each motion vector has x and y component, they are combined to form a complex number (M vx + i.M vy ). The final feature vectors are, 54dimensional complex numbers. These feature vectors are scaled appropriately to keep the real/imaginary values within -1 to 1 range. 1, 0, if |Hk (i, j)| > 0, otherwise (10) Here, Fkb indicates the representative motion vector corresponding to bth rectangular block in kth frame, Rb indicates the pixels that belong to bth rectangle block and I(i, j) is a binary matrix indicating the locations of non-zero motion vectors. Figure 3 illustrates the process of feature extraction at frame level. The final feature vectors are obtained by finding the mean optical flow within each of the rectangular patches In this section, we present performance of fully complexvalued neural classifier on human action recognition. For this purpose, we use Weizmann Human Action Dataset This dataset1 contains 90 low resolution video sequences (180 × 144), where 10 actions were performed by 9 subjects. The background videos were also provided for all action sequences. The 10 actions include: bend, jack, jump, pjump, run, side, skip, walk, wave1 and wave2. Totally 5036 frame level feature vectors were obtained from 10 action classes. For training, randomly 3000 frame level features were used and tested against the remaining 2036 frames. The performance of the classifier is evaluated using average and overall accuracies. The average (ηa ) (Eq. (11)) and over-all (ηo ) (Eq. (12)) classification efficiencies of the classifiers derived from their confusion matrices are used as the performance measures for comparison in this study. C ηa = ηo = 1 qii × 100% C i=1 Ni C i=1 qii × 100% C i=1 Ni (11) (12) where qii is the total number of correctly classified samples in the class ci and Ni is the total number of samples belonging to a class ci in the data set. In our simulation study, we used a fully complex-valued neural network with 400 hidden neurons. The number of hidden neurons are chosen based on incremental-decremental strategy given in [14]. The overall/average training and test accuracy is around 93%. The performance for individual action is given in table I and the confusion matrix is given in table III for test. Some actions like bend and wave2 are having similar motion flow as Wave1, hence the performance of these actions are below average. This could possibly overcome by using additional information such as silhouette and global features such as object trajectory. In order to compare the results with regular real-valued classifiers, same dataset 1 http://wisdom.weizmann.ac.il/∼vision/SpaceTimeActions.html (a) (b) (c) Fig. 2. (a) Original image (b) corresponding background subtracted mask (c) accumulated flow. TABLE III ACTION R ECOGNITION R ESULTS - C ONFUSION M ATRIX (T EST ) Action Bend Jack Jump pJump Run Side Skip Walk Wave1 Wave2 Bend 80.4 0 0 0.5 0 0 1.6 0 0 0.42 Jack 0 94.8 0 0 0 0 0 0 0 0 Jump 0 0 100 0 0 2 0 0 0 0 pJump 0 1.6 0 91.5 0 0 0.55 0 0 0 TABLE I ACTION R ECOGNITION R ESULTS USING C OMPLEX - VALUED ELM C LASSIFIER Action Bend Jack Jump pJump Run Side Skip Walk Wave1 Wave2 Accuracy ηi (%) Test 80.4 94.8 100 91.5 93.4 93.4 94.5 100 100 85 Accuracy ηi (%) Training 77.1 96.1 98.7 89.5 97.5 97.8 98.7 99 94.8 82.6 Run 0 0 0 0 93.4 0 1.1 0 0 0 Side 0 0 0 0 0 93.4 0 0 0 0 Skip 0 0 0 0 4.4 2 94.5 0 0 0 Walk 0 0.4 0 0 2.2 2.6 2.19 100 0 0 Wave1 19.6 1.6 0 8 0 0 0 0 100 14.6 Wave2 0 1.6 0 0 0 0 0 0 0 85 TABLE II ACTION R ECOGNITION R ESULTS USING SVM C LASSIFIER Action Bend Jack Jump pJump Run Side Skip Walk Wave1 Wave2 Accuracy ηi (%) Test 24.1 93.3 63.9 39.1 80 74.2 76.9 90.6 76.6 40.1 Accuracy ηi (%) Training 33.1 91.7 78.1 48 87.8 76.5 81.2 92.8 73.8 38.4 Object Center (a) Object Center Object Center (b) Object Center (d) (c) Object Center Object Center (e) (f) Fig. 3. Feature Extraction: Fine to coarse strategy (a) First level Fine Image patches (No. patches=36) (b) second level patches (No. of patches=9) (c) 3rd level coarse patches (No. of patches = 4) (d)-(f) other coarse resolution patches (No. of patches=5.) was trained with various classifiers including multi class SVM, using liblinear package2 . The prediction results, for the test data, for all the classifiers were in the range of 65% to 67%, which is 25% less than the proposed complex-valued neural classifier. The performance for individual action using SVM is given in table II. This result indicates the importance of using complex valued data for motion information, which by nature consists of components in two directions. Recognizing actions at sequence level could be achieved by utilizing the frame level detection results and global features such as trajectory information. Future research direction will be towards increasing the performance with global features such as object trajectory information with local frame-level complex feature. Since there is no frame level action recognition results available in literature for the dataset, we will be converting the frame level result to sequence level result and compare the performance. IV. C ONCLUSION The exceptional universal classification ability of the complex-valued neural network is attributed to their inherent orthogonal decision boundaries. In this paper, we present an efficient, fast learning complex-valued classifier for human action recognition. Here, the input weights are randomly selected and the output weights are calculated analytically. The performances of the classifier is evaluated using benchmark action recognition sequences. The results indicate superior performance of the proposed complex-valued neural classifier over the traditional real-valued classifiers. 2 http://www.csie.ntu.edu.tw/∼cjlin/liblinear R EFERENCES [1] T. Nitta, “The computational power of complex-valued neuron,” in Artificial Neural Networks and Neural Information Processing ICANN/ICONIP (Istanbul, Turkey), 2003, LNCS, pp. 993–1000. [2] T. Nitta, “On the inherent property of the decision boundary in complex-valued neural networks,” Neurocomputing, vol. 50, pp. 291– 303, 2003. [3] T. Nitta, “Solving the xor problem and the detection of symmetry using a single complex-valued neuron,” Neural Networks, vol. 16, no. 8, pp. 1101–1105, 2003. [4] I. Aizenberg and C. Moraga, “Multilayer feedforward neural network based on multi-valued neurons (MLMVN) and a backpropagation learning algorithm,” Soft Computing, vol. 11, no. 2, pp. 169–193, 2007. [5] M. F. Amin and K. Murase, “Single-layered complex-valued neural network for real-valued classification problems,” Neurocomputing, vol. 72, no. (4-6), pp. 945–955, 2009. [6] R. Savitha, S. Suresh, N. Sundararajan, and H. J. Kim, “Fast learning fully complex-valued classifiers for real-valued classification problems,” in International Symposium on Neural Networks, D. Liu et al., Ed., 2011, vol. 6675 of LNCS, pp. 602–609. [7] R. Savitha, S. Suresh, and N. Sundararajan, “A fully complex-valued radial basis function network and its learning algorithm,” International Journal of Neural Systems, vol. 19, no. 4, pp. 253–267, 2009. [8] M. B. Li, G. B. Huang, P. Saratchandran, and N. Sundararajan, “Fully complex extreme learning machines,” Neurocomputing, vol. 68, no. (1-4), pp. 306–314, 2005. [9] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local svm approach,” in Proceedings of International Conference on Pattern Recognition, Aug. 2004, pp. 32–36. [10] R. Venkatesh Babu, B. Anantharaman, K. R. Ramakrishnan, and S. H. Srinivasan, “Compressed domain action classification using HMM,” Pattern Recognition Letters, vol. 23, no. 10, pp. 1203–1213, 2002. [11] J. Niebles, H. Wang, and Li Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” International Journal of Computer Vision, vol. 79, no. 3, pp. 299–318, Sept. 2008. [12] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, Dec. 2007. [13] K. Guo, P. Ishwar, and J. Konrad, “Action recognition from video by covariance matching of silhouette tunnels,” in Proceedings of Brazilian Symposium on Computer Graphics and Image Processing, Oct. 2009, pp. 299–306. [14] S. Suresh, S. N. Omkar, V. Mani, and T. N. Guru Prakash, “Lift coefficient prediction at high angle of attack using recurrent neural network,” Aerospace Science and Technology, vol. 7, no. 8, pp. 595– 602, 2003. [15] T. Starner and A. Pentland, “Visual recognition of american sign language using hidden markov models,” in International Conference on Automatic Face and Gesture Recognition, 1995, pp. 189–194. [16] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proceedings of DARPA Image Understanding Workshop, Apr. 1981, pp. 121–130.
© Copyright 2026 Paperzz