Comparative Study of Machine Learning Algorithms for Activity Recognition with Data Sequence in Home-like Environment Xiuyi Fan1 , Huiguo Zhang1 , Cyril Leung2 , Chunyan Miao1 Abstract— Activity recognition is a key problem in multisensor systems. With data collected from different sensors, a multi-sensor system identifies activities performed by the inhabitants. Since an activity always lasts a certain duration, it is beneficial to use data sequence for the desired recognition. In this work, we experiment several machine learning techniques, including Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) and Meta-Layer Network for solving this problem. We observe that (1) compare with “single-frame” activity recognition, data sequence based classification gives better performance; and (2) directly using data sequence information with a simple “mete layer” network model yields a better performance than memory based deep learning approaches. I. I NTRODUCTION Activity recognition has been a key problem in multisensor system studies [1]. Situated in a smart home environment, multiple sensors, usually of different types, collect data continuously [2]. These sensors are connected to a central computation device that runs data analytic algorithms. As a result, the inhabitants’ activities are monitored and identified with abnormalities detected. Applications of activity recognition have been seen in assisted living systems [3], surveillance systems [4], elderly fall detection systems [5] and patient monitoring systems [6]. Many activity recognition systems have been reported in the literature with different sensor configurations (see Section III for works reported in the literature). Multi-sensor systems differ from each other as sensor configurations and activity recognition requirements differ between systems. Typical sensor systems include force sensors, switches, movement sensors, etc. Recognition requirements range in classification frequency (real time or not), expected classification accuracy, etc. In this paper, we report results of experimenting several deep learning based approaches, on activity data collected in our simulated smart home environment. Our hypothesis is that, since each activity always lasts a certain duration, classification results obtained at some time step t should be the same with the results obtained at time step t + 1, for most cases; thus, the ability of “remembering” previous seen information should improve the overall performance. Recurrent neural networks, long short term memory This research is supported by the National Research Foundation Singapore under its Interactive Digital Media (IDM) Strategic Research Programme. 1 Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), Nanyang Technological University, Singapore 2 Electrical and Computer Engineering, The University of British Columbia, Canada (LSTM) and gated recurrent units (GRU) are deep learning models with such remembering capabilities. We study their performances in this work. The remaining of this paper is organized as follows. We presented our classifiers in Section II, along with their performances. Section III discuss some related works. We conclude in Section IV. II. C LASSIFIERS AND E XPERIMENTS In this work, we experiment with a smart home equipped with the following sensors: 1 • Two Grid-Eye Infrared Array Sensors (GridEye 1 & 2), • Two force sensors (Force 1 & 2), • One noise sensor (Noise), and • One electric current detector (Current). Sensor placement is summarized in Table I with experiment environment layout illustrated in Figure 1. A GridEye thermal array sensor outputs an 8-pixel by 8-pixel thermal image in its 120-degree field of view at 2Hz. A force sensor outputs integers representing force applied to the sensor when there is a change to the force applied. The noise sensor measures ambient noise and outputs integers indicating noise level at 0.3Hz. The current detector measures AC current consumption of the TV and outputs a Boolean value indicating whether the TV is on. Sample sensor outputs during one experiment session are shown in Figure 2. Outputs from two GridEyes at a single time step is shown in Figure 3. TABLE I: Sensor placement in our experiment environment. The placement of sensors A - F can be seen in Figure 1. A B C D E F GridEye 1 GridEye 2 Force 1 Force 2 Noise Current Ceiling above dinning table Ceiling above bed Sofa legs Surface of dinning chair Next to TV TV’s power plug Five activities we plan to recognize are: 1) eat, 2) watch TV, 3) read books, 4) sleep, 5) friend visit and 6) other. We assume that at any moment, there is one and only one activity, including “other”, taking place. In our experiments, for eating, the testing subject sits by the dinning table and consumes a snack. For watching TV and reading book, he sits on the same sofa with TV on and off, respectively. 1 https://na.industrial.panasonic.com/products/sensors/sensors-automotiveindustrial-applications/grid-eye-infrared-array-sensor Fig. 1. Illustration of the testing environment. For sleeping, the testing subject lays on the bed with little movement. For friend visit, an additional testing subject enters the room and both sit at the dinning table carrying out a conversation. For data collection, we have performed eight runs of experiments by four individuals with each person performing these five activities twice with time spent on shifting activities labelled as “other”. In our experiments, each activity lasts two to three minutes in every run. All activities are manually labeled. Data pre-processing steps have been taken. Firstly, we interpolate data from different sensors based on rate of the sensor with the highest frequency. At each time step, we concatenate data from all sensors into a single vector. Thus, at any time step t, we obtain a 1-by-132 vector containing outputs from all six sensors (see Figure 4). We then normalize our data so that each data dimension carries the same initial weight in the calculation. As a result, we operate on data stream with temporally ordered vectors. Fig. 2. Normalized data from two GridEyes (mean values), the noise sensor and the current detector during one experiment run containing all six activities (including “other”). The x-axes are time in 0.5 second resolution. The y-axes are sensor outputs. A. Standard Recurrent Neural Networks We first experiment with a standard recurrent neural network (RNN) model as described in [7]. The network structure is illustrated in Figure 5. The model contains two hidden layers, each with 6 nodes (same number as the output layer). The second hidden layer is a loop back layer with self-connected nodes. Mean square error is used as the loss function (also in all other approaches introduced later). Back Propagation Through Time (BPTT) is used for training. B. Long Short-Term Memory (LSTM) LSTM [8] has seen many success in recent years. Compare with the classic RNN, LSTM has dedicated structures, i.e., gates, for maintaining memory. In our implementation, as illustrated in Figure 6, we use a two hidden-layer LSTM model; each layer contains six fully connected LSTM cells. Comparing with our RNN model, our LSTM model uses LSTM cells in place of hidden layer nodes. The LSTM model Fig. 3. Sample output from GridEye 1 & 2 (single frame). 1:64, GridEye 1 65:128, GridEye 2 129 130 131 132 129, Noise; 130, Force 1; 131, Force 2; 132, Current Fig. 4. Sensor data layout in vector format. Fig. 5. Recurrent neural network structure. This network contains 132 input nodes, 6 output nodes and two hidden layers, each with 6 nodes. The loop-back nodes in the second hidden layer serve as the internal memory of the network. Fig. 7. Meta-layer network structure. The input layer is composed of n sets of 132 nodes inputs. The middle layer is composed of n sets of 6 nodes outputs. Connections are pre-trained from each input set to its connected middle output set via the standard perceptron model. The output layer is composed of 6 nodes as in other models. The middle layer and the output layer are fully connected. gates. Formally, a GRU model can be described with the following equations: z = σ(xt U z + st−1 W z ) r Fig. 6. LSTM / GRU network structure. These networks contains two hidden layers composed of fully connected LSTM cells or GRU units. There are 132 nodes in the inputs layer, 6 nodes in the output layer and 6 LSTM cells / GRU units in each of the two hidden layers. can be described with the following equations. r (7) r = σ(xt U + st−1 W ) (8) h = tanh(xt U h + (st−1 ◦ r)W h ) (9) st = (1 − z) ◦ h + z ◦ st−1 (10) As in LSTM, σ is the sigmoid function. xt is the input at time t. h is the output. st is the internal state of a GRU unit at time t. The size of U s and W s are the same as in LSTM. BPTT is used to train weights in U s and W s. Essentially, we use the same network structure as our LSTM implementation, replacing LSTM cells with GRU units. i = σ(xt U i + st−1 W i ) (1) D. Meta-Layer Network f = σ(xt U f + st−1 W f ) (2) o = σ(xt U o + st−1 W o ) (3) g = tanh(xt U g + st−1 W g ) (4) Since RNN, LSTM and GRU models all have structures for “remembering” data seen previously, we were wondering what if we explicitly provide such information to a neural network classifier. Thus, we experiment with a “meta-layer” network composed of a set of pre-trained sub networks and a fully connected overlay. The network structure is shown in Figure 7. The input layer is composed of n sets of input nodes (n is the number of time steps contained in a data sequence), each containing 132 nodes. Each set of input is connected to a 6 nodes output layer via a pre-trained perceptron module. In the output layer, n 6-nodes sets are fully connected to 6 final output nodes. These connections are trained with back propagation. ct = ct−1 ◦ f + g ◦ i (5) st = tanh(ct ) ◦ o (6) σ is the sigmoid function. ◦ denotes element-wise multiplication. xt is the input at time t. st is the output of the cell at time t. U s and W s are weight matrices connecting various components. Specifically, in our application, for the first hidden layer, xt is a 1-by-132 vector; st is a 1-by-6 vector; U s, are 132-by-6 matrices; W s are 6-by-6 matrices. For the second hidden layer, xt and st are 1-by-6 vectors and all U s and W s are 6-by-6 matrices. BPTT is used for network training. C. Gated Recurrent Unit (GRU) GRU [9] is a recently proposed variation of the LSTM model. The main difference is that, instead of using three gates to control memory updates, a GRU unit uses only two E. Algorithm Comparison To evaluate different approaches, for each configuration, we perform a 7-fold cross-validation with 30 runs from 9218 time steps (with different initial random weights). Mean results are then reported. Figure 8 and 9 summaries performances of developed algorithms with average precision (p) and recall (r), calculated as follows: Average Percision Percision (%) 0.88 0.86 Average Training Time 120 100 80 60 40 20 0 5 10 15 20 Time Steps in a Data Sequence RNN: Training Time GRU: Training Time 25 30 LSTM: Training Time Meta Layer: Training Time Fig. 10. Training time for the developed algorithms (measured in seconds). The x-axis is the number of time steps contained in a data sequence; and the y-axis is the average training time. Note that the reported training time for the meta-layer approach does not include the pre-training for its perceptron modules. 0.84 0.82 0.8 0.78 0.76 5 10 15 20 Time Steps in a Data Sequence RNN: Precision 25 30 LSTM: Precision GRU: Precision Meta Layer: Precision Fig. 8. Average precision for classification algorithms. The x-axis is the number of time steps contained in a data sequence; and the y-axis is the average precision. Average Recall 0.85 0.83 0.82 0.81 0.8 0.79 5) the meta-layer based approaches outperforms all other approaches. These results came as a surprise as we expected to see that LSTM and GRU outperform the straightforward meta-layer approach. Clearly, since classification at time t should be the same as the result at time t−1 in most cases, by “remembering” previous information, classification performance should be improved. This hypothesis is valid in the sense that all of our models outperform the baseline approach, which only uses information available at a single time step. However, our results also show that when data sequence information can be used directly, introducing dedicated memory modules in a network structure shows no benefit. III. R ELATED W ORK 0.84 Recall (%) Training Time (second) True Positive , (11) p= True Positive + False Positive True Positive r= . (12) True Positive + False Negative To put our results into context, we also use a baseline result obtained from a single layer perceptron model trained and classified with data at individual time steps. The precision and recall of such model are: 0.752 and 0.745, respectively. The training time for the baseline model is 17.42 seconds. All experiments have been performed on a computer with an Intel i7 CPU at 3.5GHz and 16G RAM. 5 RNN: Recall 10 15 20 Time Steps in a Data Sequence LSTM: Recall GRU: Recall 25 30 Meta Layer: Recall Fig. 9. Average recall for classification algorithms. The x-axis is the number of time steps contained in a data sequence; and the y-axis is the average recall. Figure 10 shows the training time for all developed algorithms. Overall, we observe that: 1) all four approaches that use data sequence out performs the baseline approach; 2) both LSTM and GRU perform better than the standard RNN model, as seen in many other applications, although the difference is small; 3) training time grows as the length of data sequence grows; 4) increase the length of data sequence does not increase the classification performance proportionally; Activity recognition has been a long standing problem in multi-sensor system studies. We review some recent studies on this topic in this section. Note that we purposefully omit discussing computer vision based approaches such as [10], [11], [12], [13], [14] to focus on sensor systems that are less privacy invasive. E. M. Tapia, et. al. [15] present an early work on activity recognition in home-like environment. They use a large amount of low cost state-change sensors, e.g., switches and movement sensors. For classification, they use a Naive Bayes classifier. As shown in this work, Naive Bayes may not be the most suitable method for this task as they report accuracy in 50-60% range with 15 total activity classes. Five activity classes are no better recognized than random guesses. Bao and Intille [16] introduce an early work on activity recognition with ware-able devices. Five biaxial accelerometers are attached to four limb positions of a testing subject. 20 activity classes are selected for recognition. Both Decision Tree and Naive Bayes classifiers have been experimented in their work with Decision Trees giving better performance than Naive Bayes with 80% and 50% accuracy, respectively. Maurer, et. al. [17] present another early work on activity recognition with ware-able devices. Six “eWatches”, each equipped with a dual axes accelerometer, light, temperature sensor and microphone, are attached to a testing subject. Decision Trees, Naive-Bayes and k-Nearest Neighbor classifiers are used in their experiments. Six activities were considered with mixed results. Uddin, et. al. [18] use a hand ware-able device with a 9axis inertial measurement unit for hand/arm movement based activity detection. Their computation is mostly statistical thresholding with little involvement of any machine learning technique. Ince, et. al. [19] report a work on classifying three activities, watching face, shaving and brushing, from 2-axis arm wearable accelerometer data. They also take a model based approach by combining Gaussian Mixture Models with finite state machines. Gaussian mixture models output the likelihood of performing individual activity; finite state machines are used to reason temporally with previous identified activities into consideration. Hong, et. al. [20] present a work on activity recognition based on temporal pattern representation of abstract sensors models. In their work, each activity is represented as a sequence of sensor events. To recognize activity is to identify specific sequences from the collected data, which itself is a long sequence of sensor events. They report accuracy and recalls in the 90% range from a data set containing seven activities. Kasteren, et. al. [21] present another temporal reasoning based activity recognition. Both environmental sensors, e.g., switches and motion sensors, and ware-able devices, e.g., accelerometers, are used as data source in their experiment. Hidden Markov Model and Conditional Random Fields are used as learning techniques. The reported classification accuracy and recall are in the range 70-80%. Okeyo, et. al. [22] present a rule based activity classification work using a customized activity ontology. Their work is similar to [20] in that activities are classified based on sequences of events. Only simulated events are used in their experiments. Buettner, et. al. [23] introduce an early work with activity recognition through interacting with RFID equipped objects. RFID tags are attached to 25 everyday objects with 12 RFID antennas placed throughout the testing environment. Movements of RFID attached objects are detected and used for activity classification. Hidden Markov Model is used as the main classification technique. 14 activities are classified with precision and recall both in the 90% range. Li, et. al. [24] present another work on an activity classification based on user interaction with RFID equipped objects. A two-step approach is taken for activity recognition. Firstly, objects movements are identified based on changes in RFID signal properties; secondly, from object movements, activities are inferred. Support Vector Machine is used as the main learning technique. Hevesi, et. al. [25] present a work on activity recognition with thermal sensor arrays exclusively. Their sensor placement differs from ours so that the entire environment is observable from a single GridEye sensor. In their paper, the learning algorithm is not described apart from reporting “a standard statistical classifier trained using appropriate ground truth”. Lotfi, et. al. [26] present an abnormal behavior detection work in a home-like environment similar to ours. Since they mainly use movement sensors and door entry point sensors, temporal reasoning methods, e.g., recurrent neural networks (RNN), are used in their work. However, in this work, no explicit activity recognition is reported, only a binary classification on behavioral abnormality. Ordóñez, et. al. [27] present a work similar to [26] in that it only concerns on distinguishing “abnormal” behaviors from “normal” ones without explicitly recognizing activities. Instead of using RNN, or any other machine learning based algorithm, they use a Bayesian model based approach hypothesizing “normal” sensor event patterns. If a sensor event pattern fails to fit in the normal pattern, then abnormality is reported. IV. C ONCLUSION In this work, we have compared performances of several deep learning algorithms for activity recognition with sensor data collected in a home-like environment. From data collected by six different sensors, we aim to classify six activities. We have experimented with classifiers including standard recurrent networks, long short-term memory models, gated recurrent unit models and a simple meta-layer network model. It is observed that a straightforward metalayer network model outperforms memory based models. In future, we would like to explore the following directions. Firstly, we will experiment activity recognition with other sensor combinations, especially we would like to experiment configurations including movement sensors. Secondly, we would like to explore the “reusability” of the obtained models by differentiate the training environment and testing environment. We will investigate algorithms that support “train at one place, classify at many places”. Lastly, we will explore the possibility of introducing argumentation [28] based activity recognition algorithms [29] to data sequence models presented in this work for better recognition explanation. ACKNOWLEDGMENTS This research was supported by the National Research Foundation Singapore under its Interactive Digital Media (IDM) Strategic Research Programme. R EFERENCES [1] D. J. Cook and S. K. Das, “How smart are our environments? an updated look at the state of the art,” Pervasive and Mobile Computing, vol. 3, no. 2, pp. 53–73, 2007. [2] M. C. Mozer, “The neural network house: An environment that adapts to its inhabitants,” in Proc. AAAI Spring Symp. Intelligent Environments, 1998, pp. 110–114. [3] H. Sun, V. D. Florio, N. Gui, and C. Blondia, “Promises and challenges of ambient assisted living systems,” CoRR, vol. abs/1507.05765, 2015. [4] A. K. S. Kushwaha and R. Srivastava, “Multiview human activity recognition system based on spatiotemporal template for video surveillance system,” J. Electronic Imaging, vol. 24, no. 5, 2015. [5] M. Concepcin, L. Morillo, J. Garca, and L. Gonzlez-Abril, “Mobile activity recognition and fall detection system for elderly people using ameva algorithm,” Pervasive and Mobile Computing, 2016. [6] J. Klonovs, M. A. Haque, V. Krüger, K. Nasrollahi, K. AndersenRanberg, T. B. Moeslund, and E. G. Spaich, Distributed Computing and Monitoring Technologies for Older Patients. Springer, 2016. [7] Y. Bengio, P. Y. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 157–166, 1994. [8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [9] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, 2014, pp. 1724–1734. [10] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Unstructured human activity detection from RGBD images,” in Proc. ICRA, 2012, pp. 842– 849. [11] L. Piyathilaka and S. Kodagoda, “Gaussian mixture based hmm for human daily activity recognition using 3d skeleton features,” in Proc. ICIEA, June 2013, pp. 567–572. [12] K. Avgerinakis, A. Briassouli, and I. Kompatsiaris, “Activity detection and recognition of daily living events,” in Proc. MIIRH, ser. MIIRH ’13. New York, NY, USA: ACM, 2013, pp. 3–10. [13] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, 2012, pp. 2847–2854. [14] X. Ren and M. Philipose, “Egocentric recognition of handled objects: Benchmark and analysis,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2009, Miami, FL, 20-25 June, 2009, 2009, pp. 1–8. [15] E. M. Tapia, S. S. Intille, and K. Larson, “Activity recognition in the home using simple and ubiquitous sensors,” in Proc. PERVASIVE, 2004, pp. 158–175. [16] L. Bao and S. S. Intille, “Activity recognition from user-annotated acceleration data,” in Proc. PERVASIVE, 2004, pp. 1–17. [17] U. Maurer, A. Smailagic, D. P. Siewiorek, and M. Deisher, “Activity recognition and monitoring using multiple sensors on different body positions,” in Proc. BSN, 2006, pp. 113–116. [18] M. Uddin, A. Salem, I. Nam, and T. Nadeem, “Wearable sensing framework for human activity monitoring,” in Proc. WearSys 2015, 2015, pp. 21–26. [19] N. F. Ince, C. Min, and A. H. Tewfik, “A feature combination approach for the detection of early morning bathroom activities with wireless sensors,” in Proc. SIGMOBILE, 2007, pp. 61–63. [20] X. Hong, C. D. Nugent, M. D. Mulvenna, S. Martin, S. Devlin, and J. G. Wallace, “Dynamic similarity-based activity detection and recognition within smart homes,” Int. J. Pervasive Computing and Communications, vol. 8, no. 3, pp. 264–278, 2012. [21] T. Kasteren, A. K. Noulas, G. Englebienne, and B. J. A. Kröse, “Accurate activity recognition in a home setting,” in Proc. UbiComp, 2008, pp. 1–9. [22] G. Okeyo, L. Chen, H. Wang, and R. Sterritt, “Dynamic sensor data segmentation for real-time knowledge-driven activity recognition,” Pervasive and Mobile Computing, vol. 10, pp. 155–172, 2014. [23] M. Buettner, R. Prasad, M. Philipose, and D. Wetherall, “Recognizing daily activities with rfid-based sensors,” in Proc. UbiComp, 2009, pp. 51–60. [24] H. Li, C. Ye, and A. P. Sample, “Idsense: A human object interaction detection system based on passive UHF RFID,” in Proc. CHI, 2015, pp. 2555–2564. [25] P. Hevesi, S. Wille, G. Pirkl, N. Wehn, and P. Lukowicz, “Monitoring household activities and user location with a cheap, unobtrusive thermal sensor array,” in Proc. UbiComp, 2014, pp. 141–145. [26] A. Lotfi, C. S. Langensiepen, S. M. Mahmoud, and M. J. Akhlaghinia, “Smart homes for the elderly dementia sufferers: identification and prediction of abnormal behaviour,” J. Ambient Intelligence and Humanized Computing, vol. 3, no. 3, pp. 205–218, 2012. [27] F. J. Ordóñez, P. Toledo, and A. Sanchis, “Sensor-based bayesian detection of anomalous living patterns in a home setting,” Personal and Ubiquitous Computing, vol. 19, no. 2, pp. 259–270, 2015. [28] S. Modgil, F. Toni, F. Bex, I. Bratko, C. Chesñevar, W. Dvořák, M. Falappa, X. Fan, S. Gaggl, A. Garcı́a, M. González, T. Gordon, J. Leite, M. Možina, C. Reed, G. Simari, S. Szeider, P. Torroni, and S. Woltran, “The added value of argumentation,” in Agreement Technologies. Springer, 2013, vol. 8, pp. 357–403. [29] X. Fan, H. Zhang, C. Leung, and C. Miao, “A first step towards explained activity recognition with computational abstract argumentation,” in Proc. IEEE MFI, 2016, In press.
© Copyright 2026 Paperzz