Comparative Study of Machine Learning Algorithms for Activity

Comparative Study of Machine Learning Algorithms for Activity
Recognition with Data Sequence in Home-like Environment
Xiuyi Fan1 , Huiguo Zhang1 , Cyril Leung2 , Chunyan Miao1
Abstract— Activity recognition is a key problem in multisensor systems. With data collected from different sensors,
a multi-sensor system identifies activities performed by the
inhabitants. Since an activity always lasts a certain duration, it
is beneficial to use data sequence for the desired recognition. In
this work, we experiment several machine learning techniques,
including Recurrent Neural Networks (RNN), Long Short-Term
Memory (LSTM), Gated Recurrent Unit (GRU) and Meta-Layer
Network for solving this problem. We observe that (1) compare
with “single-frame” activity recognition, data sequence based
classification gives better performance; and (2) directly using
data sequence information with a simple “mete layer” network
model yields a better performance than memory based deep
learning approaches.
I. I NTRODUCTION
Activity recognition has been a key problem in multisensor system studies [1]. Situated in a smart home environment, multiple sensors, usually of different types, collect data
continuously [2]. These sensors are connected to a central
computation device that runs data analytic algorithms. As a
result, the inhabitants’ activities are monitored and identified
with abnormalities detected. Applications of activity recognition have been seen in assisted living systems [3], surveillance systems [4], elderly fall detection systems [5] and
patient monitoring systems [6]. Many activity recognition
systems have been reported in the literature with different
sensor configurations (see Section III for works reported in
the literature).
Multi-sensor systems differ from each other as sensor configurations and activity recognition requirements differ between systems. Typical sensor systems include force sensors,
switches, movement sensors, etc. Recognition requirements
range in classification frequency (real time or not), expected
classification accuracy, etc. In this paper, we report results
of experimenting several deep learning based approaches,
on activity data collected in our simulated smart home
environment.
Our hypothesis is that, since each activity always lasts a
certain duration, classification results obtained at some time
step t should be the same with the results obtained at time
step t + 1, for most cases; thus, the ability of “remembering”
previous seen information should improve the overall performance. Recurrent neural networks, long short term memory
This research is supported by the National Research Foundation Singapore under its Interactive Digital Media (IDM) Strategic Research Programme.
1 Joint NTU-UBC Research Centre of Excellence in Active Living for the
Elderly (LILY), Nanyang Technological University, Singapore
2 Electrical and Computer Engineering, The University of British
Columbia, Canada
(LSTM) and gated recurrent units (GRU) are deep learning
models with such remembering capabilities. We study their
performances in this work.
The remaining of this paper is organized as follows.
We presented our classifiers in Section II, along with their
performances. Section III discuss some related works. We
conclude in Section IV.
II. C LASSIFIERS AND E XPERIMENTS
In this work, we experiment with a smart home equipped
with the following sensors:
1
• Two Grid-Eye Infrared Array Sensors (GridEye 1 & 2),
• Two force sensors (Force 1 & 2),
• One noise sensor (Noise), and
• One electric current detector (Current).
Sensor placement is summarized in Table I with experiment
environment layout illustrated in Figure 1. A GridEye thermal array sensor outputs an 8-pixel by 8-pixel thermal image
in its 120-degree field of view at 2Hz. A force sensor outputs
integers representing force applied to the sensor when there
is a change to the force applied. The noise sensor measures
ambient noise and outputs integers indicating noise level
at 0.3Hz. The current detector measures AC current consumption of the TV and outputs a Boolean value indicating
whether the TV is on. Sample sensor outputs during one
experiment session are shown in Figure 2. Outputs from two
GridEyes at a single time step is shown in Figure 3.
TABLE I: Sensor placement in our experiment environment.
The placement of sensors A - F can be seen in Figure 1.
A
B
C
D
E
F
GridEye 1
GridEye 2
Force 1
Force 2
Noise
Current
Ceiling above dinning table
Ceiling above bed
Sofa legs
Surface of dinning chair
Next to TV
TV’s power plug
Five activities we plan to recognize are: 1) eat, 2) watch
TV, 3) read books, 4) sleep, 5) friend visit and 6) other.
We assume that at any moment, there is one and only one
activity, including “other”, taking place. In our experiments,
for eating, the testing subject sits by the dinning table and
consumes a snack. For watching TV and reading book, he
sits on the same sofa with TV on and off, respectively.
1 https://na.industrial.panasonic.com/products/sensors/sensors-automotiveindustrial-applications/grid-eye-infrared-array-sensor
Fig. 1.
Illustration of the testing environment.
For sleeping, the testing subject lays on the bed with little
movement. For friend visit, an additional testing subject
enters the room and both sit at the dinning table carrying
out a conversation.
For data collection, we have performed eight runs of experiments by four individuals with each person performing these
five activities twice with time spent on shifting activities
labelled as “other”. In our experiments, each activity lasts
two to three minutes in every run. All activities are manually
labeled.
Data pre-processing steps have been taken. Firstly, we
interpolate data from different sensors based on rate of
the sensor with the highest frequency. At each time step,
we concatenate data from all sensors into a single vector.
Thus, at any time step t, we obtain a 1-by-132 vector
containing outputs from all six sensors (see Figure 4). We
then normalize our data so that each data dimension carries
the same initial weight in the calculation. As a result, we
operate on data stream with temporally ordered vectors.
Fig. 2. Normalized data from two GridEyes (mean values), the noise
sensor and the current detector during one experiment run containing all six
activities (including “other”). The x-axes are time in 0.5 second resolution.
The y-axes are sensor outputs.
A. Standard Recurrent Neural Networks
We first experiment with a standard recurrent neural
network (RNN) model as described in [7]. The network
structure is illustrated in Figure 5. The model contains two
hidden layers, each with 6 nodes (same number as the output
layer). The second hidden layer is a loop back layer with
self-connected nodes. Mean square error is used as the loss
function (also in all other approaches introduced later). Back
Propagation Through Time (BPTT) is used for training.
B. Long Short-Term Memory (LSTM)
LSTM [8] has seen many success in recent years. Compare
with the classic RNN, LSTM has dedicated structures, i.e.,
gates, for maintaining memory. In our implementation, as
illustrated in Figure 6, we use a two hidden-layer LSTM
model; each layer contains six fully connected LSTM cells.
Comparing with our RNN model, our LSTM model uses
LSTM cells in place of hidden layer nodes. The LSTM model
Fig. 3.
Sample output from GridEye 1 & 2 (single frame).
1:64, GridEye 1
65:128, GridEye 2 129 130 131 132
129, Noise; 130, Force 1; 131, Force 2; 132, Current
Fig. 4.
Sensor data layout in vector format.
Fig. 5. Recurrent neural network structure. This network contains 132
input nodes, 6 output nodes and two hidden layers, each with 6 nodes. The
loop-back nodes in the second hidden layer serve as the internal memory
of the network.
Fig. 7. Meta-layer network structure. The input layer is composed of n
sets of 132 nodes inputs. The middle layer is composed of n sets of 6 nodes
outputs. Connections are pre-trained from each input set to its connected
middle output set via the standard perceptron model. The output layer is
composed of 6 nodes as in other models. The middle layer and the output
layer are fully connected.
gates. Formally, a GRU model can be described with the
following equations:
z = σ(xt U z + st−1 W z )
r
Fig. 6. LSTM / GRU network structure. These networks contains two
hidden layers composed of fully connected LSTM cells or GRU units. There
are 132 nodes in the inputs layer, 6 nodes in the output layer and 6 LSTM
cells / GRU units in each of the two hidden layers.
can be described with the following equations.
r
(7)
r = σ(xt U + st−1 W )
(8)
h = tanh(xt U h + (st−1 ◦ r)W h )
(9)
st = (1 − z) ◦ h + z ◦ st−1
(10)
As in LSTM, σ is the sigmoid function. xt is the input at
time t. h is the output. st is the internal state of a GRU
unit at time t. The size of U s and W s are the same as
in LSTM. BPTT is used to train weights in U s and W s.
Essentially, we use the same network structure as our LSTM
implementation, replacing LSTM cells with GRU units.
i = σ(xt U i + st−1 W i )
(1)
D. Meta-Layer Network
f = σ(xt U f + st−1 W f )
(2)
o = σ(xt U o + st−1 W o )
(3)
g = tanh(xt U g + st−1 W g )
(4)
Since RNN, LSTM and GRU models all have structures
for “remembering” data seen previously, we were wondering
what if we explicitly provide such information to a neural
network classifier. Thus, we experiment with a “meta-layer”
network composed of a set of pre-trained sub networks and
a fully connected overlay. The network structure is shown
in Figure 7. The input layer is composed of n sets of input
nodes (n is the number of time steps contained in a data
sequence), each containing 132 nodes. Each set of input
is connected to a 6 nodes output layer via a pre-trained
perceptron module. In the output layer, n 6-nodes sets are
fully connected to 6 final output nodes. These connections
are trained with back propagation.
ct = ct−1 ◦ f + g ◦ i
(5)
st = tanh(ct ) ◦ o
(6)
σ is the sigmoid function. ◦ denotes element-wise multiplication. xt is the input at time t. st is the output of the
cell at time t. U s and W s are weight matrices connecting
various components. Specifically, in our application, for the
first hidden layer, xt is a 1-by-132 vector; st is a 1-by-6
vector; U s, are 132-by-6 matrices; W s are 6-by-6 matrices.
For the second hidden layer, xt and st are 1-by-6 vectors
and all U s and W s are 6-by-6 matrices. BPTT is used for
network training.
C. Gated Recurrent Unit (GRU)
GRU [9] is a recently proposed variation of the LSTM
model. The main difference is that, instead of using three
gates to control memory updates, a GRU unit uses only two
E. Algorithm Comparison
To evaluate different approaches, for each configuration,
we perform a 7-fold cross-validation with 30 runs from
9218 time steps (with different initial random weights).
Mean results are then reported. Figure 8 and 9 summaries
performances of developed algorithms with average precision
(p) and recall (r), calculated as follows:
Average Percision
Percision (%)
0.88
0.86
Average Training Time
120
100
80
60
40
20
0
5
10
15
20
Time Steps in a Data Sequence
RNN: Training Time
GRU: Training Time
25
30
LSTM: Training Time
Meta Layer: Training Time
Fig. 10. Training time for the developed algorithms (measured in seconds).
The x-axis is the number of time steps contained in a data sequence; and the
y-axis is the average training time. Note that the reported training time for
the meta-layer approach does not include the pre-training for its perceptron
modules.
0.84
0.82
0.8
0.78
0.76
5
10
15
20
Time Steps in a Data Sequence
RNN: Precision
25
30
LSTM: Precision
GRU: Precision
Meta Layer: Precision
Fig. 8. Average precision for classification algorithms. The x-axis is the
number of time steps contained in a data sequence; and the y-axis is the
average precision.
Average Recall
0.85
0.83
0.82
0.81
0.8
0.79
5) the meta-layer based approaches outperforms all other
approaches.
These results came as a surprise as we expected to see that
LSTM and GRU outperform the straightforward meta-layer
approach. Clearly, since classification at time t should be the
same as the result at time t−1 in most cases, by “remembering” previous information, classification performance should
be improved. This hypothesis is valid in the sense that all
of our models outperform the baseline approach, which only
uses information available at a single time step. However,
our results also show that when data sequence information
can be used directly, introducing dedicated memory modules
in a network structure shows no benefit.
III. R ELATED W ORK
0.84
Recall (%)
Training Time (second)
True Positive
,
(11)
p=
True Positive + False Positive
True Positive
r=
.
(12)
True Positive + False Negative
To put our results into context, we also use a baseline result
obtained from a single layer perceptron model trained and
classified with data at individual time steps. The precision
and recall of such model are: 0.752 and 0.745, respectively.
The training time for the baseline model is 17.42 seconds.
All experiments have been performed on a computer with an
Intel i7 CPU at 3.5GHz and 16G RAM.
5
RNN: Recall
10
15
20
Time Steps in a Data Sequence
LSTM: Recall
GRU: Recall
25
30
Meta Layer: Recall
Fig. 9. Average recall for classification algorithms. The x-axis is the number
of time steps contained in a data sequence; and the y-axis is the average
recall.
Figure 10 shows the training time for all developed algorithms. Overall, we observe that:
1) all four approaches that use data sequence out performs
the baseline approach;
2) both LSTM and GRU perform better than the standard RNN model, as seen in many other applications,
although the difference is small;
3) training time grows as the length of data sequence
grows;
4) increase the length of data sequence does not increase
the classification performance proportionally;
Activity recognition has been a long standing problem in
multi-sensor system studies. We review some recent studies
on this topic in this section. Note that we purposefully omit
discussing computer vision based approaches such as [10],
[11], [12], [13], [14] to focus on sensor systems that are less
privacy invasive.
E. M. Tapia, et. al. [15] present an early work on activity
recognition in home-like environment. They use a large
amount of low cost state-change sensors, e.g., switches and
movement sensors. For classification, they use a Naive Bayes
classifier. As shown in this work, Naive Bayes may not be
the most suitable method for this task as they report accuracy
in 50-60% range with 15 total activity classes. Five activity
classes are no better recognized than random guesses.
Bao and Intille [16] introduce an early work on activity
recognition with ware-able devices. Five biaxial accelerometers are attached to four limb positions of a testing subject. 20
activity classes are selected for recognition. Both Decision
Tree and Naive Bayes classifiers have been experimented in
their work with Decision Trees giving better performance
than Naive Bayes with 80% and 50% accuracy, respectively.
Maurer, et. al. [17] present another early work on activity
recognition with ware-able devices. Six “eWatches”, each
equipped with a dual axes accelerometer, light, temperature
sensor and microphone, are attached to a testing subject. Decision Trees, Naive-Bayes and k-Nearest Neighbor classifiers
are used in their experiments. Six activities were considered
with mixed results.
Uddin, et. al. [18] use a hand ware-able device with a 9axis inertial measurement unit for hand/arm movement based
activity detection. Their computation is mostly statistical
thresholding with little involvement of any machine learning
technique.
Ince, et. al. [19] report a work on classifying three activities, watching face, shaving and brushing, from 2-axis
arm wearable accelerometer data. They also take a model
based approach by combining Gaussian Mixture Models with
finite state machines. Gaussian mixture models output the
likelihood of performing individual activity; finite state machines are used to reason temporally with previous identified
activities into consideration.
Hong, et. al. [20] present a work on activity recognition
based on temporal pattern representation of abstract sensors
models. In their work, each activity is represented as a
sequence of sensor events. To recognize activity is to identify
specific sequences from the collected data, which itself is a
long sequence of sensor events. They report accuracy and
recalls in the 90% range from a data set containing seven
activities.
Kasteren, et. al. [21] present another temporal reasoning
based activity recognition. Both environmental sensors, e.g.,
switches and motion sensors, and ware-able devices, e.g.,
accelerometers, are used as data source in their experiment.
Hidden Markov Model and Conditional Random Fields are
used as learning techniques. The reported classification accuracy and recall are in the range 70-80%.
Okeyo, et. al. [22] present a rule based activity classification work using a customized activity ontology. Their work
is similar to [20] in that activities are classified based on
sequences of events. Only simulated events are used in their
experiments.
Buettner, et. al. [23] introduce an early work with activity
recognition through interacting with RFID equipped objects.
RFID tags are attached to 25 everyday objects with 12
RFID antennas placed throughout the testing environment.
Movements of RFID attached objects are detected and used
for activity classification. Hidden Markov Model is used as
the main classification technique. 14 activities are classified
with precision and recall both in the 90% range.
Li, et. al. [24] present another work on an activity classification based on user interaction with RFID equipped
objects. A two-step approach is taken for activity recognition.
Firstly, objects movements are identified based on changes
in RFID signal properties; secondly, from object movements,
activities are inferred. Support Vector Machine is used as the
main learning technique.
Hevesi, et. al. [25] present a work on activity recognition
with thermal sensor arrays exclusively. Their sensor placement differs from ours so that the entire environment is
observable from a single GridEye sensor. In their paper, the
learning algorithm is not described apart from reporting “a
standard statistical classifier trained using appropriate ground
truth”.
Lotfi, et. al. [26] present an abnormal behavior detection
work in a home-like environment similar to ours. Since they
mainly use movement sensors and door entry point sensors,
temporal reasoning methods, e.g., recurrent neural networks
(RNN), are used in their work. However, in this work,
no explicit activity recognition is reported, only a binary
classification on behavioral abnormality.
Ordóñez, et. al. [27] present a work similar to [26] in
that it only concerns on distinguishing “abnormal” behaviors
from “normal” ones without explicitly recognizing activities.
Instead of using RNN, or any other machine learning based
algorithm, they use a Bayesian model based approach hypothesizing “normal” sensor event patterns. If a sensor event
pattern fails to fit in the normal pattern, then abnormality is
reported.
IV. C ONCLUSION
In this work, we have compared performances of several
deep learning algorithms for activity recognition with sensor data collected in a home-like environment. From data
collected by six different sensors, we aim to classify six
activities. We have experimented with classifiers including
standard recurrent networks, long short-term memory models, gated recurrent unit models and a simple meta-layer
network model. It is observed that a straightforward metalayer network model outperforms memory based models.
In future, we would like to explore the following directions. Firstly, we will experiment activity recognition
with other sensor combinations, especially we would like
to experiment configurations including movement sensors.
Secondly, we would like to explore the “reusability” of the
obtained models by differentiate the training environment
and testing environment. We will investigate algorithms that
support “train at one place, classify at many places”. Lastly,
we will explore the possibility of introducing argumentation
[28] based activity recognition algorithms [29] to data sequence models presented in this work for better recognition
explanation.
ACKNOWLEDGMENTS
This research was supported by the National Research
Foundation Singapore under its Interactive Digital Media
(IDM) Strategic Research Programme.
R EFERENCES
[1] D. J. Cook and S. K. Das, “How smart are our environments? an
updated look at the state of the art,” Pervasive and Mobile Computing,
vol. 3, no. 2, pp. 53–73, 2007.
[2] M. C. Mozer, “The neural network house: An environment that
adapts to its inhabitants,” in Proc. AAAI Spring Symp. Intelligent
Environments, 1998, pp. 110–114.
[3] H. Sun, V. D. Florio, N. Gui, and C. Blondia, “Promises and challenges
of ambient assisted living systems,” CoRR, vol. abs/1507.05765, 2015.
[4] A. K. S. Kushwaha and R. Srivastava, “Multiview human activity
recognition system based on spatiotemporal template for video surveillance system,” J. Electronic Imaging, vol. 24, no. 5, 2015.
[5] M. Concepcin, L. Morillo, J. Garca, and L. Gonzlez-Abril, “Mobile
activity recognition and fall detection system for elderly people using
ameva algorithm,” Pervasive and Mobile Computing, 2016.
[6] J. Klonovs, M. A. Haque, V. Krüger, K. Nasrollahi, K. AndersenRanberg, T. B. Moeslund, and E. G. Spaich, Distributed Computing
and Monitoring Technologies for Older Patients. Springer, 2016.
[7] Y. Bengio, P. Y. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural
Networks, vol. 5, no. 2, pp. 157–166, 1994.
[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[9] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2014, October 25-29, 2014, Doha,
Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL,
2014, pp. 1724–1734.
[10] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Unstructured human
activity detection from RGBD images,” in Proc. ICRA, 2012, pp. 842–
849.
[11] L. Piyathilaka and S. Kodagoda, “Gaussian mixture based hmm for
human daily activity recognition using 3d skeleton features,” in Proc.
ICIEA, June 2013, pp. 567–572.
[12] K. Avgerinakis, A. Briassouli, and I. Kompatsiaris, “Activity detection
and recognition of daily living events,” in Proc. MIIRH, ser. MIIRH
’13. New York, NY, USA: ACM, 2013, pp. 3–10.
[13] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living
in first-person camera views,” in 2012 IEEE Conference on Computer
Vision and Pattern Recognition, Providence, RI, USA, June 16-21,
2012, 2012, pp. 2847–2854.
[14] X. Ren and M. Philipose, “Egocentric recognition of handled objects:
Benchmark and analysis,” in IEEE Conference on Computer Vision
and Pattern Recognition, CVPR Workshops 2009, Miami, FL, 20-25
June, 2009, 2009, pp. 1–8.
[15] E. M. Tapia, S. S. Intille, and K. Larson, “Activity recognition in
the home using simple and ubiquitous sensors,” in Proc. PERVASIVE,
2004, pp. 158–175.
[16] L. Bao and S. S. Intille, “Activity recognition from user-annotated
acceleration data,” in Proc. PERVASIVE, 2004, pp. 1–17.
[17] U. Maurer, A. Smailagic, D. P. Siewiorek, and M. Deisher, “Activity
recognition and monitoring using multiple sensors on different body
positions,” in Proc. BSN, 2006, pp. 113–116.
[18] M. Uddin, A. Salem, I. Nam, and T. Nadeem, “Wearable sensing
framework for human activity monitoring,” in Proc. WearSys 2015,
2015, pp. 21–26.
[19] N. F. Ince, C. Min, and A. H. Tewfik, “A feature combination approach
for the detection of early morning bathroom activities with wireless
sensors,” in Proc. SIGMOBILE, 2007, pp. 61–63.
[20] X. Hong, C. D. Nugent, M. D. Mulvenna, S. Martin, S. Devlin,
and J. G. Wallace, “Dynamic similarity-based activity detection and
recognition within smart homes,” Int. J. Pervasive Computing and
Communications, vol. 8, no. 3, pp. 264–278, 2012.
[21] T. Kasteren, A. K. Noulas, G. Englebienne, and B. J. A. Kröse,
“Accurate activity recognition in a home setting,” in Proc. UbiComp,
2008, pp. 1–9.
[22] G. Okeyo, L. Chen, H. Wang, and R. Sterritt, “Dynamic sensor data
segmentation for real-time knowledge-driven activity recognition,”
Pervasive and Mobile Computing, vol. 10, pp. 155–172, 2014.
[23] M. Buettner, R. Prasad, M. Philipose, and D. Wetherall, “Recognizing
daily activities with rfid-based sensors,” in Proc. UbiComp, 2009, pp.
51–60.
[24] H. Li, C. Ye, and A. P. Sample, “Idsense: A human object interaction
detection system based on passive UHF RFID,” in Proc. CHI, 2015,
pp. 2555–2564.
[25] P. Hevesi, S. Wille, G. Pirkl, N. Wehn, and P. Lukowicz, “Monitoring
household activities and user location with a cheap, unobtrusive
thermal sensor array,” in Proc. UbiComp, 2014, pp. 141–145.
[26] A. Lotfi, C. S. Langensiepen, S. M. Mahmoud, and M. J. Akhlaghinia, “Smart homes for the elderly dementia sufferers: identification
and prediction of abnormal behaviour,” J. Ambient Intelligence and
Humanized Computing, vol. 3, no. 3, pp. 205–218, 2012.
[27] F. J. Ordóñez, P. Toledo, and A. Sanchis, “Sensor-based bayesian
detection of anomalous living patterns in a home setting,” Personal
and Ubiquitous Computing, vol. 19, no. 2, pp. 259–270, 2015.
[28] S. Modgil, F. Toni, F. Bex, I. Bratko, C. Chesñevar, W. Dvořák,
M. Falappa, X. Fan, S. Gaggl, A. Garcı́a, M. González, T. Gordon,
J. Leite, M. Možina, C. Reed, G. Simari, S. Szeider, P. Torroni,
and S. Woltran, “The added value of argumentation,” in Agreement
Technologies. Springer, 2013, vol. 8, pp. 357–403.
[29] X. Fan, H. Zhang, C. Leung, and C. Miao, “A first step towards
explained activity recognition with computational abstract argumentation,” in Proc. IEEE MFI, 2016, In press.