Detecting and profiling sedentary young men using machine

Detecting and profiling sedentary young men using
machine learning algorithms
Pekka Siirtola∗ , Riitta Pyky†‡§¶ , Riikka Ahola‡ , Heli Koskimäki∗ , Timo Jäms䆇 ,
Raija Korpelainen †‡§ and Juha Röning∗
∗ Computer
Science and Engineering Department
P.O. BOX 4500, FI-90014, University of Oulu, Oulu, Finland
Email: [email protected], [email protected]
† Oulu Deaconess Institute, Department of Sports and Exercise Medicine, Oulu, Finland
‡ Medical Research Center Oulu, Oulu University Hospital and University of Oulu, Finland
§ Institute of Health Sciences, Faculty of Medicine, University of Oulu, Oulu, Finland
¶ Department of Medical Technology, Faculty of Medicine, University of Oulu, Finland
Abstract—Many governments and institutions have guidelines
for health-enhancing physical activity. Additionally, according to
recent studies, the amount of time spent on sitting is a highly
important determinant of health and wellbeing. In fact, sedentary
lifestyle can lead to many diseases and, what is more, it is even
found to be associated with increased mortality.
In this study, a data set consisting of self-reported questionnaire, medical diagnoses and fitness tests was studied to detect
sedentary young men from a large population and to create a
profile of a sedentary person. The data set was collected from
595 young men and contained altogether 678 features. Most of
these are answers to multi-choice close-ended questions. More
precisely, features were mostly integers with a scale from 1 to 5
or from 1 to 2, and therefore, there was only a little variability in
the values of features. In order to detect and profile a sedentary
young man, machine learning algorithms were applied to the data
set. The performance of five algorithms is compared (quadratic
discriminant analysis (QDA), linear discriminant analysis (LDA),
C4.5, random forests, and 𝑘 nearest neighbours (𝑘NN)) to find
the most accurate algorithm.
The results of this study show that when the aim is to detect a
sedentary person based on medical records and fitness tests, LDA
performs better than the other algorithms, but still the accuracy
is not high. In the second part of the study the differences
between highly sedentary and non-sedentary young men are
searched, recognition can be obtained with high accuracy with
each algorithm.
I. I NTRODUCTION AND RELATED WORK
It is well-known that the amount and intensity of daily physical activity is positively associated with health and wellbeing
[1]. To help people to estimate how much exercising is enough,
there are guidelines available for health-enhancing physical
activity. Moreover, various devices and mobile applications are
nowadays available to measure if one’s daily activity is enough
to fulfill these recommendations. On the other hand, according
to recent studies in order to stay healthy, it is often as important
to avoid long sitting periods and sedentary lifestyle than to be
active a certain amount daily ([2] & [3]). In fact, the time spent
in sedentary behaviors has been found positively associated
with increased mortality [4]. To target interventions to those at
risk and to motivate the most sedentary groups, it is important
to understand the culture and underlying determinants for
sedentary lifestyle of the target group. In other words, it needs
to be studied what are the reasons for sedentary lifestyle
and how highly sedentary people differ from non-sedentary
persons. In this study, predictors for sedentary lifestyle of
young men are searched based on self-reported questionnaire,
medical diagnoses and fitness tests. Moreover, to find sedentary persons from a large population-based sample of young
men, a model was built to determine automatically if a person
is highly -sedentary based on medical diagnoses and fitness
tests. In the method-wise, this article concentrates on studying
how machine learning algorithms can be applied to a data set
where features are mostly integers and a single feature does
not include much information about the problem. Different
classifiers are compared to find the most suitable algorithm
for the data set.
There are studies where machine learning algorithm are
applied to a data set, quite similar to this study. A study
most similar to our study is [5] where factors connected to
academic success are searched and this information is then
used to predict first-year students academic success. The data
used in the study is questionnaire data from 533 students.
Academic success is predicted by classifying students into
three categories based on a risk at failing in studies. The
recognition rates obtained in the study are not high, but a
comparison of three machine learning methods shows that
LDA produces the least bad results. Dropout students have
been predicted using data mining methods in more recent
studies as well, such as [6]. However, in this case raw
questionnaire data were not used, instead classification was
based on ten variables derived from answers. In addition, there
are several other studies where educational mining based on
questionnaire data has been studied in several other studies as
well as shown in [7].
Another field, where machine learning methods have been
applied to data sets with only a little variation, is medicine.
In [8], neural networks, decision trees, and logistic regression
were used to a data set consisting of answers to a questionnaire
to build a model that can be used to determinate whether a
patient has a disease called gastroesophageal reflux or not. The
results show that depending on the used classifier, recognition
can be obtained with accuracy varying from 70% to 78%.
Machile learning algorithms were also succesfully used in [9],
where artificial intelligence was used to diagnose autism. The
normal procedure to diagnose autism is a questionnaire including 93 questions. However, it was shown in the article that the
same diagnosis accuracy can be obtained by using only seven
questions, when machine learning techniques were applied to
a data set. The best results were obtained using decition trees.
Other machine learning and data mining methods used with
medical data include AdaBoost, Naive Bayesian classifier, and
SVM ([10] & [11]).
In this study, one target is to profile a sedentary young man.
Profiling groups of people using machine learning algorithms
is studied, for instance in [12] where a profile of students
who take online courses was built. The methods used in
the profiling process were classification trees and multivariate
adaptive regression splines. These were applied to a data set
consisting of variables like age, gender, ethnicity, residency,
living location and test scores. These variables are quite
different from those used in our study. The results of [12]
suggest that a profile of a group of people can be really simple,
in this case consisting of only one or two variables. Moreover,
profiling a sedentary person is studied in [13] and [14]. These
studies use the same data set as our study. However, in
these studies the aim and approach were different from our
study, as it is studied here what kind of groups can be found
within highly sedentary young men and which determinants
are characteristic of these groups. Five distinctive groups were
found and several wellbeing problems were associated with
sedentary lifestyle, such as heavy alcohol use, unemployment,
and low self-esteem. However, there was also one group where
young men were physically active, motivated and healthy,
although they were highly sedentary. In these studies, the
analysis was made using factor analysis.
This study has three aims:
∙
∙
∙
To replace the questionnaire and build a model to detect
a sedentary young man based on medical records and
fitness tests
To profile a highly sedentary young men
To study how well different machine learning algorithms
can handle a feature set where the information contained
by a single feature is very limited
The study is organized as follows: Section II presents the
used data set and labels are defined in Section III. Section
IV introduces the used machine learning algorithms which are
applied in Section V to build a model to recognize a sedentary
person from a large population and in Section VI to profile
a sedentary young man. Finally, conclusions and future work
are in Section VII.
II. DATA SET
This article is part of MOPO study [15], where the aim is
to find new methods for activation of young men as they are
often at the risk of marginalization, inactivity and unhealthy
TABLE I: The data set consists most of variables where values
are integers with a scale from 1 to 2 or from 1 to 5.
Scale
1 to 2
1 to 3
1 to 4
1 to 5
other
Portion
56 %
2%
6%
26 %
10 %
lifestyle. The research subjects consist of 595 conscriptionaged men (mean age 18 years) in the city of Oulu, Finland.
Research subjects participated to an annual mandatory callup event, organized annually by Finnish Defense Forces. At
the event, young men get information regarding to military
service, and by the end of the day, it is decided based on
the medical examination and interviews whether they are fit
enough to armed military service or do they choose civil
service, postponement or total rejection.
The call-up event day contains a lot of queuing and waiting
when subjects move from one activity to another. In this
study, during this spare time, subjects were asked to fill
in a questionnaire containing questions about their family,
education, health, health behavior, diet, wellbeing and the use
of media and technology.
In addition, data from medical diagnoses were used in this
study. This data contains diagnoses from various different
diseases and disabilities. These are mainly dichotomous questions: a person either has a certain disease or not.
Moreover, the study participants were asked to go through
fitness tests at the call-up event. The tests included measurements of body composition, grip strength, heart rate variability,
and aerobic fitness. In addition, height, weight and waist
circumference were measured. These measurements were also
used in this study.
As a conclusion, three types of data, altogether 678 variables, were used in this study: self-reported questionnaire (314
variables), medical diagnoses (340 variables) and results of
fitness tests (34 variables). These claims, measurements and
questions can be divided into three categories:
1) Rating scale questions: subjects rate claims, typically
using scale from 1 to 5. However, the scale can also
be dichotomous; thus, questions where the answer is
Yes/No are included to this category as well. Fitness
test measurements belong to this category as well.
2) Rating scale questions with an option not to answer: the
same as above but also includes an option to answer Do
not know/ Other.
3) Nominal scale questions: the subject are told to choose
a correct answer from a list of nominal variables and
answers do not have any natural order.
The variables of type 1 are the easiest to use with the
algorithms used in this study, therefore, only these were used
in this study. However, most of the question were of this type,
but still, it should be further studied if other questions improve
the recognition rates. After removing the questions of type 2
1
300
0.9
answer = 1
250
0.8
answer = 2
0.7
200
0.6
0.5
150
0.4
100
0.3
0.2
50
0.1
0
0
0.5
1
(a) In most variables, 2 is the dominant answer.
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
(b) Variability within variable is low, in most cases the dominant
answer represents over 95 % of the answers.
Fig. 1: Statistics about variables where the scale of the answers is from 1 to 2.
1
80
0.9
answer = 1
0.8
answer = 2
0.7
answer = 3
0.6
answer = 4
0.5
answer = 5
70
60
50
0.4
40
30
0.3
20
0.2
10
0.1
0
1
(a) Answer 3 is the most common.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(b) In most cases, the dominant answer represents less than 50 %
of the answers.
Fig. 2: Statistics about variables where the scale of the answers is from 1 to 5.
and 3, 656 variables were left from the original 678 and these
were used as features for machine learning algorithms in this
study. Most of the variables are integers with a scale from 1
to 2 or from 1 to 5, see Table I. These are studied in detail in
Figures 1 and 2. Figure 1 shows that 2 was the most common
answer in questions where the scale was from 1 to 2, and
in almost every case the dominant answer represented over
90 % of the answers. This means that information contained
by these questions is really low. Answers to questions where
the scale is from 1 to 5 contains more information, as there
is more variability in the answers, Figure 2b. However, the
most common answer is 3, which can be considered as neutral
answer and does not tell much about the opinions of the
person. Therefore, these answers are not highly informative.
Thus, it can be concluded that the data set contains mostly
variables that do not contain a lot of information as well. This
means that there are not many features describing the problem
well, which sets challenges to feature selection.
The data set includes a lot of variables with missing values,
especially the questionnaire part as young men found it long
and laborious to fill in. To be able to use features with missing
values in the classification process, the missing values were
replaced with the mean value of answers. Moreover, those
with answers only to a few questions were removed from the
analyses.
III. D EFINING SEDENTARY CLASS LABELS
The aim of the study is to find the determinants for sedentary
lifestyle to describe a profile of a highly sedentary young man.
In addition, the aim is to build a model to detect a sedentaryclass of a person. For this purpose, the data must be labeled
based on the time spend on sedentary behaviour. As a part
of the self-reported questionnaire, responders were asked to
estimate how many hours they sat daily outside school/work
time. Persons who estimated their daily sitting time as five
hours or more were considered as highly sedentary and persons
who sat two hours or less non-sedentary. Therefore, the
participants were divided into the following three sedentaryclasses:
1) highly sedentary: sitting time ≥ 5 hours
2) moderately sedentary: sitting time between two and five
hours
3) non-sedentary: sitting time ≤ 2 hours.
By dividing the participants using this rule, from the 595
respondents, 179 young men were labeled as highly sedentary
(class 1), 244 as moderately sedentary (class 2), and 172 as
non-sedentary (class 3).
TABLE II: Detection rates for different classifiers, average
recognition rate (standard deviation).
Classifier
𝑘NN, 𝑘 = 1
𝑘NN, 𝑘 = 3
𝑘NN, 𝑘 = 5
𝑘NN, 𝑘 = 7
QDA
LDA
C4.5
Random forest
IV. C LASSIFIERS
In this study, five classifiers were used: 𝑘NN, LDA, QDA,
C4.5 and random forest. The idea of the 𝑘NN classifier is
to classify a data point into the class to which most of its 𝑘
nearest neighbors belong [16]. The distance between the data
points was defined in this case using Euclidean distance. In
this study, 𝑘 values 1, 3, 5, and 7 were employed.
LDA is used to find a linear combination of features that
separate the classes best. The resulting combination may be
employed as a linear classifier. QDA is a similar method, but
it uses quadric surfaces to separate classes [17].
C4.5 is a decision tree model by Ross Quinlan [18] that, like
all the other tree models, partitions the space spanned by the
input variables to maximize the score of class purity. This is
done so that the majority of points in each cell of the partition
belong to one cell [17]. In the case of C4.5, the partition is
based on the difference in entropy.
Another decision tree based algorithm used in this study is
random forest [19]. It is a classification model that uses ensemble learning and operates by constructing various decision
trees from the training data and classifies a data point into the
class to which most of individual trees classifies it.
V. R ECOGNIZING A SEDENTARY PERSON FROM A LARGE
POPULATION
In this section, it is studied if it is possible to train a
model that can be used to recognize a sedentary person from
a large population based on medical records and fitness tests.
Therefore, this model would remove the need to fill in the
questionnaire. Here instances are classified into two classes:
highly -, and others, which consists of classes moderately -,
and non-sedentary.
The purpose of this section is not only to replace the need
to fill in a questionnaire, but also to compare classifiers to
find out which classifier is the most accurate when applied
to a data set where the information contained by a single
feature is very limited. In order to study this, the classifiers
introduced in Section IV were applied to the data set where
class labels were self-reported estimates of time spend to
sedentary lifestyle and features consisting of medical records
and results of fitness tests. At first, the values of features were
scaled to range 0-1 so that each feature has similar weight
in the classification process. Then, the most optimal features
for the used classifiers were searched using sequential forward
selection (SFS) [20]. It starts the selection of features from an
empty feature set, and in the first phase, the algorithm tests
the accuracy of the recognition model using only one feature.
It does this to each extracted feature at time, and adds to
an empty feature set the feature that recognizes the activities
with the highest accuracy. In the second phase, again, one
Accuracy
69.7 % (1.48)
64.2 % (1.51)
63.7 % (1.09)
61.7 % (0.15)
66.2 % (1.68)
70.1 % (1.38)
62.6 % (1.16)
61.7 % (1.64)
TABLE III: Detection rates for different classifiers when
instances with the highest probability to have false label are
removed, average recognition rate (standard deviation).
Classifier
𝑘NN, 𝑘 = 1
𝑘NN, 𝑘 = 3
𝑘NN, 𝑘 = 5
𝑘NN, 𝑘 = 7
QDA
LDA
C4.5
Random forest
Accuracy
72.7 % (1.94)
74.3 % (2.30)
64.5 % (2.22)
67.4 % (1.83)
72.9 % (1.53)
77.4 % (1.28)
69.8 % (2.61)
66.4 % (1.07)
feature is added to the feature set. To decide which feature
is added, it experiments the accuracy of the model using
two features, including the one selected in the first phase.
The feature that improves the recognition accuracy the most,
is added to the feature set. Similarly, on each iteration, the
algorithm adds one feature to the set, namely, the one that
improves the classification accuracy the most. The algorithm
continues adding features until the accuracy does not get any
higher. To avoid overfitting, 7-fold cross-validation was used.
One fold in turn were used in testing and the other six in
training.
A. Results
Instances were classified into two classes, highly -sedentary
and others, and the results are shown in Table II.
The presented accuracies average recognition rates calculated from the true positive precisions of individual classes.
There is no big differences between the recognition accuracies
of the individual classes.
B. Discussion
Classification results ranged from 64.5% to 70.1%, see
Table II. Clearly, the recognition rates are not high and they
were not found satisfactory. In fact, it seems that is not
possible to divide feature space into decision regions that
can reliably separate classes. The reason for this can be
labels, which were based on self-reported estimates of the time
spent on sedentary behaviour and not on actual measurements.
Moreover, respondents were only asked to estimate the sitting
time outside school/work and not all respondents are working
or at school. In addition, estimation of the time spent on
sedentary behaviour is not an easy task, and therefore data set
most likely includes several mislabeled instances. Therefore,
it is clear that recognition can not be done with accuracy close
to 100%.
To experiment whether mislabeled instances were causing
low recognition rates, another classification experiment was
done using modified data set where instances with the highest
probability to have a false label were removed. These instances
are the ones located at the border where class label changes, as
close to this border already a little estimation error can lead
to a false class label. Therefore, to create a data set which
contains less mislabeled instances, instances where time spend
to sedentary behaviour was estimated as 4 or 5 hours, were
removed from the data set. Thus, the created new data set
consists of two classes:
1) highly-sedentary: sitting time ≥ 6 hours
2) others: sitting time ≤ 4 hours.
This modified data set were also classified using the same
algorithms as the original data set to see the impact of falsely
classified instances. The results are shown in Table III, and this
time the results ranged from 61.8% to 77.4%. Therefore, the
accuracies are much better than the ones presented in Table II.
Though, even the results of the modified data set are not excellent, still based in this experiment it is possible to conclude that
mislabeled instances are a big reason for low recognition rates
presented in Table II. However, to ensure this, the recognition
should be done based on objective measurements instead of
personal estimated. This experiment would also reveal the full
potential of different machine learning algorithms.
Interestingly, in both cases the highest recognition rate
was obtained again using LDA and with the modified data
set it was 77.4 %. Though, this is not very high accuracy,
it is much better than the accuracy presented in Table II.
This improvement is promising and shows that the presented
method has potential if data set is correctly labeled.
When the recognition accuracies are studies classifier-wise,
it can be noted that QDA produces lower recognition rates
than LDA. This is interesting, as LDA is a special case of
QDA. The reason for this may be the used feature selection
method. However, in the case of QDA, the covariance matrix
of each class in training must be positive definite. This limits
the possibilities to use QDA, especially when feature values
do not contain much variance, as in this study where in many
cases features have only two possible values. Therefore, in
this study, there are a lot of feature combinations that do not
satisfy the class-wise positive definite requirement and thus
only a fraction of feature combinations can be used in the
classification process. LDA has a quite similar requirement,
but in the case of LDA, the pooled covariance matrix of the
whole training data must be positive definite and not separately
for each class.
C4.5 did not succeed very well compared to other classifiers.
Almost the same classifiers as in this article were compared in
[21]. While, in [21], the data set was different, the values of
the features were continuous, whereas in this study they are
mostly discrete, it is noticeable that C4.5 performed worse than
other classifiers. In addition, 𝑘NN did not perform very well
for some reason, tough, using 𝑘 values 1 and 3 good results
were achieved with modified data set. It should be studied if
the reason for performing poorly with full data set and some
𝑘 values was the used point-to-point distance measure, which
was Euclidean distance in this study. More experiments should
be done with other distance measures such as Mahalanobis
distance as well.
In addition, it is surprising how badly random forests
succeed. The algorithm was run several times with different
settings and the accuracies presented in Tables II and III were
the best that were achieved. In the used data set, most of
the features from medical diagnosis data are binary while the
features extracted from the results of the fitness tests have a lot
more options as value [22]. Moreover, it is known that random
forests are biased in favor of attributes with more levels, which
may cause low recognition rates.
VI. P ROFILING A HIGHLY SEDENTARY YOUNG MAN
Profiling of a highly sedentary person is considered as a feature selection problem where the purpose is to find differences
between highly-sedentary and non-sedentary young men. Also,
in this case, the values of features were scaled to range 0-1
and the best features were selected using SFS. Moreover, to
avoid overfitting, cross-validation was used in this case as well.
The most describing features for four classifiers ( LDA, QDA,
C4.5, and 𝑘NN with 𝑘 = 1, 3, 5, 7) were searched. The results
using random forest were not calculated as it did not perform
well in Section V.
In the feature selection process, determinants for sedentary
behavior were analyzed by comparing differences in lifestyle
and health factors between highly sedentary and non-sedentary
young men. Therefore, profiling was considered as a two
class problem, and only data from young men belonging to
sedentary classes 1 and 3 were used. To avoid overfitting,
cross-validation was used. The data were randomly divided
into four parts to obtain 4-fold cross-validation. This was
done five times, so altogether five profiles per classifier were
obtained. Therefore, it was possible to measure standard
deviation between differently chosen training and testing sets.
To build a profile of a sedentary young man the results
of the most accurate classifiers were combined. To do this,
ten most descriptive features classifier-wise were selected and
points were given to these features based on their importance
ranking. The point scoring system used was the same as used
in Formula 1 championships (1st = 25 points, 2nd = 18 points,
3rd = 15points, . . ., 10th = 1 point [23]). This scoring system
gives more value to higher ranked features in relation to lower
ranked ones. This is reasonable as normally the higher ranked
feature, the more it improves the recognition rate.
A. Results
Each classifier can distinguish a highly sedentary and nonsedentary young man with high accuracy based on the data,
see Table IV. When the profile of a sedentary young man
was built, points were given to ten most descriptive features
TABLE IV: Differences between highly sedentary and nonsedentary young men can be found with high accuracy, average
recognition rate (standard deviation).
Classifier
𝑘NN, 𝑘 = 1
𝑘NN, 𝑘 = 3
𝑘NN, 𝑘 = 5
𝑘NN, 𝑘 = 7
QDA
LDA
C4.5
Accuracy
90.3 % (0.50)
90.0 % (1.95)
89.0 % (2.37)
90.1 % (1.69)
91.2 % (1.91)
90.9 % (0.63)
91.4 % (1.78)
of each classifier that produced accurate results. In this case,
each classier satisfied this criterion, and therefore, the results
of each classifier were used to build a profile. However, as
𝑘NN was experimented with four 𝑘 values, the points given
to these were divided by four to avoid bias toward the results
of 𝑘NN classifier. The profile of a sedentary young men is
presented in Table V.
B. Discussion
High recognition accuracies were obtained with each classification algorithm (Table IV), and therefore, questionnaire,
medical records and fitness tests can be used to profile a highly
sedentary person, no matter which classifier is used. The
determinants related to a sedentary young man are presented
in Table V. This shows which questions are the best ones to
describe the differences between highly sedentary and nonsedentary persons. Clearly, the questions about Internet-usage
were found most descriptive when the results from each classifiers were combined. In fact, although classification algorithms
use different approaches to build recognition models, each
classifier found Internet-usage related question as the most
descriptive, and therefore, ranked it as first. In these questions,
respondents were asked how many hours daily they use the
Internet or how many hours daily they use PC to use the Internet. They were able to choose the answer from five options:
(1) from 0-2 hours, (2) from 2-4 hours, (3) from 4-6 hours,
(4) from 6-8 hours or (5) from 8-10 hours. The importance
of this question came as no surprise, as most people sit or
lie while they use the Internet, especially when they use it
on their PC. Moreover, also the question ranked as fourth
descriptive was related to heavy Internet -usage, and more
precisely, to playing Internet games. While these findings may
sound trivial, they also show that machine learning algorithms
are powerful tools for making such findings from a large set
of questions. What is more, as the most important questions
can easily be understood as highly descriptive, based on these
findings it is easier to believe that other questions with high
ranking are important as well. In fact, the other seven questions
ranked into top 10 are not as trivial and do not have any clear
common factor. They include laziness, and lack of interest
towards exercising but they show that not always sedentary
lifestyle is a personal choice, also a medical condition can be
the reason. Moreover, it is noticeable that none of the selected
features are directly related to exercising or sports hobbies.
Therefore, it can be concluded that based on the data, highly
sedentary young men can have sports hobbies. Thus, a young
man may follow physical activity recommendation and spend
the rest of the day sitting or lying on the sofa, making them
susceptible to the dangers of a sedentary lifestyle. This finding
supports the ones suggested in the literature [14].
In Section V, the classification problem was harder as it
contained also instances from moderately -sedentary persons,
and therefore, the differences between classifiers were more
easily visible. In fact, in this section there were no big
differences in recognition rates between classifiers, see Table
IV.
VII. C ONCLUSIONS AND FUTURE WORK
In this study, a data set consisting of self-reported questionnaire, medical diagnoses and fitness tests were studied to
detect and profile sedentary young men. The profiling of a
sedentary young man was considered as a feature selection
problem between highly sedentary and non-sedentary persons,
while detecting a sedentary person from a large population was
considered as a binary classification problem between highlysedentary young men and others. Different classifiers were
compared to find the most suitable algorithm to solve these
problems.
In Section V sedentary young men were detected based
on medical records, and fitness tests in order to remove the
need to fill in the questionnaire. Recognition accuracies are
presented in Table II and it can be noted that these are not
very high. However, the reason for low recognition rates can be
labels, which were based on self-reported estimates of the time
spent on sedentary behaviour and not on actual measurements.
Moreover, respondents were only asked to estimate the sitting
time outside school/work and not all respondents are working
or at school. In addition, the estimation of the time spent on
sedentary behaviour is not an easy task, and therefore data set
most likely includes several mislabeled instances. However,
the mislabeled instances are most likely located at the border
where class label changes, as there already a little estimation
error can lead to a false class label. Therefore, to create a data
set which contains less mislabeled instances, instances where
time spend on sedentary behaviour was estimated as 4 or 5
hours, were removed. When this new data set was classified
using the same machine learning algorithms as the whole data
set, the accuracies were still below 80% but much higher than
with full data set, see Table III. Therefore, the results show that
falsely labeled instances were a big reason for low recognition
rates presented in Table II. In fact, in the previous studies it
has been shown that self-estimation of one’s daily physical
activity is difficult [24], and based on the results of this study,
most likely the same goes with sedentary time. Therefore, in
order to avoid sedentary lifestyle, the time spend on it should
be measured objectively. For this purpose, reliable algorithms
and applications to measure the intensity of movement [25] or
to detect activities [26] should be developed.
TABLE V: The profile of a highly sedentary young man: ten most descriptive questions.
Ranking
1
2
3
4
5
6
7
8
Points
366.25
147.5
75
51.5
46
44
36.75
36
9
36
10
36
Question
How many hours daily a respondent uses PC to access the Internet?
How many hours daily a respondent uses the Internet?
How much laziness limits a respondent’s free time exercising?
How many hours daily a respondent spends time to Internet games?
Does a respondent have bone or cartilage diseases?
Grip strength in kilograms, right hand
Repetitive whistling in ears, tinnitus
How certain respondent is that he goes to exercise despite having
guests?
How much lack of interest toward exercising limits a respondent’s free
time exercising?
Mass of muscles in kilograms
Profiling part of this study shows that highly sedentary
persons spend time on the Internet more than non-sedentary
persons. It is obvious that this type of behaviour increases
the time spend on sedentary lifestyle, therefore, these results
show that machine learning algorithms are powerful tools for
finding meaningful factors even when a single feature does
not include much information. What is more, as the questions
ranked as the most important can easily be understood as
highly descriptive, based on these findings it is easier to believe
that other questions with high ranking are important as well. In
fact, the other questions selected in the profile of a sedentary
young man are not as obvious findings.
While the obtained recognition results were not always as
high as expected, it can be noted that LDA is the most reliable
classifier. LDA produced the highest rates in Section V and
in Section IV the difference between LDA and C4.5, which
produced the best results, was not statistically significant.
Interestingly, in [5] where classification based on questionnaire
data were studied, LDA produced the best detection results,
as well.
The weakness of this study is that only one type of question
was used in the profiling process. In the next phase of
the study, other questions should be used as well. Using
these questions could then lead to better recognition rates.
In addition, one part of the future work is to make a study
similar to this study, but label data differently based on actual
measurements and not on estimations. Moreover, in this study,
only data from 2010 were used. However, similar data sets are
available from the years 2009-2013, increasing the sample size
significantly. Therefore, part of the future work is to study if
similar results compared with this study can be obtained with
other data sets as well.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
ACKNOWLEDGMENT
This work was done as a part of MOPO study [15]. The
authors would like to thank Infotech Oulu and the Finnish
Funding Agency for Technology and Innovation for funding
this work.
[12]
[13]
R EFERENCES
[1] F. J. Penedo and J. R. Dahn, “Exercise and well-being: a review of
mental and physical health benefits associated with physical activity.”
[14]
Current opinion in psychiatry, vol. 18, no. 2, pp. 189–193, Mar. 2005.
[Online]. Available: http://view.ncbi.nlm.nih.gov/pubmed/16639173
M. Chia and H. Suppiah, “Inactivity physiology-standing up for making
sitting less sedentary at work,” J Obes Weight Loss Ther, vol. 3, no.
171, p. 2, 2013.
J. Kim, K. Tanabe, N. Yokoyama, M. Zempo, and S. Kuno, “Objectively
measured light-intensity lifestyle activity and sedentary time are independently associated with metabolic syndrome: a cross-sectional study
of japanese adults.” International Journal of Behavioral Nutrition and
Physical Activity, vol. 10, no. 30, 2013.
C. E. Matthews, S. M. George, S. C. Moore, H. R. Bowles,
A. Blair, Y. Park, R. P. Troiano, A. Hollenbeck, and A. Schatzkin,
“Amount of time spent in sedentary behaviors and cause-specific
mortality in us adults.” Am J Clin Nutr, vol. 95, no. 2, pp. 437–45,
2012. [Online]. Available: http://www.biomedsearch.com/nih/Amounttime-spent-in-sedentary/22218159.html
J. Superby, J. Vandamme, and N. Meskens, “Determination of factors
influencing the achievement of the first-year university students using
data mining methods.” in Proc. Int. Conf. Intell. Tutoring Syst. Workshop
Educ. Data Mining., 2006, pp. 1–8.
E. Yukselturk, S. Ozekes, and Y. Türel, “Predicting dropout student: An
application of data mining methods in an online education program,”
European Journal of Open, Distance and e-Learning, vol. 17, no. 1, pp.
118– 133, 2014.
C. Romero and S. Ventura, “Educational data mining: A review of the
state of the art,” Systems, Man, and Cybernetics, Part C: Applications
and Reviews, IEEE Transactions on, vol. 40, no. 6, pp. 601–618, Nov
2010.
N. Horowitz, M. Moshkowitz, Z. Halpern, and M. Leshno, “Applying
data mining techniques in the development of a diagnostics questionnaire
for gerd,” Digestive Diseases and Sciences, vol. 52, no. 8, pp. 1871–
1878, 2007. [Online]. Available: http://dx.doi.org/10.1007/s10620-0069202-5
D. P. Wall, R. Dally, R. Luyster, J.-Y. Jung, and T. F. Deluca,
“Use of artificial intelligence to shorten the behavioral diagnosis
of autism.” PLoS One, vol. 7, no. 8, p. e43855, 2012. [Online].
Available: http://www.biomedsearch.com/nih/Use-artificial-intelligenceto-shorten/22952789.html
I. Kononenko, “Machine learning for medical diagnosis: history, state
of the art and perspective,” Artificial Intelligence in Medicine,
vol. 23, no. 1, pp. 89 – 109, 2001. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S093336570100077X
I. Yoo, P. Alafaireet, M. Marinov, K. Pena-Hernandez, R. Gopidi, J.-F.
Chang, and L. Hua, “Data mining in healthcare and biomedicine: A
survey of the literature,” Journal of Medical Systems, vol. 36, no. 4, pp.
2431–2448, 2012. [Online]. Available: http://dx.doi.org/10.1007/s10916011-9710-5
C. H. Yu, S. Digangi, A. K. Jannasch-Pennell, and C. Kaprolet, “Profiling students who take online courses using data mining methods,”
Online J. Distance Learning Administ., vol. 11, no. 2, pp. 1–14, 2008.
R. Pyky, A. Jauho, R. Ahola, T. Ikäheimo, T. Jämsä, and R. Korpelainen,
“Profiles of physically inactive young men,” in 7th International Conference on Movement and Health, 2014.
R. Korpelainen, R. Pyky, R. Ahola, H. Koivumaa-Honkanen, A. Jauho,
T. Jämsä, and T. Ikäheimo, “Profiles of physically inactive young men,”
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
in 5th International Congress on Physical Activity and Public Health,
2014, p. poster.
R. Ahola, R. Pyky, T. Jämsä, M. Mäntysaari, H. Koskimäki, T. Ikäheimo,
M.-L. Huotari, J. Röning, H. Heikkinen, and R. Korpelainen, “Gamified
physical activation of young men - a multidisciplinary populationbased randomized controlled trial (MOPO study),” BMC Public Health,
vol. 13, pp. 1–8, January 2013.
E. Fix and J. L. Hodges, “Discriminatory analysis: Nonparametric
discrimination: Consistency properties,” USAF School of Aviation
Medicine, Randolf Field, Texas, Tech. Rep. Project 21-49-004, Report
Number 4, 1951.
D. J. Hand, H. Mannila, and P. Smyth, Principles of data mining.
Cambridge, MA, USA: MIT Press, 2001.
R. J. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning). Morgan Kaufmann, January 1993.
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
5–32, 2001.
P. A. Devijver and J. Kittler, Pattern recognition: A statistical approach.
Prentice Hall, 1982.
P. Siirtola, H. Koskimäki, V. Huikari, P. Laurinen, and
J. Röning, “Improving the classification accuracy of streaming
data using SAX similarity features,” Pattern Recognition Letters,
vol. 32, no. 13, pp. 1659 – 1668, 2011. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167865511002091
C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, “Bias in
random forest variable importance measures: Illustrations, sources and
a solution,” BMC Bioinformatics, vol. 8, no. 1, p. 25, 2007. [Online].
Available: http://www.biomedcentral.com/1471-2105/8/25
FIA, “Formula 1 point scoring system,” http://www.formula1.com/.
E. M. van Sluijs, S. J. Griffin, and M. N. van Poppel, “A crosssectional study of awareness of physical activity: associations with
personal, behavioral and psychosocial factors,” International Journal of
Behavioral Nutrition and Physical Activity, vol. 4, no. 1, p. 53, 2007.
F. Garca-Garca, G. Garca-Sez, P. Chausa, I. Martnez-Sarriegui, P. Benito,
E. Gmez, and M. Hernando, “Statistical machine learning for automatic
assessment of physical activity intensity using multi-axial accelerometry
and heart rate,” in Artificial Intelligence in Medicine, ser. Lecture Notes
in Computer Science, M. Peleg, N. Lavrac, and C. Combi, Eds. Springer
Berlin Heidelberg, 2011, vol. 6747, pp. 70–79.
P. Siirtola and J. Röning, “Ready-to-use activity recognition for smartphones,” in Computational Intelligence and Data Mining (CIDM), 2013
IEEE Symposium on, April 2013, pp. 59–64.