Detecting and profiling sedentary young men using machine learning algorithms Pekka Siirtola∗ , Riitta Pyky†‡§¶ , Riikka Ahola‡ , Heli Koskimäki∗ , Timo Jäms䆇 , Raija Korpelainen †‡§ and Juha Röning∗ ∗ Computer Science and Engineering Department P.O. BOX 4500, FI-90014, University of Oulu, Oulu, Finland Email: [email protected], [email protected] † Oulu Deaconess Institute, Department of Sports and Exercise Medicine, Oulu, Finland ‡ Medical Research Center Oulu, Oulu University Hospital and University of Oulu, Finland § Institute of Health Sciences, Faculty of Medicine, University of Oulu, Oulu, Finland ¶ Department of Medical Technology, Faculty of Medicine, University of Oulu, Finland Abstract—Many governments and institutions have guidelines for health-enhancing physical activity. Additionally, according to recent studies, the amount of time spent on sitting is a highly important determinant of health and wellbeing. In fact, sedentary lifestyle can lead to many diseases and, what is more, it is even found to be associated with increased mortality. In this study, a data set consisting of self-reported questionnaire, medical diagnoses and fitness tests was studied to detect sedentary young men from a large population and to create a profile of a sedentary person. The data set was collected from 595 young men and contained altogether 678 features. Most of these are answers to multi-choice close-ended questions. More precisely, features were mostly integers with a scale from 1 to 5 or from 1 to 2, and therefore, there was only a little variability in the values of features. In order to detect and profile a sedentary young man, machine learning algorithms were applied to the data set. The performance of five algorithms is compared (quadratic discriminant analysis (QDA), linear discriminant analysis (LDA), C4.5, random forests, and 𝑘 nearest neighbours (𝑘NN)) to find the most accurate algorithm. The results of this study show that when the aim is to detect a sedentary person based on medical records and fitness tests, LDA performs better than the other algorithms, but still the accuracy is not high. In the second part of the study the differences between highly sedentary and non-sedentary young men are searched, recognition can be obtained with high accuracy with each algorithm. I. I NTRODUCTION AND RELATED WORK It is well-known that the amount and intensity of daily physical activity is positively associated with health and wellbeing [1]. To help people to estimate how much exercising is enough, there are guidelines available for health-enhancing physical activity. Moreover, various devices and mobile applications are nowadays available to measure if one’s daily activity is enough to fulfill these recommendations. On the other hand, according to recent studies in order to stay healthy, it is often as important to avoid long sitting periods and sedentary lifestyle than to be active a certain amount daily ([2] & [3]). In fact, the time spent in sedentary behaviors has been found positively associated with increased mortality [4]. To target interventions to those at risk and to motivate the most sedentary groups, it is important to understand the culture and underlying determinants for sedentary lifestyle of the target group. In other words, it needs to be studied what are the reasons for sedentary lifestyle and how highly sedentary people differ from non-sedentary persons. In this study, predictors for sedentary lifestyle of young men are searched based on self-reported questionnaire, medical diagnoses and fitness tests. Moreover, to find sedentary persons from a large population-based sample of young men, a model was built to determine automatically if a person is highly -sedentary based on medical diagnoses and fitness tests. In the method-wise, this article concentrates on studying how machine learning algorithms can be applied to a data set where features are mostly integers and a single feature does not include much information about the problem. Different classifiers are compared to find the most suitable algorithm for the data set. There are studies where machine learning algorithm are applied to a data set, quite similar to this study. A study most similar to our study is [5] where factors connected to academic success are searched and this information is then used to predict first-year students academic success. The data used in the study is questionnaire data from 533 students. Academic success is predicted by classifying students into three categories based on a risk at failing in studies. The recognition rates obtained in the study are not high, but a comparison of three machine learning methods shows that LDA produces the least bad results. Dropout students have been predicted using data mining methods in more recent studies as well, such as [6]. However, in this case raw questionnaire data were not used, instead classification was based on ten variables derived from answers. In addition, there are several other studies where educational mining based on questionnaire data has been studied in several other studies as well as shown in [7]. Another field, where machine learning methods have been applied to data sets with only a little variation, is medicine. In [8], neural networks, decision trees, and logistic regression were used to a data set consisting of answers to a questionnaire to build a model that can be used to determinate whether a patient has a disease called gastroesophageal reflux or not. The results show that depending on the used classifier, recognition can be obtained with accuracy varying from 70% to 78%. Machile learning algorithms were also succesfully used in [9], where artificial intelligence was used to diagnose autism. The normal procedure to diagnose autism is a questionnaire including 93 questions. However, it was shown in the article that the same diagnosis accuracy can be obtained by using only seven questions, when machine learning techniques were applied to a data set. The best results were obtained using decition trees. Other machine learning and data mining methods used with medical data include AdaBoost, Naive Bayesian classifier, and SVM ([10] & [11]). In this study, one target is to profile a sedentary young man. Profiling groups of people using machine learning algorithms is studied, for instance in [12] where a profile of students who take online courses was built. The methods used in the profiling process were classification trees and multivariate adaptive regression splines. These were applied to a data set consisting of variables like age, gender, ethnicity, residency, living location and test scores. These variables are quite different from those used in our study. The results of [12] suggest that a profile of a group of people can be really simple, in this case consisting of only one or two variables. Moreover, profiling a sedentary person is studied in [13] and [14]. These studies use the same data set as our study. However, in these studies the aim and approach were different from our study, as it is studied here what kind of groups can be found within highly sedentary young men and which determinants are characteristic of these groups. Five distinctive groups were found and several wellbeing problems were associated with sedentary lifestyle, such as heavy alcohol use, unemployment, and low self-esteem. However, there was also one group where young men were physically active, motivated and healthy, although they were highly sedentary. In these studies, the analysis was made using factor analysis. This study has three aims: ∙ ∙ ∙ To replace the questionnaire and build a model to detect a sedentary young man based on medical records and fitness tests To profile a highly sedentary young men To study how well different machine learning algorithms can handle a feature set where the information contained by a single feature is very limited The study is organized as follows: Section II presents the used data set and labels are defined in Section III. Section IV introduces the used machine learning algorithms which are applied in Section V to build a model to recognize a sedentary person from a large population and in Section VI to profile a sedentary young man. Finally, conclusions and future work are in Section VII. II. DATA SET This article is part of MOPO study [15], where the aim is to find new methods for activation of young men as they are often at the risk of marginalization, inactivity and unhealthy TABLE I: The data set consists most of variables where values are integers with a scale from 1 to 2 or from 1 to 5. Scale 1 to 2 1 to 3 1 to 4 1 to 5 other Portion 56 % 2% 6% 26 % 10 % lifestyle. The research subjects consist of 595 conscriptionaged men (mean age 18 years) in the city of Oulu, Finland. Research subjects participated to an annual mandatory callup event, organized annually by Finnish Defense Forces. At the event, young men get information regarding to military service, and by the end of the day, it is decided based on the medical examination and interviews whether they are fit enough to armed military service or do they choose civil service, postponement or total rejection. The call-up event day contains a lot of queuing and waiting when subjects move from one activity to another. In this study, during this spare time, subjects were asked to fill in a questionnaire containing questions about their family, education, health, health behavior, diet, wellbeing and the use of media and technology. In addition, data from medical diagnoses were used in this study. This data contains diagnoses from various different diseases and disabilities. These are mainly dichotomous questions: a person either has a certain disease or not. Moreover, the study participants were asked to go through fitness tests at the call-up event. The tests included measurements of body composition, grip strength, heart rate variability, and aerobic fitness. In addition, height, weight and waist circumference were measured. These measurements were also used in this study. As a conclusion, three types of data, altogether 678 variables, were used in this study: self-reported questionnaire (314 variables), medical diagnoses (340 variables) and results of fitness tests (34 variables). These claims, measurements and questions can be divided into three categories: 1) Rating scale questions: subjects rate claims, typically using scale from 1 to 5. However, the scale can also be dichotomous; thus, questions where the answer is Yes/No are included to this category as well. Fitness test measurements belong to this category as well. 2) Rating scale questions with an option not to answer: the same as above but also includes an option to answer Do not know/ Other. 3) Nominal scale questions: the subject are told to choose a correct answer from a list of nominal variables and answers do not have any natural order. The variables of type 1 are the easiest to use with the algorithms used in this study, therefore, only these were used in this study. However, most of the question were of this type, but still, it should be further studied if other questions improve the recognition rates. After removing the questions of type 2 1 300 0.9 answer = 1 250 0.8 answer = 2 0.7 200 0.6 0.5 150 0.4 100 0.3 0.2 50 0.1 0 0 0.5 1 (a) In most variables, 2 is the dominant answer. 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 (b) Variability within variable is low, in most cases the dominant answer represents over 95 % of the answers. Fig. 1: Statistics about variables where the scale of the answers is from 1 to 2. 1 80 0.9 answer = 1 0.8 answer = 2 0.7 answer = 3 0.6 answer = 4 0.5 answer = 5 70 60 50 0.4 40 30 0.3 20 0.2 10 0.1 0 1 (a) Answer 3 is the most common. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (b) In most cases, the dominant answer represents less than 50 % of the answers. Fig. 2: Statistics about variables where the scale of the answers is from 1 to 5. and 3, 656 variables were left from the original 678 and these were used as features for machine learning algorithms in this study. Most of the variables are integers with a scale from 1 to 2 or from 1 to 5, see Table I. These are studied in detail in Figures 1 and 2. Figure 1 shows that 2 was the most common answer in questions where the scale was from 1 to 2, and in almost every case the dominant answer represented over 90 % of the answers. This means that information contained by these questions is really low. Answers to questions where the scale is from 1 to 5 contains more information, as there is more variability in the answers, Figure 2b. However, the most common answer is 3, which can be considered as neutral answer and does not tell much about the opinions of the person. Therefore, these answers are not highly informative. Thus, it can be concluded that the data set contains mostly variables that do not contain a lot of information as well. This means that there are not many features describing the problem well, which sets challenges to feature selection. The data set includes a lot of variables with missing values, especially the questionnaire part as young men found it long and laborious to fill in. To be able to use features with missing values in the classification process, the missing values were replaced with the mean value of answers. Moreover, those with answers only to a few questions were removed from the analyses. III. D EFINING SEDENTARY CLASS LABELS The aim of the study is to find the determinants for sedentary lifestyle to describe a profile of a highly sedentary young man. In addition, the aim is to build a model to detect a sedentaryclass of a person. For this purpose, the data must be labeled based on the time spend on sedentary behaviour. As a part of the self-reported questionnaire, responders were asked to estimate how many hours they sat daily outside school/work time. Persons who estimated their daily sitting time as five hours or more were considered as highly sedentary and persons who sat two hours or less non-sedentary. Therefore, the participants were divided into the following three sedentaryclasses: 1) highly sedentary: sitting time ≥ 5 hours 2) moderately sedentary: sitting time between two and five hours 3) non-sedentary: sitting time ≤ 2 hours. By dividing the participants using this rule, from the 595 respondents, 179 young men were labeled as highly sedentary (class 1), 244 as moderately sedentary (class 2), and 172 as non-sedentary (class 3). TABLE II: Detection rates for different classifiers, average recognition rate (standard deviation). Classifier 𝑘NN, 𝑘 = 1 𝑘NN, 𝑘 = 3 𝑘NN, 𝑘 = 5 𝑘NN, 𝑘 = 7 QDA LDA C4.5 Random forest IV. C LASSIFIERS In this study, five classifiers were used: 𝑘NN, LDA, QDA, C4.5 and random forest. The idea of the 𝑘NN classifier is to classify a data point into the class to which most of its 𝑘 nearest neighbors belong [16]. The distance between the data points was defined in this case using Euclidean distance. In this study, 𝑘 values 1, 3, 5, and 7 were employed. LDA is used to find a linear combination of features that separate the classes best. The resulting combination may be employed as a linear classifier. QDA is a similar method, but it uses quadric surfaces to separate classes [17]. C4.5 is a decision tree model by Ross Quinlan [18] that, like all the other tree models, partitions the space spanned by the input variables to maximize the score of class purity. This is done so that the majority of points in each cell of the partition belong to one cell [17]. In the case of C4.5, the partition is based on the difference in entropy. Another decision tree based algorithm used in this study is random forest [19]. It is a classification model that uses ensemble learning and operates by constructing various decision trees from the training data and classifies a data point into the class to which most of individual trees classifies it. V. R ECOGNIZING A SEDENTARY PERSON FROM A LARGE POPULATION In this section, it is studied if it is possible to train a model that can be used to recognize a sedentary person from a large population based on medical records and fitness tests. Therefore, this model would remove the need to fill in the questionnaire. Here instances are classified into two classes: highly -, and others, which consists of classes moderately -, and non-sedentary. The purpose of this section is not only to replace the need to fill in a questionnaire, but also to compare classifiers to find out which classifier is the most accurate when applied to a data set where the information contained by a single feature is very limited. In order to study this, the classifiers introduced in Section IV were applied to the data set where class labels were self-reported estimates of time spend to sedentary lifestyle and features consisting of medical records and results of fitness tests. At first, the values of features were scaled to range 0-1 so that each feature has similar weight in the classification process. Then, the most optimal features for the used classifiers were searched using sequential forward selection (SFS) [20]. It starts the selection of features from an empty feature set, and in the first phase, the algorithm tests the accuracy of the recognition model using only one feature. It does this to each extracted feature at time, and adds to an empty feature set the feature that recognizes the activities with the highest accuracy. In the second phase, again, one Accuracy 69.7 % (1.48) 64.2 % (1.51) 63.7 % (1.09) 61.7 % (0.15) 66.2 % (1.68) 70.1 % (1.38) 62.6 % (1.16) 61.7 % (1.64) TABLE III: Detection rates for different classifiers when instances with the highest probability to have false label are removed, average recognition rate (standard deviation). Classifier 𝑘NN, 𝑘 = 1 𝑘NN, 𝑘 = 3 𝑘NN, 𝑘 = 5 𝑘NN, 𝑘 = 7 QDA LDA C4.5 Random forest Accuracy 72.7 % (1.94) 74.3 % (2.30) 64.5 % (2.22) 67.4 % (1.83) 72.9 % (1.53) 77.4 % (1.28) 69.8 % (2.61) 66.4 % (1.07) feature is added to the feature set. To decide which feature is added, it experiments the accuracy of the model using two features, including the one selected in the first phase. The feature that improves the recognition accuracy the most, is added to the feature set. Similarly, on each iteration, the algorithm adds one feature to the set, namely, the one that improves the classification accuracy the most. The algorithm continues adding features until the accuracy does not get any higher. To avoid overfitting, 7-fold cross-validation was used. One fold in turn were used in testing and the other six in training. A. Results Instances were classified into two classes, highly -sedentary and others, and the results are shown in Table II. The presented accuracies average recognition rates calculated from the true positive precisions of individual classes. There is no big differences between the recognition accuracies of the individual classes. B. Discussion Classification results ranged from 64.5% to 70.1%, see Table II. Clearly, the recognition rates are not high and they were not found satisfactory. In fact, it seems that is not possible to divide feature space into decision regions that can reliably separate classes. The reason for this can be labels, which were based on self-reported estimates of the time spent on sedentary behaviour and not on actual measurements. Moreover, respondents were only asked to estimate the sitting time outside school/work and not all respondents are working or at school. In addition, estimation of the time spent on sedentary behaviour is not an easy task, and therefore data set most likely includes several mislabeled instances. Therefore, it is clear that recognition can not be done with accuracy close to 100%. To experiment whether mislabeled instances were causing low recognition rates, another classification experiment was done using modified data set where instances with the highest probability to have a false label were removed. These instances are the ones located at the border where class label changes, as close to this border already a little estimation error can lead to a false class label. Therefore, to create a data set which contains less mislabeled instances, instances where time spend to sedentary behaviour was estimated as 4 or 5 hours, were removed from the data set. Thus, the created new data set consists of two classes: 1) highly-sedentary: sitting time ≥ 6 hours 2) others: sitting time ≤ 4 hours. This modified data set were also classified using the same algorithms as the original data set to see the impact of falsely classified instances. The results are shown in Table III, and this time the results ranged from 61.8% to 77.4%. Therefore, the accuracies are much better than the ones presented in Table II. Though, even the results of the modified data set are not excellent, still based in this experiment it is possible to conclude that mislabeled instances are a big reason for low recognition rates presented in Table II. However, to ensure this, the recognition should be done based on objective measurements instead of personal estimated. This experiment would also reveal the full potential of different machine learning algorithms. Interestingly, in both cases the highest recognition rate was obtained again using LDA and with the modified data set it was 77.4 %. Though, this is not very high accuracy, it is much better than the accuracy presented in Table II. This improvement is promising and shows that the presented method has potential if data set is correctly labeled. When the recognition accuracies are studies classifier-wise, it can be noted that QDA produces lower recognition rates than LDA. This is interesting, as LDA is a special case of QDA. The reason for this may be the used feature selection method. However, in the case of QDA, the covariance matrix of each class in training must be positive definite. This limits the possibilities to use QDA, especially when feature values do not contain much variance, as in this study where in many cases features have only two possible values. Therefore, in this study, there are a lot of feature combinations that do not satisfy the class-wise positive definite requirement and thus only a fraction of feature combinations can be used in the classification process. LDA has a quite similar requirement, but in the case of LDA, the pooled covariance matrix of the whole training data must be positive definite and not separately for each class. C4.5 did not succeed very well compared to other classifiers. Almost the same classifiers as in this article were compared in [21]. While, in [21], the data set was different, the values of the features were continuous, whereas in this study they are mostly discrete, it is noticeable that C4.5 performed worse than other classifiers. In addition, 𝑘NN did not perform very well for some reason, tough, using 𝑘 values 1 and 3 good results were achieved with modified data set. It should be studied if the reason for performing poorly with full data set and some 𝑘 values was the used point-to-point distance measure, which was Euclidean distance in this study. More experiments should be done with other distance measures such as Mahalanobis distance as well. In addition, it is surprising how badly random forests succeed. The algorithm was run several times with different settings and the accuracies presented in Tables II and III were the best that were achieved. In the used data set, most of the features from medical diagnosis data are binary while the features extracted from the results of the fitness tests have a lot more options as value [22]. Moreover, it is known that random forests are biased in favor of attributes with more levels, which may cause low recognition rates. VI. P ROFILING A HIGHLY SEDENTARY YOUNG MAN Profiling of a highly sedentary person is considered as a feature selection problem where the purpose is to find differences between highly-sedentary and non-sedentary young men. Also, in this case, the values of features were scaled to range 0-1 and the best features were selected using SFS. Moreover, to avoid overfitting, cross-validation was used in this case as well. The most describing features for four classifiers ( LDA, QDA, C4.5, and 𝑘NN with 𝑘 = 1, 3, 5, 7) were searched. The results using random forest were not calculated as it did not perform well in Section V. In the feature selection process, determinants for sedentary behavior were analyzed by comparing differences in lifestyle and health factors between highly sedentary and non-sedentary young men. Therefore, profiling was considered as a two class problem, and only data from young men belonging to sedentary classes 1 and 3 were used. To avoid overfitting, cross-validation was used. The data were randomly divided into four parts to obtain 4-fold cross-validation. This was done five times, so altogether five profiles per classifier were obtained. Therefore, it was possible to measure standard deviation between differently chosen training and testing sets. To build a profile of a sedentary young man the results of the most accurate classifiers were combined. To do this, ten most descriptive features classifier-wise were selected and points were given to these features based on their importance ranking. The point scoring system used was the same as used in Formula 1 championships (1st = 25 points, 2nd = 18 points, 3rd = 15points, . . ., 10th = 1 point [23]). This scoring system gives more value to higher ranked features in relation to lower ranked ones. This is reasonable as normally the higher ranked feature, the more it improves the recognition rate. A. Results Each classifier can distinguish a highly sedentary and nonsedentary young man with high accuracy based on the data, see Table IV. When the profile of a sedentary young man was built, points were given to ten most descriptive features TABLE IV: Differences between highly sedentary and nonsedentary young men can be found with high accuracy, average recognition rate (standard deviation). Classifier 𝑘NN, 𝑘 = 1 𝑘NN, 𝑘 = 3 𝑘NN, 𝑘 = 5 𝑘NN, 𝑘 = 7 QDA LDA C4.5 Accuracy 90.3 % (0.50) 90.0 % (1.95) 89.0 % (2.37) 90.1 % (1.69) 91.2 % (1.91) 90.9 % (0.63) 91.4 % (1.78) of each classifier that produced accurate results. In this case, each classier satisfied this criterion, and therefore, the results of each classifier were used to build a profile. However, as 𝑘NN was experimented with four 𝑘 values, the points given to these were divided by four to avoid bias toward the results of 𝑘NN classifier. The profile of a sedentary young men is presented in Table V. B. Discussion High recognition accuracies were obtained with each classification algorithm (Table IV), and therefore, questionnaire, medical records and fitness tests can be used to profile a highly sedentary person, no matter which classifier is used. The determinants related to a sedentary young man are presented in Table V. This shows which questions are the best ones to describe the differences between highly sedentary and nonsedentary persons. Clearly, the questions about Internet-usage were found most descriptive when the results from each classifiers were combined. In fact, although classification algorithms use different approaches to build recognition models, each classifier found Internet-usage related question as the most descriptive, and therefore, ranked it as first. In these questions, respondents were asked how many hours daily they use the Internet or how many hours daily they use PC to use the Internet. They were able to choose the answer from five options: (1) from 0-2 hours, (2) from 2-4 hours, (3) from 4-6 hours, (4) from 6-8 hours or (5) from 8-10 hours. The importance of this question came as no surprise, as most people sit or lie while they use the Internet, especially when they use it on their PC. Moreover, also the question ranked as fourth descriptive was related to heavy Internet -usage, and more precisely, to playing Internet games. While these findings may sound trivial, they also show that machine learning algorithms are powerful tools for making such findings from a large set of questions. What is more, as the most important questions can easily be understood as highly descriptive, based on these findings it is easier to believe that other questions with high ranking are important as well. In fact, the other seven questions ranked into top 10 are not as trivial and do not have any clear common factor. They include laziness, and lack of interest towards exercising but they show that not always sedentary lifestyle is a personal choice, also a medical condition can be the reason. Moreover, it is noticeable that none of the selected features are directly related to exercising or sports hobbies. Therefore, it can be concluded that based on the data, highly sedentary young men can have sports hobbies. Thus, a young man may follow physical activity recommendation and spend the rest of the day sitting or lying on the sofa, making them susceptible to the dangers of a sedentary lifestyle. This finding supports the ones suggested in the literature [14]. In Section V, the classification problem was harder as it contained also instances from moderately -sedentary persons, and therefore, the differences between classifiers were more easily visible. In fact, in this section there were no big differences in recognition rates between classifiers, see Table IV. VII. C ONCLUSIONS AND FUTURE WORK In this study, a data set consisting of self-reported questionnaire, medical diagnoses and fitness tests were studied to detect and profile sedentary young men. The profiling of a sedentary young man was considered as a feature selection problem between highly sedentary and non-sedentary persons, while detecting a sedentary person from a large population was considered as a binary classification problem between highlysedentary young men and others. Different classifiers were compared to find the most suitable algorithm to solve these problems. In Section V sedentary young men were detected based on medical records, and fitness tests in order to remove the need to fill in the questionnaire. Recognition accuracies are presented in Table II and it can be noted that these are not very high. However, the reason for low recognition rates can be labels, which were based on self-reported estimates of the time spent on sedentary behaviour and not on actual measurements. Moreover, respondents were only asked to estimate the sitting time outside school/work and not all respondents are working or at school. In addition, the estimation of the time spent on sedentary behaviour is not an easy task, and therefore data set most likely includes several mislabeled instances. However, the mislabeled instances are most likely located at the border where class label changes, as there already a little estimation error can lead to a false class label. Therefore, to create a data set which contains less mislabeled instances, instances where time spend on sedentary behaviour was estimated as 4 or 5 hours, were removed. When this new data set was classified using the same machine learning algorithms as the whole data set, the accuracies were still below 80% but much higher than with full data set, see Table III. Therefore, the results show that falsely labeled instances were a big reason for low recognition rates presented in Table II. In fact, in the previous studies it has been shown that self-estimation of one’s daily physical activity is difficult [24], and based on the results of this study, most likely the same goes with sedentary time. Therefore, in order to avoid sedentary lifestyle, the time spend on it should be measured objectively. For this purpose, reliable algorithms and applications to measure the intensity of movement [25] or to detect activities [26] should be developed. TABLE V: The profile of a highly sedentary young man: ten most descriptive questions. Ranking 1 2 3 4 5 6 7 8 Points 366.25 147.5 75 51.5 46 44 36.75 36 9 36 10 36 Question How many hours daily a respondent uses PC to access the Internet? How many hours daily a respondent uses the Internet? How much laziness limits a respondent’s free time exercising? How many hours daily a respondent spends time to Internet games? Does a respondent have bone or cartilage diseases? Grip strength in kilograms, right hand Repetitive whistling in ears, tinnitus How certain respondent is that he goes to exercise despite having guests? How much lack of interest toward exercising limits a respondent’s free time exercising? Mass of muscles in kilograms Profiling part of this study shows that highly sedentary persons spend time on the Internet more than non-sedentary persons. It is obvious that this type of behaviour increases the time spend on sedentary lifestyle, therefore, these results show that machine learning algorithms are powerful tools for finding meaningful factors even when a single feature does not include much information. What is more, as the questions ranked as the most important can easily be understood as highly descriptive, based on these findings it is easier to believe that other questions with high ranking are important as well. In fact, the other questions selected in the profile of a sedentary young man are not as obvious findings. While the obtained recognition results were not always as high as expected, it can be noted that LDA is the most reliable classifier. LDA produced the highest rates in Section V and in Section IV the difference between LDA and C4.5, which produced the best results, was not statistically significant. Interestingly, in [5] where classification based on questionnaire data were studied, LDA produced the best detection results, as well. The weakness of this study is that only one type of question was used in the profiling process. In the next phase of the study, other questions should be used as well. Using these questions could then lead to better recognition rates. In addition, one part of the future work is to make a study similar to this study, but label data differently based on actual measurements and not on estimations. Moreover, in this study, only data from 2010 were used. However, similar data sets are available from the years 2009-2013, increasing the sample size significantly. Therefore, part of the future work is to study if similar results compared with this study can be obtained with other data sets as well. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] ACKNOWLEDGMENT This work was done as a part of MOPO study [15]. The authors would like to thank Infotech Oulu and the Finnish Funding Agency for Technology and Innovation for funding this work. [12] [13] R EFERENCES [1] F. J. Penedo and J. R. Dahn, “Exercise and well-being: a review of mental and physical health benefits associated with physical activity.” [14] Current opinion in psychiatry, vol. 18, no. 2, pp. 189–193, Mar. 2005. [Online]. Available: http://view.ncbi.nlm.nih.gov/pubmed/16639173 M. Chia and H. Suppiah, “Inactivity physiology-standing up for making sitting less sedentary at work,” J Obes Weight Loss Ther, vol. 3, no. 171, p. 2, 2013. J. Kim, K. Tanabe, N. Yokoyama, M. Zempo, and S. Kuno, “Objectively measured light-intensity lifestyle activity and sedentary time are independently associated with metabolic syndrome: a cross-sectional study of japanese adults.” International Journal of Behavioral Nutrition and Physical Activity, vol. 10, no. 30, 2013. C. E. Matthews, S. M. George, S. C. Moore, H. R. Bowles, A. Blair, Y. Park, R. P. Troiano, A. Hollenbeck, and A. Schatzkin, “Amount of time spent in sedentary behaviors and cause-specific mortality in us adults.” Am J Clin Nutr, vol. 95, no. 2, pp. 437–45, 2012. [Online]. Available: http://www.biomedsearch.com/nih/Amounttime-spent-in-sedentary/22218159.html J. Superby, J. Vandamme, and N. Meskens, “Determination of factors influencing the achievement of the first-year university students using data mining methods.” in Proc. Int. Conf. Intell. Tutoring Syst. Workshop Educ. Data Mining., 2006, pp. 1–8. E. Yukselturk, S. Ozekes, and Y. Türel, “Predicting dropout student: An application of data mining methods in an online education program,” European Journal of Open, Distance and e-Learning, vol. 17, no. 1, pp. 118– 133, 2014. C. Romero and S. Ventura, “Educational data mining: A review of the state of the art,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 40, no. 6, pp. 601–618, Nov 2010. N. Horowitz, M. Moshkowitz, Z. Halpern, and M. Leshno, “Applying data mining techniques in the development of a diagnostics questionnaire for gerd,” Digestive Diseases and Sciences, vol. 52, no. 8, pp. 1871– 1878, 2007. [Online]. Available: http://dx.doi.org/10.1007/s10620-0069202-5 D. P. Wall, R. Dally, R. Luyster, J.-Y. Jung, and T. F. Deluca, “Use of artificial intelligence to shorten the behavioral diagnosis of autism.” PLoS One, vol. 7, no. 8, p. e43855, 2012. [Online]. Available: http://www.biomedsearch.com/nih/Use-artificial-intelligenceto-shorten/22952789.html I. Kononenko, “Machine learning for medical diagnosis: history, state of the art and perspective,” Artificial Intelligence in Medicine, vol. 23, no. 1, pp. 89 – 109, 2001. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S093336570100077X I. Yoo, P. Alafaireet, M. Marinov, K. Pena-Hernandez, R. Gopidi, J.-F. Chang, and L. Hua, “Data mining in healthcare and biomedicine: A survey of the literature,” Journal of Medical Systems, vol. 36, no. 4, pp. 2431–2448, 2012. [Online]. Available: http://dx.doi.org/10.1007/s10916011-9710-5 C. H. Yu, S. Digangi, A. K. Jannasch-Pennell, and C. Kaprolet, “Profiling students who take online courses using data mining methods,” Online J. Distance Learning Administ., vol. 11, no. 2, pp. 1–14, 2008. R. Pyky, A. Jauho, R. Ahola, T. Ikäheimo, T. Jämsä, and R. Korpelainen, “Profiles of physically inactive young men,” in 7th International Conference on Movement and Health, 2014. R. Korpelainen, R. Pyky, R. Ahola, H. Koivumaa-Honkanen, A. Jauho, T. Jämsä, and T. Ikäheimo, “Profiles of physically inactive young men,” [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] in 5th International Congress on Physical Activity and Public Health, 2014, p. poster. R. Ahola, R. Pyky, T. Jämsä, M. Mäntysaari, H. Koskimäki, T. Ikäheimo, M.-L. Huotari, J. Röning, H. Heikkinen, and R. Korpelainen, “Gamified physical activation of young men - a multidisciplinary populationbased randomized controlled trial (MOPO study),” BMC Public Health, vol. 13, pp. 1–8, January 2013. E. Fix and J. L. Hodges, “Discriminatory analysis: Nonparametric discrimination: Consistency properties,” USAF School of Aviation Medicine, Randolf Field, Texas, Tech. Rep. Project 21-49-004, Report Number 4, 1951. D. J. Hand, H. Mannila, and P. Smyth, Principles of data mining. Cambridge, MA, USA: MIT Press, 2001. R. J. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning). Morgan Kaufmann, January 1993. L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. P. A. Devijver and J. Kittler, Pattern recognition: A statistical approach. Prentice Hall, 1982. P. Siirtola, H. Koskimäki, V. Huikari, P. Laurinen, and J. Röning, “Improving the classification accuracy of streaming data using SAX similarity features,” Pattern Recognition Letters, vol. 32, no. 13, pp. 1659 – 1668, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167865511002091 C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, “Bias in random forest variable importance measures: Illustrations, sources and a solution,” BMC Bioinformatics, vol. 8, no. 1, p. 25, 2007. [Online]. Available: http://www.biomedcentral.com/1471-2105/8/25 FIA, “Formula 1 point scoring system,” http://www.formula1.com/. E. M. van Sluijs, S. J. Griffin, and M. N. van Poppel, “A crosssectional study of awareness of physical activity: associations with personal, behavioral and psychosocial factors,” International Journal of Behavioral Nutrition and Physical Activity, vol. 4, no. 1, p. 53, 2007. F. Garca-Garca, G. Garca-Sez, P. Chausa, I. Martnez-Sarriegui, P. Benito, E. Gmez, and M. Hernando, “Statistical machine learning for automatic assessment of physical activity intensity using multi-axial accelerometry and heart rate,” in Artificial Intelligence in Medicine, ser. Lecture Notes in Computer Science, M. Peleg, N. Lavrac, and C. Combi, Eds. Springer Berlin Heidelberg, 2011, vol. 6747, pp. 70–79. P. Siirtola and J. Röning, “Ready-to-use activity recognition for smartphones,” in Computational Intelligence and Data Mining (CIDM), 2013 IEEE Symposium on, April 2013, pp. 59–64.
© Copyright 2026 Paperzz