Supplementary material: Machine learning classification S1

Supplementary material: Machine learning classification
S1. Machine learning classification
For the machine learning part, we adhered to 5-fold cross-validation separately on both the positive
and the negative opinion dataset, each of which contained 800 hotel reviews.
S1.1 Feature sets
We conducted the analysis on six different feature set combinations (see Appendix B):
1) NER: for each named entity, the number of unique occurrences per hotel review.
2) Most frequent NER: for all named entities with at least 10% occurrence, the number of
unique occurrences per hotel review.
3) LIWC: all 92 LIWC categories.
4) LIWC + NER
5) LIWC + Most frequent NER
6) LIWC + NER + Most frequent NER
S1.2 Support vector machines (SVM)
SVMs classify data points by identifying a hyperplane in a high- or infinite-dimensional space
dividing the given data points into their corresponding classes. Thereby SVM classifiers aim to
determine the hyperplane that has the largest margin between itself and the nearest data point of each
class. Throughout the paper, we use linear SVMs. Linear SVMs classify a real feature vector 𝑋 using
a learned weight vector 𝑊 and a bias term 𝑏 such that the identified hyperplane 𝜑 can be represented
as
𝜑 ∶= {𝑋: 𝑊 𝑇 𝑋 − 𝑏 = 0}.
Hence, a class 𝑐 for a given feature vector 𝑋 can be identified as follows:
𝑐 = 𝑠𝑔𝑛(𝑊 𝑇 𝑋 − 𝑏)
where the sign of 𝑐 denotes the class prediction for 𝑋 (see e.g. Gunn, 1998).
S1.3 Classification results
Tables 3 and 4 show the results for the SVM classifier (see Ott et al., 2011; 2013; Mihalcea &
Strappavara, 2009; Fornaciari & Poesio, 2013).1 As a baseline for the evaluation of the obtained
results, we use the human performance from Ott et al. (2011; 2013). Human judges were able to
correctly classify 58.1% and 69.4% for positive and negative hotel reviews, respectively. The
classification results of the current study suggest that for positive hotel reviews, all six feature sets
outperformed the human baseline. The lowest accuracy was achieved with the Most frequent NER
feature set (66.00% accuracy). The accuracies improved considerably when the LIWC feature set was
included (accuracies between 79.12% and 79.50%). The best accuracy was obtained with the LIWC
and NER feature set combination with an accuracy of 79.50%, thereby slightly improving the original
findings from Ott et al. (2011).
1
For explanations and results for random forests and logistic regression classifiers, see Appendix B.
For negative reviews, contrary to the positive hotel reviews, only the feature sets containing the
LIWC outperformed the human baseline. Similar to the positive reviews, the lowest accuracy was
obtained for the Most frequent NER feature set (62.75%). The combined feature set of LIWC and the
Most frequent NER as well as the LIWC itself achieved the highest accuracy, both with 76.00%. For
all feature sets that included the LIWC, differences in accuracy were small (range: 75.87% to
76.00%).
To assess the generalizability of the classifiers across the polarities, we adopted Ott et al.'s (2013)
procedure of training the classifier with the opposite valence of the test set. Specifically, we trained
the SVMs on positive (negative) reviews and tested them on negative (positive) reviews. The results
suggest that when the training set consisted of positive reviews, the SVM classifier trained with the
LIWC and Most frequent NER feature set combination was only 63.37% accurate. This is a difference
compared to the same-valence train/test procedure of 16.13%. Likewise, for a negative training set
and a positive test set the combination of the LIWC and Most frequent NER feature sets performed
best (accuracy of 71.00%), and the difference in achieved accuracy compared to the same-valence
approach was 5.00%.
Overall, these findings indicate that the classifiers are rather sensitive to the valence of the training set
and do not generalize well across polarities. When the classifier was trained on positive reviews and
tested on negative reviews, accuracy dropped below the human judges’ accuracy.
Table 3. Results for the linear support vector machine classification algorithm for positive reviews (5fold cross-validation).
P
Truthful
R
F1
P
Deceptive
R
F1
Acc.
NER
73.18
59.25
65.29
65.61
78.06
71.15
68.50
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent NER
LIWC + NER+ Most frequent
NER
71.82
80.36
81.62
80.59
81.58
53.56
76.93
76.27
76.92
76.04
61.13
78.57
78.76
78.67
78.61
62.84
77.96
77.73
78.00
77.54
78.69
81.32
82.85
81.61
82.85
69.73
79.57
80.13
79.72
80.03
66.00
79.12
79.50
79.25
79.38
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table 4. Results for the linear support vector machine classification algorithm for negative reviews
(5-fold cross-validation).
P
Truthful
R
F1
P
Deceptive
R
F1
Acc.
NER
67.11
56.69
61.21
62.65
72.33
66.97
64.38
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent
NER
LIWC + NER+ Most
frequent NER
66.20
76.41
75.73
75.87
52.15
74.98
76.28
76.32
58.08
75.62
75.90
76.00
60.74
75.63
76.16
76.24
73.63
77.05
75.55
75.77
66.41
76.26
75.74
75.91
62.75
76.00
75.88
76.00
75.88
76.05
75.87
75.97
75.78
75.78
75.87
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table 5. Results with linear support vector machine classification algorithm for positive train and
negative test reviews (5-fold cross-validation).
NER
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent NER
LIWC + NER+ Most
frequent NER
P
62.13
60.89
57.89
58.31
58.68
58.21
Truthful
R
60.49
60.86
89.28
87.47
89.74
87.83
F1
61.14
60.52
69.54
69.36
70.42
69.38
P
62.64
61.69
76.37
74.84
78.37
75.07
Deceptive
R
64.51
62.18
35.74
38.20
37.59
37.68
F1
63.39
61.64
48.09
50.17
50.56
49.74
Acc.
63.12
61.62
61.88
62.37
63.37
62.25
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table 6. Results with linear support vector machine classification algorithm for negative train and
positive test reviews (5-fold cross-validation).
NER
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent NER
LIWC + NER+ Most
frequent NER
P
72.10
69.39
64.06
65.15
65.79
65.47
Truthful
R
51.87
32.22
82.19
82.80
84.53
84.03
F1
59.96
43.86
71.71
72.63
73.76
73.32
P
62.40
56.11
76.12
76.09
78.50
77.37
Deceptive
R
80.73
86.84
55.04
56.51
56.94
56.47
F1
69.99
67.81
63.67
64.74
65.92
65.19
Acc.
66.13
59.62
68.88
69.75
71.00
70.38
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Discussion
Machine learning accuracy versus statistical results
In our machine learning analysis, we found different results for positive and negative reviews. For the
positive reviews, the classifier performed best with the LIWC and all named entities, whereas for the
negative reviews both the LIWC and the combination of LIWC and most frequent named entities led
to the highest accuracy rates. Hence the inclusion of named entities improved classification accuracies
compared to the classifiers trained exclusively with the LIWC for the positive reviews. Positive and
negative reviews had in common that the feature sets that included the LIWC categories outperformed
those that did not. This finding supports the notion that the LIWC is a tool suitable for automated
verbal deception detection analysis (Fitzpatrick et al., 2015). Ott et al. (2011) found a slightly lower
accuracy for positive reviews for the LIWC. However, this difference might be attributed to the
different LIWC versions. Contrary to Ott et al. (2011), we used the revised 2015 version of the LIWC
with updated dictionaries.
It seems inconsistent that the proportion of frequent named entities was the strongest predictor for the
statistical analysis but did not show the same beneficial effect for the machine learning classification
task. Likewise, named entities outperformed the LIWC in the statistical analysis, but this was reversed
in the machine learning part. The nature of both methods might explain these findings.
For the statistical analysis, we computed one index score per dependent variable (i.e. the proportion of
unique named entities; and the sum of three LIWC categories), whereas we fed the learning
algorithms with the raw counts of all sub-categories (i.e. all 18 named entities and all 92 LIWC
categories). Therefore, the sheer amount of features used in the machine learning part could explain
why the LIWC (92 features) outperformed the named entities (18 features). The classification
accuracy of a single index (e.g. the proportion of unique named entities) is expectedly lower than that
of a multi-feature machine learning classifier. Further, the superiority of the proportion of named
entities over the LIWC richness of detail was related to the choices made for the LIWC categories
(i.e. percept, space, time). Exploratory results showed that some LIWC categories are valid veracity
indicators (e.g. pronouns, function words, punctuation). Our LIWC operationalization was based on
RM theory and previous research (Bond & Lee, 2005).
Note that not all LIWC findings were in line with previously reported results. For example, whereas
Newman et al. (2003; Hancock et al., 2008) report that liars tend to distance themselves through the
use of non-first-person pronouns, this was reversed for hotel reviews. Interestingly, despite not being
in agreement with theory, these dynamics still benefit machine learning classification. For the
classifier, it is irrelevant whether more first-person pronouns pertain to deceptive or truthful reviews.
This also illustrates the trade-off between high accuracy with machine learning and potentially more
generalizable results with theoretically motivated cues.
What we showed is that named entities as a cue to deception generalize across hotel review polarities;
that is the theoretical underpinnings were at play for both positive and negative reviews. In contrast,
machine learning classifiers suffered in accuracy when applied across polarities. The conceptual
realization of named entities might have helped to formulate a cue that is less sensitive to review
polarities than others.
Appendix to supplement: Additional machine learning classification algorithms
Table B1. Description of the feature sets used for classification.
Feature Set
NER
Most frequent NER
LIWC
Description
The number of unique occurrences of spaCy’s 18 named entity categories (see Table 2),
average sentence specificity
The number of unique occurrences of spaCy’s named entity categories: persons, facilities,
dates, times, money, ordinals, and cardinals, information specificity (see above), average
sentence specificity
92 LIWC categories (see Appendix A)
Additional machine learning classification results.
Appendix B.1 Random forests
The random forests learning algorithm consists of a collection of 𝑛 decision trees classifying a given
𝑘-dimensional feature vector 𝑋. A decision tree is a classification algorithm where a feature vector is
classified by being directed through a tree from its root to a leaf. The leaves are labeled with one
particular class (e.g. deceptive, truthful) and eventually assign 𝑋 its corresponding class. Every tree
node 𝑡 contains a condition on one feature 𝑋𝑖 with 𝑖 ∈ {0, … , 𝑘 − 1} that decides to which of 𝑡’s
successor nodes 𝑋 is being directed next. The algorithm aims to maximize a tree's quality of a split for
each node such that conditioning on the selected 𝑋𝑖 makes each successor node as pure as possible.
In this context, the term (im)purity corresponds to the possible number of different classes reaching
that node. Hence a node is considered pure if and only if all feature vectors passing that node belong
to the same class. A common means of measuring the impurity of a tree node 𝑡 is the entropy 𝐻,
representing the information gain achieved by conditioning on a particular 𝑋𝑖 at 𝑡 (Quinlan, 1986).
For a binary classification problem, H is defined as
𝐻(𝑡) = − ∑ 𝜋𝑐𝑖 log 𝜋𝑐𝑖
𝑐𝑖 ∈ 𝐶
where 𝐶 = {𝑐1 , 𝑐2 } and 𝜋𝑐 = 𝑝 (𝑧 = 𝑐|𝑡) for a class label 𝑧 of 𝑋 (see e.g. Murphy, 2012).
Given the classifications of those 𝑛 decision trees, random forests assign 𝑋 the class 𝑐 according to a
majority vote; that is the class that occurs most within those 𝑛 votes is assigned to 𝑐. For our
classification we use 𝑛 = 200.
Table B2. Results for the random forest classification algorithm for positive reviews (5-fold crossvalidation).
Truthful
R
F1
P
Deceptive
R
F1
Acc.
62.87
66.61
80.37
81.40
81.43
63.22
49.10
76.66
76.16
76.82
63.03
56.27
78.35
78.59
78.99
63.00
59.48
77.64
77.47
78.02
62.70
75.00
81.33
82.54
82.461
62.84
66.18
79.34
79.84
80.19
63.00
61.88
78.88
79.25
79.62
81.92
75.13
78.24
76.95
83.40
79.92
79.13
P
NER
Most frequent NER
LIWC
LIWC + NER
LIWC + Most
frequent NER
LIWC + NER+ Most
frequent NER
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table B3. Results for the random forest classification algorithm for negative reviews (5-fold crossvalidation).
Truthful
R
P
NER
Most frequent NER
LIWC
LIWC + NER
LIWC + Most
frequent NER
LIWC + NER+ Most
frequent NER
F1
P
Deceptive
R
F1
Acc.
63.73
60.11
78.13
78.68
79.57
61.63
50.75
75.87
74.71
75.14
62.52
54.80
76.88
76.51
77.23
62.78
57.73
76.55
75.90
76.61
64.88
66.73
78.42
79.53
80.84
63.68
61.74
77.40
77.57
78.61
63.12
58.62
77.25
77.12
78.00
78.60
75.52
76.94
76.33
79.35
77.73
77.38
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table B4. Results for the random forest classification algorithm for positive train and negative test
sentiments (5-fold cross-validation).
NER
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent NER
LIWC + NER+ Most
frequent NER
P
63.76
60.50
66.16
69.38
67.19
69.68
Truthful
R
61.77
59.90
80.46
78.73
80.53
77.17
F1
62.43
59.71
71.96
73.10
72.54
72.45
P
63.52
60.74
74.75
75.57
75.47
74.11
Deceptive
R
65.99
61.90
59.60
66.32
61.57
67.19
F1
64.44
60.92
65.79
70.10
67.21
69.84
Acc.
64.12
60.75
69.50
72.00
70.38
71.50
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table B5. Results for the random forest classification algorithm for negative train and positive test
sentiments (5-fold cross-validation).
Truthful
NER
P
64.21
R
52.04
F1
57.15
Deceptive
P
59.85
R
71.62
F1
64.89
Acc.
62.00
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent
NER
LIWC + NER+ Most
frequent NER
73.19
65.05
66.27
66.11
33.99
86.56
86.62
87.06
46.28
73.86
74.82
74.95
57.03
80.63
81.52
81.58
88.05
54.46
56.82
56.20
68.90
64.59
66.72
66.43
61.25
70.50
72.00
72.00
67.54
85.32
75.16
80.58
59.80
68.44
72.88
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Appendix B.2 Logistic regression
The logistic regression algorithm is subsumed under the generalized linear model to the extent that it
applies to (binary) classification problems. Contrary to linear regression, logistic regression assumes
the residual error between linear predictions and true values to follow a Bernoulli distribution instead
of a Gaussian (Murphy, 2012). Furthermore, as logistic regression aims to compute discrete rather
than continuous values, the linear combination of input vectors 𝑥 computed by the regression function
is passed through the function 𝑠𝑖𝑔𝑚(𝑥) such that 0 < 𝑠𝑖𝑔𝑚(𝑥) < 1 holds for all inputs 𝑥. This
function is defined as
1
𝑠𝑖𝑔𝑚(𝑥) =
1 + 𝑒𝑥𝑝(−𝑥)
and is called the sigmoid function (also known as the logistic or logit function).
Table B6. Results with logistic regression classification algorithm for positive reviews (5-fold crossvalidation).
P
NER
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent
NER
LIWC + NER+ Most
frequent NER
Truthful
R
Deceptive
R
F1
P
71.59
71.93
79.69
80.75
81.52
61.25
56.31
78.27
77.24
77.14
65.84
62.88
78.91
78.86
79.21
66.01
64.19
78.60
78.21
78.41
75.50
77.93
80.00
81.63
82.64
70.30
70.19
79.24
79.80
80.42
F1
68.25
67.00
79.12
79.38
79.88
Acc.
80.87
77.01
78.82
78.05
81.82
79.84
79.38
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table B7. Results with logistic regression classification algorithm for negative reviews (5-fold crossvalidation).
NER
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent NER
LIWC + NER+ Most
frequent NER
P
65.95
64.68
76.73
77.37
77.99
77.36
Truthful
R
57.92
53.62
76.18
78.21
77.01
77.94
F1
61.49
58.49
76.41
77.68
77.40
77.52
P
62.67
60.60
76.50
78.06
77.39
77.87
Deceptive
R
70.31
71.01
77.09
77.02
78.27
77.00
F1
66.13
65.30
76.74
77.42
77.73
77.30
Acc.
64.00
62.25
76.63
77.62
77.62
77.50
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table B8. Results with logistic regression classification algorithm for positive train and negative test
sentiments (5-fold cross-validation).
NER
Most frequent NER
LIWC
LIWC + NER
LIWC + Most frequent NER
LIWC + NER+ Most
frequent NER
P
62.06
60.37
57.32
58.02
58.09
57.96
Truthful
R
62.45
61.89
91.89
90.79
91.55
90.58
F1
62.03
60.74
69.99
70.20
70.49
70.09
P
63.27
61.62
79.39
78.75
80.31
78.31
Deceptive
R
63.15
60.50
32.32
35.04
34.73
35.04
F1
63.01
60.74
45.56
48.16
48.17
48.06
Acc.
63.25
61.25
61.62
62.50
62.75
62.38
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.
Table B9. Results with logistic regression classification algorithm for negative train and positive test
sentiments (5-fold cross-validation).
NER
Most frequent NER
LIWC
P
71.62
69.11
65.90
Truthful
R
F1
52.96
60.50
34.16
45.53
83.00
73.22
P
62.72
56.59
77.83
Deceptive
R
F1
79.80
69.84
85.87
67.86
58.10
66.33
Acc.
66.25
60.13
70.88
LIWC + NER
LIWC + Most frequent NER
LIWC + NER+ Most
frequent NER
65.92
66.33
66.12
83.90
85.26
83.69
73.60
74.39
73.65
77.64
79.71
77.52
57.37
57.57
57.84
65.89
66.70
66.15
70.88
71.75
71.00
Note. P = precision; R = recall; F1 = F1-measure; Acc. = overall accuracy. ASS = average sentence
specificity. Highest accuracy in bold.

Download Report

Supplementary material: Machine learning classification S1

Paperzz.com

Your Paperzz