Computer generated performance appraisals

1
Computer generated performance appraisals versus human generated performance
appraisals:
The effects of positive and negative evaluations on emotions
Kars Wijnhoven
ANR 295543
Bachelor’s Thesis
Communication- and information sciences
Specialisation Business communication and Digital Media
Faculty of Humanities
Tilburg University, Tilburg
Supervisor: drs. H.A.F.J. van der Kaa
Second reader: dr. S. Wubben
June 2014
2
Table of contents
Abstract
3
1. Introduction
4
1.1 Natural Language Generation
5
1.2 Performance Appraisals
7
1.3 Feedback and Emotions
9
2. Method
12
2.1 Participants
12
2.2 Materials
12
2.2.1 Texts
12
2.2.2 Scale
13
2.3 Design and Procedure
17
2.4 Pre-test
17
2.5 Data Analysis
18
3. Results
19
3.1 Negative Conditions
19
3.2 Positive Conditions
20
3.3 Emotions
22
3.3.1 Negative Conditions II
22
3.3.2 Positive Conditions II
23
3.4 Gender
24
3.5 Work Experience
25
3.6 Previous Job Evaluation(s)
25
3.7 Age
27
3.8 Education
27
4. Discussion
28
4.1 Hypotheses
28
4.2 Other Variables
30
4.3 Research Design
31
5. Conclusion and Recommendations
33
References
35
Appendices
38
3
Abstract
Performance appraisals can convey bad news to an employee and evoke negative emotions
with possibly detrimental effects. A new means of employee evaluation, in which the
evaluation is generated by a Natural Language Generation (NLG) system is introduced. This
study examined the effects of two types of performance appraisals – computer (NLG) and
human generated – on employee’s emotions. Students (N=115) reported the intensity of
seven emotions (fourteen items) after receiving either a positive or negative computer
generated evaluation or a positive or negative human generated evaluation. Following
negative evaluations, results showed that computer and human generated evaluations evoke
the same intensity of thirteen of the fourteen emotion-items. Following positive evaluations,
results showed that human generated evaluations evoke a higher intensity of four positive
emotion-items, as well as a lower intensity of four negative emotion-items. These results do
not directly support the use of NLG systems in the domain of performance appraisals.
However, some supportive evidence is presented and recommendations for further research
are given.
Keywords: Emotions, Natural Language Generation, Performance appraisals.
4
1.
Introduction
“One third of employees not satisfied with their performance appraisal” (Oving, 2014). This
headline features a recent news article following a survey of the NationaleVacaturebank
among over a thousand employees. It appeared that 20% of the employees thought their boss
was acting extremely unpleasant during the appraisal. Moreover, 33% of the employees
thought their boss criticized certain aspects that they had never been informed about before.
Most importantly, the survey showed that such a performance appraisal is a reason to actively
look for a new job for 25% of the employees. It is clear from these percentages that
performance appraisals can elicit negative emotions in employees, even to the point that
resigning is being considered. These effects, that can be detrimental for organizations, are of
course evident when employees receive negative evaluations and less so when employees
receive positive evaluations. Is there a way to prevent or minimize these negative feelings and
emotions while the benefits of performance appraisals still remain?
In this paper a new and innovative idea for performance appraisal on a more frequent
basis is proposed. The idea includes a concept for a new natural language generation (NLG)
system. This hypothetical NLG system translates data on the performance of employees into a
natural language text. The input of the system are ratings generated by evaluators or superiors
on different criteria on the performance of the employee. The output is an automatically
generated text that represents the performance of the employee on a number of unfixed
criteria. The automatic generated text enables employees to view their progress, performance
and behaviour and subsequently change their behaviour and performance. This new means of
employee evaluations, which allows for more frequent evaluations, might reduce the number
of employees that thought their boss criticized things they had never been informed about
before. As a result, even the 25% of employees that actively look for a new job might be
reduced.
Before such a system can be taken into serious consideration it has to be noted that
software systems are expensive to build, and developers need to demonstrate that developing
such a system is a better solution than hiring and training someone to manually write the
documents the NLG system is supposed to produce (Dale & Reiter, 1997). Therefore, the aim
of this research is not to elaborate on the technical details of the NLG system, but to first test
the usefulness of such a system. The main aim of this research is, taking into account the
negative emotions performance appraisals can evoke in employees, to measure the potentially
negative emotional effects of such a computer generated evaluation-text compared to a human
5
generated evaluation-text. If the results of this research show that computer generated
evaluations have no significant advantage over human generated evaluations this might
indicate that such an NLG system is not useful in the domain of performance appraisals.
However, if it appears that computer generated evaluations elicit less strong negative
emotions than human generated evaluations this might be a first step towards the use of NLG
systems to enhance human resource processes like employee evaluation.
RQ: What are the effects of computer generated text-evaluations on the emotions of
employees compared to human generated text-evaluations?
1.1
Natural language generation
Natural language generation (NLG) is the subfield of artificial intelligence and computational
linguistics that is concerned with the construction of computer systems that can produce
understandable texts in English or other human languages from some underlying nonlinguistic representation of information (Dale & Reiter, 1997). The input of an NLG system
usually comprises of data in the form of numbers. The output on the other hand comprises of
sentences in the form of a text. In other words, an NLG system can be seen as a translator that
converts a computer based representation into a natural language representation.
An example of an NLG system is the weather report system called FoG (Bourbeau,
Carcagno, Goldberg, Kittredge & Polguere, 1990). This system was built to produce textual
weather reports in English and French. The input of this NLG system are graphical or
numerical weather depictions. It is in use by the Canadian Weather Service since 1992. The
weather report system called SUMTIME-MOUSAM is an NLG system that generates textual
weather forecasts from numerical weather prediction data (Sripada, Reiter & Hawizy, 2002).
The forecasts are marine forecasts for offshore oilrigs. It was even found that users preferred
SUMTIME-MOUSAM’s texts to human-generated texts, in part because of better word
choice (Reiter, Sripada, Hunter, Yu & Davy 2005). They stated that it might have been the
first time that an evaluation has shown that NLG texts are better than human-authored texts.
Another example of a natural language generation system is STOP. This NLG system did not
include weather forecast. Rather, it was designed to serve a more civil goal. This system
generated smoking cessation letters based on a user-input questionnaire (Reiter, Robertson &
Osman, 2003). Even though this particular system did in fact not fulfil its goals, since
recipients of a non-tailored letter were as likely to stop smoking as recipients of a tailored
letter, it gives an indication of the enormous potential of NLG systems. More recently, NLG
systems are created that have the ability to summarize medical data and serve as effective
6
decision-support aids for medical professionals (Portet, Reiter, Gatt, Hunter, Sripada, Freer &
Sykes, 2009). These examples show some of the ways in which NLG systems are already
being used and show the possibilities of natural language generation in the future.
The NLG system proposed in this research generates a text document based on data
that represents the employee’s performance on a number of unfixed criteria. This means that
employees receive written feedback on their behaviour and functioning. Prue and Fairbank
(1981) identify a number of advantages of written feedback over for example verbal feedback.
Firstly, they state that written feedback provides a concrete product which can facilitate a
longitudinal assessment of performance. Furthermore, they state that written feedback gives
employees the option to display the feedback in a more public manner. Finally, they state that
feedback delivered via written evaluations can be easily monitored by the manager. This
research will take into account the consequences of computer generated evaluation for
employees. However, employers could also benefit from the use of NLG systems. Murphy
and Margulies (2004) state that managers may be more comfortable with numerical or scale
rankings. In addition, they state that quantitative measures are sometimes easier to defend
against legal challenges than qualitative appraisals.
Even if it appears that an NLG system that generates employee evaluations may be
beneficial for employee and/or employer, the introduction of such a system will not be
something for the near future. As already stated, building and introducing new software
systems is expensive (Dale & Reiter, 1997). Furthermore, before such an NLG system will be
able to produce sufficient texts numerous tasks and steps have to be carried out and included.
It is time- and cost-expensive to build a system that includes these tasks and steps. Without
going too much into depth on the technical details of an NLG system, since that is not the aim
of this research, it is important to explain how a system receives its input language and sketch
a global view of the functioning and tasks of such a system.
According to Dale and Reiter (1997) one of the first tasks when building or designing
an NLG system is to create or find an initial corpus of human-authored texts. Such a corpus
can be created based on real letters or documents written in the past. An NLG system
generating employee evaluations could, for example, take into account previous human
written employee evaluations to create a corpus. It is important that such a corpus includes
positive, negative and neutral evaluations, as Dale and Reiter (1997) state that a corpus in
general should include boundary and unusual cases as well as typical cases. Following this
initial corpus, a target text corpus has to be created including most suitable, useful and
favourable texts. This target text corpus includes a set of texts which characterizes the actual
7
output of the NLG system (Dale & Reiter, 1997).
An NLG system usually comprises six basic kinds of activity that need to be carried
out to go from input data to an output text (Dale & Reiter, 1997). These six ‘tasks’ will be
shortly named and discussed. The first task is called content determination. This includes the
selection of what information is to be communicated in the output-text. Discourse planning is
the process of overall organisation of the information or set of messages to convey. This task
contains for example the order in which the words are conveyed. The third task sentence
aggregation is the process of grouping messages and includes for example the merging of
similar sentences to include readability. Lexicalization is the fourth task and includes the
process of putting words to the concepts. It could for example cover the question whether the
word ‘mediocre’ or ‘average’ should be used to describe an employee’s performance. The
fifth task is referring expression generation and includes the selection of words or phrases to
identify domain entities. Dale and Reiter (1997) exemplify this by the use of the word ‘it’ to
refer to the domain entity ‘the Caledonian Express’ (i.e. a train). Furthermore, this task
includes decision making about pronouns and other types of anaphora. The sixth and final
task is linguistic realisation. It includes, taking into account the rules of grammar, the creation
of a text which is syntactically, morphologically and orthographically correct.
1.2
Performance Appraisals
Employee performance appraisals play an important role in today’s business world.
According to Peretz and Fried (2012) performance appraisal is a key human resource activity
in organizations. Furthermore, it is stated that performance evaluation has been demonstrated
to increase performance and effectiveness (Ferris, Mitchell, Canavan, Frink & Hopper, 1995).
Benardin and Beatty (1984) have stated that performance appraisal encompasses the assessing
of human behaviour at work. According to Grote (2002), the appraisal is usually prepared by
the employee’s immediate supervisor. Furthermore, he states that the procedure typically
requires the supervisor to fill out a standardized assessment form that evaluates the individual
on several different dimensions and then discuss the results of the evaluation with the
employee.
Grote (2002) argues that performance appraisal is the most powerful
instrument that organizations have to mobilize the energy of every employee of the enterprise
toward the achievement of strategic goals. Furthermore, he states that, if used well,
performance appraisal can focus every person’s attention on the company’s mission, vision,
and values. There seems to be a consensus among scientists about the main aims of a
8
performance appraisal. According to Murphy and Margulies (2004) the most important aims
of performance appraisals are:
(1)
Pinpointing specific behaviour or job performance that should be discontinued or
reinforced.
(2)
Serving as an employee development and coaching tool.
(3)
Providing a realistic assessment of an employee’s readiness for promotion.
(4)
Serving as the basis for awarding merit pay.
While these general goals of performance appraisals seem universally agreed on among
scientists, the opposite can be said about the contents of a performance appraisal. The
literature provides numerous criteria or dimensions that could or should be included in a
performance appraisal, including among other things: performance, satisfaction, working
experience, behaviour and salary. Numerous scientific articles are written on this subject. In
this paper the matter of performance appraisal and which criteria should be used to evaluate
employees on will not be discussed. The desired contents of an appraisal are arbitrary and
dependent on for example job contents, type of organization or individual or organizational
goals. It is up to managers and organizations to decide what is important to assess.
In contrast, the frequency of employee evaluations is something to be taken into
account if an employee evaluation NLG system is seriously considered. As the results of the
survey of NationaleVacaturebank (See 1. Introduction) show, a considerable amount of
employees feel that they hear certain things for the first time in their performance appraisal.
Such an evaluation generally takes places in a one-year, or even two- or four-year cycle.
These appraisals are executed in such cycles because they are expensive and time-consuming.
Therefore, it will most likely not be feasible to evaluate employees on a more frequent basis
in the form of the modern face-to-face performance appraisal.
However, an NLG system that produces evaluation texts might overcome these
problems since texts produced by NLG systems are known to be better and cheaper than texts
produced by human writers (Reiter, Sripada, Hunter, Yu & Davy, 2005). Moreover, providing
employees with more regular feedback gives them more information and an indication about
their functioning and behaviour. This longitudinal assessment can enable employees to
change their behaviour and functioning if necessary, and also prepare themselves for the
actual face-to-face performance appraisal with their supervisor. The possible contents of this
appraisal could be predicted based on the NLG evaluations and therefore the employee could
perceive possible negative information as less surprising and with lesser emotional impact.
9
Such an NLG system should not be created to substitute the present four-, two-, or one-year
cycle of face-to-face performance appraisals, since issues regarding for example salary are
usually discussed in a face-to-face communication. Preferably, it should be viewed as a means
of extending or improving the current resources to achieve the best possible employee
evaluation.
1.3
Feedback and emotions
There is agreement among almost all scientists that feedback is vital for organizations to be
successful. London (2012) states that it is known from psychological research that people
need knowledge of results to accomplish performance goals and improve their performance
over time. Furthermore, it is clear from scientific research that feedback can affect the
emotional state of people. According to Liden and Mitchell (1985) supervisors’ behaviour can
elicit emotional reactions in subordinates. Ferris, Munyon, Basik and Buckley (2008) state
that performance evaluation is arguably an emotional experience as evaluations have
potentially significant ramifications for employee psychological well-being, social status and
the continuation of employment within the organization.
Belschak and Den Hartog (2009) also state that feedback has an impact on emotions
and subsequently on work attitudes and behavioural intentions. Moreover, they claim that
different types of feedback relate to different emotions. It is stated that providing positive
feedback will generally lead to positive emotions, such as pride and happiness, whereas
negative feedback will generally result in negative emotions, such as disappointment or guilt
(Belschak and Den Hartog, 2009; Lazarus, 1991). It is found that bad events have greater
power than good ones in everyday events and major-life events (Baumeister, Bratslavsky,
Finkenauer, & Vohs, 2001). Moreover, it is found that bad emotions and bad feedback have
more impact than good ones (Baumeister et al.).
However, it is not clear if these emotions will be present with the same intensity if an
evaluation is represented by a text that is generated by a supervisor or a text that is generated
by an NLG system instead of a face-to-face performance appraisal. These two text-conditions
are taken into account in this research and are compared regarding the emotional effects they
have on participants. It is hypothesized that in the human generated text-condition negative
evaluations elicit stronger negative emotions compared to the computer generated textcondition. This would mean that the detrimental effects of negative performance appraisals or
evaluations in general could be decreased by the use of computer generated language. Such a
result would support the idea of an NLG system that regularly generates employee
10
evaluations. This is hypothesized because participants might perceive an evaluation that is
generated by an NLG system as an independent means of evaluation, whereas in the human
generation condition participants might perceive the evaluation as a more dependent or even
biased means of evaluation because the text stems directly from a person. This is based on the
research of Levy and Williams (2004) who argue that performance appraisals take place in a
social context. They state that this context plays a major role in the effectiveness of the
appraisal process and how participants react to that process. They state that evaluators make
errors and can have biases towards employees. This possible existence of errors and biases
could do harm to the reliability of any evaluation with a direct link to supervisor or evaluator.
Furthermore, participants in the human generated text condition could be aware of the social
context of supervisor and subordinate and the possible biases and errors of supervisors. In
contrast, participants in the NLG condition may not perceive such a social context because of
the absence of a direct link to a supervisor. This absence of a direct link to a supervisor could
reduce the negative emotional impact on employees.
Furthermore, it is argued that regardless of an organizations’ decision itself, fair
procedures will result in more positive attitudes (Korsgaard & Roberson, 1995). In other
words, they argue that procedural justice can stimulate positive attitudes towards decisions
that might otherwise be viewed negatively. Participants might perceive and associate the
computer generated text condition with more procedural justice, since there is no supervisor
involved in the process of the generation of sentences who could act or evaluate in an unfair
way or with certain interests. Based on the aforementioned, the following hypothesis is
proposed:
H1: Human generated evaluations have a greater impact on employees’ emotions
compared to computer generated evaluations.
It is clear that various forces that play a role after employees receive feedback cause different
emotional reactions. Since there are numerous human emotions known from and used in
scientific research, it is important to clearly identify emotions that are relevant with respect to
receiving negative and positive employee evaluations. In this research, emotions are taken
into account that are present and dominant when one receives such feedback according to
other research. This means emotions are measured that affect for example turnover intention,
citizenship and affective commitment. Relevant emotions are identified and a scale is
constructed that measures these emotions. This scale is discussed in the method section.
As previous mentioned, possible detrimental effects of employee evaluations are
11
usually a result of employees receiving negative evaluations rather than positive evaluations.
Therefore, evaluations with a negative message are most relevant to investigate. However, to
be able to interpret the relative impact of negative evaluations on employees’ emotions in the
two text-conditions, also positive evaluations are taken into account. It is already
hypothesized that human generated evaluations have a greater impact on employees’ emotions
compared to computer generated evaluations. This first hypothesis can be specified to
concretize the direction of the effect. Therefore, it is hypothesized that this effect will be
visible after evaluations with a negative message. Moreover, it is expected that after
evaluations with a positive message the effect will not be visible.
H2: Negative evaluations will lead to significantly higher negative emotions in the
human generated text-condition compared to the computer generated text-condition.
H3: Positive evaluations will not lead to significantly higher positive emotions in the
human generated text-condition compared to the computer generated text-condition.
12
2.
2.1
Method
Participants
The survey was completed by 115 participants (N male = 41, N female = 74). The study was
conducted amongst students in The Netherlands (mean age = 22.16). The majority of the
participants (N = 99) took part in, or had finished University education. Other participants’
educations included higher education HBO (N = 14), MBO (N = 1) and high school VWO (N
= 1). Table 1 shows the four conditions used in this research separately, together with the
distribution and characteristics of the participants. These characteristics include gender, age
and whether or not a participant had already received a job evaluation before. The program
that was used collecting the results (See paragraph 2.5) distributed participants evenly over
the four conditions used in this research. However, due to 20 participants starting but not
finishing the survey there is a small inconsistency in this distribution.
Table 1
Conditions and Participants
Condition
N
M (N)
F (N)
Mean
Age
Job evaluation
(SD)
Yes
No
CP
29
8
21
22.00 (1.65)
15
14
CN
32
14
18
22.44 (1.72)
11
21
HP
28
9
19
22.00 (1.49)
17
11
HN
26
10
16
22.15 (2.62)
7
19
Total
115
41
74
22.16 (1.88)
50
65
Note. CP stands for computer positive, CN stands for computer negative, HP stands for human positive, HN
stands for human negative. M stands for male, F stands for female.
2.2
Materials
2.2.1 Texts
Four texts were used in this research (See appendices A, B, C & D). These texts were
designed to resemble a real life employee evaluation. The texts included two texts with
negative critiques, meaning that the employee did not function sufficiently, and two texts with
positive critiques, meaning that the employee did function sufficiently. The two negative texts
were identical. However, one text was presented as an evaluation written by the participants’
boss and the other text was manipulated by presenting it as an evaluation written by a
computer. The two positive texts were also identical and manipulated by presenting one text
13
as if was written by the participants’ boss and the other text as if it was written by a computer.
The evaluation texts were constructed based on an actual competencies assessment of
a policy maker at a Dutch university, including function goal, tasks, assessment competencies
of the job and an actual assessment text. This competencies assessment was designed by the
VSNU (‘Vereniging van Nederlandse Universiteiten’, English: ‘Association of Dutch
Universities’). By taking this assessment into account it was attempted to make the evaluation
texts as close to a real-life evaluation as possible. Furthermore, since the participants of the
research included university students, it was attempted to make the evaluation situation as
imaginable and realistic as possible by resembling an evaluation of an actual university
employee.
2.2.2 Scale
Participant’s emotions were measured using a scale made for this particular research (See
appendix E). The scale was designed to include emotions that are relevant when employees
receive feedback on their performance and behaviour. Based on other research seven relevant
and suitable emotions were selected: Happiness, Anger, Disappointment, Anxiety, Pride, Guilt
and Shame. Note that these include five ‘negative’ emotions (anger, disappointment, anxiety,
guilt and shame) and two ‘positive’ emotions (happiness & pride).
Happiness (i.e. ‘happy’), anger (i.e. ‘angry’) and disappointment (i.e. ‘disappointing’)
are basic emotions and drawn from Belschak and Den Hartog (2009) who showed the
importance of employee’s emotions after receiving supervisor feedback. Furthermore, anger
and disappointment are referred to in the ‘Job Emotion Scale’, also ‘JES’ (Fisher, 1998). Both
emotions were negatively related to overall job satisfaction. In turn, job satisfaction appears to
relate negatively with turnover intentions (Irvine & Evans, 1995; Mobley, 1977; Tnay,
Othman, Siong, & Lim, 2013). In addition, disappointment relates positively with intentions
to leave one’s job (Grandey, Tam & Brauburger, 2002). Anxiety (i.e. ‘anxious’) was drawn
from Côté and Morgan (2002) who found that this emotion is positively related with
intentions to quit. The last three emotions (pride, guilt and shame) are drawn from Belschak
and Den Hartog (2009) who refer to them as self-conscious emotions, relating to our sense of
self and our consciousness of others' reactions to us. Pride (i.e. ‘proud’) is also referred to in
the ‘JES’, relating positively with overall job satisfaction. Guilt (i.e. ‘guilty’) and shame (i.e.
‘ashamed’) were regarded as potential negative reactions to supervisor feedback (Belschak &
Den Hartog, 2009).
These seven emotions appeared to be relevant and were therefore included in the scale.
14
Every emotion was represented by two items in the actual scale, meaning the scale consisted
of 14 items. The two items per emotion included two different synonyms for that particular
emotion. The synonyms, taken from ‘thesaurus.com’, were as follows: happiness (joyful,
pleased), anger (furious, enraged), disappointment (discouraged, disillusioned), anxiety
(uncertain, concerned), pride (respected, worthy), guilt (responsible, convicted) and shame
(regretful, apologetic).
The order in which the different items occurred in the scale was randomized with a
random sequence generator and the same for all participants. Participants were asked to
indicate how intensely they experienced the different emotions as a reaction to reading the
evaluation belonging to their condition on 7-point scales ranging from 1 (extremely weak) to
7 (extremely strong). These answer possibilities were drawn from Belschak & Den Hartog
(2009) who also measured the intensity of emotions in their research.
Since each emotion in the scale is represented by two synonyms it is important to
determine whether or not these two synonyms (per emotion) are consistent and actually
measure the same construct. Table 2 shows the correlations of the synonyms for every
emotion. The correlation of .92 shows for example that when one feels joyful after a positive
evaluation, one is very likely to also feel pleased. Moreover, it could indicate that when,
following a negative evaluation, one does not feel joyful, one is also very likely to not feel
pleased. Note that the correlation between responsible and convicted is negative (r = -.23).
This indicates that there is no consistency between the scores on the two items. Thus, if one
feels responsible after receiving a negative evaluation one does not necessarily feel convicted.
On the contrary, the negative correlation indicates that one is likely to not feel convicted if
one feels responsible, although this correlation is not strong. Perhaps this could be explained
by the item responsible being multi-interpretable. One can feel responsible after receiving a
positive evaluation. However, one can also feel responsible after a negative evaluation. Other
items do not possess this characteristic as specific as the item responsible.
15
Table 2
Correlation between Synonyms (per Emotion)
Emotion
Synonyms
Correlation (r)
Happiness
Joyful
Pleased
.92
Anger
Furious
Enraged
.72
Disappointment
Discouraged
Disillusioned
.72
Anxiety
Uncertain
Concerned
.70
Pride
Respected
Worthy
.89
Guilt
Responsible
Convicted
-.23
Shame
Regretful
Apologetic
.54
The internal consistency of the items in the scale was assessed by measuring the
Cronbach’s alpha. To be able to analyse the consistency of the complete scale, the positive
items were reversed for this measure. This involved the reversing of the items joyful, pleased,
respected and worthy into respectively not joyful, not pleased, not respected and not worthy.
These four computed items plus the ten other (negative) items were analysed to measure the
Cronbach’s Alpha (a = .94). The Cronbach’s Alpha of the scale could be increased if the item
responsible is deleted from the scale (a = .96).
Furthermore, data analysis showed that two subscales could be identified. The first
subscale included the positive emotions pride and happiness, represented respectively by the
items respected and worthy, and joyful and pleased (a = .97). The Cronbach’s alpha of this
positive job emotions subscale could not be increased by deleting one of the items. Table 3
(page 16) shows the item-item correlation matrix of the four items belonging to the positive
job emotions scale. It shows for example the correlation between respected and joyful (r =
.86). This correlation indicates that when one feels respected after an evaluation one is also
very likely to feel joyful.
16
Table 3
Inter-Item Correlation Matrix – Positive Items
Items
Worthy
Respected
Pleased
Joyful
Worthy
1.00
.89
.92
.88
Respected
.89
1.00
.91
.86
Pleased
.92
.91
1.00
.92
Joyful
.88
.86
.92
1.00
The second subscale that could be identified included the negative emotions anger (furious,
enraged), disappointment (discouraged, disillusioned), anxiety (uncertain, concerned), guilt
(responsible, convicted) and shame (regretful, apologetic) (a = .88). The Cronbach’s alpha of
this ‘negative job emotions’ subscale could be increased to .92 by deleting the responsible
item. Table 4 (page 17) shows the item-item correlation matrix of the ten items belonging to
the ‘negative job emotions’ subscale. It shows for example that uncertain and furious have a
correlation of .71. Again, note that responsible correlates negatively with eight other items.
This indicates that participants most likely thought of responsible as a positive item, contrary
to the intended use of the item.
17
Table 4
Item-Item Correlation Matrix – Negative Items
Items
1
2
3
4
5
6
7
8
9
10
1
1.000
.44
-.28
.58
.60
.57
.72
.69
.48
.46
2
.44
1.00
.08
.38
.25
.40
.22
.35
.54
.27
3
-.28
.08
1.00
-.10
-.46
-.31
-.52
-.44
-.10
-.23
4
.58
.38
-.10
1.00
.62
.70
.60
.63
.58
.47
5
.60
.25
-.46
.62
1.00
.83
.71
.72
.58
.47
6
.57
.40
-.31
.70
.83
1.00
.71
.75
.69
.45
7
.72
.22
-.52
.60
.71
.71
1.00
.86
.56
.50
8
.69
.35
-.44
.63
.72
.75
.86
1.00
.64
.47
9
.48
.54
-.10
.58
.58
.69
.56
.64
1.00
.38
10
.46
.27
-.23
.47
.47
.45
.50
.47
.38
1.00
Note: The numbers represent the ten items/synonyms as follows: 1 = Enraged, 2 = Apologetic, 3 = Responsible, 4 =
Concerned, 5 = Discouraged, 6 = Uncertain, 7 = Furious, 8 = Disillusioned, 9 = Regretful, 10 = Convicted.
2.3
Design and Procedure
The experiment included a 2 (positive vs. negative feedback) × 2 (human generated text vs.
computer generated text) design. A between-group design was used meaning that participants
were randomly categorized in one of the four text-conditions. After being placed in one of the
four conditions, participants were presented with a case in which they were asked to imagine
being an employee of a large organisation, for which they had been working for some time
(See appendices F & G). Next, they received the text belonging to the condition they were
placed in that included the evaluation of their performance in the organization. After reading
the text, participants were asked to complete the fourteen items and additional questions.
These additional questions included age, gender and education. Furthermore, a question was
included asking if they had worked, or worked at that time, for an organization or company.
Finally, participants were asked if they had ever received a job evaluation before.
2.4
Pre-test
A pre-test (N = 5) conducted among students of Tilburg University yielded some valuable
information. All participants completed the reading of the evaluation and the questionnaire in
6 minutes or less. After completing the questionnaire participants were asked to indicate parts
or words that were clear and parts or words that were unclear or confusing. Remarks included
18
the further clarification of the computer condition parts by showing certain words, such as
‘computer written’, in bold, adding certain words, or repeating certain words. For example,
one participant proposed to move the last sentence of the computer conditions (‘this
evaluation is automatically generated by our HR evaluation systems’) up and show it in bold.
Furthermore, the evaluations contained the words ‘in your persuasiveness’ (“in your
persuasiveness you do not provide enough arguments…”. Some participants found this
difficult to understand in this context or thought others would find this difficult to understand.
These three words were therefore excluded from the actual research material.
Prior to the pre-test two emotions (pride and guilt) had three possible synonyms. Pride
was represented by honoured, respected and worthy. Guilt was represented by responsible,
convicted and accused. After the pre-test and some remarks of the participants it was decided
to implement respected and worthy, as well as responsible and convicted in the questionnaire.
One participant stated that honoured felt strange in the context of employee evaluations.
Moreover, two participants found accused not appropriate in this context.
On being questioned if there was anything they were unsure about, two of the five
participants referred to the answer possibilities of the items. These participants stated that
‘extremely weak’ implies that there is still some intensity of that particular emotion in the
participant. After discussing, they thought it would be clearer if ‘not at all’ would be added
after ‘extremely weak’ and ‘very much’ would be added after ‘extremely strong’. This would
further clarify that ‘weak’ means a lower intensity and ‘strong’ means a higher intensity. After
these comments it was decided to add both ‘not at all’ and ‘very much’ in parentheses to
avoid uncertainty among participants.
2.5
Data Analysis
The data in this research was collected with ‘Qualtrics’, a web-based survey service. This
program was also used to create the survey, including the items and additional questions.
After collecting, the data was analysed using the statistical analysis program ‘SPSS’
(Statistical Package for the Social Sciences). The Cronbach’s Alpha of the scale and the
subscales was measured by carrying out reliability analyses of the relevant data. Furthermore,
independent t-tests were performed to analyse the scores on the items in the different
conditions.
19
3.
Results
In this chapter the results of the research are presented. Taking into account the three
hypotheses, a distinction is made between the positive and negative text-conditions. First, the
item scores in the negative conditions are compared and presented. Secondly, the item scores
in the positive conditions are compared and presented. Thirdly, the item scores are added up
and divided by 2 to represent the seven emotions, and compared in all four conditions.
Finally, five variables are taken into account and tested. These variables include gender, work
experience, whether or not a participant had already received a job evaluation before, age and
education.
3.1
Negative Conditions
Table 5 (page 20) shows the mean item scores and standard deviations in the computer
negative condition and the human negative condition. The item uncertain showed a
significant difference between the computer negative (M=5.28, SD=1.25) and human negative
(M=5.96, SD=0.84) conditions; t(56)=-2.35, p<.05, with more uncertainty in the human
negative text-condition. This indicates that negative evaluations generated by a human (i.e. a
supervisor) arouse more uncertainty among receivers of the evaluation than negative
evaluations generated by a computer. The scores on the other items did not differ significantly
in the computer negative and human negative conditions.
20
Table 5
Comparison of the Item Scores (Mean and SD) in the
Computer Negative and Human Negative Conditions.
Condition
Items
CN
HN
Joyful
1.63 (.98)
1.46 (.71)
Pleased
1.88 (1.07)
1.42 (.70)
Furious
4.69 (1.15)
4.42 (1.90)
Enraged
4.81 (1.28)
4.46 (1.61)
Discouraged
4.97 (1.60)
5.50 (1.33)
Disillusioned
4.59 (1.16)
5.04 (1.51)
Uncertain
5.28 (1.25)
5.96 (.87)
Concerned
5.75 (1.11)
5.62 (1.65)
Respected
2.19 (1.26)
1.88 (1.03)
Worthy
2.44 (1.43)
1.92 (.80)
Responsible
4.22 (1.41)
4.04 (1.99)
Convicted
4.53 (1.24)
4.42 (1.47)
Regretful
4.25 (1.46)
4.12 (1.40)
Apologetic
3.97 (1.49)
3.58 (1.60)
Note: If there is a significant difference the highest score is highlighted.
CN stands for computer negative, HN stands for human negative.
3.2
Positive Conditions
Data analysis showed that the scores of 9 items differed significantly in the computer positive
condition and the human positive condition. Table 6 (page 22) shows the mean item scores
and standard deviations in the computer positive condition and the human positive condition.
Joyful was higher in the human positive condition (M=5.61, SD=0.92) compared to the
computer positive condition (M=4.55, SD=1.84); t(55)=-2.72, p<.01. This indicates that in the
human positive condition one is likely to experience more joy following an evaluation
compared to the computer positive condition. Pleased was higher in the human positive
condition (M=6.14, SD=0.71) compared to the computer positive condition (M=4.93,
SD=1.65); t(55)=-3.59, p<.005. Similar to the item joyful, participants felt more pleased in the
human positive condition compared to the computer positive condition. Furious was higher in
the computer positive condition (M=2.28, SD=1.44) compared to the human positive
condition (M=1.50, SD=0.64); t(55)=2.62, p<.05. Even though the difference on the item
scores in the two conditions on the item furious is significant, the actual meaning of this
21
difference seems of minor importance taking into account the low intensity of furious in both
conditions. Discouraged was higher in the computer positive condition (M=3.07, SD=1.58)
compared to the human positive condition (M=1.86, SD=0.89); t(55)=3.55,p<.005. The
meaning of this significant difference is also debatable taking into account the low scores on
the item discouraged in both conditions. Disillusioned was higher in the computer positive
condition (M=2.86, SD=1.57) compared to the human positive condition (M=1.57, SD=0.69);
t(55)=3.80, p<.001. Again, this difference seems of minor importance considering the low
scores in both conditions. Uncertain was higher in the computer positive condition (M=2.90,
SD=1.40) compared to the human positive condition (M=2.11, SD=0.99); t(55)=2.45, p<.05.
The low scores in both conditions on the item uncertain also seems to be relatively
meaningless since there is no actual (high) intensity of the item. Respected was higher in the
human positive condition (M=5.68, SD=0.91) compared to the computer positive condition
(M=4.55, SD=1.74); t(55)= -3.05, p<.005. This result indicates that one is likely to feel more
respected after a human generated positive evaluation compared to a computer generated
positive evaluation. Worthy was higher in the human positive condition (M=5.75, SD=0.84)
compared to the computer positive condition (M=4.86, SD=1.66); t(55)= -2.53, p<.05.
Similar to the item respected, this indicated that one is likely to feel more worthy following a
human generated evaluation compared to a computer generated evaluation. Finally,
responsible was higher in the human positive condition (M=5.14, SD=0.89) compared to the
computer positive condition (M=4.21, SD=1.50); t(55)=-2.86, p<.01. However, as scale
analyses already indicated, the scores on the item responsible are not consistent meaning that
the results on this item should be treated with caution or not be taken into account.
22
Table 6
Comparison of the item scores (mean and SD) in the
computer positive and human positive text-conditions.
Condition
Items
CP
HP
Joyful
4.55 (1.84)
5.61 (.92)
Pleased
4.93 (1.65)
6.14 (.71)
Furious
2.28 (1.44)
1.50 (.64)
Enraged
2.66 (1.47)
2.39 (1.79)
Discouraged
3.07 (1.58)
1.86 (.89)
Disillusioned
2.86 (1.57)
1.57 (.69)
Uncertain
2.90 (1.40)
2.11 (.99)
Concerned
2.93 (1.49)
2.61 (1.55)
Respected
4.55 (1.74)
5.68 (.91)
Worthy
4.86 (1.66)
5.75 (.84)
Responsible
4.21 (1.50)
5.14 (.89)
Convicted
3.31 (1.33)
3.00 (1.66)
Regretful
2.28 (1.33)
1.75 (.93)
Apologetic
2.90 (1.57)
2.36 (1.55)
Note: If there is a significant difference the highest score is
highlighted. CP stands for computer positive, HP stands for
human positive.
3.3
Emotions
In the previous sections each synonym was presented on its own and compared in the negative
conditions as well as the positive conditions . In this section the same measures are carried
out. However, instead of the fourteen synonyms, the seven emotions are used in this section.
Note that a score on an emotion is computed by adding up the scores on the two
synonyms/items belonging to that emotion and dividing that number by 2. The emotion
happiness for example is computed by adding up the scores on the items pleased and joyful
and dividing the outcome by 2.
3.3.1 Negative Conditions II
Data analysis showed that there were no significant differences on the seven emotions
between the computer negative condition and the human negative condition. Table 7 (page
23) shows the summed scores divided by 2 (mean and SD) on the emotions in the computer
negative and human negative condition.
23
Table 7
Comparison of the Emotion Scores (Mean and SD) in the Computer
Negative and the Human Negative Conditions
Condition
Emotions
CN
HN
Happiness
1.75 (.94)
1.44 (1.67)
Anger
4.75 (1.03)
4.44 (1.63)
Disappointment
4.78 (1.09)
5.27 (1.05)
Anxiety
5.52 (1.04)
5.79 (1.05)
Pride
2.31 (1.20)
1.90 (.78)
Guilt
4.38 (.83)
4.23 (1.12)
Shame
4.11 (1.23)
3.85 (1.24)
Note. If there is a significant difference the highest score is highlighted. CN
stands for computer negative, HN stands for human negative.
3.3.2 Positive Conditions II
Table 8 (page 24) shows the summed scores divided by 2 (mean and SD) on the emotions in
the computer positive condition and the human positive condition. Data analysis showed that
the scores of 3 emotions differed significantly in the computer positive condition and the
human positive condition. Happiness was higher in the human positive condition (M=5.88,
SD=.70) compared to the computer positive condition (M=4.73, SD=1.64)); t(55)=-3.37,
p<.005. This indicates that one is likely to experience more happiness following a human
generated positive evaluation compared to a computer generated positive evaluation.
Disappointment was higher in the computer positive condition (M=2.97, SD=1.50) compared
to the human positive condition (M=1.71, SD=.69); t(55)=4.01, p<.001. However, the
meaning of the scores on the item disappointment seems questionable taking into account the
low intensity the scores indicate. Finally, pride was higher in the human positive condition
(M=5.71, SD=.81) compared to the computer positive condition (M=4.71, SD=1.62); t(55)= 2.96, p<.01. This indicates that one is likely to experience more pride after receiving a human
generated positive evaluation compared to a computer generated positive evaluation.
24
Table 8
Comparison of the Emotion Scores (Mean and SD) in the Computer
Positive and the Human Positive Conditions
Condition
Emotions
CP
HP
Happiness
4.74 (1.64)
5.88 (.70)
Anger
2.47 (1.30)
1.95 (1.07)
Disappointment
2.97 (1.50)
1.71 (.69)
Anxiety
2.91 (1.19)
2.36 (.95)
Pride
4.71 (1.62)
5.71 (.81)
Guilt
3.76 (.82)
4.07 (.99)
Shame
2.59 (1.22)
2.05 (1.12)
Note. If there is a significant difference the highest score is highlighted. CP stands
for computer positive, HP stands for human positive.
3.4
Gender
The questionnaire included a question that asked people to indicate their gender. While
analysing, the data gender was used as a variable. First, analyses were carried out while
controlling for the variable gender on all conditions taken together. These analyses showed no
significant differences. Secondly, analyses were carried out on each of the four conditions
separately. The variable gender showed a significant difference on the item concerned in the
human positive condition. Women (M=3.05, SD=1.70) reported a higher intensity than men
(M=1.67, SD=0.50); t(26)= -2.40, p<.05. This indicates that women were more concerned
than men after receiving a positive human generated evaluation. However, it is questionable
whether one can speak of the presence of any concern in this case since the scores are very
low. Figure 1 (page 25) contains a boxplot showing the difference between men and women
on the item concerned in the human positive condition. The fact that the two boxes do not
overlap indicates that the difference is significant. The other items showed no significant
differences in any of the conditions regarding the variable gender.
25
Figure 1. Boxplot of the gender difference on the item concerned in the human positive condition
3.5
Work Experience
One of the questions in the questionnaire encompassed work experience. Participants were
asked if they had ever worked, or worked at that time, for a company or organization. No
significant differences were found between participants with (N=104) and without (N=11)
work experience. However, it has to be noted that the research did not include enough
participants without any previous work experience (N=11) to be able to make statements
about possible correlations with work experience as a variable.
3.6
Previous Job Evaluation(s)
Another question in the questionnaire encompassed experience with previous job
evaluation(s). Participants were asked if they had every received a job evaluation before.
Controlled per condition, participants showed significantly different scores in one condition
on one of the fourteen items when the variable previous job evaluation was taken into
account. Participants in the computer positive condition that had received a job evaluation
before reported a higher intensity of the item apologetic (M=3.67, SD=1.36) compared to
participants who had never received a job evaluation before (M=2.07, SD=1.39); t(27)= 3.15,
p<.005. However, this result seems one of little meaning, taking into account the low scores
on the item. Data analysis showed that this score is the only significant one in all four
conditions, taking into account the variable previous job evaluation(s). Table 9 (page 27)
shows the variable previous job evaluation(s) in relation with the significant score, as well as
26
some interesting scores (mean and SD).
Even though no additional significant differences were found, data analysis showed
three other interesting and thought-provoking results. The first interesting result included the
item uncertain in the human negative condition. Participants that had not received a job
evaluation before (M=6.16, SD=.83) showed a score that was, although not significant
(p=.056), considerably higher than participants that had received a job evaluation before
(M=5.43, SD=.79). This could indicate that if one has received a job evaluation before one is
likely to feel somewhat less uncertain after a negative human generated evaluation than those
who have never received a job evaluation before. Furthermore, participants that had not
received a job evaluation before (M=5.37, SD=1.38) showed a score on the item disillusioned
that was, although not significant (p=.0.065), considerably higher than participants that had
received a job evaluation before (M=4.14, SD=1.57). This could indicate that if one has
received a job evaluation before one is likely to feel somewhat less disillusioned after a
negative human generated evaluation than those who have never received a job evaluation
before. Finally, participants that had not received a job evaluation before (M=4.32, SD=1.26)
showed a score on the item regretful that was, although not significant (p=.064), considerably
higher than participants that had received a job evaluation before (M=3.29, SD=1.50). This
could indicate that if one has received a job evaluation before one is likely to feel somewhat
less regretful after a negative human generated evaluation than those who have never received
a job evaluation before.
In addition to the fact that the three aforementioned results are not significant, it has to
be noted that the human negative condition, containing these last three results, includes only 7
participants that had received a job evaluation before (and 19 participants that had not
received a job evaluation before). Therefore, any claims regarding these results should be
made with caution.
27
Table 9
Significant and interesting scores in the four conditions taking into account the
variable ‘previous job evaluation(s)’
Job Evaluation (yes / no)
Yes
No
Condition
Apologetic
CP
3.67
2.07
(1.36)
(1.39)
Yes
No
Yes
No
Yes
No
Uncertain
Disillusioned
Regretful
5.43
6.16
4.14
5.37
3.29
4.42
(.79)
(.83)
(1.57)
(1.38)
(1.50)
(1.26)
CN
HP
HN
Note: if there is a significant difference the highest score is highlighted. CP stands for
computer positive, CN stands for computer negative, HP stands for human positive, HN
stands for human negative.
3.7
Age
The questionnaire included a question regarding participants’ age. However, the variable age
could not be used to carry out statistic measurements due to a shortage of variation in this
variable. Table 1 (See Paragraph 2.1) displays the mean age and standard deviations for the
total amount of participants as well as for the participants per condition.
3.8
Education
A question regarding participants’ education was also included in the questionnaire.
Participants were asked what the highest education was that they took part in, or had already
completed. The variable education could not be used to carry out statistical measurements due
to a shortage of variation in this variable.
28
4.
4.1
Discussion
Hypotheses
The first hypothesis of this research argued that human generated evaluations have a greater
impact on employees’ emotions compared to computer generated evaluations. In other words,
this hypothesis proposed a greater impact on emotions following human generated
evaluations. However, this hypothesis did not include positive and negative evaluations or
positive and negative emotions. Therefore, this hypothesis was elaborated on with the second
and third hypotheses. The second hypothesis argued that negative evaluations lead to negative
emotions that are significantly higher after a human generated evaluation than after a
computer generated evaluation. The third hypothesis proposed that positive evaluations do not
lead to positive emotions that are higher in the human condition compared to the computer
condition. To be able to verify or falsify the first hypothesis the second and third hypotheses
have to be taken into account first.
The second hypothesis (“negative evaluations lead to negative emotions that are
significantly higher after a human generated evaluation than after a computer generated
evaluation”) could not be verified. Data analysis showed that for thirteen of the fourteen
items, scores in the human negative and computer negative conditions did not differ
significantly. Only the result on the item uncertain was in line with the second hypothesis,
since participants’ scores in the human negative condition were higher compared to
participants’ scores in the computer negative condition. A possible explanation of this result
could be that participants experienced a lack of ability of the person that generated the
evaluation to construct an independent and objective evaluation. This result could be seen as a
small supportive indication for the use of NLG systems since it might be perceived as a more
objective means of evaluating compared to human generated evaluations.
The lack of other significant differences indicates that the other emotions are not
experienced with a higher intensity in the human negative condition compared to the
computer generated condition. Thus, these results are not in line with the second hypothesis.
This could mean that negative emotions have the same intensity in the human negative and
computer negative conditions. However, this result could also be explained in relation to the
current research design (see Paragraph 4.3). The second hypothesis was based on the
existence of a ‘direct link’ to a person in the human negative condition. A real link that
includes employees having certain emotions, feelings or ideas regarding their boss or
supervisor. Emotions, feelings and ideas that are only evoked in a social context of concrete
29
persons (a real-life boss) with whom an employee has a relation (Levy & Williams, 2004). It
was hypothesized that negative emotions would have a higher intensity because of this link to
a person. However, this link was most likely non-existent, or only existent to a small extent,
in this research since the experimental situation created in this research was hypothetical.
Participants were asked to imagine having a certain job with a boss that is not their boss in
their real-life job. Therefore, the link to a ‘real’ boss or supervisor was most likely absent for
most participants.
The fact that the scores in the computer negative condition were not ‘worse’, i.e.
higher intensity of the negative emotions, compared to the human negative condition could be
viewed as a supportive result for the use of NLG systems. The fact that the scores on the
negative items were the same, uncertainty was even lower, in the computer negative
conditions and the human negative condition might be an indication of the hypothesized effect
of computer generated evaluations. Nevertheless, the second hypothesis could not be verified
based on the results of the research. Interestingly, the third hypothesis (“positive evaluations
do not lead to positive emotions that are higher in the human condition compared to the
computer condition”) could also not be verified. This is interesting since this third hypothesis,
including positive evaluations, was initially introduced in this research as a control variable to
be able to interpret the results of the negative evaluation conditions. It was expected that
scores on the positive emotion items would be the same in the computer and human positive
conditions. Data analysis showed that the four positive items in this research (joyful, pleased,
respected and worthy), contrary to expectations, were higher in the human positive condition
compared to the computer positive condition. These findings suggest that one is likely to
experience these positive emotions more intensely following a human generated evaluation
compared to a computer generated evaluation. However, these differences could also be
explained in relation to the research design (see also Paragraph 4.3). The aforementioned idea
of a link between an employee and an employer perceived by the employee could work
differently after positive and negative evaluations. While after a negative evaluation this link
might be absent, it might be present after a positive evaluation. Even though the positive
conditions in this research also did not include a link to an actual real-life boss, it could be
possible that participants in the human positive condition did perceive this link to be present
to some extent. Receiving a positive evaluation in a social context could partly explain the
higher scores on the positive items in the human positive condition compared to the computer
positive condition.
Furthermore, participants in the computer positive condition could have been too
30
focused on the idea that their evaluation was written by a computer. Participants might
already had a negative attitude on computers or computer written evaluations, or created one
after they had received their evaluation. They may have been aware that the focus of their
evaluation, and the research in general, was on the generator of the evaluation, in their case a
computer. This could have affected the item-scores of some of the participants in a negative
way.
The analysis of the positive evaluations also showed that some negative items (furious,
discouraged, disillusioned and uncertain) were higher in the human positive condition
compared to the computer positive condition. This indicates that these negative emotions are
likely to be more present after computer generated positive evaluations compared to human
generated positive evaluations. The fact that such a positive evaluation is written by a
computer may have aroused these negative emotions in participants. However, the meaning of
these significant results seems to be negligible since the scores on the items were very low.
Therefore, it is questionable if these emotions are even present and experienced by the
participants.
After discussing the second and third hypothesis, it is possible to verify the first
hypothesis (“human generated evaluations have a greater impact on employees’ emotions
compared to computer generated evaluations”). This effect was expected to exist following
negative evaluations and negative emotions. Moreover, this effect was expected to not exist
following positive evaluations and positive emotions. However, some remarks have to be
made regarding the first hypothesis. Taking into account the results on the negative emotions
and negative evaluations, it has to be noted that only one of the negative emotions was
significantly higher in the human generated condition compared to the computer generated
condition. The scores on the other negative emotions were not significantly higher in the
human negative condition compared to the computer negative condition. Contrary to the
expectation of the third hypothesis, results showed that the four positive items used in this
research were significantly higher in the human positive condition compared to the computer
positive condition. Even though this last finding is not in line with the third hypothesis, it is in
fact in line with the encompassing first hypothesis.
4.2
Other variables
In addition, questions in this research were added that controlled for the variables gender,
work experience, whether or not a participants had already received a job evaluation before,
age and education. The variables work experience, age and education could not be used to
31
carry out statistical measures due to a lack of variation within these variables. Most
participants worked, or had already worked, at a company or organization, were between 18
and 25 years old and were following, or had finished, a university study. The variables gender
and job evaluation, however, were usable and controlled for.
The variable gender showed a significant difference on the item concerned in the
human positive condition. It appeared that women scored higher on the concerned item than
men in this condition. This could indicate that women generally are more concerned
following an evaluation. However, no other significant results on this item were found
between men and women in the other conditions. Moreover, the scores of the significant
result on the item concerned were very low. Therefore, stating that this negative emotion is
present following positive human generated evaluations is questionable and should be done
with caution.
Controlling for the variable job evaluation, whether or not a participant had already
received a job evaluation before, resulted in one significant result. This result included the
item apologetic in the computer positive condition. It appeared that participants that had
already received a job evaluation before scored higher on this item compared to participants
that had never received a job evaluation before. There does not seems to be one explanation
for this significant result. Moreover, the low scores on this item make it difficult to prove the
actual presence of this emotion.
4.3
Research Design
The findings in this research are subject to at least three limitations. First, the dependent
variables in this research included human emotions. As previous research has already made
clear, whether it includes consumers’ emotions (Richins, 1997) or in this case employees’
emotions, measuring emotions is a difficult matter (Scherer, 2005). One of the ways of
measuring emotions includes self-reporting. This means of measuring emotions was used in
this research. Even though self-reports of emotions are likely to be valid, there are some
concerns that not all individuals are aware and/or capable of reporting on their momentary
emotional states (Mauss & Robinson, 2009).
The scale was specifically designed for this research. Therefore, all collected data was
relevant and could be used for data analysis. As already mentioned, the scale included
fourteen emotions that were relevant in the context of employee evaluations. However, it is
possible that some participants did not understand all of the emotions and terms used in the
questionnaire, which included the scale and additional questions. In addition, items and their
32
emotions could have been interpreted differently by participants. The item responsible, which
correlated negatively with most other items, is an example of the different interpretations
participants can have. Even though a pre-test was conducted to identify uncertainties and
problematic areas in the questionnaire, such problems can never be completely ruled out.
Nevertheless, disregarding the item responsible, items in the scale appeared to have a high
correlation and the consistency of the scale was high. Furthermore, two subscales (positive job
emotions scale & negative job emotions scale) could be identified.
Finally, the participants in this research were mainly university students. This research
population was chosen taking into account the availability of the participants and the
feasibility of the research. However, taking into account the subject of the research, employee
evaluations, it has to be noted that this research population might not be the best suited. In
general, students have little work experience in organizations and little experience with job
evaluations. As the data showed, many participants had never received a job evaluation before
(See table 1). Therefore, students that participated in this research might have had difficulties
with the scenario presented to them including working for an organization and receiving an
evaluation.
33
5.
Conclusion and Recommendations
Recent developments in the field of Natural Language Generation show the enormous
potential of NLG systems. In addition to the generation of textual weather forecasts, these
systems are for example also being used in the health sector where they function as a
decision-support aid for medical professionals. The current study introduced a new domain in
which NLG systems might be useful in the future. This domain included performance
appraisals or employee evaluations, and potentially even more human resource processes. The
present study was designed to determine the effects on the emotions of participants following
computer generated employee evaluations compared to the more widely known and used
human generated employee evaluations. Negative as well as positive computer and human
generated evaluations were taken into account. It was expected that computer generated
negative evaluations would evoke negative emotions with a lower intensity than human
generated evaluations. However, results showed that, for almost all items used in this
research, negative evaluations had the same effect on participants’ emotions regardless of
whether it was generated by a computer or a human. Positive evaluations appeared to evoke
higher negative emotions (joyful, pleased, respected and worthy) when the evaluation was
generated by a computer and not by a human. However, the scores on these negative items
were very low.
Do these findings indicate that computer generated evaluations have no future? No. A
previous study has shown that people preferred computer generated texts over human
generated texts, partly because of better word choice (Reiter, Sripada, Hunter, Yu & Davy
2005). These computer generated texts were, contrary to the texts used in the current research,
actually generated by an NLG system. This system and most other NLG systems can generate
actual words, phrases, grammar and other linguistic features. Taking into account that the
computer and human evaluation texts in the current study were all generated by a human, and
text as such was not a variable, results may have been different if computer texts were indeed
generated by an NLG system and included these different linguistic features. Further research
is needed to test whether computer generated evaluations that are truly generated by an NLG
system have advantages over human generated evaluations. Furthermore, results of the current
research might have been different with a different research population. While the current
study included students, further research is recommended to include actual employees of an
organization. This would be beneficial compared to the current research, as such a study could
include evaluations and situations that are real, and not hypothetical. Future research could
34
take into account real employees that have a real relation, including emotions and feelings,
with a boss or supervisor. A final recommendation for further research includes the
controlling for variables such as gender, age, education, work experience and experience with
job evaluations. While these variables were taken into account in this research, a lack of
variation within some of these variables made it difficult to carry out statistical analyses on
them.
35
References
Baumeister, R. F., Bratslavsky, E., Finkenauer, C., & Vohs, K. D. (2001). Bad is stronger
than good. Review of general psychology, 5(4), 323.
Belschak, F. D., & Den Hartog, D. N. (2009). Consequences of Positive and Negative
Feedback: The Impact on Emotions and Extra‐Role Behaviors. Applied Psychology,
58(2), 274-303.
Bernardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human behavior
at work. Boston: Kent Publishing Company.
Bourbeau, L., Carcagno, D., Goldberg, E., Kittredge, R., & Polguere, A. (1990, August).
Bilingual generation of weather forecasts in an operations environment. In
Proceedings of the 13th conference on Computational linguistics-Volume 3 (pp. 318320). Association for Computational Linguistics.
Côté, S., & Morgan, L. M. (2002). A longitudinal analysis of the association between emotion
regulation, job satisfaction, and intentions to quit. Journal of Organizational Behavior,
23(8), 947-962.
Ferris, G. R., Mitchell, T.R., Canavan, P. J., Frink, D. D., & Hopper, H. (1995).
Accountability in human resource systems. In G. R. Ferris, S. D. Rosen, & D. T.
Barnum (Eds.), Handbook of human resource management (pp. 175−196). Oxford,
UK: Blackwell Publishers.
Ferris, G. R., Munyon, T. P., Basik, K., & Buckley, M. R. (2008). The performance
evaluation context: Social, emotional, cognitive, political, and relationship
components. Human Resource Management Review, 18(3), 146-163.
Fisher, C. D. (1998). Mood and emotions while working-missing pieces of job satisfaction.
School of Business Discussion Papers, 64.
Grandey, A. A., Tam, A. P., & Brauburger, A. L. (2002). Affective states and traits in the
workplace: Diary and survey data from young workers. Motivation and emotion,
26(1), 31-55.
Grote, R. C. (2002). The performance appraisal question and answer book: A survival guide
for managers. AMACOM Div American Mgmt Assn.
36
Irvine, D. M., & Evans, M. G. (1995). Job satisfaction and turnover among nurses: integrating
research findings across studies. Nursing research, 44(4), 246-253.
Korsgaard, M. A., & Roberson, L. (1995). Procedural justice in performance evaluation: The
role of instrumental and non-instrumental voice in performance appraisal discussions.
Journal of Management, 21(4), 657-669.
Lazarus, R.S. (1991). Emotion and adaptation. New York: Oxford University Press.
Levy, P. E., & Williams, J. R. (2004). The social context of performance appraisal: A review
and framework for the future. Journal of management, 30(6), 881-905.
Liden, R.C., & Mitchell, T.R. (1985). Reactions to feedback: The role of attributions.
Academy of Management Journal, 28, 291–308.
London, M. (2012). Job feedback: Giving, seeking, and using feedback for performance
improvement. Psychology Press.
Mauss, I. B., & Robinson, M. D. (2009). Measures of emotion: A review. Cognition and
emotion, 23(2), 209-237.
Mobley, W. H. (1977). Intermediate linkages in the relationship between job satisfaction and
employee turnover. Journal of applied psychology, 62(2), 237.
Murphy, T. H., & Margulies, J. (2004, March). Performance appraisals. In Presentation, ABA
Labor and Employment Law Section, Equal Employment Opportunity Committee,
Mid-Winter Meeting.
Oving, R. (2014). Een derde van de werknemers niet tevreden over het beoordelingsgesprek.
Metronieuws. Retrieved from: http://www.metronieuws.nl/nieuws/een-derde-van-dewerknemers-niet-tevreden-over-het-beoordelingsgesprek/SrZncc!Q9dSpkFEE6k76/
Peretz, H., & Fried, Y. (2012). National cultures, performance appraisal practices, and
organizational absenteeism and turnover: A study across 21 countries. Journal of
Applied Psychology, 97(2), 448.
Portet, F., Reiter, E., Gatt, A., Hunter, J., Sripada, S., Freer, Y., & Sykes, C. (2009).
Automatic generation of textual summaries from neonatal intensive care data.
Artificial Intelligence, 173(7), 789-816.
37
Prue, D. M., & Fairbank, J. A. (1981). Performance feedback in organizational behavior
management: A review. Journal of Organizational Behavior Management, 3(1), 1-16.
Reiter, E., Dale, R. (1997). Building applied natural language generation systems. Natural
Language Engineering, 3(1). 57-87, doi:10.1017/s1351324997001502
Reiter, E., Robertson, R., & Osman, L, M. (2003). Lessons from a Failure: Generating
Tailored Smoking Cessation Letters. Artificial Intelligence 144:41-58.
Reiter, E., Sripada, S., Hunter, J., Yu, J., & Davy, I. (2005). Choosing words in computergenerated weather forecasts. Artificial Intelligence, 167(1), 137-169.
Richins, M. L. (1997). Measuring emotions in the consumption experience. Journal of
consumer research, 24(2), 127-146.
Scherer, K. R. (2005). What are emotions? And how can they be measured? Social science
information, 44(4), 695-729.
Sripada, S. G., Reiter, E., & Hawizy, L. (2002). Evaluation of an NLG system using post-edit
data: Lessons learnt. WEATHER, 5, 7.
Tnay, E., Othman, A. E. A., Siong, H. C., & Lim, S. L. O. (2013). The Influences of Job
Satisfaction and Organizational Commitment on Turnover Intention. Procedia-Social
and Behavioral Sciences, 97, 201-208.
38
Appendices
Appendix A
Negative human generated evaluation text.
May 1, 2014
Tilburg University
Warandelaan 2, Tilburg
The Netherlands
Policy Department
Concerning: Performance appraisal. Period: 01-01-14 – 01-05-14.
Dear employee,
You have been an employee at Tilburg University for four months now and this is your first evaluation
as a policy advisor at the organization. You are assessed on the competencies: conceptual ability,
persuasiveness, writing qualities and result orientation/decisiveness.
Regarding conceptual ability, it stands out that you still have difficulties extracting the bigger picture
from information and translating it to new, useful ideas. Also, your ability of making links appears to
be insufficient so far.
Even though you are enthusiastic, you do not provide enough arguments and you anticipate
insufficiently on others’ reactions.
Your writing qualities have improved gradually. However, you still fall short regarding structure and
clearness.
Your result orientation and decisiveness remain unsatisfactory because your approach and planning
are not concrete enough. For example, you do not address others after achieved or disappointing
results.
In conclusion:
Based on your functioning the last four months, you do not meet the requirements that are set for the
function of policy maker at Tilburg University.
Evaluator:
C.J.A.A. Verhey
Head of Policy Department
Tilburg University
Signature:
39
Appendix B
Negative computer generated evaluation text.
May 1, 2014
Tilburg University
Warandelaan 2, Tilburg
The Netherlands
Policy Department
Concerning: Automatically computer written performance appraisal. Period: 01-01-14 – 01-05-14.
Dear employee,
You have been an employee at Tilburg University for four months now and this is your first evaluation
as a policy advisor at the organization. You are assessed on the competencies: conceptual ability,
persuasiveness, writing qualities and result orientation/decisiveness.
Regarding conceptual ability, it stands out that you still have difficulties extracting the bigger picture
from information and translating it to new, useful ideas. Also, your ability of making links appears to
be insufficient so far.
Even though you are enthusiastic, you do not provide enough arguments and you anticipate
insufficiently on others’ reactions.
Your writing qualities have improved gradually. However, you still fall short regarding structure and
clearness.
Your result orientation and decisiveness remain unsatisfactory because your approach and planning
are not concrete enough. For example, you do not address others after achieved or disappointing
results.
In conclusion:
Based on your functioning the last four months, you do not meet the requirements that are set for the
function of policy maker at Tilburg University.
This evaluation is automatically generated by our HR computer evaluation system
40
Appendix C
Positive human generated evaluation text.
May 1, 2014
Tilburg University
Warandelaan 2, Tilburg
The Netherlands
Policy Department
Concerning: Performance appraisal. Period: 01-01-14 – 01-05-14.
Dear employee,
You have been an employee at Tilburg University for four months now and this is your first evaluation
as a policy advisor at the organization. You are assessed on the competencies: conceptual ability,
persuasiveness, writing qualities and result orientation/decisiveness.
Regarding conceptual ability, it stands out that you are comfortable extracting the bigger picture from
information and translating it to new, useful ideas. Also, your ability of making links appears to be
excellent so far.
In addition to your enthusiasm, you provide satisfactory arguments and anticipate sufficiently on
others’ reactions.
Your writing qualities have improved gradually. Moreover, you have improved regarding structure
and clearness.
Your result orientation and decisiveness are satisfactory. This is apparent from your approach and
planning which are very concrete. For example, you address others after achieved or disappointing
results.
In conclusion:
Based on your functioning the last four months, you meet the requirements that are set for the function
of policy maker at Tilburg University.
Evaluator:
C.J.A.A. Verhey
Head of Policy Department
Tilburg University
Signature:
41
Appendix D
Positive computer generated evaluation text.
May 1, 2014
Tilburg University
Warandelaan 2, Tilburg
The Netherlands
Policy Department
Concerning: Automatically computer written performance appraisal. Period: 01-01-14 – 01-05-14.
Dear employee,
You have been an employee at Tilburg University for four months now and this is your first evaluation
as a policy advisor at the organization. You are assessed on the competencies: conceptual ability,
persuasiveness, writing qualities and result orientation/decisiveness.
Regarding conceptual ability, it stands out that you are comfortable extracting the bigger picture from
information and translating it to new, useful ideas. Also, your ability of making links appears to be
excellent so far.
In addition to your enthusiasm, you provide satisfactory arguments and anticipate sufficiently on
others’ reactions.
Your writing qualities have improved gradually. Moreover, you have improved regarding structure
and clearness.
Your result orientation and decisiveness are satisfactory. This is apparent from your approach and
planning which are very concrete. For example, you address others after achieved or disappointing
results.
In conclusion:
Based on your functioning the last four months, you meet the requirements that are set for the function
of policy maker at Tilburg University.
This evaluation is automatically generated by our HR computer evaluation system
42
Appendix E
Scale/Questionnaire
Please complete this questionnaire consisting of 14 items. Indicate the intensity with which you
experience the following emotions after receiving your evaluation.
(1)
This evaluation makes me feel worthy
Extremely weakly (not at all)
(2)
1 2 3 4 5 6 7
Extremely strongly (very much)
This evaluation enrages me
Extremely weakly (“) 1 2 3 4 5 6 7
(3)
This evaluation makes me feel apologetic
Extremely weakly (“) 1 2 3 4 5 6 7
(4)
Extremely strongly (“)
This evaluation makes me feel regretful
Extremely weakly (“) 1 2 3 4 5 6 7
(13)
Extremely strongly (“)
This evaluation makes me feel disillusioned
Extremely weakly (“) 1 2 3 4 5 6 7
(12)
Extremely strongly (“)
This evaluation makes me furious
Extremely weakly (“) 1 2 3 4 5 6 7
(11)
Extremely strongly (“)
This evaluation makes me feel uncertain
Extremely weakly (“) 1 2 3 4 5 6 7
(10)
Extremely strongly (“)
This evaluation discourages me
Extremely weakly (“) 1 2 3 4 5 6 7
(9)
Extremely strongly (“)
This evaluation makes me feel concerned
Extremely weakly (“) 1 2 3 4 5 6 7
(8)
Extremely strongly (“)
This evaluation makes me feel responsible
Extremely weakly (“) 1 2 3 4 5 6 7
(7)
Extremely strongly (“)
This evaluation pleases me
Extremely weakly (“) 1 2 3 4 5 6 7
(6)
Extremely strongly (“)
This evaluation makes me feel respected
Extremely weakly (“) 1 2 3 4 5 6 7
(5)
Extremely strongly (“)
Extremely strongly (“)
This evaluation makes me joyful
Extremely weakly (“) 1 2 3 4 5 6 7
Extremely strongly (“)
43
(14)
This evaluation makes me feel convicted
Extremely weakly (“) 1 2 3 4 5 6 7
(Q1) My age is: …
year
(Q2) I am a:
0 man
Extremely strongly (“)
0 woman
(Q3) The highest education that I take part in, or have already completed is:
0
Elementary school
0
High school (VMBO)
0
High school (HAVO)
0
High school (VWO)
0
MBO
0
HBO
0
University (WO)
(Q4) Have you worked for an organization or company before, or currently work for an organization
or company: Yes
/
No
(Q5) Have you ever received a job evaluation? Yes
/
No
44
Appendix F
Case computer conditions.
You are an employee of the University of Tilburg. You work for the organizations’ Policy
department, together with twenty other people. You have been working for Tilburg University
for four months now. You like your work and have been working at the organization with
great pleasure. Furthermore, you are happy with your colleagues and you would like to stay at
the organization for a longer period of time.
Tilburg University finds it important that their employees are content with their work.
However, Tilburg University also finds it very important that its employees do their work
sufficiently and behave properly. Therefore, every four months you and your colleagues are
evaluated on your performance and behaviour. This always results in a written evaluation.
Tilburg University makes use of a new way of evaluating employees, as the evaluation is
written by a computer, and not by your boss. This enables Tilburg University for more
frequent evaluations.
Next, you will receive such a computer written evaluation on your performance and
behaviour over the past four months. Make sure you read your evaluation.
45
Appendix G
Case human conditions.
You are an employee of the University of Tilburg. You work for the organizations’ Policy
department, together with twenty other people. You have been working for Tilburg University
for four months now. You like your work and have been working at the organization with
great pleasure. Furthermore, you are happy with your colleagues and you would like to stay at
the organization for a longer period of time.
Tilburg University finds it important that their employees are content with their work.
However, Tilburg University also finds it very important that its employees do their work
sufficiently and behave properly. Therefore, every four months you and your colleagues are
evaluated on your performance and behaviour. This always results in a written evaluation.
These evaluations are written by the head of your department, your boss.
Next, you will receive such an evaluation, written by your boss, on your performance and
behaviour over the past four months. Make sure you read your evaluation.