1 Computer generated performance appraisals versus human generated performance appraisals: The effects of positive and negative evaluations on emotions Kars Wijnhoven ANR 295543 Bachelor’s Thesis Communication- and information sciences Specialisation Business communication and Digital Media Faculty of Humanities Tilburg University, Tilburg Supervisor: drs. H.A.F.J. van der Kaa Second reader: dr. S. Wubben June 2014 2 Table of contents Abstract 3 1. Introduction 4 1.1 Natural Language Generation 5 1.2 Performance Appraisals 7 1.3 Feedback and Emotions 9 2. Method 12 2.1 Participants 12 2.2 Materials 12 2.2.1 Texts 12 2.2.2 Scale 13 2.3 Design and Procedure 17 2.4 Pre-test 17 2.5 Data Analysis 18 3. Results 19 3.1 Negative Conditions 19 3.2 Positive Conditions 20 3.3 Emotions 22 3.3.1 Negative Conditions II 22 3.3.2 Positive Conditions II 23 3.4 Gender 24 3.5 Work Experience 25 3.6 Previous Job Evaluation(s) 25 3.7 Age 27 3.8 Education 27 4. Discussion 28 4.1 Hypotheses 28 4.2 Other Variables 30 4.3 Research Design 31 5. Conclusion and Recommendations 33 References 35 Appendices 38 3 Abstract Performance appraisals can convey bad news to an employee and evoke negative emotions with possibly detrimental effects. A new means of employee evaluation, in which the evaluation is generated by a Natural Language Generation (NLG) system is introduced. This study examined the effects of two types of performance appraisals – computer (NLG) and human generated – on employee’s emotions. Students (N=115) reported the intensity of seven emotions (fourteen items) after receiving either a positive or negative computer generated evaluation or a positive or negative human generated evaluation. Following negative evaluations, results showed that computer and human generated evaluations evoke the same intensity of thirteen of the fourteen emotion-items. Following positive evaluations, results showed that human generated evaluations evoke a higher intensity of four positive emotion-items, as well as a lower intensity of four negative emotion-items. These results do not directly support the use of NLG systems in the domain of performance appraisals. However, some supportive evidence is presented and recommendations for further research are given. Keywords: Emotions, Natural Language Generation, Performance appraisals. 4 1. Introduction “One third of employees not satisfied with their performance appraisal” (Oving, 2014). This headline features a recent news article following a survey of the NationaleVacaturebank among over a thousand employees. It appeared that 20% of the employees thought their boss was acting extremely unpleasant during the appraisal. Moreover, 33% of the employees thought their boss criticized certain aspects that they had never been informed about before. Most importantly, the survey showed that such a performance appraisal is a reason to actively look for a new job for 25% of the employees. It is clear from these percentages that performance appraisals can elicit negative emotions in employees, even to the point that resigning is being considered. These effects, that can be detrimental for organizations, are of course evident when employees receive negative evaluations and less so when employees receive positive evaluations. Is there a way to prevent or minimize these negative feelings and emotions while the benefits of performance appraisals still remain? In this paper a new and innovative idea for performance appraisal on a more frequent basis is proposed. The idea includes a concept for a new natural language generation (NLG) system. This hypothetical NLG system translates data on the performance of employees into a natural language text. The input of the system are ratings generated by evaluators or superiors on different criteria on the performance of the employee. The output is an automatically generated text that represents the performance of the employee on a number of unfixed criteria. The automatic generated text enables employees to view their progress, performance and behaviour and subsequently change their behaviour and performance. This new means of employee evaluations, which allows for more frequent evaluations, might reduce the number of employees that thought their boss criticized things they had never been informed about before. As a result, even the 25% of employees that actively look for a new job might be reduced. Before such a system can be taken into serious consideration it has to be noted that software systems are expensive to build, and developers need to demonstrate that developing such a system is a better solution than hiring and training someone to manually write the documents the NLG system is supposed to produce (Dale & Reiter, 1997). Therefore, the aim of this research is not to elaborate on the technical details of the NLG system, but to first test the usefulness of such a system. The main aim of this research is, taking into account the negative emotions performance appraisals can evoke in employees, to measure the potentially negative emotional effects of such a computer generated evaluation-text compared to a human 5 generated evaluation-text. If the results of this research show that computer generated evaluations have no significant advantage over human generated evaluations this might indicate that such an NLG system is not useful in the domain of performance appraisals. However, if it appears that computer generated evaluations elicit less strong negative emotions than human generated evaluations this might be a first step towards the use of NLG systems to enhance human resource processes like employee evaluation. RQ: What are the effects of computer generated text-evaluations on the emotions of employees compared to human generated text-evaluations? 1.1 Natural language generation Natural language generation (NLG) is the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying nonlinguistic representation of information (Dale & Reiter, 1997). The input of an NLG system usually comprises of data in the form of numbers. The output on the other hand comprises of sentences in the form of a text. In other words, an NLG system can be seen as a translator that converts a computer based representation into a natural language representation. An example of an NLG system is the weather report system called FoG (Bourbeau, Carcagno, Goldberg, Kittredge & Polguere, 1990). This system was built to produce textual weather reports in English and French. The input of this NLG system are graphical or numerical weather depictions. It is in use by the Canadian Weather Service since 1992. The weather report system called SUMTIME-MOUSAM is an NLG system that generates textual weather forecasts from numerical weather prediction data (Sripada, Reiter & Hawizy, 2002). The forecasts are marine forecasts for offshore oilrigs. It was even found that users preferred SUMTIME-MOUSAM’s texts to human-generated texts, in part because of better word choice (Reiter, Sripada, Hunter, Yu & Davy 2005). They stated that it might have been the first time that an evaluation has shown that NLG texts are better than human-authored texts. Another example of a natural language generation system is STOP. This NLG system did not include weather forecast. Rather, it was designed to serve a more civil goal. This system generated smoking cessation letters based on a user-input questionnaire (Reiter, Robertson & Osman, 2003). Even though this particular system did in fact not fulfil its goals, since recipients of a non-tailored letter were as likely to stop smoking as recipients of a tailored letter, it gives an indication of the enormous potential of NLG systems. More recently, NLG systems are created that have the ability to summarize medical data and serve as effective 6 decision-support aids for medical professionals (Portet, Reiter, Gatt, Hunter, Sripada, Freer & Sykes, 2009). These examples show some of the ways in which NLG systems are already being used and show the possibilities of natural language generation in the future. The NLG system proposed in this research generates a text document based on data that represents the employee’s performance on a number of unfixed criteria. This means that employees receive written feedback on their behaviour and functioning. Prue and Fairbank (1981) identify a number of advantages of written feedback over for example verbal feedback. Firstly, they state that written feedback provides a concrete product which can facilitate a longitudinal assessment of performance. Furthermore, they state that written feedback gives employees the option to display the feedback in a more public manner. Finally, they state that feedback delivered via written evaluations can be easily monitored by the manager. This research will take into account the consequences of computer generated evaluation for employees. However, employers could also benefit from the use of NLG systems. Murphy and Margulies (2004) state that managers may be more comfortable with numerical or scale rankings. In addition, they state that quantitative measures are sometimes easier to defend against legal challenges than qualitative appraisals. Even if it appears that an NLG system that generates employee evaluations may be beneficial for employee and/or employer, the introduction of such a system will not be something for the near future. As already stated, building and introducing new software systems is expensive (Dale & Reiter, 1997). Furthermore, before such an NLG system will be able to produce sufficient texts numerous tasks and steps have to be carried out and included. It is time- and cost-expensive to build a system that includes these tasks and steps. Without going too much into depth on the technical details of an NLG system, since that is not the aim of this research, it is important to explain how a system receives its input language and sketch a global view of the functioning and tasks of such a system. According to Dale and Reiter (1997) one of the first tasks when building or designing an NLG system is to create or find an initial corpus of human-authored texts. Such a corpus can be created based on real letters or documents written in the past. An NLG system generating employee evaluations could, for example, take into account previous human written employee evaluations to create a corpus. It is important that such a corpus includes positive, negative and neutral evaluations, as Dale and Reiter (1997) state that a corpus in general should include boundary and unusual cases as well as typical cases. Following this initial corpus, a target text corpus has to be created including most suitable, useful and favourable texts. This target text corpus includes a set of texts which characterizes the actual 7 output of the NLG system (Dale & Reiter, 1997). An NLG system usually comprises six basic kinds of activity that need to be carried out to go from input data to an output text (Dale & Reiter, 1997). These six ‘tasks’ will be shortly named and discussed. The first task is called content determination. This includes the selection of what information is to be communicated in the output-text. Discourse planning is the process of overall organisation of the information or set of messages to convey. This task contains for example the order in which the words are conveyed. The third task sentence aggregation is the process of grouping messages and includes for example the merging of similar sentences to include readability. Lexicalization is the fourth task and includes the process of putting words to the concepts. It could for example cover the question whether the word ‘mediocre’ or ‘average’ should be used to describe an employee’s performance. The fifth task is referring expression generation and includes the selection of words or phrases to identify domain entities. Dale and Reiter (1997) exemplify this by the use of the word ‘it’ to refer to the domain entity ‘the Caledonian Express’ (i.e. a train). Furthermore, this task includes decision making about pronouns and other types of anaphora. The sixth and final task is linguistic realisation. It includes, taking into account the rules of grammar, the creation of a text which is syntactically, morphologically and orthographically correct. 1.2 Performance Appraisals Employee performance appraisals play an important role in today’s business world. According to Peretz and Fried (2012) performance appraisal is a key human resource activity in organizations. Furthermore, it is stated that performance evaluation has been demonstrated to increase performance and effectiveness (Ferris, Mitchell, Canavan, Frink & Hopper, 1995). Benardin and Beatty (1984) have stated that performance appraisal encompasses the assessing of human behaviour at work. According to Grote (2002), the appraisal is usually prepared by the employee’s immediate supervisor. Furthermore, he states that the procedure typically requires the supervisor to fill out a standardized assessment form that evaluates the individual on several different dimensions and then discuss the results of the evaluation with the employee. Grote (2002) argues that performance appraisal is the most powerful instrument that organizations have to mobilize the energy of every employee of the enterprise toward the achievement of strategic goals. Furthermore, he states that, if used well, performance appraisal can focus every person’s attention on the company’s mission, vision, and values. There seems to be a consensus among scientists about the main aims of a 8 performance appraisal. According to Murphy and Margulies (2004) the most important aims of performance appraisals are: (1) Pinpointing specific behaviour or job performance that should be discontinued or reinforced. (2) Serving as an employee development and coaching tool. (3) Providing a realistic assessment of an employee’s readiness for promotion. (4) Serving as the basis for awarding merit pay. While these general goals of performance appraisals seem universally agreed on among scientists, the opposite can be said about the contents of a performance appraisal. The literature provides numerous criteria or dimensions that could or should be included in a performance appraisal, including among other things: performance, satisfaction, working experience, behaviour and salary. Numerous scientific articles are written on this subject. In this paper the matter of performance appraisal and which criteria should be used to evaluate employees on will not be discussed. The desired contents of an appraisal are arbitrary and dependent on for example job contents, type of organization or individual or organizational goals. It is up to managers and organizations to decide what is important to assess. In contrast, the frequency of employee evaluations is something to be taken into account if an employee evaluation NLG system is seriously considered. As the results of the survey of NationaleVacaturebank (See 1. Introduction) show, a considerable amount of employees feel that they hear certain things for the first time in their performance appraisal. Such an evaluation generally takes places in a one-year, or even two- or four-year cycle. These appraisals are executed in such cycles because they are expensive and time-consuming. Therefore, it will most likely not be feasible to evaluate employees on a more frequent basis in the form of the modern face-to-face performance appraisal. However, an NLG system that produces evaluation texts might overcome these problems since texts produced by NLG systems are known to be better and cheaper than texts produced by human writers (Reiter, Sripada, Hunter, Yu & Davy, 2005). Moreover, providing employees with more regular feedback gives them more information and an indication about their functioning and behaviour. This longitudinal assessment can enable employees to change their behaviour and functioning if necessary, and also prepare themselves for the actual face-to-face performance appraisal with their supervisor. The possible contents of this appraisal could be predicted based on the NLG evaluations and therefore the employee could perceive possible negative information as less surprising and with lesser emotional impact. 9 Such an NLG system should not be created to substitute the present four-, two-, or one-year cycle of face-to-face performance appraisals, since issues regarding for example salary are usually discussed in a face-to-face communication. Preferably, it should be viewed as a means of extending or improving the current resources to achieve the best possible employee evaluation. 1.3 Feedback and emotions There is agreement among almost all scientists that feedback is vital for organizations to be successful. London (2012) states that it is known from psychological research that people need knowledge of results to accomplish performance goals and improve their performance over time. Furthermore, it is clear from scientific research that feedback can affect the emotional state of people. According to Liden and Mitchell (1985) supervisors’ behaviour can elicit emotional reactions in subordinates. Ferris, Munyon, Basik and Buckley (2008) state that performance evaluation is arguably an emotional experience as evaluations have potentially significant ramifications for employee psychological well-being, social status and the continuation of employment within the organization. Belschak and Den Hartog (2009) also state that feedback has an impact on emotions and subsequently on work attitudes and behavioural intentions. Moreover, they claim that different types of feedback relate to different emotions. It is stated that providing positive feedback will generally lead to positive emotions, such as pride and happiness, whereas negative feedback will generally result in negative emotions, such as disappointment or guilt (Belschak and Den Hartog, 2009; Lazarus, 1991). It is found that bad events have greater power than good ones in everyday events and major-life events (Baumeister, Bratslavsky, Finkenauer, & Vohs, 2001). Moreover, it is found that bad emotions and bad feedback have more impact than good ones (Baumeister et al.). However, it is not clear if these emotions will be present with the same intensity if an evaluation is represented by a text that is generated by a supervisor or a text that is generated by an NLG system instead of a face-to-face performance appraisal. These two text-conditions are taken into account in this research and are compared regarding the emotional effects they have on participants. It is hypothesized that in the human generated text-condition negative evaluations elicit stronger negative emotions compared to the computer generated textcondition. This would mean that the detrimental effects of negative performance appraisals or evaluations in general could be decreased by the use of computer generated language. Such a result would support the idea of an NLG system that regularly generates employee 10 evaluations. This is hypothesized because participants might perceive an evaluation that is generated by an NLG system as an independent means of evaluation, whereas in the human generation condition participants might perceive the evaluation as a more dependent or even biased means of evaluation because the text stems directly from a person. This is based on the research of Levy and Williams (2004) who argue that performance appraisals take place in a social context. They state that this context plays a major role in the effectiveness of the appraisal process and how participants react to that process. They state that evaluators make errors and can have biases towards employees. This possible existence of errors and biases could do harm to the reliability of any evaluation with a direct link to supervisor or evaluator. Furthermore, participants in the human generated text condition could be aware of the social context of supervisor and subordinate and the possible biases and errors of supervisors. In contrast, participants in the NLG condition may not perceive such a social context because of the absence of a direct link to a supervisor. This absence of a direct link to a supervisor could reduce the negative emotional impact on employees. Furthermore, it is argued that regardless of an organizations’ decision itself, fair procedures will result in more positive attitudes (Korsgaard & Roberson, 1995). In other words, they argue that procedural justice can stimulate positive attitudes towards decisions that might otherwise be viewed negatively. Participants might perceive and associate the computer generated text condition with more procedural justice, since there is no supervisor involved in the process of the generation of sentences who could act or evaluate in an unfair way or with certain interests. Based on the aforementioned, the following hypothesis is proposed: H1: Human generated evaluations have a greater impact on employees’ emotions compared to computer generated evaluations. It is clear that various forces that play a role after employees receive feedback cause different emotional reactions. Since there are numerous human emotions known from and used in scientific research, it is important to clearly identify emotions that are relevant with respect to receiving negative and positive employee evaluations. In this research, emotions are taken into account that are present and dominant when one receives such feedback according to other research. This means emotions are measured that affect for example turnover intention, citizenship and affective commitment. Relevant emotions are identified and a scale is constructed that measures these emotions. This scale is discussed in the method section. As previous mentioned, possible detrimental effects of employee evaluations are 11 usually a result of employees receiving negative evaluations rather than positive evaluations. Therefore, evaluations with a negative message are most relevant to investigate. However, to be able to interpret the relative impact of negative evaluations on employees’ emotions in the two text-conditions, also positive evaluations are taken into account. It is already hypothesized that human generated evaluations have a greater impact on employees’ emotions compared to computer generated evaluations. This first hypothesis can be specified to concretize the direction of the effect. Therefore, it is hypothesized that this effect will be visible after evaluations with a negative message. Moreover, it is expected that after evaluations with a positive message the effect will not be visible. H2: Negative evaluations will lead to significantly higher negative emotions in the human generated text-condition compared to the computer generated text-condition. H3: Positive evaluations will not lead to significantly higher positive emotions in the human generated text-condition compared to the computer generated text-condition. 12 2. 2.1 Method Participants The survey was completed by 115 participants (N male = 41, N female = 74). The study was conducted amongst students in The Netherlands (mean age = 22.16). The majority of the participants (N = 99) took part in, or had finished University education. Other participants’ educations included higher education HBO (N = 14), MBO (N = 1) and high school VWO (N = 1). Table 1 shows the four conditions used in this research separately, together with the distribution and characteristics of the participants. These characteristics include gender, age and whether or not a participant had already received a job evaluation before. The program that was used collecting the results (See paragraph 2.5) distributed participants evenly over the four conditions used in this research. However, due to 20 participants starting but not finishing the survey there is a small inconsistency in this distribution. Table 1 Conditions and Participants Condition N M (N) F (N) Mean Age Job evaluation (SD) Yes No CP 29 8 21 22.00 (1.65) 15 14 CN 32 14 18 22.44 (1.72) 11 21 HP 28 9 19 22.00 (1.49) 17 11 HN 26 10 16 22.15 (2.62) 7 19 Total 115 41 74 22.16 (1.88) 50 65 Note. CP stands for computer positive, CN stands for computer negative, HP stands for human positive, HN stands for human negative. M stands for male, F stands for female. 2.2 Materials 2.2.1 Texts Four texts were used in this research (See appendices A, B, C & D). These texts were designed to resemble a real life employee evaluation. The texts included two texts with negative critiques, meaning that the employee did not function sufficiently, and two texts with positive critiques, meaning that the employee did function sufficiently. The two negative texts were identical. However, one text was presented as an evaluation written by the participants’ boss and the other text was manipulated by presenting it as an evaluation written by a computer. The two positive texts were also identical and manipulated by presenting one text 13 as if was written by the participants’ boss and the other text as if it was written by a computer. The evaluation texts were constructed based on an actual competencies assessment of a policy maker at a Dutch university, including function goal, tasks, assessment competencies of the job and an actual assessment text. This competencies assessment was designed by the VSNU (‘Vereniging van Nederlandse Universiteiten’, English: ‘Association of Dutch Universities’). By taking this assessment into account it was attempted to make the evaluation texts as close to a real-life evaluation as possible. Furthermore, since the participants of the research included university students, it was attempted to make the evaluation situation as imaginable and realistic as possible by resembling an evaluation of an actual university employee. 2.2.2 Scale Participant’s emotions were measured using a scale made for this particular research (See appendix E). The scale was designed to include emotions that are relevant when employees receive feedback on their performance and behaviour. Based on other research seven relevant and suitable emotions were selected: Happiness, Anger, Disappointment, Anxiety, Pride, Guilt and Shame. Note that these include five ‘negative’ emotions (anger, disappointment, anxiety, guilt and shame) and two ‘positive’ emotions (happiness & pride). Happiness (i.e. ‘happy’), anger (i.e. ‘angry’) and disappointment (i.e. ‘disappointing’) are basic emotions and drawn from Belschak and Den Hartog (2009) who showed the importance of employee’s emotions after receiving supervisor feedback. Furthermore, anger and disappointment are referred to in the ‘Job Emotion Scale’, also ‘JES’ (Fisher, 1998). Both emotions were negatively related to overall job satisfaction. In turn, job satisfaction appears to relate negatively with turnover intentions (Irvine & Evans, 1995; Mobley, 1977; Tnay, Othman, Siong, & Lim, 2013). In addition, disappointment relates positively with intentions to leave one’s job (Grandey, Tam & Brauburger, 2002). Anxiety (i.e. ‘anxious’) was drawn from Côté and Morgan (2002) who found that this emotion is positively related with intentions to quit. The last three emotions (pride, guilt and shame) are drawn from Belschak and Den Hartog (2009) who refer to them as self-conscious emotions, relating to our sense of self and our consciousness of others' reactions to us. Pride (i.e. ‘proud’) is also referred to in the ‘JES’, relating positively with overall job satisfaction. Guilt (i.e. ‘guilty’) and shame (i.e. ‘ashamed’) were regarded as potential negative reactions to supervisor feedback (Belschak & Den Hartog, 2009). These seven emotions appeared to be relevant and were therefore included in the scale. 14 Every emotion was represented by two items in the actual scale, meaning the scale consisted of 14 items. The two items per emotion included two different synonyms for that particular emotion. The synonyms, taken from ‘thesaurus.com’, were as follows: happiness (joyful, pleased), anger (furious, enraged), disappointment (discouraged, disillusioned), anxiety (uncertain, concerned), pride (respected, worthy), guilt (responsible, convicted) and shame (regretful, apologetic). The order in which the different items occurred in the scale was randomized with a random sequence generator and the same for all participants. Participants were asked to indicate how intensely they experienced the different emotions as a reaction to reading the evaluation belonging to their condition on 7-point scales ranging from 1 (extremely weak) to 7 (extremely strong). These answer possibilities were drawn from Belschak & Den Hartog (2009) who also measured the intensity of emotions in their research. Since each emotion in the scale is represented by two synonyms it is important to determine whether or not these two synonyms (per emotion) are consistent and actually measure the same construct. Table 2 shows the correlations of the synonyms for every emotion. The correlation of .92 shows for example that when one feels joyful after a positive evaluation, one is very likely to also feel pleased. Moreover, it could indicate that when, following a negative evaluation, one does not feel joyful, one is also very likely to not feel pleased. Note that the correlation between responsible and convicted is negative (r = -.23). This indicates that there is no consistency between the scores on the two items. Thus, if one feels responsible after receiving a negative evaluation one does not necessarily feel convicted. On the contrary, the negative correlation indicates that one is likely to not feel convicted if one feels responsible, although this correlation is not strong. Perhaps this could be explained by the item responsible being multi-interpretable. One can feel responsible after receiving a positive evaluation. However, one can also feel responsible after a negative evaluation. Other items do not possess this characteristic as specific as the item responsible. 15 Table 2 Correlation between Synonyms (per Emotion) Emotion Synonyms Correlation (r) Happiness Joyful Pleased .92 Anger Furious Enraged .72 Disappointment Discouraged Disillusioned .72 Anxiety Uncertain Concerned .70 Pride Respected Worthy .89 Guilt Responsible Convicted -.23 Shame Regretful Apologetic .54 The internal consistency of the items in the scale was assessed by measuring the Cronbach’s alpha. To be able to analyse the consistency of the complete scale, the positive items were reversed for this measure. This involved the reversing of the items joyful, pleased, respected and worthy into respectively not joyful, not pleased, not respected and not worthy. These four computed items plus the ten other (negative) items were analysed to measure the Cronbach’s Alpha (a = .94). The Cronbach’s Alpha of the scale could be increased if the item responsible is deleted from the scale (a = .96). Furthermore, data analysis showed that two subscales could be identified. The first subscale included the positive emotions pride and happiness, represented respectively by the items respected and worthy, and joyful and pleased (a = .97). The Cronbach’s alpha of this positive job emotions subscale could not be increased by deleting one of the items. Table 3 (page 16) shows the item-item correlation matrix of the four items belonging to the positive job emotions scale. It shows for example the correlation between respected and joyful (r = .86). This correlation indicates that when one feels respected after an evaluation one is also very likely to feel joyful. 16 Table 3 Inter-Item Correlation Matrix – Positive Items Items Worthy Respected Pleased Joyful Worthy 1.00 .89 .92 .88 Respected .89 1.00 .91 .86 Pleased .92 .91 1.00 .92 Joyful .88 .86 .92 1.00 The second subscale that could be identified included the negative emotions anger (furious, enraged), disappointment (discouraged, disillusioned), anxiety (uncertain, concerned), guilt (responsible, convicted) and shame (regretful, apologetic) (a = .88). The Cronbach’s alpha of this ‘negative job emotions’ subscale could be increased to .92 by deleting the responsible item. Table 4 (page 17) shows the item-item correlation matrix of the ten items belonging to the ‘negative job emotions’ subscale. It shows for example that uncertain and furious have a correlation of .71. Again, note that responsible correlates negatively with eight other items. This indicates that participants most likely thought of responsible as a positive item, contrary to the intended use of the item. 17 Table 4 Item-Item Correlation Matrix – Negative Items Items 1 2 3 4 5 6 7 8 9 10 1 1.000 .44 -.28 .58 .60 .57 .72 .69 .48 .46 2 .44 1.00 .08 .38 .25 .40 .22 .35 .54 .27 3 -.28 .08 1.00 -.10 -.46 -.31 -.52 -.44 -.10 -.23 4 .58 .38 -.10 1.00 .62 .70 .60 .63 .58 .47 5 .60 .25 -.46 .62 1.00 .83 .71 .72 .58 .47 6 .57 .40 -.31 .70 .83 1.00 .71 .75 .69 .45 7 .72 .22 -.52 .60 .71 .71 1.00 .86 .56 .50 8 .69 .35 -.44 .63 .72 .75 .86 1.00 .64 .47 9 .48 .54 -.10 .58 .58 .69 .56 .64 1.00 .38 10 .46 .27 -.23 .47 .47 .45 .50 .47 .38 1.00 Note: The numbers represent the ten items/synonyms as follows: 1 = Enraged, 2 = Apologetic, 3 = Responsible, 4 = Concerned, 5 = Discouraged, 6 = Uncertain, 7 = Furious, 8 = Disillusioned, 9 = Regretful, 10 = Convicted. 2.3 Design and Procedure The experiment included a 2 (positive vs. negative feedback) × 2 (human generated text vs. computer generated text) design. A between-group design was used meaning that participants were randomly categorized in one of the four text-conditions. After being placed in one of the four conditions, participants were presented with a case in which they were asked to imagine being an employee of a large organisation, for which they had been working for some time (See appendices F & G). Next, they received the text belonging to the condition they were placed in that included the evaluation of their performance in the organization. After reading the text, participants were asked to complete the fourteen items and additional questions. These additional questions included age, gender and education. Furthermore, a question was included asking if they had worked, or worked at that time, for an organization or company. Finally, participants were asked if they had ever received a job evaluation before. 2.4 Pre-test A pre-test (N = 5) conducted among students of Tilburg University yielded some valuable information. All participants completed the reading of the evaluation and the questionnaire in 6 minutes or less. After completing the questionnaire participants were asked to indicate parts or words that were clear and parts or words that were unclear or confusing. Remarks included 18 the further clarification of the computer condition parts by showing certain words, such as ‘computer written’, in bold, adding certain words, or repeating certain words. For example, one participant proposed to move the last sentence of the computer conditions (‘this evaluation is automatically generated by our HR evaluation systems’) up and show it in bold. Furthermore, the evaluations contained the words ‘in your persuasiveness’ (“in your persuasiveness you do not provide enough arguments…”. Some participants found this difficult to understand in this context or thought others would find this difficult to understand. These three words were therefore excluded from the actual research material. Prior to the pre-test two emotions (pride and guilt) had three possible synonyms. Pride was represented by honoured, respected and worthy. Guilt was represented by responsible, convicted and accused. After the pre-test and some remarks of the participants it was decided to implement respected and worthy, as well as responsible and convicted in the questionnaire. One participant stated that honoured felt strange in the context of employee evaluations. Moreover, two participants found accused not appropriate in this context. On being questioned if there was anything they were unsure about, two of the five participants referred to the answer possibilities of the items. These participants stated that ‘extremely weak’ implies that there is still some intensity of that particular emotion in the participant. After discussing, they thought it would be clearer if ‘not at all’ would be added after ‘extremely weak’ and ‘very much’ would be added after ‘extremely strong’. This would further clarify that ‘weak’ means a lower intensity and ‘strong’ means a higher intensity. After these comments it was decided to add both ‘not at all’ and ‘very much’ in parentheses to avoid uncertainty among participants. 2.5 Data Analysis The data in this research was collected with ‘Qualtrics’, a web-based survey service. This program was also used to create the survey, including the items and additional questions. After collecting, the data was analysed using the statistical analysis program ‘SPSS’ (Statistical Package for the Social Sciences). The Cronbach’s Alpha of the scale and the subscales was measured by carrying out reliability analyses of the relevant data. Furthermore, independent t-tests were performed to analyse the scores on the items in the different conditions. 19 3. Results In this chapter the results of the research are presented. Taking into account the three hypotheses, a distinction is made between the positive and negative text-conditions. First, the item scores in the negative conditions are compared and presented. Secondly, the item scores in the positive conditions are compared and presented. Thirdly, the item scores are added up and divided by 2 to represent the seven emotions, and compared in all four conditions. Finally, five variables are taken into account and tested. These variables include gender, work experience, whether or not a participant had already received a job evaluation before, age and education. 3.1 Negative Conditions Table 5 (page 20) shows the mean item scores and standard deviations in the computer negative condition and the human negative condition. The item uncertain showed a significant difference between the computer negative (M=5.28, SD=1.25) and human negative (M=5.96, SD=0.84) conditions; t(56)=-2.35, p<.05, with more uncertainty in the human negative text-condition. This indicates that negative evaluations generated by a human (i.e. a supervisor) arouse more uncertainty among receivers of the evaluation than negative evaluations generated by a computer. The scores on the other items did not differ significantly in the computer negative and human negative conditions. 20 Table 5 Comparison of the Item Scores (Mean and SD) in the Computer Negative and Human Negative Conditions. Condition Items CN HN Joyful 1.63 (.98) 1.46 (.71) Pleased 1.88 (1.07) 1.42 (.70) Furious 4.69 (1.15) 4.42 (1.90) Enraged 4.81 (1.28) 4.46 (1.61) Discouraged 4.97 (1.60) 5.50 (1.33) Disillusioned 4.59 (1.16) 5.04 (1.51) Uncertain 5.28 (1.25) 5.96 (.87) Concerned 5.75 (1.11) 5.62 (1.65) Respected 2.19 (1.26) 1.88 (1.03) Worthy 2.44 (1.43) 1.92 (.80) Responsible 4.22 (1.41) 4.04 (1.99) Convicted 4.53 (1.24) 4.42 (1.47) Regretful 4.25 (1.46) 4.12 (1.40) Apologetic 3.97 (1.49) 3.58 (1.60) Note: If there is a significant difference the highest score is highlighted. CN stands for computer negative, HN stands for human negative. 3.2 Positive Conditions Data analysis showed that the scores of 9 items differed significantly in the computer positive condition and the human positive condition. Table 6 (page 22) shows the mean item scores and standard deviations in the computer positive condition and the human positive condition. Joyful was higher in the human positive condition (M=5.61, SD=0.92) compared to the computer positive condition (M=4.55, SD=1.84); t(55)=-2.72, p<.01. This indicates that in the human positive condition one is likely to experience more joy following an evaluation compared to the computer positive condition. Pleased was higher in the human positive condition (M=6.14, SD=0.71) compared to the computer positive condition (M=4.93, SD=1.65); t(55)=-3.59, p<.005. Similar to the item joyful, participants felt more pleased in the human positive condition compared to the computer positive condition. Furious was higher in the computer positive condition (M=2.28, SD=1.44) compared to the human positive condition (M=1.50, SD=0.64); t(55)=2.62, p<.05. Even though the difference on the item scores in the two conditions on the item furious is significant, the actual meaning of this 21 difference seems of minor importance taking into account the low intensity of furious in both conditions. Discouraged was higher in the computer positive condition (M=3.07, SD=1.58) compared to the human positive condition (M=1.86, SD=0.89); t(55)=3.55,p<.005. The meaning of this significant difference is also debatable taking into account the low scores on the item discouraged in both conditions. Disillusioned was higher in the computer positive condition (M=2.86, SD=1.57) compared to the human positive condition (M=1.57, SD=0.69); t(55)=3.80, p<.001. Again, this difference seems of minor importance considering the low scores in both conditions. Uncertain was higher in the computer positive condition (M=2.90, SD=1.40) compared to the human positive condition (M=2.11, SD=0.99); t(55)=2.45, p<.05. The low scores in both conditions on the item uncertain also seems to be relatively meaningless since there is no actual (high) intensity of the item. Respected was higher in the human positive condition (M=5.68, SD=0.91) compared to the computer positive condition (M=4.55, SD=1.74); t(55)= -3.05, p<.005. This result indicates that one is likely to feel more respected after a human generated positive evaluation compared to a computer generated positive evaluation. Worthy was higher in the human positive condition (M=5.75, SD=0.84) compared to the computer positive condition (M=4.86, SD=1.66); t(55)= -2.53, p<.05. Similar to the item respected, this indicated that one is likely to feel more worthy following a human generated evaluation compared to a computer generated evaluation. Finally, responsible was higher in the human positive condition (M=5.14, SD=0.89) compared to the computer positive condition (M=4.21, SD=1.50); t(55)=-2.86, p<.01. However, as scale analyses already indicated, the scores on the item responsible are not consistent meaning that the results on this item should be treated with caution or not be taken into account. 22 Table 6 Comparison of the item scores (mean and SD) in the computer positive and human positive text-conditions. Condition Items CP HP Joyful 4.55 (1.84) 5.61 (.92) Pleased 4.93 (1.65) 6.14 (.71) Furious 2.28 (1.44) 1.50 (.64) Enraged 2.66 (1.47) 2.39 (1.79) Discouraged 3.07 (1.58) 1.86 (.89) Disillusioned 2.86 (1.57) 1.57 (.69) Uncertain 2.90 (1.40) 2.11 (.99) Concerned 2.93 (1.49) 2.61 (1.55) Respected 4.55 (1.74) 5.68 (.91) Worthy 4.86 (1.66) 5.75 (.84) Responsible 4.21 (1.50) 5.14 (.89) Convicted 3.31 (1.33) 3.00 (1.66) Regretful 2.28 (1.33) 1.75 (.93) Apologetic 2.90 (1.57) 2.36 (1.55) Note: If there is a significant difference the highest score is highlighted. CP stands for computer positive, HP stands for human positive. 3.3 Emotions In the previous sections each synonym was presented on its own and compared in the negative conditions as well as the positive conditions . In this section the same measures are carried out. However, instead of the fourteen synonyms, the seven emotions are used in this section. Note that a score on an emotion is computed by adding up the scores on the two synonyms/items belonging to that emotion and dividing that number by 2. The emotion happiness for example is computed by adding up the scores on the items pleased and joyful and dividing the outcome by 2. 3.3.1 Negative Conditions II Data analysis showed that there were no significant differences on the seven emotions between the computer negative condition and the human negative condition. Table 7 (page 23) shows the summed scores divided by 2 (mean and SD) on the emotions in the computer negative and human negative condition. 23 Table 7 Comparison of the Emotion Scores (Mean and SD) in the Computer Negative and the Human Negative Conditions Condition Emotions CN HN Happiness 1.75 (.94) 1.44 (1.67) Anger 4.75 (1.03) 4.44 (1.63) Disappointment 4.78 (1.09) 5.27 (1.05) Anxiety 5.52 (1.04) 5.79 (1.05) Pride 2.31 (1.20) 1.90 (.78) Guilt 4.38 (.83) 4.23 (1.12) Shame 4.11 (1.23) 3.85 (1.24) Note. If there is a significant difference the highest score is highlighted. CN stands for computer negative, HN stands for human negative. 3.3.2 Positive Conditions II Table 8 (page 24) shows the summed scores divided by 2 (mean and SD) on the emotions in the computer positive condition and the human positive condition. Data analysis showed that the scores of 3 emotions differed significantly in the computer positive condition and the human positive condition. Happiness was higher in the human positive condition (M=5.88, SD=.70) compared to the computer positive condition (M=4.73, SD=1.64)); t(55)=-3.37, p<.005. This indicates that one is likely to experience more happiness following a human generated positive evaluation compared to a computer generated positive evaluation. Disappointment was higher in the computer positive condition (M=2.97, SD=1.50) compared to the human positive condition (M=1.71, SD=.69); t(55)=4.01, p<.001. However, the meaning of the scores on the item disappointment seems questionable taking into account the low intensity the scores indicate. Finally, pride was higher in the human positive condition (M=5.71, SD=.81) compared to the computer positive condition (M=4.71, SD=1.62); t(55)= 2.96, p<.01. This indicates that one is likely to experience more pride after receiving a human generated positive evaluation compared to a computer generated positive evaluation. 24 Table 8 Comparison of the Emotion Scores (Mean and SD) in the Computer Positive and the Human Positive Conditions Condition Emotions CP HP Happiness 4.74 (1.64) 5.88 (.70) Anger 2.47 (1.30) 1.95 (1.07) Disappointment 2.97 (1.50) 1.71 (.69) Anxiety 2.91 (1.19) 2.36 (.95) Pride 4.71 (1.62) 5.71 (.81) Guilt 3.76 (.82) 4.07 (.99) Shame 2.59 (1.22) 2.05 (1.12) Note. If there is a significant difference the highest score is highlighted. CP stands for computer positive, HP stands for human positive. 3.4 Gender The questionnaire included a question that asked people to indicate their gender. While analysing, the data gender was used as a variable. First, analyses were carried out while controlling for the variable gender on all conditions taken together. These analyses showed no significant differences. Secondly, analyses were carried out on each of the four conditions separately. The variable gender showed a significant difference on the item concerned in the human positive condition. Women (M=3.05, SD=1.70) reported a higher intensity than men (M=1.67, SD=0.50); t(26)= -2.40, p<.05. This indicates that women were more concerned than men after receiving a positive human generated evaluation. However, it is questionable whether one can speak of the presence of any concern in this case since the scores are very low. Figure 1 (page 25) contains a boxplot showing the difference between men and women on the item concerned in the human positive condition. The fact that the two boxes do not overlap indicates that the difference is significant. The other items showed no significant differences in any of the conditions regarding the variable gender. 25 Figure 1. Boxplot of the gender difference on the item concerned in the human positive condition 3.5 Work Experience One of the questions in the questionnaire encompassed work experience. Participants were asked if they had ever worked, or worked at that time, for a company or organization. No significant differences were found between participants with (N=104) and without (N=11) work experience. However, it has to be noted that the research did not include enough participants without any previous work experience (N=11) to be able to make statements about possible correlations with work experience as a variable. 3.6 Previous Job Evaluation(s) Another question in the questionnaire encompassed experience with previous job evaluation(s). Participants were asked if they had every received a job evaluation before. Controlled per condition, participants showed significantly different scores in one condition on one of the fourteen items when the variable previous job evaluation was taken into account. Participants in the computer positive condition that had received a job evaluation before reported a higher intensity of the item apologetic (M=3.67, SD=1.36) compared to participants who had never received a job evaluation before (M=2.07, SD=1.39); t(27)= 3.15, p<.005. However, this result seems one of little meaning, taking into account the low scores on the item. Data analysis showed that this score is the only significant one in all four conditions, taking into account the variable previous job evaluation(s). Table 9 (page 27) shows the variable previous job evaluation(s) in relation with the significant score, as well as 26 some interesting scores (mean and SD). Even though no additional significant differences were found, data analysis showed three other interesting and thought-provoking results. The first interesting result included the item uncertain in the human negative condition. Participants that had not received a job evaluation before (M=6.16, SD=.83) showed a score that was, although not significant (p=.056), considerably higher than participants that had received a job evaluation before (M=5.43, SD=.79). This could indicate that if one has received a job evaluation before one is likely to feel somewhat less uncertain after a negative human generated evaluation than those who have never received a job evaluation before. Furthermore, participants that had not received a job evaluation before (M=5.37, SD=1.38) showed a score on the item disillusioned that was, although not significant (p=.0.065), considerably higher than participants that had received a job evaluation before (M=4.14, SD=1.57). This could indicate that if one has received a job evaluation before one is likely to feel somewhat less disillusioned after a negative human generated evaluation than those who have never received a job evaluation before. Finally, participants that had not received a job evaluation before (M=4.32, SD=1.26) showed a score on the item regretful that was, although not significant (p=.064), considerably higher than participants that had received a job evaluation before (M=3.29, SD=1.50). This could indicate that if one has received a job evaluation before one is likely to feel somewhat less regretful after a negative human generated evaluation than those who have never received a job evaluation before. In addition to the fact that the three aforementioned results are not significant, it has to be noted that the human negative condition, containing these last three results, includes only 7 participants that had received a job evaluation before (and 19 participants that had not received a job evaluation before). Therefore, any claims regarding these results should be made with caution. 27 Table 9 Significant and interesting scores in the four conditions taking into account the variable ‘previous job evaluation(s)’ Job Evaluation (yes / no) Yes No Condition Apologetic CP 3.67 2.07 (1.36) (1.39) Yes No Yes No Yes No Uncertain Disillusioned Regretful 5.43 6.16 4.14 5.37 3.29 4.42 (.79) (.83) (1.57) (1.38) (1.50) (1.26) CN HP HN Note: if there is a significant difference the highest score is highlighted. CP stands for computer positive, CN stands for computer negative, HP stands for human positive, HN stands for human negative. 3.7 Age The questionnaire included a question regarding participants’ age. However, the variable age could not be used to carry out statistic measurements due to a shortage of variation in this variable. Table 1 (See Paragraph 2.1) displays the mean age and standard deviations for the total amount of participants as well as for the participants per condition. 3.8 Education A question regarding participants’ education was also included in the questionnaire. Participants were asked what the highest education was that they took part in, or had already completed. The variable education could not be used to carry out statistical measurements due to a shortage of variation in this variable. 28 4. 4.1 Discussion Hypotheses The first hypothesis of this research argued that human generated evaluations have a greater impact on employees’ emotions compared to computer generated evaluations. In other words, this hypothesis proposed a greater impact on emotions following human generated evaluations. However, this hypothesis did not include positive and negative evaluations or positive and negative emotions. Therefore, this hypothesis was elaborated on with the second and third hypotheses. The second hypothesis argued that negative evaluations lead to negative emotions that are significantly higher after a human generated evaluation than after a computer generated evaluation. The third hypothesis proposed that positive evaluations do not lead to positive emotions that are higher in the human condition compared to the computer condition. To be able to verify or falsify the first hypothesis the second and third hypotheses have to be taken into account first. The second hypothesis (“negative evaluations lead to negative emotions that are significantly higher after a human generated evaluation than after a computer generated evaluation”) could not be verified. Data analysis showed that for thirteen of the fourteen items, scores in the human negative and computer negative conditions did not differ significantly. Only the result on the item uncertain was in line with the second hypothesis, since participants’ scores in the human negative condition were higher compared to participants’ scores in the computer negative condition. A possible explanation of this result could be that participants experienced a lack of ability of the person that generated the evaluation to construct an independent and objective evaluation. This result could be seen as a small supportive indication for the use of NLG systems since it might be perceived as a more objective means of evaluating compared to human generated evaluations. The lack of other significant differences indicates that the other emotions are not experienced with a higher intensity in the human negative condition compared to the computer generated condition. Thus, these results are not in line with the second hypothesis. This could mean that negative emotions have the same intensity in the human negative and computer negative conditions. However, this result could also be explained in relation to the current research design (see Paragraph 4.3). The second hypothesis was based on the existence of a ‘direct link’ to a person in the human negative condition. A real link that includes employees having certain emotions, feelings or ideas regarding their boss or supervisor. Emotions, feelings and ideas that are only evoked in a social context of concrete 29 persons (a real-life boss) with whom an employee has a relation (Levy & Williams, 2004). It was hypothesized that negative emotions would have a higher intensity because of this link to a person. However, this link was most likely non-existent, or only existent to a small extent, in this research since the experimental situation created in this research was hypothetical. Participants were asked to imagine having a certain job with a boss that is not their boss in their real-life job. Therefore, the link to a ‘real’ boss or supervisor was most likely absent for most participants. The fact that the scores in the computer negative condition were not ‘worse’, i.e. higher intensity of the negative emotions, compared to the human negative condition could be viewed as a supportive result for the use of NLG systems. The fact that the scores on the negative items were the same, uncertainty was even lower, in the computer negative conditions and the human negative condition might be an indication of the hypothesized effect of computer generated evaluations. Nevertheless, the second hypothesis could not be verified based on the results of the research. Interestingly, the third hypothesis (“positive evaluations do not lead to positive emotions that are higher in the human condition compared to the computer condition”) could also not be verified. This is interesting since this third hypothesis, including positive evaluations, was initially introduced in this research as a control variable to be able to interpret the results of the negative evaluation conditions. It was expected that scores on the positive emotion items would be the same in the computer and human positive conditions. Data analysis showed that the four positive items in this research (joyful, pleased, respected and worthy), contrary to expectations, were higher in the human positive condition compared to the computer positive condition. These findings suggest that one is likely to experience these positive emotions more intensely following a human generated evaluation compared to a computer generated evaluation. However, these differences could also be explained in relation to the research design (see also Paragraph 4.3). The aforementioned idea of a link between an employee and an employer perceived by the employee could work differently after positive and negative evaluations. While after a negative evaluation this link might be absent, it might be present after a positive evaluation. Even though the positive conditions in this research also did not include a link to an actual real-life boss, it could be possible that participants in the human positive condition did perceive this link to be present to some extent. Receiving a positive evaluation in a social context could partly explain the higher scores on the positive items in the human positive condition compared to the computer positive condition. Furthermore, participants in the computer positive condition could have been too 30 focused on the idea that their evaluation was written by a computer. Participants might already had a negative attitude on computers or computer written evaluations, or created one after they had received their evaluation. They may have been aware that the focus of their evaluation, and the research in general, was on the generator of the evaluation, in their case a computer. This could have affected the item-scores of some of the participants in a negative way. The analysis of the positive evaluations also showed that some negative items (furious, discouraged, disillusioned and uncertain) were higher in the human positive condition compared to the computer positive condition. This indicates that these negative emotions are likely to be more present after computer generated positive evaluations compared to human generated positive evaluations. The fact that such a positive evaluation is written by a computer may have aroused these negative emotions in participants. However, the meaning of these significant results seems to be negligible since the scores on the items were very low. Therefore, it is questionable if these emotions are even present and experienced by the participants. After discussing the second and third hypothesis, it is possible to verify the first hypothesis (“human generated evaluations have a greater impact on employees’ emotions compared to computer generated evaluations”). This effect was expected to exist following negative evaluations and negative emotions. Moreover, this effect was expected to not exist following positive evaluations and positive emotions. However, some remarks have to be made regarding the first hypothesis. Taking into account the results on the negative emotions and negative evaluations, it has to be noted that only one of the negative emotions was significantly higher in the human generated condition compared to the computer generated condition. The scores on the other negative emotions were not significantly higher in the human negative condition compared to the computer negative condition. Contrary to the expectation of the third hypothesis, results showed that the four positive items used in this research were significantly higher in the human positive condition compared to the computer positive condition. Even though this last finding is not in line with the third hypothesis, it is in fact in line with the encompassing first hypothesis. 4.2 Other variables In addition, questions in this research were added that controlled for the variables gender, work experience, whether or not a participants had already received a job evaluation before, age and education. The variables work experience, age and education could not be used to 31 carry out statistical measures due to a lack of variation within these variables. Most participants worked, or had already worked, at a company or organization, were between 18 and 25 years old and were following, or had finished, a university study. The variables gender and job evaluation, however, were usable and controlled for. The variable gender showed a significant difference on the item concerned in the human positive condition. It appeared that women scored higher on the concerned item than men in this condition. This could indicate that women generally are more concerned following an evaluation. However, no other significant results on this item were found between men and women in the other conditions. Moreover, the scores of the significant result on the item concerned were very low. Therefore, stating that this negative emotion is present following positive human generated evaluations is questionable and should be done with caution. Controlling for the variable job evaluation, whether or not a participant had already received a job evaluation before, resulted in one significant result. This result included the item apologetic in the computer positive condition. It appeared that participants that had already received a job evaluation before scored higher on this item compared to participants that had never received a job evaluation before. There does not seems to be one explanation for this significant result. Moreover, the low scores on this item make it difficult to prove the actual presence of this emotion. 4.3 Research Design The findings in this research are subject to at least three limitations. First, the dependent variables in this research included human emotions. As previous research has already made clear, whether it includes consumers’ emotions (Richins, 1997) or in this case employees’ emotions, measuring emotions is a difficult matter (Scherer, 2005). One of the ways of measuring emotions includes self-reporting. This means of measuring emotions was used in this research. Even though self-reports of emotions are likely to be valid, there are some concerns that not all individuals are aware and/or capable of reporting on their momentary emotional states (Mauss & Robinson, 2009). The scale was specifically designed for this research. Therefore, all collected data was relevant and could be used for data analysis. As already mentioned, the scale included fourteen emotions that were relevant in the context of employee evaluations. However, it is possible that some participants did not understand all of the emotions and terms used in the questionnaire, which included the scale and additional questions. In addition, items and their 32 emotions could have been interpreted differently by participants. The item responsible, which correlated negatively with most other items, is an example of the different interpretations participants can have. Even though a pre-test was conducted to identify uncertainties and problematic areas in the questionnaire, such problems can never be completely ruled out. Nevertheless, disregarding the item responsible, items in the scale appeared to have a high correlation and the consistency of the scale was high. Furthermore, two subscales (positive job emotions scale & negative job emotions scale) could be identified. Finally, the participants in this research were mainly university students. This research population was chosen taking into account the availability of the participants and the feasibility of the research. However, taking into account the subject of the research, employee evaluations, it has to be noted that this research population might not be the best suited. In general, students have little work experience in organizations and little experience with job evaluations. As the data showed, many participants had never received a job evaluation before (See table 1). Therefore, students that participated in this research might have had difficulties with the scenario presented to them including working for an organization and receiving an evaluation. 33 5. Conclusion and Recommendations Recent developments in the field of Natural Language Generation show the enormous potential of NLG systems. In addition to the generation of textual weather forecasts, these systems are for example also being used in the health sector where they function as a decision-support aid for medical professionals. The current study introduced a new domain in which NLG systems might be useful in the future. This domain included performance appraisals or employee evaluations, and potentially even more human resource processes. The present study was designed to determine the effects on the emotions of participants following computer generated employee evaluations compared to the more widely known and used human generated employee evaluations. Negative as well as positive computer and human generated evaluations were taken into account. It was expected that computer generated negative evaluations would evoke negative emotions with a lower intensity than human generated evaluations. However, results showed that, for almost all items used in this research, negative evaluations had the same effect on participants’ emotions regardless of whether it was generated by a computer or a human. Positive evaluations appeared to evoke higher negative emotions (joyful, pleased, respected and worthy) when the evaluation was generated by a computer and not by a human. However, the scores on these negative items were very low. Do these findings indicate that computer generated evaluations have no future? No. A previous study has shown that people preferred computer generated texts over human generated texts, partly because of better word choice (Reiter, Sripada, Hunter, Yu & Davy 2005). These computer generated texts were, contrary to the texts used in the current research, actually generated by an NLG system. This system and most other NLG systems can generate actual words, phrases, grammar and other linguistic features. Taking into account that the computer and human evaluation texts in the current study were all generated by a human, and text as such was not a variable, results may have been different if computer texts were indeed generated by an NLG system and included these different linguistic features. Further research is needed to test whether computer generated evaluations that are truly generated by an NLG system have advantages over human generated evaluations. Furthermore, results of the current research might have been different with a different research population. While the current study included students, further research is recommended to include actual employees of an organization. This would be beneficial compared to the current research, as such a study could include evaluations and situations that are real, and not hypothetical. Future research could 34 take into account real employees that have a real relation, including emotions and feelings, with a boss or supervisor. A final recommendation for further research includes the controlling for variables such as gender, age, education, work experience and experience with job evaluations. While these variables were taken into account in this research, a lack of variation within some of these variables made it difficult to carry out statistical analyses on them. 35 References Baumeister, R. F., Bratslavsky, E., Finkenauer, C., & Vohs, K. D. (2001). Bad is stronger than good. Review of general psychology, 5(4), 323. Belschak, F. D., & Den Hartog, D. N. (2009). Consequences of Positive and Negative Feedback: The Impact on Emotions and Extra‐Role Behaviors. Applied Psychology, 58(2), 274-303. Bernardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human behavior at work. Boston: Kent Publishing Company. Bourbeau, L., Carcagno, D., Goldberg, E., Kittredge, R., & Polguere, A. (1990, August). Bilingual generation of weather forecasts in an operations environment. In Proceedings of the 13th conference on Computational linguistics-Volume 3 (pp. 318320). Association for Computational Linguistics. Côté, S., & Morgan, L. M. (2002). A longitudinal analysis of the association between emotion regulation, job satisfaction, and intentions to quit. Journal of Organizational Behavior, 23(8), 947-962. Ferris, G. R., Mitchell, T.R., Canavan, P. J., Frink, D. D., & Hopper, H. (1995). Accountability in human resource systems. In G. R. Ferris, S. D. Rosen, & D. T. Barnum (Eds.), Handbook of human resource management (pp. 175−196). Oxford, UK: Blackwell Publishers. Ferris, G. R., Munyon, T. P., Basik, K., & Buckley, M. R. (2008). The performance evaluation context: Social, emotional, cognitive, political, and relationship components. Human Resource Management Review, 18(3), 146-163. Fisher, C. D. (1998). Mood and emotions while working-missing pieces of job satisfaction. School of Business Discussion Papers, 64. Grandey, A. A., Tam, A. P., & Brauburger, A. L. (2002). Affective states and traits in the workplace: Diary and survey data from young workers. Motivation and emotion, 26(1), 31-55. Grote, R. C. (2002). The performance appraisal question and answer book: A survival guide for managers. AMACOM Div American Mgmt Assn. 36 Irvine, D. M., & Evans, M. G. (1995). Job satisfaction and turnover among nurses: integrating research findings across studies. Nursing research, 44(4), 246-253. Korsgaard, M. A., & Roberson, L. (1995). Procedural justice in performance evaluation: The role of instrumental and non-instrumental voice in performance appraisal discussions. Journal of Management, 21(4), 657-669. Lazarus, R.S. (1991). Emotion and adaptation. New York: Oxford University Press. Levy, P. E., & Williams, J. R. (2004). The social context of performance appraisal: A review and framework for the future. Journal of management, 30(6), 881-905. Liden, R.C., & Mitchell, T.R. (1985). Reactions to feedback: The role of attributions. Academy of Management Journal, 28, 291–308. London, M. (2012). Job feedback: Giving, seeking, and using feedback for performance improvement. Psychology Press. Mauss, I. B., & Robinson, M. D. (2009). Measures of emotion: A review. Cognition and emotion, 23(2), 209-237. Mobley, W. H. (1977). Intermediate linkages in the relationship between job satisfaction and employee turnover. Journal of applied psychology, 62(2), 237. Murphy, T. H., & Margulies, J. (2004, March). Performance appraisals. In Presentation, ABA Labor and Employment Law Section, Equal Employment Opportunity Committee, Mid-Winter Meeting. Oving, R. (2014). Een derde van de werknemers niet tevreden over het beoordelingsgesprek. Metronieuws. Retrieved from: http://www.metronieuws.nl/nieuws/een-derde-van-dewerknemers-niet-tevreden-over-het-beoordelingsgesprek/SrZncc!Q9dSpkFEE6k76/ Peretz, H., & Fried, Y. (2012). National cultures, performance appraisal practices, and organizational absenteeism and turnover: A study across 21 countries. Journal of Applied Psychology, 97(2), 448. Portet, F., Reiter, E., Gatt, A., Hunter, J., Sripada, S., Freer, Y., & Sykes, C. (2009). Automatic generation of textual summaries from neonatal intensive care data. Artificial Intelligence, 173(7), 789-816. 37 Prue, D. M., & Fairbank, J. A. (1981). Performance feedback in organizational behavior management: A review. Journal of Organizational Behavior Management, 3(1), 1-16. Reiter, E., Dale, R. (1997). Building applied natural language generation systems. Natural Language Engineering, 3(1). 57-87, doi:10.1017/s1351324997001502 Reiter, E., Robertson, R., & Osman, L, M. (2003). Lessons from a Failure: Generating Tailored Smoking Cessation Letters. Artificial Intelligence 144:41-58. Reiter, E., Sripada, S., Hunter, J., Yu, J., & Davy, I. (2005). Choosing words in computergenerated weather forecasts. Artificial Intelligence, 167(1), 137-169. Richins, M. L. (1997). Measuring emotions in the consumption experience. Journal of consumer research, 24(2), 127-146. Scherer, K. R. (2005). What are emotions? And how can they be measured? Social science information, 44(4), 695-729. Sripada, S. G., Reiter, E., & Hawizy, L. (2002). Evaluation of an NLG system using post-edit data: Lessons learnt. WEATHER, 5, 7. Tnay, E., Othman, A. E. A., Siong, H. C., & Lim, S. L. O. (2013). The Influences of Job Satisfaction and Organizational Commitment on Turnover Intention. Procedia-Social and Behavioral Sciences, 97, 201-208. 38 Appendices Appendix A Negative human generated evaluation text. May 1, 2014 Tilburg University Warandelaan 2, Tilburg The Netherlands Policy Department Concerning: Performance appraisal. Period: 01-01-14 – 01-05-14. Dear employee, You have been an employee at Tilburg University for four months now and this is your first evaluation as a policy advisor at the organization. You are assessed on the competencies: conceptual ability, persuasiveness, writing qualities and result orientation/decisiveness. Regarding conceptual ability, it stands out that you still have difficulties extracting the bigger picture from information and translating it to new, useful ideas. Also, your ability of making links appears to be insufficient so far. Even though you are enthusiastic, you do not provide enough arguments and you anticipate insufficiently on others’ reactions. Your writing qualities have improved gradually. However, you still fall short regarding structure and clearness. Your result orientation and decisiveness remain unsatisfactory because your approach and planning are not concrete enough. For example, you do not address others after achieved or disappointing results. In conclusion: Based on your functioning the last four months, you do not meet the requirements that are set for the function of policy maker at Tilburg University. Evaluator: C.J.A.A. Verhey Head of Policy Department Tilburg University Signature: 39 Appendix B Negative computer generated evaluation text. May 1, 2014 Tilburg University Warandelaan 2, Tilburg The Netherlands Policy Department Concerning: Automatically computer written performance appraisal. Period: 01-01-14 – 01-05-14. Dear employee, You have been an employee at Tilburg University for four months now and this is your first evaluation as a policy advisor at the organization. You are assessed on the competencies: conceptual ability, persuasiveness, writing qualities and result orientation/decisiveness. Regarding conceptual ability, it stands out that you still have difficulties extracting the bigger picture from information and translating it to new, useful ideas. Also, your ability of making links appears to be insufficient so far. Even though you are enthusiastic, you do not provide enough arguments and you anticipate insufficiently on others’ reactions. Your writing qualities have improved gradually. However, you still fall short regarding structure and clearness. Your result orientation and decisiveness remain unsatisfactory because your approach and planning are not concrete enough. For example, you do not address others after achieved or disappointing results. In conclusion: Based on your functioning the last four months, you do not meet the requirements that are set for the function of policy maker at Tilburg University. This evaluation is automatically generated by our HR computer evaluation system 40 Appendix C Positive human generated evaluation text. May 1, 2014 Tilburg University Warandelaan 2, Tilburg The Netherlands Policy Department Concerning: Performance appraisal. Period: 01-01-14 – 01-05-14. Dear employee, You have been an employee at Tilburg University for four months now and this is your first evaluation as a policy advisor at the organization. You are assessed on the competencies: conceptual ability, persuasiveness, writing qualities and result orientation/decisiveness. Regarding conceptual ability, it stands out that you are comfortable extracting the bigger picture from information and translating it to new, useful ideas. Also, your ability of making links appears to be excellent so far. In addition to your enthusiasm, you provide satisfactory arguments and anticipate sufficiently on others’ reactions. Your writing qualities have improved gradually. Moreover, you have improved regarding structure and clearness. Your result orientation and decisiveness are satisfactory. This is apparent from your approach and planning which are very concrete. For example, you address others after achieved or disappointing results. In conclusion: Based on your functioning the last four months, you meet the requirements that are set for the function of policy maker at Tilburg University. Evaluator: C.J.A.A. Verhey Head of Policy Department Tilburg University Signature: 41 Appendix D Positive computer generated evaluation text. May 1, 2014 Tilburg University Warandelaan 2, Tilburg The Netherlands Policy Department Concerning: Automatically computer written performance appraisal. Period: 01-01-14 – 01-05-14. Dear employee, You have been an employee at Tilburg University for four months now and this is your first evaluation as a policy advisor at the organization. You are assessed on the competencies: conceptual ability, persuasiveness, writing qualities and result orientation/decisiveness. Regarding conceptual ability, it stands out that you are comfortable extracting the bigger picture from information and translating it to new, useful ideas. Also, your ability of making links appears to be excellent so far. In addition to your enthusiasm, you provide satisfactory arguments and anticipate sufficiently on others’ reactions. Your writing qualities have improved gradually. Moreover, you have improved regarding structure and clearness. Your result orientation and decisiveness are satisfactory. This is apparent from your approach and planning which are very concrete. For example, you address others after achieved or disappointing results. In conclusion: Based on your functioning the last four months, you meet the requirements that are set for the function of policy maker at Tilburg University. This evaluation is automatically generated by our HR computer evaluation system 42 Appendix E Scale/Questionnaire Please complete this questionnaire consisting of 14 items. Indicate the intensity with which you experience the following emotions after receiving your evaluation. (1) This evaluation makes me feel worthy Extremely weakly (not at all) (2) 1 2 3 4 5 6 7 Extremely strongly (very much) This evaluation enrages me Extremely weakly (“) 1 2 3 4 5 6 7 (3) This evaluation makes me feel apologetic Extremely weakly (“) 1 2 3 4 5 6 7 (4) Extremely strongly (“) This evaluation makes me feel regretful Extremely weakly (“) 1 2 3 4 5 6 7 (13) Extremely strongly (“) This evaluation makes me feel disillusioned Extremely weakly (“) 1 2 3 4 5 6 7 (12) Extremely strongly (“) This evaluation makes me furious Extremely weakly (“) 1 2 3 4 5 6 7 (11) Extremely strongly (“) This evaluation makes me feel uncertain Extremely weakly (“) 1 2 3 4 5 6 7 (10) Extremely strongly (“) This evaluation discourages me Extremely weakly (“) 1 2 3 4 5 6 7 (9) Extremely strongly (“) This evaluation makes me feel concerned Extremely weakly (“) 1 2 3 4 5 6 7 (8) Extremely strongly (“) This evaluation makes me feel responsible Extremely weakly (“) 1 2 3 4 5 6 7 (7) Extremely strongly (“) This evaluation pleases me Extremely weakly (“) 1 2 3 4 5 6 7 (6) Extremely strongly (“) This evaluation makes me feel respected Extremely weakly (“) 1 2 3 4 5 6 7 (5) Extremely strongly (“) Extremely strongly (“) This evaluation makes me joyful Extremely weakly (“) 1 2 3 4 5 6 7 Extremely strongly (“) 43 (14) This evaluation makes me feel convicted Extremely weakly (“) 1 2 3 4 5 6 7 (Q1) My age is: … year (Q2) I am a: 0 man Extremely strongly (“) 0 woman (Q3) The highest education that I take part in, or have already completed is: 0 Elementary school 0 High school (VMBO) 0 High school (HAVO) 0 High school (VWO) 0 MBO 0 HBO 0 University (WO) (Q4) Have you worked for an organization or company before, or currently work for an organization or company: Yes / No (Q5) Have you ever received a job evaluation? Yes / No 44 Appendix F Case computer conditions. You are an employee of the University of Tilburg. You work for the organizations’ Policy department, together with twenty other people. You have been working for Tilburg University for four months now. You like your work and have been working at the organization with great pleasure. Furthermore, you are happy with your colleagues and you would like to stay at the organization for a longer period of time. Tilburg University finds it important that their employees are content with their work. However, Tilburg University also finds it very important that its employees do their work sufficiently and behave properly. Therefore, every four months you and your colleagues are evaluated on your performance and behaviour. This always results in a written evaluation. Tilburg University makes use of a new way of evaluating employees, as the evaluation is written by a computer, and not by your boss. This enables Tilburg University for more frequent evaluations. Next, you will receive such a computer written evaluation on your performance and behaviour over the past four months. Make sure you read your evaluation. 45 Appendix G Case human conditions. You are an employee of the University of Tilburg. You work for the organizations’ Policy department, together with twenty other people. You have been working for Tilburg University for four months now. You like your work and have been working at the organization with great pleasure. Furthermore, you are happy with your colleagues and you would like to stay at the organization for a longer period of time. Tilburg University finds it important that their employees are content with their work. However, Tilburg University also finds it very important that its employees do their work sufficiently and behave properly. Therefore, every four months you and your colleagues are evaluated on your performance and behaviour. This always results in a written evaluation. These evaluations are written by the head of your department, your boss. Next, you will receive such an evaluation, written by your boss, on your performance and behaviour over the past four months. Make sure you read your evaluation.
© Copyright 2026 Paperzz