Scalable Data Analytics, Scalable Algorithms, Software Frameworks and Visualization ICT-2013 4.2.a Project FP6-619435/SPEEDD Deliverable D8.3 Distribution Public http://speedd-project.eu Evaluation of SPEEDD prototype 1 for Road Traffic Management Federica Garin (INRIA) Chris Baber, Sandra Starke, Natan Morar and Andrew Howes (University of Birmingham) Alexander Kofman (IBM) Status: FINAL April 2015 2 Project Project Ref. no Project acronym Project full title Project site Project start Project duration EC Project Officer FP7-619435 SPEEDD Scalable ProactivE Event-Driven Decision Making http://speedd-project.eu/ February 2014 3 years Aleksandra Wesolowska Deliverable Deliverable type Distribution level Deliverable Number Deliverable Title Contractual date of delivery Actual date of delivery Relevant Task(s) Partner Responsible Other contributors Number of pages Author(s) Internal Reviewers Status & version Keywords Report Public D8.1 The Design of the User Interface for the SPEEDD Prototype M13 (March 2015) April 2015 WP8/Tasks 8.3 INRIA UoB, IBM 75 F. Garin, C. Baber, S. Starke, N. Morar, A. Howes, , and A. Kofman Final Evaluation, User Interface Design, Human Factors, Eye Tracking D5.1 UI Design 3 Contents 0 Executive Summary .........................................................................................................................7 1 Introduction ....................................................................................................................................8 1.1 Technical Evaluation ........................................................................................................................... 9 1.2 Evaluating User Interfaces .................................................................................................................. 9 1.3 Formative Evaluation ........................................................................................................................ 10 1.4 Summative Evaluation ...................................................................................................................... 12 2 Evaluation by Subject Matter Experts ............................................................................................ 13 2.1 Study ................................................................................................................................................. 13 2.3 Comments from Subject Matter Experts .......................................................................................... 14 2.3.1 Support Operator Goals ...................................................................................................... 14 2.3.2 Provide Global View ............................................................................................................ 15 2.3.3 Indicate ramp metering and operator role ......................................................................... 15 2.3.4 Modify User Interface design .............................................................................................. 15 3 Summative evaluation through laboratory trials ............................................................................ 16 3.1 Rationale ........................................................................................................................................... 16 3.2 Study design and data collection ...................................................................................................... 17 3.2.1 User interface...................................................................................................................... 17 3.2.2 Experimental conditions ..................................................................................................... 18 3.3 Method ............................................................................................................................................. 20 3.3.1 Participants ......................................................................................................................... 20 3.3.2 Task and experimental timeline .......................................................................................... 20 3.3.3 Data collection .................................................................................................................... 21 3.3.4 Questionnaire ..................................................................................................................... 21 3.3.5 Eye tracking ......................................................................................................................... 22 3.4 Measures of Performance................................................................................................................. 24 3.4.1 Correctness of response ..................................................................................................... 24 3.4.2 Decision times ..................................................................................................................... 24 3.4.3 Gaze Behaviour ................................................................................................................... 25 3.5 Results ............................................................................................................................................... 26 D5.1 UI Design 4 3.5.1 Correctness of responses .................................................................................................... 26 3.5.2 Decision times ..................................................................................................................... 27 3.5.3 Gaze behaviour ................................................................................................................... 28 3.6 Comparison of expertise levels ......................................................................................................... 35 3.6.1 Data collection .................................................................................................................... 35 3.6.2 Data analysis ....................................................................................................................... 35 3.7 Conclusions and outlook ................................................................................................................... 38 3.7.1 Summary ............................................................................................................................. 38 3.7.2 Novice vs. expert behaviour................................................................................................ 39 3.7.3 Considerations for UI design ............................................................................................... 40 3.7.4 Study limitations and considerations for future work ........................................................ 40 4 Defining Baseline Performance Metrics.......................................................................................... 42 4.1 Overview ........................................................................................................................................... 42 4.2 Overall goal hierarchy ....................................................................................................................... 43 4.3 Theory and modelling ....................................................................................................................... 44 4.4 UI evaluation ..................................................................................................................................... 45 4.5 Evaluation of metrics for the quantification of UI design ................................................................. 46 5 References .................................................................................................................................... 50 Appendix 1 – SUS translated into French ............................................................................................... 51 Appendix 2 – Summary of Comments from SMEs .................................................................................. 52 Appendix 3 – Complete Experiment Instructions ................................................................................... 53 Traffic Management Experiment ............................................................................................................ 53 Index of Figures Figure 1: ISO 13407 Human-Centred Systems Design Lifecycle ................................................................... 9 Figure 2: Software Usability Scale (Brooke, 1986) ...................................................................................... 11 Figure 3: Results from SUS evaluation ........................................................................................................ 13 Figure 4: UI for Experimental Scenario ....................................................................................................... 18 Figure 5: Incorrect ramp information: the ramp shown in panel 1 (right) is number 13, whereas the ramp shown in panel 2 (left) is number 1. ........................................................................................................... 18 Figure 6: Information displayed in figure 4, panel 3, corresponding to computer suggestions................. 19 Figure 7: Experimental timeline. Thirty-two trials were intersected by an interval displaying a counter (used to later synchronise eye tracking recordings and response data). ................................................... 21 Figure 8: Participant wearing a head-mounted eye tracker during a pilot data recording session. .......... 22 D5.1 UI Design 5 Figure 9: Setup of the monitor displaying the UI. Sixteen infrared markers allow mapping the point of gaze to specific regions of interest. ............................................................................................................ 23 Figure 10: Markers used for simple mapping of point of gaze to four ROIs. .............................................. 24 Figure 11: Number of trials with correct response. for each of the four experimental conditions. ......... 27 Figure 12: Decision times as a function of condition, response or condition separated by response. ...... 27 Figure 13: Response times as a function of trial number for each participant. ......................................... 28 Figure 14: Heatmaps for the gaze data of four participants across the complete recording period. ........ 29 Figure 15: Example for sequential viewing behaviour of the 18 participants for scenario number 13 (out of 32 scenarios). .......................................................................................................................................... 29 Figure 16: Dwell times for individual regions of interest (ROI 1 to 4) as well as dwell times calculated across all attended ROIs. Left: median dwell time per participant; right: interquartile range (IQR) per participant across median dwell times per trial. ........................................................................................ 30 Figure 17: Percentage viewing time allocated to different ROIs across all 32 trials for each participant. . 31 Figure 18: Boxplots of median viewing time per participant, both as the percentage viewing time (top row) and as the total time in seconds (bottom row). ................................................................................. 32 Figure 19: Boxplots of the interquartile range in viewing time per participant, both as the percentage viewing time (top row) and as the total time in seconds (bottom row)..................................................... 32 Figure 20: Basic gaze parameters related to visual scanning for each of the four experimental conditions and across all 32 trials. ................................................................................................................................ 33 Figure 21: Number of visits to each ROI for each of the four experimental conditions and across all 32 trials. ........................................................................................................................................................... 33 Figure 22: Percentage viewing time allocated to the four regions of interest by the four students classified as ‘expert students’ (≥ 95% of trials completed correctly, left panel) and by 14 ‘non-expert’ students (< 95% of trials completed correctly, right panel) ....................................................................... 34 Figure 23: Viewing networks for condition TT (left) and TF (right)............................................................. 34 Figure 24: Response reliability for the four experimental conditions for three staff of the road traffic management facility in DIR-CE (Grenoble, France), students classed as experts (≥ 95% of trials completed correctly) and students classed as non-experts (< 95% of trials completed correctly) .............................. 36 Figure 25: Decision times for the three different participant groups and four different experimental conditions. Throughout, the median decision time was approximately 5 seconds. .................................. 37 Figure 26: Self-evaluation compared to quantified performance. ............................................................. 38 Index of Tables Table 1: Network analysis metrics. This table shows the results for all network analysis metrics described in the data analysis section. An undirected network is independent on the direction of the gaze shift, a directed network takes the direction of the gaze shift into consideration. ............................................... 35 Table 2: Eyetracking low level ..................................................................................................................... 46 Table 3: Eye tracking processed / high level ............................................................................................... 47 Table 4: Scan pattern comparison using bioinformatics ............................................................................ 48 Table 5: Network analysis metrics .............................................................................................................. 48 D5.1 UI Design 6 Table 6: Eye / head coupling ....................................................................................................................... 49 Table 7: User performance ......................................................................................................................... 49 D5.1 UI Design 7 0 Executive Summary Task 8.3 is concerned with the definition and development of performance metrics by which it is possible to measure human decision making in the SPEEDD road traffic use case. Having a set of human performance metrics allows us to consider how different versions of the SPEEDD prototype could have an impact on operator decision performance. The primary form of interaction that the operator will have with the SPEEDD prototype will be through the User Interface (UI). Consequently, the primary aim in this report is to explore ways in which we can develop metrics that provide a baseline against which subsequent designs can be compared. In order to define such metrics, it is necessary to have a model of the activity that is being measured. In SPEEDD the evaluation of the UI designs will involve more than a judgement of the ‘look and feel’, user acceptance, or usability of the screen layouts. As the role of the UI is to support decision making, the project will focus on measuring decision making in response to information search and retrieval. An assumption is that the manner in which information is presented to the user interacts with the manner in which the user searches for specific aspects of the information in order to make decisions. This raises three dimensions that need to be considered in this work. First, the manner in which information is presented can be described in terms of its diagnosticity, i.e., the correlation between the content of the display and the goal to achieve, and in terms of its graphical properties, i.e., the format used to display the information. Second, the manner in which users search for specific information can be described in terms cost, i.e., temporal measures of information search, particularly using eye-tracking metrics. Third, the manner in which information search relates to decision making can be related to the optimal decision models being developed in SPEEDD. It is proposed that this approach is not only beneficial for SPEEDD but also provides a foundation for the evaluation of Visual Analytics in general. This report describes a user acceptance test (with Subject Matter Experts) to compare different versions of the UI, followed by the design and conduct of an experiment evaluating one of the versions of the UI for a specific (ramp metering) task. The experiment provided an opportunity to consider a variety of metrics which could be used to indicate cost (of information retrieval) against diagnosticity of information, as well as a means of comparing performance of Subject Matter Experts (i.e., experienced control room staff) and novices (i.e., Undergraduate students). The report concludes with a discussion of potential metrics that we will take forward for further consideration in the SPEEDD project. D5.1 UI Design 8 1 Introduction In the Description of Work for Work Package 8, it is proposed that "Every version of the integrated prototype will be followed by technical and user-oriented evaluation to obtain the necessary feedback for the functions to be included or altered in the next version". For the development of Prototype 1, evaluation has been conducted along four primary routes. The first is the technical evaluation that arises from the demands of integrating the User Interface (UI) into the SPEEDD Architecture. Not only has this required that the objects in the UI are able to display data delivered over the Architecture but also that any operator inputs can be either handled by the Architecture or dealt with locally. This is briefly covered in Section 1.1 and is the subject of Work Package 6. The second is acceptance testing. This is conducted by presenting versions of the UI to Subject Matter Experts (SMEs) and asking for their opinion of the designs and their suggestions for improvement. The third is user testing. This is conducted through experiments in laboratory settings in the first year (the aim is to perform such testing with SMEs on the full prototype, but this will be performed in later years). In addition to considering Evaluation of the SPEEDD prototype, this report also presents initial thoughts of the challenge of defining a ‘baseline’ against which subsequent designs can be compared. The purpose of evaluation is to “assess our designs and test our systems to ensure that they actually behave as we expect and meet the requirements of the users” (Dix et al., 1993). This means that, for SPEEDD, it is important to consider how the UI can have an impact on the decision making activity of operators and people who will be using the UI. The question of how to measure impact on decision making activity is, therefore, of paramount importance for this project. A baseline should not only capture the performance of operators’ current performance (i.e., using the technology that they currently have in their control rooms) but also provide a set of metrics for evaluating future designs. As we will explain in Section 4, the definition of a baseline is complicated by the fact that it is not easy to separate operator performance from the equipment that they use. For example, if the operators spend a large proportion of their time completing reports (see Deliverable 5.1), then ‘report time’ could be a performance metric that we would seek to improve on. However, if one eliminated the requirement to make reports, e.g., because the system fully automated the collection of such information, then the metric would be redundant. Of course, an improvement from, say 40% of one’s time spent completing reports to 0% would be impressive but it is not a useful measure of operator performance and tells us little about whether the work has actually improved (rather than simply having some aspect of it eliminated), nor does it tell us how decision making has changed. D5.1 UI Design 9 1.1 Technical Evaluation From the technical point of view, SPEEDD prototype version 1 achieved its goals, i.e., basic versions of the components and the infrastructure have been developed following the event-driven architecture paradigm. SPEEDD prototype version 1 functionality has been tested using the end-to-end flow which shows that the components get inputs and emit outputs as expected using SPEEDD platform. At this point we have not tested functional requirements (such as performance and parallelism). In this report we will consider initial feedback from users who actually operate the SPEEDD platform. Lessons learnt from year 1 prototype will assist in reshaping the architecture and infrastructure. Furthermore, more advanced features of the components will allow us to test more complex scenarios that can be then tested by real users of the platform for further refinement. 1.2 Evaluating User Interfaces Usability is “…the capability in human functional terms [of a product] to be used easily and effectively by the range of users, given specified training and user support, to fulfill the specified range of tasks, within the specified range of environmental scenarios” [Shackel, 1984]. ISO9241-11 operationalise this initial definition as “...the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use”. This raises two key issues for UI evaluation. First, it is important to note that evaluation must be directed at the role of a given technology in supporting specified users and goals. This requires clear definition of the characteristics of the tasks and goals that users perform. Second, it is important to combine an appreciation of subjective response (i.e., satisfaction) with measures of performance (i.e., effectiveness and efficiency) for these specified users seeking to achieve specified goals. From this, it is important to appreciate who will be using the product, for what purpose, in what environment (i.e., the ‘context of use’) and then to define appropriate measures of performance for this context of use. Figure 1: ISO 13407 Human-Centred Systems Design Lifecycle D5.1 UI Design 10 ISO 13407 (figure 1) describes Human-Centred System design as an iterative lifecycle. While stage 5 involves evaluating the design against requirements, it is important to realise that evaluation is involved in each of the other stages as well. Not only is evaluation integral to design, but it is also to important measure more that simply a product’s features (Baber, 2015). The evaluation of a User Interface involves not only consideration of the aesthetics (‘look and feel’) of the design but also the effect of the design on the work of people who will use that UI. From this, it is necessary to define the measures needed to support evaluation and to define a referent model. The use of the referent model will be particularly important for the evaluation of subsequent UI designs in SPEEDD as we need to identify how and if these designs have resulted in measurable change to operator performance. It is customary to distinguish between formative evaluation (i.e., the semi-formal testing of a design at the various stages) and summative evaluation (i.e., the formal testing of the design at stage 5). 1.3 Formative Evaluation The Formative Evaluation of a User Interface occurs throughout the design lifecycle. This could involve comparing the current design against user requirements. An early version of such comparison was reported in D5.1. As SPEEDD progresses, so the comparison against user requirements will be repeated to ensure that the designs continue to meet the needs of the users as expressed in these requirements. Additionally, formative evaluation can involve the elicitation of subjective responses from users. This form of acceptance testing can be a valuable means of both ensuring that the user requirements continue to be appropriate and engaging with prospective end users throughout the design and development process. There are various approaches to eliciting user opinion, with the most common involving some form of ‘usability inspection’ (Nielsen, 1993). These could involve an extensive set of questions, e.g., Ravden and Johnson (1989), which can cover many aspects of interacting with a computer system but can be time consuming to complete, to smaller question sets which are intended to cover high-level aspects of the interaction. These aspects could focus on subjective responses, such as satisfaction, e.g., CUSI — Computer User Satisfaction Inventory (Kirakowski and Corbett, 1988) or QUIS — Questionnaire for User Interface Satisfaction (Chin et al., 1988). Others focus on a broader concept of usability. Thus, it is common for evaluation for refer to a set of rules-of-thumb, or heuristics, such as: (i.) (ii.) (iii.) (iv.) (v.) (vi.) (vii.) (viii.) (ix.) Use Simple and natural language Provide clearly marked exits Speak the user’s language Provide short cuts Minimise user memory load Good error messages Be consistent Seek to minimise user error Provide feedback It is possible to ask users to respond to each of these statements, perhaps by requiring them to provide a score on a Lickert scale for each heuristic (and a further score indicating whether the user feels that D5.1 UI Design 11 this heuristic is critical to the success of the system). However, there is a lack of standardisation in evaluating response. This means that it is not easy to determine ‘good’ without a referent model. The Software Usability Scale (Brooke, 1986) provides a convenient questionnaire to gauge user acceptance of software. Figure 2: Software Usability Scale (Brooke, 1986) D5.1 UI Design 12 The primary focus of the SUS questionnaire is on identifying whether the software would be appropriate for a given user, e.g., in terms of whether the user can understand how to use it, whether the software looks as if it might be useful for the user’s work, and whether the user would be happy to use the software. A set of questions are shown in figure 1. This scale poses 10 simple questions concerning the potential usefulness and benefit that operators felt that the User Interface might provide them. Each statement is rated on a scale of 0 to 4. The scoring of responses then involves subtracting 1 from oddnumbered questions and subtracting scores of even-numbered questions from 5. This is because the questions alternate between positive and negative connotations. Scores are then summed and multiplied by 2.5, to give a final score out of 100. As a rule of thumb, scores in excess of 65 are deemed ‘acceptable’. 1.4 Summative Evaluation While Formative Evaluation captures user attitude and opinion as the UI develops, and also allows for comparison of the designs against user requirements, it is important to conduct evaluation against specific versions of the design. In this case, SPEEDD conducts evaluation on the prototype at the end of each year. This involves a more objective set of measures of the use of the prototype for performing specific tasks. In the case of SPEEDD, the measures relate to decision making in the scenarios that the prototypes have been designed to support. For Prototype 1, the scenario relates to ramp metering. Thus, the evaluation is directed at measuring how well users of a version of the prototype are able to complete ramp metering tasks. It is worth noting that ramp metering is not currently part of the activity of the operators in Grenoble DIR-CE. Thus, they do not have experience of monitoring or interacting with an automated system for ramp metering. Having said this, their experience of traffic management in general should allow them to complete the decision tasks that we have designed. Our challenge is to produce an evaluation exercise which captures the traffic management challenges relating to ramp metering with sufficient ecological validity to feel realistic to operators, while providing sufficient control of variables to allow for data to be collected that reflect decision performance. To this end, a UI was designed which presented a ramp metering task as a decision problem. This allowed for decision time to be recorded in response to combinations of information which reflected different levels of automation performance. This is described in Section 3. D5.1 UI Design 13 2 Evaluation by Subject Matter Experts 2.1 Study We presented the initial sketch of the User Interface (A) and asked the operators to rate this against the Software Usability Scale (translated into French – see Appendix 1). Figure 3: Results from SUS evaluation [The error bar indicates standard deviation in rating across the SMEs] Figure 3 indicates an increasing trend from the first to the third UI design. UI A was presented in D5.1 and represents an initial sketch of the layout based on the core Values and Priorities identified through Cognitive Work Analysis. In other words, this design is intended to reflect the key aspects of the operators’ information space that they might need to monitor in order to achieve their goals. UI B was developed to support the ramp metering task (and was the subject of the experiment described in Section 3). UI C is used in Prototype 1 for SPEEDD. The increasing rating across these designs, therefore, is to be expected as UI A was intended to reflect the general layout of the UI rather than as a specific support for operators, UI B was designed for the laboratory experiment (see below), and it was only UI C which was intended to support operators. While UI C was for the initial SPEEDD prototype, we felt that it was useful to present the design process to the operators in order to gauge their opinion on the D5.1 UI Design 14 direction of travel for the development lifecycle. We note that UI C has a score slightly below 65 which, as noted above, indicates that this is more or less ‘good’. We also note that the error bars for UI C are larger than for the other designs. Discussion with the SMEs suggests that this arose from the concern of the one of the SMEs that this UI would be used in conjunction with existing displays. When we explained that the intention would be to replace the current set of displays, he offered to modify his rating upwards, but we felt that it would be better to retain the original rating. This implies that the first prototype approaches an initial level of acceptance. However, we would like to see this score increase in response to future designs. In order to understand the scores that the operators gave, we collated their comments and use these to understand what they perceived to be good and poor aspects of the design. 2.3 Comments from Subject Matter Experts The Subject Matter Experts (SMEs) noted that their response would depend on whether the UI is intended to be used in addition to their current set of displays or whether it would be a replacement for these. As far as practicable, SPEEDD would seek to replace the set of 5 display screens that are currently used in the control room (see D5.1) with a single display (albeit one which might be larger than the current visual display units). The SMEs could see merit in this approach. They were concerned that such a replacement could result in loss of information, but could see that much of what they needed to know would be on the SPEEDD UI (particularly UI C). There was concern that the quantity of information in UI C could be too much for the job of monitoring ramp metering (see the above discussion of error bars in figure 3). To this end, there was a feeling that the simpler and less detailed design of UI B could be preferable. Part of the discussion, therefore, concerned the role of the operator in response to the different UI designs presented to them. DIR-CE is involved in a project to automate ramp metering in some parts of the road network. At present, the ramp metering would be fully automated, with little opportunity for operator intervention. This might have influenced their response to UI C. Overall, the SMEs’ responses can be grouped under 4 headings: 2.3.1 Support Operator Goals The SMEs reiterated their main goal was to maximise traffic flow. To this end, the concept of prediction would be very useful, and they appreciated the comparison between current and predicted state in the UI designs. A concern that the SMEs raised was the apparent lack of support for report writing in the UI designs. We suggested that the ‘event log’ (in UI C) could be used to support the creation of reports, and the SMEs recognised that this would be possible. However, it is important to recognise that there are a number of reasons why operators need to create, maintain and sign-off reports. This was discussed in D5.1 as a requirement and future designs need to be more explicit about how this can be supported. The SMEs were not convinced that ‘driver behaviour’ falls under their command. While D5.1 suggested that control room monitoring involved managing traffic by influencing driver behaviour (through the use of Variable Message Signs), the SMEs were concerned that the focus of ‘behaviour’ would extend their responsibilities. This means that the concept of indicating vehicle activity (as opposed to traffic flow) might need to be reconsidered in the UI designs, or at least more clearly defined and explained. D5.1 UI Design 15 A final point raised by the SMEs, and related to the issue of report writing, concerns the manner in which information could be shared with other agencies. At present, there is little explicit communication between the rocade sud control room in DIR-CE and the control room which is responsible for the roads in the town. However, the two control rooms are situated in the same building and have informal communication (e.g., through face-to-face or telephone conversations). As SPEEDD develops its work on traffic control in the town (through the modelling work), the issue of how control decisions in one area can constrain those of another will need to be considered. This raises the question of how information will be shared and how cooperation can be supported. 2.3.2 Provide Global View The SMEs appreciated the idea of providing a global view of the area that they were controlling. They also felt that the current action of zooming in to an accident was problematic. First, zooming in loses the global view. Second, zooming in implies that the most important aspect of the traffic control problem is what is on the screen at the moment. The SMEs felt that allowing them to monitor the entire road network was preferable. They did like the idea of having some ability for them to initiate zooming to specific locations, which implies that it is not zooming per se that they dislike but that automated zooming and resizing of the map outside their control is not liked. 2.3.3 Indicate ramp metering and operator role The SMEs noted that ramp metering is not always under operator control. The intention behind the UI designs was not to have operators control ramp metering, but to allow intervention if required. The SMEs noted that it would be useful to intervene when they could see a growing queue of traffic at the ramp. They suggested that it might be useful to ‘penalise’ the system in response to queue length. This is something that SPEEDD is considering in its ramp control models, and the operators note that it was important to act upstream or downstream of ramps to effect control. The SMEs also noted that it could be difficult to determine how traffic would evolve if a ramp is closed. 2.3.4 Modify User Interface design In A there was too much information on the map. From this perspective, the simplest UI, UI B, was potentially the easiest to use. Having said that, the SMEs preferred UI C for its clarity. Much of the discussion on the UI designs concerned the zooming on the map (discussed above) and the use of colours. The SMEs were not sure how to the read the colours on the dials relating to driver behaviour. However, they did like the use of coloured circles on the map to indicate events, congestion and predictions. The use of graphs (in UI C) was potentially beneficial, especially if these could show limits, e.g., upper and lower bounds of the prediction, current activity, upper and lower bounds of ‘normal’ (i.e., historical) data for that section of road. Overall, the SMEs requested three main changes to the UI designs: 1. Annotate / colour ring-road on map to display regions of congestion etc. 2. Provide ‘notes’ (or similar) button which pops up window to type in. 3. Tabs on ‘event log’ to provide different views of what is being logged. D5.1 UI Design 16 3 Summative evaluation through laboratory trials 3.1 Rationale In order to determine the most appropriate baseline metrics for laboratory evaluation of user interface design, a study was conducted with a user interface (UI) representing a simplified version of SPEEDD prototype one (UI B in figure 3). The UI was designed to present participants with a traffic management (ramp metering) scenario and to provide a means of collecting decision time data from participants. The scenario is based on the SPEEDD road traffic use case version 1 (ramp metering), in which data from sensors on ramps are used to control traffic light sequencing in order to manage traffic flow. For this task, participants are expected to monitor the traffic flow and to ascertain that the automated decisions are correct. The purpose of the UI is to enable an ‘operator’ to monitor, and potentially intervene in, computerized road traffic control decisions. These computerized decisions are based on an instrumented stretch of road network in Grenoble, on which sensors measure the speed of cars entering the motorway via ramps. Traffic lights control the flow of cars based on information gathered from the sensors. The study reported here is a preliminary experiment which investigates how operators might respond to different levels of reliability in an automated system. In the SPEEDD use case, the ramp metering algorithms would run on data collected from sensors embedded in the road. There are potential problems which might arise from sensors failing or data being lost or corrupted during transmission, as described by the SPEEDD consortium previously. While these problems might be dealt with by exception handling, it is possible that a system recommendation could be based on erroneous data. Further, it is possible that the processing time of the algorithms could result in discrepancy between the recommendation and the display. These potential errors are translated in this experiment to incorrect suggestions displayed by the automatic decision aid and by incongruence of information sources. Thus, the operator had the dual role of checking for information source congruence and validating the displayed suggestion. This laboratory study aimed to answer two lines of experimental enquiry: firstly, we were interested in response times and user performance/reliability when working with unreliable automation. Secondly, we aimed to quantify viewing behaviour in order to understand scan patterns and time needed to extract information from different sources and data types. Particularly, we aimed to investigate whether participants would notice those trials that represented flawed automation, whether there was an effect of the experimental condition on response times and whether there was consistency within and across participants in allocating viewing time to different regions of interest. In this experiment, the operator would either ‘accept’ a computer recommendation or ‘challenge’ (i.e., reject) it after examining multiple information sources relating to traffic, presented on screen. D5.1 UI Design 17 3.2 Study design and data collection 3.2.1 User interface A custom user interface (UI) was designed for this task at the University of Birmingham in JavaScript. The UI simulates aspects of the ramp metering task considered for SPEEDD prototype 1. The experimental UI was derived from the UI developed for the SPEEDD project in order to facilitate this controlled laboratory experiment. This illustrates an approach the will be pursued throughout SPEEDD; key aspects of the operator task will be abstracted from the working environment and used to define experimental tasks that provide sufficient control and repeatability to support controlled laboratory testing. The purpose of this experimental UI (Figure 4) was to enable an ‘operator’ to monitor, and potentially intervene in, computerized road traffic control decisions. These computerized decisions relate to ramp control (i.e., changing the rate at which traffic lights on a junction change in order to allow vehicles to join a main road). Because of the experimental design (see 3.2.2), the dual operator role mentioned in the previous section was not static, but dynamic: switching between checking information source congruence and validating the computer suggestion based on explicit instructions on using the visual cues. In this experiment, the computer suggestions could be either ‘lower rate’, ‘increase rate’ or ‘leave rate unchanged’. The UI contained four panels in an equally spaced 2 by 2 grid layout (Figure 4): Panel 1 (bottom-right) contains details on the computer suggestion regarding traffic light settings as well as the operator response buttons ‘challenge’ and ‘accept’. Panel 2 (top-right) presents a view of the road network surrounding the queried ramp, based on google maps. Panel 3 (top-left) shows a graph with car density on the ramp on the x-axis (representing the number of cars waiting to pass the traffic light) and rate of the ramp on the y-axis (representing the number of cars passing per second). Panel 4 (bottom-left) presents a schematic list of 17 ramp meters mimicking part of the instrumented road section. The colour used for a specific ramp is repeated in the heading of the panels 1 and 3 and the ramp number should be the one in panel 2 (in the congruent display condition). D5.1 UI Design 18 Figure 4: UI for Experimental Scenario 3.2.2 Experimental conditions The objective of this experiment was to present ‘operators’ with coherent or conflicting information and examine whether these conditions were recognised correctly. In coherent trials, the information between all four panels agreed and the computer suggestion was correct. In incoherent trials, there were three different scenarios for conflicting information: First, the information in all panels agreed and the computer suggestion was correct. This is the version outlined in the previous section. Second, the information between panels in figure 4 disagreed while the computer suggestion was correct. Disagreeing information corresponded to incorrect ramp information: for example, the ramp number displayed in panel 1 and panel 2 could have disagreed (Figure 5). Figure 5: Incorrect ramp information: the ramp shown in panel 1 (right) is number 13, whereas the ramp shown in panel 2 (left) is number 1. D5.1 UI Design 19 Third, the information between panels agreed but the computer suggestion was incorrect. An incorrect computer suggestion had to be detected by the ‘operator’ by interpreting the density graph (panel 3, see Figure 4 ). If the data points showed high density and high rate (Figure 6, a), this would mean that there were a lot of cars waiting at a traffic light. Hence, the correct computer suggestion would be to increase the rate (letting more cars through by changing the traffic light change at a higher rate). If the data points showed low density and low rate (Figure 6, b), this would mean that there were very few cars waiting. Hence, the correct computer suggestion would be to decrease the rate. For other clusters of data points (low density, low rate (Figure 6, c) and high density, high rate (Figure 6, d)), the correct action would be to leave the rate unchanged. a c b d Figure 6: Information displayed in figure 4, panel 3, corresponding to computer suggestions. [Density (of cars) is on the x-axis and rate (of the traffic light changing) is on the y-axis. a) high density and low rate (correct suggestion: increase rate), b) low density and high rate (correct suggestion: lower rate); c) low density and low rate as well as d) high density and high rate (correct suggestion: leave rate unchanged)]. Fourth, the information between panels disagreed and the computer suggestion was also incorrect. In summary, trials were hence based on four scenarios with respect to information agreement and correctness of the computer suggestion: Ramp labels agree, computer suggestion correct (TT) Ramp labels disagree, computer suggestion correct (FT) Ramp labels agree, computer suggestion incorrect (TF) Ramp labels disagree, computer suggestion incorrect (FF) D5.1 UI Design 20 Each scenario was presented eight times, resulting in 32 trails which were shown to all participants, presented in random order. Participants were asked to accept the computer suggestion if and only if information agreed and the computer suggestion was correct. Hence, 3/4 of trials had to be challenged and 1/4 had to be accepted. 3.3 Method 3.3.1 Participants Participants were recruited from the University of Birmingham student body. A total of 27 students participated in the study; for a subset of 18 participants, eye tracking data were recorded. These data could not be recorded for the remaining participants due to calibration issues arising from them wearing spectacles. The design of experiment was approved by the University of Birmingham Ethics Panel and all participants provided informed consent to participate in the study. 3.3.2 Task and experimental timeline The study consisted of four steps: first, the function of the UI components and the goals of the task were explained verbally to the participant based on a script (Appendix 3). Secondly, the participant performed two practice trials, after which he/she was given the opportunity to ask any clarifying questions. Third, the participant performed the task, which consisted of 32 trials and took approximately 2 to 6 minutes to complete. Fourth, participants were given a questionnaire to fill out. The study explanation can be summarised as follows: in order to determine whether the information sources agreed, the participants were instructed to check if all four regions of interest (ROIs) referred to the same ramp number. To determine whether the computer suggestion is correct or not the participants were instructed to check the graph in ROI 3 (top-left in Figure 4). The presence of the biggest bubbles in the bottom-right quadrant of ROI 3 (low rate, high density) indicated that the rate must be increased. The presence of the biggest bubbles in the top-left quadrant of ROI 3 (high rate, low density) meant that the rate must be decreased. The presence of the biggest bubbles in either the bottom-left or the top-right quadrant (low density, low rate and high rate, high density, respectively) meant that the rate must remain unchanged. Each trial was preceded by an interval displaying a counter that was used to later synchronise eye tracking recordings and response data. The interval displaying the timer ran for as long as the participant wished to prepare for the next trial. Participants started each trial by clicking on the timer screen. A trial was completed when the participant clicked on the ‘challenge’ or accept’ button, and the blank timer screen was automatically shown again. D5.1 UI Design 21 Figure 7: Experimental timeline. Thirty-two trials were intersected by an interval displaying a counter (used to later synchronise eye tracking recordings and response data). 3.3.3 Data collection The UI was presented on a 22” monitor (resolution: 1680 x 1050, refresh: 60 Hz) to each ‘operator’ seated in front of it. Events associated with a participant’s interaction with the UI were written to a csv file using JavaScript. The stored events for each trial were: Trial ID Trial start and end time (in ms computer time) Participant response (challenge or accept) The start time corresponded to the participant clicking on the white counter screen to start the trial, while the stop time corresponded to the participant clicking on either the ‘challenge’ or ‘accept’ button. 3.3.4 Questionnaire On completion of the task, participants were asked to fill in a questionnaire which contained the following questions: D5.1 UI Design 22 3.3.5 How difficult did you find the task? [0 to 10 scale, 0 – very easy; 10 – very hard] What was your strategy? What percentage of decisions do you think you made correctly? What percentage of trials that you SHOULD have accepted do you think you accepted? What percentage of trials that you SHOULD have challenged do you think you challenged? Eye tracking For a subset of 18 participants, eye tracking data were collected as part of the experiments. A Tobii Glasses v.1 head-mounted eyetracker was used to record the point of gaze at 30 Hz while participants performed the task. Prior to commencing the task, the eyetracker was fitted to the participant to be worn comfortably and a nine-point calibration was performed. Calibration was repeated until achieving at least 4-star (out of 5) rating for both, ‘accuracy’ and ‘tracking’. For a random subset of participants, mapping of the gaze point was checked using a custom gaze-verification video created in Matlab. This video showed moving dots on the screen which the participant had to follow. The match between mapped gaze point and the target point then allowed qualitative / quantitative evaluation of gaze point accuracy. Figure 8: Participant wearing a head-mounted eye tracker during a pilot data recording session. To map point of gaze in the local coordinate reference frame of the glasses to regions of interest (ROIs) on the screen, 16 infrared markers were attached around the monitor at equally spaced intervals. The large number of markers was chosen to accommodate technical issues we noted in a pilot study, where changes in head orientation and viewing distance of participants resulted in loss of too many IR marker locations. D5.1 UI Design 23 Figure 9: Setup of the monitor displaying the UI. Sixteen infrared markers allow mapping the point of gaze to specific regions of interest. Following data collection, point of gaze was automatically mapped to the four UI panels using custom written scripts in Matlab based on the position of 16 infrared markers attached around the monitor. The setup used here allowed to map visual attention to the four panels (ROIs) based on two approaches: Firstly, mapping was performed by minimising the vector magnitude and angle between point of gaze and infrared markers; for each frame, the ROI corresponding to the marker with the minimum value was selected. Details of this mapping approach have been described previously and are currently under review for publication in ACM Transactions in Interactive Intelligent Systems. Secondly, to accommodate dropout of large numbers of markers for the recordings of some participants, a simple mapping approach was chosen to override the method explained above when possible: for each frame, it was automatically checked whether one of the horizontal and one of the vertical markers placed at the panel intersections was in view (Figure 10). If this conditional was satisfied, then mapping was performed for the frame based on the location of point of gaze relative to these markers. D5.1 UI Design 24 Figure 10: Markers used for simple mapping of point of gaze to four ROIs. Despite these robust mapping algorithms, data for three participants had to be mapped manually after sanity-checking, as participants moved too close to the screen to have sufficient markers in view for either of the two methods to work reliably. 3.4 Measures of Performance 3.4.1 Correctness of response For each participant, the number of trials responded to correctly was calculated for each of the four experimental conditions and across all 32 trials. Summary statistics were then calculated across participants. 3.4.2 Decision times For each participant and each trial, decision times were calculated as the difference between the start and stop time. To examine whether decision times varied between the four experimental conditions and the two response options (‘accept’ and ‘challenge’), the median reaction time was calculated for trials belonging to each of these classes for each participant. A Kruskal-Wallis test was carried out, using the IBM SPSS statistical package, to examine whether decision times differed between the four experimental design categories or between the response ‘accept’ and ‘challenge’. To examine whether there was a systematic trend for decision times to change as a function of elapsed trial, linear regression was performed for each participant with trial number as the independent and decision time as the dependent variable. D5.1 UI Design 25 3.4.3 Gaze Behaviour 3.4.3.1 Heat maps For a qualitative assessment of attended regions of interest (ROIs), heat maps were created in the Tobii Studio (eye-tracking analysis toolkit) for the complete trial of each participant. These heat maps can be used to identify specific regions of interest to which a participant attends, e.g., a ramp symbol and number in ROI 2, or a region of text in ROI 1. Further, heat maps allow a first impression regarding the dispersion of fixations across the whole scene and user interface. 3.4.3.2 Dwell times The dwell time is the time for which the eyes rest on a defined ROI. Dwell times were calculated for each ROI and median values calculated for each participant per ROI and across all dwells. 3.4.3.3 Time allocated to individual ROIs and within-observer variation To examine the attendance to the four ROIs quantitatively, the cumulative viewing time was calculated for each trial and each participant, both in absolute terms (in seconds) as well as in normalised terms (in % trial duration). To calculate cumulative viewing time, dwell times for each ROI were summed. Cumulative viewing time was calculated for each condition as well as across all trials. To compare gaze allocation across participants, median values were calculated per participant and condition. To examine the consistency of an individual participant’s gaze allocation, the interquartile range (IQR) was calculated for each participant and condition as well as across all trials. The reason for doing this was a) to investigate whether controlling for the experimental condition would reduce within-observer variation and b) to investigate the ‘noise’ level associated with the fundamental metric cumulative viewing time. This metric is essential to evaluate UI design, however it is important to know how much variation is attributable to unsystematic visual scanning behaviour of participants, and how much is attributable to UI design. These two sources of variation will have to be distinguished in our future work on iterative UI development and determine the number of necessary participants and trials. 3.4.3.4 Basic visual scanning behaviour For each of the experimental conditions, the trial time, number of attended ROIs per trial, switch count and number of visits to each ROI were calculated. The switch count is the number of gaze shifts between ROIs for each trial. The number of visits per ROI is a metric which allow to quantify how often an observer attends the same information source. 3.4.3.5 Viewing networks and network analysis metrics Graphical viewing networks were constructed in line with the work which we presented in the SPEEDD framework in the past. In short, viewing networks visualise between which ROIs a participant switches. They are an intuitive way of checking for consistency in scanning behaviour. In addition to the graphical viewing networks, we calculated several quantitative network analysis which we presented previously and which are described in detail in our manuscript that is under review with TiiS. In short, the viewing sequence of each participant for each trial is processed using matrix calculations. ‘Nodes’ reflect ROIs, and ‘edges’ reflect switches between pairwise ROIs. This results in a representation of the frequency of switches between pairwise ROIs. This representation can be binary (it D5.1 UI Design 26 only matters whether or not a participant switches between two regions, but not how often) and undirected (it doesn’t matter whether for example a participant switches from ROI 3 to ROI 1 or from ROI 1 to ROI 3); or it can be binary and directed (i.e. switching from ROI 3 to ROI 1 would be considered different to switching from ROI 1 to ROI 3). We used both representations to calculate a number of network analysis metrics. In the following, a summary is given for each of the metrics reported here: Number of edges: This metric counts the number of any unique pairwise switch. In a gaze sequence with the ROI attendance 4 – 3 – 2 – 3 – 2 – 4 – 1 – 4 – 1 -4 – 3 for example, there would be six ‘edges’ in the directed representation (4 – 3, 3 – 2, 2 – 3, 2 – 4, 4 – 1, 1 – 4) and four edges in the undirected representation (4 / 3, 3 / 2, 2 / 4, 4 / 1). Link density: This metric is a measure of the number of edges Nedges within the network as a fraction of the total number of possible edges, arising from the total number of nodes Nnodes. This metric indicates how an observer sequentially combines information sources. Inclusiveness: This metric counts the number of nodes which are connected within a network. Here we calculated it from the undirected binary network. This metric is a simple measure of the number of ROIs which an observer attended to; if gaze was not directed at a ROI, it would not be included in the network. Average nodal degree: This metric represents the average number of edges connected to each node of the network. Similar to link density, this metric indicates how strategic an observer is switching between information sources: a high average nodal degree indicates that the observer is non-selective in sequentially combining information sources, while a low average nodal degree indicates a systematic approach to switching between sources. The maximum average nodal degree corresponds to the total number of nodes in the network minus 1. Connection type: In a directed network, strongly connected components are bi-directional edges, e.g. in a case where there is a link from ROI 1 to ROI 3 and a link from ROI 3 to ROI 1. Weakly connected components are uni-directional edges, e.g. in a case where there is a link from ROI 1 to ROI 3 but no link from ROI 3 to ROI 1. Together with the nodal degree, this metric allows to understand the function of a ROI in the broader context of the scan pattern. Leaf nodes: Nodes that are only connected to one other node are called leaf nodes. Nodal degree (‘local centrality’): The degree of a node describes the number of links connected to any neighbouring node. This metric measures how well connected each node is. 3.5 Results 3.5.1 Correctness of responses For the 27 participants, the percentage of correctly evaluated trials ranged from 56% to 100%, with a mean ± standard deviation (SD) of 79 ± 12% of trials being evaluated correctly. Five participant had performances better that 95% (three with 100% correct assessments), with the remaining 22 students showing performance < 95%. This was used to partition data into an ‘expert student’ group and a ‘nonexpert’ student group. Across the four different conditions, the mean ± SD number of correctly evaluated trials (out of a maximum of 8) was 7.7 ± 0.6 trials for condition ‘TT’ (ramp label agrees, suggestion correct), 6.9 ± 1.8 D5.1 UI Design 27 trials for condition ‘TF’ (ramp label agrees, suggestion incorrect), 3.0 ± 3.2 trials for condition ‘FT’ (ramp label disagrees, suggestion correct) and 7.7 ± 0.7 trials for condition ‘FF’ (ramp label disagrees, suggestion incorrect). A Friedman test, carried out using the IBM SPSS statistical package, accordingly revealed that there was a significant difference between the four experimental conditions (P < 0.0005). Figure 11: Number of trials with correct response. for each of the four experimental conditions. [TT – ramp label agrees: true, suggestion correct: true; TF – ramp label agrees: true, suggestion correct: false; FT – ramp label agrees: false, suggestion correct: true; FF – ramp label agrees: false, suggestion correct: false. The correct response for ‘TT’ was ‘accept’, the correct response for ‘TF’, ‘FT’ and ‘FF’ was ‘challenge’]. 3.5.2 Decision times Decision times per trial ranged from 1.0 s to 23.5 s across participants. The median decision time per participant across all trials ranged from 2.1 s to 11.7 s (4.8 ± 2.4 s). There was no significant difference in decision time between the four conditions (p = 0.440) or the two response types (p = 0.755). There was also no significant difference between the eight categories of each condition separated by response (p = 0.664). Figure 12: Decision times as a function of condition, response or condition separated by response. [TT – ramp label agrees: true, suggestion correct: true; TF – ramp label agrees: true, suggestion correct: false; FT – ramp label agrees: false, suggestion correct: true; FF – ramp label agrees: false, suggestion correct: false]. D5.1 UI Design 28 Results for decision times as a function of elapsed trial varied across participants: on one hand, there was a significant linear association between decision time and elapsed trial number for 12 participants, albeit shallow fitted slopes (range of fitted slopes: -0.21 to 0.25, range for R2: 0.13 to 0.45; range for p: < 0.001 to 0.041). On the other hand, there was no significant association for 15 participants (range for R2: 0.00 to 0.10; range for p: 0.075 to 0.988). Figure 13: Response times as a function of trial number for each participant. 3.5.3 Gaze behaviour 3.5.3.1 Heat maps Heat maps of the gaze shifts qualitatively confirmed that participants looked at regions on the screen that held information relevant for task completion. This included looking at the accept/challenge buttons and computer suggestion/ramp information in ROI 1, the ramp number on the map in ROI 2, the points on the graph in ROI 3 and the ramp details in ROI 4. However, the time spent looking at these different information sources differed. Visual attendance to unrelated information was virtually absent. D5.1 UI Design 29 Figure 14: Heatmaps for the gaze data of four participants across the complete recording period. [The gaze point in the center of the screen corresponds to participants looking at the counter between trials. This figure shows that the relative gaze allocation differs between individuals]. Figure 15: Example for sequential viewing behaviour of the 18 participants for scenario number 13 (out of 32 scenarios). [The display is identical, however it is obvious that participants vary in their scanning behaviour. Dwell times can be calculated from individual periods of attendance to a given ROI. Cumulative viewing time is then calculated by summing individual dwells]. D5.1 UI Design 30 3.5.3.2 Dwell times The mean ± SD dwell time across participant-specific median dwell times per ROI was 0.75 ± 0.39 s for ROI 1, 0.28 ± 0.14 a for ROI 2, 0.49 ± 0.18 s for ROI 3 and 0.39 ± 0.29 s for ROI 4. Across all ROIs, it was 0.39 ± 0.16 s. This shows that on each visit participants tended to consult the display for less than one second, however for more than 0.15 s which is a minimum time necessary to extract information from a relatively complex pattern. The variation within participants was in the same order as the median dwell times at mean ± SD values between 0.53 ± 0.20 s and 0.26 ± 0.12 s, with an overall average of 0.26 ± 0.07 s. This suggests that the evaluation of information display may be difficult to perform using dwell times as a metric, as any effect would have to be comparatively large to be detected given the natural variation in visual scanning behaviour. Figure 16: Dwell times for individual regions of interest (ROI 1 to 4) as well as dwell times calculated across all attended ROIs. Left: median dwell time per participant; right: interquartile range (IQR) per participant across median dwell times per trial. 3.5.3.3 Time allocated to individual ROIs and within-observer variation Analysis of the percentage viewing time allocated to the regions of interest (ROIs) showed that the 18 participants varied in the time allocated to the individual ROIs (Figure 17). While all but one participant spent the largest fraction of time looking at ROI 1 (the panel holding the response buttons and computer suggestion), weighting of the time spent looking at ROI 2 to 4 varied strongly: the median percentage viewing time ranged from 0 to 38% viewing time for ROI 2, 9 to 48% for ROI 3 and 0 to 26 for ROI 4. D5.1 UI Design 31 Figure 17: Percentage viewing time allocated to different ROIs across all 32 trials for each participant. [This figure shows the variation in attending to different information sources both within trials for the same participant and across participants]. There was very little variation in the viewing time spent on individual regions as a function of the experimental condition (Figure 18). As a general trend across all trials, participants spent approximately half the time looking at ROI 2 and 4 compared to ROI 1 and 3. The median cumulative percentage viewing time spent on a region of interest per trial was 43% for ROI 1, 10% for ROI 2, 30% for ROI 3 and 10% for ROI 4. This corresponded to a median cumulative viewing time of 2.0 s for ROI 1, 0.5 s for ROI 2, 1.3 s for ROI 3 and 0.5 s for ROI 4 (Figure 19). There was notable variation within a participant’s viewing time across trials (Figure x). Similar to dwell times, this variation was in the same order as the median viewing time. D5.1 UI Design 32 Figure 18: Boxplots of median viewing time per participant, both as the percentage viewing time (top row) and as the total time in seconds (bottom row). [Data are shown for each of the experimental conditions as well as across all trials (‘All’). Experimental conditions: TT – ramp label agrees: true, suggestion correct: true; TF – ramp label agrees: true, suggestion correct: false; FT – ramp label agrees: false, suggestion correct: true; FF – ramp label agrees: false, suggestion correct: false. The figure shows that attention allocation did not differ strongly between experimental conditions, but between participants]. Figure 19: Boxplots of the interquartile range in viewing time per participant, both as the percentage viewing time (top row) and as the total time in seconds (bottom row). [Data are shown for each of the experimental conditions as well as across all trials (‘All’). Experimental conditions: TT – ramp label agrees: true, suggestion correct: true; TF – ramp label agrees: true, suggestion correct: false; FT – ramp label agrees: false, suggestion correct: true; FF – ramp label agrees: false, suggestion correct: false]. D5.1 UI Design 33 3.5.3.4 Basic visual scanning behaviour As can be seen in Figure 20, the experimental condition did not have a notable effect on the trial duration, number of attended ROIs or switch count. Participants tended to attend to all ROIs, especially for the TT and TF condition. The mean ± SD switch count across all conditions was 8.2 ± 4.7. There likely was an artefact of the automatic for two participants with a very high switch count, whose gaze may have been allocated near ROI boundaries frequently. The number of visits to each ROI showed no obvious dependency on the experimental condition (Figure 21). The median number of visits per ROI was approximately 2 throughout, with slightly more frequent visits to ROI 1. Figure 20: Basic gaze parameters related to visual scanning for each of the four experimental conditions and across all 32 trials. [The trial duration is the median time it took each participant to complete all trials for a given condition. The number (N) of attended ROIs is the median number of ROIs attended to by each participant for a given condition. The switch count is the median number of gaze redirections between the different ROIs for each participant for a given condition]. Figure 21: Number of visits to each ROI for each of the four experimental conditions and across all 32 trials. [The median for each participant is represented in this dataset]. 3.5.3.5 Eye tracking data Comparison of eye tracking data between the expert and non-expert students that the expert students weighted information extraction from ROI 2 to 4 similarly, while the non-expert students weighted ROI 3 higher than ROI 2 and 4 (Figure 22). D5.1 UI Design 34 Expert students (≥ 95% correct) Non-expert students (< 95% correct) Figure 22: Percentage viewing time allocated to the four regions of interest by the four students classified as ‘expert students’ (≥ 95% of trials completed correctly, left panel) and by 14 ‘non-expert’ students (< 95% of trials completed correctly, right panel) 3.5.3.6 Viewing networks and network analysis metrics Creation of the viewing networks showed that participants followed very different scan paths when looking at exactly the same display (Figure 23). In future, we will quantify scan pattern similarity using metrics from bioinformatics, allowing to evaluate whether a specific UI design results in more consistent visual scanning within and across participants. Figure 23: Viewing networks for condition TT (left) and TF (right). The network analysis metrics showed that scanning behaviour was similar across the four experimental conditions. The large number of edges (3.6 ± 0.9 across all trial out of a maximum 4) reflected the attendance to all ROIs in many trials, also reflected in the high inclusiveness (90% on average). The link density was 0.6 ± 0.2 in an undirected network and 0.4 ± 0.2 in a directed network. This means that the D5.1 UI Design 35 taken scan path did not cycle across ROIs randomly for multiple cycles in random order. This was also reflected in the modest average nodal degree and the number of leaf nodes. Table 1: Network analysis metrics. This table shows the results for all network analysis metrics described in the data analysis section. An undirected network is independent on the direction of the gaze shift, a directed network takes the direction of the gaze shift into consideration. METRIC Number of edges Link density (fraction) Inclusiveness (in %) Average nodal degree Connection type Leaf nodes Matrix directionality Undirected Directed Undirected Directed Undirected Undirected Directed Directed, strong Directed, weak Undirected Directed TT TF FT FF All 3.7 (1.0) 5.2 (1.8) 0.6 (0.2) 0.4 (0.1) 90.3 (12.5) 1.8 (0.5) 2.6 (0.9) 1.9 (0.7) 1.8 (0.9) 1.8 (2.0) 3.9 (1.6) 3.9 (1.1) 5.3 (2.2) 0.6 (0.2) 0.4 (0.2) 91.7 (12.9) 1.9 (0.6) 2.7 (1.1) 2.0 (1.0) 1.7 (1.0) 1.4 (1.7) 3.8 (2.0) 3.1 (1.1) 3.9 (1.9) 0.5 (0.2) 0.3 (0.2) 85.4 (12.3) 1.6 (0.6) 1.9 (0.9) 1.6 (1.0) 1.1 (0.6) 2.6 (1.6) 3.6 (1.6) 3.0 (1.1) 4.0 (1.7) 0.5 (0.2) 0.3 (0.1) 84.7 (12.5) 1.5 (0.5) 2.0 (0.9) 1.5 (0.8) 1.4 (0.6) 1.9 (1.9) 4.3 (1.5) 3.6 (0.9) 4.7 (1.8) 0.6 (0.2) 0.4 (0.2) 90.3 (12.5) 1.8 (0.5) 2.3 (0.9) 1.6 (0.9) 1.4 (0.5) 2.0 (1.8) 4.1 (1.4) 3.6 Comparison of expertise levels 3.6.1 Data collection To compare student behaviour with that of domain experts, we conducted a pilot study involving three (male) experts in road traffic control, based in DIR-CE Grenoble, France. For logistical reasons, no eye tracking data could be recorded for these participants. Expert participants were given the same instructions as the student participants (based on the script in Appendix 1). 3.6.2 Data analysis 3.6.2.1 Response behaviour Response behaviour of the experts from Grenoble was analysed as described in Section 3. Data for the student cohort was split into two groups: the ‘expert student’ cohort included students which completed at least 95% of trials correctly, matching the performance of the experienced road traffic management staff. The cohort included 5 participants for decision time and response data and four participants for eye tracking data. The ‘non-expert student’ cohort included students which completed less than 95% of trials correctly. This cohort included 22 participants for decision time and response data and 14 participants for eye tracking data. 3.6.2.2 Eye tracking data For each of the three groups, summary statistics were calculated for the number of correct responses for each condition. Further, summary statistics were calculated for reaction times, again for each of the four conditions as well as across all trials. From the eye tracking data, the percentage viewing time was D5.1 UI Design 36 calculated for each ROI to examine whether the expert and non-expert students weighted visual attention allocation differently. 3.6.2.3 Correctness of responses and decision times According to the results reported in Section 3.5 and as it can be seen from Figure 24 below, there is a clear difference between the two groups of a) experts and expert students (or high performers) and b) non-expert students (low performers). This difference is of course due to the separation based on the overall percentage of correctly evaluated trails, hence the high average number of correctly identified trials for the student experts. For evaluation purposes, a subset of student participants will hence provide data that is comparable to that provided by experts. Moreover, in terms of decision times, no difference was observed between either the two different student expertise groups (a and b), or between experts from Grenoble and expert students (Figure 24). Figure 24: Response reliability for the four experimental conditions for three staff of the road traffic management facility in DIR-CE (Grenoble, France), students classed as experts (≥ 95% of trials completed correctly) and students classed as nonexperts (< 95% of trials completed correctly) D5.1 UI Design 37 Decision times calculated across the three different participant groups showed no difference; for all groups and all conditions, it took approximately 5 seconds to complete one trial (Figure25). Figure 25: Decision times for the three different participant groups and four different experimental conditions. Throughout, the median decision time was approximately 5 seconds. 3.6.2.4 Questionnaire results 3.6.2.4.1 Answers The entries from each questionnaire were digitised and summary statistics calculated, both for all student participants as well as for students separated into ‘expert students’ and ‘non-expert’ students (compare Section 5). 3.6.2.4.2 Self-evaluation vs. measured performance Regression analysis of perceived vs. measured performance was performed in Matlab. Firstly, the estimated percentage of correctly evaluated trials was regressed against the measured percentage of correctly evaluated trials. Secondly, the perceived task difficulty was regressed against the measured percentage of correctly evaluated trials. To facilitate this comparison, the difficulty scores were converted from the original 0 to 10 scale to a 0 to 1 scale, where 0 corresponded to ‘very hard’ and 1 corresponded to ‘very easy’. This approach assumed that students who did well would find the task easier than students who did poorly. 3.6.2.4.3 Answers The mean ± SD difficulty rating was 3.8 ± 2.4 (with 0 – very easy, 10 – very hard) across all students, 3.4 ± 3.1 for the ‘expert student’ group and 3.9 ± 2.3 for the ‘non-expert student’ group. The estimated percentage of correctly evaluated trials was 74.8 ± 15.4% across all students, 85.0 ± 15.8% for the ‘expert student’ group and 72.5 ± 14.7% for the ‘non-expert student’ group. The percentage of trials D5.1 UI Design 38 which students should have accepted and which they estimated to have accepted was 70 ± 24.5% across all students, 59.0 ± 33.2% for the ‘expert student’ group and 72.5 ± 22.3% for the ‘non-expert student’ group. The percentage of trials which students should have challenged and which they estimated to have challenged was 65.4 ± 22.1% across all students, 62.0 ± 37.2% for the ‘expert student’ group and 66.1 ± 18.4% for the ‘non-expert student’ group. 3.6.2.4.4 Self-evaluation vs. measured performance The correlation between the estimated and measured percentage of correctly evaluated trials was moderate (r = 0.40), and linear regression revealed a slope significantly different from zero (slope = 0.50, intercept = 0.35, R2 = 0.16, p = 0.037). This suggests that participants were able to evaluate their own performance with some consistency. There was no strong correlation between the perceived task difficulty and the measured percentage of correctly evaluated trials (r = 0.20), and linear regression estimated a slope not significantly from zero (p = 0.326). Figure 26: Self-evaluation compared to quantified performance. [The correlation between the estimated and measured percentage of correctly evaluated trials (left) was moderate (r = 0.40), and linear regression revealed a slope significantly from zero. The deviation from unity (which would indicate a 1:1 match of perceived and measured performance) shows that participants under- and overestimated their own performance to a similar extent]. 3.7 Conclusions and outlook 3.7.1 Summary In the experiment described here, we used an abstracted version of a road traffic management control user interface in order to test whether 27 participants would reliably detect incorrect computer suggestions and / or conflicting information in an automated system. We examined response times and user performance/reliability as well as viewing behaviour in a cohort of students and compared D5.1 UI Design 39 response correctness and time to three personnel from Grenoble DIR-CE who were familiar with road traffic management. In the experiment, the operator had to ‘accept’ or ‘challenge’ a computer recommendation after examining multiple information sources. Our first main aim was to investigate whether participants would notice those trials that represented flawed or incongruent automation. We found that participants evaluated between 56% and 100% of trials correctly. On closer examination, the condition most commonly evaluated incorrectly was that in which the computer suggestion was incorrect: only a median of 2 out of 8 trials were challenged across participants, whereas the median was 8 for all other experimental conditions. Hence, participants did not acknowledge the discrepancy between the information which the computer suggestion was based on and the source of the data (in this experiment, the ramp number). Our second main aim was to examine whether there was an effect of the experimental condition and expertise level on decision times. Results showed that this was unlikely the case: median decision times were around 5 seconds independent of experimental condition, response, design and response or expertise level. Our third main aim was to investigate whether there was consistency within and across participants in allocating viewing time to different regions of interest. The various eye tracking metrics we employed did not show any such obvious differences as a function of experimental condition, however we observed large individual difference in viewing time allocated to different ROIs and scan patterns. The difference in scanning behaviour that we did note concerned the percentage viewing time allocated to ROI 2 to 4: while the ‘expert student’ group tended at allocate a similar amount of time to these regions, the ‘non-expert’ student cohort spent more time looking at ROI 3 than ROI 2 and 4. 3.7.2 Novice vs. expert behaviour While the task did not present a challenge to Subject Matter Experts, we found that the non-experts exhibited an interesting pattern in their response. Considering the results, we assume that all participants were able to use the graph display (top left of the screen, ROI 3) to apply the rules that we had defined. Hence they correctly determined whether the computer suggestion was correct or false. However, the non-experts were confused by the FT condition, in which the automation was correct, but the information on the displays disagreed). This suggests that the non-experts were not checking for display congruence, which was supported by eye-tracking data: the expert student group attributed a similar percentage viewing time to ROI 2 to 4. In contrast, the expert group tended to spend a much larger proportion of their time looking at ROI 3 than ROIs 2 and 4 (which are used only to determine display congruence). The group exhibited similar weighting for ROIs 2, 3 and 4. In terms of task proximity, while all participants were presented with the same information, the nonexperts were not able to judge the ‘worth’ of the displays for congruence checking and focused their attention on the automation checking aspect of the task. It is possible that this might be an effect akin to change blindness (Simons and Levin, 1997) in which relevant information is not attended to on the assumption that it is ‘given’ and does not require checking. Alternatively, a phenomenon termed ‘satisfaction of search’ is known from the medical literature, where diagnosticians terminate visual D5.1 UI Design 40 search after finding the first sign of pathology (Berbaum et al. 1990; Berbaum et al. 1994; Samuel et al. 1995). Similarly, participants may have terminated their search after completing the visual evaluation that computer suggestion and information held in ROI 3 agreed. 3.7.3 Considerations for UI design In terms of UI design, it is important to consider not only how information can be presented to highlight its ‘worth’ but also how people might seek to extract information from the displays. Non-experts may have expected the automation to fail on the task they perceived as more complex, leading to their attention being mainly focused on validating the computer suggestion. The findings presented in this paper underline the importance of cueing operators using decision support software to make sure they are aware of the context (system state) in which they make decisions. One way to achieve this could be to prompt users to acknowledge if some displays show different views. We have already started designing and implementing algorithms to quantify scan pattern similarity using for example sequence analysis from bioinformatics. This will allow us to quantify the similarity of viewing behaviour within and across participants as well as across different UI iterations. We currently assume that a design target should be to achieve repeatable and consistent scan patterns to assure that an operator always attends to all information sources in a systematic manner. Work in the field of medicine has shown that such systematic scanning behaviour is extremely beneficial for performing a task correctly (i.e. detecting fractures or tumours). We hope that based on gaze analysis, we can inform UI design to encourage such systematic scanning. On the other hand, in context of ‘ideal observer’ theory as applied to visual perception (Geisler 2011), it is also possible that it is important to accommodate individual differences based on participant-specific constraints on e.g. working memory. In this case, it would be detrimental to pre-define a single scan path by UI design. Instead, it may be important to accommodate individual behaviour. Further, there may be several ‘optimal’ scan paths, and participants in such a scenario should not be constrained to a single solution, and equally may not exhibit a repetitive pattern. These are questions we are going to address as the project progresses. 3.7.4 Study limitations and considerations for future work Based on this study, we are looking to address several challenges associated with UI evaluation based on eye tracking and user responses. In our recently reported work, we automatically map point of gaze to regions of interest outside proprietary software due to limitations in the visibility of infrared markers which serve to define global space for head mounted eye tracking. Despite the success, in future we are looking to optimise our algorithms in order to remove noise from the sequential ROI attendance sequence. This noise mainly arises from gaze points close to ROI intersections and can be reduced by placing ROIs even further apart. In addition, we will work with a screen-mounted eye tracking system in order to achieve higher resolution and better mapping compared to the head mounted system. This will allow us to define small regions of interest corresponding to exact information content within larger ROIs. For our future work, we are investigating suitable eye tracking metrics which allow to compare UI design across different tasks and which allow to quantify improvements for iterations of the same UI. Ideally, we are looking to find metrics that give an absolute measure of UI quality, however this presents a D5.1 UI Design 41 complex problem due to the dependence of many metrics on task, scanning duration and UI factors such as clutter and information representation. D5.1 UI Design 42 4 Defining Baseline Performance Metrics 4.1 Overview In SPEEDD the evaluation of the User Interface designs will involve more than a judgement of the ‘look and feel’ or usability of the screen layouts. As the role of the UI is to support decision making, the project will focus on ways on measuring decision making in response to information search and retrieval. An assumption is that the manner in which information is presented to the user interacts with the manner in which the user searches for specific aspects of the information in order to make decisions. This raises three dimensions that need to be considered in this work. First, the manner in which information is presented can be described in terms of its value to the task, which includes its diagnosticity (i.e., the correlation between the content of the display and the goal to achieve), and in terms of its graphical properties (i.e., the format used to display the information). Second, the manner in which users search for specific information can be described in terms cost, i.e., temporal measures of information search, particularly using eye-tracking metrics. Third, the manner in which empirical information search relates to decision making can be compared to the optimal decision models being developed in SPEEDD. It is proposed that this approach is not only beneficial for SPEEDD but also provides a foundation for the evaluation of Visual Analytics in general. The definition of metrics for these three aspects involves the specification of a baseline against which subsequent designs can be assessed. While it might be tempting to assume that such a baseline can be derived from the observations and analysis of the DIR-CE control room (as outlined in D8.1 and elaborated in D5.1), it is not obvious how this could provide a realistic comparison with future designs. This is for three reasons. First, many aspects of the task and displays will change and we need to be able to isolate particular effects for formative assessment. For example, the current control room involves operators consulting 5 screens on their desk and two sets of CCTV screens. Consequently, their visual search is conducted in an environment in which near- and far-field displays of information need to be consulted. We might propose the SPEEDD prototype, with new visualisations, run on a single screen (albeit one with several windows running on a large screen). We need to empirically separate the effects of single/multiple screens and the effects of the visualisation presented on each display/window. Second, the current task of operators is predicated on completing an incident report and, as D5.1 indicates, much of their activity is focused on this task. It is likely that the SPEEDD prototype automates much of the report creation, leaving the role of the operator to provide expert commentary on the incident and verify components of the automation rather than log the more mundane details. In this case, the role of the operator would be radically different. Having proposed that the fundamental goal of the UI design work in SPEEDD is to change the nature of the information displayed and the manner of the tasks that the operator performs, it is not easy to see how baseline metrics could be defined which could be applicable across all instances of the design (having said that, we will measure the time cost of D5.1 UI Design 43 incident reporting so that we know the gain if it can be automated.) Third, quantitative measurement of operator performance in the working environment is difficult because the actions performed vary in response to changing situational demands. For metrics to be reliable and consistent, one needs to exercise control over the environment in which they are applied. For this reason, we prefer to define metrics which can be applied in laboratory settings, much like the experiment reported in this document. While the definition of metrics reflecting current practice in the DIR-CE control room could be problematic, SPEEDD requires a set of metrics which retain ecological validity. This means that the metrics need to reflect the work that the operator performs and the type of decisions that the UI will support. We could, for example, take a high-level subgoal from the Hierarchical Task Analysis presented in D8.1, such as ‘define incident location’ and ask how this subgoal is currently achieved. This could provide a qualitative description, in terms of tasks performed and Regions of Interest consulted in the DIR-CE control room to give a broad sense of what the operator’s current activity might look like. We could then take this subgoal as the basis for an experimental task and run this in controlled settings (with novice and experts participants) to define quantitative metrics of this performance. This approach is illustrated by the experiment in this report. As part of WP 8, quantitative metrics allow us to describe visual scanning behaviour of novices and experts, develop models of visual scanning and decision making, contrast model and experimental data and quantify the effect of changes to UI design on human performance and behaviour. When comparing different UI designs, it is important to work with metrics that are sensitive to changes while being robust across information content, display choices and other factors. In our work, we consider baseline metrics in the following contexts where metrics relate to different levels: 4.2 Overall goal hierarchy At the top level, the investigation concerns the overall goal hierarchy, for example to determine the location of an incident in a road traffic management task. In this context, metrics such as the time taken to complete a task / goal and the presence of error serve to quantify user performance and the impact of different UI designs. More specifically, metrics can be separated into the following categories: Decision time Use of discrete regions of interest (ROIs) o Duration spent looking at each ROI o Sequence of ROI attendance o Repetition of scanning cycles / recurring sequences Signal detection o Rate of correct responses or target / stimulus detection and probability of correct response o Sensitivity and Specificity; Positive and negative predictive value; Accuracy User error o Missed target o Misinterpretation, misunderstanding D5.1 UI Design 44 o o Failure to visually examine all components of UI Failure to perform all necessary sanity checks / verifications in automated system In the experiment we described as part of this report, we used several of the metrics mentioned above to examine their response for different experimental conditions and UI content. We used decision time to investigate whether different levels of disagreement amongst information presented in the UI triggers an alteration in search time, which was not the case. Attention to discrete regions of interest (ROIs) was quantified in terms of dwell times and cumulative viewing times, showing that these metrics varied more between participants than between experimental conditions. Sequence analysis and recurrence analysis will be performed in future. The rate of correct responses was quantified in the present experiment, allowing to perform correlation between perceived and measured performance and to separate students into cohorts representative of different expertise levels. Quantifying user error in the present experiment allowed to pin-point where participants made mistakes, with first steps taken to relate this to visual scanning behaviour, showing that differences in time spent of ROIs may be associated with erroneous responses. 4.3 Theory and modelling To compare theoretical predictions and model-based exploration of large design spaces to human performance on selected design choices, it is important to work with metrics that can both inform models of visual scanning and decision making was well as being measurable in human visual scanning. Based on our past work in the domain and published work, there are several metrics that allow for this which fall into the following categories: Basic oculomotor behaviour (used in models of fundamental visual scanning behaviour) o Saccade amplitude o Fixation duration Allocation of attention and information extraction o Dwell times per region of interest (ROI) o Scan path length Function of ROI o Use of ROI in context of task completion and user correctness o Sequence of ROI selection and weighting of ROI importance Allocation of attention relative to usefulness and cost; effect of ROI… o Diagnosticity (how well does ROI content predict the scenario outcome) o Access cost (how much effort is needed to extract information from a ROI); cost commonly corresponds to time. This will also include time necessary to perform required task actions such as filling in forms o Accuracy (how noisy is the data displayed in the ROI) These metrics will be applied in work planned for summer/autumn 2015. D5.1 UI Design 45 4.4 UI evaluation When evaluating different UI designs, either across iterations or across tasks, it is important to quantify user performance and well as user perceptions regarding the ‘goodness’ of the design. In an ideal scenario, a UI that results in reliable performance is also perceived as intuitive, easy to interpret and user-friendly by a user. Further, we are looking to apply a no-choice / choice paradigm (related to work recently published by Howes et al., 2015a) to determine user choices within different options to display the same information. This paradigm has the advantage that it does not rely on metrics that may be too noisy to detect statistical differences. Rather, in this approach users are trained to use a variety of visualisations to complete a task, followed by a task in which the visualisation can be chosen freely. This allows to determine user preferences ‘on the job’, without confounding effects such as post-hoc rationalising or speculative choices. To be able to track parameters across the design process, we are considering metrics related to the following domains: Task difficulty o Quantification of user self-evaluation using questionnaires o Correlation of perceived difficulty / workload etc. with objectively quantified user performance, such as percentage of correctly evaluated trials or task duration User evaluation o Questionnaire following task to determine strategy and user evaluation o Debrief session for each participant Design space parameterisation o Iteratively adjust e.g. diagnosticity, clutter, accuracy of ROIs in static tasks o Adjustment of rates of change in dynamic tasks o Correlation of visual sampling strategy and user performance o Examination of different options to display the same information Evaluation of different display options o No choice / choice paradigm o User ranking In our recent experiment, we used questionnaires to get a self-estimate of performance from the participants. Correlation of this self-estimate of percentage correctly evaluated trials with measured performance showed a significant association. We included the question of ‘what was your strategy’ into the questionnaire, and answers generally showed that participants had understood the task and performed it logically. In work planned for summer/autumn 2015, we will start to manipulate design space parameters, both for modelling of visual search and empirical evaluation of human scanning behaviour. This will likely include an experimental approach described in Howes et al. (2015b) based on a no-choice / choice paradigm. D5.1 UI Design 46 4.5 Evaluation of metrics for the quantification of UI design As part of our work for SPEEDD so far, we have introduced a range of metrics that can be used to quantify the aspects mentioned above. The following tables present a first evaluation, following our experimental work so far, of the advantages and disadvantages of these metrics in context of UI evaluation. It becomes apparent that it is unlikely to find a single metric that is able to capture design properties across various tasks, designs and user goals. However, there are several metrics that can be normalised and it will certainly be possible to use many of them for iterative design evaluations, and others for layout and content considerations. Table 2: Eyetracking low level Metric Saccade amplitude Saccade frequency Pro - Shows whether long or short visual paths are taken within the display - Shows whether information inside or outside the periphery is accessed - Quantifies scanning behaviour at a very basic level, e.g. showing whether a user is moving the eyes a lot or not compared to normal eye movement in normal environments Fixation duration - This metric corresponds to information extraction at the most basic level Fixation count - Attention allocation to a ROI at the most basic level Scan path length - Measure of the distance covered while moving the eyes across the UI; this metric should hence correlate with cost / efficiency of information sampling Blink amplitude - Blink frequency can relate to concentration Con - Needs a well controlled experimental setup - Can be confounded by head movement - Saccade frequency is a very intrinsic property of the visual system and may not correlate well with changes to UI design. There are always micro-saccades, and saccades can occur within the same ROI. Dwell times or cumulative viewing times might be better - Might be confounded / overridden by the physiological necessity for the eyes to always move a little after a brief time (else the image would disappear). So this metric is bounded (especially with an upper bound) - As above, relates to very fundamental scanning processes and may not correlated well with design - Hard to do with glasses in control room setup with multiple screens. - This metric has to be normalised to UI content, as it is strongly dependent on number of ROIs, ROI location and content - A scan path might consist of many short saccade amplitudes to many ROIs or few large amplitudes to few ROIs, so this metric does not capture the ‘complete story’ - May be difficult to detect and separate from dropouts due to other reasons D5.1 UI Design 47 Metric Pupil dilation metrics (constriction rate etc.) Pro - Tracking pupil dilation may allow to map the decision and information weighting and accumulation process before conscious decisions are made (see de Gee 2014) - Rates may allow to quantify the ease of information extraction Con - This is a young field and needs extensive testing / validation Table 3: Eye tracking processed / high level Metric Cumulative viewing time (in % trial duration or seconds) Dwell time Pro - Gives an indication of weighting of information sources - Shows which information sources are attended and not attended to - Very basic scanning parameter relating to information extraction time for a specific ROI per visit - Lots of reference values in the literature Number of visits to ROI per trial (respectively number of returns) Return time - How often a participant revisits a ROI should correlate with how well content can be memorised; in an optimal scan path we probably want each ROI just visited once (in a static scenario!) - Shows how long people might be able to remember ROI content for - Indicates the ‘efficiency’ of scanning behaviour Switch count (= number of transitions between ROIs) Time to first hit / rank order of first hit Dwell time on first hit / first and second pass dwell time Entropy - Indicates ‘ranking’ of importance for a participant information - Indicates how long it takes to extract sufficient information from a previously unseen ROI - Normalised dispersion measure that can be compared across different tasks / designs Con - Looking at a ROI for a long time does not necessarily mean that lots of information is extracted; it can also mean that a user is trying and failing to understand content - Dwell times tend to be quite similar, especially in our first experiment. So they might not reflect the ‘goodness’ of a display, unless it is really bad or convoluted, or the task demands differ strongly - For some ROIs, revisits may be necessary for task completion (e.g. in our study, it is necessary to read the computer suggestion, then press a button in that ROI at the end of the trial) - Confounded by attention to and competition of other ROIs and task goal There is the need to normalise this to the number of ROIs - Time needs to be normalised, so we could use rank ordering instead for example - This metric is confounded in our recent experiment, since participants fixate the middle of the screen during the interval between trials and the first gaze point is often mapped to ROI 3 following screen refresh - this will be confounded after the first trial due to learning - The maths / assumptions need careful consideration D5.1 UI Design 48 Table 4: Scan pattern comparison using bioinformatics Metric Levenshtein distance (= ‘edit distance’) Pro - Easy to normalise, quantifies similarity between pairwise scan paths using the string edit method from computer science Global and local sequence alignment - Detection of similar scan patterns within and across participants, allowing for some noise in the pattern (tolerances controlled via scoring matrix) Recurring nmers - Shows whether UI design triggers similar scan patterns within and across participants; extracts those scan patterns that are most common Con - Easily confounded by noisy mapping, sensitive to small deviations in otherwise recurring patterns (in contrast to some bioinformatics approaches) - Very UI specific; it is a good metric to evaluate consistency in scanning for different layout options of the same material, but difficult to normalise (unless using purely path length) - As for sequence alignment Table 5: Network analysis metrics Metric Number of edges Pro - Simple count of paths taken between ROIs Link density - Normalised metric of paths taken relative to all possible paths, allows to e.g. quantify how systematic a participant is scanning the UI - Normalised metric reflecting the percentage of ROIs attended to independent of the design constraints - Related to link density, reflects how many different ROIs are accessed (per ROI or on average) from a given ROI - Allows to identify the function of each ROI within the scan path of an observer, i.e. whether a ROI is central to switching to many other ROIs or only accessed at a particular point - Count of between how many pairwise ROIs participants look back and forth, and how many ROIs are attended to only unidirectionally - Counts how many ROIs are only accessed from one other ROI and reflects ROIs with a very specific function in the viewing sequence Inclusiveness Nodal degree and average nodal degree Connection type Leaf nodes Con - Has to be normalised to network size when comparing across different numbers of ROIs, where it becomes link density - For few ROIs this might be insensitive to experimental conditions - The ROI-specific nodal degree may be difficult to normalise independent of UI design; it will be very good for iterative design, but less meaningful when comparing e.g. the fraud and road traffic UI, as those ROIs have implicitly different functions - Looking at the pilot data, this metric might be insensitive to small design changes due to variation in participant behaviour - It may be difficult to separate systematic from random effects, especially for short scan paths. For short scan paths, there may be a high number of leaf nodes purely because each ROI is only accessed once D5.1 UI Design 49 Table 6: Eye / head coupling Metric Latency Combined shift Pro - This time lag between eye- and head movement onset has been associated with intrinsic / extrinsic cue attendance gaze Head contribution Head movement amplitude - Quantifies the percentage of switches between ROIs executed by both eye- and head movement, allowing to infer goal structure - Amount that head movement contributes to the gaze shift, easy to normalise - Useful metric for large control room design evaluations Con - Very difficult to measure; needs synchronised mocap system or robust head movement detection algorithm based on head-mounted system - Heavily dependent on spatial setup (especially >40 deg visual angle); for smaller gaze shifts, eye/head coupling is very participant specific as a baseline - May not correlate well with UI design and be a feature of more intrinsic viewing behaviour of individual participants - May be less useful for single-screen wok, as gaze shifts are too small for this to be relevant Table 7: User performance Metric Decision time Correlation of perceived difficulty and correctness of response Pro - Simple metric how long it takes to extract information from UI to perform a task - The correlation coefficient is a simple, standardised metric to quantify how well perceived UI handling and actual task performance match Con - Specific to task / question asked - This is currently built around the assumption that an observer who performs poorly would find the task difficult, and an observer who performs well would find the task simple. However, observers who find the task difficult might still do well, just with more mental resources D5.1 UI Design 50 5 References Baber, C.(2015) Evaluting Human Computer Interaction, In J.R. Wilson and S. Sharples (eds) Evaluation of Human Work, London: CRC Press, Berbaum, K.S., Franken, E.A., Dorfman, D.D. et al. (1990). Satisfaction of search in diagnostic radiology. Invest Radiol., 25, 133-40. Berbaum K.S., El-Khoury G.Y., Franken, E.A. (1994). Missed fractures resulting from satisfaction of search effect, Emergency Radiology, 1, 242-249. Bevan, N., 2001, International Standards for HCI and usability, International Journal of Human Computer Interaction, 55, 533–552. Brooke, J., 1996, SUS: a quick and dirty usability scale. In: P.W. Jordan, B. Weerdmeester, B.A. Thomas and I.L. McLelland (eds) Usability Evaluation in Industry, London: Taylor and Francis, 189–194. Chen, X., Bailly, G., Brumby, D.P, Oulasvirta, A. and Howes, A. (2015) The emergence of interactive behavior: a model of rational menu search, Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing - CHI’15, Seoul, Korea de Gee, J. W., Knapen, T., & Donner, T. H. (2014). Decision-related pupil dilation reflects upcoming choice and individual bias. Proceedings of the National Academy of Sciences, 111(5), E618-E625. Geisler, W.S. 2011. Contributions of ideal observer theory to vision research. Vision Research 51, 771-781. Holcomb, R. and Tharp, A.L., 1991, What users say about software usability, International Journal of Human–Computer Interaction, 3, 49–78. Howes, A., Duggan, G.B., Kalidini, K., Tseng, Y-C., Lewis, R.L. (2015a) Predicting short-term remembering as boundedly optimal strategy choice, Cognitive Science ISO 13407, 1999, Human-centred Design Processes for Interactive Systems, Geneva: International Standards Office. Samuel, S., Kundel, H.L., Nodine, C.F., Toto, L.C. (1995). Mechanism of satisfaction of search: eye position recordings in the reading of chest radiographs, Radiology, 194, 895902. Shackel, B., 1984, The concept of usability. In: J. Bennett, D. Case, J. Sandelin and M. Smith (eds) Visual Display Terminals: Usability Issues and Health Concerns, Englewood Cliffs, NJ: Prentice-Hall, 45–88. Whiteside, J. Bennett, J. and Holtzblatt, K., 1988, Usability engineering: our experience and evolution. In: M. Helander (ed.) Handbook of Human–Computer Interaction, Amsterdam: Elsevier, 791–817. D5.1 UI Design 51 Appendix 1 – SUS translated into French http://blocnotes.iergo.fr/concevoir/les-outils/sus-pour-system-usability-scale/ D5.1 UI Design 52 Appendix 2 – Summary of Comments from SMEs 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. a. b. c. d. e. 26. 27. 28. 29. 30. 31. 32. Want to control road rather than work zone-by-zone. Ramp metering not always under operator control. Sensors will only be on south / west lanes. Control of ramp should take account of the length of the queue building up behind it. Maximise traffic flow. Apply penalty (to automated system) for queue? How does prediction work in display A? Contrast global versus local view (see 1). How read colours on car dials? Is this showing targets or limits? Too much information of the display for the job of ramp metering (confused by whether UI on an extra screen or replacement for whole system?). Prediction if useful. Graphs show limits (upper and lower) and normal, and predictions. Warning shown when outside limits rather than having operator explore all sensors (confused by demo?). Too much information on same map (display A). Coloured circles show information for zone. Ramp overview (display C) useful. Would prefer map showing where there is a problem and to have information needed to understand this problem to hand at that location. Need global view and simple data. Danger is using zooming / resizable displays – makes most recent the most important; obscures operator experience in prioritising events; limits overview and SA. Sometimes a ramp is closed and it is not clear how traffic will evolve. It would be useful to include this in modelling (CB – what about inviting operator to propose how traffic will change?) Click on ramp square (on display A) and get overview of parameters for that ramp. Predicted versus current state useful. Not sure how to use vehicle inter-distance (isn’t this density?). Not sure if driver behaviour falls under control room command; need to optimise traffic flow and leave driver behaviour to driver responsibility. Also, changing driver behaviour has long-term evolution. Current ramp metering. Ramp lengths short, so need to manage length of queue. When ramp active, operator seeks to manage queue length. If queue at end of ramp then impact on traffic on main road. Might need to act on downstream ramp. Ramp metering entirely automated (in new system) and could be useful to allow operator intervention. Invitation to contribute to working party of design of ramp metering UI that is currently being implemented. Representation of events – road works, accidents, congestion – on the map. Response depends on whether this supplements or replace current system (see comment on 10). Log of events written by operators and details from phone calls, observation from cctv, email, other reports... Understanding situation (global) as it evolves. Pushing information to other panels or to other road staff. Connection with other agencies (not easy to do at present but could help with joined-up response). D5.1 UI Design 53 Appendix 3 – Complete Experiment Instructions Traffic Management Experiment In this experiment you will be playing the role of a traffic manager. Your job is to maintain a desirable traffic flow by changing ramp metering rates (frequency of traffic lights changing colour) based on traffic density (number of cars in the area of interest). This will be done using the User interface shown below. The UI consists of 4 information sources: a map (top right corner), a graph of Rate vs Density (top left corner), a list of all ramps (bottom left corner) and a control window (bottom right). Your task is to either ACCEPT or CHALLENGE a COMPUTER SUGGESTION (shown in the bottom right window) regarding changes to a specific ramp. However, the computer suggestion might be wrong and/or the information sources (the 4 different windows) might refer to different ramps. If not all information sources refer to the same ramp and/or the computer suggestion is considered to be wrong you should press the CHALLENGE button. You should only press the ACCEPT button if all information sources agree (talk about the same ramp) AND the computer suggestion is deemed correct. To determine if the computer suggestion is correct, you should look at the Rate vs Density graph in the top left window. - If most of the bigger dots are in the top left quadrant (high rate & low density) the rate at that ramp should be LOWERED. - If most of the bigger dots are in the bottom right quadrant (low rate & high density) the rate at that ramp should be INCREASED. - If most of the bigger dots are in the top right (high rate & high density) or bottom left (low rate & low density) quadrant the rate at that ramp should be LEFT UNCHANGED. This experiment consists of 32 trials. In between each trial you will see the white screen below. To begin the trial or proceed to the following one you need to press on the timer. You are requested to answer each trial as FAST as possible. However, in between each task (when you see the white screen with the counter) you will not be timed. In the beginning you will be given 2 practice trials after which you can clarify anything regarding the task. D5.1 UI Design
© Copyright 2026 Paperzz