3 Summative evaluation through laboratory trials

Scalable Data Analytics,
Scalable Algorithms, Software Frameworks
and Visualization ICT-2013 4.2.a
Project
FP6-619435/SPEEDD
Deliverable D8.3
Distribution Public
http://speedd-project.eu
Evaluation of SPEEDD prototype 1 for Road
Traffic Management
Federica Garin
(INRIA)
Chris Baber, Sandra Starke, Natan Morar and Andrew Howes
(University of Birmingham)
Alexander Kofman
(IBM)
Status: FINAL
April 2015
2
Project
Project Ref. no
Project acronym
Project full title
Project site
Project start
Project duration
EC Project Officer
FP7-619435
SPEEDD
Scalable ProactivE Event-Driven Decision Making
http://speedd-project.eu/
February 2014
3 years
Aleksandra Wesolowska
Deliverable
Deliverable type
Distribution level
Deliverable Number
Deliverable Title
Contractual date of delivery
Actual date of delivery
Relevant Task(s)
Partner Responsible
Other contributors
Number of pages
Author(s)
Internal Reviewers
Status & version
Keywords
Report
Public
D8.1
The Design of the User Interface for the SPEEDD Prototype
M13 (March 2015)
April 2015
WP8/Tasks 8.3
INRIA
UoB, IBM
75
F. Garin, C. Baber, S. Starke, N. Morar, A. Howes, , and A. Kofman
Final
Evaluation, User Interface Design, Human Factors, Eye Tracking
D5.1 UI Design
3
Contents
0 Executive Summary .........................................................................................................................7
1 Introduction ....................................................................................................................................8
1.1 Technical Evaluation ........................................................................................................................... 9
1.2 Evaluating User Interfaces .................................................................................................................. 9
1.3 Formative Evaluation ........................................................................................................................ 10
1.4 Summative Evaluation ...................................................................................................................... 12
2 Evaluation by Subject Matter Experts ............................................................................................ 13
2.1 Study ................................................................................................................................................. 13
2.3 Comments from Subject Matter Experts .......................................................................................... 14
2.3.1
Support Operator Goals ...................................................................................................... 14
2.3.2
Provide Global View ............................................................................................................ 15
2.3.3
Indicate ramp metering and operator role ......................................................................... 15
2.3.4
Modify User Interface design .............................................................................................. 15
3 Summative evaluation through laboratory trials ............................................................................ 16
3.1 Rationale ........................................................................................................................................... 16
3.2 Study design and data collection ...................................................................................................... 17
3.2.1
User interface...................................................................................................................... 17
3.2.2
Experimental conditions ..................................................................................................... 18
3.3 Method ............................................................................................................................................. 20
3.3.1
Participants ......................................................................................................................... 20
3.3.2
Task and experimental timeline .......................................................................................... 20
3.3.3
Data collection .................................................................................................................... 21
3.3.4
Questionnaire ..................................................................................................................... 21
3.3.5
Eye tracking ......................................................................................................................... 22
3.4 Measures of Performance................................................................................................................. 24
3.4.1
Correctness of response ..................................................................................................... 24
3.4.2
Decision times ..................................................................................................................... 24
3.4.3
Gaze Behaviour ................................................................................................................... 25
3.5 Results ............................................................................................................................................... 26
D5.1 UI Design
4
3.5.1
Correctness of responses .................................................................................................... 26
3.5.2
Decision times ..................................................................................................................... 27
3.5.3
Gaze behaviour ................................................................................................................... 28
3.6 Comparison of expertise levels ......................................................................................................... 35
3.6.1
Data collection .................................................................................................................... 35
3.6.2
Data analysis ....................................................................................................................... 35
3.7 Conclusions and outlook ................................................................................................................... 38
3.7.1
Summary ............................................................................................................................. 38
3.7.2
Novice vs. expert behaviour................................................................................................ 39
3.7.3
Considerations for UI design ............................................................................................... 40
3.7.4
Study limitations and considerations for future work ........................................................ 40
4 Defining Baseline Performance Metrics.......................................................................................... 42
4.1 Overview ........................................................................................................................................... 42
4.2 Overall goal hierarchy ....................................................................................................................... 43
4.3 Theory and modelling ....................................................................................................................... 44
4.4 UI evaluation ..................................................................................................................................... 45
4.5 Evaluation of metrics for the quantification of UI design ................................................................. 46
5 References .................................................................................................................................... 50
Appendix 1 – SUS translated into French ............................................................................................... 51
Appendix 2 – Summary of Comments from SMEs .................................................................................. 52
Appendix 3 – Complete Experiment Instructions ................................................................................... 53
Traffic Management Experiment ............................................................................................................ 53
Index of Figures
Figure 1: ISO 13407 Human-Centred Systems Design Lifecycle ................................................................... 9
Figure 2: Software Usability Scale (Brooke, 1986) ...................................................................................... 11
Figure 3: Results from SUS evaluation ........................................................................................................ 13
Figure 4: UI for Experimental Scenario ....................................................................................................... 18
Figure 5: Incorrect ramp information: the ramp shown in panel 1 (right) is number 13, whereas the ramp
shown in panel 2 (left) is number 1. ........................................................................................................... 18
Figure 6: Information displayed in figure 4, panel 3, corresponding to computer suggestions................. 19
Figure 7: Experimental timeline. Thirty-two trials were intersected by an interval displaying a counter
(used to later synchronise eye tracking recordings and response data). ................................................... 21
Figure 8: Participant wearing a head-mounted eye tracker during a pilot data recording session. .......... 22
D5.1 UI Design
5
Figure 9: Setup of the monitor displaying the UI. Sixteen infrared markers allow mapping the point of
gaze to specific regions of interest. ............................................................................................................ 23
Figure 10: Markers used for simple mapping of point of gaze to four ROIs. .............................................. 24
Figure 11: Number of trials with correct response. for each of the four experimental conditions. ......... 27
Figure 12: Decision times as a function of condition, response or condition separated by response. ...... 27
Figure 13: Response times as a function of trial number for each participant. ......................................... 28
Figure 14: Heatmaps for the gaze data of four participants across the complete recording period. ........ 29
Figure 15: Example for sequential viewing behaviour of the 18 participants for scenario number 13 (out
of 32 scenarios). .......................................................................................................................................... 29
Figure 16: Dwell times for individual regions of interest (ROI 1 to 4) as well as dwell times calculated
across all attended ROIs. Left: median dwell time per participant; right: interquartile range (IQR) per
participant across median dwell times per trial. ........................................................................................ 30
Figure 17: Percentage viewing time allocated to different ROIs across all 32 trials for each participant. . 31
Figure 18: Boxplots of median viewing time per participant, both as the percentage viewing time (top
row) and as the total time in seconds (bottom row). ................................................................................. 32
Figure 19: Boxplots of the interquartile range in viewing time per participant, both as the percentage
viewing time (top row) and as the total time in seconds (bottom row)..................................................... 32
Figure 20: Basic gaze parameters related to visual scanning for each of the four experimental conditions
and across all 32 trials. ................................................................................................................................ 33
Figure 21: Number of visits to each ROI for each of the four experimental conditions and across all 32
trials. ........................................................................................................................................................... 33
Figure 22: Percentage viewing time allocated to the four regions of interest by the four students
classified as ‘expert students’ (≥ 95% of trials completed correctly, left panel) and by 14 ‘non-expert’
students (< 95% of trials completed correctly, right panel) ....................................................................... 34
Figure 23: Viewing networks for condition TT (left) and TF (right)............................................................. 34
Figure 24: Response reliability for the four experimental conditions for three staff of the road traffic
management facility in DIR-CE (Grenoble, France), students classed as experts (≥ 95% of trials completed
correctly) and students classed as non-experts (< 95% of trials completed correctly) .............................. 36
Figure 25: Decision times for the three different participant groups and four different experimental
conditions. Throughout, the median decision time was approximately 5 seconds. .................................. 37
Figure 26: Self-evaluation compared to quantified performance. ............................................................. 38
Index of Tables
Table 1: Network analysis metrics. This table shows the results for all network analysis metrics described
in the data analysis section. An undirected network is independent on the direction of the gaze shift, a
directed network takes the direction of the gaze shift into consideration. ............................................... 35
Table 2: Eyetracking low level ..................................................................................................................... 46
Table 3: Eye tracking processed / high level ............................................................................................... 47
Table 4: Scan pattern comparison using bioinformatics ............................................................................ 48
Table 5: Network analysis metrics .............................................................................................................. 48
D5.1 UI Design
6
Table 6: Eye / head coupling ....................................................................................................................... 49
Table 7: User performance ......................................................................................................................... 49
D5.1 UI Design
7
0 Executive Summary
Task 8.3 is concerned with the definition and development of performance metrics by which it is
possible to measure human decision making in the SPEEDD road traffic use case. Having a set of human
performance metrics allows us to consider how different versions of the SPEEDD prototype could have
an impact on operator decision performance. The primary form of interaction that the operator will
have with the SPEEDD prototype will be through the User Interface (UI). Consequently, the primary aim
in this report is to explore ways in which we can develop metrics that provide a baseline against which
subsequent designs can be compared. In order to define such metrics, it is necessary to have a model of
the activity that is being measured.
In SPEEDD the evaluation of the UI designs will involve more than a judgement of the ‘look and feel’,
user acceptance, or usability of the screen layouts. As the role of the UI is to support decision making,
the project will focus on measuring decision making in response to information search and retrieval. An
assumption is that the manner in which information is presented to the user interacts with the manner
in which the user searches for specific aspects of the information in order to make decisions. This raises
three dimensions that need to be considered in this work. First, the manner in which information is
presented can be described in terms of its diagnosticity, i.e., the correlation between the content of the
display and the goal to achieve, and in terms of its graphical properties, i.e., the format used to display
the information. Second, the manner in which users search for specific information can be described in
terms cost, i.e., temporal measures of information search, particularly using eye-tracking metrics. Third,
the manner in which information search relates to decision making can be related to the optimal
decision models being developed in SPEEDD. It is proposed that this approach is not only beneficial for
SPEEDD but also provides a foundation for the evaluation of Visual Analytics in general.
This report describes a user acceptance test (with Subject Matter Experts) to compare different versions
of the UI, followed by the design and conduct of an experiment evaluating one of the versions of the UI
for a specific (ramp metering) task. The experiment provided an opportunity to consider a variety of
metrics which could be used to indicate cost (of information retrieval) against diagnosticity of
information, as well as a means of comparing performance of Subject Matter Experts (i.e., experienced
control room staff) and novices (i.e., Undergraduate students). The report concludes with a discussion
of potential metrics that we will take forward for further consideration in the SPEEDD project.
D5.1 UI Design
8
1 Introduction
In the Description of Work for Work Package 8, it is proposed that "Every version of the integrated
prototype will be followed by technical and user-oriented evaluation to obtain the necessary feedback
for the functions to be included or altered in the next version". For the development of Prototype 1,
evaluation has been conducted along four primary routes. The first is the technical evaluation that
arises from the demands of integrating the User Interface (UI) into the SPEEDD Architecture. Not only
has this required that the objects in the UI are able to display data delivered over the Architecture but
also that any operator inputs can be either handled by the Architecture or dealt with locally. This is
briefly covered in Section 1.1 and is the subject of Work Package 6. The second is acceptance testing.
This is conducted by presenting versions of the UI to Subject Matter Experts (SMEs) and asking for their
opinion of the designs and their suggestions for improvement. The third is user testing. This is
conducted through experiments in laboratory settings in the first year (the aim is to perform such
testing with SMEs on the full prototype, but this will be performed in later years).
In addition to considering Evaluation of the SPEEDD prototype, this report also presents initial thoughts
of the challenge of defining a ‘baseline’ against which subsequent designs can be compared. The
purpose of evaluation is to “assess our designs and test our systems to ensure that they actually behave
as we expect and meet the requirements of the users” (Dix et al., 1993). This means that, for SPEEDD, it
is important to consider how the UI can have an impact on the decision making activity of operators and
people who will be using the UI. The question of how to measure impact on decision making activity is,
therefore, of paramount importance for this project.
A baseline should not only capture the performance of operators’ current performance (i.e., using the
technology that they currently have in their control rooms) but also provide a set of metrics for
evaluating future designs. As we will explain in Section 4, the definition of a baseline is complicated by
the fact that it is not easy to separate operator performance from the equipment that they use. For
example, if the operators spend a large proportion of their time completing reports (see Deliverable 5.1),
then ‘report time’ could be a performance metric that we would seek to improve on. However, if one
eliminated the requirement to make reports, e.g., because the system fully automated the collection of
such information, then the metric would be redundant. Of course, an improvement from, say 40% of
one’s time spent completing reports to 0% would be impressive but it is not a useful measure of
operator performance and tells us little about whether the work has actually improved (rather than
simply having some aspect of it eliminated), nor does it tell us how decision making has changed.
D5.1 UI Design
9
1.1 Technical Evaluation
From the technical point of view, SPEEDD prototype version 1 achieved its goals, i.e., basic versions of
the components and the infrastructure have been developed following the event-driven architecture
paradigm. SPEEDD prototype version 1 functionality has been tested using the end-to-end flow which
shows that the components get inputs and emit outputs as expected using SPEEDD platform. At this
point we have not tested functional requirements (such as performance and parallelism). In this report
we will consider initial feedback from users who actually operate the SPEEDD platform. Lessons learnt
from year 1 prototype will assist in reshaping the architecture and infrastructure. Furthermore, more
advanced features of the components will allow us to test more complex scenarios that can be then
tested by real users of the platform for further refinement.
1.2 Evaluating User Interfaces
Usability is “…the capability in human functional terms [of a product] to be used easily and effectively by
the range of users, given specified training and user support, to fulfill the specified range of tasks, within
the specified range of environmental scenarios” [Shackel, 1984]. ISO9241-11 operationalise this initial
definition as “...the extent to which a product can be used by specified users to achieve specified goals
with effectiveness, efficiency and satisfaction in a specified context of use”. This raises two key issues for
UI evaluation. First, it is important to note that evaluation must be directed at the role of a given
technology in supporting specified users and goals. This requires clear definition of the characteristics of
the tasks and goals that users perform. Second, it is important to combine an appreciation of subjective
response (i.e., satisfaction) with measures of performance (i.e., effectiveness and efficiency) for these
specified users seeking to achieve specified goals. From this, it is important to appreciate who will be
using the product, for what purpose, in what environment (i.e., the ‘context of use’) and then to define
appropriate measures of performance for this context of use.
Figure 1: ISO 13407 Human-Centred Systems Design Lifecycle
D5.1 UI Design
10
ISO 13407 (figure 1) describes Human-Centred System design as an iterative lifecycle. While stage 5
involves evaluating the design against requirements, it is important to realise that evaluation is involved
in each of the other stages as well. Not only is evaluation integral to design, but it is also to important
measure more that simply a product’s features (Baber, 2015). The evaluation of a User Interface
involves not only consideration of the aesthetics (‘look and feel’) of the design but also the effect of the
design on the work of people who will use that UI. From this, it is necessary to define the measures
needed to support evaluation and to define a referent model. The use of the referent model will be
particularly important for the evaluation of subsequent UI designs in SPEEDD as we need to identify how
and if these designs have resulted in measurable change to operator performance. It is customary to
distinguish between formative evaluation (i.e., the semi-formal testing of a design at the various stages)
and summative evaluation (i.e., the formal testing of the design at stage 5).
1.3 Formative Evaluation
The Formative Evaluation of a User Interface occurs throughout the design lifecycle. This could involve
comparing the current design against user requirements. An early version of such comparison was
reported in D5.1. As SPEEDD progresses, so the comparison against user requirements will be repeated
to ensure that the designs continue to meet the needs of the users as expressed in these requirements.
Additionally, formative evaluation can involve the elicitation of subjective responses from users. This
form of acceptance testing can be a valuable means of both ensuring that the user requirements
continue to be appropriate and engaging with prospective end users throughout the design and
development process. There are various approaches to eliciting user opinion, with the most common
involving some form of ‘usability inspection’ (Nielsen, 1993). These could involve an extensive set of
questions, e.g., Ravden and Johnson (1989), which can cover many aspects of interacting with a
computer system but can be time consuming to complete, to smaller question sets which are intended
to cover high-level aspects of the interaction. These aspects could focus on subjective responses, such as
satisfaction, e.g., CUSI — Computer User Satisfaction Inventory (Kirakowski and Corbett, 1988) or QUIS
— Questionnaire for User Interface Satisfaction (Chin et al., 1988). Others focus on a broader concept of
usability. Thus, it is common for evaluation for refer to a set of rules-of-thumb, or heuristics, such as:
(i.)
(ii.)
(iii.)
(iv.)
(v.)
(vi.)
(vii.)
(viii.)
(ix.)
Use Simple and natural language
Provide clearly marked exits
Speak the user’s language
Provide short cuts
Minimise user memory load
Good error messages
Be consistent
Seek to minimise user error
Provide feedback
It is possible to ask users to respond to each of these statements, perhaps by requiring them to provide
a score on a Lickert scale for each heuristic (and a further score indicating whether the user feels that
D5.1 UI Design
11
this heuristic is critical to the success of the system). However, there is a lack of standardisation in
evaluating response. This means that it is not easy to determine ‘good’ without a referent model. The
Software Usability Scale (Brooke, 1986) provides a convenient questionnaire to gauge user acceptance
of software.
Figure 2: Software Usability Scale (Brooke, 1986)
D5.1 UI Design
12
The primary focus of the SUS questionnaire is on identifying whether the software would be appropriate
for a given user, e.g., in terms of whether the user can understand how to use it, whether the software
looks as if it might be useful for the user’s work, and whether the user would be happy to use the
software. A set of questions are shown in figure 1. This scale poses 10 simple questions concerning the
potential usefulness and benefit that operators felt that the User Interface might provide them. Each
statement is rated on a scale of 0 to 4. The scoring of responses then involves subtracting 1 from oddnumbered questions and subtracting scores of even-numbered questions from 5. This is because the
questions alternate between positive and negative connotations. Scores are then summed and
multiplied by 2.5, to give a final score out of 100. As a rule of thumb, scores in excess of 65 are deemed
‘acceptable’.
1.4 Summative Evaluation
While Formative Evaluation captures user attitude and opinion as the UI develops, and also allows for
comparison of the designs against user requirements, it is important to conduct evaluation against
specific versions of the design. In this case, SPEEDD conducts evaluation on the prototype at the end of
each year. This involves a more objective set of measures of the use of the prototype for performing
specific tasks. In the case of SPEEDD, the measures relate to decision making in the scenarios that the
prototypes have been designed to support.
For Prototype 1, the scenario relates to ramp metering. Thus, the evaluation is directed at measuring
how well users of a version of the prototype are able to complete ramp metering tasks. It is worth
noting that ramp metering is not currently part of the activity of the operators in Grenoble DIR-CE. Thus,
they do not have experience of monitoring or interacting with an automated system for ramp metering.
Having said this, their experience of traffic management in general should allow them to complete the
decision tasks that we have designed. Our challenge is to produce an evaluation exercise which
captures the traffic management challenges relating to ramp metering with sufficient ecological validity
to feel realistic to operators, while providing sufficient control of variables to allow for data to be
collected that reflect decision performance. To this end, a UI was designed which presented a ramp
metering task as a decision problem. This allowed for decision time to be recorded in response to
combinations of information which reflected different levels of automation performance. This is
described in Section 3.
D5.1 UI Design
13
2 Evaluation by Subject Matter Experts
2.1 Study
We presented the initial sketch of the User Interface (A) and asked the operators to rate this against the
Software Usability Scale (translated into French – see Appendix 1).
Figure 3: Results from SUS evaluation
[The error bar indicates standard deviation in rating across the SMEs]
Figure 3 indicates an increasing trend from the first to the third UI design. UI A was presented in D5.1
and represents an initial sketch of the layout based on the core Values and Priorities identified through
Cognitive Work Analysis. In other words, this design is intended to reflect the key aspects of the
operators’ information space that they might need to monitor in order to achieve their goals. UI B was
developed to support the ramp metering task (and was the subject of the experiment described in
Section 3). UI C is used in Prototype 1 for SPEEDD. The increasing rating across these designs, therefore,
is to be expected as UI A was intended to reflect the general layout of the UI rather than as a specific
support for operators, UI B was designed for the laboratory experiment (see below), and it was only UI C
which was intended to support operators. While UI C was for the initial SPEEDD prototype, we felt that
it was useful to present the design process to the operators in order to gauge their opinion on the
D5.1 UI Design
14
direction of travel for the development lifecycle. We note that UI C has a score slightly below 65 which,
as noted above, indicates that this is more or less ‘good’. We also note that the error bars for UI C are
larger than for the other designs. Discussion with the SMEs suggests that this arose from the concern of
the one of the SMEs that this UI would be used in conjunction with existing displays. When we explained
that the intention would be to replace the current set of displays, he offered to modify his rating
upwards, but we felt that it would be better to retain the original rating. This implies that the first
prototype approaches an initial level of acceptance. However, we would like to see this score increase in
response to future designs. In order to understand the scores that the operators gave, we collated their
comments and use these to understand what they perceived to be good and poor aspects of the design.
2.3 Comments from Subject Matter Experts
The Subject Matter Experts (SMEs) noted that their response would depend on whether the UI is
intended to be used in addition to their current set of displays or whether it would be a replacement for
these. As far as practicable, SPEEDD would seek to replace the set of 5 display screens that are currently
used in the control room (see D5.1) with a single display (albeit one which might be larger than the
current visual display units). The SMEs could see merit in this approach. They were concerned that such
a replacement could result in loss of information, but could see that much of what they needed to know
would be on the SPEEDD UI (particularly UI C).
There was concern that the quantity of information in UI C could be too much for the job of monitoring
ramp metering (see the above discussion of error bars in figure 3). To this end, there was a feeling that
the simpler and less detailed design of UI B could be preferable. Part of the discussion, therefore,
concerned the role of the operator in response to the different UI designs presented to them. DIR-CE is
involved in a project to automate ramp metering in some parts of the road network. At present, the
ramp metering would be fully automated, with little opportunity for operator intervention. This might
have influenced their response to UI C. Overall, the SMEs’ responses can be grouped under 4 headings:
2.3.1
Support Operator Goals
The SMEs reiterated their main goal was to maximise traffic flow. To this end, the concept of prediction
would be very useful, and they appreciated the comparison between current and predicted state in the
UI designs.
A concern that the SMEs raised was the apparent lack of support for report writing in the UI designs.
We suggested that the ‘event log’ (in UI C) could be used to support the creation of reports, and the
SMEs recognised that this would be possible. However, it is important to recognise that there are a
number of reasons why operators need to create, maintain and sign-off reports. This was discussed in
D5.1 as a requirement and future designs need to be more explicit about how this can be supported.
The SMEs were not convinced that ‘driver behaviour’ falls under their command. While D5.1 suggested
that control room monitoring involved managing traffic by influencing driver behaviour (through the use
of Variable Message Signs), the SMEs were concerned that the focus of ‘behaviour’ would extend their
responsibilities. This means that the concept of indicating vehicle activity (as opposed to traffic flow)
might need to be reconsidered in the UI designs, or at least more clearly defined and explained.
D5.1 UI Design
15
A final point raised by the SMEs, and related to the issue of report writing, concerns the manner in
which information could be shared with other agencies. At present, there is little explicit
communication between the rocade sud control room in DIR-CE and the control room which is
responsible for the roads in the town. However, the two control rooms are situated in the same
building and have informal communication (e.g., through face-to-face or telephone conversations). As
SPEEDD develops its work on traffic control in the town (through the modelling work), the issue of how
control decisions in one area can constrain those of another will need to be considered. This raises the
question of how information will be shared and how cooperation can be supported.
2.3.2
Provide Global View
The SMEs appreciated the idea of providing a global view of the area that they were controlling. They
also felt that the current action of zooming in to an accident was problematic. First, zooming in loses
the global view. Second, zooming in implies that the most important aspect of the traffic control
problem is what is on the screen at the moment. The SMEs felt that allowing them to monitor the entire
road network was preferable. They did like the idea of having some ability for them to initiate zooming
to specific locations, which implies that it is not zooming per se that they dislike but that automated
zooming and resizing of the map outside their control is not liked.
2.3.3
Indicate ramp metering and operator role
The SMEs noted that ramp metering is not always under operator control. The intention behind the UI
designs was not to have operators control ramp metering, but to allow intervention if required. The
SMEs noted that it would be useful to intervene when they could see a growing queue of traffic at the
ramp. They suggested that it might be useful to ‘penalise’ the system in response to queue length. This
is something that SPEEDD is considering in its ramp control models, and the operators note that it was
important to act upstream or downstream of ramps to effect control. The SMEs also noted that it could
be difficult to determine how traffic would evolve if a ramp is closed.
2.3.4
Modify User Interface design
In A there was too much information on the map. From this perspective, the simplest UI, UI B, was
potentially the easiest to use. Having said that, the SMEs preferred UI C for its clarity. Much of the
discussion on the UI designs concerned the zooming on the map (discussed above) and the use of
colours. The SMEs were not sure how to the read the colours on the dials relating to driver behaviour.
However, they did like the use of coloured circles on the map to indicate events, congestion and
predictions.
The use of graphs (in UI C) was potentially beneficial, especially if these could show limits, e.g., upper
and lower bounds of the prediction, current activity, upper and lower bounds of ‘normal’ (i.e., historical)
data for that section of road. Overall, the SMEs requested three main changes to the UI designs:
1. Annotate / colour ring-road on map to display regions of congestion etc.
2. Provide ‘notes’ (or similar) button which pops up window to type in.
3. Tabs on ‘event log’ to provide different views of what is being logged.
D5.1 UI Design
16
3 Summative evaluation through laboratory
trials
3.1 Rationale
In order to determine the most appropriate baseline metrics for laboratory evaluation of user interface
design, a study was conducted with a user interface (UI) representing a simplified version of SPEEDD
prototype one (UI B in figure 3). The UI was designed to present participants with a traffic management
(ramp metering) scenario and to provide a means of collecting decision time data from participants.
The scenario is based on the SPEEDD road traffic use case version 1 (ramp metering), in which data from
sensors on ramps are used to control traffic light sequencing in order to manage traffic flow. For this
task, participants are expected to monitor the traffic flow and to ascertain that the automated decisions
are correct. The purpose of the UI is to enable an ‘operator’ to monitor, and potentially intervene in,
computerized road traffic control decisions. These computerized decisions are based on an
instrumented stretch of road network in Grenoble, on which sensors measure the speed of cars entering
the motorway via ramps. Traffic lights control the flow of cars based on information gathered from the
sensors. The study reported here is a preliminary experiment which investigates how operators might
respond to different levels of reliability in an automated system. In the SPEEDD use case, the ramp
metering algorithms would run on data collected from sensors embedded in the road. There are
potential problems which might arise from sensors failing or data being lost or corrupted during
transmission, as described by the SPEEDD consortium previously. While these problems might be dealt
with by exception handling, it is possible that a system recommendation could be based on erroneous
data. Further, it is possible that the processing time of the algorithms could result in discrepancy
between the recommendation and the display. These potential errors are translated in this experiment
to incorrect suggestions displayed by the automatic decision aid and by incongruence of information
sources. Thus, the operator had the dual role of checking for information source congruence and
validating the displayed suggestion.
This laboratory study aimed to answer two lines of experimental enquiry: firstly, we were interested in
response times and user performance/reliability when working with unreliable automation. Secondly,
we aimed to quantify viewing behaviour in order to understand scan patterns and time needed to
extract information from different sources and data types. Particularly, we aimed to investigate
whether participants would notice those trials that represented flawed automation, whether there was
an effect of the experimental condition on response times and whether there was consistency within
and across participants in allocating viewing time to different regions of interest. In this experiment, the
operator would either ‘accept’ a computer recommendation or ‘challenge’ (i.e., reject) it after
examining multiple information sources relating to traffic, presented on screen.
D5.1 UI Design
17
3.2 Study design and data collection
3.2.1
User interface
A custom user interface (UI) was designed for this task at the University of Birmingham in JavaScript. The
UI simulates aspects of the ramp metering task considered for SPEEDD prototype 1. The experimental
UI was derived from the UI developed for the SPEEDD project in order to facilitate this controlled
laboratory experiment. This illustrates an approach the will be pursued throughout SPEEDD; key
aspects of the operator task will be abstracted from the working environment and used to define
experimental tasks that provide sufficient control and repeatability to support controlled laboratory
testing. The purpose of this experimental UI (Figure 4) was to enable an ‘operator’ to monitor, and
potentially intervene in, computerized road traffic control decisions. These computerized decisions
relate to ramp control (i.e., changing the rate at which traffic lights on a junction change in order to
allow vehicles to join a main road). Because of the experimental design (see 3.2.2), the dual operator
role mentioned in the previous section was not static, but dynamic: switching between checking
information source congruence and validating the computer suggestion based on explicit instructions on
using the visual cues. In this experiment, the computer suggestions could be either ‘lower rate’,
‘increase rate’ or ‘leave rate unchanged’.
The UI contained four panels in an equally spaced 2 by 2 grid layout (Figure 4):




Panel 1 (bottom-right) contains details on the computer suggestion regarding traffic light
settings as well as the operator response buttons ‘challenge’ and ‘accept’.
Panel 2 (top-right) presents a view of the road network surrounding the queried ramp, based on
google maps.
Panel 3 (top-left) shows a graph with car density on the ramp on the x-axis (representing the
number of cars waiting to pass the traffic light) and rate of the ramp on the y-axis (representing
the number of cars passing per second).
Panel 4 (bottom-left) presents a schematic list of 17 ramp meters mimicking part of the
instrumented road section. The colour used for a specific ramp is repeated in the heading of the
panels 1 and 3 and the ramp number should be the one in panel 2 (in the congruent display
condition).
D5.1 UI Design
18
Figure 4: UI for Experimental Scenario
3.2.2
Experimental conditions
The objective of this experiment was to present ‘operators’ with coherent or conflicting information and
examine whether these conditions were recognised correctly. In coherent trials, the information
between all four panels agreed and the computer suggestion was correct. In incoherent trials, there
were three different scenarios for conflicting information:
First, the information in all panels agreed and the computer suggestion was correct. This is the version
outlined in the previous section.
Second, the information between panels in figure 4 disagreed while the computer suggestion was
correct. Disagreeing information corresponded to incorrect ramp information: for example, the ramp
number displayed in panel 1 and panel 2 could have disagreed (Figure 5).
Figure 5: Incorrect ramp information: the ramp shown in panel 1 (right) is number 13, whereas the ramp shown in panel 2
(left) is number 1.
D5.1 UI Design
19
Third, the information between panels agreed but the computer suggestion was incorrect. An incorrect
computer suggestion had to be detected by the ‘operator’ by interpreting the density graph (panel 3,
see Figure 4 ). If the data points showed high density and high rate (Figure 6, a), this would mean that
there were a lot of cars waiting at a traffic light. Hence, the correct computer suggestion would be to
increase the rate (letting more cars through by changing the traffic light change at a higher rate). If the
data points showed low density and low rate (Figure 6, b), this would mean that there were very few
cars waiting. Hence, the correct computer suggestion would be to decrease the rate. For other clusters
of data points (low density, low rate (Figure 6, c) and high density, high rate (Figure 6, d)), the correct
action would be to leave the rate unchanged.
a
c
b
d
Figure 6: Information displayed in figure 4, panel 3, corresponding to computer suggestions.
[Density (of cars) is on the x-axis and rate (of the traffic light changing) is on the y-axis. a) high density and low rate
(correct suggestion: increase rate), b) low density and high rate (correct suggestion: lower rate); c) low density and
low rate as well as d) high density and high rate (correct suggestion: leave rate unchanged)].
Fourth, the information between panels disagreed and the computer suggestion was also incorrect.
In summary, trials were hence based on four scenarios with respect to information agreement and
correctness of the computer suggestion:




Ramp labels agree, computer suggestion correct (TT)
Ramp labels disagree, computer suggestion correct (FT)
Ramp labels agree, computer suggestion incorrect (TF)
Ramp labels disagree, computer suggestion incorrect (FF)
D5.1 UI Design
20
Each scenario was presented eight times, resulting in 32 trails which were shown to all participants,
presented in random order. Participants were asked to accept the computer suggestion if and only if
information agreed and the computer suggestion was correct. Hence, 3/4 of trials had to be challenged
and 1/4 had to be accepted.
3.3 Method
3.3.1
Participants
Participants were recruited from the University of Birmingham student body. A total of 27 students
participated in the study; for a subset of 18 participants, eye tracking data were recorded. These data
could not be recorded for the remaining participants due to calibration issues arising from them wearing
spectacles. The design of experiment was approved by the University of Birmingham Ethics Panel and all
participants provided informed consent to participate in the study.
3.3.2
Task and experimental timeline
The study consisted of four steps: first, the function of the UI components and the goals of the task were
explained verbally to the participant based on a script (Appendix 3). Secondly, the participant performed
two practice trials, after which he/she was given the opportunity to ask any clarifying questions. Third,
the participant performed the task, which consisted of 32 trials and took approximately 2 to 6 minutes
to complete. Fourth, participants were given a questionnaire to fill out.
The study explanation can be summarised as follows: in order to determine whether the information
sources agreed, the participants were instructed to check if all four regions of interest (ROIs) referred to
the same ramp number. To determine whether the computer suggestion is correct or not the
participants were instructed to check the graph in ROI 3 (top-left in Figure 4). The presence of the
biggest bubbles in the bottom-right quadrant of ROI 3 (low rate, high density) indicated that the rate
must be increased. The presence of the biggest bubbles in the top-left quadrant of ROI 3 (high rate, low
density) meant that the rate must be decreased. The presence of the biggest bubbles in either the
bottom-left or the top-right quadrant (low density, low rate and high rate, high density, respectively)
meant that the rate must remain unchanged.
Each trial was preceded by an interval displaying a counter that was used to later synchronise eye
tracking recordings and response data. The interval displaying the timer ran for as long as the participant
wished to prepare for the next trial. Participants started each trial by clicking on the timer screen. A trial
was completed when the participant clicked on the ‘challenge’ or accept’ button, and the blank timer
screen was automatically shown again.
D5.1 UI Design
21
Figure 7: Experimental timeline. Thirty-two trials were intersected by an interval displaying a counter (used to later
synchronise eye tracking recordings and response data).
3.3.3
Data collection
The UI was presented on a 22” monitor (resolution: 1680 x 1050, refresh: 60 Hz) to each ‘operator’
seated in front of it. Events associated with a participant’s interaction with the UI were written to a csv
file using JavaScript. The stored events for each trial were:



Trial ID
Trial start and end time (in ms computer time)
Participant response (challenge or accept)
The start time corresponded to the participant clicking on the white counter screen to start the trial,
while the stop time corresponded to the participant clicking on either the ‘challenge’ or ‘accept’ button.
3.3.4
Questionnaire
On completion of the task, participants were asked to fill in a questionnaire which contained the
following questions:
D5.1 UI Design
22





3.3.5
How difficult did you find the task? [0 to 10 scale, 0 – very easy; 10 – very hard]
What was your strategy?
What percentage of decisions do you think you made correctly?
What percentage of trials that you SHOULD have accepted do you think you accepted?
What percentage of trials that you SHOULD have challenged do you think you challenged?
Eye tracking
For a subset of 18 participants, eye tracking data were collected as part of the experiments. A Tobii
Glasses v.1 head-mounted eyetracker was used to record the point of gaze at 30 Hz while participants
performed the task. Prior to commencing the task, the eyetracker was fitted to the participant to be
worn comfortably and a nine-point calibration was performed. Calibration was repeated until achieving
at least 4-star (out of 5) rating for both, ‘accuracy’ and ‘tracking’. For a random subset of participants,
mapping of the gaze point was checked using a custom gaze-verification video created in Matlab. This
video showed moving dots on the screen which the participant had to follow. The match between
mapped gaze point and the target point then allowed qualitative / quantitative evaluation of gaze point
accuracy.
Figure 8: Participant wearing a head-mounted eye tracker during a pilot data recording session.
To map point of gaze in the local coordinate reference frame of the glasses to regions of interest (ROIs)
on the screen, 16 infrared markers were attached around the monitor at equally spaced intervals. The
large number of markers was chosen to accommodate technical issues we noted in a pilot study, where
changes in head orientation and viewing distance of participants resulted in loss of too many IR marker
locations.
D5.1 UI Design
23
Figure 9: Setup of the monitor displaying the UI. Sixteen infrared markers allow mapping the point of gaze to specific regions
of interest.
Following data collection, point of gaze was automatically mapped to the four UI panels using custom
written scripts in Matlab based on the position of 16 infrared markers attached around the monitor. The
setup used here allowed to map visual attention to the four panels (ROIs) based on two approaches:
Firstly, mapping was performed by minimising the vector magnitude and angle between point of gaze
and infrared markers; for each frame, the ROI corresponding to the marker with the minimum value was
selected. Details of this mapping approach have been described previously and are currently under
review for publication in ACM Transactions in Interactive Intelligent Systems.
Secondly, to accommodate dropout of large numbers of markers for the recordings of some participants,
a simple mapping approach was chosen to override the method explained above when possible: for
each frame, it was automatically checked whether one of the horizontal and one of the vertical markers
placed at the panel intersections was in view (Figure 10). If this conditional was satisfied, then mapping
was performed for the frame based on the location of point of gaze relative to these markers.
D5.1 UI Design
24
Figure 10: Markers used for simple mapping of point of gaze to four ROIs.
Despite these robust mapping algorithms, data for three participants had to be mapped manually after
sanity-checking, as participants moved too close to the screen to have sufficient markers in view for
either of the two methods to work reliably.
3.4 Measures of Performance
3.4.1
Correctness of response
For each participant, the number of trials responded to correctly was calculated for each of the four
experimental conditions and across all 32 trials. Summary statistics were then calculated across
participants.
3.4.2
Decision times
For each participant and each trial, decision times were calculated as the difference between the start
and stop time. To examine whether decision times varied between the four experimental conditions and
the two response options (‘accept’ and ‘challenge’), the median reaction time was calculated for trials
belonging to each of these classes for each participant. A Kruskal-Wallis test was carried out, using the
IBM SPSS statistical package, to examine whether decision times differed between the four
experimental design categories or between the response ‘accept’ and ‘challenge’.
To examine whether there was a systematic trend for decision times to change as a function of elapsed
trial, linear regression was performed for each participant with trial number as the independent and
decision time as the dependent variable.
D5.1 UI Design
25
3.4.3
Gaze Behaviour
3.4.3.1 Heat maps
For a qualitative assessment of attended regions of interest (ROIs), heat maps were created in the Tobii
Studio (eye-tracking analysis toolkit) for the complete trial of each participant. These heat maps can be
used to identify specific regions of interest to which a participant attends, e.g., a ramp symbol and
number in ROI 2, or a region of text in ROI 1. Further, heat maps allow a first impression regarding the
dispersion of fixations across the whole scene and user interface.
3.4.3.2 Dwell times
The dwell time is the time for which the eyes rest on a defined ROI. Dwell times were calculated for each
ROI and median values calculated for each participant per ROI and across all dwells.
3.4.3.3 Time allocated to individual ROIs and within-observer variation
To examine the attendance to the four ROIs quantitatively, the cumulative viewing time was calculated
for each trial and each participant, both in absolute terms (in seconds) as well as in normalised terms
(in % trial duration). To calculate cumulative viewing time, dwell times for each ROI were summed.
Cumulative viewing time was calculated for each condition as well as across all trials. To compare gaze
allocation across participants, median values were calculated per participant and condition.
To examine the consistency of an individual participant’s gaze allocation, the interquartile range (IQR)
was calculated for each participant and condition as well as across all trials. The reason for doing this
was a) to investigate whether controlling for the experimental condition would reduce within-observer
variation and b) to investigate the ‘noise’ level associated with the fundamental metric cumulative
viewing time. This metric is essential to evaluate UI design, however it is important to know how much
variation is attributable to unsystematic visual scanning behaviour of participants, and how much is
attributable to UI design. These two sources of variation will have to be distinguished in our future work
on iterative UI development and determine the number of necessary participants and trials.
3.4.3.4 Basic visual scanning behaviour
For each of the experimental conditions, the trial time, number of attended ROIs per trial, switch count
and number of visits to each ROI were calculated. The switch count is the number of gaze shifts between
ROIs for each trial. The number of visits per ROI is a metric which allow to quantify how often an
observer attends the same information source.
3.4.3.5 Viewing networks and network analysis metrics
Graphical viewing networks were constructed in line with the work which we presented in the SPEEDD
framework in the past. In short, viewing networks visualise between which ROIs a participant switches.
They are an intuitive way of checking for consistency in scanning behaviour.
In addition to the graphical viewing networks, we calculated several quantitative network analysis which
we presented previously and which are described in detail in our manuscript that is under review with
TiiS. In short, the viewing sequence of each participant for each trial is processed using matrix
calculations. ‘Nodes’ reflect ROIs, and ‘edges’ reflect switches between pairwise ROIs. This results in a
representation of the frequency of switches between pairwise ROIs. This representation can be binary (it
D5.1 UI Design
26
only matters whether or not a participant switches between two regions, but not how often) and
undirected (it doesn’t matter whether for example a participant switches from ROI 3 to ROI 1 or from
ROI 1 to ROI 3); or it can be binary and directed (i.e. switching from ROI 3 to ROI 1 would be considered
different to switching from ROI 1 to ROI 3). We used both representations to calculate a number of
network analysis metrics. In the following, a summary is given for each of the metrics reported here:







Number of edges: This metric counts the number of any unique pairwise switch. In a gaze
sequence with the ROI attendance 4 – 3 – 2 – 3 – 2 – 4 – 1 – 4 – 1 -4 – 3 for example, there
would be six ‘edges’ in the directed representation (4 – 3, 3 – 2, 2 – 3, 2 – 4, 4 – 1, 1 – 4) and
four edges in the undirected representation (4 / 3, 3 / 2, 2 / 4, 4 / 1).
Link density: This metric is a measure of the number of edges Nedges within the network as a
fraction of the total number of possible edges, arising from the total number of nodes Nnodes.
This metric indicates how an observer sequentially combines information sources.
Inclusiveness: This metric counts the number of nodes which are connected within a network.
Here we calculated it from the undirected binary network. This metric is a simple measure of the
number of ROIs which an observer attended to; if gaze was not directed at a ROI, it would not
be included in the network.
Average nodal degree: This metric represents the average number of edges connected to each
node of the network. Similar to link density, this metric indicates how strategic an observer is
switching between information sources: a high average nodal degree indicates that the observer
is non-selective in sequentially combining information sources, while a low average nodal
degree indicates a systematic approach to switching between sources. The maximum average
nodal degree corresponds to the total number of nodes in the network minus 1.
Connection type: In a directed network, strongly connected components are bi-directional edges,
e.g. in a case where there is a link from ROI 1 to ROI 3 and a link from ROI 3 to ROI 1. Weakly
connected components are uni-directional edges, e.g. in a case where there is a link from ROI 1
to ROI 3 but no link from ROI 3 to ROI 1. Together with the nodal degree, this metric allows to
understand the function of a ROI in the broader context of the scan pattern.
Leaf nodes: Nodes that are only connected to one other node are called leaf nodes.
Nodal degree (‘local centrality’): The degree of a node describes the number of links connected
to any neighbouring node. This metric measures how well connected each node is.
3.5 Results
3.5.1
Correctness of responses
For the 27 participants, the percentage of correctly evaluated trials ranged from 56% to 100%, with a
mean ± standard deviation (SD) of 79 ± 12% of trials being evaluated correctly. Five participant had
performances better that 95% (three with 100% correct assessments), with the remaining 22 students
showing performance < 95%. This was used to partition data into an ‘expert student’ group and a ‘nonexpert’ student group.
Across the four different conditions, the mean ± SD number of correctly evaluated trials (out of a
maximum of 8) was 7.7 ± 0.6 trials for condition ‘TT’ (ramp label agrees, suggestion correct), 6.9 ± 1.8
D5.1 UI Design
27
trials for condition ‘TF’ (ramp label agrees, suggestion incorrect), 3.0 ± 3.2 trials for condition ‘FT’ (ramp
label disagrees, suggestion correct) and 7.7 ± 0.7 trials for condition ‘FF’ (ramp label disagrees,
suggestion incorrect). A Friedman test, carried out using the IBM SPSS statistical package, accordingly
revealed that there was a significant difference between the four experimental conditions (P < 0.0005).
Figure 11: Number of trials with correct response. for each of the four experimental conditions.
[TT – ramp label agrees: true, suggestion correct: true; TF – ramp label agrees: true, suggestion correct: false; FT –
ramp label agrees: false, suggestion correct: true; FF – ramp label agrees: false, suggestion correct: false. The
correct response for ‘TT’ was ‘accept’, the correct response for ‘TF’, ‘FT’ and ‘FF’ was ‘challenge’].
3.5.2
Decision times
Decision times per trial ranged from 1.0 s to 23.5 s across participants. The median decision time per
participant across all trials ranged from 2.1 s to 11.7 s (4.8 ± 2.4 s). There was no significant difference in
decision time between the four conditions (p = 0.440) or the two response types (p = 0.755). There was
also no significant difference between the eight categories of each condition separated by response (p =
0.664).
Figure 12: Decision times as a function of condition, response or condition separated by response.
[TT – ramp label agrees: true, suggestion correct: true; TF – ramp label agrees: true, suggestion correct: false; FT – ramp label
agrees: false, suggestion correct: true; FF – ramp label agrees: false, suggestion correct: false].
D5.1 UI Design
28
Results for decision times as a function of elapsed trial varied across participants: on one hand, there
was a significant linear association between decision time and elapsed trial number for 12 participants,
albeit shallow fitted slopes (range of fitted slopes: -0.21 to 0.25, range for R2: 0.13 to 0.45; range for p: <
0.001 to 0.041). On the other hand, there was no significant association for 15 participants (range for R2:
0.00 to 0.10; range for p: 0.075 to 0.988).
Figure 13: Response times as a function of trial number for each participant.
3.5.3
Gaze behaviour
3.5.3.1 Heat maps
Heat maps of the gaze shifts qualitatively confirmed that participants looked at regions on the screen
that held information relevant for task completion. This included looking at the accept/challenge
buttons and computer suggestion/ramp information in ROI 1, the ramp number on the map in ROI 2, the
points on the graph in ROI 3 and the ramp details in ROI 4. However, the time spent looking at these
different information sources differed. Visual attendance to unrelated information was virtually absent.
D5.1 UI Design
29
Figure 14: Heatmaps for the gaze data of four participants across the complete recording period.
[The gaze point in the center of the screen corresponds to participants looking at the counter between trials. This
figure shows that the relative gaze allocation differs between individuals].
Figure 15: Example for sequential viewing behaviour of the 18 participants for scenario number 13 (out of 32 scenarios).
[The display is identical, however it is obvious that participants vary in their scanning behaviour. Dwell times can be calculated
from individual periods of attendance to a given ROI. Cumulative viewing time is then calculated by summing individual dwells].
D5.1 UI Design
30
3.5.3.2 Dwell times
The mean ± SD dwell time across participant-specific median dwell times per ROI was 0.75 ± 0.39 s for
ROI 1, 0.28 ± 0.14 a for ROI 2, 0.49 ± 0.18 s for ROI 3 and 0.39 ± 0.29 s for ROI 4. Across all ROIs, it was
0.39 ± 0.16 s. This shows that on each visit participants tended to consult the display for less than one
second, however for more than 0.15 s which is a minimum time necessary to extract information from a
relatively complex pattern.
The variation within participants was in the same order as the median dwell times at mean ± SD values
between 0.53 ± 0.20 s and 0.26 ± 0.12 s, with an overall average of 0.26 ± 0.07 s. This suggests that the
evaluation of information display may be difficult to perform using dwell times as a metric, as any effect
would have to be comparatively large to be detected given the natural variation in visual scanning
behaviour.
Figure 16: Dwell times for individual regions of interest (ROI 1 to 4) as well as dwell times calculated across all attended ROIs.
Left: median dwell time per participant; right: interquartile range (IQR) per participant across median dwell times per trial.
3.5.3.3 Time allocated to individual ROIs and within-observer variation
Analysis of the percentage viewing time allocated to the regions of interest (ROIs) showed that the 18
participants varied in the time allocated to the individual ROIs (Figure 17). While all but one participant
spent the largest fraction of time looking at ROI 1 (the panel holding the response buttons and computer
suggestion), weighting of the time spent looking at ROI 2 to 4 varied strongly: the median percentage
viewing time ranged from 0 to 38% viewing time for ROI 2, 9 to 48% for ROI 3 and 0 to 26 for ROI 4.
D5.1 UI Design
31
Figure 17: Percentage viewing time allocated to different ROIs across all 32 trials for each participant.
[This figure shows the variation in attending to different information sources both within trials for the same participant and
across participants].
There was very little variation in the viewing time spent on individual regions as a function of the
experimental condition (Figure 18). As a general trend across all trials, participants spent approximately
half the time looking at ROI 2 and 4 compared to ROI 1 and 3. The median cumulative percentage
viewing time spent on a region of interest per trial was 43% for ROI 1, 10% for ROI 2, 30% for ROI 3 and
10% for ROI 4. This corresponded to a median cumulative viewing time of 2.0 s for ROI 1, 0.5 s for ROI 2,
1.3 s for ROI 3 and 0.5 s for ROI 4 (Figure 19).
There was notable variation within a participant’s viewing time across trials (Figure x). Similar to dwell
times, this variation was in the same order as the median viewing time.
D5.1 UI Design
32
Figure 18: Boxplots of median viewing time per participant, both as the percentage viewing time (top row) and as the total
time in seconds (bottom row).
[Data are shown for each of the experimental conditions as well as across all trials (‘All’). Experimental conditions: TT – ramp
label agrees: true, suggestion correct: true; TF – ramp label agrees: true, suggestion correct: false; FT – ramp label agrees: false,
suggestion correct: true; FF – ramp label agrees: false, suggestion correct: false. The figure shows that attention allocation did
not differ strongly between experimental conditions, but between participants].
Figure 19: Boxplots of the interquartile range in viewing time per participant, both as the percentage viewing time (top row)
and as the total time in seconds (bottom row).
[Data are shown for each of the experimental conditions as well as across all trials (‘All’). Experimental conditions: TT – ramp
label agrees: true, suggestion correct: true; TF – ramp label agrees: true, suggestion correct: false; FT – ramp label agrees: false,
suggestion correct: true; FF – ramp label agrees: false, suggestion correct: false].
D5.1 UI Design
33
3.5.3.4 Basic visual scanning behaviour
As can be seen in Figure 20, the experimental condition did not have a notable effect on the trial
duration, number of attended ROIs or switch count. Participants tended to attend to all ROIs, especially
for the TT and TF condition. The mean ± SD switch count across all conditions was 8.2 ± 4.7. There likely
was an artefact of the automatic for two participants with a very high switch count, whose gaze may
have been allocated near ROI boundaries frequently.
The number of visits to each ROI showed no obvious dependency on the experimental condition (Figure
21). The median number of visits per ROI was approximately 2 throughout, with slightly more frequent
visits to ROI 1.
Figure 20: Basic gaze parameters related to visual scanning for each of the four experimental conditions and across all 32
trials.
[The trial duration is the median time it took each participant to complete all trials for a given condition. The number (N) of
attended ROIs is the median number of ROIs attended to by each participant for a given condition. The switch count is the
median number of gaze redirections between the different ROIs for each participant for a given condition].
Figure 21: Number of visits to each ROI for each of the four experimental conditions and across all 32 trials.
[The median for each participant is represented in this dataset].
3.5.3.5 Eye tracking data
Comparison of eye tracking data between the expert and non-expert students that the expert students
weighted information extraction from ROI 2 to 4 similarly, while the non-expert students weighted ROI 3
higher than ROI 2 and 4 (Figure 22).
D5.1 UI Design
34
Expert students (≥ 95% correct)
Non-expert students (< 95% correct)
Figure 22: Percentage viewing time allocated to the four regions of interest by the four students classified as ‘expert
students’ (≥ 95% of trials completed correctly, left panel) and by 14 ‘non-expert’ students (< 95% of trials completed correctly,
right panel)
3.5.3.6 Viewing networks and network analysis metrics
Creation of the viewing networks showed that participants followed very different scan paths when
looking at exactly the same display (Figure 23). In future, we will quantify scan pattern similarity using
metrics from bioinformatics, allowing to evaluate whether a specific UI design results in more consistent
visual scanning within and across participants.
Figure 23: Viewing networks for condition TT (left) and TF (right).
The network analysis metrics showed that scanning behaviour was similar across the four experimental
conditions. The large number of edges (3.6 ± 0.9 across all trial out of a maximum 4) reflected the
attendance to all ROIs in many trials, also reflected in the high inclusiveness (90% on average). The link
density was 0.6 ± 0.2 in an undirected network and 0.4 ± 0.2 in a directed network. This means that the
D5.1 UI Design
35
taken scan path did not cycle across ROIs randomly for multiple cycles in random order. This was also
reflected in the modest average nodal degree and the number of leaf nodes.
Table 1: Network analysis metrics. This table shows the results for all network analysis metrics described in the data analysis
section. An undirected network is independent on the direction of the gaze shift, a directed network takes the direction of
the gaze shift into consideration.
METRIC
Number of edges
Link density (fraction)
Inclusiveness (in %)
Average nodal degree
Connection type
Leaf nodes
Matrix
directionality
Undirected
Directed
Undirected
Directed
Undirected
Undirected
Directed
Directed, strong
Directed, weak
Undirected
Directed
TT
TF
FT
FF
All
3.7 (1.0)
5.2 (1.8)
0.6 (0.2)
0.4 (0.1)
90.3
(12.5)
1.8 (0.5)
2.6 (0.9)
1.9 (0.7)
1.8 (0.9)
1.8 (2.0)
3.9 (1.6)
3.9 (1.1)
5.3 (2.2)
0.6 (0.2)
0.4 (0.2)
91.7
(12.9)
1.9 (0.6)
2.7 (1.1)
2.0 (1.0)
1.7 (1.0)
1.4 (1.7)
3.8 (2.0)
3.1 (1.1)
3.9 (1.9)
0.5 (0.2)
0.3 (0.2)
85.4
(12.3)
1.6 (0.6)
1.9 (0.9)
1.6 (1.0)
1.1 (0.6)
2.6 (1.6)
3.6 (1.6)
3.0 (1.1)
4.0 (1.7)
0.5 (0.2)
0.3 (0.1)
84.7
(12.5)
1.5 (0.5)
2.0 (0.9)
1.5 (0.8)
1.4 (0.6)
1.9 (1.9)
4.3 (1.5)
3.6 (0.9)
4.7 (1.8)
0.6 (0.2)
0.4 (0.2)
90.3
(12.5)
1.8 (0.5)
2.3 (0.9)
1.6 (0.9)
1.4 (0.5)
2.0 (1.8)
4.1 (1.4)
3.6 Comparison of expertise levels
3.6.1
Data collection
To compare student behaviour with that of domain experts, we conducted a pilot study involving three
(male) experts in road traffic control, based in DIR-CE Grenoble, France. For logistical reasons, no eye
tracking data could be recorded for these participants. Expert participants were given the same
instructions as the student participants (based on the script in Appendix 1).
3.6.2
Data analysis
3.6.2.1 Response behaviour
Response behaviour of the experts from Grenoble was analysed as described in Section 3. Data for the
student cohort was split into two groups: the ‘expert student’ cohort included students which
completed at least 95% of trials correctly, matching the performance of the experienced road traffic
management staff. The cohort included 5 participants for decision time and response data and four
participants for eye tracking data. The ‘non-expert student’ cohort included students which completed
less than 95% of trials correctly. This cohort included 22 participants for decision time and response data
and 14 participants for eye tracking data.
3.6.2.2 Eye tracking data
For each of the three groups, summary statistics were calculated for the number of correct responses
for each condition. Further, summary statistics were calculated for reaction times, again for each of the
four conditions as well as across all trials. From the eye tracking data, the percentage viewing time was
D5.1 UI Design
36
calculated for each ROI to examine whether the expert and non-expert students weighted visual
attention allocation differently.
3.6.2.3 Correctness of responses and decision times
According to the results reported in Section 3.5 and as it can be seen from Figure 24 below, there is a
clear difference between the two groups of a) experts and expert students (or high performers) and b)
non-expert students (low performers). This difference is of course due to the separation based on the
overall percentage of correctly evaluated trails, hence the high average number of correctly identified
trials for the student experts. For evaluation purposes, a subset of student participants will hence
provide data that is comparable to that provided by experts. Moreover, in terms of decision times, no
difference was observed between either the two different student expertise groups (a and b), or
between experts from Grenoble and expert students (Figure 24).
Figure 24: Response reliability for the four experimental conditions for three staff of the road traffic management facility in
DIR-CE (Grenoble, France), students classed as experts (≥ 95% of trials completed correctly) and students classed as nonexperts (< 95% of trials completed correctly)
D5.1 UI Design
37
Decision times calculated across the three different participant groups showed no difference; for all
groups and all conditions, it took approximately 5 seconds to complete one trial (Figure25).
Figure 25: Decision times for the three different participant groups and four different experimental conditions. Throughout,
the median decision time was approximately 5 seconds.
3.6.2.4 Questionnaire results
3.6.2.4.1 Answers
The entries from each questionnaire were digitised and summary statistics calculated, both for all
student participants as well as for students separated into ‘expert students’ and ‘non-expert’ students
(compare Section 5).
3.6.2.4.2 Self-evaluation vs. measured performance
Regression analysis of perceived vs. measured performance was performed in Matlab. Firstly, the
estimated percentage of correctly evaluated trials was regressed against the measured percentage of
correctly evaluated trials. Secondly, the perceived task difficulty was regressed against the measured
percentage of correctly evaluated trials. To facilitate this comparison, the difficulty scores were
converted from the original 0 to 10 scale to a 0 to 1 scale, where 0 corresponded to ‘very hard’ and 1
corresponded to ‘very easy’. This approach assumed that students who did well would find the task
easier than students who did poorly.
3.6.2.4.3 Answers
The mean ± SD difficulty rating was 3.8 ± 2.4 (with 0 – very easy, 10 – very hard) across all students, 3.4
± 3.1 for the ‘expert student’ group and 3.9 ± 2.3 for the ‘non-expert student’ group. The estimated
percentage of correctly evaluated trials was 74.8 ± 15.4% across all students, 85.0 ± 15.8% for the
‘expert student’ group and 72.5 ± 14.7% for the ‘non-expert student’ group. The percentage of trials
D5.1 UI Design
38
which students should have accepted and which they estimated to have accepted was 70 ± 24.5% across
all students, 59.0 ± 33.2% for the ‘expert student’ group and 72.5 ± 22.3% for the ‘non-expert student’
group. The percentage of trials which students should have challenged and which they estimated to
have challenged was 65.4 ± 22.1% across all students, 62.0 ± 37.2% for the ‘expert student’ group and
66.1 ± 18.4% for the ‘non-expert student’ group.
3.6.2.4.4 Self-evaluation vs. measured performance
The correlation between the estimated and measured percentage of correctly evaluated trials was
moderate (r = 0.40), and linear regression revealed a slope significantly different from zero (slope = 0.50,
intercept = 0.35, R2 = 0.16, p = 0.037). This suggests that participants were able to evaluate their own
performance with some consistency. There was no strong correlation between the perceived task
difficulty and the measured percentage of correctly evaluated trials (r = 0.20), and linear regression
estimated a slope not significantly from zero (p = 0.326).
Figure 26: Self-evaluation compared to quantified performance.
[The correlation between the estimated and measured percentage of correctly evaluated trials (left) was moderate (r = 0.40),
and linear regression revealed a slope significantly from zero. The deviation from unity (which would indicate a 1:1 match of
perceived and measured performance) shows that participants under- and overestimated their own performance to a similar
extent].
3.7 Conclusions and outlook
3.7.1
Summary
In the experiment described here, we used an abstracted version of a road traffic management control
user interface in order to test whether 27 participants would reliably detect incorrect computer
suggestions and / or conflicting information in an automated system. We examined response times and
user performance/reliability as well as viewing behaviour in a cohort of students and compared
D5.1 UI Design
39
response correctness and time to three personnel from Grenoble DIR-CE who were familiar with road
traffic management. In the experiment, the operator had to ‘accept’ or ‘challenge’ a computer
recommendation after examining multiple information sources.
Our first main aim was to investigate whether participants would notice those trials that represented
flawed or incongruent automation. We found that participants evaluated between 56% and 100% of
trials correctly. On closer examination, the condition most commonly evaluated incorrectly was that in
which the computer suggestion was incorrect: only a median of 2 out of 8 trials were challenged across
participants, whereas the median was 8 for all other experimental conditions. Hence, participants did
not acknowledge the discrepancy between the information which the computer suggestion was based
on and the source of the data (in this experiment, the ramp number).
Our second main aim was to examine whether there was an effect of the experimental condition and
expertise level on decision times. Results showed that this was unlikely the case: median decision times
were around 5 seconds independent of experimental condition, response, design and response or
expertise level.
Our third main aim was to investigate whether there was consistency within and across participants in
allocating viewing time to different regions of interest. The various eye tracking metrics we employed
did not show any such obvious differences as a function of experimental condition, however we
observed large individual difference in viewing time allocated to different ROIs and scan patterns. The
difference in scanning behaviour that we did note concerned the percentage viewing time allocated to
ROI 2 to 4: while the ‘expert student’ group tended at allocate a similar amount of time to these regions,
the ‘non-expert’ student cohort spent more time looking at ROI 3 than ROI 2 and 4.
3.7.2
Novice vs. expert behaviour
While the task did not present a challenge to Subject Matter Experts, we found that the non-experts
exhibited an interesting pattern in their response. Considering the results, we assume that all
participants were able to use the graph display (top left of the screen, ROI 3) to apply the rules that we
had defined. Hence they correctly determined whether the computer suggestion was correct or false.
However, the non-experts were confused by the FT condition, in which the automation was correct, but
the information on the displays disagreed). This suggests that the non-experts were not checking for
display congruence, which was supported by eye-tracking data: the expert student group attributed a
similar percentage viewing time to ROI 2 to 4. In contrast, the expert group tended to spend a much
larger proportion of their time looking at ROI 3 than ROIs 2 and 4 (which are used only to determine
display congruence). The group exhibited similar weighting for ROIs 2, 3 and 4.
In terms of task proximity, while all participants were presented with the same information, the nonexperts were not able to judge the ‘worth’ of the displays for congruence checking and focused their
attention on the automation checking aspect of the task. It is possible that this might be an effect akin
to change blindness (Simons and Levin, 1997) in which relevant information is not attended to on the
assumption that it is ‘given’ and does not require checking. Alternatively, a phenomenon termed
‘satisfaction of search’ is known from the medical literature, where diagnosticians terminate visual
D5.1 UI Design
40
search after finding the first sign of pathology (Berbaum et al. 1990; Berbaum et al. 1994; Samuel et al.
1995). Similarly, participants may have terminated their search after completing the visual evaluation
that computer suggestion and information held in ROI 3 agreed.
3.7.3
Considerations for UI design
In terms of UI design, it is important to consider not only how information can be presented to highlight
its ‘worth’ but also how people might seek to extract information from the displays. Non-experts may
have expected the automation to fail on the task they perceived as more complex, leading to their
attention being mainly focused on validating the computer suggestion. The findings presented in this
paper underline the importance of cueing operators using decision support software to make sure they
are aware of the context (system state) in which they make decisions. One way to achieve this could be
to prompt users to acknowledge if some displays show different views.
We have already started designing and implementing algorithms to quantify scan pattern similarity
using for example sequence analysis from bioinformatics. This will allow us to quantify the similarity of
viewing behaviour within and across participants as well as across different UI iterations. We currently
assume that a design target should be to achieve repeatable and consistent scan patterns to assure that
an operator always attends to all information sources in a systematic manner. Work in the field of
medicine has shown that such systematic scanning behaviour is extremely beneficial for performing a
task correctly (i.e. detecting fractures or tumours). We hope that based on gaze analysis, we can inform
UI design to encourage such systematic scanning. On the other hand, in context of ‘ideal observer’
theory as applied to visual perception (Geisler 2011), it is also possible that it is important to
accommodate individual differences based on participant-specific constraints on e.g. working memory.
In this case, it would be detrimental to pre-define a single scan path by UI design. Instead, it may be
important to accommodate individual behaviour. Further, there may be several ‘optimal’ scan paths,
and participants in such a scenario should not be constrained to a single solution, and equally may not
exhibit a repetitive pattern. These are questions we are going to address as the project progresses.
3.7.4
Study limitations and considerations for future work
Based on this study, we are looking to address several challenges associated with UI evaluation based on
eye tracking and user responses.
In our recently reported work, we automatically map point of gaze to regions of interest outside
proprietary software due to limitations in the visibility of infrared markers which serve to define global
space for head mounted eye tracking. Despite the success, in future we are looking to optimise our
algorithms in order to remove noise from the sequential ROI attendance sequence. This noise mainly
arises from gaze points close to ROI intersections and can be reduced by placing ROIs even further apart.
In addition, we will work with a screen-mounted eye tracking system in order to achieve higher
resolution and better mapping compared to the head mounted system. This will allow us to define small
regions of interest corresponding to exact information content within larger ROIs.
For our future work, we are investigating suitable eye tracking metrics which allow to compare UI design
across different tasks and which allow to quantify improvements for iterations of the same UI. Ideally,
we are looking to find metrics that give an absolute measure of UI quality, however this presents a
D5.1 UI Design
41
complex problem due to the dependence of many metrics on task, scanning duration and UI factors
such as clutter and information representation.
D5.1 UI Design
42
4 Defining Baseline Performance Metrics
4.1 Overview
In SPEEDD the evaluation of the User Interface designs will involve more than a judgement of the ‘look
and feel’ or usability of the screen layouts. As the role of the UI is to support decision making, the
project will focus on ways on measuring decision making in response to information search and retrieval.
An assumption is that the manner in which information is presented to the user interacts with the
manner in which the user searches for specific aspects of the information in order to make decisions.
This raises three dimensions that need to be considered in this work. First, the manner in which
information is presented can be described in terms of its value to the task, which includes its
diagnosticity (i.e., the correlation between the content of the display and the goal to achieve), and in
terms of its graphical properties (i.e., the format used to display the information). Second, the manner
in which users search for specific information can be described in terms cost, i.e., temporal measures of
information search, particularly using eye-tracking metrics. Third, the manner in which empirical
information search relates to decision making can be compared to the optimal decision models being
developed in SPEEDD. It is proposed that this approach is not only beneficial for SPEEDD but also
provides a foundation for the evaluation of Visual Analytics in general.
The definition of metrics for these three aspects involves the specification of a baseline against which
subsequent designs can be assessed. While it might be tempting to assume that such a baseline can be
derived from the observations and analysis of the DIR-CE control room (as outlined in D8.1 and
elaborated in D5.1), it is not obvious how this could provide a realistic comparison with future designs.
This is for three reasons.
First, many aspects of the task and displays will change and we need to be able to isolate particular
effects for formative assessment. For example, the current control room involves operators consulting 5
screens on their desk and two sets of CCTV screens. Consequently, their visual search is conducted in an
environment in which near- and far-field displays of information need to be consulted. We might
propose the SPEEDD prototype, with new visualisations, run on a single screen (albeit one with several
windows running on a large screen). We need to empirically separate the effects of single/multiple
screens and the effects of the visualisation presented on each display/window.
Second, the current task of operators is predicated on completing an incident report and, as D5.1
indicates, much of their activity is focused on this task. It is likely that the SPEEDD prototype automates
much of the report creation, leaving the role of the operator to provide expert commentary on the
incident and verify components of the automation rather than log the more mundane details. In this
case, the role of the operator would be radically different. Having proposed that the fundamental goal
of the UI design work in SPEEDD is to change the nature of the information displayed and the manner of
the tasks that the operator performs, it is not easy to see how baseline metrics could be defined which
could be applicable across all instances of the design (having said that, we will measure the time cost of
D5.1 UI Design
43
incident reporting so that we know the gain if it can be automated.) Third, quantitative measurement of
operator performance in the working environment is difficult because the actions performed vary in
response to changing situational demands. For metrics to be reliable and consistent, one needs to
exercise control over the environment in which they are applied. For this reason, we prefer to define
metrics which can be applied in laboratory settings, much like the experiment reported in this document.
While the definition of metrics reflecting current practice in the DIR-CE control room could be
problematic, SPEEDD requires a set of metrics which retain ecological validity. This means that the
metrics need to reflect the work that the operator performs and the type of decisions that the UI will
support. We could, for example, take a high-level subgoal from the Hierarchical Task Analysis presented
in D8.1, such as ‘define incident location’ and ask how this subgoal is currently achieved. This could
provide a qualitative description, in terms of tasks performed and Regions of Interest consulted in the
DIR-CE control room to give a broad sense of what the operator’s current activity might look like. We
could then take this subgoal as the basis for an experimental task and run this in controlled settings
(with novice and experts participants) to define quantitative metrics of this performance. This approach
is illustrated by the experiment in this report.
As part of WP 8, quantitative metrics allow us to describe visual scanning behaviour of novices and
experts, develop models of visual scanning and decision making, contrast model and experimental data
and quantify the effect of changes to UI design on human performance and behaviour. When comparing
different UI designs, it is important to work with metrics that are sensitive to changes while being robust
across information content, display choices and other factors. In our work, we consider baseline metrics
in the following contexts where metrics relate to different levels:
4.2 Overall goal hierarchy
At the top level, the investigation concerns the overall goal hierarchy, for example to determine the
location of an incident in a road traffic management task. In this context, metrics such as the time taken
to complete a task / goal and the presence of error serve to quantify user performance and the impact
of different UI designs. More specifically, metrics can be separated into the following categories:




Decision time
Use of discrete regions of interest (ROIs)
o Duration spent looking at each ROI
o Sequence of ROI attendance
o Repetition of scanning cycles / recurring sequences
Signal detection
o Rate of correct responses or target / stimulus detection and probability of correct
response
o Sensitivity and Specificity; Positive and negative predictive value; Accuracy
User error
o Missed target
o Misinterpretation, misunderstanding
D5.1 UI Design
44
o
o
Failure to visually examine all components of UI
Failure to perform all necessary sanity checks / verifications in automated system
In the experiment we described as part of this report, we used several of the metrics mentioned above
to examine their response for different experimental conditions and UI content. We used decision time
to investigate whether different levels of disagreement amongst information presented in the UI
triggers an alteration in search time, which was not the case. Attention to discrete regions of interest
(ROIs) was quantified in terms of dwell times and cumulative viewing times, showing that these metrics
varied more between participants than between experimental conditions. Sequence analysis and
recurrence analysis will be performed in future. The rate of correct responses was quantified in the
present experiment, allowing to perform correlation between perceived and measured performance
and to separate students into cohorts representative of different expertise levels. Quantifying user error
in the present experiment allowed to pin-point where participants made mistakes, with first steps taken
to relate this to visual scanning behaviour, showing that differences in time spent of ROIs may be
associated with erroneous responses.
4.3 Theory and modelling
To compare theoretical predictions and model-based exploration of large design spaces to human
performance on selected design choices, it is important to work with metrics that can both inform
models of visual scanning and decision making was well as being measurable in human visual scanning.
Based on our past work in the domain and published work, there are several metrics that allow for this
which fall into the following categories:




Basic oculomotor behaviour (used in models of fundamental visual scanning behaviour)
o Saccade amplitude
o Fixation duration
Allocation of attention and information extraction
o Dwell times per region of interest (ROI)
o Scan path length
Function of ROI
o Use of ROI in context of task completion and user correctness
o Sequence of ROI selection and weighting of ROI importance
Allocation of attention relative to usefulness and cost; effect of ROI…
o Diagnosticity (how well does ROI content predict the scenario outcome)
o Access cost (how much effort is needed to extract information from a ROI); cost
commonly corresponds to time. This will also include time necessary to perform
required task actions such as filling in forms
o Accuracy (how noisy is the data displayed in the ROI)
These metrics will be applied in work planned for summer/autumn 2015.
D5.1 UI Design
45
4.4 UI evaluation
When evaluating different UI designs, either across iterations or across tasks, it is important to quantify
user performance and well as user perceptions regarding the ‘goodness’ of the design. In an ideal
scenario, a UI that results in reliable performance is also perceived as intuitive, easy to interpret and
user-friendly by a user. Further, we are looking to apply a no-choice / choice paradigm (related to work
recently published by Howes et al., 2015a) to determine user choices within different options to display
the same information. This paradigm has the advantage that it does not rely on metrics that may be too
noisy to detect statistical differences. Rather, in this approach users are trained to use a variety of
visualisations to complete a task, followed by a task in which the visualisation can be chosen freely. This
allows to determine user preferences ‘on the job’, without confounding effects such as post-hoc
rationalising or speculative choices.
To be able to track parameters across the design process, we are considering metrics related to the
following domains:




Task difficulty
o Quantification of user self-evaluation using questionnaires
o Correlation of perceived difficulty / workload etc. with objectively quantified user
performance, such as percentage of correctly evaluated trials or task duration
User evaluation
o Questionnaire following task to determine strategy and user evaluation
o Debrief session for each participant
Design space parameterisation
o Iteratively adjust e.g. diagnosticity, clutter, accuracy of ROIs in static tasks
o Adjustment of rates of change in dynamic tasks
o Correlation of visual sampling strategy and user performance
o Examination of different options to display the same information
Evaluation of different display options
o No choice / choice paradigm
o User ranking
In our recent experiment, we used questionnaires to get a self-estimate of performance from the
participants. Correlation of this self-estimate of percentage correctly evaluated trials with measured
performance showed a significant association. We included the question of ‘what was your strategy’
into the questionnaire, and answers generally showed that participants had understood the task and
performed it logically. In work planned for summer/autumn 2015, we will start to manipulate design
space parameters, both for modelling of visual search and empirical evaluation of human scanning
behaviour. This will likely include an experimental approach described in Howes et al. (2015b) based on
a no-choice / choice paradigm.
D5.1 UI Design
46
4.5 Evaluation of metrics for the quantification of UI design
As part of our work for SPEEDD so far, we have introduced a range of metrics that can be used to
quantify the aspects mentioned above. The following tables present a first evaluation, following our
experimental work so far, of the advantages and disadvantages of these metrics in context of UI
evaluation. It becomes apparent that it is unlikely to find a single metric that is able to capture design
properties across various tasks, designs and user goals. However, there are several metrics that can be
normalised and it will certainly be possible to use many of them for iterative design evaluations, and
others for layout and content considerations.
Table 2: Eyetracking low level
Metric
Saccade
amplitude
Saccade
frequency
Pro
- Shows whether long or short visual paths
are taken within the display
- Shows whether information inside or
outside the periphery is accessed
- Quantifies scanning behaviour at a very
basic level, e.g. showing whether a user is
moving the eyes a lot or not compared to
normal
eye
movement
in
normal
environments
Fixation duration
- This metric corresponds to information
extraction at the most basic level
Fixation count
- Attention allocation to a ROI at the most
basic level
Scan path length
- Measure of the distance covered while
moving the eyes across the UI; this metric
should hence correlate with cost / efficiency
of information sampling
Blink amplitude
- Blink frequency can relate to concentration
Con
- Needs a well controlled experimental
setup
- Can be confounded by head movement
- Saccade frequency is a very intrinsic
property of the visual system and may not
correlate well with changes to UI design.
There are always micro-saccades, and
saccades can occur within the same ROI.
Dwell times or cumulative viewing times
might be better
- Might be confounded / overridden by
the physiological necessity for the eyes to
always move a little after a brief time (else
the image would disappear). So this
metric is bounded (especially with an
upper bound)
- As above, relates to very fundamental
scanning processes and may not
correlated well with design - Hard to do
with glasses in control room setup with
multiple screens.
- This metric has to be normalised to UI
content, as it is strongly dependent on
number of ROIs, ROI location and content
- A scan path might consist of many short
saccade amplitudes to many ROIs or few
large amplitudes to few ROIs, so this
metric does not capture the ‘complete
story’
- May be difficult to detect and separate
from dropouts due to other reasons
D5.1 UI Design
47
Metric
Pupil
dilation
metrics
(constriction rate
etc.)
Pro
- Tracking pupil dilation may allow to map
the decision and information weighting and
accumulation process before conscious
decisions are made (see de Gee 2014)
- Rates may allow to quantify the ease of
information extraction
Con
- This is a young field and needs extensive
testing / validation
Table 3: Eye tracking processed / high level
Metric
Cumulative
viewing time (in %
trial duration or
seconds)
Dwell time
Pro
- Gives an indication of weighting of
information sources
- Shows which information sources are
attended and not attended to
- Very basic scanning parameter relating to
information extraction time for a specific ROI
per visit
- Lots of reference values in the literature
Number of visits
to ROI per trial
(respectively
number
of
returns)
Return time
- How often a participant revisits a ROI should
correlate with how well content can be
memorised; in an optimal scan path we
probably want each ROI just visited once (in a
static scenario!)
- Shows how long people might be able to
remember ROI content for
- Indicates the ‘efficiency’ of scanning
behaviour
Switch count (=
number
of
transitions
between ROIs)
Time to first hit /
rank order of first
hit
Dwell time on first
hit / first and
second pass dwell
time
Entropy
- Indicates ‘ranking’ of
importance for a participant
information
- Indicates how long it takes to extract
sufficient information from a previously
unseen ROI
- Normalised dispersion measure that can be
compared across different tasks / designs
Con
- Looking at a ROI for a long time does not
necessarily mean that lots of information
is extracted; it can also mean that a user is
trying and failing to understand content
- Dwell times tend to be quite similar,
especially in our first experiment. So they
might not reflect the ‘goodness’ of a
display, unless it is really bad or
convoluted, or the task demands differ
strongly
- For some ROIs, revisits may be necessary
for task completion (e.g. in our study, it is
necessary to read the computer
suggestion, then press a button in that
ROI at the end of the trial)
- Confounded by attention to and
competition of other ROIs and task goal
There is the need to normalise this to the
number of ROIs
- Time needs to be normalised, so we
could use rank ordering instead for
example
- This metric is confounded in our recent
experiment, since participants fixate the
middle of the screen during the interval
between trials and the first gaze point is
often mapped to ROI 3 following screen
refresh
- this will be confounded after the first
trial due to learning
- The maths / assumptions need careful
consideration
D5.1 UI Design
48
Table 4: Scan pattern comparison using bioinformatics
Metric
Levenshtein
distance (= ‘edit
distance’)
Pro
- Easy to normalise, quantifies similarity
between pairwise scan paths using the string
edit method from computer science
Global and local
sequence
alignment
- Detection of similar scan patterns within and
across participants, allowing for some noise in
the pattern (tolerances controlled via scoring
matrix)
Recurring nmers
- Shows whether UI design triggers similar
scan patterns within and across participants;
extracts those scan patterns that are most
common
Con
- Easily confounded by noisy mapping,
sensitive to small deviations in otherwise
recurring patterns (in contrast to some
bioinformatics approaches)
- Very UI specific; it is a good metric to
evaluate consistency in scanning for
different layout options of the same
material, but difficult to normalise (unless
using purely path length)
- As for sequence alignment
Table 5: Network analysis metrics
Metric
Number of edges
Pro
- Simple count of paths taken between ROIs
Link density
- Normalised metric of paths taken relative to
all possible paths, allows to e.g. quantify how
systematic a participant is scanning the UI
- Normalised metric reflecting the percentage
of ROIs attended to independent of the
design constraints
- Related to link density, reflects how many
different ROIs are accessed (per ROI or on
average) from a given ROI
- Allows to identify the function of each ROI
within the scan path of an observer, i.e.
whether a ROI is central to switching to many
other ROIs or only accessed at a particular
point
- Count of between how many pairwise ROIs
participants look back and forth, and how
many ROIs are attended to only unidirectionally
- Counts how many ROIs are only accessed
from one other ROI and reflects ROIs with a
very specific function in the viewing sequence
Inclusiveness
Nodal degree and
average
nodal
degree
Connection type
Leaf nodes
Con
- Has to be normalised to network size
when comparing across different numbers
of ROIs, where it becomes link density
- For few ROIs this might be insensitive to
experimental conditions
- The ROI-specific nodal degree may be
difficult to normalise independent of UI
design; it will be very good for iterative
design, but less meaningful when
comparing e.g. the fraud and road traffic
UI, as those ROIs have implicitly different
functions
- Looking at the pilot data, this metric
might be insensitive to small design
changes due to variation in participant
behaviour
- It may be difficult to separate systematic
from random effects, especially for short
scan paths. For short scan paths, there
may be a high number of leaf nodes
purely because each ROI is only accessed
once
D5.1 UI Design
49
Table 6: Eye / head coupling
Metric
Latency
Combined
shift
Pro
- This time lag between eye- and head
movement onset has been associated with
intrinsic / extrinsic cue attendance
gaze
Head contribution
Head movement
amplitude
- Quantifies the percentage of switches
between ROIs executed by both eye- and
head movement, allowing to infer goal
structure
- Amount that head movement contributes to
the gaze shift, easy to normalise
- Useful metric for large control room design
evaluations
Con
- Very difficult to measure; needs
synchronised mocap system or robust
head movement detection algorithm
based on head-mounted system
- Heavily dependent on spatial setup
(especially >40 deg visual angle); for
smaller gaze shifts, eye/head coupling is
very participant specific as a baseline
- May not correlate well with UI design
and be a feature of more intrinsic viewing
behaviour of individual participants
- May be less useful for single-screen wok,
as gaze shifts are too small for this to be
relevant
Table 7: User performance
Metric
Decision time
Correlation
of
perceived difficulty
and correctness of
response
Pro
- Simple metric how long it takes to extract
information from UI to perform a task
- The correlation coefficient is a simple,
standardised metric to quantify how well
perceived UI handling and actual task
performance match
Con
- Specific to task / question asked
- This is currently built around the
assumption that an observer who
performs poorly would find the task
difficult, and an observer who performs
well would find the task simple. However,
observers who find the task difficult might
still do well, just with more mental
resources
D5.1 UI Design
50
5 References
Baber, C.(2015) Evaluting Human Computer Interaction, In J.R. Wilson and S. Sharples (eds)
Evaluation of Human Work, London: CRC Press,
Berbaum, K.S., Franken, E.A., Dorfman, D.D. et al. (1990). Satisfaction of search in
diagnostic radiology. Invest Radiol., 25, 133-40.
Berbaum K.S., El-Khoury G.Y., Franken, E.A. (1994). Missed fractures resulting from
satisfaction of search effect, Emergency Radiology, 1, 242-249.
Bevan, N., 2001, International Standards for HCI and usability, International Journal of Human
Computer Interaction, 55, 533–552.
Brooke, J., 1996, SUS: a quick and dirty usability scale. In: P.W. Jordan, B. Weerdmeester, B.A.
Thomas and I.L. McLelland (eds) Usability Evaluation in Industry, London: Taylor and Francis,
189–194.
Chen, X., Bailly, G., Brumby, D.P, Oulasvirta, A. and Howes, A. (2015) The emergence of
interactive behavior: a model of rational menu search, Proceedings of the 33rd Annual ACM
Conference on Human Factors in Computing - CHI’15, Seoul, Korea
de Gee, J. W., Knapen, T., & Donner, T. H. (2014). Decision-related pupil dilation reflects
upcoming choice and individual bias. Proceedings of the National Academy of
Sciences, 111(5), E618-E625.
Geisler, W.S. 2011. Contributions of ideal observer theory to vision research. Vision Research 51,
771-781.
Holcomb, R. and Tharp, A.L., 1991, What users say about software usability, International
Journal of Human–Computer Interaction, 3, 49–78.
Howes, A., Duggan, G.B., Kalidini, K., Tseng, Y-C., Lewis, R.L. (2015a) Predicting short-term
remembering as boundedly optimal strategy choice, Cognitive Science
ISO 13407, 1999, Human-centred Design Processes for Interactive Systems, Geneva:
International Standards Office.
Samuel, S., Kundel, H.L., Nodine, C.F., Toto, L.C. (1995). Mechanism of satisfaction of
search: eye position recordings in the reading of chest radiographs, Radiology, 194, 895902.
Shackel, B., 1984, The concept of usability. In: J. Bennett, D. Case, J. Sandelin and M. Smith (eds)
Visual Display Terminals: Usability Issues and Health Concerns, Englewood Cliffs, NJ:
Prentice-Hall, 45–88.
Whiteside, J. Bennett, J. and Holtzblatt, K., 1988, Usability engineering: our experience and
evolution. In: M. Helander (ed.) Handbook of Human–Computer Interaction, Amsterdam:
Elsevier, 791–817.
D5.1 UI Design
51
Appendix 1 – SUS translated into French
http://blocnotes.iergo.fr/concevoir/les-outils/sus-pour-system-usability-scale/
D5.1 UI Design
52
Appendix 2 – Summary of Comments from SMEs
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
a.
b.
c.
d.
e.
26.
27.
28.
29.
30.
31.
32.
Want to control road rather than work zone-by-zone.
Ramp metering not always under operator control.
Sensors will only be on south / west lanes.
Control of ramp should take account of the length of the queue building up behind it.
Maximise traffic flow.
Apply penalty (to automated system) for queue?
How does prediction work in display A?
Contrast global versus local view (see 1).
How read colours on car dials? Is this showing targets or limits?
Too much information of the display for the job of ramp metering (confused by whether UI on an extra screen
or replacement for whole system?).
Prediction if useful.
Graphs show limits (upper and lower) and normal, and predictions.
Warning shown when outside limits rather than having operator explore all sensors (confused by demo?).
Too much information on same map (display A).
Coloured circles show information for zone.
Ramp overview (display C) useful.
Would prefer map showing where there is a problem and to have information needed to understand this
problem to hand at that location.
Need global view and simple data.
Danger is using zooming / resizable displays – makes most recent the most important; obscures operator
experience in prioritising events; limits overview and SA.
Sometimes a ramp is closed and it is not clear how traffic will evolve. It would be useful to include this in
modelling (CB – what about inviting operator to propose how traffic will change?)
Click on ramp square (on display A) and get overview of parameters for that ramp.
Predicted versus current state useful.
Not sure how to use vehicle inter-distance (isn’t this density?).
Not sure if driver behaviour falls under control room command; need to optimise traffic flow and leave driver
behaviour to driver responsibility. Also, changing driver behaviour has long-term evolution.
Current ramp metering.
Ramp lengths short, so need to manage length of queue.
When ramp active, operator seeks to manage queue length.
If queue at end of ramp then impact on traffic on main road.
Might need to act on downstream ramp.
Ramp metering entirely automated (in new system) and could be useful to allow operator intervention.
Invitation to contribute to working party of design of ramp metering UI that is currently being implemented.
Representation of events – road works, accidents, congestion – on the map.
Response depends on whether this supplements or replace current system (see comment on 10).
Log of events written by operators and details from phone calls, observation from cctv, email, other reports...
Understanding situation (global) as it evolves.
Pushing information to other panels or to other road staff.
Connection with other agencies (not easy to do at present but could help with joined-up response).
D5.1 UI Design
53
Appendix 3 – Complete Experiment Instructions
Traffic Management Experiment
In this experiment you will be playing the role of a traffic manager. Your job is to maintain a desirable
traffic flow by changing ramp metering rates (frequency of traffic lights changing colour) based on traffic
density (number of cars in the area of interest). This will be done using the User interface shown below.
The UI consists of 4 information sources: a map (top right corner), a graph of Rate vs Density (top left
corner), a list of all ramps (bottom left corner) and a control window (bottom right).
Your task is to either ACCEPT or CHALLENGE a COMPUTER SUGGESTION (shown in the bottom right
window) regarding changes to a specific ramp. However, the computer suggestion might be wrong
and/or the information sources (the 4 different windows) might refer to different ramps. If not all
information sources refer to the same ramp and/or the computer suggestion is considered to be wrong
you should press the CHALLENGE button. You should only press the ACCEPT button if all information
sources agree (talk about the same ramp) AND the computer suggestion is deemed correct.
To determine if the computer suggestion is correct, you should look at the Rate vs Density graph in the
top left window.
-
If most of the bigger dots are in the top left quadrant (high rate & low density) the rate at that ramp
should be LOWERED.
-
If most of the bigger dots are in the bottom right quadrant (low rate & high density) the rate at
that ramp should be INCREASED.
-
If most of the bigger dots are in the top right (high rate & high density) or bottom left (low rate
& low density) quadrant the rate at that ramp should be LEFT UNCHANGED.
This experiment consists of 32 trials. In between each trial you will see the white screen below. To
begin the trial or proceed to the following one you need to press on the timer. You are requested to
answer each trial as FAST as possible. However, in between each task (when you see the white
screen with the counter) you will not be timed. In the beginning you will be given 2 practice trials
after which you can clarify anything regarding the task.
D5.1 UI Design