On the Advantages of a Systematic Inspection for Evaluating

INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION, 15(3), 315–335
Copyright © 2003, Lawrence Erlbaum Associates, Inc.
On the Advantages of a Systematic Inspection
for Evaluating Hypermedia Usability
A. De Angeli
NCR Self-Service
Advanced Technology & Research, Dundee, UK
M. Matera
M. F. Costabile
Dipartimento di Informatica
Università di Bari, Italy
F. Garzotto
P. Paolini
Dipartimento di Elettronica e Informazione
Politecnico di Milano, Italy
It is indubitable that usability inspection of complex hypermedia is still an “art,” in the
sense that a great deal is left to the skills, experience, and ability of the inspectors.
Training inspectors is difficult and often quite expensive. The Systematic Usability
Evaluation (SUE) inspection technique has been proposed to help usability inspectors
share and transfer their evaluation know-how, to simplify the hypermedia inspection
process for newcomers, and to achieve more effective and efficient evaluation results.
SUE inspection is based on the use of evaluation patterns, called abstract tasks, which
precisely describe the activities to be performed by evaluators during inspection. This
article highlights the advantages of this inspection technique by presenting its empirical validation through a controlled experiment. Two groups of novice inspectors were
asked to evaluate a commercial hypermedia CD-ROM by applying the SUE inspection
or traditional heuristic evaluation. The comparison was based on three major dimensions: effectiveness, efficiency, and satisfaction. Results indicate a clear advantage of
the SUE inspection over the traditional inspection on all dimensions, demonstrating
that abstract tasks are efficient tools to drive the evaluator’s performance.
The authors are immensely grateful to Prof. Rex Hartson, from Virginia Tech, for his valuable suggestions. The authors also thank Francesca Alonzo and Alessandra Di Silvestro, from the Hypermedia
Open Center of Polytechnic of Milan, for the help offered during the experiment data coding.
The support of the EC grant FAIRWIS project IST-1999-12641 and of MURST COFIN 2000 is
acknowledged.
Requests for reprints should be sent to M. F. Costabile, Dipartimento di Informatica, Università di
Bari, Via Orabona, 4–70126 Bari, Italy. E-mail: [email protected]
316
De Angeli et al.
1. INTRODUCTION
One of the goals of the human–computer interaction (HCI) discipline is to define
methods for ensuring usability, which is now universally acknowledged as a significant aspect of the overall quality of interactive systems. ISO Standard 9241-11
(International Standard Organization, 1997) defines usability as “the extent to
which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.” In this framework,
effectiveness is defined as the accuracy and the completeness with which users
achieve goals in particular environments. Efficiency refers to the resources expended in relation to the accuracy and completeness of the goals achieved. Satisfaction is defined as the comfort and the acceptability of the system for its users and
other people affected by its use.
Much attention to usability is currently paid by industry, which is recognizing the
importance of adopting evaluation methods during the development cycle to verify
the quality of new products before they are put on the market (Madsen, 1999). One of
the main complaints of industry is, however, that cost-effective evaluation tools are
still lacking. This prevents most companies from actually performing usability evaluation, with the consequent result that a lot of software is still poorly designed and
unusable. Therefore, usability inspection methods are emerging as preferred evaluation procedures, being less costly than traditional user-based evaluation.
It is indubitable that usability inspection of complex applications, such as
hypermedia, is still an “art,” in the sense that a great deal is left to the skills, experience, and ability of the inspectors. Moreover, training inspectors is difficult and
often quite expensive. As part of an overall methodology called Systematic Usability Evaluation (SUE; Costabile, Garzotto, Matera, & Paolini, 1997), a novel inspection technique has been proposed to help usability inspectors share and transfer
their evaluation know-how, make the hypermedia inspection process easier for
newcomers, and achieve more effective and efficient evaluations. As described in
previous articles (Costabile & Matera, 1999; Garzotto, Matera, & Paolini, 1998,
1999), the inspection proposed by SUE is based on the use of evaluation patterns,
called abstract tasks, which precisely describe the activities to be performed by an
evaluator during the inspection.
This article presents an empirical validation of the SUE inspection technique.
Two groups of novice inspectors were asked to evaluate a commercial hypermedia
CD-ROM by applying the SUE inspection or the traditional heuristic technique.
The comparison was based on three major dimensions: effectiveness, efficiency,
and satisfaction. Results demonstrated a clear advantage of the SUE inspection
over the heuristic evaluation on all dimensions, showing that abstract tasks are efficient tools to drive the evaluator’s performance.
This article has the following organization. Section 2 provides the rationale for
the SUE inspection by describing the current situation of usability inspection techniques. Section 3 briefly describes the main characteristics of the SUE methodology,
whereas Section 4 outlines the inspection technique proposed by SUE. Section 5,
the core of the article, describes the experiment that was performed to validate the
SUE inspection. Finally, Section 6 presents conclusions.
Evaluating Hypermedia Usability
317
2. BACKGROUND
Different methods can be used to evaluate usability, among which the most common
are user-based methods and inspection methods. User-based evaluation mainly
consists of user testing: It assesses usability properties by observing how the system
is actually used by some representative sample of real users (Dix, Finlay, Abowd, &
Beale, 1993; Preece et al., 1994; Whiteside, Bennet, & Holtzblatt, 1988). Usability inspection refers to a set of methods in which expert evaluators examine usability-related aspects of an application and provide judgments based on their knowledge. Examples of inspection methods are heuristic evaluation, cognitive walk-through,
guideline review, and formal usability inspection (Nielsen & Mack, 1994).
User-based evaluations claim to provide, at least until now, the most reliable
results, because they involve samples of real users. Such methods, however, have a
number of drawbacks, such as the difficulty of properly selecting a correct sample
from the user community and training participants to master the most sophisticated and advanced features of an interactive system. Furthermore, it is difficult
and expensive to reproduce actual situations of usage (Lim, Benbasat, & Todd,
1996), and failure in creating real-life situations may lead to artificial findings
rather than realistic results. Therefore, the cost and the time to set up reliable empirical testing may be excessive.
In comparison to user-based evaluation, usability inspection methods are
more subjective. They are strongly dependent on the inspector’s skills. Therefore,
different inspectors may produce noncomparable outcomes. However, usability
inspection methods “save users” (Jeffries, Miller, Wharton, & Uyeda, 1991; Nielsen & Mack, 1994) and do not require special equipment or lab facilities. In addition, experts can detect problems and possible future faults of a complex system
in a limited amount of time. For all these reasons, inspection methods have been
used more widely in recent years, especially in industrial environments (Nielsen,
1994a).
Among usability inspection methods, heuristic evaluation (Nielsen, 1993,
1994b) is the most commonly used. With this method, a small set of experts inspect a system and evaluate its interface against a list of recognized usability
principles—the heuristics. Experts in heuristic evaluation can be usability specialists, experts of the specific domain of the application to be evaluated, or (preferably) double experts, with both usability and domain experience. During the
evaluation session, each evaluator goes individually through the interface at least
twice. The first step is to get a feel of the flow of the interaction and the general
scope of the system. The second is to focus on specific objects and functionality,
evaluating their design, implementation, and so forth, against a list of
well-known heuristics. Typically, such heuristics are general principles, which refer to common properties of usable systems. However, it is desirable to develop
and adopt category-specific heuristics that apply to a specific class of products
(Garzotto & Matera, 1997; Nielsen & Mack, 1994). The output of a heuristic evaluation session is a list of usability problems with reference to the violated
heuristics. Reporting problems in relation to heuristics enables designers to easily
revise the design, in accordance with what is prescribed by the guidelines pro-
318
De Angeli et al.
vided by the violated principles. Once the evaluation has been completed, the
findings of the different evaluators are compared to generate a report summarizing all the findings.
Heuristic evaluation is a “discount usability” method (Nielsen, 1993, 1994a). In
fact, some researchers have shown that it is a very efficient usability engineering
method (Jeffries & Desurvire, 1992) with a high benefit–cost ratio (Nielsen, 1994a).
It is especially valuable when time and resources are short, because skilled evaluators, without needing the involvement of representative users, can produce
high-quality results in a limited amount of time (Kantner & Rosenbaum, 1997).
This technique has, however, a number of drawbacks. As highlighted by Jeffries et
al. (1991), Doubleday, Ryan, Springett, and Sutcliffe (1997), and Kantner and
Rosenbaum (1997), its major disadvantage is the high dependence on the skills and
experiences of the evaluators. Nielsen (1992) stated that novice evaluators with no
usability expertise are poor evaluators and that usability experts are 1.8 times as
good as novices. Moreover, application domain and usability experts (the double
experts) are 2.7 as good as novices and 1.5 as good as usability experts (Nielsen,
1992). This means that experience with the specific category of applications being
evaluated really improves the evaluators’ performance. Unfortunately, usability
specialists may lack domain expertise, and domain specialists are rarely trained or
experienced in usability methodologies. To overcome this problem for hypermedia
usability evaluation, the SUE inspection technique has been introduced. It uses
evaluation patterns, called abstract tasks, to guide the inspectors’ activity. Abstract
tasks precisely describe which hypermedia objects to look for and which actions
the evaluators must perform to analyze such objects. In this way, less experienced
evaluators, with lack of expertise in usability or hypermedia, are able to produce
more complete and precise results. The SUE inspection technique solves a further
heuristic evaluation drawback, which was reported by Doubleday et al. (1997). The
problem is that heuristics, as they are generally formulated, are not always able to
adequately guide evaluators. At this proposal, the SUE inspection framework provides evaluators with a list of detailed heuristics that are specific for hypermedia.
Abstract tasks provide a detailed description of the activities to be performed to detect possible violations of the hypermedia heuristics.
In SUE, the overall inspection process is driven by the use of an application
model, the hypermedia design model (HDM; Garzotto, Paolini, & Schwabe, 1993).
The HDM concepts and primitives allow evaluators to identify precisely the
hypermedia constituents that are worthy of investigation. Moreover, both the
hypermedia heuristics and the abstract tasks focus on such constituents and are
formulated through HDM terminology. Such a terminology also is used by evaluators for reporting problems, thus avoiding the generation of incomprehensible and
vague inspection reports. In some recent articles (Andre, Hartson, & Williges, 1999;
Hartson, Andre, Williges, & Van Rens, 1999), authors have highlighted the need for
more focused usability inspection methods and for a classification of usability
problems to support the production of inspection reports that are easy to read and
compare. These authors have defined the user action framework (UAF), which is a
unifying and organizing environment that supports design guidelines, usability
Evaluating Hypermedia Usability
319
inspection, classification, and reporting of usability problems. UAF provides a
knowledge base in which different usability problems are organized, taking into
account how users are affected by the design during the interaction, at various
points where they must accomplish cognitive or physical actions. The classification
of design problems and usability concepts is a way to capitalize on past evaluation
experiences. It allows evaluators to better understand the design problems they
encounter during the inspection and helps them identify precisely which physical
or cognitive aspects cause problems. Evaluators are therefore able to propose
well-focused redesign solutions. The motivations behind this research are similar
to ours. Reusing past evaluation experience and making it available to less experienced people is a basic goal of the authors, which is pursued through the use of
abstract tasks. Their formulation is, in fact, the reflection of the experiences of some
skilled evaluators. Unlike the UAF, rather than recording problems, abstract tasks
offer a way to keep track of the activities to be performed to discover problems.
3. THE SUE METHODOLOGY
SUE is a methodology for evaluating the usability of interactive systems, which
prescribes a structured flow of activities (Costabile et al., 1997). SUE has been
largely specialized for hypermedia (Costabile, Garzotto, Matera, & Paolini, 1998;
Garzotto et al., 1998, 1999), but this methodology easily can be exploited to evaluate
the usability of any interactive application (Costabile & Matera, 1999). A core idea
of SUE is that the most reliable evaluation can be achieved by systematically combining inspection with user-based evaluation. In fact, several studies have outlined
how two such methods are complementary and can be effectively coupled to
obtain a reliable evaluation process (Desurvire, 1994; Karat, 1994; Virzi, Sorce, &
Herbert, 1993). The inspection proposed by SUE, based on the use of abstract tasks
(in the following SUE inspection), is carried out first. Then, if inspection results are
not sufficient to predict the impact of some critical situations, user-based evaluation also is conducted. Because SUE is driven by the inspection outcome, the
user-based evaluation tends to be better focused and more cost-effective.
Another basic assumption of SUE is that, to be reliable, usability evaluation
should encompass a variety of dimensions of a system. Some of these dimensions
may refer to general layout features common to all interactive systems, whereas
others may be more specific for the design of a particular product category or a
particular domain of use. For each dimension, the evaluation process consists of
a preparatory phase and an execution phase. The preparatory phase is performed
only once for each dimension, and its purpose is to create a conceptual framework that will be used to carry out actual evaluations. As better explained in the
next section, such a framework includes a design model, a set of usability attributes, and a library of abstract tasks. Because the activities in the preparatory
phase may require extensive use of resources, they should be regarded as a
long-term investment. The execution phase is performed every time a specific application must be evaluated. It mainly consists of an inspection, performed by ex-
320
De Angeli et al.
pert evaluators. If needed, inspection can be followed by sessions of user testing,
involving real users.
4. THE SUE INSPECTION TECHNIQUE FOR HYPERMEDIA
The SUE inspection is based on the use of an application design model for
describing the application, a set of usability attributes to be verified during the
evaluation, and a set of abstract tasks (ATs) to be applied during the inspection
phase. The term model is used in a broad sense, meaning a set of concepts, representation structures, design principles, primitives, and terms, which can be used
to build a description of an application. The model helps organize concepts, so
identifying and describing, in a nonambiguous way, the components of the application that constitute the entities of the evaluation (Fenton, 1991). For the evaluation of hypermedia, the authors have adopted HDM (Garzotto et al., 1993),
which focuses on structural and navigation properties as well as on active media
features.
Usability attributes are obtained by decomposing general usability principles
into finer grained criteria that can be better analyzed. In accordance with the
suggestion by Nielsen and Mack (1994) to develop category-specific heuristics,
the authors have defined a set of usability attributes, able to capture the peculiar
features of hypermedia (Garzotto & Matera, 1997; Garzotto et al., 1998). Such
hypermedia usability attributes correspond with Nielsen’s 10 heuristics (Nielsen,
1993). The hypermedia usability attributes, in fact, can be considered a specialization for hypermedia of Nielsen’s heuristics, with the only exception of “good error messages” and “help and documentation,” which do not need to be further
specialized.
ATs are evaluation patterns that provide a detailed description of the activities to
be performed by expert evaluators during inspection (Garzotto et al., 1998, 1999).
They are formulated precisely by following a pattern template, which provides a
consistent format including the following items:
• AT classification code and title univocally identify the AT and succinctly convey
its essence.
• Focus of action briefly describes the context, or focus, of the AT by listing the
application constituents that correspond to the evaluation entities.
• Intent describes the problem addressed by the AT and its rationale, trying to
make clear which is the specific goal to be achieved through the AT application.
• Activity description is a detailed description of the activities to be performed
when applying the AT.
• Output describes the output of the fragment of the inspection the AT refers to.
Optionally, a comment is provided, with the aim of indicating further ATs to be
applied in combination or highlighting related usability attributes.
Evaluating Hypermedia Usability
321
A further advantage of the use of a model is that it provides the terminology for
formulating the ATs. The 40 ATs defined for hypermedia (Matera, 1999) have been
formulated by using the HDM vocabulary. Two examples are reported in Table 1.
The two ATs focus on active slots.1 The list of ATs provides systematic guidance on
how to inspect a hypermedia application. Most evaluators are very good at analyzing only certain features of interactive applications; often they neglect some other
features, strictly dependent on the specific application category. Exploiting a set of
ATs ready for use allows evaluators with no experience in hypermedia to come up
with good results.
During inspection, evaluators analyze the application and specify a viable
HDM schema, when it is not already available, for describing the application.
During this activity, different application components (i.e., the objects of the evaluation) are identified. Then, having in mind the usability criteria, evaluators apply the ATs and produce a report that describes the discovered problems. Evaluators use the terminology provided by the model to refer to objects and describe
critical situations while reporting problems, thus attaining precision in their final
evaluation report.
5. THE VALIDATION EXPERIMENT
To validate the SUE inspection technique, a comparison study was conducted
involving a group of senior students of an HCI class at the University of Bari, Italy.
The aim of the experiment was to compare the performance of evaluators carrying
out the SUE inspection (SI), based on the use of ATs, with the performance of evaluators carrying out heuristic inspection (HI), based on the use of heuristics only.
As better explained in Section 5.2, the validation metrics were defined along
three major dimensions: effectiveness, efficiency, and user satisfaction. Such
dimensions actually correspond to the principal usability factors as defined by the
Standard ISO 9241-11 (International Standard Organization, 1997). Therefore, the
experiment allowed the authors to assess the usability of the inspection technique
(John, 1996). In the defined metrics, effectiveness refers to the completeness and
accuracy with which inspectors performed the evaluation. Efficiency refers to the
time expended in relation to the effectiveness of the evaluation. Satisfaction refers
to a number of subjective parameters, such as perceived usefulness, difficulty,
acceptability, and confidence with respect to the evaluation technique. For each
dimension, a specific hypothesis was tested.
• Effectiveness hypothesis. As a general hypothesis, SI was predicted to increase
evaluation effectiveness compared with HI. The advantage is related to two factors:
(a) the systematic nature of the SI technique, deriving from the use of the HDM
model to precisely identify the application constituents, and (b) the use of ATs,
1In the HDM terminology, a slot is an atomic piece of information, such as text, picture, video, and
sound.
322
De Angeli et al.
Table 1: Two Abstract Tasks (ATs) from the Library of Hypermedia ATs
AS 1: Control on Active Slots
Focus of action: An active slot
Intent: to evaluate the control provided over the active slot, in terms of the following:
A. Mechanisms for the control of the active slot
B. Mechanisms supporting the state visibility ( i.e., the identification of any intermediate state of
the slot activation)
Activity description: given an active slot:
A. Execute commands such as play, suspend, continue, stop, replay, get to an intermediate state,
and so forth
B. At a given instant, during the activation of the active slot, verify if it is possible to identify its
current state as well as its evolution up to the end
Output:
A. A list and a short description of the set of control commands and of the mechanisms supporting
the state visibility
B. A statement saying if the following are true:
• The type and number of commands are appropriate, in accordance with the intrinsic nature of
the active slot
• Besides the available commands, some further commands would make the active slot control
more effective
• The mechanisms supporting the state visibility are evident and effective
AS 6: Navigational Behavior of Active Slots
Focus of action: An active slot + links
Intent: to evaluate the cross effects of navigation on the behavior of active slots
Activity description: consider an active slot:
A. Activate it, and then follow one or more links while the slot is still active; return to the “original”
node where the slot has been activated and verify the actual slot state
B. Activate the active slot; suspend it; follow one or more links; return to the original node where
the slot has been suspended and verify the actual slot state
C. Execute activities A and B traversing different types of links, both to leave the original node and
to return to it
D. Execute activities A and B by using only backtracking to return to the original node
Output:
A. A description of the behavior of the active slot, when its activation is followed by the execution
of navigational links and, eventually, backtracking
B. A list and a short description of possible unpredictable effects or semantic conflicts in the source
or in the destination node, together with the indication of the type and nature of the link that has
generated the problem
Note.
AS = active slot
which suggest the activity to be conducted over such objects. Because the ATs
directly address hypermedia applications, this prediction should also be weighted
with respect to the nature of problems detected by evaluators. The hypermedia
specialization of the SI could constitute both the method advantage and its limit.
Indeed, although it could be particularly effective with respect to hypermedia-specific problems, it could neglect other flaws related to presentation and content. In
other words, the limit of ATs could be that they take evaluators away from defects
not specifically addressed by the AT activity.
Evaluating Hypermedia Usability
323
• Efficiency hypothesis. A limit of SI could be that a rigorous application of several
ATs is time consuming. However, SI was not expected to compromise inspection
efficiency compared with the less structured HI technique. Indeed, the expected
higher effectiveness of the SI technique should compensate for the major time
demand required by its application.
• Satisfaction hypothesis. Although SI was expected to be perceived as a more
complex technique than HI, it was hypothesized that SI should enhance the evaluators’ control over the inspection process and their confidence in obtained results.
5.1. Method
In this section, the experimental method adopted to test the effectiveness, efficiency, and user satisfaction hypotheses is described.
Participants. Twenty-eight senior students from the University of Bari participated in the experiment as part of their credit for an HCI course. Their curriculum
comprised training and hands-on experience in Nielsen’s heuristic evaluation
method, which they had applied to paper prototypes, computer-based prototypes,
and hypermedia CD-ROMs. During lectures they also were exposed to the HDM
model.
Design. The inspection technique was manipulated between participants.
Randomly, half of the sample was assigned to the HI condition and the other half to
the SI condition.
Procedure. Aweek before the experiment, participants were introduced to the
conceptual tools to be used during the inspection. The training session lasted 2 hr and
30 min for the HI group and 3 hr for the SI group. The discrepancy was due to the different conceptual tools used during the evaluation by the two groups, as better
explained in the following. A preliminary 2-hr seminar briefly reviewed the HDM
and introduced all the participants to hypermedia-specific heuristics, as defined by
SUE. Particular attention was devoted to informing students without influencing
their expectations and attitudes toward the two different inspection techniques. A
couple of days later, all participants were presented with a short demonstration of
the application, lasting almost 15 min. Afew summary indications about the application content and the main functions were introduced, without providing too many
details. In this way, participants, having limited time at their disposal, did not start
their usability analysis from scratch but had an idea (although vague) of how to
become oriented in the application. Then, participants assigned to the SI group were
briefly introduced to the HDM schema of the application and to the key concepts of
applying ATs. In the proposed application schema, only the main application components were introduced (i.e., structure of entity types and application links for the
324
De Angeli et al.
Table 2: The List of Abstract Tasks (ATs) Submitted to Inspectors
AT Classification Code
AS-1
AS-6
PS-1
HB-N1
HB-N4
AL-S1
AL-N1
AL-N3
AL-N11
AL-N13
AT Title
Control on active slots
Navigational behavior of active slots
Control on passive slots
Complexity of structural navigation patterns
Complexity of applicative navigation patterns
Coverage power of access structures
Complexity of collection navigation patterns
Bottom-up navigation in index hierarchies
History collection structure and navigation
Exit mechanisms availability
hyperbase, collection structure and navigation for the access layer), without revealing any detail that could indicate usability problems.
The experimental session lasted 3 hr. Participants had to inspect the CD-ROM,
applying the technique to which they were assigned. All participants were
provided with a list of 10 SUE heuristics, summarizing the usability guidelines
for hypermedia2 (Garzotto & Matera, 1997; Garzotto et al., 1998). The SI group
also was provided with the HDM application schema and with 10 ATs to be
applied during the inspection (see Table 2). The limited number of ATs was attributable to the limited amount of time participants had at their disposal. The
most basic ATs were selected that could guide SI inspectors in the analysis of the
main application constituents. For example, the authors disregarded ATs addressing advanced hypermedia features.
Working individually, participants had to find the maximum number of usability problems in the application and record them on a report booklet, which differed
according to the experimental conditions. In the HI group, the booklet included 10
forms, one for each of the hypermedia heuristics. The forms required information
about the application point where that heuristic was violated and a short description of the problem. The SI group was provided with a report booklet including 10
forms, each one corresponding to an AT. Again, the forms required information
about the violations detected through that AT and where they occurred. Examples
of the forms included in the report booklets provided to the two groups are shown
in Figures 1 and 2.
At the end of the evaluation, participants were invited to fill in the evaluatorsatisfaction questionnaire, which combined several item formats to measure three
main dimensions: user satisfaction with the evaluated application, evaluator satisfaction with the inspection technique, and evaluator satisfaction with the results
achieved. The psychometric instrument was organized in two parts. The first was
concerned with the application, and the second included the questions about the
2By
providing both groups with the same heuristic list, the authors have been able to measure the
possible added value of the systematic inspection induced by SUE with respect to the subjective application of heuristics.
FIGURE 1
An example of table in the report form provided to the HI group.
FIGURE 2 A page from the report booklet given to the SI group, containing one AT
and the corresponding form for reporting problems.
325
326
De Angeli et al.
adopted evaluation technique. Two final questions asked participants to specify
how satisfied they felt about their performance as evaluators.
The application. The evaluated application was the Italian CD-ROM “Camminare nella pittura,” which means “walking through painting” (Mondadori New
Media, 1997). It was composed of two CD-ROMs, each one presenting the analysis of
painting and some relevant artworks in two periods. The first CD-ROM (CD1 in the
following) covered the period from Cimabue to Leonardo, the second one the period
from Bosch to Cezanne. The CD-ROMs were identical in structure, and each one
could be used independently of the other. Each CD-ROM was a distinct and “complete” application of limited size, particularly suitable for being exhaustively
analyzed in a limited amount of time. Therefore, CD1 only was submitted to participants. The limited number of navigation nodes in CD1 simplified the postexperimental analysis of the paths followed by the evaluators during the inspection
and the identification of the application points where they highlighted problems.
Data coding. The report booklets were analyzed by three expert hypermedia
designers (expert evaluators, in the following) with a strong HCI background, to
assess effectiveness and efficiency of the applied evaluation technique. All
reported measures had a reliability value of at least .85. Evaluator satisfaction was
measured by analyzing the self-administered postexperimental questionnaires.
All the statements written in the report booklets were scored as problems or
nonproblems. Problems are actual usability flaws that could affect user performance.
Nonproblems include (a) observations reflecting only evaluators’ personal preferences but not real usability bugs; (b) evaluation errors, reflecting evaluators’ misjudgments or system defects due to a particular hardware configuration; and (c)
statements that are not understandable (i.e., not clearly reported).
For each statement scored as a problem or a nonproblem of type (a), a severity
rating was performed. As suggested by Nielsen (1994b), severity was estimated
considering three factors: the frequency of the problem, the impact of the problem
on the user, and the persistence of the problem during interaction. The evaluation
was modulated on a Likert scale, ranging from 1 (I don’t agree that this is a usability
problem at all) to 5 (usability catastrophe). Each problem was further classified in one
of the following dimensions, according to the nature of the problem itself:
• Navigation, which includes problems related to the task of moving within the
hyperspace. It refers to the appropriateness of mechanisms for accessing information and for getting oriented in the hyperspace.
• Active media control, which includes problems related to the interaction with
dynamic multimedia objects, such as video, animation, and audio comment. It
refers to the appropriateness of mechanisms for controlling the dynamic behavior
of media and of mechanisms providing feedback about the current state of the
media activation.
Evaluating Hypermedia Usability
327
• Interaction with widgets, which includes problems related to the interaction
with the widgets of the visual interface, such as buttons of various types, icons, and
scrollbars. It includes problems related to the appropriateness of mechanisms for
manipulating widgets and their self-evidence.
Note that navigation and active media control are dimensions specific to hypermedia systems.
5.2. Results
The total number of problems detected in the application was 38. Among these, 29
problems were discovered by the expert evaluators, through an inspection before the
experiment. The remaining 9 were identified only by the experimental inspectors.
During the experiment, inspectors reported a total number of 36 different types of
problems. They also reported 25 different types of nonproblems of type (a) and (b).
Four inspectors reported at least one nonunderstandable statement, i.e.
nonproblems of type (c).
The results of the psychometric analysis are reported in the following paragraphs with reference to the three experimental hypotheses.
Effectiveness. Effectiveness can be decomposed into the completeness and accuracy with which inspectors performed the evaluation. Completeness corresponds to the percentage of problems detected by a single inspector out of the total
number of problems. It is computed by the following formula:
Completenessi =
Pi
´100
n
where Pi is the number of problems found by the ith inspector, and n is the total
number of problems existing in the application (n = 38).
On average, inspectors in the SI group individually found 24% of all the usability
defects (SEM = 1.88); inspectors in the HI group found 19% (SEM = 1.99). As shown
by a Mann–Whitney U test, the difference is statistically significant (U = 50.5, N = 28, p
< .05). It follows that the SI technique enhances evaluation completeness, allowing
individual evaluators to discover a major number of usability problems.
Accuracy can be defined by two indexes: precision and severity. Precision is
given by the percentage of problems detected by a single inspector out of the total number of statements. For a given inspector, precision is computed by the following formula:
Precisioni =
Pi
´100
Si
328
De Angeli et al.
where Pi is the number of problems found by the ith inspector, and si is the total
number of statements he or she reported (including nonproblems).
In general, the distribution of precision is affected by a severe negative skewness, with 50% of participants not committing any errors. The variable ranges from
40 to 100, with a median value of 96. In the SI group, most inspectors were totally
accurate (precision value = 100), with the exception of two of them, who were
slightly inaccurate (precision value > 80). On the other hand, only two participants
of the HI condition were totally accurate. The mean value for the HI group was 77.4
(SEM = 4.23), and the median value was 77.5. Moreover, four evaluators in the HI
group reported at least one nonunderstandable statement, whereas all the statements reported by the SI group were clearly expressed and referred to application
objects using a comprehensible and consistent terminology.
This general trend reflecting an advantage of SI over HI was supported also by
the analysis of the severity index, which refers to the average rating of all scored
statements for each participant. A t-test analysis demonstrated that the mean rating of the two groups varied significantly, t(26) = –3.92, p < .001 (two-tailed).
Problems detected applying the ATs were scored as more serious than those
detected when only the heuristics were available (means and standard errors are
reported in Table 3).
The effectiveness hypothesis also states that the SUE inspection technique could
be particularly effective for detecting hypermedia-specific problems, whereas it
could neglect other bugs related to graphical user interface widgets. To test this aspect, the distribution of problem types was analyzed as a function of experimental
conditions. As can be seen in Figure 3, the most common problems detected by all
the evaluators were concerned with navigation, followed by defects related to active media control. Only a minority of problems regarded interaction with widgets.
In general, it is evident that the SI inspectors found more problems. However, this
superiority especially emerges for hypermedia-related defects (navigation and active media control), t(26) = –2.70, p < .05 (two-tailed).
A slightly higher average number of “interaction with widgets” problems was
found by the HI group, compared with the SI group. A Mann–Whitney U test,
comparing the percentage of problems in the two experimental conditions, indicated that this difference was not significant (U = 67, N = 28, p = .16). This means
that unlike what was hypothesized, the systematic inspection activity suggested by
ATs does not take evaluators away from other problems not covered by the activity
description. Because the problems found by the SI group in the “interaction with
widgets” category were those having the highest severity, it also can be assumed
that the hypermedia ATs do not prevent evaluators from noticing usability catastrophes related to presentation aspects. Also, supplying evaluators with ATs focusTable 3: Means and Standard Errors for the Analysis of Severity
Severity index
Note:
HI
SI
3.66 (0.12)
4.22 (0.08)
HI = Heuristics Inspection; SI = Systematic Usability Evaluation Inspection.
Evaluating Hypermedia Usability
329
FIGURE 3 Average number of problems as a function of experimental conditions
and problem categories.
ing on presentation aspects, such as those presented by Costabile and Matera
(1999), may allow one to obtain a deep analysis of the graphic–user interface, with
the result that SI evaluators find a major number of “interaction with widgets”
problems.
Efficiency. Efficiency has been considered both at the individual and at the
group level. Individual efficiency refers to the number of problems extracted by
a single inspector, in relation to the time spent. It is computed by the following
formula:
Ind _ Efficiencyi =
Pi
ti
where Pi is the number of problems detected by the ith inspector, and ti is the time
spent for finding the problems.
On average, SI inspectors found 4.5 problems in 1 hr of inspection, versus the 3.6
problems per hour found by the HI inspectors. A t test on the variable normalized
by a square root transformation demonstrated that this difference was not significant, t(26) = –1.44, p = .16 (two-tailed). Such a result further supports the efficiency
hypothesis, because the application of the ATs did not compromise efficiency compared with a less structured evaluation technique. Rather, SI showed a positive tendency in finding a major number of problems per hour.
Group efficiency refers to the evaluation results achieved by aggregating the
performance of several inspectors. Toward this end, Nielsen’s cost–benefit curve,
relating the proportion of usability problems to the number of evaluators, has been
computed (Nielsen, 1994b). This curve derives from a mathematical model based
330
De Angeli et al.
on the prediction formula for the number of usability problems found in a heuristic
evaluation reported in the following (Nielsen, 1992):
Found(i) = n(1 - (1 - λ)i )
where Found(i) is the number of problems found by aggregating reports from i
independent evaluators, n is the total number of problems in the application, and λ
is the probability of finding the average usability problem when using a single
average evaluator.
As suggested by Nielsen and Landauer (1993), one possible use of this model is
in estimating the number of inspectors needed to identify a given percentage of
usability errors. This model therefore was used to determine how many inspectors
for the two techniques would enable the detection of a reasonable percentage of
problems in the application. The curves calculated for the two techniques are reported in Figure 4 (n = 38, λHI = 0.19, λSI = 0.24). As shown in the figure, SI tended to
reach better performance with a lower number of evaluators. If Nielsen’s 75%
threshold is assumed, SI can reach this level with five evaluators. The HI technique
would require seven evaluators.
FIGURE 4 The cost–benefit curve (Nielsen & Landauer, 1993) computed for the two
techniques, HI (heuristic inspection) and SI (SUE inspection). Each curve shows the
proportion of usability problems found by each technique when different numbers of
evaluators were used.
Evaluating Hypermedia Usability
331
Satisfaction. With respect to an evaluation technique, satisfaction refers to
many parameters, such as perceived usefulness, difficulty, and acceptability of
applying the method. The postexperimental questionnaire addressed three main
dimensions: user satisfaction with the application evaluated, evaluator satisfaction
with the inspection technique, and evaluator satisfaction with the results achieved.
At first sight it may appear that the first dimension, addressing evaluators’ satisfaction with the application, is out of the scope of the experiment, the main intent of
which was to compare two inspection techniques. However, the authors wanted to
verify in which way the used technique may have influenced inspector severity.
User satisfaction with the application evaluated was assessed through a semantic–differential scale that required inspectors to judge the application on 11 pairs of
adjectives describing satisfaction with information systems. Inspectors could
modulate their evaluation on 7 points (after recoding of reversed items, where 1 =
very negative and 7 = positive). The initial reliability of the satisfaction scale is moderately satisfying (α = .74), with three items (reliable–unreliable, amusing–boring,
difficult–simple) presenting a corrected item-total correlation inferior to .30. Therefore, the user-satisfaction index was computed averaging scorings for the remaining
eight items (α = .79). Then, the index was analyzed by a t test. Results showed a significant effect of the inspection group, t(26) = 2.38, p < .05 (two-tailed). On average, the SI
inspectors evaluated the application more severely (M = 4.37, SEM = .23) than HI inspectors (M = 5.13, SEM = .22). From this difference, it can be inferred that ATs provide evaluators with a more effective framework to weight limits and benefits of the
application. The hypothesis is supported by the significant correlation between the
number of usability problems found by an evaluator and his or her satisfaction with
the application (r = –.42, p < .05). The negative index indicates that as more problems
were found, the less positive was the evaluation.
Evaluator satisfaction with the inspection technique was assessed by 11 pairs of
adjectives, modulated on 7 points. The original reliability value was .72, increasing to
.75 after deletion of three items (tiring–restful, complex–simple, satisfying–unsatisfying). The evaluator-satisfaction index then was computed by averaging scorings to
the remaining eight items. The index is highly correlated with a direct item assessing
learnability of the inspection technique (r = .53, p < .001). The easier a technique is
perceived, the better it is evaluated. A t test showed no significant differences in the
satisfaction with the inspection technique, t(26) = 1.19, p = .25 (two-tailed). On average, evaluations were moderately positive for both techniques, with a mean difference of .32 slightly favoring the HI group. To conclude, despite being objectively
more demanding, SI was not evaluated worse than HI.
Evaluator satisfaction with the result achieved was assessed directly by a
Likert-type item asking participants to express their gratification on a 4-point scale
(from not at all to very much) and indirectly by a percentage estimation of the number
of problems found. The two variables were highly correlated (r = .57, p < .01). The
more problems an inspector thought he or she had found, the more satisfied he or she
was with his or her performance. Consequently, the final satisfaction index was computed by multiplying the two scores. A Mann–Whitney U test showed a tendency
toward a difference in favor of the HI group (U = 54.5, p = .07). Participants in the HI
group felt more satisfied about their performance than those in the SI group.
332
De Angeli et al.
By considering this finding in the general framework of the experiment, it
appears that ATs provide participants with higher critical abilities than heuristics.
Indeed, despite the major effectiveness achieved by participants in the SI group,
they were still less satisfied with their performance, as if they could better understand the limits of an individual evaluation.
Summary. Table 4 summarizes the experimental results presented in the
previous paragraphs. The advantage of the systematic approach adopted by the
evaluators assigned to the SI condition is evident. The implications of these findings are discussed in the final paragraph.
6. CONCLUSIONS
In the last decade, several techniques for evaluating the usability of software
systems have been proposed. Unfortunately, research in HCI has not devoted sufficient efforts toward validating such techniques, and therefore some questions
persist (John, 1996). The study reported in this article provides some answers about
the effectiveness, efficiency, and satisfaction of the SUE inspection technique. The
experiment seems to confirm the general hypothesis of a sharp increase in the overall quality of inspection when ATs are used. More specifically, the following may be
concluded:
• The SUE inspection increases evaluation effectiveness. The SI group showed a major
completeness and precision in reporting problems and also identified more severe
problems.
Table 4: Summary of the Experimental Results
Indexes
Hypothesis
Effectiveness
Completeness
Accurateness
Precision
Severity
Efficiency
Individual efficiency
Group efficiency
Satisfaction
User satisfaction with the application evaluated
Evaluator satisfaction with the inspection technique
Evaluator satisfaction with the achieved results
HI
SI
–
+
–
–
+
+
=
–
=
+
<
=
<
>
=
>
Note: HI = Heuristic Inspection; SI = Systematic Usability Evaluation Inspection; – worse performance; + better performance; = equal performance; < minor critical ability; > major critical ability.
Evaluating Hypermedia Usability
333
• Although more rigorous and structured, the SUE inspection does not compromise
inspection efficiency. Rather, it enhanced group efficiency, defined as the number of
different usability problems found by aggregating the reports of several inspectors,
and showed a similar individual efficiency, defined as the number of problems
extracted by a single inspector in relation to the time spent.
• The SUE inspection enhances the inspectors’ control over the inspection process and
their confidence on the obtained results. SI inspectors evaluated the application more
severely than HI inspectors. Although SUE inspection was perceived as a more complex technique, SI inspectors were moderately satisfied with it. Finally, they showed
a major critical ability, feeling less satisfied with their performance, as if they could
understand the limits of their inspection activity better than the HI inspectors.
The authors are confident in the validity of such results, because the evaluators
in this study were by no means influenced by the authors’ association with the SUE
inspection method. Actually, the evaluators were more familiar with Nielsen’s
heuristic inspection, being exposed to this method during the HCI course. They
learned the SUE inspection only during the training session. Further experiments
are being planned involving expert evaluators to further evaluate whether ATs
provide greater power to experts as well.
REFERENCES
Andre, T. S., Hartson, H. R., & Williges, R. C. (1999). Expert-based usability inspections:
Developing a foundational framework and method. In Proceedings of the 2nd Annual Student’s Symposium on Human Factors of Complex Systems.
Costabile, M. F., Garzotto, F., Matera, M., & Paolini, P. (1997). SUE: A systematic usability evaluation (Tech. Rep. 19-97). Milan: Dipartimento di Elettronica e Informazione, Politecnico
di Milano.
Costabile, M. F., Garzotto, F., Matera, M., & Paolini, P. (1998). Abstract tasks and concrete
tasks for the evaluation of multimedia applications. Proceedings of the ACM CHI ’98 Workshop From Hyped-Media to Hyper-Media: Towards Theoretical Foundations of Design Use and
Evaluation, Los Angeles, April 1998. Retrieved December 1, 1999, from
http://www.eng.auburn.edu/department/cse/research/vi3rg/ws/papers.html
Costabile, M. F., & Matera, M. (1999). Evaluating WIMP interfaces through the SUE
Approach. In B. Werner (Ed.), Proceedings of IEEE ICIAP ’99—International Conference on
Image Analysis and Processing (pp. 1192–1197). Los Alamitos, CA: IEEE Computer Society.
Desurvire, H. W. (1994). Faster, cheaper! Are usability inspection methods as effective as
empirical testing? In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp.
173–202). New York: Wiley.
Dix, A., Finlay, J., Abowd, G., & Beale, R. (1998). Human–computer interaction (2nd ed.). London: Prentice Hall Europe.
Doubleday, A., Ryan, M., Springett, M., & Sutcliffe, A. (1997). A comparison of usability techniques for evaluating design. In S. Cole (Ed.), Proceedings of ACM DIS ’97—International
Conference on Designing Interactive Systems (pp. 101–110). Berlin: Springer-Verlag.
Fenton, N. E. (1991). Software metrics—A rigorous approach. London: Chapman & Hall.
Garzotto, F., & Matera, M. (1997). A systematic method for hypermedia usability inspection.
New Review of Hypermedia and Multimedia, 3, 39–65.
334
De Angeli et al.
Garzotto, F., Matera, M., & Paolini, P. (1998). Model-based heuristic evaluation of
hypermedia usability. In T. Catarci, M. F. Costabile, G. Santucci, & L. Tarantino (Eds.), Proceedings of AVI ’98—International Conference on Advanced Visual Interfaces (pp. 135–145).
New York: ACM.
Garzotto, F., Matera, M., & Paolini, P. (1999). Abstract tasks: A tool for the inspection of
Web sites and off-line hypermedia. In J. Westbomke, U. K. Will, J. J. Leggett, K.
Tochterman, J. M. Haake (Eds.), Proceedings of ACM Hypertext ’99 (pp. 157–164). New
York: ACM.
Garzotto, F., Paolini, P., & Schwabe, D. (1993). HDM—Amodel based approach to hypermedia
application design. ACM Transactions on Information Systems, 11(1), 1–26.
Hartson, H. R., Andre, T. S., Williges, R. C., & Van Rens, L. (1999). The user action framework: A theory-based foundation for inspection and classification of usability problems.
In H.–J. Bullinger & J. Ziegler (Eds.), Proceedings of HCI International ’99 (pp. 1058–1062).
Oxford, England: Elsevier.
International Standard Organization. (1997). Ergonomics requirements for office work with
visual display terminal (VDT): Parts 1–17. Geneva, Switzerland: International Standard
Organization 9241.
Jeffries, R., & Desurvire, H. W. (1992). Usability testing vs. heuristic evaluation: Was there a
context? ACM SIGCHI Bulletin, 24(4), 39–41.
Jeffries, R., Miller, J., Wharton, C., & Uyeda, K. M. (1991). User interface evaluation in the real
word: A comparison of four techniques. In S. P. Robertson, G. M. Olson, & J. S. Olson
(Ed.), Proceedings of ACM CHI ’91—International Conference on Human Factors in Computing
Systems (pp. 119–124). New York: ACM.
John, B. E. (1996). Evaluating usability evaluation techniques. ACM Computing Surveys,
28(Elec. Suppl. 4).
Kantner, L., & Rosenbaum, S. (1997). Usability studies of WWW sites: Heuristic evaluation
vs. laboratory testing. In Proceedings of ACM SIGDOC ’97—International Conference on
Computer Documentation (pp. 153–160). New York: ACM.
Karat, C. M. (1994). A comparison of user interface evaluation methods. In J. Nielsen & R. L.
Mack (Eds.), Usability inspection methods (pp. 203–230). New York: Wiley.
Lim, K. H., Benbasat, I., & Todd, P. A. (1996). An experimental investigation of the interactive
effects of interface style, instructions, and task familiarity on user performance. ACM
Transactions on Computer–Human Interaction, 3(1), 1–37.
Madsen, K. H. (1999). The diversity of usability practices [Special issue]. Communication of
ACM, 42(5).
Matera, M. (1999). SUE: A systematic methodology for evaluating hypermedia usability. Milan:
Dipartimento di Elettronica e Informazione, Politecnico di Milano.
Mondadori New Media. (1997). Camminare nella pittura [CD-ROM]. Milan: Mondadori New
Media.
Nielsen, J. (1992). Finding usability problems through heuristic evaluation. In P. Bauersfeld,
J. Benett, & G. Lynch (Eds.), Proceedings of ACM CHI ’92—International Conference on Human Factors in Computing Systems (pp. 373–380). New York: ACM.
Nielsen, J. (1993). Usability engineering. Cambridge, MA: Academic.
Nielsen, J. (1994a). Guerrilla HCI: Using discount usability engineering to penetrate intimidation barrier. In R. G. Bias & D. J. Mayhew (Eds.), Cost-justifying usability. Cambridge,
MA: Academic. Retrieved December 1, 1999, from http://www.useit.com/papers/guerrilla_hci.html
Nielsen, J. (1994b). Heuristic evaluation. In J. Nielsen & R. L. Mack (Eds.), Usability inspection
methods (pp. 25–62). New York: Wiley.
Evaluating Hypermedia Usability
335
Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the finding of usability problems. In Proceedings of ACM INTERCHI ’93—International Conference on Human Factors in
Computing Systems (pp. 296–213). New York: ACM.
Nielsen, J., & Mack, R. L. (1994). Usability inspection methods. New York: Wiley.
Preece, J., Rogers, Y., Sharp, H., Benyon, D., Holland, S., & Carey, T. (1994). Human–computer
interaction. New York: Addison Wesley.
Virzi, R. A., Sorce, J. F., & Herbert L. B. (1993). Acomparison of three usability evaluation methods: Heuristic, think-aloud, and performance testing. In Proceedings of Human Factors and
Ergonomics Society 37th Annual Meeting (pp. 309–313). Santa Monica, CA: Human Factors
and Ergonomics Society.
Whiteside, J., Bennet, J., & Holtzblatt, K. (1988). Usability engineering: Our experience and
evolution. In M. Helander (Ed.), Handbook of human–computer interaction (pp. 791–817).
Oxford, England: Elsevier Science.