INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION, 15(3), 315–335 Copyright © 2003, Lawrence Erlbaum Associates, Inc. On the Advantages of a Systematic Inspection for Evaluating Hypermedia Usability A. De Angeli NCR Self-Service Advanced Technology & Research, Dundee, UK M. Matera M. F. Costabile Dipartimento di Informatica Università di Bari, Italy F. Garzotto P. Paolini Dipartimento di Elettronica e Informazione Politecnico di Milano, Italy It is indubitable that usability inspection of complex hypermedia is still an “art,” in the sense that a great deal is left to the skills, experience, and ability of the inspectors. Training inspectors is difficult and often quite expensive. The Systematic Usability Evaluation (SUE) inspection technique has been proposed to help usability inspectors share and transfer their evaluation know-how, to simplify the hypermedia inspection process for newcomers, and to achieve more effective and efficient evaluation results. SUE inspection is based on the use of evaluation patterns, called abstract tasks, which precisely describe the activities to be performed by evaluators during inspection. This article highlights the advantages of this inspection technique by presenting its empirical validation through a controlled experiment. Two groups of novice inspectors were asked to evaluate a commercial hypermedia CD-ROM by applying the SUE inspection or traditional heuristic evaluation. The comparison was based on three major dimensions: effectiveness, efficiency, and satisfaction. Results indicate a clear advantage of the SUE inspection over the traditional inspection on all dimensions, demonstrating that abstract tasks are efficient tools to drive the evaluator’s performance. The authors are immensely grateful to Prof. Rex Hartson, from Virginia Tech, for his valuable suggestions. The authors also thank Francesca Alonzo and Alessandra Di Silvestro, from the Hypermedia Open Center of Polytechnic of Milan, for the help offered during the experiment data coding. The support of the EC grant FAIRWIS project IST-1999-12641 and of MURST COFIN 2000 is acknowledged. Requests for reprints should be sent to M. F. Costabile, Dipartimento di Informatica, Università di Bari, Via Orabona, 4–70126 Bari, Italy. E-mail: [email protected] 316 De Angeli et al. 1. INTRODUCTION One of the goals of the human–computer interaction (HCI) discipline is to define methods for ensuring usability, which is now universally acknowledged as a significant aspect of the overall quality of interactive systems. ISO Standard 9241-11 (International Standard Organization, 1997) defines usability as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.” In this framework, effectiveness is defined as the accuracy and the completeness with which users achieve goals in particular environments. Efficiency refers to the resources expended in relation to the accuracy and completeness of the goals achieved. Satisfaction is defined as the comfort and the acceptability of the system for its users and other people affected by its use. Much attention to usability is currently paid by industry, which is recognizing the importance of adopting evaluation methods during the development cycle to verify the quality of new products before they are put on the market (Madsen, 1999). One of the main complaints of industry is, however, that cost-effective evaluation tools are still lacking. This prevents most companies from actually performing usability evaluation, with the consequent result that a lot of software is still poorly designed and unusable. Therefore, usability inspection methods are emerging as preferred evaluation procedures, being less costly than traditional user-based evaluation. It is indubitable that usability inspection of complex applications, such as hypermedia, is still an “art,” in the sense that a great deal is left to the skills, experience, and ability of the inspectors. Moreover, training inspectors is difficult and often quite expensive. As part of an overall methodology called Systematic Usability Evaluation (SUE; Costabile, Garzotto, Matera, & Paolini, 1997), a novel inspection technique has been proposed to help usability inspectors share and transfer their evaluation know-how, make the hypermedia inspection process easier for newcomers, and achieve more effective and efficient evaluations. As described in previous articles (Costabile & Matera, 1999; Garzotto, Matera, & Paolini, 1998, 1999), the inspection proposed by SUE is based on the use of evaluation patterns, called abstract tasks, which precisely describe the activities to be performed by an evaluator during the inspection. This article presents an empirical validation of the SUE inspection technique. Two groups of novice inspectors were asked to evaluate a commercial hypermedia CD-ROM by applying the SUE inspection or the traditional heuristic technique. The comparison was based on three major dimensions: effectiveness, efficiency, and satisfaction. Results demonstrated a clear advantage of the SUE inspection over the heuristic evaluation on all dimensions, showing that abstract tasks are efficient tools to drive the evaluator’s performance. This article has the following organization. Section 2 provides the rationale for the SUE inspection by describing the current situation of usability inspection techniques. Section 3 briefly describes the main characteristics of the SUE methodology, whereas Section 4 outlines the inspection technique proposed by SUE. Section 5, the core of the article, describes the experiment that was performed to validate the SUE inspection. Finally, Section 6 presents conclusions. Evaluating Hypermedia Usability 317 2. BACKGROUND Different methods can be used to evaluate usability, among which the most common are user-based methods and inspection methods. User-based evaluation mainly consists of user testing: It assesses usability properties by observing how the system is actually used by some representative sample of real users (Dix, Finlay, Abowd, & Beale, 1993; Preece et al., 1994; Whiteside, Bennet, & Holtzblatt, 1988). Usability inspection refers to a set of methods in which expert evaluators examine usability-related aspects of an application and provide judgments based on their knowledge. Examples of inspection methods are heuristic evaluation, cognitive walk-through, guideline review, and formal usability inspection (Nielsen & Mack, 1994). User-based evaluations claim to provide, at least until now, the most reliable results, because they involve samples of real users. Such methods, however, have a number of drawbacks, such as the difficulty of properly selecting a correct sample from the user community and training participants to master the most sophisticated and advanced features of an interactive system. Furthermore, it is difficult and expensive to reproduce actual situations of usage (Lim, Benbasat, & Todd, 1996), and failure in creating real-life situations may lead to artificial findings rather than realistic results. Therefore, the cost and the time to set up reliable empirical testing may be excessive. In comparison to user-based evaluation, usability inspection methods are more subjective. They are strongly dependent on the inspector’s skills. Therefore, different inspectors may produce noncomparable outcomes. However, usability inspection methods “save users” (Jeffries, Miller, Wharton, & Uyeda, 1991; Nielsen & Mack, 1994) and do not require special equipment or lab facilities. In addition, experts can detect problems and possible future faults of a complex system in a limited amount of time. For all these reasons, inspection methods have been used more widely in recent years, especially in industrial environments (Nielsen, 1994a). Among usability inspection methods, heuristic evaluation (Nielsen, 1993, 1994b) is the most commonly used. With this method, a small set of experts inspect a system and evaluate its interface against a list of recognized usability principles—the heuristics. Experts in heuristic evaluation can be usability specialists, experts of the specific domain of the application to be evaluated, or (preferably) double experts, with both usability and domain experience. During the evaluation session, each evaluator goes individually through the interface at least twice. The first step is to get a feel of the flow of the interaction and the general scope of the system. The second is to focus on specific objects and functionality, evaluating their design, implementation, and so forth, against a list of well-known heuristics. Typically, such heuristics are general principles, which refer to common properties of usable systems. However, it is desirable to develop and adopt category-specific heuristics that apply to a specific class of products (Garzotto & Matera, 1997; Nielsen & Mack, 1994). The output of a heuristic evaluation session is a list of usability problems with reference to the violated heuristics. Reporting problems in relation to heuristics enables designers to easily revise the design, in accordance with what is prescribed by the guidelines pro- 318 De Angeli et al. vided by the violated principles. Once the evaluation has been completed, the findings of the different evaluators are compared to generate a report summarizing all the findings. Heuristic evaluation is a “discount usability” method (Nielsen, 1993, 1994a). In fact, some researchers have shown that it is a very efficient usability engineering method (Jeffries & Desurvire, 1992) with a high benefit–cost ratio (Nielsen, 1994a). It is especially valuable when time and resources are short, because skilled evaluators, without needing the involvement of representative users, can produce high-quality results in a limited amount of time (Kantner & Rosenbaum, 1997). This technique has, however, a number of drawbacks. As highlighted by Jeffries et al. (1991), Doubleday, Ryan, Springett, and Sutcliffe (1997), and Kantner and Rosenbaum (1997), its major disadvantage is the high dependence on the skills and experiences of the evaluators. Nielsen (1992) stated that novice evaluators with no usability expertise are poor evaluators and that usability experts are 1.8 times as good as novices. Moreover, application domain and usability experts (the double experts) are 2.7 as good as novices and 1.5 as good as usability experts (Nielsen, 1992). This means that experience with the specific category of applications being evaluated really improves the evaluators’ performance. Unfortunately, usability specialists may lack domain expertise, and domain specialists are rarely trained or experienced in usability methodologies. To overcome this problem for hypermedia usability evaluation, the SUE inspection technique has been introduced. It uses evaluation patterns, called abstract tasks, to guide the inspectors’ activity. Abstract tasks precisely describe which hypermedia objects to look for and which actions the evaluators must perform to analyze such objects. In this way, less experienced evaluators, with lack of expertise in usability or hypermedia, are able to produce more complete and precise results. The SUE inspection technique solves a further heuristic evaluation drawback, which was reported by Doubleday et al. (1997). The problem is that heuristics, as they are generally formulated, are not always able to adequately guide evaluators. At this proposal, the SUE inspection framework provides evaluators with a list of detailed heuristics that are specific for hypermedia. Abstract tasks provide a detailed description of the activities to be performed to detect possible violations of the hypermedia heuristics. In SUE, the overall inspection process is driven by the use of an application model, the hypermedia design model (HDM; Garzotto, Paolini, & Schwabe, 1993). The HDM concepts and primitives allow evaluators to identify precisely the hypermedia constituents that are worthy of investigation. Moreover, both the hypermedia heuristics and the abstract tasks focus on such constituents and are formulated through HDM terminology. Such a terminology also is used by evaluators for reporting problems, thus avoiding the generation of incomprehensible and vague inspection reports. In some recent articles (Andre, Hartson, & Williges, 1999; Hartson, Andre, Williges, & Van Rens, 1999), authors have highlighted the need for more focused usability inspection methods and for a classification of usability problems to support the production of inspection reports that are easy to read and compare. These authors have defined the user action framework (UAF), which is a unifying and organizing environment that supports design guidelines, usability Evaluating Hypermedia Usability 319 inspection, classification, and reporting of usability problems. UAF provides a knowledge base in which different usability problems are organized, taking into account how users are affected by the design during the interaction, at various points where they must accomplish cognitive or physical actions. The classification of design problems and usability concepts is a way to capitalize on past evaluation experiences. It allows evaluators to better understand the design problems they encounter during the inspection and helps them identify precisely which physical or cognitive aspects cause problems. Evaluators are therefore able to propose well-focused redesign solutions. The motivations behind this research are similar to ours. Reusing past evaluation experience and making it available to less experienced people is a basic goal of the authors, which is pursued through the use of abstract tasks. Their formulation is, in fact, the reflection of the experiences of some skilled evaluators. Unlike the UAF, rather than recording problems, abstract tasks offer a way to keep track of the activities to be performed to discover problems. 3. THE SUE METHODOLOGY SUE is a methodology for evaluating the usability of interactive systems, which prescribes a structured flow of activities (Costabile et al., 1997). SUE has been largely specialized for hypermedia (Costabile, Garzotto, Matera, & Paolini, 1998; Garzotto et al., 1998, 1999), but this methodology easily can be exploited to evaluate the usability of any interactive application (Costabile & Matera, 1999). A core idea of SUE is that the most reliable evaluation can be achieved by systematically combining inspection with user-based evaluation. In fact, several studies have outlined how two such methods are complementary and can be effectively coupled to obtain a reliable evaluation process (Desurvire, 1994; Karat, 1994; Virzi, Sorce, & Herbert, 1993). The inspection proposed by SUE, based on the use of abstract tasks (in the following SUE inspection), is carried out first. Then, if inspection results are not sufficient to predict the impact of some critical situations, user-based evaluation also is conducted. Because SUE is driven by the inspection outcome, the user-based evaluation tends to be better focused and more cost-effective. Another basic assumption of SUE is that, to be reliable, usability evaluation should encompass a variety of dimensions of a system. Some of these dimensions may refer to general layout features common to all interactive systems, whereas others may be more specific for the design of a particular product category or a particular domain of use. For each dimension, the evaluation process consists of a preparatory phase and an execution phase. The preparatory phase is performed only once for each dimension, and its purpose is to create a conceptual framework that will be used to carry out actual evaluations. As better explained in the next section, such a framework includes a design model, a set of usability attributes, and a library of abstract tasks. Because the activities in the preparatory phase may require extensive use of resources, they should be regarded as a long-term investment. The execution phase is performed every time a specific application must be evaluated. It mainly consists of an inspection, performed by ex- 320 De Angeli et al. pert evaluators. If needed, inspection can be followed by sessions of user testing, involving real users. 4. THE SUE INSPECTION TECHNIQUE FOR HYPERMEDIA The SUE inspection is based on the use of an application design model for describing the application, a set of usability attributes to be verified during the evaluation, and a set of abstract tasks (ATs) to be applied during the inspection phase. The term model is used in a broad sense, meaning a set of concepts, representation structures, design principles, primitives, and terms, which can be used to build a description of an application. The model helps organize concepts, so identifying and describing, in a nonambiguous way, the components of the application that constitute the entities of the evaluation (Fenton, 1991). For the evaluation of hypermedia, the authors have adopted HDM (Garzotto et al., 1993), which focuses on structural and navigation properties as well as on active media features. Usability attributes are obtained by decomposing general usability principles into finer grained criteria that can be better analyzed. In accordance with the suggestion by Nielsen and Mack (1994) to develop category-specific heuristics, the authors have defined a set of usability attributes, able to capture the peculiar features of hypermedia (Garzotto & Matera, 1997; Garzotto et al., 1998). Such hypermedia usability attributes correspond with Nielsen’s 10 heuristics (Nielsen, 1993). The hypermedia usability attributes, in fact, can be considered a specialization for hypermedia of Nielsen’s heuristics, with the only exception of “good error messages” and “help and documentation,” which do not need to be further specialized. ATs are evaluation patterns that provide a detailed description of the activities to be performed by expert evaluators during inspection (Garzotto et al., 1998, 1999). They are formulated precisely by following a pattern template, which provides a consistent format including the following items: • AT classification code and title univocally identify the AT and succinctly convey its essence. • Focus of action briefly describes the context, or focus, of the AT by listing the application constituents that correspond to the evaluation entities. • Intent describes the problem addressed by the AT and its rationale, trying to make clear which is the specific goal to be achieved through the AT application. • Activity description is a detailed description of the activities to be performed when applying the AT. • Output describes the output of the fragment of the inspection the AT refers to. Optionally, a comment is provided, with the aim of indicating further ATs to be applied in combination or highlighting related usability attributes. Evaluating Hypermedia Usability 321 A further advantage of the use of a model is that it provides the terminology for formulating the ATs. The 40 ATs defined for hypermedia (Matera, 1999) have been formulated by using the HDM vocabulary. Two examples are reported in Table 1. The two ATs focus on active slots.1 The list of ATs provides systematic guidance on how to inspect a hypermedia application. Most evaluators are very good at analyzing only certain features of interactive applications; often they neglect some other features, strictly dependent on the specific application category. Exploiting a set of ATs ready for use allows evaluators with no experience in hypermedia to come up with good results. During inspection, evaluators analyze the application and specify a viable HDM schema, when it is not already available, for describing the application. During this activity, different application components (i.e., the objects of the evaluation) are identified. Then, having in mind the usability criteria, evaluators apply the ATs and produce a report that describes the discovered problems. Evaluators use the terminology provided by the model to refer to objects and describe critical situations while reporting problems, thus attaining precision in their final evaluation report. 5. THE VALIDATION EXPERIMENT To validate the SUE inspection technique, a comparison study was conducted involving a group of senior students of an HCI class at the University of Bari, Italy. The aim of the experiment was to compare the performance of evaluators carrying out the SUE inspection (SI), based on the use of ATs, with the performance of evaluators carrying out heuristic inspection (HI), based on the use of heuristics only. As better explained in Section 5.2, the validation metrics were defined along three major dimensions: effectiveness, efficiency, and user satisfaction. Such dimensions actually correspond to the principal usability factors as defined by the Standard ISO 9241-11 (International Standard Organization, 1997). Therefore, the experiment allowed the authors to assess the usability of the inspection technique (John, 1996). In the defined metrics, effectiveness refers to the completeness and accuracy with which inspectors performed the evaluation. Efficiency refers to the time expended in relation to the effectiveness of the evaluation. Satisfaction refers to a number of subjective parameters, such as perceived usefulness, difficulty, acceptability, and confidence with respect to the evaluation technique. For each dimension, a specific hypothesis was tested. • Effectiveness hypothesis. As a general hypothesis, SI was predicted to increase evaluation effectiveness compared with HI. The advantage is related to two factors: (a) the systematic nature of the SI technique, deriving from the use of the HDM model to precisely identify the application constituents, and (b) the use of ATs, 1In the HDM terminology, a slot is an atomic piece of information, such as text, picture, video, and sound. 322 De Angeli et al. Table 1: Two Abstract Tasks (ATs) from the Library of Hypermedia ATs AS 1: Control on Active Slots Focus of action: An active slot Intent: to evaluate the control provided over the active slot, in terms of the following: A. Mechanisms for the control of the active slot B. Mechanisms supporting the state visibility ( i.e., the identification of any intermediate state of the slot activation) Activity description: given an active slot: A. Execute commands such as play, suspend, continue, stop, replay, get to an intermediate state, and so forth B. At a given instant, during the activation of the active slot, verify if it is possible to identify its current state as well as its evolution up to the end Output: A. A list and a short description of the set of control commands and of the mechanisms supporting the state visibility B. A statement saying if the following are true: • The type and number of commands are appropriate, in accordance with the intrinsic nature of the active slot • Besides the available commands, some further commands would make the active slot control more effective • The mechanisms supporting the state visibility are evident and effective AS 6: Navigational Behavior of Active Slots Focus of action: An active slot + links Intent: to evaluate the cross effects of navigation on the behavior of active slots Activity description: consider an active slot: A. Activate it, and then follow one or more links while the slot is still active; return to the “original” node where the slot has been activated and verify the actual slot state B. Activate the active slot; suspend it; follow one or more links; return to the original node where the slot has been suspended and verify the actual slot state C. Execute activities A and B traversing different types of links, both to leave the original node and to return to it D. Execute activities A and B by using only backtracking to return to the original node Output: A. A description of the behavior of the active slot, when its activation is followed by the execution of navigational links and, eventually, backtracking B. A list and a short description of possible unpredictable effects or semantic conflicts in the source or in the destination node, together with the indication of the type and nature of the link that has generated the problem Note. AS = active slot which suggest the activity to be conducted over such objects. Because the ATs directly address hypermedia applications, this prediction should also be weighted with respect to the nature of problems detected by evaluators. The hypermedia specialization of the SI could constitute both the method advantage and its limit. Indeed, although it could be particularly effective with respect to hypermedia-specific problems, it could neglect other flaws related to presentation and content. In other words, the limit of ATs could be that they take evaluators away from defects not specifically addressed by the AT activity. Evaluating Hypermedia Usability 323 • Efficiency hypothesis. A limit of SI could be that a rigorous application of several ATs is time consuming. However, SI was not expected to compromise inspection efficiency compared with the less structured HI technique. Indeed, the expected higher effectiveness of the SI technique should compensate for the major time demand required by its application. • Satisfaction hypothesis. Although SI was expected to be perceived as a more complex technique than HI, it was hypothesized that SI should enhance the evaluators’ control over the inspection process and their confidence in obtained results. 5.1. Method In this section, the experimental method adopted to test the effectiveness, efficiency, and user satisfaction hypotheses is described. Participants. Twenty-eight senior students from the University of Bari participated in the experiment as part of their credit for an HCI course. Their curriculum comprised training and hands-on experience in Nielsen’s heuristic evaluation method, which they had applied to paper prototypes, computer-based prototypes, and hypermedia CD-ROMs. During lectures they also were exposed to the HDM model. Design. The inspection technique was manipulated between participants. Randomly, half of the sample was assigned to the HI condition and the other half to the SI condition. Procedure. Aweek before the experiment, participants were introduced to the conceptual tools to be used during the inspection. The training session lasted 2 hr and 30 min for the HI group and 3 hr for the SI group. The discrepancy was due to the different conceptual tools used during the evaluation by the two groups, as better explained in the following. A preliminary 2-hr seminar briefly reviewed the HDM and introduced all the participants to hypermedia-specific heuristics, as defined by SUE. Particular attention was devoted to informing students without influencing their expectations and attitudes toward the two different inspection techniques. A couple of days later, all participants were presented with a short demonstration of the application, lasting almost 15 min. Afew summary indications about the application content and the main functions were introduced, without providing too many details. In this way, participants, having limited time at their disposal, did not start their usability analysis from scratch but had an idea (although vague) of how to become oriented in the application. Then, participants assigned to the SI group were briefly introduced to the HDM schema of the application and to the key concepts of applying ATs. In the proposed application schema, only the main application components were introduced (i.e., structure of entity types and application links for the 324 De Angeli et al. Table 2: The List of Abstract Tasks (ATs) Submitted to Inspectors AT Classification Code AS-1 AS-6 PS-1 HB-N1 HB-N4 AL-S1 AL-N1 AL-N3 AL-N11 AL-N13 AT Title Control on active slots Navigational behavior of active slots Control on passive slots Complexity of structural navigation patterns Complexity of applicative navigation patterns Coverage power of access structures Complexity of collection navigation patterns Bottom-up navigation in index hierarchies History collection structure and navigation Exit mechanisms availability hyperbase, collection structure and navigation for the access layer), without revealing any detail that could indicate usability problems. The experimental session lasted 3 hr. Participants had to inspect the CD-ROM, applying the technique to which they were assigned. All participants were provided with a list of 10 SUE heuristics, summarizing the usability guidelines for hypermedia2 (Garzotto & Matera, 1997; Garzotto et al., 1998). The SI group also was provided with the HDM application schema and with 10 ATs to be applied during the inspection (see Table 2). The limited number of ATs was attributable to the limited amount of time participants had at their disposal. The most basic ATs were selected that could guide SI inspectors in the analysis of the main application constituents. For example, the authors disregarded ATs addressing advanced hypermedia features. Working individually, participants had to find the maximum number of usability problems in the application and record them on a report booklet, which differed according to the experimental conditions. In the HI group, the booklet included 10 forms, one for each of the hypermedia heuristics. The forms required information about the application point where that heuristic was violated and a short description of the problem. The SI group was provided with a report booklet including 10 forms, each one corresponding to an AT. Again, the forms required information about the violations detected through that AT and where they occurred. Examples of the forms included in the report booklets provided to the two groups are shown in Figures 1 and 2. At the end of the evaluation, participants were invited to fill in the evaluatorsatisfaction questionnaire, which combined several item formats to measure three main dimensions: user satisfaction with the evaluated application, evaluator satisfaction with the inspection technique, and evaluator satisfaction with the results achieved. The psychometric instrument was organized in two parts. The first was concerned with the application, and the second included the questions about the 2By providing both groups with the same heuristic list, the authors have been able to measure the possible added value of the systematic inspection induced by SUE with respect to the subjective application of heuristics. FIGURE 1 An example of table in the report form provided to the HI group. FIGURE 2 A page from the report booklet given to the SI group, containing one AT and the corresponding form for reporting problems. 325 326 De Angeli et al. adopted evaluation technique. Two final questions asked participants to specify how satisfied they felt about their performance as evaluators. The application. The evaluated application was the Italian CD-ROM “Camminare nella pittura,” which means “walking through painting” (Mondadori New Media, 1997). It was composed of two CD-ROMs, each one presenting the analysis of painting and some relevant artworks in two periods. The first CD-ROM (CD1 in the following) covered the period from Cimabue to Leonardo, the second one the period from Bosch to Cezanne. The CD-ROMs were identical in structure, and each one could be used independently of the other. Each CD-ROM was a distinct and “complete” application of limited size, particularly suitable for being exhaustively analyzed in a limited amount of time. Therefore, CD1 only was submitted to participants. The limited number of navigation nodes in CD1 simplified the postexperimental analysis of the paths followed by the evaluators during the inspection and the identification of the application points where they highlighted problems. Data coding. The report booklets were analyzed by three expert hypermedia designers (expert evaluators, in the following) with a strong HCI background, to assess effectiveness and efficiency of the applied evaluation technique. All reported measures had a reliability value of at least .85. Evaluator satisfaction was measured by analyzing the self-administered postexperimental questionnaires. All the statements written in the report booklets were scored as problems or nonproblems. Problems are actual usability flaws that could affect user performance. Nonproblems include (a) observations reflecting only evaluators’ personal preferences but not real usability bugs; (b) evaluation errors, reflecting evaluators’ misjudgments or system defects due to a particular hardware configuration; and (c) statements that are not understandable (i.e., not clearly reported). For each statement scored as a problem or a nonproblem of type (a), a severity rating was performed. As suggested by Nielsen (1994b), severity was estimated considering three factors: the frequency of the problem, the impact of the problem on the user, and the persistence of the problem during interaction. The evaluation was modulated on a Likert scale, ranging from 1 (I don’t agree that this is a usability problem at all) to 5 (usability catastrophe). Each problem was further classified in one of the following dimensions, according to the nature of the problem itself: • Navigation, which includes problems related to the task of moving within the hyperspace. It refers to the appropriateness of mechanisms for accessing information and for getting oriented in the hyperspace. • Active media control, which includes problems related to the interaction with dynamic multimedia objects, such as video, animation, and audio comment. It refers to the appropriateness of mechanisms for controlling the dynamic behavior of media and of mechanisms providing feedback about the current state of the media activation. Evaluating Hypermedia Usability 327 • Interaction with widgets, which includes problems related to the interaction with the widgets of the visual interface, such as buttons of various types, icons, and scrollbars. It includes problems related to the appropriateness of mechanisms for manipulating widgets and their self-evidence. Note that navigation and active media control are dimensions specific to hypermedia systems. 5.2. Results The total number of problems detected in the application was 38. Among these, 29 problems were discovered by the expert evaluators, through an inspection before the experiment. The remaining 9 were identified only by the experimental inspectors. During the experiment, inspectors reported a total number of 36 different types of problems. They also reported 25 different types of nonproblems of type (a) and (b). Four inspectors reported at least one nonunderstandable statement, i.e. nonproblems of type (c). The results of the psychometric analysis are reported in the following paragraphs with reference to the three experimental hypotheses. Effectiveness. Effectiveness can be decomposed into the completeness and accuracy with which inspectors performed the evaluation. Completeness corresponds to the percentage of problems detected by a single inspector out of the total number of problems. It is computed by the following formula: Completenessi = Pi ´100 n where Pi is the number of problems found by the ith inspector, and n is the total number of problems existing in the application (n = 38). On average, inspectors in the SI group individually found 24% of all the usability defects (SEM = 1.88); inspectors in the HI group found 19% (SEM = 1.99). As shown by a Mann–Whitney U test, the difference is statistically significant (U = 50.5, N = 28, p < .05). It follows that the SI technique enhances evaluation completeness, allowing individual evaluators to discover a major number of usability problems. Accuracy can be defined by two indexes: precision and severity. Precision is given by the percentage of problems detected by a single inspector out of the total number of statements. For a given inspector, precision is computed by the following formula: Precisioni = Pi ´100 Si 328 De Angeli et al. where Pi is the number of problems found by the ith inspector, and si is the total number of statements he or she reported (including nonproblems). In general, the distribution of precision is affected by a severe negative skewness, with 50% of participants not committing any errors. The variable ranges from 40 to 100, with a median value of 96. In the SI group, most inspectors were totally accurate (precision value = 100), with the exception of two of them, who were slightly inaccurate (precision value > 80). On the other hand, only two participants of the HI condition were totally accurate. The mean value for the HI group was 77.4 (SEM = 4.23), and the median value was 77.5. Moreover, four evaluators in the HI group reported at least one nonunderstandable statement, whereas all the statements reported by the SI group were clearly expressed and referred to application objects using a comprehensible and consistent terminology. This general trend reflecting an advantage of SI over HI was supported also by the analysis of the severity index, which refers to the average rating of all scored statements for each participant. A t-test analysis demonstrated that the mean rating of the two groups varied significantly, t(26) = –3.92, p < .001 (two-tailed). Problems detected applying the ATs were scored as more serious than those detected when only the heuristics were available (means and standard errors are reported in Table 3). The effectiveness hypothesis also states that the SUE inspection technique could be particularly effective for detecting hypermedia-specific problems, whereas it could neglect other bugs related to graphical user interface widgets. To test this aspect, the distribution of problem types was analyzed as a function of experimental conditions. As can be seen in Figure 3, the most common problems detected by all the evaluators were concerned with navigation, followed by defects related to active media control. Only a minority of problems regarded interaction with widgets. In general, it is evident that the SI inspectors found more problems. However, this superiority especially emerges for hypermedia-related defects (navigation and active media control), t(26) = –2.70, p < .05 (two-tailed). A slightly higher average number of “interaction with widgets” problems was found by the HI group, compared with the SI group. A Mann–Whitney U test, comparing the percentage of problems in the two experimental conditions, indicated that this difference was not significant (U = 67, N = 28, p = .16). This means that unlike what was hypothesized, the systematic inspection activity suggested by ATs does not take evaluators away from other problems not covered by the activity description. Because the problems found by the SI group in the “interaction with widgets” category were those having the highest severity, it also can be assumed that the hypermedia ATs do not prevent evaluators from noticing usability catastrophes related to presentation aspects. Also, supplying evaluators with ATs focusTable 3: Means and Standard Errors for the Analysis of Severity Severity index Note: HI SI 3.66 (0.12) 4.22 (0.08) HI = Heuristics Inspection; SI = Systematic Usability Evaluation Inspection. Evaluating Hypermedia Usability 329 FIGURE 3 Average number of problems as a function of experimental conditions and problem categories. ing on presentation aspects, such as those presented by Costabile and Matera (1999), may allow one to obtain a deep analysis of the graphic–user interface, with the result that SI evaluators find a major number of “interaction with widgets” problems. Efficiency. Efficiency has been considered both at the individual and at the group level. Individual efficiency refers to the number of problems extracted by a single inspector, in relation to the time spent. It is computed by the following formula: Ind _ Efficiencyi = Pi ti where Pi is the number of problems detected by the ith inspector, and ti is the time spent for finding the problems. On average, SI inspectors found 4.5 problems in 1 hr of inspection, versus the 3.6 problems per hour found by the HI inspectors. A t test on the variable normalized by a square root transformation demonstrated that this difference was not significant, t(26) = –1.44, p = .16 (two-tailed). Such a result further supports the efficiency hypothesis, because the application of the ATs did not compromise efficiency compared with a less structured evaluation technique. Rather, SI showed a positive tendency in finding a major number of problems per hour. Group efficiency refers to the evaluation results achieved by aggregating the performance of several inspectors. Toward this end, Nielsen’s cost–benefit curve, relating the proportion of usability problems to the number of evaluators, has been computed (Nielsen, 1994b). This curve derives from a mathematical model based 330 De Angeli et al. on the prediction formula for the number of usability problems found in a heuristic evaluation reported in the following (Nielsen, 1992): Found(i) = n(1 - (1 - λ)i ) where Found(i) is the number of problems found by aggregating reports from i independent evaluators, n is the total number of problems in the application, and λ is the probability of finding the average usability problem when using a single average evaluator. As suggested by Nielsen and Landauer (1993), one possible use of this model is in estimating the number of inspectors needed to identify a given percentage of usability errors. This model therefore was used to determine how many inspectors for the two techniques would enable the detection of a reasonable percentage of problems in the application. The curves calculated for the two techniques are reported in Figure 4 (n = 38, λHI = 0.19, λSI = 0.24). As shown in the figure, SI tended to reach better performance with a lower number of evaluators. If Nielsen’s 75% threshold is assumed, SI can reach this level with five evaluators. The HI technique would require seven evaluators. FIGURE 4 The cost–benefit curve (Nielsen & Landauer, 1993) computed for the two techniques, HI (heuristic inspection) and SI (SUE inspection). Each curve shows the proportion of usability problems found by each technique when different numbers of evaluators were used. Evaluating Hypermedia Usability 331 Satisfaction. With respect to an evaluation technique, satisfaction refers to many parameters, such as perceived usefulness, difficulty, and acceptability of applying the method. The postexperimental questionnaire addressed three main dimensions: user satisfaction with the application evaluated, evaluator satisfaction with the inspection technique, and evaluator satisfaction with the results achieved. At first sight it may appear that the first dimension, addressing evaluators’ satisfaction with the application, is out of the scope of the experiment, the main intent of which was to compare two inspection techniques. However, the authors wanted to verify in which way the used technique may have influenced inspector severity. User satisfaction with the application evaluated was assessed through a semantic–differential scale that required inspectors to judge the application on 11 pairs of adjectives describing satisfaction with information systems. Inspectors could modulate their evaluation on 7 points (after recoding of reversed items, where 1 = very negative and 7 = positive). The initial reliability of the satisfaction scale is moderately satisfying (α = .74), with three items (reliable–unreliable, amusing–boring, difficult–simple) presenting a corrected item-total correlation inferior to .30. Therefore, the user-satisfaction index was computed averaging scorings for the remaining eight items (α = .79). Then, the index was analyzed by a t test. Results showed a significant effect of the inspection group, t(26) = 2.38, p < .05 (two-tailed). On average, the SI inspectors evaluated the application more severely (M = 4.37, SEM = .23) than HI inspectors (M = 5.13, SEM = .22). From this difference, it can be inferred that ATs provide evaluators with a more effective framework to weight limits and benefits of the application. The hypothesis is supported by the significant correlation between the number of usability problems found by an evaluator and his or her satisfaction with the application (r = –.42, p < .05). The negative index indicates that as more problems were found, the less positive was the evaluation. Evaluator satisfaction with the inspection technique was assessed by 11 pairs of adjectives, modulated on 7 points. The original reliability value was .72, increasing to .75 after deletion of three items (tiring–restful, complex–simple, satisfying–unsatisfying). The evaluator-satisfaction index then was computed by averaging scorings to the remaining eight items. The index is highly correlated with a direct item assessing learnability of the inspection technique (r = .53, p < .001). The easier a technique is perceived, the better it is evaluated. A t test showed no significant differences in the satisfaction with the inspection technique, t(26) = 1.19, p = .25 (two-tailed). On average, evaluations were moderately positive for both techniques, with a mean difference of .32 slightly favoring the HI group. To conclude, despite being objectively more demanding, SI was not evaluated worse than HI. Evaluator satisfaction with the result achieved was assessed directly by a Likert-type item asking participants to express their gratification on a 4-point scale (from not at all to very much) and indirectly by a percentage estimation of the number of problems found. The two variables were highly correlated (r = .57, p < .01). The more problems an inspector thought he or she had found, the more satisfied he or she was with his or her performance. Consequently, the final satisfaction index was computed by multiplying the two scores. A Mann–Whitney U test showed a tendency toward a difference in favor of the HI group (U = 54.5, p = .07). Participants in the HI group felt more satisfied about their performance than those in the SI group. 332 De Angeli et al. By considering this finding in the general framework of the experiment, it appears that ATs provide participants with higher critical abilities than heuristics. Indeed, despite the major effectiveness achieved by participants in the SI group, they were still less satisfied with their performance, as if they could better understand the limits of an individual evaluation. Summary. Table 4 summarizes the experimental results presented in the previous paragraphs. The advantage of the systematic approach adopted by the evaluators assigned to the SI condition is evident. The implications of these findings are discussed in the final paragraph. 6. CONCLUSIONS In the last decade, several techniques for evaluating the usability of software systems have been proposed. Unfortunately, research in HCI has not devoted sufficient efforts toward validating such techniques, and therefore some questions persist (John, 1996). The study reported in this article provides some answers about the effectiveness, efficiency, and satisfaction of the SUE inspection technique. The experiment seems to confirm the general hypothesis of a sharp increase in the overall quality of inspection when ATs are used. More specifically, the following may be concluded: • The SUE inspection increases evaluation effectiveness. The SI group showed a major completeness and precision in reporting problems and also identified more severe problems. Table 4: Summary of the Experimental Results Indexes Hypothesis Effectiveness Completeness Accurateness Precision Severity Efficiency Individual efficiency Group efficiency Satisfaction User satisfaction with the application evaluated Evaluator satisfaction with the inspection technique Evaluator satisfaction with the achieved results HI SI – + – – + + = – = + < = < > = > Note: HI = Heuristic Inspection; SI = Systematic Usability Evaluation Inspection; – worse performance; + better performance; = equal performance; < minor critical ability; > major critical ability. Evaluating Hypermedia Usability 333 • Although more rigorous and structured, the SUE inspection does not compromise inspection efficiency. Rather, it enhanced group efficiency, defined as the number of different usability problems found by aggregating the reports of several inspectors, and showed a similar individual efficiency, defined as the number of problems extracted by a single inspector in relation to the time spent. • The SUE inspection enhances the inspectors’ control over the inspection process and their confidence on the obtained results. SI inspectors evaluated the application more severely than HI inspectors. Although SUE inspection was perceived as a more complex technique, SI inspectors were moderately satisfied with it. Finally, they showed a major critical ability, feeling less satisfied with their performance, as if they could understand the limits of their inspection activity better than the HI inspectors. The authors are confident in the validity of such results, because the evaluators in this study were by no means influenced by the authors’ association with the SUE inspection method. Actually, the evaluators were more familiar with Nielsen’s heuristic inspection, being exposed to this method during the HCI course. They learned the SUE inspection only during the training session. Further experiments are being planned involving expert evaluators to further evaluate whether ATs provide greater power to experts as well. REFERENCES Andre, T. S., Hartson, H. R., & Williges, R. C. (1999). Expert-based usability inspections: Developing a foundational framework and method. In Proceedings of the 2nd Annual Student’s Symposium on Human Factors of Complex Systems. Costabile, M. F., Garzotto, F., Matera, M., & Paolini, P. (1997). SUE: A systematic usability evaluation (Tech. Rep. 19-97). Milan: Dipartimento di Elettronica e Informazione, Politecnico di Milano. Costabile, M. F., Garzotto, F., Matera, M., & Paolini, P. (1998). Abstract tasks and concrete tasks for the evaluation of multimedia applications. Proceedings of the ACM CHI ’98 Workshop From Hyped-Media to Hyper-Media: Towards Theoretical Foundations of Design Use and Evaluation, Los Angeles, April 1998. Retrieved December 1, 1999, from http://www.eng.auburn.edu/department/cse/research/vi3rg/ws/papers.html Costabile, M. F., & Matera, M. (1999). Evaluating WIMP interfaces through the SUE Approach. In B. Werner (Ed.), Proceedings of IEEE ICIAP ’99—International Conference on Image Analysis and Processing (pp. 1192–1197). Los Alamitos, CA: IEEE Computer Society. Desurvire, H. W. (1994). Faster, cheaper! Are usability inspection methods as effective as empirical testing? In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp. 173–202). New York: Wiley. Dix, A., Finlay, J., Abowd, G., & Beale, R. (1998). Human–computer interaction (2nd ed.). London: Prentice Hall Europe. Doubleday, A., Ryan, M., Springett, M., & Sutcliffe, A. (1997). A comparison of usability techniques for evaluating design. In S. Cole (Ed.), Proceedings of ACM DIS ’97—International Conference on Designing Interactive Systems (pp. 101–110). Berlin: Springer-Verlag. Fenton, N. E. (1991). Software metrics—A rigorous approach. London: Chapman & Hall. Garzotto, F., & Matera, M. (1997). A systematic method for hypermedia usability inspection. New Review of Hypermedia and Multimedia, 3, 39–65. 334 De Angeli et al. Garzotto, F., Matera, M., & Paolini, P. (1998). Model-based heuristic evaluation of hypermedia usability. In T. Catarci, M. F. Costabile, G. Santucci, & L. Tarantino (Eds.), Proceedings of AVI ’98—International Conference on Advanced Visual Interfaces (pp. 135–145). New York: ACM. Garzotto, F., Matera, M., & Paolini, P. (1999). Abstract tasks: A tool for the inspection of Web sites and off-line hypermedia. In J. Westbomke, U. K. Will, J. J. Leggett, K. Tochterman, J. M. Haake (Eds.), Proceedings of ACM Hypertext ’99 (pp. 157–164). New York: ACM. Garzotto, F., Paolini, P., & Schwabe, D. (1993). HDM—Amodel based approach to hypermedia application design. ACM Transactions on Information Systems, 11(1), 1–26. Hartson, H. R., Andre, T. S., Williges, R. C., & Van Rens, L. (1999). The user action framework: A theory-based foundation for inspection and classification of usability problems. In H.–J. Bullinger & J. Ziegler (Eds.), Proceedings of HCI International ’99 (pp. 1058–1062). Oxford, England: Elsevier. International Standard Organization. (1997). Ergonomics requirements for office work with visual display terminal (VDT): Parts 1–17. Geneva, Switzerland: International Standard Organization 9241. Jeffries, R., & Desurvire, H. W. (1992). Usability testing vs. heuristic evaluation: Was there a context? ACM SIGCHI Bulletin, 24(4), 39–41. Jeffries, R., Miller, J., Wharton, C., & Uyeda, K. M. (1991). User interface evaluation in the real word: A comparison of four techniques. In S. P. Robertson, G. M. Olson, & J. S. Olson (Ed.), Proceedings of ACM CHI ’91—International Conference on Human Factors in Computing Systems (pp. 119–124). New York: ACM. John, B. E. (1996). Evaluating usability evaluation techniques. ACM Computing Surveys, 28(Elec. Suppl. 4). Kantner, L., & Rosenbaum, S. (1997). Usability studies of WWW sites: Heuristic evaluation vs. laboratory testing. In Proceedings of ACM SIGDOC ’97—International Conference on Computer Documentation (pp. 153–160). New York: ACM. Karat, C. M. (1994). A comparison of user interface evaluation methods. In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp. 203–230). New York: Wiley. Lim, K. H., Benbasat, I., & Todd, P. A. (1996). An experimental investigation of the interactive effects of interface style, instructions, and task familiarity on user performance. ACM Transactions on Computer–Human Interaction, 3(1), 1–37. Madsen, K. H. (1999). The diversity of usability practices [Special issue]. Communication of ACM, 42(5). Matera, M. (1999). SUE: A systematic methodology for evaluating hypermedia usability. Milan: Dipartimento di Elettronica e Informazione, Politecnico di Milano. Mondadori New Media. (1997). Camminare nella pittura [CD-ROM]. Milan: Mondadori New Media. Nielsen, J. (1992). Finding usability problems through heuristic evaluation. In P. Bauersfeld, J. Benett, & G. Lynch (Eds.), Proceedings of ACM CHI ’92—International Conference on Human Factors in Computing Systems (pp. 373–380). New York: ACM. Nielsen, J. (1993). Usability engineering. Cambridge, MA: Academic. Nielsen, J. (1994a). Guerrilla HCI: Using discount usability engineering to penetrate intimidation barrier. In R. G. Bias & D. J. Mayhew (Eds.), Cost-justifying usability. Cambridge, MA: Academic. Retrieved December 1, 1999, from http://www.useit.com/papers/guerrilla_hci.html Nielsen, J. (1994b). Heuristic evaluation. In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp. 25–62). New York: Wiley. Evaluating Hypermedia Usability 335 Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the finding of usability problems. In Proceedings of ACM INTERCHI ’93—International Conference on Human Factors in Computing Systems (pp. 296–213). New York: ACM. Nielsen, J., & Mack, R. L. (1994). Usability inspection methods. New York: Wiley. Preece, J., Rogers, Y., Sharp, H., Benyon, D., Holland, S., & Carey, T. (1994). Human–computer interaction. New York: Addison Wesley. Virzi, R. A., Sorce, J. F., & Herbert L. B. (1993). Acomparison of three usability evaluation methods: Heuristic, think-aloud, and performance testing. In Proceedings of Human Factors and Ergonomics Society 37th Annual Meeting (pp. 309–313). Santa Monica, CA: Human Factors and Ergonomics Society. Whiteside, J., Bennet, J., & Holtzblatt, K. (1988). Usability engineering: Our experience and evolution. In M. Helander (Ed.), Handbook of human–computer interaction (pp. 791–817). Oxford, England: Elsevier Science.
© Copyright 2024 Paperzz