Food Quality and Preference 21 (2010) 4–12 Contents lists available at ScienceDirect Food Quality and Preference journal homepage: www.elsevier.com/locate/foodqual A comparison of the labeled magnitude (LAM) scale, an 11-point category scale and the traditional 9-point hedonic scale Harry T. Lawless a,*, Richard Popper b, Beverley J. Kroll b a b Food Science Department, Cornell University, Stocking Hall, Ithaca, NY 14853, United States Peryam and Kroll Research Corporation, 6323 N. Avondale Ave., Chicago, IL 60631, United States a r t i c l e i n f o Article history: Received 3 March 2009 Received in revised form 26 April 2009 Accepted 26 June 2009 Available online 1 July 2009 Keywords: Scaling Food acceptance Hedonic scale a b s t r a c t Schutz and Cardello [Schutz, H. G. & Cardello, A. V. (2001). A labeled affective magnitude (LAM) scale for assessing food liking/disliking. Journal of Sensory Studies, 16, 117–159] proposed the labeled magnitude (LAM) scale for measuring food acceptance. The LAM is a line scale anchored at its end points with the phrases ‘‘greatest imaginable like” and ‘‘greatest imaginable dislike” and uses as intermediate anchors the nine phrases of the traditional hedonic scale. In this study, three hedonic scales were compared, including the widely-used 9-point hedonic scale, the LAM scale, and an 11-point category scale using the LAM’s verbal anchors as category labels. Three groups of consumers (N = about 100 each) used one of the three scales to evaluate the acceptability of highly liked foods (orange juices, potato chips, cookies, and ice cream, with four samples of each). Scales were evaluated primarily on their ability to show differences in acceptability, the correspondence of acceptance ratings to preference ranking and the correspondence of stated product usage (e.g., purchase of pulp vs. non-pulp orange juice) to the product scoring highest. All three scales performed equally well, with no one scale showing a consistent superiority over another. All three scales were able to differentiate acceptability of the orange juices, chips and cookies. No scale differentiated among the ice creams, which had equal and high acceptability. All scales showed a strong correspondence between liking and preference rankings and also between the product rated highest and the type of product usually consumed, within each of the product categories. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction The labeled affective magnitude scale (LAM, Fig. 1) was developed by Schutz and Cardello (2001) as an alternative to the commonly used 9-point category scale for measuring food acceptability (Jones, Peryam, & Thurstone, 1955; Peryam & Girardot, 1952; Peryam & Pilgrim, 1957). The LAM scale was an extension of the labeled magnitude scale (LMS) for psychophysical intensity scaling developed by Green, Shaffer, and Gilmore (1993), based on earlier work by Borg (1982) for a so-called ‘‘category–ratio scale”. The LAM scale has been used recently for evaluation of consumer liking for teas (Chung & Vickers, 2007a,b), to study the genetic factors in sweet taste perception (Keskitalo et al., 2007) and in a comparison of young and older person’s liking for different orange juices (Forde & Delahunty, 2004). Recently, Jaeger and Cardello (2009) compared the LAM to best–worst scaling and found approximate parity in discrimination. The LMS and LAM scales may have the following properties: * Corresponding author. Tel./fax: +1 607 255 7363. E-mail address: [email protected] (H.T. Lawless). 0950-3293/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.foodqual.2009.06.009 (1) Because the scales were based on magnitude estimates (ratio scaling instructions) of the verbal anchor word meanings, the resulting scale values are thought by some to represent ratio scale data (Stevens, 1971). If true, one could make valid statements such as ‘‘this product was liked twice as much as that one”. The LMS and LAM produce data similar to that from magnitude estimation (Green et al., 1993; Schutz & Cardello, 2001). (2) Because the scales had a high end anchor of ‘‘strongest imaginable” for the LMS or greatest imaginable like (or dislike) for the LAM, subjects in these scaling studies might have a similar idea of the intensity of the experience suggested by those phrases, and thus be placed on the same subjective scale. This was based on an argument by Borg (1982), who, for example, in studying perceived effort or exertion, thought that exerting oneself maximally, i.e., to the point of exhaustion, should be a similar experience among different people. This assumption was later challenged (Bartoshuk et al., 2002). (3) Because the scales have commonly understood labels (weak, moderate, strong), the data could be interpreted in light of these labels, unlike magnitude estimation, which produced H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12 GREATEST IMAGINABLE LIKE LIKE EXTREMELY LIKE VERY MUCH LIKE MODERATELY LIKE SLIGHTLY NEITHER LIKE NOR DISLIKE DISLIKE SLIGHTLY DISLIKE MODERATELY DISLIKE VERY MUCH DISLIKE EXTREMELY GREATEST IMAGINABLE DISLIKE Fig. 1. The LAM scale (after Schutz and Cardello (2001)). scale data based on proportions (i.e., one stimulus was twice as strong as another), but with no absolute anchor for whether these experience were weak or strong (one could be twice the other but both could be weak). Regarding food acceptability testing, the question arises as to whether the LAM scale provides any advantages over the commonly used 9-point hedonic scale. An important criterion is whether one scale is better at finding differences among products (see, for example, Lawless & Malone, 1986). In the original set of studies (Schutz & Cardello, 2001) performance of the LAM scale and the 9-point scales were very similar in this regard. Two direct comparisons were conducted, one involving 51 food names and one involving five foods that were actually tasted. Correlations between the mean values obtained on the two scales were +0.99 for the 51 food names and +0.98 for the tasted foods. Statistical differentiation was almost equivalent in both cases. Analysis of variance for the tasted foods showed 27.6% of variance accounted for by the food differences for the 9-point scales and 26.6% for the LAM. For the food names, there were 467 pairs of means (out of 1275 possible comparisons) that were significantly different for the LAM scale 5 and 459 for the hedonic scale (not significantly different by binomial test on proportions). The only appreciable difference that was found was in an examination of foods that scored above the overall mean across products, i.e., well-liked foods, and considering only those pairings in which one scale showed a difference but the other did not (87 possible pairs, suggesting that about 43 would be above the mean). In these specific cases, the LAM scale was responsible for 86% of the differences (37 out of 43). For foods below the grand mean the split was about even. The higher end of the scale range was used more frequently with the LAM scale, consistent with the idea that it might be valuable for differentiating wellliked foods. The performance of the two scales has been evaluated in several other direct comparisons. Greene, Bratka, Drake, and Sanders (2006) examined consumers’ reactions to peanuts with fruity-fermented flavor defects. The 9-point hedonic scale only uncovered one significant pair of differences, whereas the LAM scale showed four pairs of significant differences (out of 12 possible). Hein, Jaeger, Carr, and Delahunty (2008) performed a comparison of the 9-point, LAM, a line scale, ranking and best–worst scaling in a replicated test of breakfast bars with large groups of consumers. Best– worst is a variation of choice/ranking whose analyses yield scale values. Among the other three true scaling methods, the first replication showed similar discrimination (similar F-ratios) for the LAM, line scale and 9-point, but the 9-point had a much higher F-ratio on the second replicate and showed more paired differences among means. El Dine and Olabi (2009) found similar performance of the LAM and 9-point scale in differentiating a set of both familiar and novel foods, with the LAM scale differentiating better among the three highest rated items, a finding in line with the original observation of Schutz and Cardello (2001). Another criterion for comparing scales concerns the ability of the scale to detect different patterns of preference in consumer segments. Recently, Villanueva and Da Silva (2009) compared the traditional 9-point hedonic scale to a hedonic line scale, which was called a ‘‘hybrid” scale, previously studied by this group (Villanueva, Petenate, & Da Silva, 2005). The authors introduced a potentially important criterion for comparing the effectiveness of hedonic scales that has rarely been used: the segmentation of consumers as shown by internal preference mapping. They concluded that the hybrid scale has superior properties in terms of its ability to uncover segments of consumers in an evaluation of eight red wines. Such a comparison has not been made between the 9-point scale and the LAM scale. Another useful relationship is between acceptance ratings and consumer preferences in the real world. Although there are many reasons why stated usage might not correspond to liking ratings (for example, I might prefer a certain style of potato chip but its cost might discourage me from frequent purchases), one would expect at least a moderate correlation across a group of individuals from different preference segments. Given the relatively few direct comparisons of the LAM scale to the 9-point scale, sensory professionals might be cautious in substituting the newer LAM for the 9-point hedonic scale, an industry standard. However, there are some hints in the literature that an expanded of 11-point scale could be useful for measuring product acceptability. Peryam (1989) in his reflections on the early days of sensory science, offered the following observation: ‘‘Why does the hedonic scale have nine categories, rather than more or less? Economy perhaps? Preliminary investigation had shown that discrimination between foods and reliability tended to increase up to eleven categories, but we encountered, in addition to the dearth of appropriate adverbs, a mechanical problem due to equipment limitations. Official government paper was only 800 wide and we found that typing eleven categories horizontally was not possible. So we sacrificed a theoret- 6 H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12 ical modicum of precision for a real improvement in efficiency at the moment.” p. 23. He further pointed to a potential advantage to having more room for positive evaluations on a scale as follows: ‘‘An 8-point unbalanced scale with more ‘‘like” than ‘‘dislike” categories was shown to be somewhat better than the standard 9-point one, but only when dealing with relatively well-liked foods.” p. 24. Thus there is a need for further evaluation of the performance of the LAM and 9-point scales in head to head comparisons over different products and conditions. The LAM scale has potential advantages, with greater room for more extreme responses than the 9-point scale. It is not clear whether the added phrases are key or whether the added line length is important as well. To see whether the line itself made any difference we included a scale with the LAM’s verbal phrases only, similar to the original portrayal of the 9-point hedonic scale (Peryam & Girardot, 1952). The main objective of the study was to compare the scales using four criteria: (1) the ability to differentiate products, (2) the ability to differentiate consumer segments, (3) relation of acceptance scores to consumption choices, and (4) reliability. Representative products of four product categories were used in this study. 2. Methods 2.1. Participants A total of 302 consumers completed the study, 99 using the LAM scale, 103 using the 11-point category scale and 100 using the 9-point category scale. Consumers were recruited from Peryam and Kroll (Chicago, IL) databases in four cities, of persons available for testing and represented a range of ages (37% from 18 to 35, 43 % from 36 to 55, and 20% from 56 to 65 years of age) and both genders equally (50.5% male and 49.5% female). None of the three groups deviated from these overall percentages by more than ±2% and there was no difference in a v2 test. All were consumers of the four products tested, as determined by pre-screening and confirmed by a demographic questionnaire that included product usage questions. Consumers were excluded during recruitment if they indicated they had participated in any food or beverage test in the last three months (an industry standard). Information on the product usage questionnaire and response options is found in the Appendix. 2.2. Products Products consisted of four commercially available versions of orange juice, chocolate chip cookies and potato chips. Four vanilla ice cream samples were also presented with one sample being duplicated. The products were pre-screened to produce a range of sensory characteristics and some potential preference differences, i.e., different degrees of appeal to different consumer segments. The orange juices included one popular refrigerated not-from-concentrate product, one refrigerated high-pulp notfrom-concentrate product, and two shelf-stable (from concentrate) products from different manufacturers. The chips consisted of one popular ‘‘normal” potato chip product, two kettle-style chips from different manufacturers, and one ridged/wavy style chip. The chocolate chip cookies consisted of one popular ‘‘normal” style cookie, one chunky style, and two soft/chewy style cookies from different manufacturers. The vanilla ice creams consisted of one labeled ‘‘vanilla bean,” one labeled ‘‘extra creamy,” and one labeled ‘‘French vanilla” which was served in duplicate. Major market leading brands were used to insure their availability in the four testing sites. The brand names are available from the first author. All products were served as small samples in cups coded with random three digit numbers. Samples within a product type were served in randomized orders, but the four types of products were served in separate groups in the following order: juices, chips, cookies, then ice creams. This order was used so that no strong flavor contrast would occur that might lead to negative reactions, as might have happened if the high-acid juices followed the ice creams. 2.3. Test methods Tests were conducted in four central location test sites in White Plains, NY, Santa Ana, CA, Plano, TX and Chicago, IL. Questionnaires were administered on computer screens in a classroom style setting. Once instructions were given, participants received practice in using their assigned scale by rating their liking or disliking for two color swatches, produced on separate ‘‘pages” of the computer questionnaire. The products were then served, one group at a time. The LAM scale consisted of a vertical line scale with anchor words spaced according to the spacing’s provided by Cardello and Schutz (2004). Scales were the about the same size on the screen, averaging 113 mm in length. LAM markings were re-coded to a 200-point scale based on a pixel count recorded by the computer system. Instructions to participants for the 9- and 11-point scales read as follows: ‘‘Thank you for participating in today’s taste test. We are interested in how much you like or dislike various products. You will make your like/dislike ratings using the computer. Before we start we would like to show you the scale we would like you to use to make your ratings.” Then on a new screen, ‘‘This is the scale we would like you to use. You will use your mouse to click on any of the words in order to indicate how much you like or dislike the product.” The screen then showed the following nine phrases, centered on the page and arranged vertically from top to bottom in this order: like extremely, like very much, like moderately, like slightly, neither like nor dislike, dislike slightly, dislike moderately, dislike very much and dislike extremely. For the 11-point scale, the phrases greatest imaginable like and greatest imaginable dislike were placed at the top and bottom, respectively. For the LAM scale, the instructions were ‘‘You will use your mouse to click anywhere on the line to indicate how much you like or dislike the product.” Practice was then given by having them rate their liking or disliking for two color swatches that appeared sequentially on the screen. Instructions for the product testing appeared after the second color sample screen and read as follows: ‘‘Now we are ready to begin the taste test. In order to cleanse your palate, please take a sip of water now and in between each of the samples you will be tasting today. First you will be rating four samples of (PRODUCT NAME). We will serve the four samples one at a time. Please start by evaluating the juice on the far left. Make sure the sample number on your cup matches the sample number on your screen. Please drink enough of the sample to form an opinion, then provide us your rating. After rating the first sample, continue with the next sample working from left to right. While evaluating a sample, do not re-taste any of the samples that you previously rated. There will be a short wait time in between each sample to allow you to cleanse your palate by taking a sip of water. When you have received your samples, please click below to continue.” The next screen asked the participant to ‘‘Please Evaluate Sample 241” or some other three-digit random code, followed by the phrase ‘‘Overall, how much do you like or dislike this sample?” The scale was positioned below that phrase and a continue button 7 H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12 was at the bottom of the screen. In between samples a screen with a counter was visible with the instructions, ‘‘Before proceeding to the next sample/question, please have a sip of water. When the counter reaches zero you will be able to continue.” Following the final sample in the set of four, a preference ranking was obtained by having them click on their first through fourth best samples, with the numbers 1–4 appearing over the code for each sample. It was possible to re-taste the samples and possible to alter their rankings. After the test was concluded, demographic information (age, gender) and product usage information was collected. Details of the coding scheme for product usage are given in the Appendix. 2.4. Analysis Analysis of variance (ANOVA) was conducted for each food and scale using SAS PROC GLM. Duncan multiple range tests were used to examine significance of differences among means. Product discrimination was assessed by the size of the F-ratio from the ANOVA for product effects, the discrimination ratio (DR) the intraclass correlation coefficient (ICC) and the number of distinct groups arising from the Duncan tests. See Levy, Morris, Hammersley, and Turner (1999) for a discussion of the DR and ICC. Cross-tabulations were made of the type of product consumed most often with the product scoring highest for each individual. For each food, a decision had to be made about how to treat tied scores. A somewhat different approach was taken for each food, depending upon the number of ties and frequency of use of the ‘‘other” category for the food most frequently purchased. See Tables 3–5 for further information. Cross-tabulation of the chips counted the reported usage (product type reported to be that most frequently consumed) against the highest scoring product type. Frequent tied scores were observed with the chips. For purposes of measuring concordance, ties were assigned to the highest reported usage type. v2 analysis was performed omitting the ‘‘other” column and row (upper left 3 3). For the orange juices, the pulp-no pulp dichotomy was robust, and ties were not counted as concordant in this analysis. v2 analysis was performed omitting the ‘‘other” column and tied row (upper left 2 2 cells only). For the cookies, ties were assigned to the highest scoring product in the tie, as was done with the chips. v2 analysis was performed omitting the ‘‘other” row (upper left 3 3). Contingency coefficients were based on the same cells that contributed to the v2 analyses. 3. Results 3.1. Product discrimination The overall pattern of results was that the scales worked about equally well. All three methods were able to differentiate the chips, cookies and orange juice products with a high degree of statistical significance. None of the scales were able to differentiate the ice creams, which had roughly equal and high acceptability. It is possible that this failure was due to the ice creams being presented last and that some fatigue had set in. Table 1 shows the F-ratios, intraclass correlations, discrimination ratios, and number of differentiated Duncan groups for the juices, cookies and chips. For the juices, the LAM scale performed best (9-point second). For the cookies, the 9-point scale performed best (LAM second) and for the chips, the 11-point scale performed best (9-point second). Thus there was no consistent pattern of superiority for any scale method across the three differentiated product types. Rank orders of the mean values were virtually identical among the scales for the juices, cookies and chips with only small reversals occurring be- Table 1 Indices of product differentiation in hedonic ratings. Scaling method 9-Point 11-Point Lam scale Cookies F-Ratio ICC DR Duncan groups 37.64 0.984 11.13 3 25.38 0.972 8.39 2 27.06 0.975 8.89 3 Chips F-Ratio ICC DR Duncan groups 20.88 0.955 6.57 4 22.83 0.956 6.68 2 10.62 0.906 4.50 2 Juices F-Ratio ICC DR Duncan groups 24.67 0.959 6.94 2 19.82 0.949 6.20 3 34.61 0.995 21.06 3 Notes: F-ratio is the product F-ratio. ICC = intraclass correlation, ratio of systematic product variance to total product variance (of means). DR = discrimination ratio, number of potential product groups differentiated by this level of error and systematic variance. Duncan groups: numbers of product groups differentiated by Duncan’s multiple range tests. tween product pairs that were not significantly different. Additional evidence of the consistency of the three methods was shown by the correlations across the 12 product means from the juices, cookies and chips. Correlation coefficients were +0.950 for the 9- and 11-point scales, +0.939 for the 11-point and LAM scales and +0.955 for the 9-point and LAM scales. Table 2 shows the mean scores, standard errors of the means and Duncan test groups. Discrimination of products, as indicated by the discrimination ratio, was associated with a wider use of the available scale (see Fig. 2). Scale range here was defined as the highest minus lowest product mean in that category, divided by the total length of the scale. This would be expected if the size of the error variance was approximately the same for all products and scales. The data shown in Fig. 2 do not include the ice creams for which there was no evidence of product discrimination in the ANOVAs. 3.2. Reliability (duplicate sample analysis) Correlations were assessed for the scores of the pair of identical ice cream samples. For the 9-point scale, the correlation was +0.518, for the 11-point scale, +0.362 and for the LAM scale, +0.521. The LAM and 9-point scale were marginally better than the 11-point using Fisher’s Z transformation (p = 0.081). Note that there was some range compression in these values due to the consistently high ratings given to the ice cream products. Thus the correlations shown here are probably a low estimate of the reliability of these methods, and a product system with a wider range of consumer opinions would most likely show a higher correlation value. 3.3. Correspondence of acceptance scores with preferred types of products All methods showed a strong correspondence between the consumers’ stated usage patterns (type of product consumed most often) and the products that scored highest for that individual. The analysis is detailed below. No pattern of superiority emerged, with the 9-point scale showing the strongest relationship for juices, the 11-point scale for chips and the LAM scale for cookies. For the orange juices (see Table 3), only the two refrigerated juices were considered, as it was possible to classify consumers as pulp likers vs. dislikers, and there were relatively few persons 8 H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12 Table 2 Means, standard errors and Duncan test groups. Product 9-Point 11-Point Lam scale Mean (SE) Duncan group Mean (SE) Duncan group Mean (SE) Duncan group Juice No pulp High pulp Shelf-stable 1 Shelf-stable 2 6.98 7.11 5.36 5.64 (0.15) (0.19) (0.23) (0.20) A A B B 8.01 7.77 6.05 6.63 (0.21) (0.23) (0.25) (0.25) A A C B 148.6 140.6 100.0 114.3 (2.59) (4.27) (4.81) (4.47) A A C B Chips Regular Wavy Kettle 1 Kettle 2 6.94 7.40 6.30 5.69 (0.17) (0.12) (0.20) (0.22) A B C D 8.09 8.62 6.63 6.73 (0.17) (0.16) (0.28) (0.26) A A B B 142.1 150.1 129.7 123.6 (3.47) (3.03) (4.23) (4.49) A A B B Cookies Chunky Regular Soft/chewy 1 Soft/chewy 2 7.13 7.02 5.71 4.78 (0.16) (0.17) (0.23) (0.22) A A B C 8.06 7.77 6.27 6.04 (0.23) (0.19) (0.29) (0.25) A A B B 144.3 142.1 122.4 106.3 (2.79) (3.02) (4.74) (4.78) A A B C Ice cream ‘‘Vanilla bean” Duplicate 1 Duplicate 2 Extra creamy 7.36 7.13 7.17 7.09 (0.16) (0.16) (0.15) (0.17) – – – – 8.18 8.10 8.24 8.39 (0.21) (0.20) (0.19) (0.20) – – – – 143.5 149.1 147.9 149.3 (3.35) (3.12) (2.81) (2.93) – – – – Total Duncan groups 9 7 8 Table 3 Orange juice usage and type scoring highest. Type consumed (response) 9-Point scale Highest scoring Pulp No pulp Tied No pulp Pulp Both or other 11 24 6 28 2 10 10 1 8 Concordant cells shown in bold face. v2 = 25.8, p < .001 Contingency coefficient = +0.53 11-Point scale Highest scoring Pulp No pulp Both/other 11 33 7 14 8 7 11 6 5 Concordant cells shown in bold face. v2 = 9.3, p < .01 Contingency coefficient = +0.35 Fig. 2. Discrimination ratio vs. percent of scale range from lowest to highest mean value, across the three products and three scaling methods. who scored the shelf-stable juices higher than the refrigerated ones. The analysis compared which juice scored higher (pulp or no pulp) and which group the individual belonged to regarding their type of juice drunk most often (pulp, no pulp, both). v2 analysis was performed with the tied scores and ‘‘both” categories omitted. All three methods showed significant associations between the higher scoring juice and the reported usage type. However, in terms of v2 magnitude and contingency coefficients, the 9-point scale was superior to the 11-point scale and LAM scale. For the chips (see Table 4), all three scaling methods showed a strong correspondence between the highest rated chip and the person’s reported type of chip consumed most frequently. In terms of v2 association, all were significant and showed the expected pattern, i.e., the highest counts in the diagonal. In terms of v2 and contingency coefficients, the order from highest to lowest was 11-point > 9-point > LAM. LAM scale Highest scoring Pulp No pulp Both/other 10 23 4 24 13 9 8 6 7 Concordant cells shown in bold face. v2 = 8.34, p < .01 Contingency coefficient = +0.33 Note: Ties were not counted as concordant in this analysis. v2 analysis was performed omitting the ‘‘other” column and tied row (upper left 2 2 cells only). For the cookies (see Table 5), again all three scaling methods showed a strong correspondence between stated usage and highest acceptance scores. In terms of v2 association, all were significant and showed the expected pattern, i.e., the highest counts in the diagonal. In terms of v2 and contingency coefficients, the order from best to worst was LAM > 11-point > 9-point. For the cookies, all the methods showed a strong pattern of dislikers of soft/chewy cookies among those who consume regular or chunky-chip cookies. Chunky chip cookies scored well among those consumers avowing to eat regular chocolate cookies most often. H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12 9 Table 4 Potato chip usage and type scoring highest. Type consumed (response) 9-Point scale Highest scoring Kettle Ridged Regular Other ties Kettle Ridged Regular Baked, other 14 3 0 1 4 18 5 0 4 5 12 4 5 4 3 7 Concordant cells shown in bold face. v2 = 41.3, p < .001 Contingency coefficient = +0.612 11-Point scale Highest scoring Kettle Ridged Regular Other ties 16 5 1 0 1 18 7 1 2 5 17 3 3 7 4 9 Concordant cells shown in bold face. v2 = 50.8, p < .001 Contingency coefficient = +0.643 LAM scale Highest scoring Kettle Ridged Regular Other ties 17 3 1 0 3 19 4 1 3 7 14 4 7 5 4 7 Concordant cells shown in bold face. v2 = 36.4, p < .001 Contingency coefficient = +0.602 Table 5 Chocolate chip cookie usage and type scoring highest. Type consumed (response) 9-Point scale Highest scoring Regular Chunky Soft/chewy Other ties Regular Chunky Soft/chewy 24 12 1 4 9 12 2 2 7 1 14 2 Concordant cells shown in bold face. v2 = 36.7, p < .001 Contingency coefficient = +0.55 11-Point scale Highest scoring Regular Chunky Soft/chewy Other ties 16 15 4 3 3 25 1 0 5 5 19 2 Concordant cells shown in bold face. v2 = 49.6, p < .001 Contingency coefficient = +0.59 LAM scale Highest scoring Regular Chunky Soft/chewy Other ties 24 13 3 1 4 22 1 0 4 1 22 2 Concordant cells shown in bold face. v2 = 75.5, p < .001 Contingency coefficient = +0.67 Fig. 3. Frequency histogram of ratings for one cookie sample for the LAM and 11point scales showing evidence of a disliker segment. such segments that were greater than 10% of the sample. These products were the soft/chewy cookies, the kettle cooked chips, and all four orange juices. Because there are no formal criteria for what constitutes a sufficiently large ‘‘bump” in a frequency distribution to qualify as a segment, seven criteria for a disliker segment were formulated that seemed reasonable. The use of seven criteria was designed to prevent any spurious results from any single measure, since none could be considered ideal. We considered the literature on specific anosmias (Amoore, Venstrom, & Davis, 1968), which uses evidence of a minor mode segment a given distance from the population mean. These criteria were as follows: the frequency (percent) in more than two categories less than the mode (mean of 22, criterion set at 20% or greater), the frequency less than four categories from the mode, (mean of 14, criterion set at 10%) the percent below the neutral point (mean of 30%, criterion set at 25%) and the percent below ‘‘dislike slightly” (mean of 20%, criterion set at 15%). Pass/fail criteria included whether there was an antimode at neutral or dislike slightly (at least one lower category with a higher frequency), whether there were two lower categories with a higher frequency and whether there were three lower categories with a high frequency. These multiple criteria, varying in stringency, were used to prevent any small random variation leading to a ‘‘detection” overly influencing the analysis. An example of segmented ratings distributions is shown in Fig. 3. To apply these criteria to the line scale, data from the LAM scale were converted to an 11-point basis by dividing the line at the midpoint between verbal anchor points. Frequency distributions were then tallied for the categorical form of the data for comparison to the other two scales. In order to place the 9- and 11-point scales on the same footing, a further adjustment was required. For the first four criteria, i.e., the actual frequency counts, the 9point scale tallies were adjusted to include 18% of the counts in the next higher category to account for the smaller range (2/ 11 = .18). Given these seven criteria, there were 56 possible ‘‘detections” for the eight products. A ‘‘detection” meant meeting one of the stated criterion. The LAM scale and 9-point scale did about equally well with 39 out of 56 possible detections. The 11-point scale fared slightly better, with 48 out of 56 possible detections. 3.4. Detection of disliker segments 3.5. Use of scale points above ‘‘like extremely” Data were examined for groups of panelists who were dislikers of some of the products. A disliker segment appears on a single scale as an increase in frequency on the negative half of the scale after a local minimum near the center (an antimode). Informal inspection suggested that possibly eight of the products showed One of the potential advantages of the LAM scale is the opportunity for panelists to score products higher than the traditional 9-point endpoint of like extremely. This may also provide some psychological space to counteract the tendency to avoid category 10 H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12 Table 6 Scale usage frequency above ‘‘like extremely”. Product LAM scale Users (%) Ice creams Juices Chips Cookies 20.0 10.3 16.0 18.4 Table 7 ‘‘Categorical” behavior with the LAM scale. 11-Point scale Total judgments (%) Users (%) 6.6 3.6 4.8 6.1 20.4 12.6 13.6 8.7 Product Total judgments (%) 7.5 3.9 4.6 3.2 scale endpoints as responses. In Cardello, Lawless, and Schutz (2008) between 19% and 30% of participants used this part of the LAM scale for at least one of five products. In one of the studies contained in Schutz and Cardello (2001) a 17% usage rate was found above ‘‘like extremely.” Table 6 shows frequencies of usage of the space above ‘‘like extremely” for the LAM scale, and use of the uppermost category for the 11-point category scale. Tallies were made for individuals as well as the total number of judgments and are reported for each product category separately. Note that the usage for the LAM scale is consistent with the previous literature (Schutz & Cardello, 2001), in the range of 10–20% of respondents making use of the high end of the scale. Use of the 11th scale point was somewhat lower for the cookies. However, this usage is still remarkable given that the label was ‘‘Greatest Imaginable Liking.”. 3.6. Correspondence to preference rankings All three methods showed a high degree of consistency in the rankings of the mean scale values and the mean preference ranking. There were two reversals of ranking of the acceptance scale means and preference ranks for the 9-point scale, only one for the 11-point scale and none for the LAM (out of 18 possible reversals of pairwise order). The reversals of ranking were found among pairs which were un-differentiated, i.e., not significantly different by Duncan tests. This correspondence is perhaps not surprising as the ranking was performed directly after the ratings, although participants could not see their ratings and re-tasting was possible at this point. 3.7. ‘‘Categorical” behavior with the LAM scale A judgment was tallied as ‘‘categorical” if it fell ±2 units from the point designated for an anchor phrase (considering the scale on a 200-point basis). This was the criterion previous used by Cardello et al. (2008). An individual was tallied as being ‘‘categorical” using two criteria as follow: first, if three out of four products showed categorical judgments and second, if all four products showed categorical judgments. This breakdown was also expanded to include ±3 units from the designated scale anchor points. Behaviors were tallied separately for the four products. As shown in Table 7, there was a high degree of categorical behavior, with a majority of the ratings (71–83%) falling within ±2 units of an anchor phrase hash mark (for at least three out of four products). This is somewhat higher than the rates seen in Cardello et al. (2008) who found 65% of a college population sample and 50% of an ongoing Army laboratory panel to act in this manner with at least four out of five products. To put this into perspective, the intervals designated by ±2 units comprise 1/4 of the total space on the LAM scale, yet they captured about 3/4 of the data points. If the criterion is expanded to ±3 units, then over 82% of the data fall on about 1/3 of the usable scale. In other words, participants used intermediate space between phrases infrequently. About half the participants placed all four products at or very near the anchor phrases. Categorical behavior frequency 62 Products (%) P3 Products (%) All four products (%) Cookies ±2 units from anchor ±3 units from anchor 29 22 71 78 42 43 Chips ±2 units from anchor ±3 units from anchor 17 14 83 86 56 59 Juices ±2 units from anchor ±3 units from anchor 29 22 71 78 47 53 Ice creams ±2 units from anchor ±3 units from anchor 24 17 76 87 53 56 4. Discussion To our knowledge, this is the first large-scale consumer study comparing the LAM and 9-point hedonic scales in several different product systems. A further advantage of this study is the betweengroups comparison a design also adopted by Greene et al. (2006), Hein et al. (2008), and El Dine and Olabi (2009). Thus each participant only used one type of scale and was not influenced by recent experience with another scale type. Early studies of the LAM used within-subjects comparisons but people received the scales on different days (Schutz & Cardello, 2001, Experiments 4 and 5). Some studies have shown that the LAM scale is as good or better than other scales for differentiating products with respect to their consumer acceptance ratings (El Dine & Olabi, 2009; Greene et al., 2006; Schutz & Cardello, 2001. Our results are in line with the other large-scale consumer study of Hein et al. who examined performance of five scaling methods with a sports/snack bar among New Zealand consumers. They observed parity between the LAM and 9-point scale on the first replicate, but superior performance of the 9-point scale on a second trial. In this case we have seen instances where it performs sometimes better and sometimes worse. However, from a statistical perspective these differences were not large. All scales were able to differentiate three of the foods at high levels and none could discriminate among the ice cream samples. This result is consistent with Peryam’s statement from the early days of trying different scale versions, to wit: ‘‘All hedonic scales seem to measure what they are intended to measure rather effectively, as long as no gross mistakes are made” (Peryam, 1989, p. 23). Another piece of evidence for this level of consistency are the correlations among product means, which Schutz and Cardello found to be as high as +0.99, and which we found to be close to +0.94. One potential reason for the performance of the LAM scale might be the high degree of categorical scoring with the LAM seen in this study. This kind of behavior has been noted for line scales and is especially pronounced for magnitude estimation, where it takes the form of subjects frequently using numbers that multiples of two and five (Baird & Noma, 1978; Giovanni & Pangborn, 1983). Given a criterion for categorical scoring of making marks on at least three out of four products within two scale points of a phrase (on a 200-point basis) we observed from 71% (with juices) to 83% (with chips) of consumers fitting this criterion. This is somewhat higher than seen in Cardello et al. (2008) of 50% of a laboratory panel and 65% of a college student panel. Apparently, to first-time users of this scale in a real consumer population, the phrases are quite compelling even if they are told to make a mark ‘‘anywhere on the line.” This was accompanied by usage of the area above the phrase ‘‘like extremely” by only 10–20% of the participant pool, H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12 and a total frequency of usage at about 5% of all judgments. If the higher category option is used less often, one might expect performance to resemble that of the 9-point scale (assuming little or no end use avoidance with that scale). Future research could examine the effects of instructions, examples and practice to see if more complete use of the LAM scale can be encouraged in first-time users. The effects of number of categories (9 vs. 11) and the potential advantage of having a line to mark remain unclear. Bendig and Hughes (1953) found an increase in information transmission as the number of categories for auditory scaling increased to 11, but some minor loss in reliability going from nine to eleven. The present study showed no great advantage to an 11-point scale over nine categories, even though the avoidance of end categories is an often-cited shortcoming of the 9-point scale (e.g., Villanueva et al., 2005). The question remains of whether there is any advantage to a line scale or a less structured scale with fewer phrases (or none at all). Villanueva and colleagues have developed a so-called hybrid line scale with only three anchors (the two ends and the middle neutral anchor phrase) that is similar to a line scale used in quantitative descriptive analysis (Lawless & Heymann, 1998), although there are some small pip or dot marks across ten intervals. This scale is reported to show some advantages over the 9point scale (Villanueva & Da Silva, 2009; Villanueva et al., 2005). Yao et al. (2003) found wider scale usage with an ‘‘unstructured” 9-point category scale with only numbered boxes and no phrases attached. One way to potentially remove the categorical behavior with the LAM scale would be to strip off the interior labels entirely. This was used by Wright (2007) in a study examining the appeal of military field ration packaging, although it was not compared directly to the LAM scale itself. The criteria for what makes a ‘‘good” scale for sensory evaluation have generally been practical, as opposed to some of the theoretical arguments that are seen in psychophysics (Baird & Noma, 1978). Paramount has been the ability of the scale to detect differences, in this case in the consumer acceptability of food products. Reliability, defined as the ability of the measuring instrument to give the same value over repeated measurements has long been a criterion for any quantitative sensory method. To this we have added some validity criteria, namely the correlation of acceptance with ranked preference, the correspondence of the results to reported product usage habits, and the ability of the scales to detect consumer segments. To the extent that usage of a wider range of the scale is associated with better discrimination (see Fig. 2), that is also a potentially useful criterion. Factors which tend to compress ratings or limit the usage of the scale would be generally undesirable (Cardello et al., 2008). Along these lines we have noted a surprisingly high incidence of categorical behavior. Whether this is detrimental to the overall functioning of the scaling method remains to be seen, but it could form another criterion, albeit a negative one to perhaps be minimized. In conclusion, the results of this study agree with the assertion of Peryam (1989) that hedonic scales measure what they are intended to measure rather effectively, as long as no obvious mistakes are made (for example, having too few categories, or perhaps using a unipolar scale lacking a neutral point). With these groups of products and these consumers, which closely approximated the conditions of a commercial central location test, the LAM and the 9-point scale both performed well. Acknowledgement The authors thank Terry Mongoven for assistance in supervising the field study. 11 Appendix. Details of Product Usage Questionnaire Product usage questions followed demographic questions on gender, age, and education. For orange juice the questions were as follows: ‘‘How often to do you drink orange juice?” Seven response categories were offered: every day, every 2– 3 days, once a week, every 2–3 weeks, once a month, once every 2–3 months and once every 4–6 months. ‘‘What type of orange juice do you drink most often? (check only one)” Four response categories were offered: (1) orange juice without pulp, (2) orange juice with pulp, (3) drink both types equally often and (4) don’t know. ‘‘Thinking about the orange juice that you drink most often, where in the supermarket is that orange juice sold?” Four response options were (1) in the refrigerated section, (2) in the freezer case, (3) in the juice aisle (not refrigerated) and (4) other. ‘‘Is the orange juice you drink most often. . .? (check one answer)” With options (1) made from concentrate, (2) not from concentrate, (3) fresh squeezed and (4) don’t know. For potato chips, the questions were as follows: ‘‘How often do you eat potato chips?” Response option categories were the same as the juices. ‘‘What type of potato chips do you eat most often (check only one)” Response options were (1) kettle cooked chips (e.g., kettle chips), (2) ridged chips (e.g., ruffles), (3) stacked chips (e.g., Pringles), (4) baked chips (e.g., baked lays), (5) regular chips (e.g., lay’s classic) and (6) other. For cookies, the questions were as follows: ‘‘How often do you eat chocolate chip cookies?” Response option categories were the same as the juices. ‘‘What type of chocolate chips cookies do you eat most often? (check only one)” Response categories were (1) regular, (2) chunky chocolate chip cookies, (3) soft/chewy chocolate chip cookies and (4) other. For ice cream the questions were: ‘‘How often do you eat vanilla ice cream?” Response option categories were the same as the juices. (No questions about type of ice cream were asked). References Amoore, J. E., Venstrom, D., & Davis, A. R. (1968). Measurement of specific anosmia. Perceptual and Motor Skills, 26, 143–164. Baird, J. C., & Noma, E. (1978). Fundamentals of scaling and psychophysics. New York: John Wiley & Sons. Bartoshuk, L. M., Duffy, V. B., Fast, K., Greeb, B. G., Prutkin, J., & Snyder, D. J. (2002). Labeled scales (e.g. category, Likert, VAS) and invalid across-group comparisons: What we have learned from genetic variation in taste. Food Quality and Preference, 14(12), 5–138. Bendig, A. W., & Hughes, J. B. (1953). Effect of number of verbal anchoring and number of rating scale categories upon transmitted information. Journal of Experimental Psychology, 46(2), 87–90. Borg, G. (1982). A category scale with ratio properties for intermodal and interindividual comparisons. In H.-G. Geissler & P. Petzold (Eds.), Psychophysical judgment and the process of perception (pp. 25–34). Berlin: VEB Deutscherverlag der Wissenschaften. 12 H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12 Cardello, A., Lawless, H. T., & Schutz, H. G. (2008). Effects of extreme anchors and interior label spacing on labeled magnitude scales. Food Quality and Preference, 21, 323–334. Cardello, A. V., & Schutz, H. G. (2004). Research note. Numerical scale-point locations for constructing the LAM (labeled affective magnitude) scale. Journal of Sensory Studies, 19, 341–346. Chung, S.-J., & Vickers, A. (2007a). Long-term acceptability and choice of teas differing in sweetness. Food Quality and Preference, 18, 963–974. Chung, S.-J., & Vickers, A. (2007b). Influence of sweetness on the sensory-specific satiety and long-term acceptability of tea. Food Quality and Preference, 18, 256–264. El Dine, A. N., & Olabi, A. (2009). Effect of reference foods in repeated acceptability tests: Testing familiar and novel foods using 2 acceptability scales. Journal of Food Science, 74, S97–S105. Forde, C. G., & Delahunty, C. M. (2004). Understanding the role cross-modal sensory interactions play in food acceptability in younger and older consumers. Food Quality and Preference, 15, 715–727. Giovanni, M. E., & Pangborn, R. M. (1983). Measurement of taste intensity and degree of liking of beverages by graphic scaling and magnitude estimation. Journal of Food Science, 48, 1175–1182. Green, B. G., Shaffer, G. S., & Gilmore, M. M. (1993). Derivation and evaluation of a semantic scale of oral sensation magnitude with apparent ratio properties. Chemical Senses, 18, 683–702. Greene, J. L., Bratka, K. J., Drake, M. A., & Sanders, T. H. (2006). Effective of category and line scales to characterize consumer perception of fruity fermented flavors in peanuts. Journal of Sensory Studies, 21, 146–154. Hein, K. A., Jaeger, S. R., Carr, B. T., & Delahunty, C. M. (2008). Comparison of five common acceptance and preference methods. Food Quality and Preference, 19, 651–661. Jaeger, S. R., & Cardello, A. V. (2009). Direct and indirect hedonic scaling methods: A comparison of the labeled affective magnitude (LAM) scale and best–worst scaling. Food Quality and Preference, 20, 249–258. Jones, L. V., Peryam, D. R., & Thurstone, L. L. (1955). Development of a scale for measuring soldiers’ food preferences. Food Research, 20, 512–520. Keskitalo, K., Knaapila, A., Kallela, M., Palotie, A., Wessman, M., Sammalisto, S., et al. (2007). Sweet taste preference are partly genetically determined: Identification of a trait locus on chromosome 161–3. American Journal of Clinical Nutrition, 86, 55–63. Lawless, H. T., & Heymann, H. (1998). Sensory evaluation of food: Principles and practices. New York: Springer. Lawless, H. T., & Malone, G. J. (1986). A comparison of scaling methods: Sensitivity, replicates and relative measurement. Journal of Sensory Studies, 1, 155–174. Levy, J., Morris, R., Hammersley, M., & Turner, R. (1999). Discrimination ratio, adjusted correlation and equivalence of imprecise tests: Application to glucose tolerance. American Journal of Endocrinology and Metabolism, 276, 365–375. Peryam, D. R. (1989). Reflections. In Sensory evaluation, in celebration of our beginnings (pp. 21–30). Conshohocken, PA: ASTM International. Peryam, D. R., & Girardot, N. F. (1952). Advanced taste-test method. Food Engineering, 24, 58–61. Peryam, D. R., & Pilgrim, F. J. (1957). Hedonic scale method of measuring food acceptance. Food Technology, 11, 9–14. Schutz, H. G., & Cardello, A. V. (2001). A labeled affective magnitude (LAM) scale for assessing food liking/disliking. Journal of Sensory Studies, 16, 117–159. Stevens, S. S. (1971). Issues in psychophysical measurement. Psychological Review, 78, 328–330. Villanueva, N. D. M., & Da Silva, M. A. A. P. (2009). Performance of the nine-point hedonic, hybrid and self-adjusting scales in the generation of internal preference maps. Food Quality and Preference, 20, 1–12. Villanueva, N. D. M., Petenate, A. J., & Da Silva, M. A. A. P. (2005). Comparative performance of the hybrid hedonic scale as compared to the traditional hedonic, self-adjusting and ranking scales. Food Quality and Preference, 16, 691–703. Wright, A. O. (2007). Comparison of hedonic, LAM, and other scaling methods to determine Warfighter visual liking of MRE packaging labels, includes webbased challenges, experiences and data. Presentation at the seventh Pangborn sensory science symposium, Minneapolis, MN, 8/12/07. Supplement to Abstract Book/Delegate Manual. Yao, E., Lim, J., Tamaki, K., Ishii, R., Kim, K.-O., & O’Mahony, M. (2003). Structured and unstructured 9-point hedonic scales: A cross cultural study with American, Japanese and Korean consumers. Journal of Sensory Studies, 18, 115–139.
© Copyright 2026 Paperzz