A comparison of the labeled magnitude (LAM) scale, an 11

Food Quality and Preference 21 (2010) 4–12
Contents lists available at ScienceDirect
Food Quality and Preference
journal homepage: www.elsevier.com/locate/foodqual
A comparison of the labeled magnitude (LAM) scale, an 11-point category scale
and the traditional 9-point hedonic scale
Harry T. Lawless a,*, Richard Popper b, Beverley J. Kroll b
a
b
Food Science Department, Cornell University, Stocking Hall, Ithaca, NY 14853, United States
Peryam and Kroll Research Corporation, 6323 N. Avondale Ave., Chicago, IL 60631, United States
a r t i c l e
i n f o
Article history:
Received 3 March 2009
Received in revised form 26 April 2009
Accepted 26 June 2009
Available online 1 July 2009
Keywords:
Scaling
Food acceptance
Hedonic scale
a b s t r a c t
Schutz and Cardello [Schutz, H. G. & Cardello, A. V. (2001). A labeled affective magnitude (LAM) scale for
assessing food liking/disliking. Journal of Sensory Studies, 16, 117–159] proposed the labeled magnitude
(LAM) scale for measuring food acceptance. The LAM is a line scale anchored at its end points with the
phrases ‘‘greatest imaginable like” and ‘‘greatest imaginable dislike” and uses as intermediate anchors
the nine phrases of the traditional hedonic scale. In this study, three hedonic scales were compared,
including the widely-used 9-point hedonic scale, the LAM scale, and an 11-point category scale using
the LAM’s verbal anchors as category labels. Three groups of consumers (N = about 100 each) used one
of the three scales to evaluate the acceptability of highly liked foods (orange juices, potato chips, cookies,
and ice cream, with four samples of each). Scales were evaluated primarily on their ability to show differences in acceptability, the correspondence of acceptance ratings to preference ranking and the correspondence of stated product usage (e.g., purchase of pulp vs. non-pulp orange juice) to the product
scoring highest. All three scales performed equally well, with no one scale showing a consistent superiority over another. All three scales were able to differentiate acceptability of the orange juices, chips and
cookies. No scale differentiated among the ice creams, which had equal and high acceptability. All scales
showed a strong correspondence between liking and preference rankings and also between the product
rated highest and the type of product usually consumed, within each of the product categories.
Ó 2009 Elsevier Ltd. All rights reserved.
1. Introduction
The labeled affective magnitude scale (LAM, Fig. 1) was developed by Schutz and Cardello (2001) as an alternative to the commonly used 9-point category scale for measuring food
acceptability (Jones, Peryam, & Thurstone, 1955; Peryam & Girardot, 1952; Peryam & Pilgrim, 1957). The LAM scale was an extension of the labeled magnitude scale (LMS) for psychophysical
intensity scaling developed by Green, Shaffer, and Gilmore
(1993), based on earlier work by Borg (1982) for a so-called ‘‘category–ratio scale”. The LAM scale has been used recently for evaluation of consumer liking for teas (Chung & Vickers, 2007a,b), to
study the genetic factors in sweet taste perception (Keskitalo
et al., 2007) and in a comparison of young and older person’s liking
for different orange juices (Forde & Delahunty, 2004). Recently, Jaeger and Cardello (2009) compared the LAM to best–worst scaling
and found approximate parity in discrimination. The LMS and
LAM scales may have the following properties:
* Corresponding author. Tel./fax: +1 607 255 7363.
E-mail address: [email protected] (H.T. Lawless).
0950-3293/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.foodqual.2009.06.009
(1) Because the scales were based on magnitude estimates
(ratio scaling instructions) of the verbal anchor word meanings, the resulting scale values are thought by some to represent ratio scale data (Stevens, 1971). If true, one could
make valid statements such as ‘‘this product was liked twice
as much as that one”. The LMS and LAM produce data similar
to that from magnitude estimation (Green et al., 1993;
Schutz & Cardello, 2001).
(2) Because the scales had a high end anchor of ‘‘strongest imaginable” for the LMS or greatest imaginable like (or dislike) for
the LAM, subjects in these scaling studies might have a similar idea of the intensity of the experience suggested by
those phrases, and thus be placed on the same subjective
scale. This was based on an argument by Borg (1982), who,
for example, in studying perceived effort or exertion,
thought that exerting oneself maximally, i.e., to the point
of exhaustion, should be a similar experience among different people. This assumption was later challenged (Bartoshuk
et al., 2002).
(3) Because the scales have commonly understood labels (weak,
moderate, strong), the data could be interpreted in light of
these labels, unlike magnitude estimation, which produced
H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12
GREATEST IMAGINABLE LIKE
LIKE EXTREMELY
LIKE VERY MUCH
LIKE MODERATELY
LIKE SLIGHTLY
NEITHER LIKE NOR DISLIKE
DISLIKE SLIGHTLY
DISLIKE MODERATELY
DISLIKE VERY MUCH
DISLIKE EXTREMELY
GREATEST IMAGINABLE DISLIKE
Fig. 1. The LAM scale (after Schutz and Cardello (2001)).
scale data based on proportions (i.e., one stimulus was twice
as strong as another), but with no absolute anchor for
whether these experience were weak or strong (one could
be twice the other but both could be weak).
Regarding food acceptability testing, the question arises as to
whether the LAM scale provides any advantages over the commonly used 9-point hedonic scale. An important criterion is
whether one scale is better at finding differences among products
(see, for example, Lawless & Malone, 1986). In the original set of
studies (Schutz & Cardello, 2001) performance of the LAM scale
and the 9-point scales were very similar in this regard. Two direct
comparisons were conducted, one involving 51 food names and
one involving five foods that were actually tasted. Correlations between the mean values obtained on the two scales were +0.99 for
the 51 food names and +0.98 for the tasted foods. Statistical differentiation was almost equivalent in both cases. Analysis of variance
for the tasted foods showed 27.6% of variance accounted for by the
food differences for the 9-point scales and 26.6% for the LAM. For
the food names, there were 467 pairs of means (out of 1275 possible comparisons) that were significantly different for the LAM scale
5
and 459 for the hedonic scale (not significantly different by binomial test on proportions). The only appreciable difference that
was found was in an examination of foods that scored above the
overall mean across products, i.e., well-liked foods, and considering
only those pairings in which one scale showed a difference but the
other did not (87 possible pairs, suggesting that about 43 would be
above the mean). In these specific cases, the LAM scale was responsible for 86% of the differences (37 out of 43). For foods below the
grand mean the split was about even. The higher end of the scale
range was used more frequently with the LAM scale, consistent
with the idea that it might be valuable for differentiating wellliked foods.
The performance of the two scales has been evaluated in several
other direct comparisons. Greene, Bratka, Drake, and Sanders
(2006) examined consumers’ reactions to peanuts with fruity-fermented flavor defects. The 9-point hedonic scale only uncovered
one significant pair of differences, whereas the LAM scale showed
four pairs of significant differences (out of 12 possible). Hein, Jaeger, Carr, and Delahunty (2008) performed a comparison of the
9-point, LAM, a line scale, ranking and best–worst scaling in a replicated test of breakfast bars with large groups of consumers. Best–
worst is a variation of choice/ranking whose analyses yield scale
values. Among the other three true scaling methods, the first replication showed similar discrimination (similar F-ratios) for the
LAM, line scale and 9-point, but the 9-point had a much higher
F-ratio on the second replicate and showed more paired differences
among means. El Dine and Olabi (2009) found similar performance
of the LAM and 9-point scale in differentiating a set of both familiar
and novel foods, with the LAM scale differentiating better among
the three highest rated items, a finding in line with the original
observation of Schutz and Cardello (2001).
Another criterion for comparing scales concerns the ability of
the scale to detect different patterns of preference in consumer
segments. Recently, Villanueva and Da Silva (2009) compared the
traditional 9-point hedonic scale to a hedonic line scale, which
was called a ‘‘hybrid” scale, previously studied by this group
(Villanueva, Petenate, & Da Silva, 2005). The authors introduced a
potentially important criterion for comparing the effectiveness of
hedonic scales that has rarely been used: the segmentation of consumers as shown by internal preference mapping. They concluded
that the hybrid scale has superior properties in terms of its ability
to uncover segments of consumers in an evaluation of eight red
wines. Such a comparison has not been made between the 9-point
scale and the LAM scale. Another useful relationship is between
acceptance ratings and consumer preferences in the real world.
Although there are many reasons why stated usage might not correspond to liking ratings (for example, I might prefer a certain style
of potato chip but its cost might discourage me from frequent purchases), one would expect at least a moderate correlation across a
group of individuals from different preference segments.
Given the relatively few direct comparisons of the LAM scale to
the 9-point scale, sensory professionals might be cautious in
substituting the newer LAM for the 9-point hedonic scale, an
industry standard. However, there are some hints in the literature
that an expanded of 11-point scale could be useful for measuring
product acceptability. Peryam (1989) in his reflections on the early
days of sensory science, offered the following observation:
‘‘Why does the hedonic scale have nine categories, rather than
more or less? Economy perhaps? Preliminary investigation
had shown that discrimination between foods and reliability
tended to increase up to eleven categories, but we encountered,
in addition to the dearth of appropriate adverbs, a mechanical
problem due to equipment limitations. Official government
paper was only 800 wide and we found that typing eleven categories horizontally was not possible. So we sacrificed a theoret-
6
H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12
ical modicum of precision for a real improvement in efficiency
at the moment.” p. 23.
He further pointed to a potential advantage to having more
room for positive evaluations on a scale as follows: ‘‘An 8-point
unbalanced scale with more ‘‘like” than ‘‘dislike” categories was
shown to be somewhat better than the standard 9-point one, but
only when dealing with relatively well-liked foods.” p. 24.
Thus there is a need for further evaluation of the performance of
the LAM and 9-point scales in head to head comparisons over
different products and conditions. The LAM scale has potential
advantages, with greater room for more extreme responses than
the 9-point scale. It is not clear whether the added phrases are
key or whether the added line length is important as well. To see
whether the line itself made any difference we included a scale
with the LAM’s verbal phrases only, similar to the original portrayal of the 9-point hedonic scale (Peryam & Girardot, 1952).
The main objective of the study was to compare the scales using
four criteria: (1) the ability to differentiate products, (2) the ability
to differentiate consumer segments, (3) relation of acceptance
scores to consumption choices, and (4) reliability. Representative
products of four product categories were used in this study.
2. Methods
2.1. Participants
A total of 302 consumers completed the study, 99 using the
LAM scale, 103 using the 11-point category scale and 100 using
the 9-point category scale. Consumers were recruited from Peryam
and Kroll (Chicago, IL) databases in four cities, of persons available
for testing and represented a range of ages (37% from 18 to 35, 43 %
from 36 to 55, and 20% from 56 to 65 years of age) and both genders equally (50.5% male and 49.5% female). None of the three
groups deviated from these overall percentages by more than
±2% and there was no difference in a v2 test. All were consumers
of the four products tested, as determined by pre-screening and
confirmed by a demographic questionnaire that included product
usage questions. Consumers were excluded during recruitment if
they indicated they had participated in any food or beverage test
in the last three months (an industry standard). Information on
the product usage questionnaire and response options is found in
the Appendix.
2.2. Products
Products consisted of four commercially available versions of
orange juice, chocolate chip cookies and potato chips. Four vanilla
ice cream samples were also presented with one sample being
duplicated. The products were pre-screened to produce a range
of sensory characteristics and some potential preference differences, i.e., different degrees of appeal to different consumer
segments. The orange juices included one popular refrigerated
not-from-concentrate product, one refrigerated high-pulp notfrom-concentrate product, and two shelf-stable (from concentrate)
products from different manufacturers. The chips consisted of one
popular ‘‘normal” potato chip product, two kettle-style chips from
different manufacturers, and one ridged/wavy style chip. The chocolate chip cookies consisted of one popular ‘‘normal” style cookie,
one chunky style, and two soft/chewy style cookies from different
manufacturers. The vanilla ice creams consisted of one labeled ‘‘vanilla bean,” one labeled ‘‘extra creamy,” and one labeled ‘‘French
vanilla” which was served in duplicate. Major market leading
brands were used to insure their availability in the four testing
sites. The brand names are available from the first author. All products were served as small samples in cups coded with random
three digit numbers. Samples within a product type were served
in randomized orders, but the four types of products were served
in separate groups in the following order: juices, chips, cookies,
then ice creams. This order was used so that no strong flavor contrast would occur that might lead to negative reactions, as might
have happened if the high-acid juices followed the ice creams.
2.3. Test methods
Tests were conducted in four central location test sites in White
Plains, NY, Santa Ana, CA, Plano, TX and Chicago, IL. Questionnaires
were administered on computer screens in a classroom style setting. Once instructions were given, participants received practice
in using their assigned scale by rating their liking or disliking for
two color swatches, produced on separate ‘‘pages” of the computer
questionnaire. The products were then served, one group at a time.
The LAM scale consisted of a vertical line scale with anchor
words spaced according to the spacing’s provided by Cardello
and Schutz (2004). Scales were the about the same size on the
screen, averaging 113 mm in length. LAM markings were re-coded
to a 200-point scale based on a pixel count recorded by the computer system.
Instructions to participants for the 9- and 11-point scales read
as follows:
‘‘Thank you for participating in today’s taste test. We are interested in how much you like or dislike various products. You will
make your like/dislike ratings using the computer. Before we
start we would like to show you the scale we would like you
to use to make your ratings.” Then on a new screen, ‘‘This is
the scale we would like you to use. You will use your mouse
to click on any of the words in order to indicate how much
you like or dislike the product.”
The screen then showed the following nine phrases, centered on
the page and arranged vertically from top to bottom in this order:
like extremely, like very much, like moderately, like slightly, neither like nor dislike, dislike slightly, dislike moderately, dislike very
much and dislike extremely. For the 11-point scale, the phrases
greatest imaginable like and greatest imaginable dislike were
placed at the top and bottom, respectively. For the LAM scale, the
instructions were ‘‘You will use your mouse to click anywhere on
the line to indicate how much you like or dislike the product.”
Practice was then given by having them rate their liking or disliking for two color swatches that appeared sequentially on the
screen. Instructions for the product testing appeared after the second color sample screen and read as follows:
‘‘Now we are ready to begin the taste test. In order to cleanse
your palate, please take a sip of water now and in between each
of the samples you will be tasting today.
First you will be rating four samples of (PRODUCT NAME). We
will serve the four samples one at a time. Please start by evaluating the juice on the far left. Make sure the sample number on
your cup matches the sample number on your screen. Please
drink enough of the sample to form an opinion, then provide
us your rating.
After rating the first sample, continue with the next sample
working from left to right. While evaluating a sample, do not
re-taste any of the samples that you previously rated. There will
be a short wait time in between each sample to allow you to
cleanse your palate by taking a sip of water. When you have
received your samples, please click below to continue.”
The next screen asked the participant to ‘‘Please Evaluate Sample 241” or some other three-digit random code, followed by the
phrase ‘‘Overall, how much do you like or dislike this sample?”
The scale was positioned below that phrase and a continue button
7
H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12
was at the bottom of the screen. In between samples a screen with
a counter was visible with the instructions, ‘‘Before proceeding to
the next sample/question, please have a sip of water. When the
counter reaches zero you will be able to continue.”
Following the final sample in the set of four, a preference ranking was obtained by having them click on their first through fourth
best samples, with the numbers 1–4 appearing over the code for
each sample. It was possible to re-taste the samples and possible
to alter their rankings. After the test was concluded, demographic
information (age, gender) and product usage information was collected. Details of the coding scheme for product usage are given in
the Appendix.
2.4. Analysis
Analysis of variance (ANOVA) was conducted for each food and
scale using SAS PROC GLM. Duncan multiple range tests were used
to examine significance of differences among means. Product discrimination was assessed by the size of the F-ratio from the ANOVA
for product effects, the discrimination ratio (DR) the intraclass correlation coefficient (ICC) and the number of distinct groups arising
from the Duncan tests. See Levy, Morris, Hammersley, and Turner
(1999) for a discussion of the DR and ICC.
Cross-tabulations were made of the type of product consumed
most often with the product scoring highest for each individual.
For each food, a decision had to be made about how to treat tied
scores. A somewhat different approach was taken for each food,
depending upon the number of ties and frequency of use of the
‘‘other” category for the food most frequently purchased. See
Tables 3–5 for further information. Cross-tabulation of the chips
counted the reported usage (product type reported to be that most
frequently consumed) against the highest scoring product type.
Frequent tied scores were observed with the chips. For purposes
of measuring concordance, ties were assigned to the highest
reported usage type. v2 analysis was performed omitting the
‘‘other” column and row (upper left 3 3). For the orange juices,
the pulp-no pulp dichotomy was robust, and ties were not counted
as concordant in this analysis. v2 analysis was performed omitting
the ‘‘other” column and tied row (upper left 2 2 cells only). For
the cookies, ties were assigned to the highest scoring product in
the tie, as was done with the chips. v2 analysis was performed
omitting the ‘‘other” row (upper left 3 3). Contingency coefficients were based on the same cells that contributed to the v2
analyses.
3. Results
3.1. Product discrimination
The overall pattern of results was that the scales worked about
equally well. All three methods were able to differentiate the chips,
cookies and orange juice products with a high degree of statistical
significance. None of the scales were able to differentiate the ice
creams, which had roughly equal and high acceptability. It is possible that this failure was due to the ice creams being presented
last and that some fatigue had set in. Table 1 shows the F-ratios,
intraclass correlations, discrimination ratios, and number of differentiated Duncan groups for the juices, cookies and chips. For the
juices, the LAM scale performed best (9-point second). For the
cookies, the 9-point scale performed best (LAM second) and for
the chips, the 11-point scale performed best (9-point second). Thus
there was no consistent pattern of superiority for any scale method
across the three differentiated product types. Rank orders of the
mean values were virtually identical among the scales for the
juices, cookies and chips with only small reversals occurring be-
Table 1
Indices of product differentiation in hedonic ratings.
Scaling method
9-Point
11-Point
Lam scale
Cookies
F-Ratio
ICC
DR
Duncan groups
37.64
0.984
11.13
3
25.38
0.972
8.39
2
27.06
0.975
8.89
3
Chips
F-Ratio
ICC
DR
Duncan groups
20.88
0.955
6.57
4
22.83
0.956
6.68
2
10.62
0.906
4.50
2
Juices
F-Ratio
ICC
DR
Duncan groups
24.67
0.959
6.94
2
19.82
0.949
6.20
3
34.61
0.995
21.06
3
Notes: F-ratio is the product F-ratio. ICC = intraclass correlation, ratio of systematic
product variance to total product variance (of means). DR = discrimination ratio,
number of potential product groups differentiated by this level of error and systematic variance. Duncan groups: numbers of product groups differentiated by
Duncan’s multiple range tests.
tween product pairs that were not significantly different. Additional evidence of the consistency of the three methods was
shown by the correlations across the 12 product means from the
juices, cookies and chips. Correlation coefficients were +0.950 for
the 9- and 11-point scales, +0.939 for the 11-point and LAM scales
and +0.955 for the 9-point and LAM scales. Table 2 shows the mean
scores, standard errors of the means and Duncan test groups.
Discrimination of products, as indicated by the discrimination
ratio, was associated with a wider use of the available scale (see
Fig. 2). Scale range here was defined as the highest minus lowest
product mean in that category, divided by the total length of the
scale. This would be expected if the size of the error variance
was approximately the same for all products and scales. The data
shown in Fig. 2 do not include the ice creams for which there
was no evidence of product discrimination in the ANOVAs.
3.2. Reliability (duplicate sample analysis)
Correlations were assessed for the scores of the pair of identical
ice cream samples. For the 9-point scale, the correlation was
+0.518, for the 11-point scale, +0.362 and for the LAM scale,
+0.521. The LAM and 9-point scale were marginally better than
the 11-point using Fisher’s Z transformation (p = 0.081). Note that
there was some range compression in these values due to the consistently high ratings given to the ice cream products. Thus the correlations shown here are probably a low estimate of the reliability
of these methods, and a product system with a wider range of consumer opinions would most likely show a higher correlation value.
3.3. Correspondence of acceptance scores with preferred types of
products
All methods showed a strong correspondence between the consumers’ stated usage patterns (type of product consumed most often) and the products that scored highest for that individual. The
analysis is detailed below. No pattern of superiority emerged, with
the 9-point scale showing the strongest relationship for juices, the
11-point scale for chips and the LAM scale for cookies.
For the orange juices (see Table 3), only the two refrigerated
juices were considered, as it was possible to classify consumers
as pulp likers vs. dislikers, and there were relatively few persons
8
H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12
Table 2
Means, standard errors and Duncan test groups.
Product
9-Point
11-Point
Lam scale
Mean (SE)
Duncan group
Mean (SE)
Duncan group
Mean (SE)
Duncan group
Juice
No pulp
High pulp
Shelf-stable 1
Shelf-stable 2
6.98
7.11
5.36
5.64
(0.15)
(0.19)
(0.23)
(0.20)
A
A
B
B
8.01
7.77
6.05
6.63
(0.21)
(0.23)
(0.25)
(0.25)
A
A
C
B
148.6
140.6
100.0
114.3
(2.59)
(4.27)
(4.81)
(4.47)
A
A
C
B
Chips
Regular
Wavy
Kettle 1
Kettle 2
6.94
7.40
6.30
5.69
(0.17)
(0.12)
(0.20)
(0.22)
A
B
C
D
8.09
8.62
6.63
6.73
(0.17)
(0.16)
(0.28)
(0.26)
A
A
B
B
142.1
150.1
129.7
123.6
(3.47)
(3.03)
(4.23)
(4.49)
A
A
B
B
Cookies
Chunky
Regular
Soft/chewy 1
Soft/chewy 2
7.13
7.02
5.71
4.78
(0.16)
(0.17)
(0.23)
(0.22)
A
A
B
C
8.06
7.77
6.27
6.04
(0.23)
(0.19)
(0.29)
(0.25)
A
A
B
B
144.3
142.1
122.4
106.3
(2.79)
(3.02)
(4.74)
(4.78)
A
A
B
C
Ice cream
‘‘Vanilla bean”
Duplicate 1
Duplicate 2
Extra creamy
7.36
7.13
7.17
7.09
(0.16)
(0.16)
(0.15)
(0.17)
–
–
–
–
8.18
8.10
8.24
8.39
(0.21)
(0.20)
(0.19)
(0.20)
–
–
–
–
143.5
149.1
147.9
149.3
(3.35)
(3.12)
(2.81)
(2.93)
–
–
–
–
Total Duncan groups
9
7
8
Table 3
Orange juice usage and type scoring highest.
Type consumed (response)
9-Point scale
Highest scoring
Pulp
No pulp
Tied
No pulp
Pulp
Both or other
11
24
6
28
2
10
10
1
8
Concordant cells shown in bold face. v2 = 25.8, p < .001
Contingency coefficient = +0.53
11-Point scale
Highest scoring
Pulp
No pulp
Both/other
11
33
7
14
8
7
11
6
5
Concordant cells shown in bold face. v2 = 9.3, p < .01
Contingency coefficient = +0.35
Fig. 2. Discrimination ratio vs. percent of scale range from lowest to highest mean
value, across the three products and three scaling methods.
who scored the shelf-stable juices higher than the refrigerated
ones. The analysis compared which juice scored higher (pulp or
no pulp) and which group the individual belonged to regarding
their type of juice drunk most often (pulp, no pulp, both). v2 analysis was performed with the tied scores and ‘‘both” categories
omitted. All three methods showed significant associations
between the higher scoring juice and the reported usage type.
However, in terms of v2 magnitude and contingency coefficients,
the 9-point scale was superior to the 11-point scale and LAM scale.
For the chips (see Table 4), all three scaling methods showed a
strong correspondence between the highest rated chip and the person’s reported type of chip consumed most frequently. In terms of
v2 association, all were significant and showed the expected
pattern, i.e., the highest counts in the diagonal. In terms of v2
and contingency coefficients, the order from highest to lowest
was 11-point > 9-point > LAM.
LAM scale
Highest scoring
Pulp
No pulp
Both/other
10
23
4
24
13
9
8
6
7
Concordant cells shown in bold face. v2 = 8.34, p < .01
Contingency coefficient = +0.33
Note: Ties were not counted as concordant in this analysis. v2 analysis was performed omitting the ‘‘other” column and tied row (upper left 2 2 cells only).
For the cookies (see Table 5), again all three scaling methods
showed a strong correspondence between stated usage and highest
acceptance scores. In terms of v2 association, all were significant
and showed the expected pattern, i.e., the highest counts in the
diagonal. In terms of v2 and contingency coefficients, the order
from best to worst was LAM > 11-point > 9-point. For the cookies,
all the methods showed a strong pattern of dislikers of soft/chewy
cookies among those who consume regular or chunky-chip cookies. Chunky chip cookies scored well among those consumers
avowing to eat regular chocolate cookies most often.
H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12
9
Table 4
Potato chip usage and type scoring highest.
Type consumed (response)
9-Point scale
Highest scoring
Kettle
Ridged
Regular
Other ties
Kettle
Ridged
Regular
Baked, other
14
3
0
1
4
18
5
0
4
5
12
4
5
4
3
7
Concordant cells shown in bold face. v2 = 41.3, p < .001
Contingency coefficient = +0.612
11-Point scale
Highest scoring
Kettle
Ridged
Regular
Other ties
16
5
1
0
1
18
7
1
2
5
17
3
3
7
4
9
Concordant cells shown in bold face. v2 = 50.8, p < .001
Contingency coefficient = +0.643
LAM scale
Highest scoring
Kettle
Ridged
Regular
Other ties
17
3
1
0
3
19
4
1
3
7
14
4
7
5
4
7
Concordant cells shown in bold face. v2 = 36.4, p < .001
Contingency coefficient = +0.602
Table 5
Chocolate chip cookie usage and type scoring highest.
Type consumed (response)
9-Point scale
Highest scoring
Regular
Chunky
Soft/chewy
Other ties
Regular
Chunky
Soft/chewy
24
12
1
4
9
12
2
2
7
1
14
2
Concordant cells shown in bold face. v2 = 36.7, p < .001
Contingency coefficient = +0.55
11-Point scale
Highest scoring
Regular
Chunky
Soft/chewy
Other ties
16
15
4
3
3
25
1
0
5
5
19
2
Concordant cells shown in bold face. v2 = 49.6, p < .001
Contingency coefficient = +0.59
LAM scale
Highest scoring
Regular
Chunky
Soft/chewy
Other ties
24
13
3
1
4
22
1
0
4
1
22
2
Concordant cells shown in bold face. v2 = 75.5, p < .001
Contingency coefficient = +0.67
Fig. 3. Frequency histogram of ratings for one cookie sample for the LAM and 11point scales showing evidence of a disliker segment.
such segments that were greater than 10% of the sample. These
products were the soft/chewy cookies, the kettle cooked chips,
and all four orange juices.
Because there are no formal criteria for what constitutes a sufficiently large ‘‘bump” in a frequency distribution to qualify as a
segment, seven criteria for a disliker segment were formulated that
seemed reasonable. The use of seven criteria was designed to prevent any spurious results from any single measure, since none
could be considered ideal. We considered the literature on specific
anosmias (Amoore, Venstrom, & Davis, 1968), which uses evidence
of a minor mode segment a given distance from the population
mean. These criteria were as follows: the frequency (percent) in
more than two categories less than the mode (mean of 22, criterion
set at 20% or greater), the frequency less than four categories from
the mode, (mean of 14, criterion set at 10%) the percent below the
neutral point (mean of 30%, criterion set at 25%) and the percent
below ‘‘dislike slightly” (mean of 20%, criterion set at 15%). Pass/fail
criteria included whether there was an antimode at neutral or dislike slightly (at least one lower category with a higher frequency),
whether there were two lower categories with a higher frequency
and whether there were three lower categories with a high frequency. These multiple criteria, varying in stringency, were used
to prevent any small random variation leading to a ‘‘detection”
overly influencing the analysis. An example of segmented ratings
distributions is shown in Fig. 3.
To apply these criteria to the line scale, data from the LAM scale
were converted to an 11-point basis by dividing the line at the
midpoint between verbal anchor points. Frequency distributions
were then tallied for the categorical form of the data for comparison to the other two scales. In order to place the 9- and 11-point
scales on the same footing, a further adjustment was required.
For the first four criteria, i.e., the actual frequency counts, the 9point scale tallies were adjusted to include 18% of the counts in
the next higher category to account for the smaller range (2/
11 = .18).
Given these seven criteria, there were 56 possible ‘‘detections”
for the eight products. A ‘‘detection” meant meeting one of the stated criterion. The LAM scale and 9-point scale did about equally
well with 39 out of 56 possible detections. The 11-point scale fared
slightly better, with 48 out of 56 possible detections.
3.4. Detection of disliker segments
3.5. Use of scale points above ‘‘like extremely”
Data were examined for groups of panelists who were dislikers
of some of the products. A disliker segment appears on a single
scale as an increase in frequency on the negative half of the scale
after a local minimum near the center (an antimode). Informal
inspection suggested that possibly eight of the products showed
One of the potential advantages of the LAM scale is the opportunity for panelists to score products higher than the traditional
9-point endpoint of like extremely. This may also provide some
psychological space to counteract the tendency to avoid category
10
H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12
Table 6
Scale usage frequency above ‘‘like extremely”.
Product
LAM scale
Users (%)
Ice creams
Juices
Chips
Cookies
20.0
10.3
16.0
18.4
Table 7
‘‘Categorical” behavior with the LAM scale.
11-Point scale
Total
judgments (%)
Users (%)
6.6
3.6
4.8
6.1
20.4
12.6
13.6
8.7
Product
Total
judgments (%)
7.5
3.9
4.6
3.2
scale endpoints as responses. In Cardello, Lawless, and Schutz
(2008) between 19% and 30% of participants used this part of the
LAM scale for at least one of five products. In one of the studies
contained in Schutz and Cardello (2001) a 17% usage rate was
found above ‘‘like extremely.”
Table 6 shows frequencies of usage of the space above ‘‘like extremely” for the LAM scale, and use of the uppermost category for
the 11-point category scale. Tallies were made for individuals as
well as the total number of judgments and are reported for each
product category separately. Note that the usage for the LAM scale
is consistent with the previous literature (Schutz & Cardello, 2001),
in the range of 10–20% of respondents making use of the high end
of the scale. Use of the 11th scale point was somewhat lower for
the cookies. However, this usage is still remarkable given that
the label was ‘‘Greatest Imaginable Liking.”.
3.6. Correspondence to preference rankings
All three methods showed a high degree of consistency in the
rankings of the mean scale values and the mean preference ranking. There were two reversals of ranking of the acceptance scale
means and preference ranks for the 9-point scale, only one for
the 11-point scale and none for the LAM (out of 18 possible reversals of pairwise order). The reversals of ranking were found among
pairs which were un-differentiated, i.e., not significantly different
by Duncan tests. This correspondence is perhaps not surprising
as the ranking was performed directly after the ratings, although
participants could not see their ratings and re-tasting was possible
at this point.
3.7. ‘‘Categorical” behavior with the LAM scale
A judgment was tallied as ‘‘categorical” if it fell ±2 units from
the point designated for an anchor phrase (considering the scale
on a 200-point basis). This was the criterion previous used by Cardello et al. (2008). An individual was tallied as being ‘‘categorical”
using two criteria as follow: first, if three out of four products
showed categorical judgments and second, if all four products
showed categorical judgments. This breakdown was also expanded
to include ±3 units from the designated scale anchor points. Behaviors were tallied separately for the four products.
As shown in Table 7, there was a high degree of categorical
behavior, with a majority of the ratings (71–83%) falling within
±2 units of an anchor phrase hash mark (for at least three out of
four products). This is somewhat higher than the rates seen in Cardello et al. (2008) who found 65% of a college population sample
and 50% of an ongoing Army laboratory panel to act in this manner
with at least four out of five products. To put this into perspective,
the intervals designated by ±2 units comprise 1/4 of the total space
on the LAM scale, yet they captured about 3/4 of the data points. If
the criterion is expanded to ±3 units, then over 82% of the data fall
on about 1/3 of the usable scale. In other words, participants used
intermediate space between phrases infrequently. About half the
participants placed all four products at or very near the anchor
phrases.
Categorical behavior frequency
62 Products (%)
P3 Products (%)
All four products (%)
Cookies
±2 units from anchor
±3 units from anchor
29
22
71
78
42
43
Chips
±2 units from anchor
±3 units from anchor
17
14
83
86
56
59
Juices
±2 units from anchor
±3 units from anchor
29
22
71
78
47
53
Ice creams
±2 units from anchor
±3 units from anchor
24
17
76
87
53
56
4. Discussion
To our knowledge, this is the first large-scale consumer study
comparing the LAM and 9-point hedonic scales in several different
product systems. A further advantage of this study is the betweengroups comparison a design also adopted by Greene et al. (2006),
Hein et al. (2008), and El Dine and Olabi (2009). Thus each participant only used one type of scale and was not influenced by recent
experience with another scale type. Early studies of the LAM used
within-subjects comparisons but people received the scales on different days (Schutz & Cardello, 2001, Experiments 4 and 5). Some
studies have shown that the LAM scale is as good or better than
other scales for differentiating products with respect to their consumer acceptance ratings (El Dine & Olabi, 2009; Greene et al.,
2006; Schutz & Cardello, 2001. Our results are in line with the
other large-scale consumer study of Hein et al. who examined performance of five scaling methods with a sports/snack bar among
New Zealand consumers. They observed parity between the LAM
and 9-point scale on the first replicate, but superior performance
of the 9-point scale on a second trial. In this case we have seen instances where it performs sometimes better and sometimes worse.
However, from a statistical perspective these differences were not
large. All scales were able to differentiate three of the foods at high
levels and none could discriminate among the ice cream samples.
This result is consistent with Peryam’s statement from the early
days of trying different scale versions, to wit: ‘‘All hedonic scales
seem to measure what they are intended to measure rather effectively, as long as no gross mistakes are made” (Peryam, 1989, p.
23). Another piece of evidence for this level of consistency are
the correlations among product means, which Schutz and Cardello
found to be as high as +0.99, and which we found to be close to
+0.94.
One potential reason for the performance of the LAM scale
might be the high degree of categorical scoring with the LAM seen
in this study. This kind of behavior has been noted for line scales
and is especially pronounced for magnitude estimation, where it
takes the form of subjects frequently using numbers that multiples
of two and five (Baird & Noma, 1978; Giovanni & Pangborn, 1983).
Given a criterion for categorical scoring of making marks on at
least three out of four products within two scale points of a phrase
(on a 200-point basis) we observed from 71% (with juices) to 83%
(with chips) of consumers fitting this criterion. This is somewhat
higher than seen in Cardello et al. (2008) of 50% of a laboratory panel and 65% of a college student panel. Apparently, to first-time
users of this scale in a real consumer population, the phrases are
quite compelling even if they are told to make a mark ‘‘anywhere
on the line.” This was accompanied by usage of the area above
the phrase ‘‘like extremely” by only 10–20% of the participant pool,
H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12
and a total frequency of usage at about 5% of all judgments. If the
higher category option is used less often, one might expect performance to resemble that of the 9-point scale (assuming little or no
end use avoidance with that scale). Future research could examine
the effects of instructions, examples and practice to see if more
complete use of the LAM scale can be encouraged in first-time
users.
The effects of number of categories (9 vs. 11) and the potential
advantage of having a line to mark remain unclear. Bendig and
Hughes (1953) found an increase in information transmission as
the number of categories for auditory scaling increased to 11, but
some minor loss in reliability going from nine to eleven. The present study showed no great advantage to an 11-point scale over
nine categories, even though the avoidance of end categories is
an often-cited shortcoming of the 9-point scale (e.g., Villanueva
et al., 2005). The question remains of whether there is any advantage to a line scale or a less structured scale with fewer phrases (or
none at all). Villanueva and colleagues have developed a so-called
hybrid line scale with only three anchors (the two ends and the
middle neutral anchor phrase) that is similar to a line scale used
in quantitative descriptive analysis (Lawless & Heymann, 1998),
although there are some small pip or dot marks across ten intervals. This scale is reported to show some advantages over the 9point scale (Villanueva & Da Silva, 2009; Villanueva et al., 2005).
Yao et al. (2003) found wider scale usage with an ‘‘unstructured”
9-point category scale with only numbered boxes and no phrases
attached. One way to potentially remove the categorical behavior
with the LAM scale would be to strip off the interior labels entirely.
This was used by Wright (2007) in a study examining the appeal of
military field ration packaging, although it was not compared directly to the LAM scale itself.
The criteria for what makes a ‘‘good” scale for sensory evaluation have generally been practical, as opposed to some of the theoretical arguments that are seen in psychophysics (Baird & Noma,
1978). Paramount has been the ability of the scale to detect
differences, in this case in the consumer acceptability of food
products. Reliability, defined as the ability of the measuring
instrument to give the same value over repeated measurements
has long been a criterion for any quantitative sensory method.
To this we have added some validity criteria, namely the correlation of acceptance with ranked preference, the correspondence of
the results to reported product usage habits, and the ability of the
scales to detect consumer segments. To the extent that usage of a
wider range of the scale is associated with better discrimination
(see Fig. 2), that is also a potentially useful criterion. Factors
which tend to compress ratings or limit the usage of the scale
would be generally undesirable (Cardello et al., 2008). Along
these lines we have noted a surprisingly high incidence of categorical behavior. Whether this is detrimental to the overall functioning of the scaling method remains to be seen, but it could
form another criterion, albeit a negative one to perhaps be
minimized.
In conclusion, the results of this study agree with the assertion
of Peryam (1989) that hedonic scales measure what they are intended to measure rather effectively, as long as no obvious mistakes are made (for example, having too few categories, or
perhaps using a unipolar scale lacking a neutral point). With these
groups of products and these consumers, which closely approximated the conditions of a commercial central location test, the
LAM and the 9-point scale both performed well.
Acknowledgement
The authors thank Terry Mongoven for assistance in supervising
the field study.
11
Appendix. Details of Product Usage Questionnaire
Product usage questions followed demographic questions on
gender, age, and education.
For orange juice the questions were as follows:
‘‘How often to do you drink orange juice?”
Seven response categories were offered: every day, every 2–
3 days, once a week, every 2–3 weeks, once a month, once every
2–3 months and once every 4–6 months.
‘‘What type of orange juice do you drink most often? (check
only one)”
Four response categories were offered: (1) orange juice without
pulp, (2) orange juice with pulp, (3) drink both types equally
often and (4) don’t know.
‘‘Thinking about the orange juice that you drink most often,
where in the supermarket is that orange juice sold?”
Four response options were (1) in the refrigerated section, (2) in
the freezer case, (3) in the juice aisle (not refrigerated) and (4)
other.
‘‘Is the orange juice you drink most often. . .? (check one
answer)”
With options (1) made from concentrate, (2) not from concentrate, (3) fresh squeezed and (4) don’t know.
For potato chips, the questions were as follows:
‘‘How often do you eat potato chips?”
Response option categories were the same as the juices.
‘‘What type of potato chips do you eat most often (check only
one)”
Response options were (1) kettle cooked chips (e.g., kettle
chips), (2) ridged chips (e.g., ruffles), (3) stacked chips (e.g.,
Pringles), (4) baked chips (e.g., baked lays), (5) regular chips
(e.g., lay’s classic) and (6) other.
For cookies, the questions were as follows:
‘‘How often do you eat chocolate chip cookies?”
Response option categories were the same as the juices.
‘‘What type of chocolate chips cookies do you eat most often?
(check only one)”
Response categories were (1) regular, (2) chunky chocolate chip
cookies, (3) soft/chewy chocolate chip cookies and (4) other.
For ice cream the questions were:
‘‘How often do you eat vanilla ice cream?”
Response option categories were the same as the juices.
(No questions about type of ice cream were asked).
References
Amoore, J. E., Venstrom, D., & Davis, A. R. (1968). Measurement of specific anosmia.
Perceptual and Motor Skills, 26, 143–164.
Baird, J. C., & Noma, E. (1978). Fundamentals of scaling and psychophysics. New York:
John Wiley & Sons.
Bartoshuk, L. M., Duffy, V. B., Fast, K., Greeb, B. G., Prutkin, J., & Snyder, D. J. (2002).
Labeled scales (e.g. category, Likert, VAS) and invalid across-group
comparisons: What we have learned from genetic variation in taste. Food
Quality and Preference, 14(12), 5–138.
Bendig, A. W., & Hughes, J. B. (1953). Effect of number of verbal anchoring and
number of rating scale categories upon transmitted information. Journal of
Experimental Psychology, 46(2), 87–90.
Borg, G. (1982). A category scale with ratio properties for intermodal and
interindividual comparisons. In H.-G. Geissler & P. Petzold (Eds.),
Psychophysical judgment and the process of perception (pp. 25–34). Berlin: VEB
Deutscherverlag der Wissenschaften.
12
H.T. Lawless et al. / Food Quality and Preference 21 (2010) 4–12
Cardello, A., Lawless, H. T., & Schutz, H. G. (2008). Effects of extreme anchors and
interior label spacing on labeled magnitude scales. Food Quality and Preference,
21, 323–334.
Cardello, A. V., & Schutz, H. G. (2004). Research note. Numerical scale-point
locations for constructing the LAM (labeled affective magnitude) scale. Journal
of Sensory Studies, 19, 341–346.
Chung, S.-J., & Vickers, A. (2007a). Long-term acceptability and choice of teas
differing in sweetness. Food Quality and Preference, 18, 963–974.
Chung, S.-J., & Vickers, A. (2007b). Influence of sweetness on the sensory-specific
satiety and long-term acceptability of tea. Food Quality and Preference, 18,
256–264.
El Dine, A. N., & Olabi, A. (2009). Effect of reference foods in repeated acceptability
tests: Testing familiar and novel foods using 2 acceptability scales. Journal of
Food Science, 74, S97–S105.
Forde, C. G., & Delahunty, C. M. (2004). Understanding the role cross-modal sensory
interactions play in food acceptability in younger and older consumers. Food
Quality and Preference, 15, 715–727.
Giovanni, M. E., & Pangborn, R. M. (1983). Measurement of taste intensity and
degree of liking of beverages by graphic scaling and magnitude estimation.
Journal of Food Science, 48, 1175–1182.
Green, B. G., Shaffer, G. S., & Gilmore, M. M. (1993). Derivation and evaluation of a
semantic scale of oral sensation magnitude with apparent ratio properties.
Chemical Senses, 18, 683–702.
Greene, J. L., Bratka, K. J., Drake, M. A., & Sanders, T. H. (2006). Effective of category
and line scales to characterize consumer perception of fruity fermented flavors
in peanuts. Journal of Sensory Studies, 21, 146–154.
Hein, K. A., Jaeger, S. R., Carr, B. T., & Delahunty, C. M. (2008). Comparison of five
common acceptance and preference methods. Food Quality and Preference, 19,
651–661.
Jaeger, S. R., & Cardello, A. V. (2009). Direct and indirect hedonic scaling methods: A
comparison of the labeled affective magnitude (LAM) scale and best–worst
scaling. Food Quality and Preference, 20, 249–258.
Jones, L. V., Peryam, D. R., & Thurstone, L. L. (1955). Development of a scale for
measuring soldiers’ food preferences. Food Research, 20, 512–520.
Keskitalo, K., Knaapila, A., Kallela, M., Palotie, A., Wessman, M., Sammalisto, S., et al.
(2007). Sweet taste preference are partly genetically determined: Identification
of a trait locus on chromosome 161–3. American Journal of Clinical Nutrition, 86,
55–63.
Lawless, H. T., & Heymann, H. (1998). Sensory evaluation of food: Principles and
practices. New York: Springer.
Lawless, H. T., & Malone, G. J. (1986). A comparison of scaling methods: Sensitivity,
replicates and relative measurement. Journal of Sensory Studies, 1, 155–174.
Levy, J., Morris, R., Hammersley, M., & Turner, R. (1999). Discrimination ratio,
adjusted correlation and equivalence of imprecise tests: Application to glucose
tolerance. American Journal of Endocrinology and Metabolism, 276, 365–375.
Peryam, D. R. (1989). Reflections. In Sensory evaluation, in celebration of our
beginnings (pp. 21–30). Conshohocken, PA: ASTM International.
Peryam, D. R., & Girardot, N. F. (1952). Advanced taste-test method. Food
Engineering, 24, 58–61.
Peryam, D. R., & Pilgrim, F. J. (1957). Hedonic scale method of measuring food
acceptance. Food Technology, 11, 9–14.
Schutz, H. G., & Cardello, A. V. (2001). A labeled affective magnitude (LAM) scale for
assessing food liking/disliking. Journal of Sensory Studies, 16, 117–159.
Stevens, S. S. (1971). Issues in psychophysical measurement. Psychological Review,
78, 328–330.
Villanueva, N. D. M., & Da Silva, M. A. A. P. (2009). Performance of the nine-point
hedonic, hybrid and self-adjusting scales in the generation of internal
preference maps. Food Quality and Preference, 20, 1–12.
Villanueva, N. D. M., Petenate, A. J., & Da Silva, M. A. A. P. (2005). Comparative
performance of the hybrid hedonic scale as compared to the traditional hedonic,
self-adjusting and ranking scales. Food Quality and Preference, 16, 691–703.
Wright, A. O. (2007). Comparison of hedonic, LAM, and other scaling methods to
determine Warfighter visual liking of MRE packaging labels, includes webbased challenges, experiences and data. Presentation at the seventh Pangborn
sensory science symposium, Minneapolis, MN, 8/12/07. Supplement to Abstract
Book/Delegate Manual.
Yao, E., Lim, J., Tamaki, K., Ishii, R., Kim, K.-O., & O’Mahony, M. (2003). Structured
and unstructured 9-point hedonic scales: A cross cultural study with American,
Japanese and Korean consumers. Journal of Sensory Studies, 18, 115–139.

Download Report

A comparison of the labeled magnitude (LAM) scale, an 11

Paperzz.com

Your Paperzz