Generation failure: Estimating metacognition in cued recall

ARTICLE IN PRESS
Journal of Memory and Language xxx (2005) xxx–xxx
Journal of
Memory and
Language
www.elsevier.com/locate/jml
Generation failure: Estimating metacognition in cued recall
q
Philip A. Higham *, Helen Tam
School of Psychology, University of Southampton, Highfield, Southampton SO17 1BJ, UK
Received 9 August 2004; revision received 20 January 2005
Abstract
Three experiments examined generation, recognition, and response bias in the original encoding-specificity paradigm
using the type 2 signal-detection analysis advocated by Higham (2002). Experiments 1 (pure-list design) and 2 (mixedlist design) indicated that some guidance regarding the strength of the associative relationship between the test cue and
target greatly improved strong-cue target production relative to no guidance, and that this effect was attributable to
improved generation, as well as recognition. Problems with generating candidates for response during standard cued
recall was further shown in Experiment 3, where despite having the opportunity to provide multiple responses for each
cue, participantsÕ ability to produce the targets remained poor. The results are discussed in terms of traditional and
modern generate-recognize theory, metacognition, and dual-route models of recall.
2005 Elsevier Inc. All rights reserved.
Keywords: Cued recall; Generate-recognize; Metacognition; Encoding specificity
Since Tulving and colleagues introduced the encoding specificity principle in the early 1970s (e.g., Thomson
& Tulving, 1970; Tulving & Thomson, 1973), most students of memory have viewed generate-recognize theory
as a straw man. It is considered by many to be an old-
q
Philip A. Higham and Helen Tam, School of Psychology,
University of Southampton. Preparation of this article was
supported by a research grant from the British Academy.
Portions of this research were presented at the 43rd annual
meeting of the Psychonomic Society, November, 2002 in
Kansas City, Missouri, USA, and at the 45th annual meeting
of the Psychonomic Society, November, 2004 in Minneapolis,
Minnesota, USA. We thank Nina Eskriett, Wendy Kneller and
David Brook for research assistance. We thank Harry Bahrick,
Chuck Brainerd, Morris Goldsmith, Asher Koriat, Steve
Lindsay, Doug Nelson, and Mike Watkins for helpful comments on earlier drafts of this article.
*
Corresponding author. Fax: +44 23 8059 4597.
E-mail address: [email protected] (P.A. Higham).
fashioned theory, with a particular failing when it comes
to explaining context reinstatement effects in cued recall.
In this paper, we revisit both generate-recognize theory
and the classic cued-recall paradigm that provided the
initial support for the encoding-specificity principle.
However, let us be clear at the outset that we are not
attempting to resurrect traditional generate-recognize
theory. Indeed, as will become apparent, the data from
the experiments that we report are quite inconsistent
with those early models, and, if anything, they support
many aspects of TulvingÕs message. On the other hand,
we will argue that a more modern generate-recognize
model of cued recall that maintains the crucial distinction between memory access (generation) and metacognitive monitoring (recognition) processes is still a
useful framework for cued-recall performance, and performance on many other tasks as well. In this way, our
message is similar to that of other metacognitive
researchers who have promoted two-stage models
involving separate stages of access and memory moni-
0749-596X/$ - see front matter 2005 Elsevier Inc. All rights reserved.
doi:10.1016/j.jml.2005.01.015
ARTICLE IN PRESS
2
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
toring (e.g., Barnes, Nelson, Dunlosky, Mazzoni, & Narens, 1999; Goldsmith, Koriat, & Weinberg-Eliezer,
2002; Higham, 2002; Kelley & Sahakyan, 2003; Klatzky
& Erdelyi, 1985; Koriat & Goldsmith, 1996). Before
describing our unique method of analyzing the underlying generation and recognition processes, we will first review the traditional generate-recognize models, and
some of their variants.
Early generate-recognize theory and encoding specificity
Early generate-recognize theory (e.g., Anderson &
Bower, 1972; Bahrick, 1969, 1970) proposed that cued
recall could be achieved by covertly generating associates of test cues, and then attempting to recognize the
sought-after target from amongst the generated candidates. Support for the theory came, in part, from experiments demonstrating that the associative strength
between the test cue and the target affected the probability of recall (e.g., Bahrick, 1970).
In the early 1970s, Tulving and colleagues (e.g.,
Thomson & Tulving, 1970; Tulving & Thomson, 1973;
Wiseman & Tulving, 1976) argued that generate-recognize models were an insufficient account of recall for
two reasons. First, Thomson and Tulving (1970) demonstrated that extralist retrieval cues that are strong associates of the target words (based on free association
norms) are not very effective retrieval cues, particularly
if the target words were encoded in relation to some
other (weak) associate during study. If cued recall is
accomplished by first generating candidates from the
test cue, and then recognizing the target from amongst
the candidates, one would expect that strong associates
would be excellent retrieval cues because the probability
of accomplishing the first step in the process (generating
the target) is high. Second, Tulving and Thomson (1973)
demonstrated recognition failure of recallable words. In
this demonstration, participants were first given weak
associate-target pairs to study. Next they were provided
with strong associates of the targets and asked to free
associate. Unsurprisingly, copies of the targets were often generated during this phase of the experiment. Following free association, participants were asked to
circle those generated items that were targets from the
study list. Finally, they attempted to recall the targets
in the presence of the weak cues that were encoded specifically with the targets during study. Recognition failure was revealed in that targets not recognized during
the generate-recognize phase of the experiment were often recalled later in the presence of reinstated weak cues.
This basic finding has been replicated many times (e.g.,
Bartling & Thompson, 1977; Gardiner, 1988; Postman,
1975; Reder, Anderson, & Bjork, 1974; Sikstrom &
Gardiner, 1997; Tulving, 1974; Watkins & Tulving,
1975; Wiseman & Tulving, 1975, 1976; see Nilsson &
Gardiner, 1993 for a review) and forms the basis of
the Tulving–Wiseman law. Recognition failure is problematic for generate-recognize theory because recall is
limited by two bottlenecks, whereas recognition is only
limited by one (i.e., the target item has already been
‘‘generated’’ in recognition). Therefore, Tulving and
Thomson reasoned that it is impossible for recall to be
superior to recognition.
Tulving and colleagues argued, instead, that their results were best explained in terms of the encoding specificity principle. According to this principle, the effectiveness
of retrieval cues is determined by the extent to which the
cues are encoded specifically with the to-be-remembered
(TBR) information. Thus, strong extralist cues are generally not effective for retrieval, despite the fact that they elicit the TBR information with a high probability, because
they were not encoded specifically with it. Similarly, recognition failure occurs because the cues available during
recognition differ from those that were present at encoding. In contrast, the (weak) cues in recall are reinstated
from study. The difference in the reinstatement status of
the strong versus weak cues renders recall performance
that is superior to recognition.
Variants of early generate-recognize theory
It is important to point out that Tulving and colleaguesÕ data and criticisms only pertain to one class of
generate-recognize models: those that assume ‘‘trans-situational identity of words’’ (Tulving & Thomson, 1973,
p. 358). If a given target word is considered to have only
a single representation in memory—for example, a node
in a stable, abstract associative network—then the target
representation generated in the context of a strong
extralist associate at test must match the target representation activated in the context of a reinstated weak associate. Consequently, under this assumption, there is no
way for generate-recognize models to explain either poor
performance with strong extralist cues, or recognition
failure, as Tulving and colleagues correctly pointed
out. However, since the publication of the first encoding
specificity papers, several authors have argued that the
one-representation-per-word assumption need not necessarily hold, and that encoding-specificity effects can
be incorporated into generate-recognize theory under
different assumptions. For example, Reder et al. (1974;
see also Martin, 1975) proposed that a given word can
have more than one ‘‘sense,’’ and that the sense of the
target evoked at study, when it is presented with a weak
associate, is different from the sense of the target when it
is generated to a new, strong associate. For example,
LIGHT, when generated in the context of the strong
associate dark at test, has a different sense than the word
LIGHT when presented in the context of the weak associate head at study; the former means LIGHT (lumi-
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
nance) while the latter means LIGHT (lamp) (see Martin, 1975). Thus, although the nominal stimulus is the
same, different senses of it are processed at study and
test, resulting in both poor strong extralist cued-recall
performance and recognition failure.
The idea that encoding-specificity effects are reliant
on the use of words with more than one sense has been
criticized by Tulving and Watkins (1977) who showed
that single-meaning words also demonstrate recognition
failure (although see Muter, 1984). However, the spirit
of the multiple-representation argument can be seen in
explanations of encoding specificity using feature sampling theory (e.g., Bower, 1967; Estes, 1959; Kintsch,
1974; Underwood, 1969; Wickens, 1970), some of which
are relevant to generate-recognize theory. For example,
Pellegrino and Salzberg (1975b; see also Flexser & Tulving, 1978; Pellegrino & Salzberg, 1975a; Roediger &
Adelson, 1980) suggested that a set of features is sampled from the target item both when it is presented at
study (in which case the features are ‘‘tagged’’), and
when it is generated at test. To the extent that there is
matching or overlap of the features sampled at test in
the generated item, with those that are tagged at study,
a positive recognition response is elicited, and the target
is given as a response. In this amendment to generaterecognize theory, context plays a role in determining
the features that are sampled at study and at test. To
the extent that the features sampled in the same nominal
stimulus are not fixed, the functional representations
that are compared between study and test are not necessarily the same. Consequently, the problem that encoding specificity posed for generate-recognize theory,
which is dependent on there being a single representation, is no longer applicable.
Modern generate-recognize theory and metacognition
Although these variants of the traditional generaterecognize model could incorporate encoding-specificity
effects, we believe that a modernized version of the theory
will have to make further changes. Many of the variants of
early generate-recognize models assumed that the source
of candidates is a stable, abstract associative network.
However, in a radical shift away from this assumption, Jacoby and Hollingshead (1990) proposed that the memory
base from which candidates are generated might be distributed, such that generation processes are influenced
by specific prior episodes. This change moved generaterecognize theory away from a reliance on a semantic or
associative memory system as a source of candidates,
and rendered it more consistent with episodic or instance-based memory theory (e.g., Brooks, 1978; Jacoby,
1983; Whittlesea, 1997). We agree wholeheartedly with
Jacoby and HollingsheadÕs proposition; in fact, we cannot
3
foresee how any generate-recognize model based on readout from a stable, abstract associative network can incorporate the now vast array of context effects that have
emerged in memory research since the principles of encoding specificity (Tulving & Thomson, 1973) and transferappropriate processing (Morris, Bransford, & Franks,
1977) were introduced. In this vein, the experiments we report here, and those of others (e.g., Santa & Lamwers,
1974), demonstrate metacognitive flexibility and context-specific influences on recollective processing. For
example, retrieval cues can be used to interrogate memory
differently depending on the specific instructional set or on
the specific study-list structure (see also Higham & Tam,
2005). These findings provide evidence against stability
in the search set generated by test cues. Thus, we believe
that if generate-recognize theory is to remain a viable
model of cued recall, it must relinquish the assumption
that the product of the generation process is based on a
stable, abstract associative network.
Another issue that needs to be addressed is how a
modern, metacognitive generate-recognize theory will
incorporate the concept of conscious recollection resulting from direct retrieval in cued recall. Since its introduction over three decades ago, the generate-recognize
route to recall has typically been contrasted with another, more direct retrieval route (e.g., Bahrick, 1969,
1970, 1979; Bodner, Masson, & Caldwell, 2000; Brainerd, Wright, Reyna, & Payne, 2002; Gardiner, 1988;
Guynn & McDaniel, 1999; Jacoby, 1996, 1998; Jones,
1978, 1987; Jones & Gardiner, 1990; Naveh-Benjamin
& Guez, 2000; Toth, Reingold, & Jacoby, 1994; Weldon
& Colston, 1995). For example, Bahrick (1970; see also
Bahrick, 1969, 1979) suggested that the generate-recognize route to recall was only implemented if direct retrieval failed. In his original experiments, cues (or prompts)
were only provided for items that could not be freely recalled, a point which is often overlooked in criticisms of
BahrickÕs work (although see Gardiner, 1988 for a counter example). Thus, the generate-recognize process was
seen as a fall-back process. At both an intuitive level,
and at an empirical level, there appears to be fairly widespread consensus that recall can occur either directly and
efficiently via direct access, or indirectly and inefficiently
via a generate-recognize route. As Bahrick (1979) stated,
. . .no one has ever seriously suggested, for example,
that recalling the name of oneÕs wife involves generating
a series of female names and selecting the correct name
after having rejected several erroneously generated candidates. Much information of both an episodic and semantic type is recalled without a time-consuming search (p.
148).
This dual-route distinction characteristic of writings
in cued-recall research suggests that conscious recollection is the result of relatively immediate access of a
ARTICLE IN PRESS
4
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
veridical memory trace (although see Brainerd, Payne,
Wright, & Reyna, 2003). We agree with Bahrick (1979)
in that most situations in which veridical recollection occurs are not preceded by an effortful phase during which
multiple, plausible candidates are consciously generated
in response to the available cues. However, if the concept of generation is broadened to include all sorts of access of information from long-term memory, then
conscious recollection might be seen as a special case
of generate-recognize processing. As will be shown,
our data highlight how direct retrieval may not be as direct as previously thought; instead, ecphoric (Tulving,
1983) processing seems dependent on metacognitive factors, such as the number and/or strength of association
of competing candidates for recall (Experiment 2). For
now, the important point is that we see generate-recognize theory not as a point of contrast to direct retrieval,
but rather as incorporating it. Situations that experimenters and participants alike identify as constituting
direct retrieval can likely be understood in terms of
parameters associated with monitoring and the attributions that are subsequently made.
no-cue recall, and they did so to the same degree (i.e.,
target production in the weak- and strong- cue conditions did not differ).
To explain these results, suppose participants in HighamÕs (2002) experiment generated candidate answers to
each test cue in free report, and then attempted to recognize the target from amongst the alternatives. With weak
cues, the reinstatement of context led to targets being
recognized with high probability.1 Consequently, nearly
all targets that were generated were also recognized and
reported. However, with strong cues, participants presumably had no difficulty generating targets, but they
very seldom successfully recognized them, and so responses to these cues were withheld. However, in forced
report, the criterion of acceptability (report) was lowered, and the targets that were successfully generated
to the strong cues, but not recognized, were revealed.
As such, HighamÕs data are not inconsistent with the
predominant explanation of poor strong extralist cue
target production in this task: recognition failure.2
1
Metacognition and cued recall
In this section, we will attempt to illustrate how
metacognitive factors can be investigated in a cued-recall
paradigm. Our research constitutes a follow-up to that
presented in Higham (2002; see also Higham & Tam,
2005), and adopts the same methodology, so we will describe it in some detail. Higham replicated the main
methodological components of Thomson and TulvingÕs
(1970) experiments; participants studied weakly associated cue-target pairs and later were given a cued-recall
test with a mixture of weak reinstated cues, or new
strong cues. Recall performance with these cues was
then compared to a no-cue recall condition in which targets were recalled without the assistance of any cues.
The feature that made HighamÕs research different from
Thomson and TulvingÕs (1970) research was that, for the
cued-recall conditions, he examined target production
under both free- and forced-report conditions; in free report, participants were given an incentive system such
that they gained points for a correct answer, lost points
for an incorrect answer, but neither gained nor lost any
points for withholding answers. However, in forced report, the incentive system was removed and participants
were asked to provide their best guess to the cues that
were initially left blank (cf. Koriat & Goldsmith,
1996). He found that, in free report, there was evidence
of encoding specificity; weak reinstated cues facilitated
target production relative to no-cue recall, whereas
new strong cues did not. However, the results in forced
report were very different; in that condition, both weak
and strong cues facilitated target production relative to
Our reference to generating targets in response to weak cues
might be confusing to some readers: How is it possible to
generate a target that is only weakly associated with the test
cue? However, our use of the term in this context is only
confusing if one adheres to the traditional sense of generation
(i.e., a process of consciously producing candidates responses
from an associative network once direct retrieval has failed).
Consistent with what we consider to be a modern generaterecognize theory, we use the term ‘‘generate’’ here to incorporate not just the effortful process of producing semantic
associates to the test cues, but also the more direct process of
recollecting targets with the help of test cues. Conceived of in
this way, targets to weak cues are potentially ‘‘generated’’ in
just the same way that they are to strong cues.
2
Some researchers have demonstrated that context reinstatement affects generation processes (e.g., Nelson & Goodmon,
2003; Vokey & Higham, 2005; Zeelenberg, Pecher, Shiffrin, &
Raaijmakers, 2003). Similarly, most modern memory models
(e.g., SAM: Gillund & Shiffrin, 1984; PIER2: D.L. Nelson,
McKinney, Gee, & Janczura, 1998; see also Humphreys, Bain,
& Pike, 1989) assume that context reinstatement facilitates
retrieval or access of information from memory (i.e., generation). However, the same cannot be said of the role of context in
the strong–weak cue paradigm, which generated interest in the
encoding-specificity principle in the first place. It has been taken
for granted that the cause of both recognition failure, and poor
cued-recall performance with non-reinstated strong cues in the
strong–weak cue paradigm, lies in the recognition stage, not the
generation phase. As Tulving and Thomson (1973) pointed out,
participants presumably have no difficulty generating copies of
the targets in response to strong cues; indeed, demonstrations of
recognition failure of recallable words are dependent on
participants successfully generating (but failing to recognize)
target words to the strong extralist cues. These cues, after all,
are deliberately chosen to elicit the targets with high probability
on tests of free association.
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
Indeed, the analyses that Higham (2002) performed
on the data supported this conclusion. For each participant, a 2 (report: yes–no) · 2 (accuracy: correct–incorrect) contingency table was constructed, like the one
shown in Table 1, and the frequencies in it were used
to calculate various measures of performance. First,
free-report target production was defined as the proportion of all test cues that were assigned correct answers
when there was the option to withhold responses (i.e.,
a/[a + b + c + d] in Table 1). Second, forced-report target production was defined as the proportion of all test
cues that were assigned correct responses on the test
once the option to withhold responses was removed
(i.e., [a + c]/[a + b + c + d] in Table 1). Third, two metacognitive indices of performance, monitoring and report
bias, were calculated using type 2 signal-detection theory
(SDT; e.g., Clarke, Birdsall, & Tanner, 1959; Galvin,
Podd, Drga, & Whitmore, 2003; Higham & Gerrard,
2005; Vokey & Higham, 2005). The hit (H) rate, in this
type 2 context, was defined as the proportion of the total
number of correct responses produced on the test (a + c)
that were actually reported (i.e., H rate = a/[a +c] in Table 1). Similarly, the false alarm (FA) rate was defined as
the proportion of the total number of incorrect responses (b + d) that were reported (i.e., FA rate = b/
[b + d] in Table 1). As in standard SDT, it was possible
to calculate a discrimination index (A 0 ; Grier, 1971) and
a bias index (B00D ; Donaldson, 1992) from the H and FA
rates, which corresponded to monitoring and report
bias, respectively (see Higham, 2002; for more detail).
Within the generate-recognize framework, monitoring corresponds to recognition; it is a measure of the degree to which participants report targets and withhold
non-targets, in other words, the degree to which participants are able to discriminate the targets amongst generated candidates. Higham (2002) found that
monitoring (recognition) was much higher for reinstated
weak cues (.88) than non-reinstated strong cues (.61).
Also, report bias, which is a measure of participantsÕ
tendency to offer versus withhold responses in free report, was much more liberal for weak cues (.31) than
Table 1
The 2 · 2 contingency table and formulae used to derive the
various measures discussed in the text
Response
Reported
Withheld
Candidate answer
Correct
Incorrect
a
c
b
d
Note. Free-report target production = a/(a + b + c + d); forcedreport target production = (a + c)/(a + b + c + d); hit rate
(h) = a/(a + c); false alarm rate (fa) = b/(b + d); monitoring =
A 0 = .5 + [(h fa)(1 + h fa)]/[4h(1 fa)]; report bias ¼ B00D ¼
½ð1 hÞð1 faÞ h fa=½ð1 hÞð1 faÞ þ h fa.
5
strong cues (.62).3 Thus, the overall pattern of results
suggests that participants were able to generate targets
to strong and weak cues with equal success, as indexed
by equal forced-report target production for the two
cue types. However, they did not recognize the targets
that they generated to strong cues as well as they recognized targets generated to weak cues, as suggested by the
cue effect on monitoring. The difference in report bias
suggests that participants were sensitive to the fact that
their recognition performance with strong cues was poor
and so responses were withheld in free report.
Overview of the experiments
HighamÕs (2002) use of free- and forced-report methodology, in conjunction with type 2 SDT, is well suited
to an analysis of modern generate-recognize theory, as it
is possible to separate cognition from metacognition. As
suggested above, forced-report target production serves
as a measure of the generation process, discrimination
(A 0 ) serves as a measure of the monitoring or recognition
process, and report bias (B00D ) serves as measure of the re3
Readers familiar with Koriat and GoldsmithÕs (1996)
methodology might be curious about how the type 2 SDT
measure of bias that we use in the current research (B00D ; see also
Higham, 2002; Higham & Gerrard, 2005) compares to their
measure, Prc. In short, they are designed to measure different
aspects of performance. Like all bias statistics in both type 1
and type 2 SDT, B00D is designed to measure participantsÕ
tendency to say ‘‘yes,’’ and is a monotonic function of the H
and FA rates. Specifically, in the type 2 context, B00D measures
participantsÕ propensity to report candidate responses, and is
based on triangular geometry in ROC space (see Donaldson,
1992 for more detail). The propensity to report candidates can
be affected by either a shift in participantsÕ criterion placement,
or by a shift in the underlying distributions (over confidence) of
the correct and incorrect candidates. Hence, when we refer to
‘‘liberal’’ or ‘‘conservative’’ bias or responding, we are referring
to the participantsÕ response tendency, not specifically to the
placement of the criterion. Prc is better suited to estimate this
placement. It estimates the confidence scale value that is aligned
with the participantsÕ report criterion by determining the
maximum fit ratio. A fit ratio is the mean of (a) the percentage
of would-be reported candidates that are actually reported at a
given confidence level and (b) the percentage of would-be
withheld candidates that are actually withheld at that same
confidence level. By choosing the confidence level that maximizes this mean, one presumably estimates the confidence level
associated with participantsÕ report criterion. Discussion of the
relative merits of these two measures is beyond the scope of the
current paper, but interested readers might refer to Higham
(2002), pp. 72–76. However, because B00D is inherently ambiguous with regard to its interpretation, we will report mean
confidence along with B00D to potentially disambiguate it (i.e., to
determine whether effects on bias are due to true criterion shifts
or shifts in the underlying distributions).
ARTICLE IN PRESS
6
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
sponse tendency.4 Although these indices are not process
pure for reasons that we will discuss below, they offer a
substantial improvement over the way that cued-recall
performance has traditionally been assessed in the
encoding-specificity literature. Indeed, in that literature,
metacognitive variables are often overlooked completely; typically, only free-report target production
has been analyzed (although see Pellegrino & Salzberg,
1975b). As Higham pointed out, free-report target production is a very ‘‘dirty’’ measure of performance because generation processes, recognition processes, and
response bias all affect it (see also Higham & Gerrard,
2005; Kelley & Sahakyan, 2003; Koriat & Goldsmith,
1996). Thus, although the indices we report in the current research are not perfectly aligned with the underlying processes, together, they do a much better job at
separating the individual mechanisms of cued recall than
free-report target production alone.
The monitoring component mentioned above is only
one type of metacognition that might be important in
cued recall. Another has to do with the participantsÕ
choice over the domains in which targets are sought.
Traditional generate-recognize models imply that there
is very little flexibility in domain choice, and that
searches are mostly determined by a pre-established
associative network. However, we believe there is much
more metacognitive flexibility than such models suggest.
Experiments 1 and 2 provide evidence for this flexibility
by demonstrating the influence of retrieval guidance. In
particular, these experiments show that informing participants on a trial-by-trial basis about the nature of
the relationship between the test cue and the sought-after target dramatically enhances target production with
strong extralist cues, an effect that we attribute to improved domain search. In Experiment 3, improvements
are made to forced-report target production as a measure of the generation process, the measure of central
importance in the current research, by using a multiple-response methodology. This new methodology allows us to examine two possible factors underlying
participantsÕ choice of search domains: direct-memory
instructions and the structure of the study list.
Experiment 1
To demonstrate participantsÕ metacognitive control
over search domain, we replicated HighamÕs (2002) results in this experiment, and compared target production
in this standard cued-recall group to a group provided
with some recall guidance. For the latter group, two
manipulations were instantiated. First, participants were
informed, on a trial-by-trial basis, about the associative
relationship between the test cue and the target. Second,
participants were instructed on the best strategy to use
with each type of cue. In particular, they were informed
that the best strategy with ‘‘weak’’ cues was to simply try
to remember the target from the study phase. Conversely, for ‘‘strong’’ cues, they were informed that the
best strategy was first to generate associates to the cues,
and then to try to recognize targets from the generated
candidates. It was made clear to participants that the
task with strong cues was not simply associative generation, but that the goal was to recall as many of the targets from the study phase as possible. Similar generaterecognize instructions have previously been used by
other researchers (e.g., Baker & Santa, 1977).
Santa and Lamwers (1974) also informed participants
about the relationship between test cues and targets in a
replication and extension of Thomson and TulvingÕs
(1970) experiments. They found that free-report performance with extralist strong cues improved greatly as a result. However, because free-report target production was
their measure of choice, it is not clear from their experiments whether the enhanced performance was due to improved generation, recognition, more liberal reporting, or
some combination of these. Higham (2002) found that a
shift in report bias (from free to forced report) was enough
to triple the proportion of targets given to strong cues.
Thus, it is quite conceivable that Santa and LamwersÕ
guidance manipulation had only an effect on report bias,
with no effect on either the generation or recognition component of cued recall. To better determine what effect
guidance has, in Experiment 1 we obtained separate estimates of generation (forced-report target production),
recognition (A 0 ), and report bias (B00D ).
Method
4
Different researchers favor different statistics for different
reasons. To get around possible criticism of A 0 , another
measure of monitoring, the Kruskal–Goodman gamma correlation, was also calculated on the data from the 2 · 2
contingency tables generated in Experiments 1 and 2 (monitoring was not analyzed in Experiment 3). This particular
statistic is popular in the metacognition literature and has been
advocated by Nelson (1984). Gamma was subjected to the same
ANOVAs that were performed on A 0 . With only one exception
(see footnote in the results section in Experiment 2), these
analyses on gamma produced identical results to those on A 0 .
Participants
Participants were 32 students at the University of
Southampton who either volunteered or participated in
return for course credit. They were randomly assigned
to either the standard group or guided group, with 16
participants in each. They were tested in groups of 1–3
at individual workstations.
Design and materials
One hundred target words, each with two associated
cue words, were taken from Higham (2002). Five of the
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
word trios from HighamÕs original list containing proper
names (e.g., Russia-communism-tractor) were replaced
with trios not containing proper names. These trios were
removed to dissuade participants from typing in responses that were a mixture of lower and upper case letters, which would be scored as incorrect by the computer
program. One cue word for each target word was a
strong associate of the target (mean probability of target
production for all 100 strong cues = 35%), whereas the
second was a weak associate (mean probability of target
production for all 100 weak cues = 1%). No word was
repeated across the lists of weak cues, strong cues or
targets.
For both groups of participants, the experiment was
divided into a study phase and a test phase. In the study
phase, all 100 targets, along with their weak associates,
were presented in pairs in a random order to all participants for study. In the test phase, participants were presented with retrieval cues one at a time. For
counterbalancing purposes, approximately half the participants were presented at test with strong cues to elicit
one set of 50 targets, and presented with weak cues to
elicit the other set of 50 targets, whereas this was reversed for the remaining participants. Participants were
initially given the choice of providing a response to a given cue or leaving it blank (free-report; see procedure
below). However, responses to cues that were left blank
were obtained immediately afterwards by presenting the
cue again and requiring a response (guessing if necessary) before moving on to the next trial. Presentation order of the cues in the test phase was uniquely
randomized for each participant. After randomization,
data from the first six trials of the test phase were
counted as practice trials and were not analyzed. Thus,
the analyses were based on 94 items, with an average
of 47 (range: 44–50) data points for each of the strong
and weak conditions.
Procedure
In the study phase, participants studied 100 weak
cue-target word pairs presented individually for 3 s each,
centered on a computer monitor. The cue words were
displayed in lower case letters to the left of the target
words, which were presented in upper case letters. Following Thomson and Tulving (1970), participants were
instructed to study the upper case words for a later
memory test, but to attend to the lower case words as
possible cues to assist in recalling the upper case words
at test.
In the test phase, participants were instructed that
they would be presented with cue words, one at a time
on the computer monitor. Each cue word was centered
on the monitor and displayed with a question mark (?)
to its immediate right indicating that a response was requested. Participants in the standard group were told
that each word presented during this phase had an upper
7
case word from the study list that was related to it and to
use the word as a cue to assist in recalling the upper case
word. In addition to these instructions, participants in
the guided group were told that some cues would be
the same as those presented in the study phase, and these
cues would be weakly related to the upper case word and
would be labelled as ‘‘weak cue’’ during a recall trial.
Other cues, however, would be new cues that were not
presented during the study phase but were strongly related to an upper case word, and these would be labelled
as ‘‘strong cue’’ during a recall trial. Thus, for the guided
group, the label ‘‘weak cue’’ or ‘‘strong cue’’ was presented below each given cue to inform the participants
of the relation of the cue to the uppercase word. Additionally, the guided group were further informed that
if it was a weak cue, the best strategy to retrieve the target was to simply try to remember it; generating associated words would not help. On the other hand, if the cue
was a strong one, the best strategy to use was to generate
words that are strongly related to the cue and then to try
to recognize the target amongst the generated
candidates.
Participants in both groups were also informed that
each trial started with a ‘‘points stage’’ during which
each correct answer would earn 1 point, but that each
incorrect answer would cost 4 points. However, participants could avoid the point system by entering ‘‘B’’ (for
‘‘blank’’) which would immediately send them to the
‘‘guessing stage.’’ Responses offered during the guessing
stage would neither earn nor cost any points. Regardless
of whether a response was offered during the points
stage or the guessing stage, a confidence rating was required next on a separate screen, using a scale from 1
to 6, where 1 = extremely low confident correct,
2 = very low confident correct, 3 = low confident correct, 4 = high confident correct, 5 = very high confident
correct, and 6 = extremely high confident correct. This
scale appeared on the screen whenever a confidence rating was required. After entering a confidence rating, the
number of trials remaining was displayed (100 in total).
No feedback was provided regarding the points won or
lost on trials for which responses were entered in the
‘‘points stage,’’ nor the cumulative point total. Participants pressed the space bar to initiate the next trial. This
procedure for each trial continued until all 100 trials
were completed.
Results
Following Higham (2002), we analyzed the cued-recall data in terms of free- and forced-report target production, monitoring, and report bias. The term ‘‘freereport target production rate’’ refers to the proportion
of targets that were offered in the points stage of the
experiment, whereas the term ‘‘forced-report target production rate’’ refers to the summed proportion of targets
ARTICLE IN PRESS
8
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
that were offered in both the points stage and the guessing stage. Conservative scoring was used in all experiments; a response was scored as correct only if a target
was offered to either the strong or weak cue associated
with it. Monitoring and report bias refer to discrimination (A 0 ) and response bias (B00D ) derived for each participant from the a, b, c, and d cells of Table 1. Specific
formulae for all measures are presented in Table 1. Before calculating discrimination and bias, the H and FA
rates were adjusted according to Snodgrass and CorwinÕs (1988) recommendation (i.e., 0.5 added to the
numerator and 1 added to the denominator of each
rate). Analysis of the confidence data follows a report
of the SDT indices. For some analyses reported in this
paper, empty cells forced the elimination of participants,
the number of which can be determined from the degrees
of freedom reported with the analysis. For cases in
which empty cells posed a particular problem, we explicitly describe the problem and the steps taken to avoid
the elimination of too many participants (e.g., by
excluding a factor from the analysis). Alpha level of
.05 was adopted for all comparisons.
Target production
Mean free- and forced-report target production rates
for the standard and guided groups are shown in Table
2. Two 2 (group: standard/guided) · 2 (cue: weak/
strong) mixed analyses of variance (ANOVA) were conducted, the first was on the free-report target production
rates, and the second on forced-report rates. In both
analyses, group was the between-subjects factor, and
cue type was the within-subjects factor. For free-report
target production, the analysis revealed a main effect
of group, with the guided group demonstrating significantly better recall than the standard group,
F (1, 30) = 8.15, MSE = .016, g2 = .214 (standard = .18;
guided = .27). The main effect of cue was marginally significant, F (1, 30) = 3.69, MSE = .014, g2 = .110, p < .07,
reflecting a trend for better recall in the context of weak
cues than strong cues (weak = .25; strong = .20). A significant group · cue interaction was also obtained,
Table 2
Mean free- and forced-report target production in Experiment 1
as a function of cue type and experimental group
Experimental group
F (1, 30) = 14.43, MSE = .014, g2 = .325. This interaction
arose because free-report target production was significantly better for weak cues than for strong cues in the standard group, F (1, 15) = 15.89, MSE = .015, g2 = .514,
whereas in the guided group, strong-cue target production
was numerically, but not significantly, better
than weak-cue target production, F (1, 15) = 1.82, (see Table 2).
The ANOVA on forced-report target production revealed a main effect of group, F (1, 30) = 4.82,
MSE = .015, g2 = .134; significantly more targets were retrieved by the guided than the standard group (standard = .26; guided = .33). There was also a significant
cue main effect, F (1, 30) = 5.08, MSE = .012, g2 = .145,
with recall significantly better for strong cues than weak
cues (weak = .26; strong = .33), as well as a significant
group · cue interaction, F (1, 30) = 10.27, MSE = .012,
g2 = .255. This interaction reflected the fact that there
was no difference in forced-report target production for
weak versus strong cues in the standard instruction group,
F < 1, whereas in the guided group, strong-cue target
production was significantly better than weak-cue target
production, F (1, 15) = 14.57, MSE = .013, g2 = .493.
Forced-report target production for strong cues in the
guided group was also significantly greater than to the
same cues in the standard group, F (1, 30) = 18.91,
MSE = .010, g2 = .387 (see Table 2).
Monitoring
Mean monitoring indices for the standard and guided
groups are shown in Table 3. A mixed 2 (group: standard/guided) · 2 (cue: weak/strong) ANOVA on monitoring revealed a significant main effect for group,
F (1, 30) = 9.64, MSE = .021, g2 = .243, with the guided
group showing significantly better monitoring than the
standard group (standard = .71; guided = .83). The cue
main effect was significant, F (1, 30) = 57.23, MSE =
.011, g2 = .656, reflecting better monitoring for weak
cues than strong cues (weak = .87; strong = .67). A significant group · cue interaction was also found,
Table 3
Mean monitoring (A 0 ) and report bias (B00D ) in Experiment 1 as a
function of cue type and experimental group
Experimental group
Report type
Free-report
M
SD
Forced report
M
Weak cues
Standard
Guided
.26
.24
.10
.14
.28
.25
Strong cues
Standard
Guided
.09
.30
.12
.13
.25
.41
Measure
Monitoring
(A 0 )
Report bias
(B00D )
M
SD
M
SD
.12
.14
Weak cues
Standard
Guided
.85
.88
.17
.09
.51
.39
.52
.61
.11
.09
Strong cues
Standard
Guided
.57
.77
.16
.05
.42
.13
.80
.72
SD
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
F (1, 30) = 9.23, MSE = .011, g2 = .235, which occurred
because the monitoring difference between weak cues
and strong cues was larger for the standard group than
for the guided group. Nevertheless, separate one-way
ANOVAs showed monitoring was significantly better
for weak cues than strong cues for both groups:
F (1, 15) = 33.09,
MSE = .019,
g2 = .688,
and
2
F (1, 15) = 34.00, MSE = .003, g = .694, for the standard and guided groups, respectively. Monitoring was
significantly greater than chance (.50) for both weak
and strong cues only in the guided group (lower bound
95% confidence interval for weak cues = .84; for strong
cues = .74). In the standard group, only monitoring
for weak cues was significantly better than chance (lower
bound 95% confidence interval for weak cues = .76; for
strong cues = .49).
Report bias
Table 3 shows report bias for weak and strong cues in
the standard and guided groups. A mixed 2 (group: standard/guided) · 2 (cue: weak/strong) ANOVA on report
bias showed a significant main effect of cue, with report
bias being more liberal for weak cues than for strong cues,
F (1, 30) = 44.88, MSE = .127, g2 = .599 (weak = .45;
strong = .15). The group main effect was not significant,
F < 1, (standard = .04; guided = .26). However, a significant group · cue interaction was found, F (1, 30) =
13.71, MSE = .127, g2 = .314. The interaction resulted
because cue strength had a larger effect in the standard
group than in the guided group, although both effects
were significant, F (1, 15) = 37.58, MSE = .183, g2 =
.715, and F (1, 15) = 8.01, MSE = .071, g2 = .348,
respectively.
Confidence data
A 2 (group: standard/guided) · 2 (cue: weak/strong)
· 2 (accuracy: correct/incorrect) mixed ANOVA was performed on the mean confidence ratings (Table 4).5 This
analysis yielded significant main effects of cue (weak =
3.23; strong = 2.06), F (1, 29) = 138.92, MSE = .303,
g2 = .827, and accuracy (correct = 3.53; incorrect =
1.75), F (1, 29) = 229.24, MSE = .427, g2 = .888. There
was also a significant cue · accuracy interaction,
F (1, 29) = 176.04, MSE = .234, g2 = .859, indicating that
the difference in confidence ratings between correct and
incorrect responses was larger for weak cues (correct = 4.69; incorrect = 1.76) than for strong cues (correct = 2.37; incorrect = 1.75). This effect reflects better
monitoring with weak cues than strong cues, as shown
with the analysis on A 0 . A significant group · cue interaction was also found, F (1, 29) = 6.90, MSE = .303,
5
The data were collapsed across the report option variable
(reported-withheld) to minimize the elimination of participants
due to empty cells.
9
Table 4
Mean confidence ratings (/6) for correct and incorrect responses
in Experiments 1 and 3 as a function of cue type and
experimental group
Experimental group
Accuracy
Correct
Incorrect
M
SD
M
SD
Weak cues
Standard (Experiment 1)
Guided (Experiment 1)
Real study (Experiment 3)
Sham study (Experiment 3)
No study (Experiment 3)
4.55
4.84
4.66
—
—
.26
.27
.64
—
—
1.81
1.71
1.46
2.00
2.31
.13
.14
.43
.64
.73
Strong cues
Standard (Experiment 1)
Guided (Experiment 1)
Real study (Experiment 3)
Sham study (Experiment 3)
No study (Experiment 3)
1.92
2.82
1.86
2.36
3.06
.21
.22
.70
.76
.85
1.58
1.91
1.34
2.03
2.19
.13
.14
.35
.64
.67
Note. Cells without entries had too few participants contributing data to be considered valid (n < 10).
g2 = .192, which was caused by cue strength having a larger effect on confidence in the standard group
(weak = 3.18; strong = 1.75) than in the guided group
(weak = 3.28; strong = 2.37). This pattern suggests that
the group by cue strength interaction on the SDT measure
of bias described above was attributable in large part to a
shift in the strong-cue confidence distribution; that is, participants became more confident in their responses to
strong cues in the guided group, and hence tended to report them more. Finally, the group · accuracy interaction
approached significance, F (1, 29) = 4.09, MSE = .427,
g2 = .124, p < .06. The difference in confidence ratings between correct and incorrect responses was marginally larger for the guided group (correct = 3.83; incorrect = 1.81)
than for the standard group (correct = 3.23; incorrect = 1.69). This marginal effect reflects better monitoring in the guided group than the standard group,
supporting the analysis of A 0 .
Discussion
Results from the standard group replicated HighamÕs
(2002) cued-recall results almost exactly. In free report,
there was an advantage of reinstated weak cues over
non-reinstated strong cues in eliciting target retrieval,
as was found by Thomson and Tulving (1970). This
encoding-specificity effect, however, was eliminated under forced-report instructions (cf. Pellegrino & Salzberg,
1975b). As with HighamÕs research, the fact that strong
cue target production improved from free to forced report, whereas weak-cue target production did not, was
attributable to the fact that strong cues were associated
ARTICLE IN PRESS
10
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
with both poorer monitoring, and more conservative report bias, than weak cues. In other words, participants
withheld a lot of correct responses with strong cues,
but not with weak cues.
Results from the guided group, on the other hand,
were quite different from the standard group. Although
guiding participants about the relationship between
weak cues and their targets had no effect on free-report
target production, forced-report target production,
monitoring or report bias, guidance had large effects
on target production with strong cues, which is consistent with Santa and LamwersÕ (1974) findings. However,
because we adopted HighamÕs (2002) method of analyzing cued-recall data, we, unlike Santa and Lamwers,
were able to determine at what stage(s) of processing
guidance had its effect: generation, recognition, response
bias, or some combination of these. First, analysis of the
discrimination index indicated that guidance had an effect at the recognition stage; target recognition (monitoring) in the guided group was significantly improved
relative to the standard group, although it did not reach
the level observed with weak cues. Second, analysis of
B00D indicated that report bias was more liberal for weak
cues than for strong cues in both groups, but the effect
was less pronounced in the guided group compared to
the standard group.
In addition to these effects on recognition and report
bias, however, guidance also affected generation processes. Forced-report target production, our measure
of generation, was no different between weak and strong
cues in the standard group, replicating HighamÕs (2002)
results. However, in the guided group, forced-report target production with strong cues was significantly greater
than with weak cues in the same group, and it was better
than strong-cue, forced-report target production in the
standard group. These results suggest that poor strong
extralist cue target production in the standard cued-recall group, which partially forms the basis of the encoding specificity principle, is not just a recognition
problem, as is commonly believed. Rather, it is also
partly attributable to the fact that participants given
standard cued-recall instructions do not generate candidate sets in response to strong cues that are as high quality as they might be. In other words, once failures to
report and/or recognize the targets are factored out of
the equation by forcing responses to all test cues, target
production with strong extralist cues in the standard
cued-recall group is still quite poor.
Experiment 2
Experiment 1 indicated that guiding participants
about the cue-target relationship and instructing them
to generate-then-recognize with strong cues had effects
on generation, recognition, and response bias. In Exper-
iment 2, we sought to replicate this effect using a mixedlist design. Type of study cue (strong or weak) was
crossed with type of test cue (strong or weak), which
yielded four conditions: weak (study)–weak(test),
weak–strong, strong–weak, and strong–strong. The first
two conditions were replications of the two test conditions of Experiment 1, although without a pure weak-associate study list. As in the guided group of Experiment
1, participants were informed on a trial-by-trial basis
about the relationship between the test cues and their
associated targets. Furthermore, they were told to generate-then-recognize with strong cues, but to avoid the
generate-then-recognize strategy with weak cues (i.e.,
just try to remember targets).
Thomson and Tulving (1970, Experiment 3; see also
Ehrlich & Philippe, 1976; Postman, 1975) used a
mixed-list paradigm of this sort to discount a confusion
or mental set interpretation of their finding of poor target production with strong extralist cues. That is, they
acknowledged the fact that participants, after studying
a pure-list of weak-associate/target pairs, may have
developed a mental set of weak associates as appropriate
responses at test, or participants might be confused at
test about appropriate responses when faced with novel
strong cues. Thomson and Tulving reasoned that, by
using a mixed-list of both weak-and strong-associate/
target pairs during study, this potential explanation of
the poor target production they observed with strong
extralist cues would be obviated.
However, as Santa and Lamwers (1974) rightly
pointed out, it is questionable whether the confusion
experienced by participants during recall was completely
resolved in this mixed-list paradigm. Although participants may no longer have held onto the assumption that
all targets were weakly associated to the test cues, confusion could still arise because participants were not informed that some of the targets which were studied
with weak associates would be cued with novel strong
cues during recall. In other words, the uncertainty as
to how a particular cue was related (weakly or strongly)
to its target remained in the cued-recall phase of the
mixed-list paradigm. Consequently, participants in
Thomson and TulvingÕs (1970) study may still have been
unsure as to the appropriate retrieval space (weak associates or strong associates of the cue) in which the target
could be found, despite the mixed-list design. This
uncertainty may have resulted in poor target production
with strong test cues (see also Santa & Lamwers, 1976).
The trial-by-trial guidance manipulation that we used
in Experiment 1 is ideally suited to alleviate this potential confusion about search space. If, indeed, target production with strong extralist cues was impaired in
Thomson and TulvingÕs (1970) experiment because of
uncertainty regarding the appropriate domain in which
to search for targets, our guidance manipulation should
lead to an improvement in strong-cue target production.
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
11
In particular, forced-report target production, our measure of the generation process in recall, should be high.
Table 5
Mean free- and forced-report target production in Experiment 2
as a function of experimental condition
Method
Experimental condition
Participants
Sixteen students from the University of Southampton
took part in return for course credit or payment.
Design and materials
Materials from Experiment 1, that is, the same 100
targets, each with a weak and strong cue, were used. At
study, participants studied one set of 50 targets with
strong cues, and the other set of 50 targets with weak
cues. At test, half the targets (25) studied with weak cues
were tested with weak cues (the weak–weak condition),
whereas the other half (25) was tested with strong cues
(the weak–strong condition). Similarly, 25 targets studied
with strong cues were tested with weak cues (the strong–
weak condition), and the remaining 25 were tested with
strong cues (the strong–strong condition). The assignment of targets to the four study-test conditions was
counterbalanced across participants. The presentation
order of the cue-target pairs at study, and cues at test,
was uniquely randomized for each participant. After randomization, the first 6 trials for each participant were
treated as practice trials and were not analyzed. Thus,
analyses were based on 94 items, with a range of 19–25
data points for each study-test (weak–weak, weak–
strong, strong–weak, and strong–strong) condition.
Procedure
For the presentation of cue-target pairs, the same
procedure in Experiment 1 was used. The procedure of
the cued-recall phase was essentially the same as that
used for the guided group in Experiment 1. That is,
for each participant, weak cues and strong cues were labelled accordingly during cued recall. Participants were
also told that the best strategy to retrieve a target for
a weak cue was to try to remember the target; generating
associated words would not help. Conversely, for a
strong cue, it was best to generate words strongly related
to the cue and then decide if one of the words was the
intended target.
Results
Target production
Mean target-production rates are shown in Table 5.
Because these rates, and hence variance, were virtually
zero in the strong–weak condition at both free-and
forced-report, only data from the remaining three conditions—weak–weak, weak–strong, and strong–strong
were analysed. Thus, two repeated-measures, one-way
ANOVAs were conducted on the target-production rate
in those three conditions, the first on free-report target
Report type
Free-report
Weak–weak
Weak–strong
Strong–weak
Strong–strong
Forced report
M
SD
M
SD
.19
.19
.00
.54
.16
.10
.00
.21
.21
.41
.00
.65
.17
.15
.00
.16
production, and the second on forced-report target production. For free report, there was a significant main effect of condition, F (2, 30) = 34.32, MSE = .018,
g2 = .696. This main effect arose because free-report target production was significantly better in the strong–
strong condition than both the weak–weak and the
weak–strong conditions, F (1, 15) = 80.36, MSE = .24,
g2 = .843, and F (1, 15) = 40.46, MSE = .047, g2 = .730,
respectively. There was no significant difference between
the weak–weak and weak–strong conditions, F < 1.
For forced report, the ANOVA again revealed a significant main effect of condition, F (2, 30) = 47.00, MSE =
.017, g2 = .758. This main effect was produced because
forced-report target production was significantly better
in the strong–strong than in the weak–strong condition,
F (1, 15) = 25.71, MSE = .038, g2 = .632. In turn,
forced-report target production was significantly better
in the weak–strong than in the weak–weak condition,
F (1, 15) = 12.60, MSE = .048, g2 = .457 (Table 5).
Monitoring
Mean monitoring (A 0 ) is shown in the first column in
Table 6. As the target production rate in the strong–weak
condition was virtually zero, it was not possible to calculate a monitoring index for that condition. However, sufficient data were available to complete a repeatedmeasures, one-way ANOVA on monitoring in the other
conditions (weak–weak, weak–strong, and strong–
strong). This ANOVA revealed a significant main effect,
F (2, 30) = 19.42, MSE = .011, g2 = .564. Within-subjects
contrasts indicated that monitoring for weak–weak items
was significantly better than for strong–strong items,
F (1, 15) = 14.24; MSE = .019, g2 = .487, which, in turn,
was better than for weak–strong items, F (1, 15) = 6.07,
MSE = .029, g2 = .288.6 For all three conditions, moni-
6
This latter comparison on monitoring between strong–
strong and weak–strong items was marginally significant when
monitoring was estimated with gamma, but the pattern of
results remained the same; monitoring for strong–strong items
was higher than for weak–strong items, F (1, 13) = 4.04,
MSE = .209, g2 = .237, p < .07.
ARTICLE IN PRESS
12
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
Table 6
Mean monitoring (A 0 ) and report bias (B00D ) in Experiment 2 as a
function of experimental condition
Experimental condition
Weak–weak
Weak–strong
Strong–weak
Strong–strong
Measure
Monitoring
(A 0 )
Report bias
(B00D )
M
SD
M
SD
.87
.64
—
.74
.08
.16
—
.20
.03
.20
—
.34
.66
.70
—
.67
toring was significantly above chance (lower bound 95%
confidence intervals: weak–weak = .83; weak–strong =
.55; strong–strong = .64).
Report bias
Mean report bias (B00D ) is shown in the third column of
Table 6. As with the analysis on monitoring, there was
insufficient target production data in the strong–weak
condition to calculate report bias. However, a repeatedmeasures one-way ANOVA on report bias in the remaining conditions (weak–weak, weak–strong, and strong–
strong) revealed a significant main effect, F (2, 30) =
7.65, MSE = .159, g2 = .338. Within-subjects contrasts
indicated that report bias was significantly more liberal
in the strong–strong condition than in either the weak–
weak condition, F (1, 15) = 7.43, MSE = .291, g2 = .331,
or the weak–strong condition, F (1, 15) = 16.32, MSE =
.286, g2 = .521. The latter two conditions did not differ
significantly, F (1, 15) = 1.26.
Confidence data
The confidence data are shown in Table 7. To manage
empty cells, confidence ratings were collapsed across report option, and the strong–weak condition was not entered into the analysis. A 3 (condition: weak–weak/
weak–strong/strong–strong) · 2 (accuracy: correct/incorrect) repeated measures ANOVA revealed significant
main effects of accuracy, F (1, 13) = 165.16, MSE =
.405, g2 = .927 (accurate = 4.18; inaccurate = 2.39), and
condition, F (2, 26) = 20.23, MSE = .258, g2 = .609. ConTable 7
Mean confidence ratings (/6) for correct and incorrect responses
in Experiment 2 as a function of condition
Item type
Accuracy
Correct
Weak–weak
Weak–strong
Strong–weak
Strong–strong
Incorrect
M
SD
M
SD
5.13
2.96
—
4.16
.80
.81
—
.80
1.97
2.43
1.69
2.75
.60
.73
.56
.84
fidence ratings were lower in the weak–strong condition
(2.79) than in both the weak–weak condition (3.58),
F (1,14) = 30.42, MSE = .333, g2 = .685, and the
strong–strong condition (3.48), F (1, 13) = 37.73, MSE
= .176, g2 = .744, whereas the latter two conditions did
not differ, F < 1. There was also a significant condition · accuracy interaction, F (2, 26) = 72.41, MSE =
.187, g2 = .848. The interaction occurred because the
difference in confidence between correct and incorrect responses was larger in the weak-weak condition than in the
strong–strong condition, F (1, 13) = 37.95, MSE = .980,
g2 = .745, which, in turn, was larger than in the weak–
strong condition, F (1, 13) = 21.35, MSE = .640, g2 =
.622. These analyses on the confidence data conform with
the analyses on monitoring reported above, which also
showed a weak–weak > strong–strong > weak–strong
pattern. The confidence data analysis also suggests
that the liberal response bias associated with the
strong–strong condition likely resulted from the particularly high levels of confidence assigned to incorrect
responses (i.e., distribution shift). Because confidence
was high for these incorrect responses, a large portion of
the distribution would have fallen above participantsÕ report criterion, yielding a high FA rate, and a liberal report
bias estimate (i.e., an overall high tendency to report
candidates).
Discussion
Target production in the weak–weak and weak–
strong conditions of Experiment 2 replicated performance in the guided group of Experiment 1 fairly closely. The weak–weak condition showed moderately
good target production, and it was not affected by report
option. Similarly, monitoring (recognition) in the weak–
weak conditions of both experiments was similar and
high (Experiment 1, guided = .88; Experiment 2 = .87).
Although free-report target production in the weak–
strong condition of Experiment 2 was somewhat lower
(.19) than in the weak–strong condition in the guided
group of Experiment 1 (.30), forced-report target production was identical (.41 in both cases). This pattern
probably reflects the fact that participants tended to be
fairly conservative in their responding in all conditions
of Experiment 2 except in the strong–strong condition,
and that the mixed-list design made monitoring of responses to strong cues somewhat more difficult (Experiment 1, weak–strong, pure-list = .77; Experiment 2,
weak–strong, mixed-list = .64).
More important, however, Experiment 2 replicated
the finding that some guidance, in the form of generate-recognize instructions and cue-type labels, led to
excellent forced-report target production with strong
extralist test cues. Forced-report target production in
the weak–strong guided conditions of both Experiments
1 and 2 (both. 41) was considerable greater than forced-
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
report, strong-cue target production in the standard
condition of Experiment 1 (.25). This result again suggests that the effect of guidance is not limited to recognition and/or response bias in cued recall. Rather, it also
had a considerable effect on the generation process.
We interpret this result in the same way that we interpreted the analogous result in Experiment 1; by guiding
participants about the relationship between the test cue
and the sought-after target, participants were better able
to define an appropriate search space. By improving the
definition of the search space, the likelihood that the
pool of generated items contained the target was
increased.
It is worth considering the dissociations obtained
between target production and monitoring, which we
interpret to correspond to generation and recognition
being influenced differently by the same variables. Both
generation and recognition were influenced positively
by cue reinstatement. Cue reinstatement helped to limit
the search to recent encounters containing the target,
which, in turn, increased the likelihood that the target
was generated (i.e., accessed from memory). It also increased the likelihood that the target would be correctly recognized. On the other hand, cue strength
affected generation and recognition differently. Whereas
high cue strength, compared to low cue strength, increased the likelihood that targets would be generated,
it decreased the likelihood that those generated targets
would be recognized. Recognition was poor with
strong cues because several highly interrelated items
were likely to be generated. For example, in response
to the strong test cue homicide, participants may have
generated the candidates MURDER, death, kill, and
die. Recognizing the target MURDER from amongst
these candidates may have been quite difficult, because
the size of the generated set is large and the interrelatedness amongst the alternatives is great (e.g., see Nelson, Bennett, Gee, Schreiber, & McKinney, 1993;
Nelson, Bennett, & Leibert, 1997; Nelson, McEvoy,
& Friedrich, 1982; Nelson, Schreiber, & McEvoy,
1992; see also Santa & Lamwers, 1974). In contrast,
both the number of candidates generated in response
to weak cues, and their interrelatedness, is likely to
be less than with strong cues. First, for weak cues, participants were instructed to avoid generating related
items in an attempt to find the target, reducing the generated set size. Second, even if participants ignored
these instructions, and went ahead and generated both
the target and candidates strongly related to the weak
cue, the target would be unrelated to the other generated strong associates. Thus, the target may be the only
candidate to be considered seriously as a response to
each (reinstated) weak cue, either because it was the
only generated candidate, or because it ‘‘stood out
from the crowd,’’ being unrelated to the other items
in the candidate set.
13
The differential effect of cue strength on the generation and recognition processes of cued recall can be seen
by comparing the pattern of forced-report target production and monitoring, respectively, amongst the
experimental conditions. The strong–strong condition
showed the best target production because it benefited
from both cue reinstatement and high cue strength. Target production in the weak–strong and weak–weak conditions benefited from only one of the variables (either
cue strength in the weak–strong condition or cue reinstatement in the weak–weak condition), but not the
other, and so an intermediate level of performance was
observed. The strong–weak condition benefited from
neither variable, and so demonstrated the worst performance (see Table 5). On the other hand, monitoring
(recognition) was highest in the weak–weak condition
because it benefited from both cue reinstatement and
low cue strength. When the test cues were reinstated,
but their strength was high (strong–strong; one for,
one against), an intermediate level of recognition was
observed. When reinstatement was absent, and cue
strength was high (weak–strong; both against), recognition was poorest (see Table 6).
These results highlight the role of metacognitive
monitoring processes in a direct cued-recall task, and
the importance of considering such processes separately
from target production in such tasks. The pattern of
poorer monitoring in the strong–strong condition compared to the weak-weak condition, despite better target
production, was also apparent in the mean confidence
data. Targets produced to weak cues were assigned
much higher confidence than targets produced to strong
cues, presumably because in the former case, there were
no other serious competitors, in line with the reasoning
above.
Some readers may be concerned that instructing participants to generate-then-recognize with strong cues
may have undermined the direct nature of the task,
and for this reason, we observed the excellent strongcue target production and the dissociations between target production and monitoring. To test this possibility,
we tested an additional 11 participants in the mixed-list
design of Experiment 2. These participants were treated
exactly the same as those in Experiment 2, except the
generate-recognize instructions for strong cues, and
‘‘try to remember’’ instructions for weak cues, were removed. Instead, participants were simply instructed to
use the cue to help them recall a capitalized word from
the study phase, and it was indicated to them on a trial-by-trial basis which cues were ‘‘weak’’ and which were
‘‘strong.’’ These participants produced data almost identical to those in Experiment 2. In particular, the data for
the weak–weak, weak–strong, strong–weak, and strong–
strong conditions were as follows: free-report target production: .24, .24, .00, .52, respectively; forced-report target production: .28, .41, .01, .64, respectively;
ARTICLE IN PRESS
14
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
monitoring (A 0 ): .84, .69, , .81, respectively; report bias
(B00D ): .05, .26,, .22, respectively. Thus, the excellent
strong-cue target production observed in the guided
group of Experiment 2 (and by extension, the guided
group of Experiment 1) was replicated. Also, the same
dissociated ordering of forced-report retrieval (strong–
strong > weak–strong and weak–weak > strong–weak)
and monitoring (weak–weak > strong–strong > weak–
strong) was observed.
Experiment 3
Thus far, we have aligned generation processes in
cued recall with forced-report target production, recognition processes with monitoring (A 0 ), and response
tendencies with a bias index (B00D ). This analysis has
shown itself to be useful in separating the underlying
components of cued recall, and we believe it constitutes
a huge improvement relative to the sole use of free-report target production as a measure of performance,
which has characterized much cued-recall research.
Furthermore, these measures respond in predictable
and sensible ways to the cue reinstatement and cue
strength variables, but they have also provided new evidence that questions some long-held assumptions
regarding the nature of encoding specificity. In particular, the results of Experiments 1 and 2 question the
common belief that encoding specificity effects in the
strong–weak cue paradigm are located only in the recognition stage, and not the generation stage, of cued
recall.
As we mentioned in the introduction section, however, it is unlikely that these measures, in particular,
our measure of generation (forced-report target production), are process pure. For some items, forced-report
target production may also be sensitive to discrimination
in the recognition stage. This would occur for trials for
which the target is actually covertly generated to a given
cue, but the confidence associated with it is extremely
low. That being the case, not only would it not be offered in free report, it might not be offered in forced report either. Of course, such a scenario necessitates that
some other non-target item, generated along with the
target in response to a cue, is assigned higher confidence
than the target itself. That is, the generated target would
have to be lower in the rank order of candidates than
some other non-target candidate, the latter of which
would be offered at forced report. To the extent that this
occurs, forced-report target production will underestimate the generation process.
To scrutinize the generation process more directly,
and to achieve a purer measure of the generation process, a multiple-responses methodology was adopted in
Experiment 3. With this methodology, participants were
given the opportunity to offer at least one, but as many
as six, responses to each cue in their attempt to produce
targets. A point system was also established to create
conditions that were analogous to the report-option
manipulation in Experiments 1 and 2. Specifically, participants were given the option to nominate one of the
generated candidates as the likely target. By doing so,
they had chosen to ‘‘go for points,’’ meaning that two
points would be awarded for each correct nomination,
but two points would be deducted from their score for
each incorrect nomination.
By considering the entire candidate list generated by
each cue, regardless of whether or not a candidate was
nominated, a condition analogous to forced report was
produced. The point system in Experiment 3 was specifically designed to motivate participants to produce as
many reasonable candidates as possible. For example,
participants in Experiment 3 were informed that producing, but failing to nominate, the target in the list of candidates would still earn 0.25 points, and that no points
would be deducted for producing non-target responses
that were not nominated. Consequently, there was nothing to lose, and indeed something to be gained, by listing
all plausible candidates for a given cue. This was true
even on trials when an incorrect candidate was nominated; the incorrect nomination would result in a loss
of two points, but if the target appeared elsewhere on
the list, this loss would be offset by 0.25 points (net
total = 1.75 points). The motivation to produce more
than one response per cue in forced report was unlike
the forced-report condition in Experiments 1 and 2 in
which a maximum of one response per cue was
generated.
Along with the new methodology, Experiment 3 allowed us to examine two possible causes of poor
strong-cue target production revealed in the unguided
group of Experiment 1. Previously, Higham and Brooks
(1997) demonstrated that participants in recognition
memory experiments implicitly learn about the structure
inherent in study lists, such that test items consistent
with the structure were more likely to be rated ‘‘old’’
than items inconsistent with it. In the same vein, unguided cued-recall participants in Experiment 1 may
have learned the list-wide structure that cues and targets
were only weakly related. Learning about this structure
may have led them to search for targets in a domain containing only weak associates of the test cues, a strategy
that is useful for weak cues, but detrimental for strong
cues. Such a proposition is consistent with our hypothesis that participants are searching inappropriate domains when faced with strong cues. On the other
hand, the direct memory instructions used in the unguided cued-recall group is another possible cause of
poor strong-cue target production. Direct-memory
instructions may lead participants to search for single
‘‘recollected’’ solutions, limiting the size of the generated
set.
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
To investigate the role of these two possible causes of
poor unguided target production for strong cues, three
groups of participants were tested in Experiment 3: a
standard cued-recall group (the real-study group) and
two generation groups (the sham-study and no-study
groups). All three experimental groups were given the
same strong and weak test cues as those in the test phase
of Experiment 1. The main difference between the
groups was that the real-study group was exposed to
the targets during study (in the context of weakly associated study cues), and was required to recall them,
whereas the sham-study and the no-study groups were
not exposed to any targets, and were required instead
to generate words to the cues. The main difference between the no-study and sham-study groups was that
the latter group was exposed to a sham study list—a list
of weakly associated word pairs that contained none of
the targets—whereas the no-study group was exposed to
no study list at all. By comparing target production between these two generation groups, it was possible to
determine the effect of the study-list structure on pure
generation processes, outside of the context of cued recall.7 By comparing target production between the
real-study and sham-study groups, it was possible to
evaluate the role of direct-memory versus generation
instructions.
Although by using this design it is possible to evaluate the separate contributions of each of the aforementioned factors on target production, we believe it may
be somewhat overly simplistic to attribute poor unguided target production for strong cues exclusively to
one factor or the other. Rather, it seems likely to us that
list structure and direct-memory instructions work in
conjunction, making a particularly potent mix. Unlike
generation instructions, direct-memory instructions
encourage constant referral to the study list as a source
of candidates, thus intensifying any effects that the
study-list structure might have. As a result, strong-cue
target production may be particularly poor in the realstudy group who receive both direct-memory instructions and an inappropriate study-list structure.
7
One obvious way of ensuring high output is to force
participants to generate a certain number of (multiple)
responses. However, with the multiple-response methodology
used in Experiment 3, we suspected that such requirements
might switch participants in the real-study group from a
retrieval strategy to a pure generation strategy. Such a strategic
shift might possibly have masked any differences that would
otherwise have been found between the real-study group on the
one hand, and the generation groups (sham-study and nostudy) on the other. Consequently, we implemented a point
system that rewarded the production of several responses per
cue, but ultimately left participants in control of the number of
responses (beyond one) to produce.
15
Method
Participants
Participants were 63 undergraduate students who
took part in return for either course credit or a payment
of £5. Twenty-one participants were assigned to each of
the three (real-study, sham-study, and no-study) groups.
Design and materials
The materials were the same as in Experiment 1, with
the exception that a new list of 100 weakly associated
pairs was constructed for the sham-study group to
study. As with the weakly-associated word pairs in the
real-study group, the cues presented to the sham-study
group during study produced their associated sham targets at a rate of approximately 1%. Although no actual
targets were presented in the study phase for the shamstudy and no-study groups, the same test cues used in
the real-study group (derived from Experiment 1) were
used for these groups, and target production was scored
in all three groups of Experiment 3 according to participantsÕ tendency to respond with the targets as they were
defined in Experiment 1.
Procedure
The procedure for the study phase for the real-study
and sham-study groups was identical to that used for the
standard group in Experiment 1. In the test phase, the
real-study group received the same basic instructions
as those in the standard group in Experiment 1. However, participants from the sham-study group were told
that, for each cue word, we had ‘‘another word in mind’’
that was related to it, and that the relationship between
the cue word and the word we had in mind was similar
to the relationship between the word pairs seen in the
study phase. It was made clear to participants that none
of the words seen in the study phase was the same as the
word we had in mind. Participants were instructed to use
the cue to think of the word we have in mind. Similarly,
for the no-study group, participants were told that for
each cue word, there was a related word that we had
in mind, and their task was to think what that word
might be.
For all three groups in Experiment 3, participants
were required to provide at least one, but as many as
six, responses to each cue. Additionally, for each response given, participants were required to rate their
confidence that it was the target on a 6-point scale. Participants were informed of the point system used in the
test phase, and that the goal was to reach as many points
as possible. They were told that if they were confident
that one of their responses was the target, they could
nominate it by clicking on the ‘‘go for points’’ button.
By doing so, they would be awarded two points if the
nominated candidate was indeed the target. However,
if it was not, they would lose two points. Only one re-
ARTICLE IN PRESS
16
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
sponse per trial could be nominated. Participants were
further informed that they could still gain 0.25 points
if one of the non-nominated responses they had produced was the target, regardless of whether or not they
nominated another candidate. Because some points
could still be gained if non-nominated responses turned
out to be targets, an incentive was provided to enter as
many responses as possible. There was no requirement
to nominate a response to ‘‘go for points’’ on any trial.
In the test phase, each cue was presented towards the
top of the computer screen with a question mark (?) displayed to its immediate right indicating that a response
was required. Participants typed in their first response in
an empty text box situated below the cue. When a response had been entered, a 6-point scale appeared immediately to the right of the response, and participants
indicated their confidence of their response being correct
by clicking on one of the points on the scale. A ‘‘go for
points’’ button was displayed to the right of the 6-point
scale, which could be highlighted if participants chose to
nominate that particular response. Once a response had
been typed into the box, and confidence had been rated,
the participant was prompted to type in another response
(in a box which appears directly below the previous one)
or to click on the ‘‘next’’ button located at the bottom
of the screen to initiate the next test trial. If a response
was typed in the second text box, a confidence scale appeared again to its immediate right, and a ‘‘go for points’’
button to the right of that. This process continued either
until six responses had been offered, or until the ‘‘next’’
button was clicked. Participants were not permitted to
type in multiple copies of the same response on any given
trial. Throughout the trials, a brief summary was presented on the top left corner of the screen to remind participants of their task and how the points system
worked. In addition, the participantsÕ cumulative score,
and the number of points scored on the previous trial,
were displayed on the top right hand corner of the screen.
Clearly, participants in the sham-study and no-study
groups could not rely on implicit or explicit memory to
generate correct responses to the test cues (as no actual
targets were presented at study). Nevertheless, for purposes of comparison to the real-study group, we define
‘‘targets’’ for these groups the same way as in Experiment 1. Responses which were nominated for points
were treated as free-report responses, whereas all responses, regardless of whether or not they were nominated for points, were treated as forced-report
responses.
Target production
The mean free- and forced-report target production
rates for weak and strong cues, as a function of study
groups are shown in Table 8. Unsurprisingly, because
both the sham-study and no-study groups were not presented with the targets prior to the test phase, both
groups generated virtually no targets in the context of
weak cues. Target production was close to zero for both
free report and forced report, so these data were not
analyzed further. To determine the effect of cue type,
strong and weak-cue target production in just the realstudy group was analyzed using paired-samples t tests.
In free report, target production was shown to be significantly better for weak cues than for strong cues,
t (20) = 3.37, SE = .037, g2 = .362. The reverse was
found, however, in forced report, where target production was significantly better for strong cues than for
weak cues, t (20) = 3.41, SE = .047, g2 = .368.
Although very few targets were retrieved for weak
cues in the sham-study and no-study groups, target production with strong cues was more successful. To determine if there were any differences between the groups in
these rates, two one-way ANOVAs were conducted on
strong-cue target production data from the three groups,
the first in free report, and the second in forced report.
In free report, the one-way ANOVA revealed a significant main effect of group, F (2, 60) = 7.31, MSE = .007,
Results
Because of the change to the methodology in this
experiment, it was no longer feasible to calculate type
2 SDT measures of monitoring and report bias. The
large number of responses given in forced report meant
that the type 2 FA rates were below 5% for most participants, and below 1% in several cases, potentially making discrimination and bias statistics misleading.
Consequently, we report only the target production rate,
which corresponds to the number of targets produced in
a given condition out of the total number possible, and
the confidence data. Although we could not calculate A 0
in this experiment, the close correspondence between it
and the confidence data analyses in the previous experiments made it feasible to use the confidence data as an
indicator of monitoring.
Table 8
Mean free- and forced-report target production in Experiment 3
as a function of cue type and experimental group
Experimental group
Report type
Free-report
Forced report
M
SD
M
SD
Weak cues
Real study
Sham study
No study
.18
.00
.00
.15
.00
.00
.23
.01
.02
.18
.01
.02
Strong cues
Real study
Sham study
No study
.05
.06
.14
.07
.07
.11
.39
.47
.48
.14
.10
.12
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
g2 = .196. Independent-samples t tests showed that this
effect arose because strong-cue target production in free
report was significantly better in the no-study group than
in the real-study and sham-study groups, t (40) = 3.23,
SE = .028, g2 = .207, and t (40) = 2.89, SE = .029,
g2 = .173, respectively (see Table 8). There was no difference in target production success between the latter two
study groups, t (40) = 0.318.
In forced report, the one-way ANOVA again revealed
a significant main effect of group, F (2, 60) = 3.52,
MSE = .015, g2 = .105. The effect reflects the way
strong-cue target production in forced report was significantly better for both the sham-study and the no-study
groups than for the real-study group, t (40) = 2.17,
SE = .038, g2 = .105, and t (40) = 2.25, SE = .040,
g2 = .112, respectively. Strong-cue target production in
forced report for the sham-study and no-study groups,
however, did not differ significantly from each other,
t (40) = .221.
Confidence data
Confidence data (Table 4) produced by the real-study
group were collapsed across report option and a 2 (cue:
weak/strong) · 2 (accuracy: correct/incorrect) mixed
ANOVA was conducted. This analysis revealed significant main effects of cue, F (1, 18) = 316.07, MSE = .131,
g2 = .946, accuracy, F (1, 18) = 539.97, MSE = .120,
g2 = .968, and a significant cue · accuracy interaction,
F (1, 18) = 217.47, MSE = .156, g2 = .924. Overall, participants in the real-study group gave higher confidence
ratings for weak cues (3.07) than strong cues (1.59), and
higher confidence ratings for correct (3.25) than incorrect
(1.40) responses. The interaction arose because this difference in confidence ratings between correct and incorrect
responses was larger for weak cues (difference = 3.19)
than for strong cues (difference = 0.51). These results on
confidence ratings suggest that better monitoring for
weak cues than strong cues was apparent in this experiment just as it was in Experiments 1 and 2.
Because of the very low number of correct responses
produced by the sham and no-study groups in the context
of weak cues, cross-group comparisons were restricted to
confidence data produced for strong cues. As with previous analyses, the data were collapsed across report option
(see Table 4) and a 3 (group: real/sham/no study) · 2
(accuracy: correct/incorrect) mixed ANOVA was carried
out. This analysis yielded a significant accuracy main
effect, F (1, 60) = 101.69, MSE = .103, g2 = .629 (correct = 2.43; incorrect = 1.85), and a significant group
main effect, F (2, 60) = 13.58, MSE = .819, g2 = .312.
The significant group main effect arose because for strong
cues, the real-study group (1.60) gave lower confidence
ratings than both the sham-study (2.19), F (1, 40) =
10.40, MSE = .356, g2 = .206, and no-study (2.63)
groups, F (1, 40) = 28.57, MSE = .386, g2 = .417. The difference in confidence ratings between the sham-study and
17
no-study groups was marginally significant, F (1, 40) =
4.00, MSE = .486, g2 = .091, p < .06. Finally, a significant group · accuracy interaction was also found,
F (2, 60) = 7.65, MSE = .103, g2 = .203. This interaction
occurred because the difference in confidence ratings between correct and incorrect responses was significantly
larger in the no-study group than in both the real-study
group, F (1, 40) = 5.69, MSE = .121, g2 = .125, and
the sham-study group, F (1, 40) = 14.59, MSE = .103,
g2 = .267.
At first glance, the fact that the no-study and shamstudy groups assigned different levels of confidence to
target and non-target responses is surprising because
neither of these groups were exposed to any targets during study. We speculate that, in the case of strong cues,
participants were responding to the ease with which a related word can be generated, which happens to be correlated with the high degree of association between strong
cues and words chosen as targets for the real-study
group. At any rate, the significant group by accuracy
interaction obtained here indicates a clear influence of
study-list structure on monitoring. Presumably, because
both the real-study group and the sham-study group
were exposed to a list of weak associates, confidence in
targets produced to strong cues was low due to their
inconsistency with the list structure; only targets produced to strong cues after studying no list at all were assigned reasonably high levels of confidence.
Discussion
Experiment 3 demonstrated that the cued-recall
group, who studied targets in the context of weakly related associate pairs (real-study group), were less likely
to produce those targets in the context of non-reinstated
strong cues, than were the generation groups (shamstudy and no-study), who never studied the targets. This
effect was apparent in free-report, where more targets
were retrieved by the no-study group than the real-study
group, and in forced-report, where more targets were retrieved by both the no-study and sham-study groups
than the real-study group. These effects were obtained
despite the fact that participants were given the opportunity to generate multiple responses to the cues, and provided with an incentive to produce multiple candidates.
Consequently, the notion that targets were actually covertly generated in cued recall, but not reported (i.e., not
recognized), seems less likely. Instead, the results of
Experiment 3 add support to the conclusion, derived
from Experiments 1 and 2, that poor cued-recall performance to strong extralist cues in the typical encodingspecificity experiments is partly attributable to generation failure. Whereas Experiments 1 and 2 demonstrated
that some guidance regarding the nature of the relationship between the cue and the sought-after target improved forced-report target production with strong
ARTICLE IN PRESS
18
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
extralist cues, there was some question about the degree
to which recognition processes might have influenced
this result. However, Experiment 3 demonstrated,
unequivocally, that generation failure is at least partly
the problem that cued-recall participants experience.
We suggested above that learning the weak-associate
study-list structure, which was not conducive to generating strong associates to strong test cues, was one possible cause of poor strong-cue target production.
However, a comparison of the sham-study and no-study
groups suggests that the list structure by itself had little
effect on forced-report retrieval, our measure of generation, but had substantial effects on monitoring (Tables 4
and 8). The most likely explanation for this is that the
multiple-response methodology limited the effect of the
inappropriate list structure such that targets to strong
cues in the sham-study group were generated, but confidence assigned to them was lowered. The fact that this
comparison involves only pure generation groups makes
generalization to possible effects that the study-list structure might have in cued recall somewhat tenuous. Nonetheless, it is clear that, based on the current data, the
study-list structure alone cannot account for the generation failure observed in the real-study group.
On the other hand, a comparison of the real-study
and sham-study groups revealed that either the direct
memory instructions alone, or these instructions in conjunction with the weak-associate study list, produced
generation failure. As we mentioned above, our preferred explanation is that the combination of direct memory instructions and inappropriate study-list structure is
particularly damaging to strong-cue target production.
Although the current data cannot determine whether direct memory instructions exert a singular effect, or
whether they are involved in an interactive effect with
study-list structure, our preference is based on the results
of other experiments we have conducted that favor the
latter explanation. For example, Higham and Tam
(2005) found that generation failure (i.e., greater
forced-report target production in a generation group
shown no targets during study than in a cued-recall
group exposed to the targets) was only obtained if the
study list consisted of weak-associate pairs. We describe
this research in more detail in the General discussion.
The multiple-response methodology introduced a potential confound in the experiment; namely, there was
no longer strict control over report output, as there was
in Experiments 1 and 2. Indeed, the real-study group produced significantly fewer responses (2.04) than both the
sham-study group (2.91; t (40) = 3.43, SE = 0.25, g2 =
.227), and the no-study group, (2.74; t (40) = 2.41,
SE = 0.29, g2 = .127). These differences are potentially
problematic because they represent a form of report bias
shift in forced report, which may, by itself, account for
superior strong-cue, forced-report target production in
the generation groups compared to the cued-recall group.
To explore the role of this possible confound, strongcue target production was determined in each experimental group separately for trials corresponding to
one, two, three, four, five, and six responses. If a report
bias shift was solely responsible for the better strong-cue
target production in the generation groups compared to
the real-study group, target production should increase
as more responses are offered, but performance between
the groups corresponding to each output level should be
equated. By this reasoning, the better target production
in the generation groups, compared to the real-study
group, would be due to the fact there are simply more
trials with more responses in the former groups, which
would boost their overall target production rate.
The results of this analysis are shown in Fig. 1. The
number of observations contributing to each mean,
summed across participants, appears in brackets. Analyses of these data were limited to descriptive statistics
because several participants in each group contributed
no data to some of the cells, making calculation of inferential statistics unviable. Nonetheless, the data pattern is
clear enough to eliminate a report-bias interpretation of
the obtained results. As shown in the figure, although
there was a trend for better target production as more
responses were offered (i.e., positive slope), strong-cue
target production in the real-study group was worse
than that in both generation groups at virtually every level of responding. This analysis suggests that the poorer
target production in the real-study group, relative to the
generation groups, was due to generation failure, not a
shift in report bias.
The results from Experiment 3 also eliminate a recognition-failure explanation for the good target production observed in the guided groups of Experiments 1
Fig. 1. Strong-cue target production in Experiment 3 as a
function of the number of responses offered to each cue. The
number of observations contributing to each point, summed
across participants, is also indicated in parentheses.
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
and 2. The critic may argue that the instructions to generate-then-recognize to strong cues effectively rendered
performance in these conditions comparable to indirect
memory instructions. That is, participants may have
simply stopped trying to recognize the candidates that
they generated, responding with the first word that came
to mind. By effectively eliminating or reducing the influence of the recognition stage in the guided groups, recognition failure was reduced or eliminated, and target
production to strong cues was boosted. However, the
multiple response methodology used in Experiment 3,
with the emphasis on generating multiple candidates
and deciding whether or not to put one forward for
points, effectively made the real-study group a generate-recognize group like the guided groups of Experiments 1 and 2. Indeed, the fact that the forced-report
number of candidates generated (2.04) was more than
twice that observed in either Experiment 1 or 2 (1.00),
suggests that the real-study group fits the generate-recognize description better than the guided groups. Despite this, there was still good evidence for generation
failure in the real-study group. Given that there was evidence of generation failure even under the conditions of
Experiment 3, it seems more likely that the excellent
strong-cue target production observed in the guided
groups in Experiments 1 and 2 was due to an effect on
generation, not recognition.
General discussion
The current experiments tested whether poor cued-recall target production observed with strong extralist cues
in the standard encoding-specificity paradigm is due solely to recognition failure. In Experiments 1 and 2, guiding participants about the strength of the associative
relationship between test cues and sought-after targets
was found to improve target production with strong
extralist cues considerably. To determine which particular stage(s) of recall was affected by the guidance manipulation, type 2 SDT methodology (Higham, 2002;
Higham & Gerrard, 2005; Vokey & Higham, 2005)
was employed, which allowed us to gain separate estimates of generation processes, recognition processes,
and response tendencies involved with cued recall. Consistent with earlier claims, this analysis indicated that
recognition performance to strong extralist cues was
poorer in the standard group than in the guided group,
and that report bias also tended to be somewhat more
conservative. However, counter to prior claims, the
analysis also indicated that guidance had a substantial
effect on forced-report target production, our measure
of the generation process; by the time participants provided responses to all the test cues in forced report, recall performance for strong cues in the guided groups
was superior to that in the group given standard cued-
19
recall instructions. Thus, poor strong extralist cue target
production is not just a problem in recognizing easily
generated targets. In the context of standard cued-recall
instructions, participants also have difficulty generating
correct responses, despite the fact that the targets were
primary associates of many of the strong cues.
We attribute the generation difficulties observed in
the current experiments to participants searching an
inappropriate domain when given standard cued-recall
instructions. There appear to be at least two reasons
for this inappropriate domain search. First, as a result
of studying weakly associated word pairs during study,
participants ‘‘learn the experimenterÕs design’’ (Higham
& Brooks, 1997); that is, participants learn about listwide commonalities that exist amongst the study items,
and these commonalities partially define the class of
plausible candidates generated during the test. Such
learning is detrimental to finding targets with strong
cues, which accounts for poor strong-cue target production in the unguided group of Experiment 1 and the realstudy group of Experiment 3. However, if participants
are freed from assumptions regarding the memory domains in which to search for targets, as in the guided
groups of Experiments 1 and 2, strong-cue target production improves dramatically.
Direct support for the role of study-list structure was
obtained recently by Higham and Tam (2005) who varied the nature of the relationship between the cues and
targets in the study list. They found that if a sham study
list consisted of entirely weak associates, as in Experiment 3 of the current series, the probability of target
production in a pure-generation task was less than if
the study list consisted of sham strong associates. If, instead, participants studied a list of moderate associates,
participants were released from generation failure; that
is, forced-report target production to strong cues in
the cued-recall group was better than pure-generation
target production. Furthermore, the cued-recall group
given the weak-associate study list tended to produce
non-target responses that were less strongly related to
the cues than the cued-recall group given a moderate
associate study list.
Despite these data, a comparison between forced-report target production in the sham-study group and the
real-study group of Experiment 3 indicated that the generation problems of the real-study group were not limited to incorrect assumptions derived from the studylist structure. Both the sham-study and real-study
groups studied a list of weakly associated word pairs,
yet forced-report target production was higher in the
sham-study group than in the real-study group. Thus,
direct memory instructions appear to be a second factor
contributing to inappropriate domain searching and
generation failure.
Research by Weldon and Colston (1995) and Guynn
and McDaniel (1999) may be helpful in interpreting the
ARTICLE IN PRESS
20
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
role of direct-memory instructions. Weldon and Colston
found that cue reinstatement had differential effects on
the generation component of direct (cued recall) versus
indirect (generation) tests. More specifically, reinstating
the study context at test had a larger effect on an explicit
cued-recall test than on an implicit generation test. They
suggested that this was due to the fact that context-specific information could be used strategically to access
targets directly and efficiently in the cued-recall group,
limiting the search set when the context was reinstated,
whereas this did not occur in the generation group.
More recently, Guynn and McDaniel (1999) found an
advantage for target production in a generation group
compared to the cued-recall group, an effect that is analogous to our generation/cued-recall group difference. In
their experiment, participants studied exemplars of common categories (e.g., trees, ships) that were of either high
or low frequency within the category. One group of participants were later required to recall the exemplars with
the assistance of the (extralist) category label, whereas
another group of participants were required to generate
exemplars to the category labels, without concern for
whether the generated items were old or new. They
found that high-frequency exemplar production was
greater in the generate group than in the cued-recall
group, whereas this was not the case for low-frequency
exemplars.
It is important to point out that although both Weldon and Colston (1995) and Guynn and McDaniel
(1999) found better target production in their generate
groups than in their cued-recall groups, both groups
encountered the targets during the study phase. Consequently, their results lack the paradoxical feature obtained in our experiments: worse target production
when trying to remember recently encountered targets
than when merely generating without the benefit of recent target exposure. Nonetheless, our results, coupled
with the results of these studies, suggest that the type
of instructions that are given to participants can affect
the nature of the domains that are searched during the
generation stage. Whereas indirect (generation) instructions may lead participants to search the domain of
pre-experimental associations, direct (cued-recall) memory instructions may lead participants to search more recent, within-experiment episodes, using the cue and test
context to define the domain and guide the search. This
focus on searching within-experiment domains is likely
to enhance any effects of having learned an inappropriate list-wide structure during study, rendering the ‘‘potent mix,’’ referred to above, that leads to generation
failure.
Because the search domain, and the candidates generated from it, are affected by the test instructions, there
is no guarantee that encountering items during study
will automatically lead to either implicit episodic priming or explicit memory for the prior processing episodes.
If direct-memory instructions are used, but the study context is not reinstated, participants may search domains
that are very unlikely to contain the target. Thus, performance in such situations may be very poor, even poorer
than performance in a situation where targets have not
been recently encountered, and generation instructions
are used. These results underline what are perhaps the
two most critical points of our research: (1) metacognitive
flexibility in episodic memory retrieval and (2) the necessity of estimating that flexibility by considering both
memory access (generation) and memory monitoring
(recognition) as separate stages of processing.
A modernized generate-recognize theory
Clearly, the results of the current experiments are
wholly incompatible with traditional generate-recognize
theory. If participants fell back on generating candidates
that were semantic associates of the test cues once direct
retrieval failed, as these models posit, then guidance
regarding the nature of the cue-target relationship in
Experiments 1 and 2 should have had little effect on
cued-recall performance. Additionally, target production in the real-study group of Experiment 3 should have
equalled or bettered performance in the no-study and
sham-study groups of Experiment 3.
Why, then, do we continue to refer to discrete generation and recognition processes in cued recall? Perhaps
the answer is best summarized in Jacoby and HollingsheadÕs (1990) comment, made 20 years or so after such
models were first troubled by the introduction of the
encoding-specificity principle: ‘‘Generate/recognize
models are too useful as descriptions of memory monitoring and other activities to be abandoned’’ (p. 452).
This sentiment is reflected in the fact that most modern
metacognitive models are essentially generate-recognize
models at heart. Although the stages have different labels (production–evaluation, Whittlesea, 1997; retrieval-monitoring, Koriat & Goldsmith, 1996), each
distinguishes between separate memory access, in the
broad sense of the term, and memory assessment stages.
We have attempted in our research to turn this distinction, which is now mostly uncontroversial in the metacognitive literature, full circle to revisit the analogous
distinction in cued recall. In our view, the access-monitoring distinction became blurred as concepts such as
ecphory (Tulving, 1983), direct access, and conscious
recollection took over, chiefly pushing metamemory
processes aside (although see Payne, Jacoby, & Lambert,
2004). It is as if both monitoring and retrieval processes
have been assumed to be virtually perfect when direct
access occurs. The former is revealed by the common
use of free-report measures of cued recall performance,
which is only justified if direct access is perfectly monitored (otherwise monitoring will contaminate the measure). The latter is revealed with the equating of
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
‘‘remember’’ judgments (Tulving, 1985) with veridical
retrieval (see Higham & Vokey, 2004 for discussion).
We believe, however, that all memory decisions involve
memory access, memory monitoring, and bias parameters to varying degrees, and that the methodology and
tools that memory researchers use should have the
inherent ability to estimate them. The best way we have
found to make these separate estimates is to use type 2
SDT, regardless of whether the tasks are understood
to involve direct retrieval or not.
References
Anderson, J. R., & Bower, G. H. (1972). Recognition and
retrieval processes in cued recall. Psychological Review, 79,
97–123.
Bahrick, H. P. (1969). Measurement of memory by prompted
recall. Journal of Experimental Psychology, 79, 213–219.
Bahrick, H. P. (1970). Two-phase model for prompted recall.
Psychological Review, 77(3), 215–222.
Bahrick, H. P. (1979). Broader methods and narrower theories
for memory research: Comments on the papers by Eysenck
and Cermak. In L. S. Cermak & F. I. M. Craik (Eds.),
Levels of processing in human memory (pp. 141–156). Hillsdale, NJ: Erlbaum.
Baker, L., & Santa, J. L. (1977). Context, integration, and
retrieval. Memory & Cognition, 5, 308–314.
Barnes, A. E., Nelson, T. O., Dunlosky, J., Mazzoni, G., &
Narens, L. (1999). An integrative system of metamemory
components involved in retrieval. In D. Gopher & A. Koriat
(Eds.), Cognitive regulation of performance: Interaction of
theory and application (Attention and performance XVII)
(pp. 289–313). Cambridge, MA: MIT Press.
Bartling, C. A., & Thompson, C. P. (1977). Encoding specificity: Retrieval asymmetry in the recognition failure paradigm. Journal of Experimental Psychology: Human Learning
and Memory, 3(6), 690–700.
Bodner, G. E., Masson, M. E. J., & Caldwell, J. I. (2000).
Evidence for a generate-recognize model of episodic influences on word-stem completion. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 26(2),
267–293.
Bower, G. H. (1967). A multicomponent theory of the memory
trace. In K. W. Spence & J. T. Spence (Eds.). The
psychology of learning and motivation (Vol. 1). New York:
Academic Press.
Brainerd, C. J., Payne, D. G., Wright, R., & Reyna, V. F. (2003).
Phantom recall. Journal of Memory and Language, 48,
445–467.
Brainerd, C. J., Wright, R., Reyna, V. F., & Payne, D. G.
(2002). Dual-retrieval processes in free and associative
recall. Journal of Memory and Language, 46, 120–152.
Brooks, L. R. (1978). Nonanalytic concept formation and
memory for instances. In E. Rosch & B. Lloyd (Eds.),
Cognition and categorization (pp. 169–211). Hillsdale, NJ:
Erlbaumm.
Clarke, F. R., Birdsall, T. G., & Tanner, W. P. (1959). Two
types of ROC curves and definitions of parameters. Journal
of the Acoustical Society of America, 31, 629–630.
21
Donaldson, W. (1992). Measuring recognition memory. Journal
of Experimental Psychology: General, 121, 275–277.
Ehrlich, S., & Philippe, M. (1976). Encoding specificity,
retrieval specificity or structural specificity?. Journal of
Verbal Learning and Verbal Behavior, 15, 537–548.
Estes, W. K. (1959). The statistical approach to learning theory.
In S. Koch (Ed.). Psychology: A study of a science (Vol. 2).
New York: McGraw-Hill.
Flexser, A. J., & Tulving, E. (1978). Retrieval independence in
recognition and recall. Psychological Review, 85(3), 153–171.
Galvin, S. J., Podd, J. V., Drga, V., & Whitmore, J. (2003).
Type 2 tasks in the theory of signal detectability: Discrimination between correct and incorrect decisions. Psychonomic Bulletin & Review, 10, 843–876.
Gardiner, J. M. (1988). Recognition failures and free-recall
failures: Implications for the relation between recall and
recognition. Memory & Cognition, 16(5), 446–451.
Gillund, G., & Shiffrin, R. M. (1984). A retrieval model for
both recognition and recall. Psychological Review, 91, 1–67.
Goldsmith, M., Koriat, A., & Weinberg-Eliezer, A. (2002).
Strategic regulation of grain size in memory reporting.
Journal of Experimental Psychology: General, 131(1), 73–95.
Grier, J. B. (1971). Nonparametric indexes for sensitivity and bias:
Computing formulas. Psychological Bulletin, 75, 424–429.
Guynn, M. J., & McDaniel, M. A. (1999). Generate-sometimes
recognize, sometimes not. Journal of Memory and Language,
41, 398–415.
Higham, P. A. (2002). Strong cues are not necessarily weak:
Thomson and Tulving (1970) and the encoding specificity
principle revisited. Memory & Cognition, 30, 67–80.
Higham, P. A., & Brooks, L. R. (1997). Learning the
experimenterÕs design: Tacit sensitivity to the structure of
memory lists. Quarterly Journal of Experimental Psychology,
50A, 199–215.
Higham, P. A., & Gerrard, C. (2005). Not all errors are created
equal: Metacognition and changing answers on multiplechoice tests. Canadian Journal of Experimental Psychology,
59, 28–34.
Higham, P. A., & Tam, H. (2005). Release from generation
failure: The role of study-list structure. Manuscript submitted for publication.
Higham, P. A., & Vokey, J. R. (2004). Illusory recollection and
dual-process models of recognition memory. Quarterly
Journal of Experimental Psychology, 57A, 714–744.
Humphreys, M. S., Bain, J. D., & Pike, R. (1989). Different
ways to cue a coherent memory system: A theory for
episodic, semantic and procedural tasks. Psychological
Review, 96(2), 208–233.
Jacoby, L. L. (1983). Remembering the data: Analyzing
interactive processes in reading. Journal of Verbal Learning
and Verbal Behavior, 22, 485–508.
Jacoby, L. L. (1996). Dissociating automatic and consciously
controlled effects of study/test compatibility. Journal of
Memory and Language, 35, 32–52.
Jacoby, L. L. (1998). Invariance in automatic influences of
memory: Toward a userÕs guide for the process-dissociation
procedure. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 24(1), 3–26.
Jacoby, L. L., & Hollingshead, A. (1990). Toward a generate/
recognize model of performance on direct and indirect tests
of memory. Journal of Memory and Language, 29, 433–454.
ARTICLE IN PRESS
22
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
Jones, G. V. (1978). Recognition failure and dual mechanisms
in recall. Psychological Review, 85(5), 464–469.
Jones, G. V. (1987). Independence and exclusivity among
psychological processes: Implications for the structure of
recall. Psychological Review, 94, 229–235.
Jones, G. V., & Gardiner, J. M. (1990). Recognition failure
when recognition targets and recall cues are identical.
Bulletin of the Psychonomic Society, 28(2), 105–108.
Kelley, C. M., & Sahakyan, L. (2003). Memory, monitoring,
and control in the attainment of memory accuracy. Journal
of Memory and Language, 48, 704–721.
Kintsch, W. (1974). The representation of meaning in memory.
Hillsdale, NJ: Erlbaum.
Klatzky, R. L., & Erdelyi, M. H. (1985). The response criterion
problem in tests of hypnosis and memory. The International
Journal of Clinical and Experimental Hypnosis, 33, 246–257.
Koriat, A., & Goldsmith, M. (1996). Monitoring and control
processes in the strategic regulation of memory accuracy.
Psychological Review, 103, 490–517.
Martin, E. A. (1975). Theoretical notes: Generation-recognition
theory and the encoding specificity principle. Psychological
Review, 82(2), 150–153.
Morris, C. D., Bransford, J. D., & Franks, J. J. (1977). Levels
of processing versus transfer appropriate processing. Journal of Verbal Learning and Verbal Behavior, 16, 519–533.
Muter, P. (1984). Recognition and recall of words with a single
meaning. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 10(2), 198–202.
Naveh-Benjamin, M., & Guez, J. (2000). Effects of divided
attention on encoding and retrieval processes: Assessment
of attentional costs and a componential analysis. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 26(6), 1461–1482.
Nelson, D. L., Bennett, D. J., Gee, N. R., Schreiber, T. A., &
McKinney, V. M. (1993). Implicit memory: Effects of
network size and interconnectivity on cued recall. Journal
of Experimental Psychology: Learning, Memory, and Cognition, 19(4), 747–764.
Nelson, D. L., Bennett, D. J., & Leibert, T. W. (1997). One step
is not enough: Making better use of association norms to
predict cued recall. Memory & Cognition, 25(6), 785–796.
Nelson, D. L., & Goodmon, L. B. (2003). Disrupting attention:
The need for retrieval cues in working memory. Memory &
Cognition, 31, 65–76.
Nelson, D. L., McEvoy, C. L., & Friedrich, M. A. (1982).
Extralist cuing and retrieval inhibition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 8(2),
89–105.
Nelson, D. L., McKinney, V. M., Gee, N. R., & Janczura, G.
A. (1998). Interpreting the influence of implicitly activated
memories on recall and recognition. Psychological Review,
105, 299–324.
Nelson, D. L., Schreiber, T. A., & McEvoy, C. L. (1992).
Processing implicit and explicit representations. Psychological Review, 99(2), 322–348.
Nelson, T. O. (1984). A comparison of current measures of the
accuracy of feeling-of-knowing predictions. Psychological
Bulletin, 95, 109–133.
Nilsson, L. G., & Gardiner, J. M. (1993). Identifying exceptions
in a database of recognition failure studies from 1973 to
1992. Memory & Cognition, 21(3), 397–410.
Payne, B. K., Jacoby, L. L., & Lambert, A. J. (2004). Memory
monitoring and the control of stereotype distortion. Journal
of Experimental Social Psychology, 40, 52–64.
Pellegrino, J. W., & Salzberg, P. M. (1975a). Encoding specificity
in associative processing tasks. Journal of Experimental
Psychology: Human Learning and Memory, 1(5), 538–548.
Pellegrino, J. W., & Salzberg, P. M. (1975b). Encoding
specificity in cued recall and context recognition. Journal
of Experimental Psychology: Human Learning and Memory,
104(3), 261–270.
Postman, L. (1975). Tests of the generality of the principle of
encoding specificity. Memory & Cognition, 3(6), 663–672.
Reder, L. M., Anderson, J. R., & Bjork, R. A. (1974). A
semantic interpretation of encoding specificity. Journal of
Experimental Psychology, 102(4), 648–656.
Roediger, H. L., & Adelson, B. (1980). Semantic specificity in
cued recall. Memory & Cognition, 8(1), 65–74.
Santa, J. L., & Lamwers, L. L. (1974). Encoding specificity:
Fact or artifact?. Journal of Verbal Learning and Verbal
Behavior, 13, 412–423.
Santa, J. L., & Lamwers, L. L. (1976). Where does the confusion
lie? Comments on the Wiseman and Tulving paper. Journal of
Verbal Learning and Verbal Behavior, 15, 53–57.
Sikstrom, S. P., & Gardiner, J. M. (1997). Remembering,
knowing and the Tulving-Wiseman law. European Journal
of Cognitive Psychology, 9(2), 167–185.
Snodgrass, J. C., & Corwin, J. (1988). Pragmatics of measuring
recognition memory: Applications to dementia and amnesia. Journal of Experimental Psychology: General, 117,
34–50.
Thomson, D. M., & Tulving, E. (1970). Associative encoding
and retrieval: Weak and strong cues. Journal of Experimental Psychology, 86(2), 255–262.
Toth, J. P., Reingold, E. M., & Jacoby, L. L. (1994). Toward a
redefinition of implicit memory: Process dissociations
following elaborative processing and self-generation. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 20, 290–303.
Tulving, E. (1974). Recall and recognition of semantically
encoded words. Journal of Experimental Psychology, 102,
778–787.
Tulving, E. (1983). Elements of episodic memory. Oxford:
Oxford University Press.
Tulving, E. (1985). Memory and consciousness. Canadian
Psychology, 26, 1–12.
Tulving, E., & Thomson, D. M. (1973). Encoding specificity
and retrieval processes in episodic memory. Psychological
Review, 80(5), 352–373.
Tulving, E., & Watkins, O. C. (1977). Recognition failure of
words with a single meaning. Memory & Cognition, 5(5),
513–522.
Underwood, B. J. (1969). Attributes of memory. Psychological
Review, 76, 559–773.
Vokey, J. R., & Higham, P. A. (2005). Components of recall:
The semantic specificity effect and the monitoring of cued
recall. Manuscript submitted for publication..
Watkins, M. J., & Tulving, E. (1975). Episodic memory: When
recognition fails. Journal of Experimental Psychology:
General, 104(1), 5–29.
Weldon, M. S., & Colston, H. L. (1995). Dissociating the
generation stage in implicit and explicit memory tests:
ARTICLE IN PRESS
P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx
Incidental production can differ from strategic access.
Psychonomic Bulletin & Review, 2(3), 381–386.
Whittlesea, B. W. A. (1997). Production, evaluation, and
preservation of experiences: Constructive processing in
remembering and performance tasks. In D. L Medin
(Ed.). The psychology of learning and motivation (Vol. 37,
pp. 211–264). San Diego, CA: Academic Press.
Wickens, D. D. (1970). Encoding categories of words: An
empirical approach to meaning. Psychological Review, 77,
1–15.
23
Wiseman, S., & Tulving, E. (1975). A test of confusion theory
of encoding specificity. Journal of Verbal Learning and
Verbal Behavior, 14, 370–381.
Wiseman, S., & Tulving, E. (1976). Encoding specificity:
Relation between recall superiority and recognition failure.
Journal of Experimental Psychology: Human Learning and
Memory, 2(4), 349–361.
Zeelenberg, R., Pecher, D., Shiffrin, R. M., & Raaijmakers, J.
G. W. (2003). Semantic context effects and priming in word
association. Psychonomic Bulletin & Review, 10, 653–660.