ARTICLE IN PRESS Journal of Memory and Language xxx (2005) xxx–xxx Journal of Memory and Language www.elsevier.com/locate/jml Generation failure: Estimating metacognition in cued recall q Philip A. Higham *, Helen Tam School of Psychology, University of Southampton, Highfield, Southampton SO17 1BJ, UK Received 9 August 2004; revision received 20 January 2005 Abstract Three experiments examined generation, recognition, and response bias in the original encoding-specificity paradigm using the type 2 signal-detection analysis advocated by Higham (2002). Experiments 1 (pure-list design) and 2 (mixedlist design) indicated that some guidance regarding the strength of the associative relationship between the test cue and target greatly improved strong-cue target production relative to no guidance, and that this effect was attributable to improved generation, as well as recognition. Problems with generating candidates for response during standard cued recall was further shown in Experiment 3, where despite having the opportunity to provide multiple responses for each cue, participantsÕ ability to produce the targets remained poor. The results are discussed in terms of traditional and modern generate-recognize theory, metacognition, and dual-route models of recall. 2005 Elsevier Inc. All rights reserved. Keywords: Cued recall; Generate-recognize; Metacognition; Encoding specificity Since Tulving and colleagues introduced the encoding specificity principle in the early 1970s (e.g., Thomson & Tulving, 1970; Tulving & Thomson, 1973), most students of memory have viewed generate-recognize theory as a straw man. It is considered by many to be an old- q Philip A. Higham and Helen Tam, School of Psychology, University of Southampton. Preparation of this article was supported by a research grant from the British Academy. Portions of this research were presented at the 43rd annual meeting of the Psychonomic Society, November, 2002 in Kansas City, Missouri, USA, and at the 45th annual meeting of the Psychonomic Society, November, 2004 in Minneapolis, Minnesota, USA. We thank Nina Eskriett, Wendy Kneller and David Brook for research assistance. We thank Harry Bahrick, Chuck Brainerd, Morris Goldsmith, Asher Koriat, Steve Lindsay, Doug Nelson, and Mike Watkins for helpful comments on earlier drafts of this article. * Corresponding author. Fax: +44 23 8059 4597. E-mail address: [email protected] (P.A. Higham). fashioned theory, with a particular failing when it comes to explaining context reinstatement effects in cued recall. In this paper, we revisit both generate-recognize theory and the classic cued-recall paradigm that provided the initial support for the encoding-specificity principle. However, let us be clear at the outset that we are not attempting to resurrect traditional generate-recognize theory. Indeed, as will become apparent, the data from the experiments that we report are quite inconsistent with those early models, and, if anything, they support many aspects of TulvingÕs message. On the other hand, we will argue that a more modern generate-recognize model of cued recall that maintains the crucial distinction between memory access (generation) and metacognitive monitoring (recognition) processes is still a useful framework for cued-recall performance, and performance on many other tasks as well. In this way, our message is similar to that of other metacognitive researchers who have promoted two-stage models involving separate stages of access and memory moni- 0749-596X/$ - see front matter 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jml.2005.01.015 ARTICLE IN PRESS 2 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx toring (e.g., Barnes, Nelson, Dunlosky, Mazzoni, & Narens, 1999; Goldsmith, Koriat, & Weinberg-Eliezer, 2002; Higham, 2002; Kelley & Sahakyan, 2003; Klatzky & Erdelyi, 1985; Koriat & Goldsmith, 1996). Before describing our unique method of analyzing the underlying generation and recognition processes, we will first review the traditional generate-recognize models, and some of their variants. Early generate-recognize theory and encoding specificity Early generate-recognize theory (e.g., Anderson & Bower, 1972; Bahrick, 1969, 1970) proposed that cued recall could be achieved by covertly generating associates of test cues, and then attempting to recognize the sought-after target from amongst the generated candidates. Support for the theory came, in part, from experiments demonstrating that the associative strength between the test cue and the target affected the probability of recall (e.g., Bahrick, 1970). In the early 1970s, Tulving and colleagues (e.g., Thomson & Tulving, 1970; Tulving & Thomson, 1973; Wiseman & Tulving, 1976) argued that generate-recognize models were an insufficient account of recall for two reasons. First, Thomson and Tulving (1970) demonstrated that extralist retrieval cues that are strong associates of the target words (based on free association norms) are not very effective retrieval cues, particularly if the target words were encoded in relation to some other (weak) associate during study. If cued recall is accomplished by first generating candidates from the test cue, and then recognizing the target from amongst the candidates, one would expect that strong associates would be excellent retrieval cues because the probability of accomplishing the first step in the process (generating the target) is high. Second, Tulving and Thomson (1973) demonstrated recognition failure of recallable words. In this demonstration, participants were first given weak associate-target pairs to study. Next they were provided with strong associates of the targets and asked to free associate. Unsurprisingly, copies of the targets were often generated during this phase of the experiment. Following free association, participants were asked to circle those generated items that were targets from the study list. Finally, they attempted to recall the targets in the presence of the weak cues that were encoded specifically with the targets during study. Recognition failure was revealed in that targets not recognized during the generate-recognize phase of the experiment were often recalled later in the presence of reinstated weak cues. This basic finding has been replicated many times (e.g., Bartling & Thompson, 1977; Gardiner, 1988; Postman, 1975; Reder, Anderson, & Bjork, 1974; Sikstrom & Gardiner, 1997; Tulving, 1974; Watkins & Tulving, 1975; Wiseman & Tulving, 1975, 1976; see Nilsson & Gardiner, 1993 for a review) and forms the basis of the Tulving–Wiseman law. Recognition failure is problematic for generate-recognize theory because recall is limited by two bottlenecks, whereas recognition is only limited by one (i.e., the target item has already been ‘‘generated’’ in recognition). Therefore, Tulving and Thomson reasoned that it is impossible for recall to be superior to recognition. Tulving and colleagues argued, instead, that their results were best explained in terms of the encoding specificity principle. According to this principle, the effectiveness of retrieval cues is determined by the extent to which the cues are encoded specifically with the to-be-remembered (TBR) information. Thus, strong extralist cues are generally not effective for retrieval, despite the fact that they elicit the TBR information with a high probability, because they were not encoded specifically with it. Similarly, recognition failure occurs because the cues available during recognition differ from those that were present at encoding. In contrast, the (weak) cues in recall are reinstated from study. The difference in the reinstatement status of the strong versus weak cues renders recall performance that is superior to recognition. Variants of early generate-recognize theory It is important to point out that Tulving and colleaguesÕ data and criticisms only pertain to one class of generate-recognize models: those that assume ‘‘trans-situational identity of words’’ (Tulving & Thomson, 1973, p. 358). If a given target word is considered to have only a single representation in memory—for example, a node in a stable, abstract associative network—then the target representation generated in the context of a strong extralist associate at test must match the target representation activated in the context of a reinstated weak associate. Consequently, under this assumption, there is no way for generate-recognize models to explain either poor performance with strong extralist cues, or recognition failure, as Tulving and colleagues correctly pointed out. However, since the publication of the first encoding specificity papers, several authors have argued that the one-representation-per-word assumption need not necessarily hold, and that encoding-specificity effects can be incorporated into generate-recognize theory under different assumptions. For example, Reder et al. (1974; see also Martin, 1975) proposed that a given word can have more than one ‘‘sense,’’ and that the sense of the target evoked at study, when it is presented with a weak associate, is different from the sense of the target when it is generated to a new, strong associate. For example, LIGHT, when generated in the context of the strong associate dark at test, has a different sense than the word LIGHT when presented in the context of the weak associate head at study; the former means LIGHT (lumi- ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx nance) while the latter means LIGHT (lamp) (see Martin, 1975). Thus, although the nominal stimulus is the same, different senses of it are processed at study and test, resulting in both poor strong extralist cued-recall performance and recognition failure. The idea that encoding-specificity effects are reliant on the use of words with more than one sense has been criticized by Tulving and Watkins (1977) who showed that single-meaning words also demonstrate recognition failure (although see Muter, 1984). However, the spirit of the multiple-representation argument can be seen in explanations of encoding specificity using feature sampling theory (e.g., Bower, 1967; Estes, 1959; Kintsch, 1974; Underwood, 1969; Wickens, 1970), some of which are relevant to generate-recognize theory. For example, Pellegrino and Salzberg (1975b; see also Flexser & Tulving, 1978; Pellegrino & Salzberg, 1975a; Roediger & Adelson, 1980) suggested that a set of features is sampled from the target item both when it is presented at study (in which case the features are ‘‘tagged’’), and when it is generated at test. To the extent that there is matching or overlap of the features sampled at test in the generated item, with those that are tagged at study, a positive recognition response is elicited, and the target is given as a response. In this amendment to generaterecognize theory, context plays a role in determining the features that are sampled at study and at test. To the extent that the features sampled in the same nominal stimulus are not fixed, the functional representations that are compared between study and test are not necessarily the same. Consequently, the problem that encoding specificity posed for generate-recognize theory, which is dependent on there being a single representation, is no longer applicable. Modern generate-recognize theory and metacognition Although these variants of the traditional generaterecognize model could incorporate encoding-specificity effects, we believe that a modernized version of the theory will have to make further changes. Many of the variants of early generate-recognize models assumed that the source of candidates is a stable, abstract associative network. However, in a radical shift away from this assumption, Jacoby and Hollingshead (1990) proposed that the memory base from which candidates are generated might be distributed, such that generation processes are influenced by specific prior episodes. This change moved generaterecognize theory away from a reliance on a semantic or associative memory system as a source of candidates, and rendered it more consistent with episodic or instance-based memory theory (e.g., Brooks, 1978; Jacoby, 1983; Whittlesea, 1997). We agree wholeheartedly with Jacoby and HollingsheadÕs proposition; in fact, we cannot 3 foresee how any generate-recognize model based on readout from a stable, abstract associative network can incorporate the now vast array of context effects that have emerged in memory research since the principles of encoding specificity (Tulving & Thomson, 1973) and transferappropriate processing (Morris, Bransford, & Franks, 1977) were introduced. In this vein, the experiments we report here, and those of others (e.g., Santa & Lamwers, 1974), demonstrate metacognitive flexibility and context-specific influences on recollective processing. For example, retrieval cues can be used to interrogate memory differently depending on the specific instructional set or on the specific study-list structure (see also Higham & Tam, 2005). These findings provide evidence against stability in the search set generated by test cues. Thus, we believe that if generate-recognize theory is to remain a viable model of cued recall, it must relinquish the assumption that the product of the generation process is based on a stable, abstract associative network. Another issue that needs to be addressed is how a modern, metacognitive generate-recognize theory will incorporate the concept of conscious recollection resulting from direct retrieval in cued recall. Since its introduction over three decades ago, the generate-recognize route to recall has typically been contrasted with another, more direct retrieval route (e.g., Bahrick, 1969, 1970, 1979; Bodner, Masson, & Caldwell, 2000; Brainerd, Wright, Reyna, & Payne, 2002; Gardiner, 1988; Guynn & McDaniel, 1999; Jacoby, 1996, 1998; Jones, 1978, 1987; Jones & Gardiner, 1990; Naveh-Benjamin & Guez, 2000; Toth, Reingold, & Jacoby, 1994; Weldon & Colston, 1995). For example, Bahrick (1970; see also Bahrick, 1969, 1979) suggested that the generate-recognize route to recall was only implemented if direct retrieval failed. In his original experiments, cues (or prompts) were only provided for items that could not be freely recalled, a point which is often overlooked in criticisms of BahrickÕs work (although see Gardiner, 1988 for a counter example). Thus, the generate-recognize process was seen as a fall-back process. At both an intuitive level, and at an empirical level, there appears to be fairly widespread consensus that recall can occur either directly and efficiently via direct access, or indirectly and inefficiently via a generate-recognize route. As Bahrick (1979) stated, . . .no one has ever seriously suggested, for example, that recalling the name of oneÕs wife involves generating a series of female names and selecting the correct name after having rejected several erroneously generated candidates. Much information of both an episodic and semantic type is recalled without a time-consuming search (p. 148). This dual-route distinction characteristic of writings in cued-recall research suggests that conscious recollection is the result of relatively immediate access of a ARTICLE IN PRESS 4 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx veridical memory trace (although see Brainerd, Payne, Wright, & Reyna, 2003). We agree with Bahrick (1979) in that most situations in which veridical recollection occurs are not preceded by an effortful phase during which multiple, plausible candidates are consciously generated in response to the available cues. However, if the concept of generation is broadened to include all sorts of access of information from long-term memory, then conscious recollection might be seen as a special case of generate-recognize processing. As will be shown, our data highlight how direct retrieval may not be as direct as previously thought; instead, ecphoric (Tulving, 1983) processing seems dependent on metacognitive factors, such as the number and/or strength of association of competing candidates for recall (Experiment 2). For now, the important point is that we see generate-recognize theory not as a point of contrast to direct retrieval, but rather as incorporating it. Situations that experimenters and participants alike identify as constituting direct retrieval can likely be understood in terms of parameters associated with monitoring and the attributions that are subsequently made. no-cue recall, and they did so to the same degree (i.e., target production in the weak- and strong- cue conditions did not differ). To explain these results, suppose participants in HighamÕs (2002) experiment generated candidate answers to each test cue in free report, and then attempted to recognize the target from amongst the alternatives. With weak cues, the reinstatement of context led to targets being recognized with high probability.1 Consequently, nearly all targets that were generated were also recognized and reported. However, with strong cues, participants presumably had no difficulty generating targets, but they very seldom successfully recognized them, and so responses to these cues were withheld. However, in forced report, the criterion of acceptability (report) was lowered, and the targets that were successfully generated to the strong cues, but not recognized, were revealed. As such, HighamÕs data are not inconsistent with the predominant explanation of poor strong extralist cue target production in this task: recognition failure.2 1 Metacognition and cued recall In this section, we will attempt to illustrate how metacognitive factors can be investigated in a cued-recall paradigm. Our research constitutes a follow-up to that presented in Higham (2002; see also Higham & Tam, 2005), and adopts the same methodology, so we will describe it in some detail. Higham replicated the main methodological components of Thomson and TulvingÕs (1970) experiments; participants studied weakly associated cue-target pairs and later were given a cued-recall test with a mixture of weak reinstated cues, or new strong cues. Recall performance with these cues was then compared to a no-cue recall condition in which targets were recalled without the assistance of any cues. The feature that made HighamÕs research different from Thomson and TulvingÕs (1970) research was that, for the cued-recall conditions, he examined target production under both free- and forced-report conditions; in free report, participants were given an incentive system such that they gained points for a correct answer, lost points for an incorrect answer, but neither gained nor lost any points for withholding answers. However, in forced report, the incentive system was removed and participants were asked to provide their best guess to the cues that were initially left blank (cf. Koriat & Goldsmith, 1996). He found that, in free report, there was evidence of encoding specificity; weak reinstated cues facilitated target production relative to no-cue recall, whereas new strong cues did not. However, the results in forced report were very different; in that condition, both weak and strong cues facilitated target production relative to Our reference to generating targets in response to weak cues might be confusing to some readers: How is it possible to generate a target that is only weakly associated with the test cue? However, our use of the term in this context is only confusing if one adheres to the traditional sense of generation (i.e., a process of consciously producing candidates responses from an associative network once direct retrieval has failed). Consistent with what we consider to be a modern generaterecognize theory, we use the term ‘‘generate’’ here to incorporate not just the effortful process of producing semantic associates to the test cues, but also the more direct process of recollecting targets with the help of test cues. Conceived of in this way, targets to weak cues are potentially ‘‘generated’’ in just the same way that they are to strong cues. 2 Some researchers have demonstrated that context reinstatement affects generation processes (e.g., Nelson & Goodmon, 2003; Vokey & Higham, 2005; Zeelenberg, Pecher, Shiffrin, & Raaijmakers, 2003). Similarly, most modern memory models (e.g., SAM: Gillund & Shiffrin, 1984; PIER2: D.L. Nelson, McKinney, Gee, & Janczura, 1998; see also Humphreys, Bain, & Pike, 1989) assume that context reinstatement facilitates retrieval or access of information from memory (i.e., generation). However, the same cannot be said of the role of context in the strong–weak cue paradigm, which generated interest in the encoding-specificity principle in the first place. It has been taken for granted that the cause of both recognition failure, and poor cued-recall performance with non-reinstated strong cues in the strong–weak cue paradigm, lies in the recognition stage, not the generation phase. As Tulving and Thomson (1973) pointed out, participants presumably have no difficulty generating copies of the targets in response to strong cues; indeed, demonstrations of recognition failure of recallable words are dependent on participants successfully generating (but failing to recognize) target words to the strong extralist cues. These cues, after all, are deliberately chosen to elicit the targets with high probability on tests of free association. ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx Indeed, the analyses that Higham (2002) performed on the data supported this conclusion. For each participant, a 2 (report: yes–no) · 2 (accuracy: correct–incorrect) contingency table was constructed, like the one shown in Table 1, and the frequencies in it were used to calculate various measures of performance. First, free-report target production was defined as the proportion of all test cues that were assigned correct answers when there was the option to withhold responses (i.e., a/[a + b + c + d] in Table 1). Second, forced-report target production was defined as the proportion of all test cues that were assigned correct responses on the test once the option to withhold responses was removed (i.e., [a + c]/[a + b + c + d] in Table 1). Third, two metacognitive indices of performance, monitoring and report bias, were calculated using type 2 signal-detection theory (SDT; e.g., Clarke, Birdsall, & Tanner, 1959; Galvin, Podd, Drga, & Whitmore, 2003; Higham & Gerrard, 2005; Vokey & Higham, 2005). The hit (H) rate, in this type 2 context, was defined as the proportion of the total number of correct responses produced on the test (a + c) that were actually reported (i.e., H rate = a/[a +c] in Table 1). Similarly, the false alarm (FA) rate was defined as the proportion of the total number of incorrect responses (b + d) that were reported (i.e., FA rate = b/ [b + d] in Table 1). As in standard SDT, it was possible to calculate a discrimination index (A 0 ; Grier, 1971) and a bias index (B00D ; Donaldson, 1992) from the H and FA rates, which corresponded to monitoring and report bias, respectively (see Higham, 2002; for more detail). Within the generate-recognize framework, monitoring corresponds to recognition; it is a measure of the degree to which participants report targets and withhold non-targets, in other words, the degree to which participants are able to discriminate the targets amongst generated candidates. Higham (2002) found that monitoring (recognition) was much higher for reinstated weak cues (.88) than non-reinstated strong cues (.61). Also, report bias, which is a measure of participantsÕ tendency to offer versus withhold responses in free report, was much more liberal for weak cues (.31) than Table 1 The 2 · 2 contingency table and formulae used to derive the various measures discussed in the text Response Reported Withheld Candidate answer Correct Incorrect a c b d Note. Free-report target production = a/(a + b + c + d); forcedreport target production = (a + c)/(a + b + c + d); hit rate (h) = a/(a + c); false alarm rate (fa) = b/(b + d); monitoring = A 0 = .5 + [(h fa)(1 + h fa)]/[4h(1 fa)]; report bias ¼ B00D ¼ ½ð1 hÞð1 faÞ h fa=½ð1 hÞð1 faÞ þ h fa. 5 strong cues (.62).3 Thus, the overall pattern of results suggests that participants were able to generate targets to strong and weak cues with equal success, as indexed by equal forced-report target production for the two cue types. However, they did not recognize the targets that they generated to strong cues as well as they recognized targets generated to weak cues, as suggested by the cue effect on monitoring. The difference in report bias suggests that participants were sensitive to the fact that their recognition performance with strong cues was poor and so responses were withheld in free report. Overview of the experiments HighamÕs (2002) use of free- and forced-report methodology, in conjunction with type 2 SDT, is well suited to an analysis of modern generate-recognize theory, as it is possible to separate cognition from metacognition. As suggested above, forced-report target production serves as a measure of the generation process, discrimination (A 0 ) serves as a measure of the monitoring or recognition process, and report bias (B00D ) serves as measure of the re3 Readers familiar with Koriat and GoldsmithÕs (1996) methodology might be curious about how the type 2 SDT measure of bias that we use in the current research (B00D ; see also Higham, 2002; Higham & Gerrard, 2005) compares to their measure, Prc. In short, they are designed to measure different aspects of performance. Like all bias statistics in both type 1 and type 2 SDT, B00D is designed to measure participantsÕ tendency to say ‘‘yes,’’ and is a monotonic function of the H and FA rates. Specifically, in the type 2 context, B00D measures participantsÕ propensity to report candidate responses, and is based on triangular geometry in ROC space (see Donaldson, 1992 for more detail). The propensity to report candidates can be affected by either a shift in participantsÕ criterion placement, or by a shift in the underlying distributions (over confidence) of the correct and incorrect candidates. Hence, when we refer to ‘‘liberal’’ or ‘‘conservative’’ bias or responding, we are referring to the participantsÕ response tendency, not specifically to the placement of the criterion. Prc is better suited to estimate this placement. It estimates the confidence scale value that is aligned with the participantsÕ report criterion by determining the maximum fit ratio. A fit ratio is the mean of (a) the percentage of would-be reported candidates that are actually reported at a given confidence level and (b) the percentage of would-be withheld candidates that are actually withheld at that same confidence level. By choosing the confidence level that maximizes this mean, one presumably estimates the confidence level associated with participantsÕ report criterion. Discussion of the relative merits of these two measures is beyond the scope of the current paper, but interested readers might refer to Higham (2002), pp. 72–76. However, because B00D is inherently ambiguous with regard to its interpretation, we will report mean confidence along with B00D to potentially disambiguate it (i.e., to determine whether effects on bias are due to true criterion shifts or shifts in the underlying distributions). ARTICLE IN PRESS 6 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx sponse tendency.4 Although these indices are not process pure for reasons that we will discuss below, they offer a substantial improvement over the way that cued-recall performance has traditionally been assessed in the encoding-specificity literature. Indeed, in that literature, metacognitive variables are often overlooked completely; typically, only free-report target production has been analyzed (although see Pellegrino & Salzberg, 1975b). As Higham pointed out, free-report target production is a very ‘‘dirty’’ measure of performance because generation processes, recognition processes, and response bias all affect it (see also Higham & Gerrard, 2005; Kelley & Sahakyan, 2003; Koriat & Goldsmith, 1996). Thus, although the indices we report in the current research are not perfectly aligned with the underlying processes, together, they do a much better job at separating the individual mechanisms of cued recall than free-report target production alone. The monitoring component mentioned above is only one type of metacognition that might be important in cued recall. Another has to do with the participantsÕ choice over the domains in which targets are sought. Traditional generate-recognize models imply that there is very little flexibility in domain choice, and that searches are mostly determined by a pre-established associative network. However, we believe there is much more metacognitive flexibility than such models suggest. Experiments 1 and 2 provide evidence for this flexibility by demonstrating the influence of retrieval guidance. In particular, these experiments show that informing participants on a trial-by-trial basis about the nature of the relationship between the test cue and the sought-after target dramatically enhances target production with strong extralist cues, an effect that we attribute to improved domain search. In Experiment 3, improvements are made to forced-report target production as a measure of the generation process, the measure of central importance in the current research, by using a multiple-response methodology. This new methodology allows us to examine two possible factors underlying participantsÕ choice of search domains: direct-memory instructions and the structure of the study list. Experiment 1 To demonstrate participantsÕ metacognitive control over search domain, we replicated HighamÕs (2002) results in this experiment, and compared target production in this standard cued-recall group to a group provided with some recall guidance. For the latter group, two manipulations were instantiated. First, participants were informed, on a trial-by-trial basis, about the associative relationship between the test cue and the target. Second, participants were instructed on the best strategy to use with each type of cue. In particular, they were informed that the best strategy with ‘‘weak’’ cues was to simply try to remember the target from the study phase. Conversely, for ‘‘strong’’ cues, they were informed that the best strategy was first to generate associates to the cues, and then to try to recognize targets from the generated candidates. It was made clear to participants that the task with strong cues was not simply associative generation, but that the goal was to recall as many of the targets from the study phase as possible. Similar generaterecognize instructions have previously been used by other researchers (e.g., Baker & Santa, 1977). Santa and Lamwers (1974) also informed participants about the relationship between test cues and targets in a replication and extension of Thomson and TulvingÕs (1970) experiments. They found that free-report performance with extralist strong cues improved greatly as a result. However, because free-report target production was their measure of choice, it is not clear from their experiments whether the enhanced performance was due to improved generation, recognition, more liberal reporting, or some combination of these. Higham (2002) found that a shift in report bias (from free to forced report) was enough to triple the proportion of targets given to strong cues. Thus, it is quite conceivable that Santa and LamwersÕ guidance manipulation had only an effect on report bias, with no effect on either the generation or recognition component of cued recall. To better determine what effect guidance has, in Experiment 1 we obtained separate estimates of generation (forced-report target production), recognition (A 0 ), and report bias (B00D ). Method 4 Different researchers favor different statistics for different reasons. To get around possible criticism of A 0 , another measure of monitoring, the Kruskal–Goodman gamma correlation, was also calculated on the data from the 2 · 2 contingency tables generated in Experiments 1 and 2 (monitoring was not analyzed in Experiment 3). This particular statistic is popular in the metacognition literature and has been advocated by Nelson (1984). Gamma was subjected to the same ANOVAs that were performed on A 0 . With only one exception (see footnote in the results section in Experiment 2), these analyses on gamma produced identical results to those on A 0 . Participants Participants were 32 students at the University of Southampton who either volunteered or participated in return for course credit. They were randomly assigned to either the standard group or guided group, with 16 participants in each. They were tested in groups of 1–3 at individual workstations. Design and materials One hundred target words, each with two associated cue words, were taken from Higham (2002). Five of the ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx word trios from HighamÕs original list containing proper names (e.g., Russia-communism-tractor) were replaced with trios not containing proper names. These trios were removed to dissuade participants from typing in responses that were a mixture of lower and upper case letters, which would be scored as incorrect by the computer program. One cue word for each target word was a strong associate of the target (mean probability of target production for all 100 strong cues = 35%), whereas the second was a weak associate (mean probability of target production for all 100 weak cues = 1%). No word was repeated across the lists of weak cues, strong cues or targets. For both groups of participants, the experiment was divided into a study phase and a test phase. In the study phase, all 100 targets, along with their weak associates, were presented in pairs in a random order to all participants for study. In the test phase, participants were presented with retrieval cues one at a time. For counterbalancing purposes, approximately half the participants were presented at test with strong cues to elicit one set of 50 targets, and presented with weak cues to elicit the other set of 50 targets, whereas this was reversed for the remaining participants. Participants were initially given the choice of providing a response to a given cue or leaving it blank (free-report; see procedure below). However, responses to cues that were left blank were obtained immediately afterwards by presenting the cue again and requiring a response (guessing if necessary) before moving on to the next trial. Presentation order of the cues in the test phase was uniquely randomized for each participant. After randomization, data from the first six trials of the test phase were counted as practice trials and were not analyzed. Thus, the analyses were based on 94 items, with an average of 47 (range: 44–50) data points for each of the strong and weak conditions. Procedure In the study phase, participants studied 100 weak cue-target word pairs presented individually for 3 s each, centered on a computer monitor. The cue words were displayed in lower case letters to the left of the target words, which were presented in upper case letters. Following Thomson and Tulving (1970), participants were instructed to study the upper case words for a later memory test, but to attend to the lower case words as possible cues to assist in recalling the upper case words at test. In the test phase, participants were instructed that they would be presented with cue words, one at a time on the computer monitor. Each cue word was centered on the monitor and displayed with a question mark (?) to its immediate right indicating that a response was requested. Participants in the standard group were told that each word presented during this phase had an upper 7 case word from the study list that was related to it and to use the word as a cue to assist in recalling the upper case word. In addition to these instructions, participants in the guided group were told that some cues would be the same as those presented in the study phase, and these cues would be weakly related to the upper case word and would be labelled as ‘‘weak cue’’ during a recall trial. Other cues, however, would be new cues that were not presented during the study phase but were strongly related to an upper case word, and these would be labelled as ‘‘strong cue’’ during a recall trial. Thus, for the guided group, the label ‘‘weak cue’’ or ‘‘strong cue’’ was presented below each given cue to inform the participants of the relation of the cue to the uppercase word. Additionally, the guided group were further informed that if it was a weak cue, the best strategy to retrieve the target was to simply try to remember it; generating associated words would not help. On the other hand, if the cue was a strong one, the best strategy to use was to generate words that are strongly related to the cue and then to try to recognize the target amongst the generated candidates. Participants in both groups were also informed that each trial started with a ‘‘points stage’’ during which each correct answer would earn 1 point, but that each incorrect answer would cost 4 points. However, participants could avoid the point system by entering ‘‘B’’ (for ‘‘blank’’) which would immediately send them to the ‘‘guessing stage.’’ Responses offered during the guessing stage would neither earn nor cost any points. Regardless of whether a response was offered during the points stage or the guessing stage, a confidence rating was required next on a separate screen, using a scale from 1 to 6, where 1 = extremely low confident correct, 2 = very low confident correct, 3 = low confident correct, 4 = high confident correct, 5 = very high confident correct, and 6 = extremely high confident correct. This scale appeared on the screen whenever a confidence rating was required. After entering a confidence rating, the number of trials remaining was displayed (100 in total). No feedback was provided regarding the points won or lost on trials for which responses were entered in the ‘‘points stage,’’ nor the cumulative point total. Participants pressed the space bar to initiate the next trial. This procedure for each trial continued until all 100 trials were completed. Results Following Higham (2002), we analyzed the cued-recall data in terms of free- and forced-report target production, monitoring, and report bias. The term ‘‘freereport target production rate’’ refers to the proportion of targets that were offered in the points stage of the experiment, whereas the term ‘‘forced-report target production rate’’ refers to the summed proportion of targets ARTICLE IN PRESS 8 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx that were offered in both the points stage and the guessing stage. Conservative scoring was used in all experiments; a response was scored as correct only if a target was offered to either the strong or weak cue associated with it. Monitoring and report bias refer to discrimination (A 0 ) and response bias (B00D ) derived for each participant from the a, b, c, and d cells of Table 1. Specific formulae for all measures are presented in Table 1. Before calculating discrimination and bias, the H and FA rates were adjusted according to Snodgrass and CorwinÕs (1988) recommendation (i.e., 0.5 added to the numerator and 1 added to the denominator of each rate). Analysis of the confidence data follows a report of the SDT indices. For some analyses reported in this paper, empty cells forced the elimination of participants, the number of which can be determined from the degrees of freedom reported with the analysis. For cases in which empty cells posed a particular problem, we explicitly describe the problem and the steps taken to avoid the elimination of too many participants (e.g., by excluding a factor from the analysis). Alpha level of .05 was adopted for all comparisons. Target production Mean free- and forced-report target production rates for the standard and guided groups are shown in Table 2. Two 2 (group: standard/guided) · 2 (cue: weak/ strong) mixed analyses of variance (ANOVA) were conducted, the first was on the free-report target production rates, and the second on forced-report rates. In both analyses, group was the between-subjects factor, and cue type was the within-subjects factor. For free-report target production, the analysis revealed a main effect of group, with the guided group demonstrating significantly better recall than the standard group, F (1, 30) = 8.15, MSE = .016, g2 = .214 (standard = .18; guided = .27). The main effect of cue was marginally significant, F (1, 30) = 3.69, MSE = .014, g2 = .110, p < .07, reflecting a trend for better recall in the context of weak cues than strong cues (weak = .25; strong = .20). A significant group · cue interaction was also obtained, Table 2 Mean free- and forced-report target production in Experiment 1 as a function of cue type and experimental group Experimental group F (1, 30) = 14.43, MSE = .014, g2 = .325. This interaction arose because free-report target production was significantly better for weak cues than for strong cues in the standard group, F (1, 15) = 15.89, MSE = .015, g2 = .514, whereas in the guided group, strong-cue target production was numerically, but not significantly, better than weak-cue target production, F (1, 15) = 1.82, (see Table 2). The ANOVA on forced-report target production revealed a main effect of group, F (1, 30) = 4.82, MSE = .015, g2 = .134; significantly more targets were retrieved by the guided than the standard group (standard = .26; guided = .33). There was also a significant cue main effect, F (1, 30) = 5.08, MSE = .012, g2 = .145, with recall significantly better for strong cues than weak cues (weak = .26; strong = .33), as well as a significant group · cue interaction, F (1, 30) = 10.27, MSE = .012, g2 = .255. This interaction reflected the fact that there was no difference in forced-report target production for weak versus strong cues in the standard instruction group, F < 1, whereas in the guided group, strong-cue target production was significantly better than weak-cue target production, F (1, 15) = 14.57, MSE = .013, g2 = .493. Forced-report target production for strong cues in the guided group was also significantly greater than to the same cues in the standard group, F (1, 30) = 18.91, MSE = .010, g2 = .387 (see Table 2). Monitoring Mean monitoring indices for the standard and guided groups are shown in Table 3. A mixed 2 (group: standard/guided) · 2 (cue: weak/strong) ANOVA on monitoring revealed a significant main effect for group, F (1, 30) = 9.64, MSE = .021, g2 = .243, with the guided group showing significantly better monitoring than the standard group (standard = .71; guided = .83). The cue main effect was significant, F (1, 30) = 57.23, MSE = .011, g2 = .656, reflecting better monitoring for weak cues than strong cues (weak = .87; strong = .67). A significant group · cue interaction was also found, Table 3 Mean monitoring (A 0 ) and report bias (B00D ) in Experiment 1 as a function of cue type and experimental group Experimental group Report type Free-report M SD Forced report M Weak cues Standard Guided .26 .24 .10 .14 .28 .25 Strong cues Standard Guided .09 .30 .12 .13 .25 .41 Measure Monitoring (A 0 ) Report bias (B00D ) M SD M SD .12 .14 Weak cues Standard Guided .85 .88 .17 .09 .51 .39 .52 .61 .11 .09 Strong cues Standard Guided .57 .77 .16 .05 .42 .13 .80 .72 SD ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx F (1, 30) = 9.23, MSE = .011, g2 = .235, which occurred because the monitoring difference between weak cues and strong cues was larger for the standard group than for the guided group. Nevertheless, separate one-way ANOVAs showed monitoring was significantly better for weak cues than strong cues for both groups: F (1, 15) = 33.09, MSE = .019, g2 = .688, and 2 F (1, 15) = 34.00, MSE = .003, g = .694, for the standard and guided groups, respectively. Monitoring was significantly greater than chance (.50) for both weak and strong cues only in the guided group (lower bound 95% confidence interval for weak cues = .84; for strong cues = .74). In the standard group, only monitoring for weak cues was significantly better than chance (lower bound 95% confidence interval for weak cues = .76; for strong cues = .49). Report bias Table 3 shows report bias for weak and strong cues in the standard and guided groups. A mixed 2 (group: standard/guided) · 2 (cue: weak/strong) ANOVA on report bias showed a significant main effect of cue, with report bias being more liberal for weak cues than for strong cues, F (1, 30) = 44.88, MSE = .127, g2 = .599 (weak = .45; strong = .15). The group main effect was not significant, F < 1, (standard = .04; guided = .26). However, a significant group · cue interaction was found, F (1, 30) = 13.71, MSE = .127, g2 = .314. The interaction resulted because cue strength had a larger effect in the standard group than in the guided group, although both effects were significant, F (1, 15) = 37.58, MSE = .183, g2 = .715, and F (1, 15) = 8.01, MSE = .071, g2 = .348, respectively. Confidence data A 2 (group: standard/guided) · 2 (cue: weak/strong) · 2 (accuracy: correct/incorrect) mixed ANOVA was performed on the mean confidence ratings (Table 4).5 This analysis yielded significant main effects of cue (weak = 3.23; strong = 2.06), F (1, 29) = 138.92, MSE = .303, g2 = .827, and accuracy (correct = 3.53; incorrect = 1.75), F (1, 29) = 229.24, MSE = .427, g2 = .888. There was also a significant cue · accuracy interaction, F (1, 29) = 176.04, MSE = .234, g2 = .859, indicating that the difference in confidence ratings between correct and incorrect responses was larger for weak cues (correct = 4.69; incorrect = 1.76) than for strong cues (correct = 2.37; incorrect = 1.75). This effect reflects better monitoring with weak cues than strong cues, as shown with the analysis on A 0 . A significant group · cue interaction was also found, F (1, 29) = 6.90, MSE = .303, 5 The data were collapsed across the report option variable (reported-withheld) to minimize the elimination of participants due to empty cells. 9 Table 4 Mean confidence ratings (/6) for correct and incorrect responses in Experiments 1 and 3 as a function of cue type and experimental group Experimental group Accuracy Correct Incorrect M SD M SD Weak cues Standard (Experiment 1) Guided (Experiment 1) Real study (Experiment 3) Sham study (Experiment 3) No study (Experiment 3) 4.55 4.84 4.66 — — .26 .27 .64 — — 1.81 1.71 1.46 2.00 2.31 .13 .14 .43 .64 .73 Strong cues Standard (Experiment 1) Guided (Experiment 1) Real study (Experiment 3) Sham study (Experiment 3) No study (Experiment 3) 1.92 2.82 1.86 2.36 3.06 .21 .22 .70 .76 .85 1.58 1.91 1.34 2.03 2.19 .13 .14 .35 .64 .67 Note. Cells without entries had too few participants contributing data to be considered valid (n < 10). g2 = .192, which was caused by cue strength having a larger effect on confidence in the standard group (weak = 3.18; strong = 1.75) than in the guided group (weak = 3.28; strong = 2.37). This pattern suggests that the group by cue strength interaction on the SDT measure of bias described above was attributable in large part to a shift in the strong-cue confidence distribution; that is, participants became more confident in their responses to strong cues in the guided group, and hence tended to report them more. Finally, the group · accuracy interaction approached significance, F (1, 29) = 4.09, MSE = .427, g2 = .124, p < .06. The difference in confidence ratings between correct and incorrect responses was marginally larger for the guided group (correct = 3.83; incorrect = 1.81) than for the standard group (correct = 3.23; incorrect = 1.69). This marginal effect reflects better monitoring in the guided group than the standard group, supporting the analysis of A 0 . Discussion Results from the standard group replicated HighamÕs (2002) cued-recall results almost exactly. In free report, there was an advantage of reinstated weak cues over non-reinstated strong cues in eliciting target retrieval, as was found by Thomson and Tulving (1970). This encoding-specificity effect, however, was eliminated under forced-report instructions (cf. Pellegrino & Salzberg, 1975b). As with HighamÕs research, the fact that strong cue target production improved from free to forced report, whereas weak-cue target production did not, was attributable to the fact that strong cues were associated ARTICLE IN PRESS 10 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx with both poorer monitoring, and more conservative report bias, than weak cues. In other words, participants withheld a lot of correct responses with strong cues, but not with weak cues. Results from the guided group, on the other hand, were quite different from the standard group. Although guiding participants about the relationship between weak cues and their targets had no effect on free-report target production, forced-report target production, monitoring or report bias, guidance had large effects on target production with strong cues, which is consistent with Santa and LamwersÕ (1974) findings. However, because we adopted HighamÕs (2002) method of analyzing cued-recall data, we, unlike Santa and Lamwers, were able to determine at what stage(s) of processing guidance had its effect: generation, recognition, response bias, or some combination of these. First, analysis of the discrimination index indicated that guidance had an effect at the recognition stage; target recognition (monitoring) in the guided group was significantly improved relative to the standard group, although it did not reach the level observed with weak cues. Second, analysis of B00D indicated that report bias was more liberal for weak cues than for strong cues in both groups, but the effect was less pronounced in the guided group compared to the standard group. In addition to these effects on recognition and report bias, however, guidance also affected generation processes. Forced-report target production, our measure of generation, was no different between weak and strong cues in the standard group, replicating HighamÕs (2002) results. However, in the guided group, forced-report target production with strong cues was significantly greater than with weak cues in the same group, and it was better than strong-cue, forced-report target production in the standard group. These results suggest that poor strong extralist cue target production in the standard cued-recall group, which partially forms the basis of the encoding specificity principle, is not just a recognition problem, as is commonly believed. Rather, it is also partly attributable to the fact that participants given standard cued-recall instructions do not generate candidate sets in response to strong cues that are as high quality as they might be. In other words, once failures to report and/or recognize the targets are factored out of the equation by forcing responses to all test cues, target production with strong extralist cues in the standard cued-recall group is still quite poor. Experiment 2 Experiment 1 indicated that guiding participants about the cue-target relationship and instructing them to generate-then-recognize with strong cues had effects on generation, recognition, and response bias. In Exper- iment 2, we sought to replicate this effect using a mixedlist design. Type of study cue (strong or weak) was crossed with type of test cue (strong or weak), which yielded four conditions: weak (study)–weak(test), weak–strong, strong–weak, and strong–strong. The first two conditions were replications of the two test conditions of Experiment 1, although without a pure weak-associate study list. As in the guided group of Experiment 1, participants were informed on a trial-by-trial basis about the relationship between the test cues and their associated targets. Furthermore, they were told to generate-then-recognize with strong cues, but to avoid the generate-then-recognize strategy with weak cues (i.e., just try to remember targets). Thomson and Tulving (1970, Experiment 3; see also Ehrlich & Philippe, 1976; Postman, 1975) used a mixed-list paradigm of this sort to discount a confusion or mental set interpretation of their finding of poor target production with strong extralist cues. That is, they acknowledged the fact that participants, after studying a pure-list of weak-associate/target pairs, may have developed a mental set of weak associates as appropriate responses at test, or participants might be confused at test about appropriate responses when faced with novel strong cues. Thomson and Tulving reasoned that, by using a mixed-list of both weak-and strong-associate/ target pairs during study, this potential explanation of the poor target production they observed with strong extralist cues would be obviated. However, as Santa and Lamwers (1974) rightly pointed out, it is questionable whether the confusion experienced by participants during recall was completely resolved in this mixed-list paradigm. Although participants may no longer have held onto the assumption that all targets were weakly associated to the test cues, confusion could still arise because participants were not informed that some of the targets which were studied with weak associates would be cued with novel strong cues during recall. In other words, the uncertainty as to how a particular cue was related (weakly or strongly) to its target remained in the cued-recall phase of the mixed-list paradigm. Consequently, participants in Thomson and TulvingÕs (1970) study may still have been unsure as to the appropriate retrieval space (weak associates or strong associates of the cue) in which the target could be found, despite the mixed-list design. This uncertainty may have resulted in poor target production with strong test cues (see also Santa & Lamwers, 1976). The trial-by-trial guidance manipulation that we used in Experiment 1 is ideally suited to alleviate this potential confusion about search space. If, indeed, target production with strong extralist cues was impaired in Thomson and TulvingÕs (1970) experiment because of uncertainty regarding the appropriate domain in which to search for targets, our guidance manipulation should lead to an improvement in strong-cue target production. ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx 11 In particular, forced-report target production, our measure of the generation process in recall, should be high. Table 5 Mean free- and forced-report target production in Experiment 2 as a function of experimental condition Method Experimental condition Participants Sixteen students from the University of Southampton took part in return for course credit or payment. Design and materials Materials from Experiment 1, that is, the same 100 targets, each with a weak and strong cue, were used. At study, participants studied one set of 50 targets with strong cues, and the other set of 50 targets with weak cues. At test, half the targets (25) studied with weak cues were tested with weak cues (the weak–weak condition), whereas the other half (25) was tested with strong cues (the weak–strong condition). Similarly, 25 targets studied with strong cues were tested with weak cues (the strong– weak condition), and the remaining 25 were tested with strong cues (the strong–strong condition). The assignment of targets to the four study-test conditions was counterbalanced across participants. The presentation order of the cue-target pairs at study, and cues at test, was uniquely randomized for each participant. After randomization, the first 6 trials for each participant were treated as practice trials and were not analyzed. Thus, analyses were based on 94 items, with a range of 19–25 data points for each study-test (weak–weak, weak– strong, strong–weak, and strong–strong) condition. Procedure For the presentation of cue-target pairs, the same procedure in Experiment 1 was used. The procedure of the cued-recall phase was essentially the same as that used for the guided group in Experiment 1. That is, for each participant, weak cues and strong cues were labelled accordingly during cued recall. Participants were also told that the best strategy to retrieve a target for a weak cue was to try to remember the target; generating associated words would not help. Conversely, for a strong cue, it was best to generate words strongly related to the cue and then decide if one of the words was the intended target. Results Target production Mean target-production rates are shown in Table 5. Because these rates, and hence variance, were virtually zero in the strong–weak condition at both free-and forced-report, only data from the remaining three conditions—weak–weak, weak–strong, and strong–strong were analysed. Thus, two repeated-measures, one-way ANOVAs were conducted on the target-production rate in those three conditions, the first on free-report target Report type Free-report Weak–weak Weak–strong Strong–weak Strong–strong Forced report M SD M SD .19 .19 .00 .54 .16 .10 .00 .21 .21 .41 .00 .65 .17 .15 .00 .16 production, and the second on forced-report target production. For free report, there was a significant main effect of condition, F (2, 30) = 34.32, MSE = .018, g2 = .696. This main effect arose because free-report target production was significantly better in the strong– strong condition than both the weak–weak and the weak–strong conditions, F (1, 15) = 80.36, MSE = .24, g2 = .843, and F (1, 15) = 40.46, MSE = .047, g2 = .730, respectively. There was no significant difference between the weak–weak and weak–strong conditions, F < 1. For forced report, the ANOVA again revealed a significant main effect of condition, F (2, 30) = 47.00, MSE = .017, g2 = .758. This main effect was produced because forced-report target production was significantly better in the strong–strong than in the weak–strong condition, F (1, 15) = 25.71, MSE = .038, g2 = .632. In turn, forced-report target production was significantly better in the weak–strong than in the weak–weak condition, F (1, 15) = 12.60, MSE = .048, g2 = .457 (Table 5). Monitoring Mean monitoring (A 0 ) is shown in the first column in Table 6. As the target production rate in the strong–weak condition was virtually zero, it was not possible to calculate a monitoring index for that condition. However, sufficient data were available to complete a repeatedmeasures, one-way ANOVA on monitoring in the other conditions (weak–weak, weak–strong, and strong– strong). This ANOVA revealed a significant main effect, F (2, 30) = 19.42, MSE = .011, g2 = .564. Within-subjects contrasts indicated that monitoring for weak–weak items was significantly better than for strong–strong items, F (1, 15) = 14.24; MSE = .019, g2 = .487, which, in turn, was better than for weak–strong items, F (1, 15) = 6.07, MSE = .029, g2 = .288.6 For all three conditions, moni- 6 This latter comparison on monitoring between strong– strong and weak–strong items was marginally significant when monitoring was estimated with gamma, but the pattern of results remained the same; monitoring for strong–strong items was higher than for weak–strong items, F (1, 13) = 4.04, MSE = .209, g2 = .237, p < .07. ARTICLE IN PRESS 12 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx Table 6 Mean monitoring (A 0 ) and report bias (B00D ) in Experiment 2 as a function of experimental condition Experimental condition Weak–weak Weak–strong Strong–weak Strong–strong Measure Monitoring (A 0 ) Report bias (B00D ) M SD M SD .87 .64 — .74 .08 .16 — .20 .03 .20 — .34 .66 .70 — .67 toring was significantly above chance (lower bound 95% confidence intervals: weak–weak = .83; weak–strong = .55; strong–strong = .64). Report bias Mean report bias (B00D ) is shown in the third column of Table 6. As with the analysis on monitoring, there was insufficient target production data in the strong–weak condition to calculate report bias. However, a repeatedmeasures one-way ANOVA on report bias in the remaining conditions (weak–weak, weak–strong, and strong– strong) revealed a significant main effect, F (2, 30) = 7.65, MSE = .159, g2 = .338. Within-subjects contrasts indicated that report bias was significantly more liberal in the strong–strong condition than in either the weak– weak condition, F (1, 15) = 7.43, MSE = .291, g2 = .331, or the weak–strong condition, F (1, 15) = 16.32, MSE = .286, g2 = .521. The latter two conditions did not differ significantly, F (1, 15) = 1.26. Confidence data The confidence data are shown in Table 7. To manage empty cells, confidence ratings were collapsed across report option, and the strong–weak condition was not entered into the analysis. A 3 (condition: weak–weak/ weak–strong/strong–strong) · 2 (accuracy: correct/incorrect) repeated measures ANOVA revealed significant main effects of accuracy, F (1, 13) = 165.16, MSE = .405, g2 = .927 (accurate = 4.18; inaccurate = 2.39), and condition, F (2, 26) = 20.23, MSE = .258, g2 = .609. ConTable 7 Mean confidence ratings (/6) for correct and incorrect responses in Experiment 2 as a function of condition Item type Accuracy Correct Weak–weak Weak–strong Strong–weak Strong–strong Incorrect M SD M SD 5.13 2.96 — 4.16 .80 .81 — .80 1.97 2.43 1.69 2.75 .60 .73 .56 .84 fidence ratings were lower in the weak–strong condition (2.79) than in both the weak–weak condition (3.58), F (1,14) = 30.42, MSE = .333, g2 = .685, and the strong–strong condition (3.48), F (1, 13) = 37.73, MSE = .176, g2 = .744, whereas the latter two conditions did not differ, F < 1. There was also a significant condition · accuracy interaction, F (2, 26) = 72.41, MSE = .187, g2 = .848. The interaction occurred because the difference in confidence between correct and incorrect responses was larger in the weak-weak condition than in the strong–strong condition, F (1, 13) = 37.95, MSE = .980, g2 = .745, which, in turn, was larger than in the weak– strong condition, F (1, 13) = 21.35, MSE = .640, g2 = .622. These analyses on the confidence data conform with the analyses on monitoring reported above, which also showed a weak–weak > strong–strong > weak–strong pattern. The confidence data analysis also suggests that the liberal response bias associated with the strong–strong condition likely resulted from the particularly high levels of confidence assigned to incorrect responses (i.e., distribution shift). Because confidence was high for these incorrect responses, a large portion of the distribution would have fallen above participantsÕ report criterion, yielding a high FA rate, and a liberal report bias estimate (i.e., an overall high tendency to report candidates). Discussion Target production in the weak–weak and weak– strong conditions of Experiment 2 replicated performance in the guided group of Experiment 1 fairly closely. The weak–weak condition showed moderately good target production, and it was not affected by report option. Similarly, monitoring (recognition) in the weak– weak conditions of both experiments was similar and high (Experiment 1, guided = .88; Experiment 2 = .87). Although free-report target production in the weak– strong condition of Experiment 2 was somewhat lower (.19) than in the weak–strong condition in the guided group of Experiment 1 (.30), forced-report target production was identical (.41 in both cases). This pattern probably reflects the fact that participants tended to be fairly conservative in their responding in all conditions of Experiment 2 except in the strong–strong condition, and that the mixed-list design made monitoring of responses to strong cues somewhat more difficult (Experiment 1, weak–strong, pure-list = .77; Experiment 2, weak–strong, mixed-list = .64). More important, however, Experiment 2 replicated the finding that some guidance, in the form of generate-recognize instructions and cue-type labels, led to excellent forced-report target production with strong extralist test cues. Forced-report target production in the weak–strong guided conditions of both Experiments 1 and 2 (both. 41) was considerable greater than forced- ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx report, strong-cue target production in the standard condition of Experiment 1 (.25). This result again suggests that the effect of guidance is not limited to recognition and/or response bias in cued recall. Rather, it also had a considerable effect on the generation process. We interpret this result in the same way that we interpreted the analogous result in Experiment 1; by guiding participants about the relationship between the test cue and the sought-after target, participants were better able to define an appropriate search space. By improving the definition of the search space, the likelihood that the pool of generated items contained the target was increased. It is worth considering the dissociations obtained between target production and monitoring, which we interpret to correspond to generation and recognition being influenced differently by the same variables. Both generation and recognition were influenced positively by cue reinstatement. Cue reinstatement helped to limit the search to recent encounters containing the target, which, in turn, increased the likelihood that the target was generated (i.e., accessed from memory). It also increased the likelihood that the target would be correctly recognized. On the other hand, cue strength affected generation and recognition differently. Whereas high cue strength, compared to low cue strength, increased the likelihood that targets would be generated, it decreased the likelihood that those generated targets would be recognized. Recognition was poor with strong cues because several highly interrelated items were likely to be generated. For example, in response to the strong test cue homicide, participants may have generated the candidates MURDER, death, kill, and die. Recognizing the target MURDER from amongst these candidates may have been quite difficult, because the size of the generated set is large and the interrelatedness amongst the alternatives is great (e.g., see Nelson, Bennett, Gee, Schreiber, & McKinney, 1993; Nelson, Bennett, & Leibert, 1997; Nelson, McEvoy, & Friedrich, 1982; Nelson, Schreiber, & McEvoy, 1992; see also Santa & Lamwers, 1974). In contrast, both the number of candidates generated in response to weak cues, and their interrelatedness, is likely to be less than with strong cues. First, for weak cues, participants were instructed to avoid generating related items in an attempt to find the target, reducing the generated set size. Second, even if participants ignored these instructions, and went ahead and generated both the target and candidates strongly related to the weak cue, the target would be unrelated to the other generated strong associates. Thus, the target may be the only candidate to be considered seriously as a response to each (reinstated) weak cue, either because it was the only generated candidate, or because it ‘‘stood out from the crowd,’’ being unrelated to the other items in the candidate set. 13 The differential effect of cue strength on the generation and recognition processes of cued recall can be seen by comparing the pattern of forced-report target production and monitoring, respectively, amongst the experimental conditions. The strong–strong condition showed the best target production because it benefited from both cue reinstatement and high cue strength. Target production in the weak–strong and weak–weak conditions benefited from only one of the variables (either cue strength in the weak–strong condition or cue reinstatement in the weak–weak condition), but not the other, and so an intermediate level of performance was observed. The strong–weak condition benefited from neither variable, and so demonstrated the worst performance (see Table 5). On the other hand, monitoring (recognition) was highest in the weak–weak condition because it benefited from both cue reinstatement and low cue strength. When the test cues were reinstated, but their strength was high (strong–strong; one for, one against), an intermediate level of recognition was observed. When reinstatement was absent, and cue strength was high (weak–strong; both against), recognition was poorest (see Table 6). These results highlight the role of metacognitive monitoring processes in a direct cued-recall task, and the importance of considering such processes separately from target production in such tasks. The pattern of poorer monitoring in the strong–strong condition compared to the weak-weak condition, despite better target production, was also apparent in the mean confidence data. Targets produced to weak cues were assigned much higher confidence than targets produced to strong cues, presumably because in the former case, there were no other serious competitors, in line with the reasoning above. Some readers may be concerned that instructing participants to generate-then-recognize with strong cues may have undermined the direct nature of the task, and for this reason, we observed the excellent strongcue target production and the dissociations between target production and monitoring. To test this possibility, we tested an additional 11 participants in the mixed-list design of Experiment 2. These participants were treated exactly the same as those in Experiment 2, except the generate-recognize instructions for strong cues, and ‘‘try to remember’’ instructions for weak cues, were removed. Instead, participants were simply instructed to use the cue to help them recall a capitalized word from the study phase, and it was indicated to them on a trial-by-trial basis which cues were ‘‘weak’’ and which were ‘‘strong.’’ These participants produced data almost identical to those in Experiment 2. In particular, the data for the weak–weak, weak–strong, strong–weak, and strong– strong conditions were as follows: free-report target production: .24, .24, .00, .52, respectively; forced-report target production: .28, .41, .01, .64, respectively; ARTICLE IN PRESS 14 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx monitoring (A 0 ): .84, .69, , .81, respectively; report bias (B00D ): .05, .26,, .22, respectively. Thus, the excellent strong-cue target production observed in the guided group of Experiment 2 (and by extension, the guided group of Experiment 1) was replicated. Also, the same dissociated ordering of forced-report retrieval (strong– strong > weak–strong and weak–weak > strong–weak) and monitoring (weak–weak > strong–strong > weak– strong) was observed. Experiment 3 Thus far, we have aligned generation processes in cued recall with forced-report target production, recognition processes with monitoring (A 0 ), and response tendencies with a bias index (B00D ). This analysis has shown itself to be useful in separating the underlying components of cued recall, and we believe it constitutes a huge improvement relative to the sole use of free-report target production as a measure of performance, which has characterized much cued-recall research. Furthermore, these measures respond in predictable and sensible ways to the cue reinstatement and cue strength variables, but they have also provided new evidence that questions some long-held assumptions regarding the nature of encoding specificity. In particular, the results of Experiments 1 and 2 question the common belief that encoding specificity effects in the strong–weak cue paradigm are located only in the recognition stage, and not the generation stage, of cued recall. As we mentioned in the introduction section, however, it is unlikely that these measures, in particular, our measure of generation (forced-report target production), are process pure. For some items, forced-report target production may also be sensitive to discrimination in the recognition stage. This would occur for trials for which the target is actually covertly generated to a given cue, but the confidence associated with it is extremely low. That being the case, not only would it not be offered in free report, it might not be offered in forced report either. Of course, such a scenario necessitates that some other non-target item, generated along with the target in response to a cue, is assigned higher confidence than the target itself. That is, the generated target would have to be lower in the rank order of candidates than some other non-target candidate, the latter of which would be offered at forced report. To the extent that this occurs, forced-report target production will underestimate the generation process. To scrutinize the generation process more directly, and to achieve a purer measure of the generation process, a multiple-responses methodology was adopted in Experiment 3. With this methodology, participants were given the opportunity to offer at least one, but as many as six, responses to each cue in their attempt to produce targets. A point system was also established to create conditions that were analogous to the report-option manipulation in Experiments 1 and 2. Specifically, participants were given the option to nominate one of the generated candidates as the likely target. By doing so, they had chosen to ‘‘go for points,’’ meaning that two points would be awarded for each correct nomination, but two points would be deducted from their score for each incorrect nomination. By considering the entire candidate list generated by each cue, regardless of whether or not a candidate was nominated, a condition analogous to forced report was produced. The point system in Experiment 3 was specifically designed to motivate participants to produce as many reasonable candidates as possible. For example, participants in Experiment 3 were informed that producing, but failing to nominate, the target in the list of candidates would still earn 0.25 points, and that no points would be deducted for producing non-target responses that were not nominated. Consequently, there was nothing to lose, and indeed something to be gained, by listing all plausible candidates for a given cue. This was true even on trials when an incorrect candidate was nominated; the incorrect nomination would result in a loss of two points, but if the target appeared elsewhere on the list, this loss would be offset by 0.25 points (net total = 1.75 points). The motivation to produce more than one response per cue in forced report was unlike the forced-report condition in Experiments 1 and 2 in which a maximum of one response per cue was generated. Along with the new methodology, Experiment 3 allowed us to examine two possible causes of poor strong-cue target production revealed in the unguided group of Experiment 1. Previously, Higham and Brooks (1997) demonstrated that participants in recognition memory experiments implicitly learn about the structure inherent in study lists, such that test items consistent with the structure were more likely to be rated ‘‘old’’ than items inconsistent with it. In the same vein, unguided cued-recall participants in Experiment 1 may have learned the list-wide structure that cues and targets were only weakly related. Learning about this structure may have led them to search for targets in a domain containing only weak associates of the test cues, a strategy that is useful for weak cues, but detrimental for strong cues. Such a proposition is consistent with our hypothesis that participants are searching inappropriate domains when faced with strong cues. On the other hand, the direct memory instructions used in the unguided cued-recall group is another possible cause of poor strong-cue target production. Direct-memory instructions may lead participants to search for single ‘‘recollected’’ solutions, limiting the size of the generated set. ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx To investigate the role of these two possible causes of poor unguided target production for strong cues, three groups of participants were tested in Experiment 3: a standard cued-recall group (the real-study group) and two generation groups (the sham-study and no-study groups). All three experimental groups were given the same strong and weak test cues as those in the test phase of Experiment 1. The main difference between the groups was that the real-study group was exposed to the targets during study (in the context of weakly associated study cues), and was required to recall them, whereas the sham-study and the no-study groups were not exposed to any targets, and were required instead to generate words to the cues. The main difference between the no-study and sham-study groups was that the latter group was exposed to a sham study list—a list of weakly associated word pairs that contained none of the targets—whereas the no-study group was exposed to no study list at all. By comparing target production between these two generation groups, it was possible to determine the effect of the study-list structure on pure generation processes, outside of the context of cued recall.7 By comparing target production between the real-study and sham-study groups, it was possible to evaluate the role of direct-memory versus generation instructions. Although by using this design it is possible to evaluate the separate contributions of each of the aforementioned factors on target production, we believe it may be somewhat overly simplistic to attribute poor unguided target production for strong cues exclusively to one factor or the other. Rather, it seems likely to us that list structure and direct-memory instructions work in conjunction, making a particularly potent mix. Unlike generation instructions, direct-memory instructions encourage constant referral to the study list as a source of candidates, thus intensifying any effects that the study-list structure might have. As a result, strong-cue target production may be particularly poor in the realstudy group who receive both direct-memory instructions and an inappropriate study-list structure. 7 One obvious way of ensuring high output is to force participants to generate a certain number of (multiple) responses. However, with the multiple-response methodology used in Experiment 3, we suspected that such requirements might switch participants in the real-study group from a retrieval strategy to a pure generation strategy. Such a strategic shift might possibly have masked any differences that would otherwise have been found between the real-study group on the one hand, and the generation groups (sham-study and nostudy) on the other. Consequently, we implemented a point system that rewarded the production of several responses per cue, but ultimately left participants in control of the number of responses (beyond one) to produce. 15 Method Participants Participants were 63 undergraduate students who took part in return for either course credit or a payment of £5. Twenty-one participants were assigned to each of the three (real-study, sham-study, and no-study) groups. Design and materials The materials were the same as in Experiment 1, with the exception that a new list of 100 weakly associated pairs was constructed for the sham-study group to study. As with the weakly-associated word pairs in the real-study group, the cues presented to the sham-study group during study produced their associated sham targets at a rate of approximately 1%. Although no actual targets were presented in the study phase for the shamstudy and no-study groups, the same test cues used in the real-study group (derived from Experiment 1) were used for these groups, and target production was scored in all three groups of Experiment 3 according to participantsÕ tendency to respond with the targets as they were defined in Experiment 1. Procedure The procedure for the study phase for the real-study and sham-study groups was identical to that used for the standard group in Experiment 1. In the test phase, the real-study group received the same basic instructions as those in the standard group in Experiment 1. However, participants from the sham-study group were told that, for each cue word, we had ‘‘another word in mind’’ that was related to it, and that the relationship between the cue word and the word we had in mind was similar to the relationship between the word pairs seen in the study phase. It was made clear to participants that none of the words seen in the study phase was the same as the word we had in mind. Participants were instructed to use the cue to think of the word we have in mind. Similarly, for the no-study group, participants were told that for each cue word, there was a related word that we had in mind, and their task was to think what that word might be. For all three groups in Experiment 3, participants were required to provide at least one, but as many as six, responses to each cue. Additionally, for each response given, participants were required to rate their confidence that it was the target on a 6-point scale. Participants were informed of the point system used in the test phase, and that the goal was to reach as many points as possible. They were told that if they were confident that one of their responses was the target, they could nominate it by clicking on the ‘‘go for points’’ button. By doing so, they would be awarded two points if the nominated candidate was indeed the target. However, if it was not, they would lose two points. Only one re- ARTICLE IN PRESS 16 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx sponse per trial could be nominated. Participants were further informed that they could still gain 0.25 points if one of the non-nominated responses they had produced was the target, regardless of whether or not they nominated another candidate. Because some points could still be gained if non-nominated responses turned out to be targets, an incentive was provided to enter as many responses as possible. There was no requirement to nominate a response to ‘‘go for points’’ on any trial. In the test phase, each cue was presented towards the top of the computer screen with a question mark (?) displayed to its immediate right indicating that a response was required. Participants typed in their first response in an empty text box situated below the cue. When a response had been entered, a 6-point scale appeared immediately to the right of the response, and participants indicated their confidence of their response being correct by clicking on one of the points on the scale. A ‘‘go for points’’ button was displayed to the right of the 6-point scale, which could be highlighted if participants chose to nominate that particular response. Once a response had been typed into the box, and confidence had been rated, the participant was prompted to type in another response (in a box which appears directly below the previous one) or to click on the ‘‘next’’ button located at the bottom of the screen to initiate the next test trial. If a response was typed in the second text box, a confidence scale appeared again to its immediate right, and a ‘‘go for points’’ button to the right of that. This process continued either until six responses had been offered, or until the ‘‘next’’ button was clicked. Participants were not permitted to type in multiple copies of the same response on any given trial. Throughout the trials, a brief summary was presented on the top left corner of the screen to remind participants of their task and how the points system worked. In addition, the participantsÕ cumulative score, and the number of points scored on the previous trial, were displayed on the top right hand corner of the screen. Clearly, participants in the sham-study and no-study groups could not rely on implicit or explicit memory to generate correct responses to the test cues (as no actual targets were presented at study). Nevertheless, for purposes of comparison to the real-study group, we define ‘‘targets’’ for these groups the same way as in Experiment 1. Responses which were nominated for points were treated as free-report responses, whereas all responses, regardless of whether or not they were nominated for points, were treated as forced-report responses. Target production The mean free- and forced-report target production rates for weak and strong cues, as a function of study groups are shown in Table 8. Unsurprisingly, because both the sham-study and no-study groups were not presented with the targets prior to the test phase, both groups generated virtually no targets in the context of weak cues. Target production was close to zero for both free report and forced report, so these data were not analyzed further. To determine the effect of cue type, strong and weak-cue target production in just the realstudy group was analyzed using paired-samples t tests. In free report, target production was shown to be significantly better for weak cues than for strong cues, t (20) = 3.37, SE = .037, g2 = .362. The reverse was found, however, in forced report, where target production was significantly better for strong cues than for weak cues, t (20) = 3.41, SE = .047, g2 = .368. Although very few targets were retrieved for weak cues in the sham-study and no-study groups, target production with strong cues was more successful. To determine if there were any differences between the groups in these rates, two one-way ANOVAs were conducted on strong-cue target production data from the three groups, the first in free report, and the second in forced report. In free report, the one-way ANOVA revealed a significant main effect of group, F (2, 60) = 7.31, MSE = .007, Results Because of the change to the methodology in this experiment, it was no longer feasible to calculate type 2 SDT measures of monitoring and report bias. The large number of responses given in forced report meant that the type 2 FA rates were below 5% for most participants, and below 1% in several cases, potentially making discrimination and bias statistics misleading. Consequently, we report only the target production rate, which corresponds to the number of targets produced in a given condition out of the total number possible, and the confidence data. Although we could not calculate A 0 in this experiment, the close correspondence between it and the confidence data analyses in the previous experiments made it feasible to use the confidence data as an indicator of monitoring. Table 8 Mean free- and forced-report target production in Experiment 3 as a function of cue type and experimental group Experimental group Report type Free-report Forced report M SD M SD Weak cues Real study Sham study No study .18 .00 .00 .15 .00 .00 .23 .01 .02 .18 .01 .02 Strong cues Real study Sham study No study .05 .06 .14 .07 .07 .11 .39 .47 .48 .14 .10 .12 ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx g2 = .196. Independent-samples t tests showed that this effect arose because strong-cue target production in free report was significantly better in the no-study group than in the real-study and sham-study groups, t (40) = 3.23, SE = .028, g2 = .207, and t (40) = 2.89, SE = .029, g2 = .173, respectively (see Table 8). There was no difference in target production success between the latter two study groups, t (40) = 0.318. In forced report, the one-way ANOVA again revealed a significant main effect of group, F (2, 60) = 3.52, MSE = .015, g2 = .105. The effect reflects the way strong-cue target production in forced report was significantly better for both the sham-study and the no-study groups than for the real-study group, t (40) = 2.17, SE = .038, g2 = .105, and t (40) = 2.25, SE = .040, g2 = .112, respectively. Strong-cue target production in forced report for the sham-study and no-study groups, however, did not differ significantly from each other, t (40) = .221. Confidence data Confidence data (Table 4) produced by the real-study group were collapsed across report option and a 2 (cue: weak/strong) · 2 (accuracy: correct/incorrect) mixed ANOVA was conducted. This analysis revealed significant main effects of cue, F (1, 18) = 316.07, MSE = .131, g2 = .946, accuracy, F (1, 18) = 539.97, MSE = .120, g2 = .968, and a significant cue · accuracy interaction, F (1, 18) = 217.47, MSE = .156, g2 = .924. Overall, participants in the real-study group gave higher confidence ratings for weak cues (3.07) than strong cues (1.59), and higher confidence ratings for correct (3.25) than incorrect (1.40) responses. The interaction arose because this difference in confidence ratings between correct and incorrect responses was larger for weak cues (difference = 3.19) than for strong cues (difference = 0.51). These results on confidence ratings suggest that better monitoring for weak cues than strong cues was apparent in this experiment just as it was in Experiments 1 and 2. Because of the very low number of correct responses produced by the sham and no-study groups in the context of weak cues, cross-group comparisons were restricted to confidence data produced for strong cues. As with previous analyses, the data were collapsed across report option (see Table 4) and a 3 (group: real/sham/no study) · 2 (accuracy: correct/incorrect) mixed ANOVA was carried out. This analysis yielded a significant accuracy main effect, F (1, 60) = 101.69, MSE = .103, g2 = .629 (correct = 2.43; incorrect = 1.85), and a significant group main effect, F (2, 60) = 13.58, MSE = .819, g2 = .312. The significant group main effect arose because for strong cues, the real-study group (1.60) gave lower confidence ratings than both the sham-study (2.19), F (1, 40) = 10.40, MSE = .356, g2 = .206, and no-study (2.63) groups, F (1, 40) = 28.57, MSE = .386, g2 = .417. The difference in confidence ratings between the sham-study and 17 no-study groups was marginally significant, F (1, 40) = 4.00, MSE = .486, g2 = .091, p < .06. Finally, a significant group · accuracy interaction was also found, F (2, 60) = 7.65, MSE = .103, g2 = .203. This interaction occurred because the difference in confidence ratings between correct and incorrect responses was significantly larger in the no-study group than in both the real-study group, F (1, 40) = 5.69, MSE = .121, g2 = .125, and the sham-study group, F (1, 40) = 14.59, MSE = .103, g2 = .267. At first glance, the fact that the no-study and shamstudy groups assigned different levels of confidence to target and non-target responses is surprising because neither of these groups were exposed to any targets during study. We speculate that, in the case of strong cues, participants were responding to the ease with which a related word can be generated, which happens to be correlated with the high degree of association between strong cues and words chosen as targets for the real-study group. At any rate, the significant group by accuracy interaction obtained here indicates a clear influence of study-list structure on monitoring. Presumably, because both the real-study group and the sham-study group were exposed to a list of weak associates, confidence in targets produced to strong cues was low due to their inconsistency with the list structure; only targets produced to strong cues after studying no list at all were assigned reasonably high levels of confidence. Discussion Experiment 3 demonstrated that the cued-recall group, who studied targets in the context of weakly related associate pairs (real-study group), were less likely to produce those targets in the context of non-reinstated strong cues, than were the generation groups (shamstudy and no-study), who never studied the targets. This effect was apparent in free-report, where more targets were retrieved by the no-study group than the real-study group, and in forced-report, where more targets were retrieved by both the no-study and sham-study groups than the real-study group. These effects were obtained despite the fact that participants were given the opportunity to generate multiple responses to the cues, and provided with an incentive to produce multiple candidates. Consequently, the notion that targets were actually covertly generated in cued recall, but not reported (i.e., not recognized), seems less likely. Instead, the results of Experiment 3 add support to the conclusion, derived from Experiments 1 and 2, that poor cued-recall performance to strong extralist cues in the typical encodingspecificity experiments is partly attributable to generation failure. Whereas Experiments 1 and 2 demonstrated that some guidance regarding the nature of the relationship between the cue and the sought-after target improved forced-report target production with strong ARTICLE IN PRESS 18 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx extralist cues, there was some question about the degree to which recognition processes might have influenced this result. However, Experiment 3 demonstrated, unequivocally, that generation failure is at least partly the problem that cued-recall participants experience. We suggested above that learning the weak-associate study-list structure, which was not conducive to generating strong associates to strong test cues, was one possible cause of poor strong-cue target production. However, a comparison of the sham-study and no-study groups suggests that the list structure by itself had little effect on forced-report retrieval, our measure of generation, but had substantial effects on monitoring (Tables 4 and 8). The most likely explanation for this is that the multiple-response methodology limited the effect of the inappropriate list structure such that targets to strong cues in the sham-study group were generated, but confidence assigned to them was lowered. The fact that this comparison involves only pure generation groups makes generalization to possible effects that the study-list structure might have in cued recall somewhat tenuous. Nonetheless, it is clear that, based on the current data, the study-list structure alone cannot account for the generation failure observed in the real-study group. On the other hand, a comparison of the real-study and sham-study groups revealed that either the direct memory instructions alone, or these instructions in conjunction with the weak-associate study list, produced generation failure. As we mentioned above, our preferred explanation is that the combination of direct memory instructions and inappropriate study-list structure is particularly damaging to strong-cue target production. Although the current data cannot determine whether direct memory instructions exert a singular effect, or whether they are involved in an interactive effect with study-list structure, our preference is based on the results of other experiments we have conducted that favor the latter explanation. For example, Higham and Tam (2005) found that generation failure (i.e., greater forced-report target production in a generation group shown no targets during study than in a cued-recall group exposed to the targets) was only obtained if the study list consisted of weak-associate pairs. We describe this research in more detail in the General discussion. The multiple-response methodology introduced a potential confound in the experiment; namely, there was no longer strict control over report output, as there was in Experiments 1 and 2. Indeed, the real-study group produced significantly fewer responses (2.04) than both the sham-study group (2.91; t (40) = 3.43, SE = 0.25, g2 = .227), and the no-study group, (2.74; t (40) = 2.41, SE = 0.29, g2 = .127). These differences are potentially problematic because they represent a form of report bias shift in forced report, which may, by itself, account for superior strong-cue, forced-report target production in the generation groups compared to the cued-recall group. To explore the role of this possible confound, strongcue target production was determined in each experimental group separately for trials corresponding to one, two, three, four, five, and six responses. If a report bias shift was solely responsible for the better strong-cue target production in the generation groups compared to the real-study group, target production should increase as more responses are offered, but performance between the groups corresponding to each output level should be equated. By this reasoning, the better target production in the generation groups, compared to the real-study group, would be due to the fact there are simply more trials with more responses in the former groups, which would boost their overall target production rate. The results of this analysis are shown in Fig. 1. The number of observations contributing to each mean, summed across participants, appears in brackets. Analyses of these data were limited to descriptive statistics because several participants in each group contributed no data to some of the cells, making calculation of inferential statistics unviable. Nonetheless, the data pattern is clear enough to eliminate a report-bias interpretation of the obtained results. As shown in the figure, although there was a trend for better target production as more responses were offered (i.e., positive slope), strong-cue target production in the real-study group was worse than that in both generation groups at virtually every level of responding. This analysis suggests that the poorer target production in the real-study group, relative to the generation groups, was due to generation failure, not a shift in report bias. The results from Experiment 3 also eliminate a recognition-failure explanation for the good target production observed in the guided groups of Experiments 1 Fig. 1. Strong-cue target production in Experiment 3 as a function of the number of responses offered to each cue. The number of observations contributing to each point, summed across participants, is also indicated in parentheses. ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx and 2. The critic may argue that the instructions to generate-then-recognize to strong cues effectively rendered performance in these conditions comparable to indirect memory instructions. That is, participants may have simply stopped trying to recognize the candidates that they generated, responding with the first word that came to mind. By effectively eliminating or reducing the influence of the recognition stage in the guided groups, recognition failure was reduced or eliminated, and target production to strong cues was boosted. However, the multiple response methodology used in Experiment 3, with the emphasis on generating multiple candidates and deciding whether or not to put one forward for points, effectively made the real-study group a generate-recognize group like the guided groups of Experiments 1 and 2. Indeed, the fact that the forced-report number of candidates generated (2.04) was more than twice that observed in either Experiment 1 or 2 (1.00), suggests that the real-study group fits the generate-recognize description better than the guided groups. Despite this, there was still good evidence for generation failure in the real-study group. Given that there was evidence of generation failure even under the conditions of Experiment 3, it seems more likely that the excellent strong-cue target production observed in the guided groups in Experiments 1 and 2 was due to an effect on generation, not recognition. General discussion The current experiments tested whether poor cued-recall target production observed with strong extralist cues in the standard encoding-specificity paradigm is due solely to recognition failure. In Experiments 1 and 2, guiding participants about the strength of the associative relationship between test cues and sought-after targets was found to improve target production with strong extralist cues considerably. To determine which particular stage(s) of recall was affected by the guidance manipulation, type 2 SDT methodology (Higham, 2002; Higham & Gerrard, 2005; Vokey & Higham, 2005) was employed, which allowed us to gain separate estimates of generation processes, recognition processes, and response tendencies involved with cued recall. Consistent with earlier claims, this analysis indicated that recognition performance to strong extralist cues was poorer in the standard group than in the guided group, and that report bias also tended to be somewhat more conservative. However, counter to prior claims, the analysis also indicated that guidance had a substantial effect on forced-report target production, our measure of the generation process; by the time participants provided responses to all the test cues in forced report, recall performance for strong cues in the guided groups was superior to that in the group given standard cued- 19 recall instructions. Thus, poor strong extralist cue target production is not just a problem in recognizing easily generated targets. In the context of standard cued-recall instructions, participants also have difficulty generating correct responses, despite the fact that the targets were primary associates of many of the strong cues. We attribute the generation difficulties observed in the current experiments to participants searching an inappropriate domain when given standard cued-recall instructions. There appear to be at least two reasons for this inappropriate domain search. First, as a result of studying weakly associated word pairs during study, participants ‘‘learn the experimenterÕs design’’ (Higham & Brooks, 1997); that is, participants learn about listwide commonalities that exist amongst the study items, and these commonalities partially define the class of plausible candidates generated during the test. Such learning is detrimental to finding targets with strong cues, which accounts for poor strong-cue target production in the unguided group of Experiment 1 and the realstudy group of Experiment 3. However, if participants are freed from assumptions regarding the memory domains in which to search for targets, as in the guided groups of Experiments 1 and 2, strong-cue target production improves dramatically. Direct support for the role of study-list structure was obtained recently by Higham and Tam (2005) who varied the nature of the relationship between the cues and targets in the study list. They found that if a sham study list consisted of entirely weak associates, as in Experiment 3 of the current series, the probability of target production in a pure-generation task was less than if the study list consisted of sham strong associates. If, instead, participants studied a list of moderate associates, participants were released from generation failure; that is, forced-report target production to strong cues in the cued-recall group was better than pure-generation target production. Furthermore, the cued-recall group given the weak-associate study list tended to produce non-target responses that were less strongly related to the cues than the cued-recall group given a moderate associate study list. Despite these data, a comparison between forced-report target production in the sham-study group and the real-study group of Experiment 3 indicated that the generation problems of the real-study group were not limited to incorrect assumptions derived from the studylist structure. Both the sham-study and real-study groups studied a list of weakly associated word pairs, yet forced-report target production was higher in the sham-study group than in the real-study group. Thus, direct memory instructions appear to be a second factor contributing to inappropriate domain searching and generation failure. Research by Weldon and Colston (1995) and Guynn and McDaniel (1999) may be helpful in interpreting the ARTICLE IN PRESS 20 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx role of direct-memory instructions. Weldon and Colston found that cue reinstatement had differential effects on the generation component of direct (cued recall) versus indirect (generation) tests. More specifically, reinstating the study context at test had a larger effect on an explicit cued-recall test than on an implicit generation test. They suggested that this was due to the fact that context-specific information could be used strategically to access targets directly and efficiently in the cued-recall group, limiting the search set when the context was reinstated, whereas this did not occur in the generation group. More recently, Guynn and McDaniel (1999) found an advantage for target production in a generation group compared to the cued-recall group, an effect that is analogous to our generation/cued-recall group difference. In their experiment, participants studied exemplars of common categories (e.g., trees, ships) that were of either high or low frequency within the category. One group of participants were later required to recall the exemplars with the assistance of the (extralist) category label, whereas another group of participants were required to generate exemplars to the category labels, without concern for whether the generated items were old or new. They found that high-frequency exemplar production was greater in the generate group than in the cued-recall group, whereas this was not the case for low-frequency exemplars. It is important to point out that although both Weldon and Colston (1995) and Guynn and McDaniel (1999) found better target production in their generate groups than in their cued-recall groups, both groups encountered the targets during the study phase. Consequently, their results lack the paradoxical feature obtained in our experiments: worse target production when trying to remember recently encountered targets than when merely generating without the benefit of recent target exposure. Nonetheless, our results, coupled with the results of these studies, suggest that the type of instructions that are given to participants can affect the nature of the domains that are searched during the generation stage. Whereas indirect (generation) instructions may lead participants to search the domain of pre-experimental associations, direct (cued-recall) memory instructions may lead participants to search more recent, within-experiment episodes, using the cue and test context to define the domain and guide the search. This focus on searching within-experiment domains is likely to enhance any effects of having learned an inappropriate list-wide structure during study, rendering the ‘‘potent mix,’’ referred to above, that leads to generation failure. Because the search domain, and the candidates generated from it, are affected by the test instructions, there is no guarantee that encountering items during study will automatically lead to either implicit episodic priming or explicit memory for the prior processing episodes. If direct-memory instructions are used, but the study context is not reinstated, participants may search domains that are very unlikely to contain the target. Thus, performance in such situations may be very poor, even poorer than performance in a situation where targets have not been recently encountered, and generation instructions are used. These results underline what are perhaps the two most critical points of our research: (1) metacognitive flexibility in episodic memory retrieval and (2) the necessity of estimating that flexibility by considering both memory access (generation) and memory monitoring (recognition) as separate stages of processing. A modernized generate-recognize theory Clearly, the results of the current experiments are wholly incompatible with traditional generate-recognize theory. If participants fell back on generating candidates that were semantic associates of the test cues once direct retrieval failed, as these models posit, then guidance regarding the nature of the cue-target relationship in Experiments 1 and 2 should have had little effect on cued-recall performance. Additionally, target production in the real-study group of Experiment 3 should have equalled or bettered performance in the no-study and sham-study groups of Experiment 3. Why, then, do we continue to refer to discrete generation and recognition processes in cued recall? Perhaps the answer is best summarized in Jacoby and HollingsheadÕs (1990) comment, made 20 years or so after such models were first troubled by the introduction of the encoding-specificity principle: ‘‘Generate/recognize models are too useful as descriptions of memory monitoring and other activities to be abandoned’’ (p. 452). This sentiment is reflected in the fact that most modern metacognitive models are essentially generate-recognize models at heart. Although the stages have different labels (production–evaluation, Whittlesea, 1997; retrieval-monitoring, Koriat & Goldsmith, 1996), each distinguishes between separate memory access, in the broad sense of the term, and memory assessment stages. We have attempted in our research to turn this distinction, which is now mostly uncontroversial in the metacognitive literature, full circle to revisit the analogous distinction in cued recall. In our view, the access-monitoring distinction became blurred as concepts such as ecphory (Tulving, 1983), direct access, and conscious recollection took over, chiefly pushing metamemory processes aside (although see Payne, Jacoby, & Lambert, 2004). It is as if both monitoring and retrieval processes have been assumed to be virtually perfect when direct access occurs. The former is revealed by the common use of free-report measures of cued recall performance, which is only justified if direct access is perfectly monitored (otherwise monitoring will contaminate the measure). The latter is revealed with the equating of ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx ‘‘remember’’ judgments (Tulving, 1985) with veridical retrieval (see Higham & Vokey, 2004 for discussion). We believe, however, that all memory decisions involve memory access, memory monitoring, and bias parameters to varying degrees, and that the methodology and tools that memory researchers use should have the inherent ability to estimate them. The best way we have found to make these separate estimates is to use type 2 SDT, regardless of whether the tasks are understood to involve direct retrieval or not. References Anderson, J. R., & Bower, G. H. (1972). Recognition and retrieval processes in cued recall. Psychological Review, 79, 97–123. Bahrick, H. P. (1969). Measurement of memory by prompted recall. Journal of Experimental Psychology, 79, 213–219. Bahrick, H. P. (1970). Two-phase model for prompted recall. Psychological Review, 77(3), 215–222. Bahrick, H. P. (1979). Broader methods and narrower theories for memory research: Comments on the papers by Eysenck and Cermak. In L. S. Cermak & F. I. M. Craik (Eds.), Levels of processing in human memory (pp. 141–156). Hillsdale, NJ: Erlbaum. Baker, L., & Santa, J. L. (1977). Context, integration, and retrieval. Memory & Cognition, 5, 308–314. Barnes, A. E., Nelson, T. O., Dunlosky, J., Mazzoni, G., & Narens, L. (1999). An integrative system of metamemory components involved in retrieval. In D. Gopher & A. Koriat (Eds.), Cognitive regulation of performance: Interaction of theory and application (Attention and performance XVII) (pp. 289–313). Cambridge, MA: MIT Press. Bartling, C. A., & Thompson, C. P. (1977). Encoding specificity: Retrieval asymmetry in the recognition failure paradigm. Journal of Experimental Psychology: Human Learning and Memory, 3(6), 690–700. Bodner, G. E., Masson, M. E. J., & Caldwell, J. I. (2000). Evidence for a generate-recognize model of episodic influences on word-stem completion. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26(2), 267–293. Bower, G. H. (1967). A multicomponent theory of the memory trace. In K. W. Spence & J. T. Spence (Eds.). The psychology of learning and motivation (Vol. 1). New York: Academic Press. Brainerd, C. J., Payne, D. G., Wright, R., & Reyna, V. F. (2003). Phantom recall. Journal of Memory and Language, 48, 445–467. Brainerd, C. J., Wright, R., Reyna, V. F., & Payne, D. G. (2002). Dual-retrieval processes in free and associative recall. Journal of Memory and Language, 46, 120–152. Brooks, L. R. (1978). Nonanalytic concept formation and memory for instances. In E. Rosch & B. Lloyd (Eds.), Cognition and categorization (pp. 169–211). Hillsdale, NJ: Erlbaumm. Clarke, F. R., Birdsall, T. G., & Tanner, W. P. (1959). Two types of ROC curves and definitions of parameters. Journal of the Acoustical Society of America, 31, 629–630. 21 Donaldson, W. (1992). Measuring recognition memory. Journal of Experimental Psychology: General, 121, 275–277. Ehrlich, S., & Philippe, M. (1976). Encoding specificity, retrieval specificity or structural specificity?. Journal of Verbal Learning and Verbal Behavior, 15, 537–548. Estes, W. K. (1959). The statistical approach to learning theory. In S. Koch (Ed.). Psychology: A study of a science (Vol. 2). New York: McGraw-Hill. Flexser, A. J., & Tulving, E. (1978). Retrieval independence in recognition and recall. Psychological Review, 85(3), 153–171. Galvin, S. J., Podd, J. V., Drga, V., & Whitmore, J. (2003). Type 2 tasks in the theory of signal detectability: Discrimination between correct and incorrect decisions. Psychonomic Bulletin & Review, 10, 843–876. Gardiner, J. M. (1988). Recognition failures and free-recall failures: Implications for the relation between recall and recognition. Memory & Cognition, 16(5), 446–451. Gillund, G., & Shiffrin, R. M. (1984). A retrieval model for both recognition and recall. Psychological Review, 91, 1–67. Goldsmith, M., Koriat, A., & Weinberg-Eliezer, A. (2002). Strategic regulation of grain size in memory reporting. Journal of Experimental Psychology: General, 131(1), 73–95. Grier, J. B. (1971). Nonparametric indexes for sensitivity and bias: Computing formulas. Psychological Bulletin, 75, 424–429. Guynn, M. J., & McDaniel, M. A. (1999). Generate-sometimes recognize, sometimes not. Journal of Memory and Language, 41, 398–415. Higham, P. A. (2002). Strong cues are not necessarily weak: Thomson and Tulving (1970) and the encoding specificity principle revisited. Memory & Cognition, 30, 67–80. Higham, P. A., & Brooks, L. R. (1997). Learning the experimenterÕs design: Tacit sensitivity to the structure of memory lists. Quarterly Journal of Experimental Psychology, 50A, 199–215. Higham, P. A., & Gerrard, C. (2005). Not all errors are created equal: Metacognition and changing answers on multiplechoice tests. Canadian Journal of Experimental Psychology, 59, 28–34. Higham, P. A., & Tam, H. (2005). Release from generation failure: The role of study-list structure. Manuscript submitted for publication. Higham, P. A., & Vokey, J. R. (2004). Illusory recollection and dual-process models of recognition memory. Quarterly Journal of Experimental Psychology, 57A, 714–744. Humphreys, M. S., Bain, J. D., & Pike, R. (1989). Different ways to cue a coherent memory system: A theory for episodic, semantic and procedural tasks. Psychological Review, 96(2), 208–233. Jacoby, L. L. (1983). Remembering the data: Analyzing interactive processes in reading. Journal of Verbal Learning and Verbal Behavior, 22, 485–508. Jacoby, L. L. (1996). Dissociating automatic and consciously controlled effects of study/test compatibility. Journal of Memory and Language, 35, 32–52. Jacoby, L. L. (1998). Invariance in automatic influences of memory: Toward a userÕs guide for the process-dissociation procedure. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24(1), 3–26. Jacoby, L. L., & Hollingshead, A. (1990). Toward a generate/ recognize model of performance on direct and indirect tests of memory. Journal of Memory and Language, 29, 433–454. ARTICLE IN PRESS 22 P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx Jones, G. V. (1978). Recognition failure and dual mechanisms in recall. Psychological Review, 85(5), 464–469. Jones, G. V. (1987). Independence and exclusivity among psychological processes: Implications for the structure of recall. Psychological Review, 94, 229–235. Jones, G. V., & Gardiner, J. M. (1990). Recognition failure when recognition targets and recall cues are identical. Bulletin of the Psychonomic Society, 28(2), 105–108. Kelley, C. M., & Sahakyan, L. (2003). Memory, monitoring, and control in the attainment of memory accuracy. Journal of Memory and Language, 48, 704–721. Kintsch, W. (1974). The representation of meaning in memory. Hillsdale, NJ: Erlbaum. Klatzky, R. L., & Erdelyi, M. H. (1985). The response criterion problem in tests of hypnosis and memory. The International Journal of Clinical and Experimental Hypnosis, 33, 246–257. Koriat, A., & Goldsmith, M. (1996). Monitoring and control processes in the strategic regulation of memory accuracy. Psychological Review, 103, 490–517. Martin, E. A. (1975). Theoretical notes: Generation-recognition theory and the encoding specificity principle. Psychological Review, 82(2), 150–153. Morris, C. D., Bransford, J. D., & Franks, J. J. (1977). Levels of processing versus transfer appropriate processing. Journal of Verbal Learning and Verbal Behavior, 16, 519–533. Muter, P. (1984). Recognition and recall of words with a single meaning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10(2), 198–202. Naveh-Benjamin, M., & Guez, J. (2000). Effects of divided attention on encoding and retrieval processes: Assessment of attentional costs and a componential analysis. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26(6), 1461–1482. Nelson, D. L., Bennett, D. J., Gee, N. R., Schreiber, T. A., & McKinney, V. M. (1993). Implicit memory: Effects of network size and interconnectivity on cued recall. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19(4), 747–764. Nelson, D. L., Bennett, D. J., & Leibert, T. W. (1997). One step is not enough: Making better use of association norms to predict cued recall. Memory & Cognition, 25(6), 785–796. Nelson, D. L., & Goodmon, L. B. (2003). Disrupting attention: The need for retrieval cues in working memory. Memory & Cognition, 31, 65–76. Nelson, D. L., McEvoy, C. L., & Friedrich, M. A. (1982). Extralist cuing and retrieval inhibition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 8(2), 89–105. Nelson, D. L., McKinney, V. M., Gee, N. R., & Janczura, G. A. (1998). Interpreting the influence of implicitly activated memories on recall and recognition. Psychological Review, 105, 299–324. Nelson, D. L., Schreiber, T. A., & McEvoy, C. L. (1992). Processing implicit and explicit representations. Psychological Review, 99(2), 322–348. Nelson, T. O. (1984). A comparison of current measures of the accuracy of feeling-of-knowing predictions. Psychological Bulletin, 95, 109–133. Nilsson, L. G., & Gardiner, J. M. (1993). Identifying exceptions in a database of recognition failure studies from 1973 to 1992. Memory & Cognition, 21(3), 397–410. Payne, B. K., Jacoby, L. L., & Lambert, A. J. (2004). Memory monitoring and the control of stereotype distortion. Journal of Experimental Social Psychology, 40, 52–64. Pellegrino, J. W., & Salzberg, P. M. (1975a). Encoding specificity in associative processing tasks. Journal of Experimental Psychology: Human Learning and Memory, 1(5), 538–548. Pellegrino, J. W., & Salzberg, P. M. (1975b). Encoding specificity in cued recall and context recognition. Journal of Experimental Psychology: Human Learning and Memory, 104(3), 261–270. Postman, L. (1975). Tests of the generality of the principle of encoding specificity. Memory & Cognition, 3(6), 663–672. Reder, L. M., Anderson, J. R., & Bjork, R. A. (1974). A semantic interpretation of encoding specificity. Journal of Experimental Psychology, 102(4), 648–656. Roediger, H. L., & Adelson, B. (1980). Semantic specificity in cued recall. Memory & Cognition, 8(1), 65–74. Santa, J. L., & Lamwers, L. L. (1974). Encoding specificity: Fact or artifact?. Journal of Verbal Learning and Verbal Behavior, 13, 412–423. Santa, J. L., & Lamwers, L. L. (1976). Where does the confusion lie? Comments on the Wiseman and Tulving paper. Journal of Verbal Learning and Verbal Behavior, 15, 53–57. Sikstrom, S. P., & Gardiner, J. M. (1997). Remembering, knowing and the Tulving-Wiseman law. European Journal of Cognitive Psychology, 9(2), 167–185. Snodgrass, J. C., & Corwin, J. (1988). Pragmatics of measuring recognition memory: Applications to dementia and amnesia. Journal of Experimental Psychology: General, 117, 34–50. Thomson, D. M., & Tulving, E. (1970). Associative encoding and retrieval: Weak and strong cues. Journal of Experimental Psychology, 86(2), 255–262. Toth, J. P., Reingold, E. M., & Jacoby, L. L. (1994). Toward a redefinition of implicit memory: Process dissociations following elaborative processing and self-generation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 290–303. Tulving, E. (1974). Recall and recognition of semantically encoded words. Journal of Experimental Psychology, 102, 778–787. Tulving, E. (1983). Elements of episodic memory. Oxford: Oxford University Press. Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26, 1–12. Tulving, E., & Thomson, D. M. (1973). Encoding specificity and retrieval processes in episodic memory. Psychological Review, 80(5), 352–373. Tulving, E., & Watkins, O. C. (1977). Recognition failure of words with a single meaning. Memory & Cognition, 5(5), 513–522. Underwood, B. J. (1969). Attributes of memory. Psychological Review, 76, 559–773. Vokey, J. R., & Higham, P. A. (2005). Components of recall: The semantic specificity effect and the monitoring of cued recall. Manuscript submitted for publication.. Watkins, M. J., & Tulving, E. (1975). Episodic memory: When recognition fails. Journal of Experimental Psychology: General, 104(1), 5–29. Weldon, M. S., & Colston, H. L. (1995). Dissociating the generation stage in implicit and explicit memory tests: ARTICLE IN PRESS P.A. Higham, H. Tam / Journal of Memory and Language xxx (2005) xxx–xxx Incidental production can differ from strategic access. Psychonomic Bulletin & Review, 2(3), 381–386. Whittlesea, B. W. A. (1997). Production, evaluation, and preservation of experiences: Constructive processing in remembering and performance tasks. In D. L Medin (Ed.). The psychology of learning and motivation (Vol. 37, pp. 211–264). San Diego, CA: Academic Press. Wickens, D. D. (1970). Encoding categories of words: An empirical approach to meaning. Psychological Review, 77, 1–15. 23 Wiseman, S., & Tulving, E. (1975). A test of confusion theory of encoding specificity. Journal of Verbal Learning and Verbal Behavior, 14, 370–381. Wiseman, S., & Tulving, E. (1976). Encoding specificity: Relation between recall superiority and recognition failure. Journal of Experimental Psychology: Human Learning and Memory, 2(4), 349–361. Zeelenberg, R., Pecher, D., Shiffrin, R. M., & Raaijmakers, J. G. W. (2003). Semantic context effects and priming in word association. Psychonomic Bulletin & Review, 10, 653–660.
© Copyright 2026 Paperzz