3 Artificial Grammar Learning - Electronics and Computer Science

Positive/negative instantiation, active
responding/passive exposure, and corrective
feedback in artificial grammar learning
A mini-thesis submitted for transfer of registration from MPhil to PhD
September 2002
Martina T. Johnson
Supervisor: Professor Stevan Harnad
IAM Group
ECS Department
University of Southampton
Table of Contents
1 Introduction ............................................................................................................. 5
2 Background ............................................................................................................. 9
2.1
Natural language learning ............................................................................. 9
2.2
The Credit/Blame Assignment Problem ....................................................... 10
2.3
Gold's Theorem ............................................................................................ 10
2.4
Project Grammarama .................................................................................. 11
2.5
Conclusion ................................................................................................... 13
3 Artificial Grammar Learning .............................................................................. 14
3.1
The Artificial Grammar Learning Paradigm ............................................... 14
3.2
Positive and Negative Instances .................................................................. 15
3.2.1
Brooks' (1978) series of studies ........................................................... 15
3.2.2
Whittlesea & Dorken's (1993) studies ................................................. 17
3.2.3
Dienes, Altmann, Kwan, & Goode's (1995) study .............................. 19
3.2.4
Dienes, Broadbent, & Berry's (1991) study ......................................... 20
3.2.5
Conclusion ........................................................................................... 21
4 Rule or Fragment Knowledge? ............................................................................ 23
4.1
Transfer studies – Evidence for Abstract Knowledge? ................................ 23
4.2
Evidence for Fragment Knowledge ............................................................. 23
4.3
Problem with FSGs ...................................................................................... 24
4.4
Biconditional Grammars ............................................................................. 24
4.4.1.
St. John's (1996) study ......................................................................... 24
4.4.2
Advantages of the BCG over FSGs ..................................................... 25
4.4.3
Associative Chunk Strength ................................................................. 26
4.4.3.1 Servan-Schreiber & Anderson's study ............................................. 26
4.5
5
Summary ...................................................................................................... 27
The variables that need to be controlled and tested ....................................... 28
5.1
The variables that need to be controlled...................................................... 28
5.1.1
Grammar .............................................................................................. 28
5.1.2
Degree of difference between strings .................................................. 29
5.1.3
Number of trials and presentation method ........................................... 29
5.1.4
Chance baseline ................................................................................... 30
5.2
The variables that need to be tested ............................................................. 30
5.2.1
Instructions ........................................................................................... 30
5.2.1.1 'Hide' vs. 'Seek' instructions ............................................................. 30
5.2.1.2 'Tell' instructions .............................................................................. 32
5.2.2
Task and Feedback ............................................................................... 32
2
5.2.3
5.3
⊕ and ⊖ instances ................................................................................ 33
Summary ...................................................................................................... 33
6 Experiment 1: Rule or Fragment knowledge? ................................................... 34
6.1
Method ......................................................................................................... 34
6.1.1
Subjects ................................................................................................ 35
6.1.2
Design .................................................................................................. 35
6.1.3
Apparatus ............................................................................................. 36
6.1.4
Procedure ............................................................................................. 36
6.2
Results and Discussion ................................................................................ 36
6.2.1
Training phase ...................................................................................... 36
6.2.2
Test phase............................................................................................. 37
6.2.2.1 The guessing criterion .......................................................................... 39
6.2.2.2 The zero correlation criterion............................................................... 39
6.2.2.3 ACS analyses ....................................................................................... 41
6.3
Conclusions .................................................................................................. 42
7 Experiment 2: The effect of ⊖ instances in a standard AGL experiment ....... 43
7.1
Method ......................................................................................................... 43
7.1.1
Subjects ................................................................................................ 43
7.1.2
Design .................................................................................................. 43
7.1.3
Apparatus ............................................................................................. 43
7.1.4
Procedure ............................................................................................. 43
7.2
Results and Discussion ................................................................................ 44
7.2.1
Training phase ...................................................................................... 44
7.2.2.
Test phase............................................................................................. 44
7.2.2.1 The guessing criterion ...................................................................... 46
7.2.2.2 The zero correlation criterion........................................................... 46
7.3
Conclusions .................................................................................................. 48
8 Future Work .......................................................................................................... 49
8.1
Overview ...................................................................................................... 49
8.2
Pilot studies .................................................................................................. 50
8.3
Experimental plan ........................................................................................ 51
References ................................................................................................................... 53
Appendix A ................................................................................................................. 56
A1 Servan-Schreiber and Anderson's theory of competitive chunking (CC) ......... 56
A1.1 The strengthening of existing chunks ....................................................... 56
A1.2 Creation of new chunks............................................................................. 57
A1.3 Simulation ................................................................................................. 57
3
Abstract
4
Positive/negative instantiation, active
responding/passive exposure, and corrective
feedback in artificial grammar learning
There is nothing more basic to thought and
language than our sense of similarity; our sorting
of things into kinds.
Quine (1969, p.116)
But what is classification but the perceiving that these
objects are not chaotic, and are not foreign, but have a
law which is also the law of the human mind?
Ralph Waldo Emerson (1803–1882)
1 Introduction
One of the most important capacities of the mind is its ability to sort the information it
receives (and perceives) into categories. The world consists of an infinite number of
potentially different objects and kinds of objects, which one would not be able to
make sense of were one unable to sort the confusion into categories. In order to
survive, behave adaptively in the environment and to reproduce, all organisms must
partition the environment into categories, so that non-identical stimuli can be treated
as equivalent. Correct categorisation is based on the recognition of a set of features or
rules that is sufficient to determine which category the thing in question belongs to.
This set of features or rules can only be detected if one has sampled objects belonging
to the specific category (i.e. positive, henceforth ⊕, examples) as well as objects that
do not belong to the category (i.e. negative, henceforth ⊖, examples). If one had only
ever seen ⊕ instances of a category one would have no way of knowing whether or
not a new object belonged to the same category or to a different one. To find the
discriminating set of features one must sample both their presence and their absence.
The area of concept learning shows this need for both ⊕ and ⊖ evidence nicely. In a
typical concept learning experiment, such as those conducted by Bruner, Goodnow, &
Austin (1956), subjects are presented with a number of pictures. In Bruner et al's
series of experiments these vary in several attributes: number of figures depicted,
shape, colour, and number of borders. The subject is told that the experimenter has a
concept in mind, e.g. red circles, or three blue squares, and that some of the pictures
are illustrative of the concept while others are not. His task is to discover what the
concept is. The experiment begins with the subject being shown a picture that
5
illustrates the concept: a ⊕ instance of the concept. The subject is then shown a
sequence of cards and told whether each one is a ⊕ or a ⊖ instance of the concept.
After each card subjects are required to write down what they think the concept is,
their hypothesis. They are not able to refer back to previously seen cards, or to
previous hypotheses. If their final answer is the correct concept, the subject is
considered to have attained the concept. Subjects are given just enough instances to
have sufficient information for attaining the concept. Subjects typically proceed by
using various different strategies in making and testing hypotheses about the category
in question (e.g. "red", "red and square" etc), and changing hypotheses based on the
feedback they receive. Ss are trying to find the discriminating set of features or the
"rule", which will enable them to successfully identify stimuli as belonging to a
specific category or not.
Bruner et al found that performance was dependent on the number of attributes that
defined the concept and the strategy subjects used in forming hypotheses about what
the concept was. With few attributes performance was better than with more
attributes. In other words, concepts that had only few attributes (e.g. "red") were more
easily attained than concepts with more attributes (e.g. "red, square, and 3 borders").
Bruner et al identified two ideal strategies in forming hypotheses: the wholist strategy
and the part strategy. The wholist strategy consists in the adoption of a first hypothesis
that is based on the whole instance (e.g. hypothesis "2 red squares" when shown two
red squares), whereas the part strategy begins with part of the first instance as an
hypothesis (e.g. hypothesis "red" when shown two red squares). Subjects seem to
prefer using whole hypotheses, and the findings show that the percentage of concepts
attained is higher when using a whole hypothesis (with 3 attributes, successful
acquisition of the concept using a wholist strategy is about 73%, while when using a
part strategy success is about 40%).
Learning in such experiments depends on several things. The learner needs to receive
feedback on the success or failure of the hypotheses he is making in order to gain
information about the category he is looking for. Obviously, the feedback has to be
valid, i.e. if the feedback to a correct hypothesis were frequently "wrong", learning
would not occur. And as mentioned above, the learner needs to sample both ⊕ and ⊖
instances, and to know which instances are ⊕ and which are ⊖.
Although it is common practice in the concept learning literature to use ⊕ and ⊖
instances (indeed, learning depends on the presence of both if no innate or prior
knowledge exists), in another area of rule learning, the typical study has used only ⊕
instances: the area of artificial grammar learning (AGL). Reber first started
experimenting with artificial grammars (AG) in the late 1960s to investigate the
process by which subjects respond to the nature of a stimulus array, which he defined
as "implicit learning". In Chapter 2 of this report, the emergence of AGL experiments
will be outlined, from the origins in natural language learning to Project Grammarama
(Miller, 1958), the grandmother of all AGL experiments. The research area of implicit
learning was initially motivated by the analogy to natural language learning.
Researchers wanted to gain some insight into how people learn their native language
by studying what people learn about artificial grammars. One of the early
observations in natural language learning (e.g. Chomsky, 1980) was that children
6
never encounter ⊖ evidence. This set AGL in the direction of studying learning from
⊕ evidence only. Consequently, the vast majority of AGL experiments use only ⊕
instances.
In Chapter 3, the standard AGL paradigm initialised by Reber's studies in the late
1960s will be described. Reber's experiments signify the beginning of the field of
study of implicit learning in general. Then the few AGL studies in which both ⊕ and
⊖ instances have been used will be presented in detail.
One of the most controversial issues in the AGL community concerns the nature of
the knowledge people gain in AGL experiments. Some researchers (e.g. Reber, 1967)
argue that it is abstract knowledge about the rules of the grammar, i.e. that subjects
learn the underlying abstract structure of the grammar. Other researchers (e.g.
Perruchet & Pacteau, 1990; Johnstone & Shanks, 2001) are convinced that subjects
only memorise chunks of letters, and do not learn anything about the underlying rules.
In this report the debate over the representational nature of the knowledge gained in
AGL experiments will be outlined in Chapter 4.
In Chapter 5 the many different factors that need to be controlled and tested in an
experiment for any precise conclusions to be drawn will be pointed out and discussed.
The first experiment is described in Chapter 6 and will demonstrate that subjects are
not learning abstract rules of an AG, but are merely rote memorising chunks of letters.
In Chapter 7 the second experiment will be described and it will be shown that ⊖
instances interfere with learning, again demonstrating that many AGL experiments are
simple rote memorisation experiments, not rule learning experiments.
Finally, in Chapter 8, future work that will be conducted will briefly be outlined. This
consists of a series of experiments that will test several variables using AGs of
varying difficulty, i.e. on easy, medium, and hard grammars. It could be that subjects
use a rule-based strategy for the hard grammars, and a memory-based strategy for the
easier grammars, since harder grammars may be more difficult to memorise, and
using a rule would be beneficial. However, it could also be that subjects have
difficulty finding a rule in harder grammars, and are thus merely rote memorising
stimuli (or parts of stimuli). By examining the learnability of easy, medium, and hard
grammars, the type of strategy subjects use can be determined. The effect that ⊕ and
⊖ instances have on learning is another aspect of AGL that will be examined. In order
for Ss to pick out the relevant features it is necessary for them to compare stimuli in
which the relevant features are present, i.e. ⊕ strings, and stimuli in which they are
absent, i.e. ⊖ stimuli. The amount of learning with ⊕ stimuli only and ⊖ stimuli only,
as well as with ⊕ and ⊖ stimuli will be investigated. It has been demonstrated in
many aspects of human life that active engaging in doing something, rather than just
passively taking in something, greatly increases subsequent performance, be it
learning how to write a computer programme, ride a bike, or sing a song. The affect of
active responding versus passive exposure to a series of strings created by an AG will
be examined. In addition and in connection with ⊕ and ⊖ instances, and active and
7
passive responding, the presence and absense of corrective feedback on subjects's
responses is likely to have a profound effect on learning, and will be assessed.
Once it has been ascertained what effect these variables have on learning an artificial
grammar, the results of these experiments can – to a certain degree and within reason–
be generalised to other more applied domains of rule learning. Rule learning is, for
example, an inherent part of machine learning or education, and it would certainly be
useful in such domains to know which variables affect learning beneficially, and
should thus be incorporated into the computational or educational programme, and
which affect the learning process unfavourably and are thus to be avoided.
8
2 Background
2.1 Natural language learning
One of the first things a child has to learn is his native language. In order to learn a
natural language one has to be able to distinguish between grammatical and
ungrammatical sentences. However, infants almost never encounter ungrammatical
sentences1 (⊖ evidence); the only sentences they hear or utter are grammatical ones
(⊕ evidence) (see e.g. Brown & Hanlon, 1970, for relevant empirical work). A child
hears a finite number of sentences from his parents and elsewhere, and from this input
must generalise to an infinite set of sentences (which is the whole language) that
includes the input sample but goes beyond it. According to Chomsky's (1980) poverty
of the stimulus argument there are too many grammars that are compatible with the
input data the child receives. Many of these grammars are more tempting to someone
in the child's situation than the correct grammar, yet the child does not "choose" an
incorrect grammar. Since the child receives only ⊕ evidence, there is no way for him
to discover that his hypothesis is wrong; for that, his input data would also have to
include ⊖ information, or he would have to utter ⊖ sentences and be corrected. There
is, however, considerable evidence that ⊖ information is not and cannot be necessary
for first language acquisition (e.g. Brown & Hanlon, 1970). The knowledge infants
acquire when learning their first language is far greater than the information that is
available to them in the environment. Or, as philosophers sometimes put it, the output
of the language learning process is underdetermined by the input (Laurence &
Margolis, 2001). In philosophy, underdetermination is the thesis that all data can be
explained by more than one theory and there is no way of knowing which is the
correct theory.
The underdetermination of linguistic input has led many researchers, notably
Chomsky (e.g. 1968), to conclude that infants must have a pre-wired disposition to
pick up the correct set of rules (grammar), in the form of an inborn Universal
Grammar (UG). Chomsky and many other linguists infer that every natural language
has structural properties which are common to all human languages (UG) and which
are innately coded in the brain, giving infants the ability to learn grammar from ⊕
evidence only. This theory of the existence of UG is the prevailing one in linguistics
today.
To illustrate, consider the following sentences (Chomsky, 1968):
1. (a) They intercepted a message to the boy.
(b) Who did they intercept a message to?
2. (a) They intercepted John's message to the boy.
1
Here, ungrammatical means 'violating Universal Grammar (UG)'. UG is a complex implicit set of
rules gradually discovered by syntacticians and made explicit by them, but already "known" implicitly
by the child. Children do sometimes hear ungrammatical examples in the sense of stylistic grammar,
e.g. *ain't, etc.
9
(b) * Who did they intercept John's message to?2
If we transform the phrase in 1(a) to a question, we get the grammatical sentence 1(b).
However, if we apply exactly the same process to the very similar sentence 2(a), we
get the ungrammatical sentence 2(b). In some unknown way, the speaker of English
implicitly learns the rules for creating such sentences, and knows under which formal
conditions these rules are applicable, i.e. knows that 1(a) is acceptable (grammatical),
but 1(b) is not. This is a big problem for learning machines (see Section 2.2), but not
for children. Since children do not make mistakes that violate UG during the course of
language acquisition, the principles of UG must already be built in to the brain.
Chomsky (1987) likens the language faculty to a complex network associated with a
switch box that contains a finite number of switches. To acquire a language the child's
mind determines (implicitly) how the switches are set from the simple input it
receives, but not the structure of the network itself.
2.2 The Credit/Blame Assignment Problem
Related to the problem of underdetermination is the "credit/blame assignment
problem" in machine learning. When learning to categorise objects, a machine (or a
human) may be categorising several objects in a row successfully, basing the
categorisation on certain features. However, the categorisation may be based on the
wrong features, and at some point, the categorising may be unsuccessful. It is often
difficult to know what is to blame for the incorrect categorisation. The same thing
applies for the opposite case, when one is continuously categorising incorrectly and at
some point there occurs a correct categorisation: which feature is to be credited with
the correct categorisation?
Researchers in machine learning are trying to design algorithms for solving the
credit/blame assignment problem (e.g. Stroulia & Goel, 1996). In order for an
algorithm to be successful it has to be able to generalise from the set of exemplars it
was trained on, to all possible cases. When a machine makes a mistake, the algorithms
are designed to detect that their performance has gone awry, and to adjust themselves.
However, if the objects in the training set contain many features, determining which
features to credit and which to blame is difficult; indeed, the credit/blame assignment
problem is one of the hardest problems in Artificial Intelligence. The number of
potential features that can be credited/blamed is sometimes simply too large.
2.3 Gold's Theorem
Results in formal learning theory also support the theory of UG, suggesting that the
learners of natural language must be predisposed to choose the correct set of rules. In
Gold’s (1967) study of language learnability learners are given strings of symbols and
must say whether or not each string is correct on the basis of the information received
so far. A "language" is taken to be a subset of the possible symbols from some finite
alphabet of symbols. Gold investigated two basic ways to present the strings, "⊕" and
2
The asterisk is used to indicate a sentence that deviates in some respect from grammatical rule.
10
"⊕/⊖" presentation3. In the "⊕" presentation the subject sees only ⊕ strings. In the
"⊕/⊖" presentation the subject sees both ⊕ and ⊖ strings, and is informed whether
the string is correct (⊕) or not.
Gold studied several different types of "grammars" and their learnability. He found
that only the most trivial of grammars, namely, those consisting of only a finite
number of sentences, are learnable from "⊕" presentation. All other languages are
learnable only if ⊖ evidence is also available (i.e. "⊕/⊖" presentation).
2.4 Project Grammarama
In the late 1950s Miller started "Project Grammarama", a project motivated by the
question of how infants learn a natural language. A natural language is based on a set
of complex rules. "Finite-state grammars" (FSG) are complex rules for generating
strings of letters. The aim of the project was to discover how people learn the pattern
of a FSG. FSGs can be graphically depicted as a finite number of states (see Figure 1
taken from Reber, 1967). The transition from one state (represented in Figure 1 as a
circle) to the next state produces a letter. A string is generated by entering at state 0,
then each move to another state produces a letter until the final state is reached, and
the string is complete. Any letter strings which can be created in this way are
grammatical or ⊕ strings, and those strings which cannot be produced in this way are
ungrammatical or ⊖ strings. For example, in Figure 1 the ⊕ string VVPS can be
generated, whereas the string VXVPTS is ⊖ as it cannot be produced by this FSG. A
finite-state language is the finite number of strings which can be generated by a finitestate grammar.
Figure 1: The finite-state artificial grammar created by Reber (1967)
3
Gold called ⊕ presentation "text presentation" and ⊕/⊖ presentation "informant presentation", but for
the present purposes I find it clearer to use ⊕ and ⊕/⊖.
11
Miller (1958) exposed people to strings generated by a simple FSG to see what they
would do with them. In order to induce the subjects to pay close attention to the
strings when they were presented, Miller decided to ask them to memorise the strings,
which would guarantee they were attending the strings4. Miller used 18 ⊕ strings (of
no more than seven letters) generated by a FSG using the letters N, S, X, and G, and
18 ⊖ strings of the same letters constructed from a table of random numbers. The
strings were divided into two lists of nine strings for the ⊕ strings (⊕ list), the same
for the ⊖ strings (⊖ list) (four lists in all). There were four pairs of lists: ⊕/⊕, ⊕/⊖,
⊖/⊕, ⊖/⊖. Ss were shown the first list of nine strings at a rate of one string every
five seconds and were then asked to write down, in any order, all the strings they
could recall. Then their response sheet was removed and the next trial started with the
strings in a different order. Ss knew nothing about the rules governing the strings, and
were just asked to read the nine strings and write down as many as they could
remember on each trial. Each subject had ten such trials, and ten more for the second
list. After the tenth trial on the second list, the Ss were asked to write down all the
strings they could remember from the first list.
Those Ss who studied a ⊕ list first could remember more strings after the first ten
trials than those who studied a ⊖ list first, with an average of 8.9 compared to 3.79. Ss
who learnt the ⊕/⊕ lists could remember most strings from the second list on the
tenth trial, while the Ss who learnt the ⊕/⊖, ⊖/⊕, or ⊖/⊖ hardly differed in their
results and were worse than the ⊕/⊕ Ss. When Ss studied a ⊖ list prior to a ⊕ list,
the variance on the ⊕ list was much greater than when they studied a ⊕ list first. The
number of strings recalled correctly from the first list, after studying the second list,
was on average less than the number recalled after the first ten trials.
The more similar the strings are to each other, the more likely it is that Ss will get
them mixed up (interference) and that more mistakes will be made. However, Miller
found that Ss are better able to memorise the ⊕ strings, which are more similar to one
another, than the ⊖ strings. Thus, Miller concluded that Ss must be grouping and
recoding the ⊕ strings, and thereby avoiding the interference effects one would expect
due to the similarity of the strings to each other. For example, one of the lists
contained the strings: NNSG, NNSXG, NNXSXG, NNXXSG, NNXXSXG, and
NNXXXSG. Given NNSG, these strings can be recoded as 00, 01, 11, 20, 21, and 30,
where the two digits indicate the number of X’s preceding and following the S. If one
groups and recodes the stimuli in such a way, the similarity of the strings to each other
does not interfere with performance.
Although when Miller first conceived of Project Grammarama he had hope that these
experiments would cast light on the way infants acquire the rules of their native
language, he soon came to realise that there are too many differences between the
experiment and language learning. First, the Ss are adults, not infants. They already
know one or more languages, and a second language is never learnt the same way as
4
Miller used memorisation because previous work with AGs had also used memorisation (Aborn &
Rubenstein, 1952, 1954).
12
the first. The artificial "language" is visual, not auditory, which accentuates a different
kind of patterning. There is no meaning (no semantics) in FSGs and there is no use for
the language once it has been learnt. There is also the time difference within which
the language is learnt, infants take about 2 years, whereas the subjects in Project
Grammarama had one to two hours. The infant acquires a sensorimotor skill whereas
the subjects are puzzling over an abstract cognitive pattern. Most important, like
Chomsky, Miller noted that the infants’ task is much more complicated than the
subjects’. Since UG is much more complicated than artificial grammar (AG) rules, the
child must already "know" UG innately and implicitly. These differences led Miller
to conclude that there is almost nothing in common between natural language learning
and artificial language learning.
2.5 Conclusion
Although experiments like Miller's, which are now collectively called artificial
grammar learning (AGL) experiments, were originally motivated by the desire to gain
insight into natural language learning (e.g. Miller, 1958; Reber, 1967), UG is so much
more complex than any of the AGs used that AGL data proved unable to shed much
light on that topic. However, even now that its original ties to the problem of UG have
been dropped, the question of how people learn complex rules and under which
conditions the rules can be learnt still remains one of research interest.
13
3 Artificial Grammar Learning
3.1 The Artificial Grammar Learning Paradigm
The artificial grammar learning (AGL) paradigm was further developed by Reber in
1967. He was interested in the process by which subjects respond to the statistical
nature of a stimulus array. To induce Ss to attend to the stimuli, Reber, like Miller,
asked them to memorise the letter strings. Several subsequent researchers have
continued to use the task of memorisation to ensure Ss are paying attention to the
stimuli. Other researchers have used other tasks and other forms of presentation which
have yielded similar results, e.g. sequential observation of exemplars (Lewis, 1975),
anagram solving (Reber & Lewis, 1977), similarity matching (Mathews et al, 1989),
recognition (Vokey & Brooks, 1992), simultaneous scanning of large numbers of
letter strings (Kassin & Reber, 1979; Reber, Kassin, Lewis, & Cantor, 1980) and
paired-associate learning (Brooks, 1978).
Many AGL experiments use FSGs such as the one depicted in Figure 1 (from Reber,
1967). In a typical AGL experiment subjects are asked to memorise strings of letters,
such as XMXRVM, in a training phase. These letter strings appear to be arbitrary but
actually follow a set of rules. Ss are told they are taking part in a short-term memory
experiment, and do not know anything about the underlying rules governing the
structure of the strings. After the training phase, Ss are informed that the strings they
have been memorising follow rules and are then shown new strings in the testing
phase. Some of these test strings follow the rules and some do not. The task is to
classify which ones follow the rules and which do not. The typical classification
performance of Ss is significantly better than chance, indicating that they have
acquired some knowledge. Most Ss, however, are unaware of the knowledge they
have gained and are unable to verbalise the rules.
This dissociation of performance and reportability led Reber (1967, 1989) to conclude
that Ss are acting on "implicit"5 (i.e. unconscious), abstract knowledge. Reber came to
this conclusion by way of comparison with natural language learning. As with Miller,
Reber's motivation for artificial grammar learning studies arose from the inadequacy
of the traditional learning paradigms (e.g. the stimulus-response approach6) in
explaining the way a child learns a natural language (UG). In Reber's opinion what
learning theories could not account for was the fact that learning seemed to occur
implicitly, without conscious awareness. That is, a child learns to recognise and
generate what does conform to UG and reject what does not, but children (and adults)
are not explicitly aware of UG, hence cannot state the rules they are using (Chomsky,
1968).
Reber initially believed that implicit learning was "a rudimentary inductive process
which is intrinsic in such phenomena as language learning and pattern perception"
(1967). However, although Reber's interest in AGL was initially motivated by natural
5
The term "implicit" also came from the field of language learning: Chomsky argued that knowledge
of UG was "implicit knowledge", i.e. UG was both innate and unconscious
6
Stimulus-response learning is associative learning governed by the forming of a link or bond between
a particular stimulus and a specific response
14
language learning, with the advances in linguistics, and for reasons Miller already
noted (see section 2.4), he later says, "I view the synthetic language as a convenient
forum for examining implicit cognitive processes. If our explorations of this implicit
domain of mentation turn out to provide insights into the acquisition of natural
languages, that would be most pleasing" (p.30, Reber, 1993). Thus, he became rather
careful about generalising results from AGL experiments to the learning of natural
languages (UG).
3.2 Positive and Negative Instances
A consequence of the initial motivation from natural language learning is that in most
of the AGL literature subjects are presented only with ⊕ (grammatical) strings in the
training phase. In the testing phase Ss then have to distinguish between ⊕
(grammatical) and ⊖ (ungrammatical) strings, although they have never seen ⊖
strings until that point. For proficient learning to occur, it is crucial that both ⊕ and ⊖
instances are available to the learner. In order for the learner to select the relevant
features and discard the irrelavnt features of stimuli, he needs to have the opportunity
to see stimuli in which the relevant features are present, as well as stimuli in which the
relevant features are absent. Only then can the learner make a comparison and know
whether a feature is relevant or not. If he does not see ⊕ and ⊖ instances, there is no
way for the learner to know whether the feature he is concentrating on is relevant or
not.
There are only very few studies in which ⊕ and ⊖ items are used (Brooks, 1978;
Whittlesea & Dorken, 1993; Dienes, Broadbent, & Berry, 1991; and Dienes, Altmann,
Kwan, & Goode, 1995). Since the aim of most of the researhers presented here was
not to examine the effect of ⊖ instances, but to examine whether subjects could
distinguish between two different grammars, the authors have used the technique of
presenting subjects with instances from two different grammars. Instead of using ⊕
and ⊖ instances from the same grammar, a ⊕ instance of one grammar can be seen as
a ⊖ instance of the other grammar and vice versa. This use of two different grammars
invites several problems, which will be discussed later (see sections 5.1.1 and 5.1.2),
after the studies have been presented in detail.
3.2.1 Brooks' (1978) series of studies
Brooks (1978) wanted to demonstrate that by encouraging subjects to learn to use
information about individual items, one can show that subjects draw analogies
between individual items. He conducted a series of experiments using AGs, since this
type of experiment emphasises "the learning of specifics as a basis for the later
generalization of an extremely complicated concept" (Brooks, p. 170, 1978). He
trained subjects on three paired-associate lists, that is, each letter string was associated
with a particular word, for example the letter string VVTRXRR was paired with the
word 'Paris'. Half of the strings were generated from grammar A and half from
grammar B. The letter strings from grammar A were paired with 'Old-World' items
15
(e.g. Paris, elephant, Rome), the strings from grammar B with 'New-World' items (e.g.
Montreal, possum, Detroit).
In the first experiment Ss knew nothing about the existence of two grammars
generating the strings, nor did they know the relevant categories during training, nor
did they know that there were any general rules to be learnt. The training task was to
respond to the letter string with the correct word, i.e. to memorise which word was
paired with each particular letter string. The responses were divided into two
categories, animals (e.g. elephant, possum) and cities (e.g. Paris, Montreal), which
were, however, unrelated to the underlying categories of Old-World and New-World.
The training ended when Ss completed one trial without mistake for each of the three
lists. Ss were then asked if they had noticed that there were two different types of
letter strings, which none of the Ss had. They were then asked if they had noticed two
different types of response, to which most Ss answered that the responses were all
either cities or animals.
Ss were then told that there was a distinction between Old-World items (associated
with grammar A) and New-World items (associated with Grammar B). They were
given a test stack of 30 cards with novel letter strings printed on them. Ten of the
strings were generated from Grammar A, ten from grammar B, and ten were random
letter strings using the same letters. Ss were required to sort the 30 cards into three
piles and were told that 10 of these should be "Old-World" (the strings generated from
grammar A), 10 should be "New-World" (generated from grammar B) and 10 did not
belong to either category. Although Ss indicated that they did not know what they
were doing and were just guessing, they were able to distinguish the categories from
one another at a level significantly above chance (60 – 64.4% correct performance,
Brooks takes chance level to be 33%).
Brooks conducted a further experiment in which he made the presence of two
different types of stimuli quite obvious by pairing the letter strings from grammar A
with the word "city" and the strings from grammar B with the word "animal". The
experiment was presented as a memory test, and training finished when all three lists
had been learnt. At test, Ss were presented with the same test stack of 30 new letter
strings as the previous group, and were asked to sort them into three piles, animals,
cities, or neither. The results of this group (60.4% – 65.8% correct performance) are
very similar to the results of the previous group. That is, making the presence of two
different categories apparent in the training phase does not affect the way Ss perform
at test.
In a further experiment, Brooks explicitly told Ss that they were to learn to distinguish
letter strings that were generated by two different sets of rules. They were shown the
same training items as the previous groups and were required to categorise each item
into grammar A and grammar B, and were told whether they were right or wrong after
each trial. In the previous experiments the responses were passively paired with the
letter strings, while in this experiment the Ss are receiving active feedback after each
response. They continued until they could categorise all three lists correctly, which
happened to be for the same average number of trials as the previous groups. They
were then given the same test stack of 30 new letter strings as the previous groups and
asked to sort them again into the three piles (grammar A, grammar B, or neither).
Their results are between 45.4% and 50.8% correct categorisation. These Ss did not
16
do as well on generalising to new items as the previous Ss who were not told about
the category distinction. In fact, they performed at almost the same level as a group
who received no training at all, and a group in which the New-World and Old-World
pairings were randomly assigned to the stimuli in training. The group which had
received no training were categorising between 37.1% and 51.1% correctly7, and the
group which had received random-pairing training were performing at levels between
40.4% - 47.9% correct categorisation. These Ss were apparently sorting the cards
according to the structural similarity of items to one another, i.e. items that looked
similar to each other were put into the same pile.
At first glance these experiments seem to suggest that subjects are able to distinguish
between the two grammars at levels better than chance only when they are not told
about the existence of two different sets of rules during training. However, Brooks
suggests that these findings are dependent on several things. Firstly, the regularities in
the material are so complex that it is difficult for the subjects to grasp them in any
reasonable amount of time, and "certainly not under the conditions of successive
presentation" (p.175, 1978). He also comments that the stimuli are sufficiently similar
to one another, so that those Ss who know about the existence of rules would have
poor incidental memory for them, i.e. would not be storing individual stimuli, since
they would be concentrating on finding rules. Those subjects who were not informed
of the existence of rules, were doing a paired-associate task, which consists of
memorising the strings. Brooks thus concludes that Ss were "drawing analogies"
between previously memorised individual training stimuli and test stimuli. He
suggests that subjects might simply be thinking along the lines that, for example
MRMRV looks similar to MRRMRV and since they had learnt that MRRMRV is
associated with Vancouver which is New-World, they would also call MRMRV a
New-World item solely on this basis, in Brooks's words "by simple analogy". In other
words, subjects who had been told to look for rules were in a bad position for two
reasons; one, the rules were much too complicated for them to deal with under those
circumstances, and two, since they were looking for rules, they had not been
memorising individual items, which could have later helped them in drawing
analogies.
3.2.2 Whittlesea & Dorken's (1993) studies
Whittlesea & Dorken (1993) were interested in investigating whether performance in
implicit learning experiments (e.g. Reber, 1976) reflects automatic abstraction of the
structure underlying stimuli or whether it instead reflects task-dependent encoding of
particular experiences of stimuli.
Their experiment consisted of showing subjects instances of two different FSGs, and
having them process them in a particular way. Ss had to pronounce the letter strings
produced by one grammar (A) while spelling the letter strings from the other grammar
7
In the previous experiments, Brooks takes 33% to be chance level performance, since there are three
categories (Grammar A, Grammar B, and neither or random). However, he didn't establish a baseline
chance level. The Ss who received no training and were just required to sort the 30 stimuli into three
piles, were apparently basing their classifications on the structural similarities, and were classifying at a
level above 33%, namely between 37.1% and 51.1%. This can and should be taken as baseline chance
level performance. (See section 4.3)
17
(B) during training. All 40 training exemplars of one grammar were presented in
random order. Then all 40 exemplars of the other grammar were presented in random
order. This procedure was repeated using a different random sequence (within the 40
exemplars of one grammar), for a total of 160 training trials.
Just before the test phase Ss were informed that the items they had spelled earlier
were generated by one set of rules, while the items they had pronounced earlier were
generated by a second set of rules. They were then shown novel items which they
again had either to spell or to pronounce, and then to classify as either ‘Spell’ or
‘Say’. The critical manipulation was that Ss were now required to pronounce half of
the test items belonging to the Spell grammar, and spell the rest, and to spell half the
Say items and say the rest. It was carefully explained to the Ss that the spelling or
saying of the words in the test phase was completely unrelated to whether the item
conformed to the rules of earlier items that had been spelled or pronounced, and that,
in fact, half the items actually belonging to the Say category would have to be spelled
now, and vice versa.
In the test phase, Ss were able to discriminate members of the ‘Say grammar’ from
members of the ‘Spell grammar’ at levels above chance. When the processing
contexts were matched (e.g. items from the Spell grammar were spelled) Ss classified
at a level of 66%, and when the processing contexts were mismatched (e.g. items from
the Say grammar were spelled) they performed at a 61% level. A third of the stimuli
in the test phase were common to both grammars, i.e. they could be created by both
grammars. These items were more likely to be assigned to the grammar that matched
the processing context, i.e. when they were required to spell the item, they were more
likely to assign it to the spell grammar. Therefore Whittlesea & Dorken suggest that
Ss are able to discriminate the two grammars by "feelings of familiarity" induced by
task context. Because processing is more "fluent" when a test experience perceptually
matches representations of prior experiences (Jacoby, Kelley, & Diwan, 1989),
fluency or familiarity could be used to discriminate the grammars. A test string from
the Spell grammar would seem familiar if spelt at test, and would be correctly
classified as belonging to the Spell grammar, whereas if it were pronounced at test it
would feel unfamiliar and it would be correctly categorised as not belonging to the
Say grammar.
In a further experiment, Whittlesea & Dorken presented subjects with ⊖ strings as
well as ⊕ strings from grammar A and B in the test phase. The ⊖ test strings were
created using the same letters as ⊕ strings, but violated both grammars in at least one
sequence rule. After training Ss on the same training set and under the same
conditions as in the previous experiment, Ss were told that some items they were
about to see conformed to the same rules as the items they had just seen, whereas
some didn’t. Their task was to say whether the novel strings were 'good' or 'bad'. As
in the previous experiment, half of the test items were spelled and half were
pronounced.
⊕ items were classified as 'good' with higher probability (p = 0.64) than ⊖ items (p =
0.27), which Whittlesea & Dorken take as a demonstration of sensitivity to the
"Goodness", i.e. grammaticality. ⊕ items were judged 'good' with greater probability
18
when the processing contexts were the same (p = 0.66) than when they were different
(p = 0.49). For example, strings from grammar A that had previously been spelled
were considered 'good' more often when they were again spelled, than when they were
pronounced.
Whittlesea & Dorken argue that this effect of task context (spell or say) demonstrates
a sensitivity to grammaticality that is mediated by representations of the training
items, which have preserved highly specific information about the experience of the
training items. In other words, they suggest that the grammaticality effect is due to Ss
memory of specific information about how they experienced the stimuli in the training
phase. This corresponds to Brooks's interpretation of his data. It seems that here, too,
subjects are simply remembering specific instances of the training items, and basing
their classifications of novel stimuli in the test phase on these previously memorised
instances.
3.2.3 Dienes, Altmann, Kwan, & Goode's (1995) study
Dienes, Altmann, Kwan, & Goode (1995) also conducted a study in which they used
two different grammars. They were investigating the extent to which subjects are
conscious of the knowledge gained in an AGL experiment. In their first experiment
they trained subjects first on one grammar and then on another grammar. Subjects
were given a sheet of paper with strings from grammar A and asked "to study the
strings as carefully as possible". This sheet was replaced by another sheet with
training strings from grammar B. In the subsequent test phase subjects were told that
the order of the letters in the strings conformed to complex rules: one set of rules for
the first sheet and different rules for the second sheet. Subjects then received a test
sheet and were informed that a third of the strings were like the strings on the first
sheet, a third like the strings on the second sheet, and a third were like neither. Half
the subjects were asked to tick only the strings that were like those on the first sheet,
and half the subjects were asked to tick those strings that were like those on the
second sheet. This procedure was used in order to see whether subjects have
intentional control over the knowledge they have gained in the training phase.
Knowledge of an AG could be unconscious in the sense that it is applied in an
obligatory way, regardless of whether the subject wants to apply the knowledge or
not, in which case the subject has no intentional control of the knowledge. Dienes et
al reckon that if the acquired knowledge is unconscious in this sense, subjects would
have difficulties differentiating the two grammars in the test phase. Thus, for example,
if a subject were required to tick strings from the first grammar, the familiarity of
items from the other grammar may interfere and also be ticked. This would indicate
that subjects are using the acquired knowledge contrary to their intentions.
Dienes et al found that subjects performed the task quite successfully, with means of
0.53 to 0.59 correct. The proportion of strings ticked incorrectly was subtracted from
the proportion of strings ticked correctly to give a measure of intentional control. A
score of zero would then indicate no intentional control and a score of 1 would
indicate complete intentional control. They found that the scores were significantly
greater than zero, indicating that subjects had a significant amount of intentional
control.
19
Dienes et al also noted that participants had virtually no obligatory knowledge
(knowledge not under intentional control). They compared the performance of the
experimental group with a control group, in which Ss had received no training and
were told that the order of letters in some but not all of the strings obeyed a complex
set of rules, and they should tick only those strings that obeyed the rules. A baseline
measure was calculated for the control group for each grammar: the number of ticks
to grammatical strings of one grammar divided by {the number of ticks to
grammatical strings of that grammar plus number of checks to ungrammatical
strings}. The means were 0.43 for grammar A and 0.50 for grammar B. The
comparison of these baselines to the experimental group's obligatory knowledge
revealed no significant differences. Dienes et al thus conclude that the knowledge
their subjects gained was largely under intentional control, i.e. available to
consciousness.
However, the results can also be explained with an instance memory theory. Subjects
could simply be remembering training exemplars. Since they studied all items from
one grammar first, then all items from the other grammar, it would just be one extra
bit of information to remember which list the item was on.
3.2.4 Dienes, Broadbent, & Berry's (1991) study
Dienes, Broadbent, & Berry (1991) conducted an experiment in which they showed
subjects ⊖ instances of a grammar as well as ⊕ instances. Theirs is the only
experiment which has used ⊕ and ⊖ instances of a grammar instead of two different
grammars. They hypothesised that providing a distinction between ⊕ and ⊖ instances
may induce a strategy that inhibits implicit learning and promotes explicit learning.
They conducted an experiment in which they showed half the subjects only ⊕
(grammatical) examples of a FSG, the other half saw ⊕ (grammatical) and ⊖
(ungrammatical) instances. In the training phase the total set of strings was shown six
times in a different random order each time. The ⊕ group was presented with 20 ⊕
items, and the ⊕/⊖ group saw 20 ⊕ and 20 ⊖ items. The ⊕ exemplars were shown in
black, the ⊖ exemplars were shown in red. Subjects were shown each exemplar for
5s, and the total set (i.e. 20 for the ⊕ group, and 40 for the ⊕/⊖ group) was presented
six times in random order. They were told that it was a simple memory experiment,
and that their task was to "learn and remember as much as possible about all 20 (40)
items".
 were ⊖ strings randomly generated or local rule violation?
After the training phase, the ⊕ group was informed that the strings followed a
complex set of rules; the ⊕/⊖ group was told that the black (⊕) instances followed a
complex set of rules, while the red (⊖) exemplars violated these rules in some way. In
the test phase subjects were presented one by one with black exemplars of 25 ⊕ and
25 ⊖ letter strings and asked to classify them as grammatical or non-grammatical.
20
The 50 exemplars were repeated once in a different random order. They were also
given a free report test, asking how they decided whether an item followed the rules,
or any rules or strategies they used. The results showed that the ⊕ group classified at
65% correct, while the ⊕/⊖ group classified at 60% correct, which is above chancelevel of 50% correct. The groups differed significantly.
Those Ss who received only ⊕ instances were performing better than those who
received ⊕ and ⊖ instances. Despite their lack of confidence, even the ⊕/⊖ group
were classifying the test items at a level above chance. Dienes et al concluded that the
presence of ⊖ instances interfered with Ss’ performance, and impaired both implicit
and explicit learning (if performance is taken as a measure of implicit learning and
free report as a measure of explicit learning). Ss reported that they couldn’t remember
which instances had appeared in black and which had appeared in red in the training
phase, and thus, may be making many mistakes.
If subjects are indeed simply memorising specific instances from the training set, as
suggested by Brooks's experiments and Whittlesea & Dorken's experiments, then the
presence of ⊖ instances in this experiment would interfere with performance at test.
Subjects in the ⊕/⊖ group had memorised both ⊕ and ⊖ instances, but not their
differentiation (red or black), since they had not been told what the colours signified.
Subjects in the ⊕ group would have an advantage over the ⊕/⊖ group, since they
would know that all instances that they remembered were ⊕ instances, and wouldn’t
have to also remember if it was shown in red or black. In effect, the presence of
colours is just an additional bit of information to remember.
3.2.5 Conclusion
It seems as if the presence of ⊖ instances in these experiments interferes with learning
rather than aids learning. However, first of all one has to ask oneself what subjects in
implicit learning experiments are learning in the first place. Reber initially concluded
that subjects were implicitly learning the abstract rules of the grammar. However,
other researchers (e.g. Perruchet & Pacteau, 1990) suggest that subjects are merely
memorising chunks (fragments) of letters. These experiments in which researchers
have used either two grammars or ⊕ and ⊖ instances of a grammar seem to indicate
that subjects are memorising either whole letter strings, or more likely, chunks of
letters, and are later (in the test phase) "drawing analogies" (Brooks, 1978) to these
memorised chunks of letters. The presence of ⊖ instances – whether from a different
grammar, or from the same grammar – interferes with this rote memorisation, and
consequently performance levels drop.
In Brooks's first and second experiment subjects do not know about the existence of
rules in the training phase. Their task is to memorise the letter strings. At test they are
then required to sort the strings, which they can at a level above chance. Subjects in
the third experiment are told about the existence of rules, and their training task is a
classification task with corrective feedback. Their performance at test is at chance
level. Subjects in the third experiment are trying to find the rules, which are much too
21
complex to be learnt in the given amount of time. Since they are concentrating on
finding rules, they are not memorising the letter strings, hence their performance in
the test phase is worse. They do not know the rules, and they also have not memorised
the strings. Subjects in the first two experiments have memorised the strings, and can
use that knowledge to classify the novel strings at a level above chance.
In Whittlesea and Dorken's study Ss processed instances from two different grammars
in different ways in the training phase, and are able to distinguish the two grammars
from each other at test. Whittlesea & Dorken interpret their findings as resulting from
the feelings of familiarity caused by the processing context. In other words, they also,
like Brooks, suggest that subjects are remembering specific information from the
training phase.
In the Dienes et al (1995) study Ss are asked to "study the strings as carefully as
possible" in the training phase. They first studied all strings (simultaneously) from
grammar A and then all strings from grammar B. Their test task was to tick strings
from only one of the grammars, e.g. tick strings that are like the strings from the first
list. Ss performed this task quite well. On an instance memory interpretation, Ss just
had to remember which items were on the first list, and which on the second, to be
able to perform at the observed levels.
Dienes et al’s (1991) training task is a memory task. Ss who only received black ⊕
items are performing better in the classification test then subjects who saw black ⊕
items and red ⊖ items. Subjects in the ⊕/⊖ group reported that they had difficulties
remembering which items had appeared in black and which in red. If subjects are
using rote memorisation to classify novel items in the test phase, then the memory
load on Ss in the ⊕/⊖ group would be greater than on Ss in the ⊕ group, since they
would not only have to remember the letter strings themselves, but also the colours
that they were presented in. Ss in the ⊕ group were performing better since their
memory load was smaller. The presence of ⊖ instances interfered with the rote
memorisation.
If subjects were learning abstract knowledge, the presence of ⊖ instances in the
training phase should aid learning. However, in these experiments, ⊖ instances seem
to interfere with learning. This can only be explained if one acknowledges that in
actual fact, AGL experiments are simple rote memory experiments, and not rule
learning experiments at all. In the next chapter, the debate about whether people learn
the abstract rules of the grammar or whether they just learn chunks of letters will be
discussed in further detail, since this is a major controversy in the implicit learning
literature.
22
4 Rule or Fragment Knowledge?
In the AGL literature there has been much debate about the nature of the knowledge
people gain: is it abstract knowledge about the rules of the grammar (i.e. an abstract
representation of the rules of the grammar) or is it knowledge about fragments of
letters? Reber initially concluded that people are picking up the abstract rules of the
grammar. However, there is considerable evidence (e.g. Perruchet & Pacteau, 1990;
Brooks & Vokey, 1991) that subjects are merely explicitly memorising short
fragments (chunks) of the whole letter string, i.e. two letter bigrams or three letter
trigrams, and classifying novel ⊕ and ⊖ strings based on the familiarity of these novel
strings to the previously seen strings.
4.1
Transfer studies – Evidence for Abstract Knowledge?
Several researchers have claimed that transfer studies in which the letter-set or
modality of the artificial grammar are changed at test provide evidence of abstract rule
acquisition (e.g. Altmann, Dienes, & Goode, 1995; Knowlton & Squire, 1996).
Although the only common factor between the training and test items is the abstract
structure of the rules, it has been shown that subjects can classify test items
accurately. For instance, Altmann et al. (1995) trained one group on letter strings and
a second group on tone sequences, both conforming to the same grammar. At test
subjects were required to classify either strings presented in the same modality or
strings presented in the other modality. They found that Ss could perform accurately
at a level above chance regardless of whether the classification test was in the same or
different modality, whereas control Ss only performed at chance levels. These
findings suggest that Ss had learnt the abstract structure of the rules of the grammar.
However, Redington & Chater (1996) demonstrated that the results could also be
explained by Ss using fragment knowledge of two or three letters to perform at similar
levels at test. Brooks & Vokey (1991) suggested that performance in transfer tests can
be explained by the abstract similarity between training and test items, for example,
the abstract nature of MXVVVM could be seen as similar to BDCCCB.
4.2
Evidence for Fragment Knowledge
The studies mentioned earlier in Chapter 3 seem to provide evidence for a fragment
knowledge account of AGL. Furthermore, Perruchet and Pacteau (1990) trained one
group of subjects on ⊕ (grammatical) letter strings and another group only on the
bigrams used in the strings. A comparison of the performance of these two groups
showed that both groups were able to distinguish between ⊕ and ⊖ test strings
indicating that it suffices to have fragment knowledge to be able to classify the strings
at test.
23
4.3
Problem with FSGs
One of the problems of this debate is that most AGL experiments have used FSGs.
Shanks, Johnstone, & Staggs (1997) questioned the use of FSGs to investigate
implicit learning, since they do not provide a means of convincingly determining the
contributions of rule and fragment knowledge in classification performance. The
problem with FSGs is that the rule structure dictates legal consecutive letters tied to
particular letter string locations. For example, in the grammar created by Reber
(1967), all grammatical strings start with either TP, TT, VX, or VV. If subjects are
classifying test strings because they know which letters are allowed in the first two
positions, it is not clear what type of knowledge they are using to make this decision.
They could be using rules, such as "all grammatical strings must begin with T or V",
"an initial T must be followed by P or T", and "an initial V must be followed by X or
V". However, they could also be using knowledge about bigrams, such as "all training
string must begin with either TP, TT, VX, or VV. Shanks et al suggest using a
different type of grammar, namely a biconditional grammar, that allows the
unconfounding (i.e. the separation) of rule and fragment knowledge.
4.4
Biconditional Grammars
Mathews et al (1989, Exp. 4, see also section 5.2.1.1) created a biconditional grammar
(BCG), which is based on biconditional rules. From prior research (e.g. Bourne,
1970) it is known that people are capable of explicitly generating simple logical rules
such as biconditional rules. A biconditional rule is a relationship between two
propositions when one is true only if the other is true, i.e. 'p if and only if q'. The
BCG Mathews et al created was based on three letter correspondence rules specifying
which letters must occur in corresponding positions in the left and right halves of the
string. The grammar generates strings of 8 letters separated by a dot. The three rules
are T with X, P with C, and V with S. An example of a correct string could be
TPPV.XCCS, while VSXX.SVCT would be incorrect since for it to be correct there
would have to be a T in the seventh position.
4.4.1. St. John's (1996) study
St. John (1996) conducted a study investigating whether subjects are limited to
learning contiguous regularities (as in strings created by FSGs) or whether they can
also learn non-contiguous regularities (as in strings created by BCGs) in implicit
learning conditions (i.e. with hide instructions). In his experiment there were three
different rules: the intervening-0 rule which placed contingent letter pairs in adjacent
positions (i.e. 1a,1b,2a,2b,3a,3b,4a,4b); the intervening-1 rule interleaved the
contingent letter pairs so there was one irrelevant letter between each contingent pair
(i.e. 1a,2a,1b,2b,3a,4a,3b,4b), and the intervening-3 rule, which placed the first letter
of each contingent pair in the first half of the string, and the second half of each pair
in the second half of the string (i.e. 1a,2a,3a,4a,1b,2b,3b,4b). This intervening-3 rule
is very similar to the BCG created by Mathews et al. The only difference is that in
Mathews et al's study the letters could appear in all eight positions, whereas in this
study the letters were permitted only in one half, e.g. with the rule S <-> B, only S
could be in the first half, and B in the second half, not the other way around. Three
groups of subjects were trained on the three rules. They were told to silently read the
24
letters from left to right, and once the string disappeared from view, to again silently
repeat the string from memory. They were told that they would later be tested for their
memory of the strings. There were also three additional groups of subjects (one for
each rule) who received no training, but only did the classification test. These subjects
acted as control subjects for any learning that might occur during the classification
test itself. In the classification test, subjects were told about the existence of
underlying rules, but not what the rules were, and asked to classify new strings
according to whether or not they were 'acceptable'. All subjects received feedback for
their incorrect responses.
St. John found that the intervening-0 rule was learnt well, but the intervening-1 and
intervening-3 rules showed only marginal learning. Since the intervening-1 rule was
almost as difficult to learn as the intervening-3 rule, St. John concludes that subjects
cannot learn about non-adjacent dependencies in implicit learning conditions. He
further supported this conclusion with a second experiment, in which subjects were
instructed to encode (memorise) the strings in two different ways, either normal left to
right encoding, or alternating back and forth through the string: from first letter to
fifth letter to second letter to sixth letter etc. He hypothesized that if it were the left to
right linguistic properties or the physical order of the letters in the stimuli that
determine adjacency, the intervening-3 grammar should be difficult to learn under
either encoding regime. If it is the encoding order that determines adjacency, then the
alternation encoding regime will make the pairs adjacent and thereby permit good
learning. He found that when subjects memorised the strings by alternating back and
forth through the string, they were able to learn the intervening-3 grammar,
confirming that adjacency is determined by the order in which the letters are encoded,
rather than the physical order of the stimuli. In other words, it is unlikely that subjects
will implicitly learn the rules of a BCG that has at least one intervening letter between
the rule-related positions.
4.4.2 Advantages of the BCG over FSGs
The BCG created by Mathews et al has several advantages over FSGs. In a grammar
based on these biconditional rules, exemplars have a lower level of similarity than
exemplars from a FSG because there is greater variation in letters across different
positions in correct strings. There are, for example, no constraints on which letters can
occur at the beginning or end of a string, and the letters can occur in any of the eight
positions (as long as the corresponding letter is in the appropriate position). There are
three intervening letters between the rule-related positions, which make it possible to
separate (unconfound) rule and fragment knowledge by creating four different types
of test strings: Grammatical and high familiarity (gh), grammatical and low
familiarity (gl), ungrammatical and high familiarity (uh) and ungrammatical and low
familiarity (ul). Grammatical strings are those strings that follow the same rules as
the training strings. Ungrammatical strings violate these rules in either one or two
letters. Strings with high familiarity are test strings in which the strings seem more
familiar due to the frequency with which the chunks have already appeared in the
training strings, whereas strings with low familiarity are test strings in which the
strings seem less familiar due to the chunks not appearing frequently (if at all) in the
training strings. This measure of familiarity is based on the competitive chunking
model of Servan-Schreiber and Anderson (1990, see Appendix A) and is generally
referred to as associative chunk strength (ACS). ACS is a measure of the frequency
25
with which fragments of two letters (bigrams) and three letters (trigrams) in the test
strings appeared already in the training strings. For example, the trigram GKX could
appear for the first time in the test strings and never have appeared in the training
strings, whereas the trigram KDF could have appeared three times in the training
strings already, in which case a string containing KDF would be more familiar than a
string containing GKX (but see section 6.2.2.3 for exact details on how to calculate
ACS).
4.4.3 Associative Chunk Strength
Servan-Schreiber and Anderson (1990) base their theory on the assumption that the
way we learn about regular stimulus fields like AGs is by sorting the material into
some sort of hierarchy of "chunks", i.e. into contiguous substrings. There is abundant
evidence that people faced with the task of memorising meaningless strings of letters
will break the stimuli into chunks. For example, the 26 letters of the alphabet seem to
be encoded into seven chunks: abcd, efg, hijk, lmnop, qrst, uvw, and xyz (Klahr,
Chase, & Lovelace, 1983)8.
They studied whether the learning process in AGL experiments is chunking and
whether the grammatical knowledge is implicitly encoded in a hierarchical network of
chunks. Servan-Schreiber and Anderson considered a chunk to be a single letter, a
pair (bigram) or a triplet (trigram) of adjacent letters. They predicted that the
probability that a string would be classified as grammatical increases with the
familiarity of a string, given the network of chunks acquired during the memorisation
task. Thus, the more of its parts that consist of already existing chunks, the more
likely it will be judged to be grammatical.
4.4.3.1 Servan-Schreiber & Anderson's study
In their study, they controlled which chunks subjects created during the memorisation
task (training phase) by inserting spaces in between chunks, e.g. T PP TX X VS.
There were four groups of subjects, two groups saw well structured strings, one group
saw badly structured strings, and one group saw unstructured strings. The training
task was to memorise the strings. After they had completed the training task, Ss were
told that the strings they had just memorised were all examples of "good strings", and
that they would now be asked to judge whether the next new strings were "good" or
"bad". The test strings were presented in an unstructured way. There were five
different types of "bad" strings: two preserved the well-structured chunks, and three
did not. Those "bad" strings that preserved the well-structured chunks were "bad"
because they violated the chunk order constraints of the grammar. Those strings that
did not preserve the chunks were either completely randomly generated strings using
the same letters, or they were made ungrammatical by changing one of letters. Ss in
the well-structured condition rejected those strings that did not preserve the chunks
more often (88.73%) than those strings that did preserve the chunks (64.3%). Thus,
the study seems to indicate that subjects do make use of chunking, and Servan8
However, since mobile phones are very widely used by the public and the letters of the alphabet are
divided into different chunks on mobile phones (the letters 'abc' are on one button,' def' on the next,
etc.), the alphabet may now be encoded into different chunks.
26
Schreiber and Anderson conclude that subjects base their classification on an overall
familiarity measure rather than abstract knowledge about the rules of the grammar.
Servan-Schreiber and Anderson formulated a theory of competitive chunking (CC)
based on the results of their study. They also report two simulations of experimental
data providing evidence for their theory of competitive chunking (see Appendix A for
detailed description of theory and simulation).
4.5
Summary
The major controversy in the AGL field about whether subjects learn the abstract
rules of the grammar, or whether they learn fragments of letter strings has produced a
lot of research. However, most of the studies have used FSGs, the nature of which
make it difficult to disentangle the role of abstract rule knowledge and fragment
knowledge.
Using a BCG enables the differentiation between people who are classifying novel
stimuli in the test phase of an AGL experiment according to the abstract rules of the
grammar (grammaticality), and people who are classifying the test strings according
to the ACS of test exemplars to training exemplars (familiarity). By using BCGs
different types of test strings can be created which have either high or low familiarity
to the training strings. In this way rule and fragment knowledge can be unconfounded
and a clearer picture of what is being learnt may emerge.
27
5 The variables that need to be controlled and
tested
The methods and designs of the cited studies all differ considerably from one another
in several respects. This is partly due to the diverse questions the experimenters were
investigating, and partly due to the vast amount of research on AGL where differences
are bound to occur. These differences in methodology and design make it difficult to
make any conclusive comparisons among the studies and thus need to be controlled.
There are several variables that need to be controlled:




The grammar used (see section 5.1);
The degree of difference between strings (see section 5.2)
The number of trials and presentation style (see section 5.3);
The chance baseline (see section 5.4).
These factors may all have influenced the behaviour in different ways and thus need
to be controlled for. There are several other variables that differ in the studies, and in
order to investigate what has been learnt and what effect the factors have on
performance, these variables need to be systematically tested.


The instructions subjects receive (see section 5.5).
The task subjects do in the training phase and the feedback subjects receive
about their responses (see section 5.6);

The percentages of ⊕ vs. ⊖ instances (see section 5.7)
5.1
The variables that need to be controlled
5.1.1 Grammar
A major difference among the studies described is the AGs used. Whittlesea &
Dorken created two FSGs consisting of vowels as well as consonants, since their
strings had to be pronounceable. The grammars in all other studies consisted only of
consonants.
Brooks created strings from two FSGs similar to the ones Reber used in 19699. Dienes
et al (1991) used the same grammar as the one used by Dulany et al (1984), Perruchet
& Pacteau (1990) and Reber & Allen (1978). Dienes et al (1995) used the two
grammars created by Reber (1969). These grammars all consist of different letters of
the alphabet and have varying numbers of rules. Thus, the grammars all differ in
9
Both of the grammars that Brooks used are the same as the ones Reber (1969) used, except that
Brooks's are missing the final node. Brooks deleted this final node because he did not want all the
instances from each grammar ending in the same letter, as was the case in Reber's grammars. One sideeffect of this deletion is that the letter X is associated only with Grammar A. Brooks mentions,
however, that none of his subjects reported having noticed this, even when, in an experiment not
reported here, they were asked for their reasons for categorisation after every test trial.
28
terms of difficulty. Some of the AGs used may be more difficult to learn than others,
and this needs to be known to be able to compare results of different experiments. In
order to measure the degree of difficulty it is necessary to know how many trials are
needed for 100% correct performance, i.e. one needs to know the learnability of the
grammars.
5.1.2 Degree of difference between strings
In three of the four cited studies, research has been conducted using two different
grammars. A grammatical (⊕) instance of one grammar is then seen as an
ungrammatical (⊖) instance of a second grammar. The only study in which only one
grammar was used is the Dienes et al (1991) study. The use of one or two grammars
changes the degree of difference between the ⊕ and ⊖ strings. In the Dienes et al
(1991) study the ⊖ strings were created by substituting a grammatical element with an
ungrammatical element, so the ⊖ strings differ from the ⊕ strings in only one letter.
In the studies using two grammars, the ⊖ strings are created from a whole different
set of rules, which results in the ⊕ and ⊖ strings being more different from one
another. It is hard to compare the different experimental findings when the degree of
difference between the ⊕ and ⊖ instances differs from study to study, and this needs
to be controlled.
In addition, the degree of difference between ⊕ strings has not been noted, nor has the
difference between ⊖ strings. For example, in some studies, ⊖ strings were created
from random letters, whereas in other studies ⊖ strings were created by substituting
one or two correct letters for incorrect letters. Therefore, random ⊖ strings are more
different from the ⊕ strings and more different among one another, than ⊖ strings that
differ in only one or two letters from the ⊕ strings. The degree of difference between
all strings (between ⊕ and ⊖, between ⊕ and ⊕, and between ⊖ and ⊖) needs to be
controlled and needs to be the same.
5.1.3 Number of trials and presentation method
The number of trials in the training phase differs from study to study. Moreover, not
only does the number of times subjects see a specific exemplar and the length of time
Ss see the exemplars vary, but also the number of exemplars Ss see in total, and the
manner in which the exemplars are presented.
In Dienes et al (1995) and Whittlesea and Dorken's experiments the strings from each
individual grammar were presented together (either one after the other or
simultaneously), whereas in Brooks and Dienes et al's (1991) experiments items from
both grammars were mixed and shown randomly.
Subjects in Brooks study saw 15 items from each grammar, in Whittlesea & Dorken's
they saw 40 items from each grammar, in Dienes et al (1995) subjects saw 32 items
from each grammar, and in Dienes et al (1991) subjects were presented with 20 ⊕ and
29
20 ⊖ items. In all studies the strings were shown for varying amounts of time, in some
studies it was a fixed amount of time, in others subjects had to reach a criterion.
These variations in presentation time, style, and number may have affected
performance in the different studies, and need to be controlled for. The number of
trials and the length of exposure should be relative to the learnability of the AG, i.e.
dependent on how many trials it takes to learn the grammar to 100%.
5.1.4 Chance baseline
Many of the researchers (e.g. Brooks, Dienes et al, 1991) did not establish a chance
baseline for their experiments. Brooks, for example, assumed chance level to be 33%
in his experiment, since the stimuli were to be sorted into three different categories.
However, he also gave the task to a group of subjects who had received no training
and were asked to sort the stimuli into three piles, and they were sorting at a level
above 33%, namely between 37.1 and 51.1%. This should be considered chance level
performance. There are many ways in which stimuli can be sorted into categories, e.g.
based on their structural similarities, their colour etc. It needs to be taken into
consideration that chance performance in these experiments does not necessarily
always correspond to the mathematical chance level performance. The chance
baseline needs to be explicitly ascertained by for example having an untrained control
group do the task.
 also other types of controls
5.2 The variables that need to be tested
5.2.1 Instructions
One variable that has to be systematically varied and tested is the instructions subjects
receive, i.e. whether they receive 'hide' instructions, which hide the existence of rules
generating the stimuli, or 'seek' instructions, which instruct the subjects to seek for the
rules that generate the stimuli, or 'tell' instructions, which tell subjects what the rules
are from the beginning.
5.2.1.1 'Hide' vs. 'Seek' instructions
Several authors have thoroughly studied the effect of 'hide' versus 'seek' instructions
on subject's performance. Mathews, Buss, Stanley, Blanchard-Fields, Cho, & Druhan
(1989, Exp.3) used a FSG and trained their subjects with either hide or seek
instructions. Subjects who received hide instructions were told that on each trial they
would see a string of letters which they should try to memorise. Then five choices
would appear. Their task was to select the string that was identical to the one they had
previously seen. After each response they were informed which was the correct choice
and the next trial began.
Subjects with seek instructions were told that each presented letter string was a flawed
example of a string generated by a set of rules. Each string would have from one to
four letters that were incorrect. Their task was to mark up to four of the letters as
30
incorrect and to try and figure out the underlying rules. After each trial the incorrect
letters were indicated and the correct string was displayed. After completing the
training phase (with either hide or seek instructions) all Ss were told that the letter
strings they had seen were generated by a complex set of rules. They were told that
some of the strings they would see in the following test phase were generated by the
same set of rules. Their task on each trial was to pick the string that was generated by
the same rules from a choice of five strings. During testing, they were not told
whether or not they had responded correctly. There were 100 multiple-choice trials in
this test phase, divided into blocks of 1010. After each block of 10 trials Ss were
requested to pause and verbalise instructions on how to perform the task to an "unseen
partner" (i.e. so that he or she would be able to perform the task "just like you did".)11.
Ss in both groups performed at above-chance levels with no difference between the
groups. This indicates that having prior knowledge about the existence of rules and
explicitly generating and testing the rules of the grammar (seek instructions) does not
necessarily aid acquisition of the rules. Most of Ss’ verbal instructions for their
"unseen partners" consisted of letter patterns to select. This indicates that subjects in
both groups did not learn the rules of the grammar, but were solely remembering
fragments from the training strings. The verbal instructions were analysed in terms of
the specific trigrams (fragments of three adjacent letters) they told their partners to
select. The trigrams that were noticed most frequently, the 'salient trigrams', were the
initial (first three letters), terminal (last three letters), and repetition trigrams (e.g.
SSS). The mean proportion of trigrams mentioned for the first five blocks (when the
letter set was the same) was quite similar for both groups (0.25 for salient trigrams,
and 0.05 for nonsalient trigrams in the hide group, and 0.19 and 0.03 respectively in
the seek group). This indicates that the verbalisable knowledge of the grammar
acquired in the two different training conditions was quite similar, and again, that
initial knowledge about the existence of rules does not help performance at test.
In a further experiment Mathews et al (1989, Exp.4) devised a different type of AG
based on simple biconditional rules (see Section 4.4 for detailed description of this
grammar). This biconditional grammar (BCG) reduces the level of resemblance
between different exemplars, i.e. the strings are less similar to one another than strings
created by a FSG. The design of the experiment was identical to the previous
experiment but using the BCG instead of the FSG. In this experiment the group who
knew about the existence of rules performed significantly better than the group who
didn't know about the rules in the training phase, which was performing at a level very
close to chance. The mean proportion of Ss who could verbalise the rules was 0.05
for the group with hide instructions and 0.2 for the group with seek instructions.
10
To measure the generalisability of knowledge acquired during the test phase, the letter set was
changed beginning on trial 51 of the test phase, and from trial 71 onwards, subjects were receiving
feedback about their choices. These manipulations are not relevant to the issue here, which is why they
(and the results) are not mentioned.
11
These instructions were given to a group of yoked subjects, who attempted to perform the same task
without the benefit of any prior experience or feedback. The relative performance of yoked subjects
versus their experimental partners then provided a direct measure of the extent to which knowledge of
the grammar was communicated verbally to another person.
31
Mathews et al interpreted these findings as indicative of two distinct learning
processes, one explicit and one implicit. They conclude that implicit learning
processes are capable only of identifying common patterns of resemblance among
exemplars. The BCG was designed to have a limited amount of resemblance among
exemplars, so that high levels of performance would require going beyond patterns of
familiarity. Thus, rote memorisation strategies are less effective with BCGs than with
FSGs.
Shanks and his co-workers (1997, 1999, 2000) conducted some similar experiments.
Their conclusions on the matter of hide versus seek instructions are that the two
different types of instructions induce partially separate learning processes. In their
experiments with strings from a FSG there was no difference in performance between
subjects who received hide instructions and Ss who received seek instructions.
However, when the strings were created from a BCG several Ss with seek instructions
were performing perfectly, while the Ss with hide instructions and those Ss with seek
instructions, who did not learn the rules (non-learners), were performing at chance
levels. Shanks et al argue that seek instructions induce a hypothesis-testing strategy
in order to learn the rules. Shanks et al stress the importance of the complexity of the
rules governing the domain. The rule structures of typical FSGs are too complex to be
learnt by hypothesis testing (e.g. Brooks, 1978, Reber, 1976, Reber et al, 1980 etc)
whereas the rules of a BCG can be learnt by hypothesis testing (Mathews et al, 1989,
Exp.4; Shanks et al, 1997, Exp. 4). There are large individual differences in
hypothesis-testing ability. Shanks et al suggest that some subjects may need
preliminary training in hypothesis testing before they can benefit from seek
instructions. Hide instructions may result in the contingencies generated by a rule
being learnt, but they are acquired very slowly over a large number of trials.
In sum, the performance in implicit learning experiments depends largely on the
complexity, and consequently on the learnability of the grammar, which needs to be
tested.
5.2.1.2 'Tell' instructions
When subjects are told the rules in advance, their performance is likely to be 100%
accurate (or very close to 100%), but their response time is likely to be slow at the
beginning. Perhaps reaction times would increase over time. These hypotheses need to
be tested.
5.2.2 Task and Feedback
The task that subjects perform during training has an effect on the way subjects
perform at test. Whether they are required to give an active response with corrective
feedback12, or whether they are just passively exposed to the material affects test
performance. Active responding with feedback is for example when subjects make a
response and are told "Yes, that was a correct response". Passive exposure, on the
other hand, is when subjects are just being exposed to the stimuli. When subjects are
actively engaging in the response, it seems likely that they will be paying more
12
Reber defines feedback in learning as "any information about the correctness or appropriateness of a
response" (Reber, Penguin Dictionary of Psychology, 1985)
32
attention and picking up (learning) more than when they are just exposed to the
stimuli. Passive exposure can be combined with a cue, for example in Dienes et al's
(1991) study the ⊕ stimuli were written in black and the ⊖ stimuli were written in
red. Here the colour was a cue. This extra bit of information may make the
differentiation a bit more salient, especially as colours are quite obvious cues. The cue
can also be more "active", for example in the Whittlesea & Dorken study, subjects
had to spell stimuli from one grammar and say stimuli from the other grammar. They
did not know that the spelled stimuli were from one grammar and the said stimuli
from a different grammar, but the cue made the two items differ from one another. It
seems likely that the more salient the differences between items are (with the most
salient being actually telling subjects what the difference is) the more likely it is that
subjects will actually pick up the rules. This effect of active responding versus passive
exposure and the various types of cues is not clear from the cited studies, and needs to
be methodically tested.
5.2.3
⊕ and ⊖ instances
The percentages of ⊕ and ⊖ instances in the training phase need to be systematically
varied and tested, in order to understand the effect on learning. Once the variables
mentioned above have been controlled, the effect that ⊖ instances have in an AGL
experiment can be tested
5.3
Summary
In summary, there are several things that need to be controlled in an experiment, or a
series of experiments and there are several variables that are of research interest and
that can be systematically tested. These variables need to be controlled and tested for
the results to be meaningful and interesting.
33
6 Experiment 1: Rule or Fragment knowledge?
The first experiment was conducted to examine whether subjects in a standard AGL
experiment (i.e. with hide instructions in the training phase) are classifying test strings
according to their grammaticality (i.e. whether or not they follow the rules) or
according to their familiarity (dependent on the ACS of the test strings to the training
strings). By using the BCG created by Johnstone and Shanks (2001) abstract rule
knowledge and knowledge about the letter chunks can be investigated separately (see
section 4.4).
Another aim of this experiment is to examine whether the knowledge people gain in
AGL is implicit or explicit by examining the relationship between subject's
performance and their confidence. A major controversy in the area of implicit
learning is whether the knowledge of AGs - be it abstract knowledge or distributional
knowledge - is unconsciously gained or whether it is explicitly learnt (e.g. Perruchet
& Pacteau, 1990; Shanks & St. John, 1994; Dienes & Perner, 1996). The controversy
is partly based on the definition of the dividing line between what is conscious and
what is unconscious. Cheesman & Merikle (1984) suggest that knowledge can be
defined as unconscious if it is below the subjective threshold, i.e. if participants are
unaware that they have the knowledge. They suggest that two criteria determine a lack
of this "metaknowledge" (knowing that they have the knowledge): the zero
correlation criterion showing a lack of correlation between accurate performance and
confidence (Chan, 1992), and the guessing criterion showing accurate performance
when participants think they are guessing (Cheesman & Merikle, 1984, 1986; Dienes
& Altmann, 1997). Chan (1992) observed that participants were just as confident in
their incorrect responses as they were for their correct responses, although their
performance showed high accuracy. This reveals that they had no metaknowledge of
their knowledge. In a subliminal perception experiment, Cheesman & Merikle (1984,
1986) found that although participants claimed they were guessing they were actually
performing at a level above chance indicating that they didn’t know that they
possessed the knowledge.
Dienes & Altmann (1997) conducted a study using Reber’s (1969) FSG testing for the
zero correlation and the guessing criteria. They found evidence for the guessing
criterion, but none for the zero correlation criterion. Their data showed that when
participants were performing accurately they were more confident, indicating that
they had some awareness of their knowledge (metaknowledge). However, their results
showed a mean percentage of correct classifications of 63% when participants
believed they were guessing, indicating that there was some implicit knowledge
involved. It is not clear whether the participants learnt the abstract rules of the
grammar and were responding according to these rules, or whether they were
classifying strings according to the familiarity of letter chunks in the strings. By using
the BCG it will become more clear whether the knowledge involved is abstract rule
knowledge or distributional fragment knowledge, and testing for the zero correlation
and guessing criteria will show whether the knowledge gained is implicit or explicit.
6.1
Method
34
6.1.1 Subjects
16 voluntary subjects from the University of Southampton took part in the
experiment; eight were assigned to the experimental group, and eight to the control
group. There were eight males and eight females with ages ranging from 20 to 50
(mean 28.31).
6.1.2 Design
All strings were created from the letters D, F, G, L, K, and X. As explained earlier,
the grammar produces strings of 8 letters governed by three rules controlling the
relationship between letters in positions 1 and 5, 2 and 6, 3 and 7, and 4 and 8. One
possible set of rules could be DF , GL , KX, so when the letter D occupies one
position, the letter F must appear in the corresponding position, where the letter G
appears, the letter L must appear in the other position, and when a position contains a
K, the corresponding position must be an X, for the string to be grammatical (i.e. ⊕).
Each letter could appear in any of the eight positions. ⊕ strings consisted of four valid
linkages, while ⊖ strings consisted of three valid linkages and one invalid linkage.
That means that the ⊖ strings differ from the ⊕ ones in one letter.
15 possible sets of three letter rules can be created from the letters D, F, G, L, K, and
X. Each subject in each group was presented with a different version of the 15
possible rules.
36 ⊕ training strings were created for the experimental group using a subset of 18 of
the 36 possible two-letter fragments (bigrams) that can be generated from the letters
D, F, G, L, K, and X. The 36 ⊖ training strings created for the control group contain
all possible bigrams of the six letters without using double letter bigrams (e.g. LL).
The two violation strings created for use in the training task (for experimental and
control group) differ from the rehearsal strings (i.e. the strings that are to be
memorised and chosen from the choice of three) in one and two letters. The two
violation strings for the experimental group were constructed so that they comprise
the same subset of bigrams and trigrams as the ⊕ training strings. Each letter was
evenly distributed across each of the eight possible positions in the training set for
both the control and the experimental group.
The set of test strings consisted of 48 new ⊕ strings and 48 new ⊖ strings. Within the
strings, half of them were familiar to the training strings presented to the experimental
group and half were unfamiliar. The familiarity measure used was associative chunk
strength (ACS) as defined by Servan-Schreiber & Andersen (see section 4.4.3) Thus
four different types of test string (12 strings for each type) were generated:
grammatical (⊕) and high familiarity (gh), grammatical (⊕) and low familiarity (gl),
ungrammatical (⊖) and high familiarity (uh), and ungrammatical (⊖) and low
familiarity (ul). There was no difference between familiarity for ⊕ versus ⊖ test
strings for either group.
The control group is presented with the same test strings as the experimental group.
35
However, the control groups’ training strings and the test strings have no
grammaticality or familiarity relationship.
6.1.3 Apparatus
This experiment used a computer program written in Java by Jorge Basto as a thirdyear undergraduate project. The experiment runs on a Web browser
(http://www.ecs.soton.ac.uk/~mtj00r/ApplicationClass.php) and everyone with Web
access can take part in the experiment. This ensures that the subject database is
distributed geographically, and that it is more representative of the larger population
than traditional psychology experiments. Since the experiment can be completed by
anyone equipped with the URL and with access to the Internet, the subject sample
could potentially come from any background, rather than the usual student subject
sample when using more traditional methods of collecting data (Krantz & Dalal,
2000). The program records all responses made by each subject in a central
repository.
6.1.4 Procedure
Both the experimental and the control group were told that the experiment was testing
their short-term memory for strings such as GFXK.LDKX. The training phase
consisted of 72 trials (each set of 36 strings presented twice). On each trial a string
appeared on the screen for five seconds, which the subject was asked to memorise.
Then there was an interval of two seconds after which a list of three strings appeared.
One of the strings was the string presented before; the other two strings were ⊖
violation strings differing from the training string in either one or two letters. Subject's
task was to choose the same string as they had previously seen. They were told
whether their choice was correct, and if they did not choose correctly, the correct
string was shown again.
After this training phase Ss were told that the strings they had been memorising
followed a particular set of rules. In actual fact, only the experimental group had seen
strings that followed a rule, while the control group had seen random strings.
The test phase consisted of 96 trials in which the Ss were asked to decide whether
new strings that they had not seen before, followed the same rules or not. After they
had decided whether it followed the rules or not (by choosing 'Yes' or 'No'), they were
asked to indicate how confident they were that their response was correct on a scale
from 50% (complete guess) to 100% (completely sure).
6.2
Results and Discussion
6.2.1 Training phase
First, the data from the training phase were analysed. The mean reaction time of the
control group was 577.03msec, and the mean reaction time of the experimental group
was 431.6msec. Reaction times of longer than 4000msec were excluded as well as
reaction times of 0msec, since these reaction times seem to indicate that the subject
36
was distracted in the case of a reaction time of over 4 seconds13, or that the program
failed and did not record the data in the case of 0.
An independent samples t-test was not significant, t(14)= -2.086, indicating that the
two groups were not performing differently on reaction time. However, the
significance was 0.056, so just barely non-significant. Judging also by the mean
reaction times, the experimental group seems to have been slightly faster. It has been
found in several other studies that stimuli that are structured are easier to learn than
patterns that are unstructured (e.g. Miller, 1958), which may account for the slightly
faster responding.
In the training phase Ss were presented with a letter string, and after an interval of two
seconds they were requested to choose the string that they had seen before from a list
of three strings. Responses were counted as correct, if the subject chose the same
string as they had seen before. The mean percentage of correct responses was 90.8%
for the control group and 91.15% for the experimental group. An independent samples
t-test was not significant, t(14)=0.127, indicating that the control and experimental
group were not performing differently on the memory task.
6.2.2 Test phase
The following analyses focus on the responses made in the test phase. The mean
reaction time of the control group was 612.5msec, and the mean reaction time of the
experimental group was 544.6msec (again excluding reaction times over 4000msec
and 0msec). An independent samples t-test showed no significant difference, t(14)= 0.689, indicating that the two groups were not performing differently on reaction time.
The mean percentages of test strings classified as 'Yes, it follows the same rules' were
analysed for each of the four types of test items: grammatical and high familiarity
(gh), grammatical and low familiarity (gl), ungrammatical and high familiarity (uh)
and ungrammatical and low familiarity (ul). These mean percentages are shown in
Figure 2.
13
RTs were mostly either below 4 seconds or much longer than 4 seconds, which is why 4 seconds was
chosen as the cut-off point.
37
70
% classified as gramm
60
50
40
exp group
ctr group
30
20
10
0
gh
gl
uh
ul
type of string
Figure 2: Mean percentage of test strings classified as grammatical ("Yes, it follows the rules")
In order to test whether subjects are classifying test strings based on their
grammaticality or on the familiarity of the test strings, a repeated measures ANOVA
with group as between-subjects variable, and grammaticality and familiarity as
within-subjects variables was carried out. There was a significant effect of familiarity,
F(1,14)=10.719. This suggests that Ss were classifying the test strings based on their
familiarity to the training strings. There was also a significant effect of the interaction
between familiarity and group, F(1,14)=5.924, showing that the two groups were
responding differently according to familiarity.
There was no significant effect of grammaticality F(1, 14)=1.247 indicating that Ss
were not basing their classification on the grammaticality of the strings. Neither was
there a significant effect of group, F(1, 14)=1.095, showing that the experimental and
the control group were not classifying differently according to grammaticality.
These results suggest that subjects in the experimental group were classifying the test
strings according to the familiarity of the test strings to the training strings, and not
according to whether or not they follow the rules of the grammar (grammaticality).
This finding supports the notion that people simply learn the distributional statistics of
fragments, not abstract rules.
Since subjects seem to be classifying the strings according to the familiarity to
training strings, rather than the grammaticality of the strings, we also looked at the
data from the point of view of familiarity. Therefore, when subjects classified gh and
uh strings as following the rules (i.e. they selected "Yes, it follows the rules") or gl
and ul as not following the rules ("No, it doesn’t follow the rules") it was considered
to be correctly classified according to familiarity.
38
6.2.2.1 The guessing criterion
To examine whether subject's knowledge was implicit (unconscious) or explicit
(conscious), we calculated the mean percentages of correct classifications according
to grammaticality when Ss thought they were guessing (i.e. they gave a confidence
rating of 50). The experimental group were classifying 46.0015% correctly, and the
control group's mean percentage was 41.5520%. When classifying correctly
according to familiarity, the mean percentage of "correct classification" was
56.8248% for the experimental group and 52.2680% for the control group. This data
is also shown in Table 1.
Experimental group
Control group
Grammaticality
46.00
41.55
Familiarity
56.82
52.26
Table 1: Mean percentages of correct classifications when subjects thought they were guessing (the
guessing criterion)
A chi-square test on the mean percentages of correct classifications according to
grammaticality and familiarity when Ss thought they were guessing (the guessing
criterion) showed no significant effect χ2 = 2.7866, indicating that subjects are not
classifying better than chance on either grammaticality or familiarity. In other words,
when subjects thought they were guessing and gave a confidence value of 50, they
really were guessing.
6.2.2.2 The zero correlation criterion
The confidence ratings were divided into 6 groups: the first group included all
responses of 50% (i.e. when the Ss thought they were guessing), the second group
comprised those ratings between 51 and 60, the third the ratings between 61 and 70
and so on. The following figures show the mean percentages of correct responses at
test for each level of confidence. Figure 4 shows the mean percentages of correct
responses according to grammaticality.
80
70
% correct
60
exp group
control group
50
40
30
20
50
51-60
61-70
71-80
81-90
91-100
confidence level
39
Figure 4: Mean percentages of correct responses according to grammaticality
Accuracy and confidence were not correlated. To test the zero correlation criterion, a
repeated measures ANOVA with group as between-subjects variable and the six
levels of confidence as within-subjects variable was conducted. There were no
significant effects when classifying correctly according to grammaticality.
90
80
% correct
70
60
exp group
control group
50
40
30
20
50
51-60
61-70
71-80
81-90
91-100
confidence level
Figure 5: Mean percentage of correct responses according to familiarity
Figure 5 shows the mean percentages of responses according to familiarity. Accuracy
according to familiarity and confidence were positively correlated for the
experimental group (r = 0.100, p<0.05). From Figure 5 it can be seen that the
experimental group are performing better than the control group at each level of
confidence.
These results show that confidence was not correlated with performance (and the zero
correlation criterion has been met). However, confidence is correlated with
familiarity. Inspection of Figure 5 shows that when subjects in the experimental group
are most confident (confidence level of 91-100) they are classifying 86.5% of the
strings according to their familiarity. If one takes the zero correlation criterion as a
measure of unconscious knowledge these results suggest that subjects in the
experimental group have some metaknowledge about their performance.
Further results support the claim that confidence is correlated with familiarity: When
classifying correctly according to familiarity there was a significant interaction
between confidence and linear trend and a significant interaction between confidence,
group and linear trend. These results indicate that the groups are performing
differently, and that there is a linear trend between the mean percentages of correct
classifications (according to familiarity) and confidence level. In other words, when
subjects were less confident and gave a lower confidence rating, the mean percentage
of correct responses is also lower, and when they were more confident about their
40
responses and gave higher confidence ratings, the mean percentage of correct
responses was also higher. This linear trend can be seen in Figure 5 in the
experimental group's data.
6.2.2.3 ACS analyses
The following set of analyses was conducted on the data as a function of ACS. ACS
was calculated on the basis of the theoretical perspective on chunking presented by
Servan-Schreiber and Anderson (1990, see Appendix A). Two measures of ACS were
calculated: Global ACS for all fragments in a test string, and Anchor ACS for the
initial and terminal fragments within each test string. Anchor ACS was calculated
because prior research has shown that subjects are especially sensitive to 'extremities',
i.e. beginnings and ends of strings (e.g. Reber, 1967; Servan-Schreiber & Anderson,
1990). Global ACS was calculated by breaking each test string down into its
constituent bigrams and trigrams and then calculating how many times each fragment
had occurred in any location in the training items, and then dividing the totals by the
number of fragments (i.e. 7 bigrams and 6 trigrams) (Johnstone & Shanks, 2001). For
example, the test string LFGK.GDLX can be broken down into the bigrams LF, FG,
GK, KG, GD, DL, and LX, and the trigrams LFG, FGK, GKG, KGD, GDL, and
DLX. For example, the bigram LF could have appeared 4 times in the training strings,
and the bigram FG 3 times, etc., so the global ACS would be calculated as follows:
((4 + 3 + 5 + 3 + 4 +2 + 5) / 7 ) + ((0 + 1 + 0 + 0 + 0 + 1) / 6) / 2 = 2.02.
Anchor ACS was calculated by adding the initial and terminal bigrams and trigrams
and dividing by 4. For example, the anchor ACS for the test string LFGK.GDLX
would be (1 + 1 + 0 + 1)/4 = 0.75.
Figures 6 and 7 show the 48 test strings plotted according to the global ACS of each
test string to the training strings, i.e. from the least familiar to the most familiar string.
30
25
7
27 11 35
6 10
8 26 34
global ACS
20
4
9
2 32
1 31
25
28 3 33
30
29
5
36
12
15
10
5
17 15 39
43 14 38 19 37
13 20 44
18 42 47 24 48 23 21 40 41 45 46 16 22
1
25
11
33
2
28
4
6
26
30
29
12
15
37
38
43
20
22
46
41
21
48
47
18
0
string number
Figure 6: Global ACS of each test string for the experimental group
From Figure 6 it can be seen that half of the test strings are familiar (the high
familiarity strings) to the training strings of the experimental group, i.e. have a large
41
global ACS score (the right part of the figure), and half are not familiar to the training
strings (low familiarity strings), i.e. have a low global ACS score (the left part of the
figure). In Figure 7 the global ACS scores for the control group are in the same range
of around 10,i.e. there is no familiarity relationship between the control group's
training strings and the test strings.
30
25
global ACS
20
15
12
48 40 25 46 15
14 45 4 1 11 13 36 39
5 19 9 35 28 16 44 8 18 7 22 24 20 31 29
43 34 38 3 2 37 10 6 23
42
30
33
32
17
27
47
21 41
10
26
5
12
46
40
39
1
13
45
29
20
22
18
44
9
28
5
6
3
37
34
33
42
32
47
21
0
string number
Figure 7: Global ACS of each string for the control group
Correlational analyses revealed that global ACS and anchor ACS were both positively
correlated with confidence for the experimental group: global ACS r = 0.112 and
anchor ACS r = 0.102, p<0.01. This indicates that when the global and anchor ACS
scores were both high, i.e. the test string had high familiarity to the training strings,
the confidence about whether the responses were correct or not was also high (and
vice versa for low familiarity and low confidence). Again, this suggests that subjects
are classifying the test strings according to how familiar they are to the training
strings, and are confident that they are classifying correctly on familiar strings.
6.3
Conclusions
The subjects seem to possess virtually no knowledge of the rules of the grammar, and
there is no correlation between their performance and their confidence ratings. In fact,
subjects are classifying the strings at chance level (when classifying correctly
according to grammaticality). Instead, subjects seem to be classifying the test strings
according to their familiarity to the training strings. Furthermore, when subjects in
the experimental group are more confident, they are classifying more strings
according to familiarity, than when they are less confident. This indicates that the
experimental group may have some metaknowledge (knowledge about their
knowledge) of their performance.
42
7 Experiment 2: The effect of ⊖ instances in a
standard AGL experiment
In most AGL experiments subjects are presented with only ⊕ instances of the
grammar. In those experiments where subjects are seeing stimuli from two different
grammars the results are rather varied. Under certain circumstances subjects are
performing well above chance, whereas under other circumstances ⊖ instances are
detrimental to learning. In this second experiment, subjects will do exactly the same
task as in Experiment 1, except that subjects in the experimental group will be
exposed to ⊕ and ⊖ strings in the training phase.
7.1
Method
7.1.1 Subjects
22 voluntary subjects took part in the experiment. 5 were female, 17 were male. This
experiment was run on the Internet. An email informing people where the experiment
could be found on the Internet was sent to several departments in the University of
Southampton, UK. 14 subjects were assigned to the experimental group, and 8 to the
control group.
7.1.2 Design
The same BCG was used as in Experiment 1. The experimental group was presented
with 36 ⊕ strings and 36 ⊖ strings. The ⊕ training strings are the same as the strings
created for the experimental group in Experiment 1, the ⊖ training strings are the
same as the strings created for the control group in Experiment 1. The ⊕ strings were
written on a green background, the ⊖ strings were written on a red background for the
experimental group. The control group saw the same strings as the control group in
Experiment 1 on random green and red backgrounds. The test strings were the same
as in Experiment 1 and were all written in black on a white background.
7.1.3 Apparatus
The computer program for this experiment was written in Java by Jorge Basto and
modified by Mike Jewell. The program recorded all responses made by each subject.
7.1.4 Procedure
The training phase was identical to the training phase of Experiment 1, except that the
strings were shown on different coloured backgrounds. The experimental group saw
⊕ training strings on a green background, and ⊖ training strings on a red background.
The control group saw green and red backgrounds randomly. Ss were informed that
43
the screen might change colours, but that they need not worry about this for this part
of the experiment.
After the training phase, Ss were informed that the green strings followed a particular
set of rules, while the red strings did not. In actual fact, only the experimental group's
green strings followed rules, while the control group's green strings were random.
The task was the same as in Experiment 1, i.e. to decide whether the following strings
followed the same rules as the green strings, or not. All test strings were written in
black and shown on a white background. Subjects were also required to give a
confidence rating after each trial. At the end of the experiment subjects were asked
several questions about whether they knew the rules, which strategies they were
using, and any other comments they might have.
7.2
Results and Discussion
7.2.1 Training phase
The mean reaction times in the training phase were 506.16msec for the experimental
group and 476.05msec for the control group. The mean percentages of correct
answers were 89.98% for the experimental group, and 86.63% for the control group.
T-tests on these results show no significant difference in reaction time, t(20) = 0.486,
or correctness, t(20) = 1.247, for experimental and control group, indicating that the
two groups were not performing differently in the training phase.
7.2.2. Test phase
The mean reaction times in the test phase were 678.28msec for the experimental
group and 863.17msec for the control group, and there was no significant difference
between groups, t(20) = -1.014.
The experimental group had a mean percentage of 50% correct classifications at test
when according to grammaticality and 57.96% when according to familiarity. The
control group had a mean percentage of 52.86% when according to grammaticality
and 53.65% when according to familiarity. These percentages are also shown in
Table 3.
Experimental group
Control group
Grammaticality
50.00
52.86
Familiarity
57.96
53.65
Table 3: Mean percentages of correct classifications according to grammaticality and familiarity
A chi-square test showed that the mean percentages were not significantly different
from chance indicating that subjects in both groups were not performing better than
chance. An independent measures t-test showed no significant difference on
correctness for control and experimental group, either for correctness on
grammaticality, t(20) = -1.063, nor for correctness on familiarity, t(20) = 1.368. This
indicates that the experimental and control group are not doing the task differently.
44
The mean percentages of test strings that subjects classified as "Yes, it follows the
rules" (i.e. ⊕) for each type of string are shown in numerical form in Table 2.
Experimental group
Control group
gh
59.88
42.47
gl
32.34
36.13
uh
46.53
39.68
ul
46.89
31.15
average
47.25
38.15
Table 2: Mean percentages of each type of test string classified as "Yes, it follows the rules"
Figure 9 shows these mean percentages in graphical form.
% classified as grammatical
70
60
50
40
exp group
ctr group
30
20
10
0
ghs
gls
uhs
uls
type of string
Figure 9: Mean percentages of test strings classified as "Yes, it follows the rules" (i.e. ⊕)
The experimental group selected "Yes, it follows the rules" (⊕) on an average of
47.25% of the trials, and "No, it doesn't follow the rules" (⊖) on an average of
52.75% of the trials, while the control group selected ⊕ on 38.15% of trials, and ⊖ on
61.85% of the trials. There seems to have been a tendency to call more strings ⊖
("No") than ⊕ ("Yes"), although the difference is not significant. This tendency is
especially pronounced in the control group. This is probably due to the fact that
subjects in the control group have not seen ⊕ strings up until this point.
Like in Experiment 1 it was investigated whether subjects are classifying test strings
according to whether or not they follow the rules, or whether they were classifying
them according to whether or not they were familiar to training strings. A repeated
measures ANOVA with group as between-subjects variable and grammaticality and
familiarity as within-subjects variables revealed a significant effect of grammaticality,
F = 10.551. This suggests that subjects are classifying test strings according to their
grammaticality, i.e. to whether or not they were following the rules. There was,
however, no significant effect of group, which shows that subjects in the two groups
were not performing differently. However, these results may also reflect the fact that
45
more strings were classified as ⊖ ("No, it doesn't follow the rules") than ⊕, thus more
strings may be correctly classified as ⊖ than as ⊕. This interpretation also
corresponds to the mean percentage of correct classifications (see Table 3), which do
not show a grammaticality effect.
7.2.2.1 The guessing criterion
The mean percentages of correct classifications according to grammaticality when
subjects thought they were guessing are 58.62% for the experimental group and
34.1% for the control group. The mean percentages of classifications according to
familiarity when subjects thought they were guessing are 61.02% for the experimental
group and 47.63% for the control group. This data is also shown in Table 4.
Experimental group
Control group
Grammaticality
58.62
34.1
Familiarity
61.02
47.63
Table 4: Mean percentages of correct classifications when subjects thought they were guessing
A chi square test conducted on the mean percentages showed that the means were
significantly different from chance, χ2 = 9.0845 (critical value 3.84). This indicates
that subjects in the experimental group seem to know more than they think they know,
i.e. this finding suggests that subjects are classifying the test strings implicitly.
7.2.2.2 The zero correlation criterion
The confidence ratings were divided into six groups, as in Experiment 1. Figure 7
shows the mean percentages of correct classifications according to grammaticality for
each level of confidence.
90
80
mean % of strings
70
60
exp group
ctr group
50
40
30
20
50
60
70
80
90
100
confidence
46
Figure 7: Mean percentage of correct classifications according to grammaticality for each level of
confidence
Accuracy and performance were not correlated. A repeated measures ANOVA with
group as between-subjects variable and correctness according to grammaticality on
the six levels of confidence as within-subjects variables revealed a significant main
effect of confidence, F = 2.915, and a significant interaction between confidence and
group, F = 2.922. There was also a significant linear trend of confidence, F = 5.907,
and a significant linear interaction between confidence and group, F = 8.152.
These findings indicate that subjects were more confident when they were responding
correctly, than when they were responding incorrectly. This suggests that subjects
were to some extent conscious of the knowledge they possessed. The results show that
the experimental and control group are responding differently. Inspection of Figure 7
seems to suggest that it is the control group that is more confident on correct answers
and less confident on incorrect answers. The experimental group shows virtually no
correspondence between correct responses and confidence level. The linear trend can
also be seen in Figure 7, in the experimental group as a horizontal line around 55%,
and in the control group as a diagonal line (indicating the correspondence between
confidence level and performance) from low confidence – low correctness to high
confidence – high correctness.
90
80
mean % of strings
70
60
exp group
ctr group
50
40
30
20
50
60
70
80
90
100
confidence
Figure 8: Mean percentage of correct classifications according to familiarity for each confidence
rating
Figure 8 shows the mean percentages of correct responses according to familiarity for
each level of confidence. Confidence and accuracy on familiarity are positively
correlated in the control group, r = 0.112, p<0.005, but not in the experimental group.
A repeated measures ANOVA with group as between-subjects variable and
correctness according to familiarity on the six levels of confidence as within-subjects
variables revealed a significant main effect of confidence, F = 17.920. This suggests
that when subjects are classifying the strings according to their familiarity, they are
more confident on their correct responses than on their incorrect responses.
47
Inspection of Figure 8 seems to suggest that here, too, the control group seems to be
more confident on classifying strings according to the familiarity, than the
experimental group. However, there was no significant effect of the interaction
between group and confidence, which indicates that the two groups were not
responding significantly different on the confidence measure.
7.3
Conclusions
The inclusion of ⊖ instances in this experiment interferes with learning. Subjects are
classifying test strings at chance level. If an instance memory interpretation is applied,
this finding makes sense. With the inclusion of ⊖ instances, the memory load was
merely increased, and performance foreseeably worsened.
There was a tendency to call more strings ⊖ (i.e. "No, it doesn't follow the rules")
than ⊕, especially in the control group. Ss in the control group seem to be more
confident on correct responses than on incorrect responses. However, when
experimental subjects thought they were guessing, they were performing at a level
above chance, whereas control subjects were performing at a level below chance.
Since control subjects received ⊖ instances in the training phase, but were told that
the ones shown on a green background were ⊕ instances, they could have been
remembering specific green instances (or chunks) from the training phase, and falsely
thinking these were ⊕ instances and thus categorising ⊖ instances as ⊕. This would
also explain the tendency to call more items ⊖ rather than ⊕.
48
8 Future Work
8.1
Overview
Since Reber's early work in the late 1960s, there has been a vast amount of research
on artificial grammar learning. Most of the literature has focussed on the question of
whether what subjects learn is implicit or explicit, i.e. conscious or unconscious, and
whether subjects are learning the abstract rules of the grammar, or whether they are
learning about chunks of letters. Thus, the first experiment was conducted to
investigate these two questions. It was seen that subjects do not seem to have any
knowledge at all about the rules of the grammar and are classifying the test strings at a
chance level. Subjects are, however, classifying the test strings according to whether
or not they were familiar (according to the ACS measure of familiarity) to the training
strings. Furthermore, subjects seem to be more confident that they were responding
correctly when the strings were familiar than when they were grammatical, which
again suggests that subjects are classifying strings according to their familiarity to
training strings. It also indicates that subjects may have some awareness of what they
are doing. In summary, subjects seem to be classifying test strings according to their
ACS to the training strings and they seem to be somewhat aware of this.
The research conducted on artificial grammars has followed rather restricted lines. In
other areas of rule learning, e.g. concept learning, it is essential that subjects sample
both ⊕ and ⊖ evidence to learn the rule(s), i.e. subjects are told what is an instance of
the rule (⊕) and what is not an instance of the rule (⊖). In AGL, however, it seems to
have been an "unwritten law" that subjects only receive ⊕ instances of a grammar.
There are only few studies in which subjects have also been exposed to ⊖ instances.
The researchers who did introduce ⊖ instances (be it in the form of ungrammatical
strings, or strings from another grammar) had several different experimental research
interests. Thus, the experimental designs differ quite considerably from one another
making it difficult to identify the effect ⊖ instances per se have on learning an AG.
Experiment 2 was conducted in order to see the effect that ⊖ instances have on
learning. It was seen that ⊖ instances interfere with learning. As also seen in
Experiment 1, subjects do not seem to learn any rules in AGL experiments, but are
merely learning chunks of letters. Thus, the inclusion of ⊖ instances differentiated by
a colour cue was just an additional bit of information (the colour) to be remembered.
Therefore the memory load was larger and consequently the performance was worse.
In general, AGL research has been concerned with establishing the existence of
implicit learning, i.e. learning without conscious awareness. However, it has not been
established whether the AGs are learnable in the first place, and if they are learnable,
how many trials are needed until criterion is reached (whatever the criterion is defined
to be, e.g. 10 successive trials categorised correctly, or explicit naming of the rules).
This is an important issue, because if the AG is not learnable in the first place, then
neither explicit nor implicit learning will occur. Or if the AG is so complex that it
would take subjects an unreasonable number of trials to learn it even with all possible
help – as is the case with most of the FSGs used in AGL experiments –, it seems
49
unreasonable to assume that either implicit or explicit learning will occur in the short
time available for an experiment. In the case of an AG that is too complex, subjects
have no chance of ever implicitly or explicitly learning the rules, and have no other
option than to rote memorise fragments of the letter strings and base their subsequent
classifications on this rote memorisation of fragments. There is no rule learning, only
memorisation. If this is the case, then the inclusion of ⊖ instances in the training
phase of an experiment will only serve as an interference to the rote memorisation,
rather than an aid for learning, as seen in Experiment 2.
Only once it has been established that an AG is learnable and how difficult (or
complex) the grammar is to learn, can one examine, for example the relationship
between the time needed to learn the grammar explicitly, and the time needed to learn
the grammar implicitly. The learnability of the AG must be pre-tested before one can
assume that learning – implicit or explicit – will take place, and before one can
understand how different variables affect learning.
8.2
Pilot studies
Since it is not known which dimension to use as a difficulty measure, pilot studies
will be conducted. The aim of the pilot studies is to find three grammars that vary
along the difficulty dimension from easy via medium to hard. Easy, here, will be
defined as 'learnable in roughly 15 minutes'; medium will be defined as 'learnable in
roughly 30 minutes'; and hard will be 'learnable in roughly one hour'. Biconditional
rules will be used. The difficulty dimension is thought to be defined by number of
rules (which is related to the number of elements), and length of string.
One possibility is to keep the length of the string constant at eight letters, and vary the
number of rules:
a) one rule
A<->B
e.g. ⊕ AAAA.BBBB
⊖ AAAA.AAAA
⊕ AABA.BBAB
⊖ ABAB.ABAB
⊕ BABB.ABAA
⊖ BAAA.BBBB
b) two rules A<->B, C<->D
e.g. ⊕ ABCD.BADC
⊖ ABCD.ABCD
⊕ AACC.BBDD
⊖ AACC.DDBB
⊕ ACAC.BDBD
⊖ ACAC.ACDC
c) three rules A<->B, C<->D, E<->F
e.g.⊕ ABCE.BADF
⊖ ABAC.EFEF
⊕ BAFD.ABEC
⊖ BBED.CABE
⊕ ACEF.BDFE
⊖ FDEC.ABCD
50
Another possibility is to keep the number of rules constant at for example two rules
A<->B and C<->D, and vary the length of the string:
a) length 4
e.g. ⊕ AC.BD
⊖ AA.CD
⊕ BC.AD
⊖ CA.BD
⊕ AD.BC
⊖ BA.BA
b) length 6
e.g. ⊕ ABC.BAD
⊖ ABC.DAB
⊕ BAC.ABD
⊖ ABD.ACA
⊕ ADA.BCB
⊖ CAB.DDD
c) length 8
e.g.⊕ABCD.BADC
⊖ ABCD.DCBA
⊕CACA.DBDB
⊖ DABB.CBAD
⊕BBCC.AADD
⊖ CBDA.DBCD
On each trial a stimulus will be presented for 5 seconds. The subject will be required
to repeat the stimulus out loud and type the letters on the keyboard. Then the subject
has to indicate whether the stimulus follows the rules or not. Subjects will receive
feedback on their answers. Subjects will be informed that there are rules governing
the letter strings, and that they should try and figure out the rules. Subjects will be run
for approx. one hour. After every 10 trials subjects will have a short break and will be
required to write down any rules they think they know. The learning curves of
subjects will then show us after how many trials on average the grammar can be
learnt. This will give us an idea of whether the proposed grammars can be classified
as easy, medium, and hard, and if so, which is the better difficulty dimension: the
variation of length of string, or variation of number of rules.
8.3
Experimental plan
A series of four experiments will be conducted. Each of these experiments will be run
on the easy, medium and hard grammars, that will have been defined with the pilot
studies.
The first experiment will investigate the learnability of the three grammars, i.e. how
many trials are needed for the grammar to be learnt, with all possible help, replicating
the results of the pilot studies (hopefully!). The first experiment will also assess the
effect of feedback. There will be two groups of subjects for each grammar, the
feedback (F) group and the no feedback (NF) group (i.e. 6 groups in total).
There will also be a control group, which will only do the test phase. This is to
examine whether anything is learnt in the test phase and to define the chance baseline.
In the learning phase subjects will see a string for 5 seconds. They will then be
required to repeat the string by saying it out loud and typing it on the keyboard while
51
the string is still visible. Then the subject must decide whether the string follows the
rules or not. Subjects in the F group will receive feedback on whether their choice was
correct or not, whereas subjects in the NF group will not receive feedback on their
responses.
In the test phase subjects will be presented with a string for 5 seconds. They will then
repeat the string by typing it on the keyboard while the string is still visible. They will
then decide whether the string follows the rules or not.
The second experiment will investigate the effect of active responding (A group)
versus passive exposure (P group). Subjects in the A group will attend to the stimulus
by saying and typing the letters on the keyboard, respond with either Yes or No and
be given corrective feedback. Subjects in the P group will say and type the letters and
then be told whether the string follows the rules or not (no response, only feedback).
The test phase is the same as in the first experiment.
In the third experiment we will investigate the effect of ⊕ and ⊖ instances versus ⊕
only or ⊖ only on learning. There will be three groups or subjects, those who receive
⊕ and ⊖ instances (⊕/⊖ group), those who receive ⊕ only (⊕ group) and those who
receive ⊖ only (⊖ group). Subjects in the ⊕/⊖ group will be told that strings on a
green background follow the rules (⊕), while strings on a red background do not (⊖).
Subjects in the ⊕ or the ⊖ group will be told whether their strings are ⊕ or ⊖
respectively.
In the fourth experiment we will examine the effect that prior knowledge of the rule(s)
has on learning. There will be three conditions. One group will be explicitly told what
the rules of the grammar are (T group), a second group will be told that there are rules
that they should try and find (seek instructions, S group), and a third group will not be
told that there are rules until after the learning phase (hide instructions, H group).
Subjects in the T and S group will be responding actively and will receive feedback
on their responses in the learning phase, while subjects in the H group will only be
repeating the stimulus. H subjects will do different tasks for ⊕ and ⊖ exemplars; for
⊕ exemplars they will type the string on the keyboard, for ⊖ exemplars they will spell
the string to themselves (i.e. not type). This is so that there is a differentiation between
⊕ and ⊖ strings.
These proposed AGL experiments are real learning experiments, that will test learning
abilities, which variables influence learning, how much they affect learning, and
under what circumstances they affect learning, unlike many other AGL experiments
that are simply rote memorisation experiments.
52
References
-
-
-
-
-
-
-
-
-
-
Aborn, M. & Rubenstein, H. (1952). Information Theory and Immediate Recall.
Journal of Experimental Psychology, XLIV, 260-266.
Altmann, G., Dienes, Z., & Goode, A. (1995). On the modality independence of
implicitly learned grammatical knowledge. Journal of Experimental Psychology:
Learning, Memory, & Cognition, 21, 899-912.
Bourne, L.E., Jr. (1970). Knowing and using concepts. Psychological Review, 77,
546-556.
Brooks, L.R. (1978). Nonanalytic concept formation and memory for instances.
In: E. Rosch & B.B. Lloyd (Eds.), Cognition and categorization (pp. 169-211).
Hillsdale, NJ: Lawrence Erlbaum Associates Inc.
Brooks, L.R. & Vokey, J.R. (1991). Abstract analogies and abstracted grammars:
Comments on Reber (1989) and Matthews et al (1989). . Journal of Experimental
Psychology: General, 120, 316-323.
Brown, R. & Hanlon, C. (1970). Derivational complexity and order of acquisition
in child speech. In J.Hayes (Ed.), Cognition and the development of language
(pp.155-207). New York: Wiley.
Bruner, J.S. & Goodnow, J.L., & Austin, G.A. (1956). A Study of Thinking, New
York: Wiley.
Chan, C. (1992). Implicit cognitive processes: Theoretical issues and applications
in computer systems design. Unpublished D.Phil. thesis, University of Oxford.
Cheesman, J, & Merikle, P.M. (1984). Priming with and without awareness.
Perception and Psychophysics, 36, 387-395.
Cheesman, J, & Merikle, P.M. (1986). Distinguishing conscious from unconscious
perceptual processes. Canadian Journal of Psychology, 40, 343-367.
Chomsky, N. (1968). Language and Mind. New York: Harcourt, Brace.
Chomsky, N. (1980). Rules and representations. New York: Columbia Univeristy
Press
Chomsky, N. (1987). On the Nature, Use, and Acquisition of Language. In:
Handbook of Child Language Acquisition, W.C. Ritchie & T.K. Bhatia (Eds.),
1999, Academic Press, Ltd., Ch. 2, pp. 33-54.
Dienes, Z., Altmann, G.T.M., Kwan, L., & Goode, A. (1995). Unconscious
Knowledge of artificial grammars is applied strategically. Journal of Experimental
Psychology: Learning, Memory, & Cognition, 21, 1322-1338.
Dienes, Z. & Altmann, G. (1997). Transfer of implicit knowledge across domains:
How implicit and how abstract? In How implicit is implicit learning?, D. Berry
(Ed.). Oxford University Press.
Dienes, Z., Broadbent, D.E., & Berry, D. (1991). Implicit and explicit knowledge
bases in artificial grammar learning. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 17, 875-887.
Dienes, Z. & Perner, J. (1996). Implicit knowledge in people and connectionist
networks. In Implicit cognition. (Ed. G. Underwood), Oxford University Press.
Dulany, D.E., Carlson, R., & Dewey, G. (1984). A case of syntactical learning and
judgement: How concrete and how abstract? Journal of Experimental Psychology:
General, 113, 541-555.
Gold, E. M. (1967). Language Identification in the Limit. Information and
Control, 10, 447-474.
53
-
-
-
-
-
-
-
-
-
Johnstone, T. & Shanks, D.R. (1999). Two mechanisms in implicit artificial
grammar learning? Comment on Meulemans and Van der Linden (1997). .
Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 524531.
Johnstone, T. & Shanks, D.R. (2001). Abstractionist and processing accounts of
implicit learning. Cognitive Psychology, 42, 61-112.
Kassin, S.M., & Reber, A.S. (1979). Locus of control and the learning of an
artificial language. Journal of Research in Personality, 13, 111-118.
Klahr, D., Chase, W.G., & Lovelace, E.A. (1983). Structure and process in
alphabetic retrieval. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 9, 462-477.
Krantz, J.H. & Dalal, R. (2000). Validity of Web-Based Psychological Research.
In: M.H. Birnbaum, Psychological Experiments on the Internet. pp. 35-60.
Academic Press: USA
Laurence, S., & Margolis, E. (2001). The Poverty of the Stimulus Argument. The
British Journal for the Philosophy of Science, 52, 217-276.
Mathews, R.C., Buss, R.R., Stanley, W.B., Blanchard-Fields, F., Cho, J.R., &
Druhan, B. (1989). Role of implicit and explicit processes in learning from
examples: A synergistic effect. . Journal of Experimental Psychology: Learning,
Memory, and Cognition, 15, 1083-1100.
Meulemans, T. & Van der Linden, M. (1997). Associative chunk strength in
artificial grammar learning. . Journal of Experimental Psychology: Learning,
Memory, and Cognition, 23, 1007-1028.
Miller, G.A. (1958). Free recall of redundant strings of letters. Journal of
Experimental Psychology, 56, 485-491.
Perruchet, P. & Pacteau, C. (1990). Synthetic grammar learning: Implicit rule
abstraction or explicit fragmentary knowledge. Journal of Experimental
Psychology: General, 119, 264-275.
Pinker, S. (1989). Learnability and Cognition: The Acquisition of Argument
Structure. Cambridge, MA: MIT Press.
Pinker, S. (1995). Language Acquisition. In Gleitman, L.R. & Liberman, M.
(Eds.) An invitation to Cognitive Science: Language, 2nd Ed., Vol. 1. Cambridge,
MA: MIT Press.
Pinker, S. & Bloom, P. (1990). Natural Language and Natural Selection.
Behavioural and Brain Sciences, 13, 707-784.
Quine, W.O. (1969). Ontological Relativity and Other Essays. New York:
Columbia University Press.
Reber, A.S. (1967). Implicit learning of artificial grammars. Journal of Verbal
Learning and Verbal Behaviour, 6, 855-863.
Reber, A.S. (1969). Transfer of syntactic structures in synthetic languages.
Journal of Experimental Psychology, 81, 115-119.
Reber, A.S. (1976). Implicit Learning of Synthetic languages: The role of
instructional set. Journal of Experimental Psychology: Human Learning and
Memory, 2, 88-94.
Reber, A.S. (1989). Implicit learning and tacit knowledge. Journal of
Experimental Psychology: General, 118, 219-235.
Reber, A.S. (1993). Implicit learning and tacit knowledge. Oxford, England:
Oxford University Press.
Reber, A.S. & Lewis, S. (1977). Implicit learning: An analysis of the form and
structure of a body of tacit knowledge. Cognition, 5, 333-361.
54
-
-
-
-
-
-
-
Reber, A.S. & Allen, R. (1978). Analogy and abstraction strategies in synthetic
grammar learning: A functionalist interpretation. Cognition, 6, 189-221.
Reber, A.S., Kassin, S., Lewis, S., & Cantor, G. (1980). On the relationship
between implicit and explicit modes in the learning of a complex rule structure.
Journal of Experimental Psychology: Human Learning and Memory, 6, 492-502.
Redington, M. & Chater, N. (1996). Transfer in artificial grammar learning: A reevaluation. Journal of Experimental Psychology: General, 125, 123-138.
Servan-Schreiber, E. & Anderson, J.R. (1990). Learning artificial grammars with
competitive chunking. Journal of Experimental Psychology: Learning, Memory,
and Cognition, 16, 592-608.
Shanks, D.R., Johnstone, T., & Staggs, L. (1997). Abstraction processes in
artificial grammar learning. Quarterly Journal of Experimental Psychology, 50A,
216-252.
Shanks, D.R. & St. John, M.F. (1994). Characteristics of dissociable human
learning systems. Behavioural and Brain Sciences, 17, 367-447.
Schiffman, S.S., Reynolds, M.L., & Young, F.W. (1981). Introduction to
Multidimensional Scaling: Theory, Methods, and Applications. New York:
Academic Press, Inc.
St. John, M.F. (1996). Computational differences between implicit and explicit
learning: Evidence from learning crypto-grammars. Unpublished manuscript.
Stroulia, E. & Goel, A. (1996). A Model-Based Approach to Blame Assignment:
Revising the Reasoning Steps of Problem Solvers. In Proc. Thirteenth National
Conference on Artificial Intelligence, pp. 959-964.
Vokey, J.R. & Brooks, L.R. (1992). Salience of item knowledge in learning
artificial grammars. Journal of Experimental Psychology: Learning, Memory, &
Cognition, 18, 328-344.
Whittlesea, B.W.A. & Dorken, M.D. (1993). Incidentally, things in general are
particularly determined: An episodic-processing account of implicit learning.
Journal of Experimental Psychology: General, 122, 227-248.
55
Appendix A
A1 Servan-Schreiber and Anderson's theory of competitive
chunking (CC)
Based on these results Servan-Schreiber and Anderson formulated a theory of
competitive chunking (CC). In CC two things are known about every chunk: (1) what
its immediate subchunks are, and (2) the chunk's strength, i.e. how often and how
recently it has been used in the past. A newly created chunk has the strength of one
unit and every time the chunk is used its strength is increased by one unit. A chunk's
strength decreases with time however. Thus, at any point in time, the strength of a
chunk is defined as follows:
Strength =  Ti – d
(Ti is the time elapsed since the ith strengthening, and d is the decay parameter
(0<d<1)).
A1.1 The strengthening of existing chunks
According to CC the perception process proceeds from the simplest, elementary
chunks, that is, those chunks that the system never had to learn, in a recursive cycle.
When a stimulus is perceived the system matches the elementary chunks to the
stimulus forming the elementary percept. Then the system matches its more complex
chunks to this percept forming a new percept. The system continues to match more
complex chunks to the percepts until it has no more chunks available. Thus, the
number of chunks gets smaller with each cycle of the perception process.
For example, if we give the system the following stimulus:
carnegiemellonuniversity
it would first perceive every letter, giving the elementary percept
CARNEGIEMELLONUNIVERSITY
Then it would use its more complex chunks to form the percepts
(CARNEGIE) (MELLON) (UNIVERSITY)
then
((CARNEGIE)(MELLON)) (UNIVERSITY)
and
(((CARNEGIE)(MELLON)) (UNIVERISTY))
56
The number of chunks in the final percept is an important variable in CC called
nchunks. nchunks is a measure of how compact the representation of a stimulus is,
and thus, of how familiar the stimulus is perceived to be. In the above example the
number of chunks in the elementary percept is 24, then 3, then 2, and 1 in the final
percept.
Familiarity can be cautiously defined as
e1-nchunks
where the familiarity of a stimulus ranges from 1 (maximally familiar) to an
asymptotic 0 (maximally unfamiliar).
The probability that a chunk is retrieved depends exclusively on the strength of its
subchunks, which S&S called the chunks support. The average strength of a chunk’s
immediate subchunks is its support. A chunk's strength does not affect its own
probability of being used but directly affects its superchunks' probabilities of being
used. Thus, when the strength of a chunk decays, its superchunks are being forgotten,
or conversely, when the strength of a chunk increases, its superchunks are being
learned. In our example, if (CARNEGIE) and (MELLON) did not have enough
strength, then their superchunk ((CARNEGIE)(MELLON)) may not be retrieved.
Whether a chunk is used by the perception process and strengthened consequently,
depends on the chunk’s own strength, relative to the strength of its competitors (as
there may be other chunks competing). Therefore both a chunk’s strength and its
support are critical to its being used by the perception process.
A1.2 Creation of new chunks
The creation process follows on from the perception process. The final percept is the
input and the output is a new chunk whose immediate subchunks are chunks in the
final percept. If the final percept is for example
Chunk1 Chunk2 Chunk3
there are at least two potential new chunks that may be created: (Chunk1 Chunk2) and
(Chunk2 Chunk3). The probability that a new chunk is actually proposed is assumed
to be equivalent to the process of retrieving existing chunks for perceiving. The chunk
with the largest support wins the chunk creation competition.
A1.3 Simulation
Servan-Schreiber & Anderson report two simulations of experimental data providing
evidence for their theory of competitive chunking. The first simulation simulates
Miller's (1958) data and illustrates the learning process, the second simulates their
own data and illustrates how such a network can be used to perform the grammatical
discrimination task.
In order for the simulation to provide data in the same form as the human data, the
value of n-chunks was regarded as an act of rejection, i.e. the probability that a string
would be rejected increased with the value of nchunks. The higher the value of
nchunks, the less familiar a string was perceived to be, and the higher the probability
that it would be rejected. In the example of above, if nchunks was 1 (i.e. the stimulus
57
is encoded into a single chunk: (CARNEGIE MELLON UNIVERSITY) ), the
stimulus is maximally familiar and less likely to be rejected. If nchunks is 3 (i.e. the
stimulus is encoded into three separate chunks: (CARNEGIE) (MELLON)
(UNIVERSITY) ), the stimulus is less familiar and thus more likely to be rejected.
Further assumptions were that letters were elementary chunks with the following
constraints:
(a)
the subchunks of a chunk must be adjacent (consistent with the Gestalt
principle of proximity)
(b)
The next level of chunks called "word chunks" can have at most 3
subchunks (reflecting the finding that the preferred word chunk size of
subjects is 3)
(c)
The next higher level, the "phrase chunks" can have at most 2 subchunks
(so that length constraints become more severe as the complexity of
chunks increases)
It was also assumed that each string of letters also includes "extremity markers" that
signal the beginning and end of that string: the string TTXVPS was represented as
"beginTTXVPSend" in the simulation. This was to help explain why subjects are
apparently more sensitive to grammatical violations at the extremities of the string
rather than those occurring in the middle (Reber, 1967; Servan-Schreiber & Anderson,
1990).
Many different values were tried for the two parameters c (competition parameter)
and d (decay parameter). It was found that with the values of c = d = 0.5 the
coefficient of correlation between the simulated and the human data was 0.935. In
addition the simulation was able to reproduce 87% of the variance.
The data from the simulation shows that forming chunks in the way S&S suggest in
their theory, is efficient enough to be able to discriminate between ⊕ and ⊖ letter
strings. Ss were basing their classification on an overall familiarity measure. The
more chunks present in the novel examples, the more likely for the string to be called
"good".
58

Download Report

3 Artificial Grammar Learning - Electronics and Computer Science

Paperzz.com

Your Paperzz