Input Layer (36 units) Hidden Layer (4 units) Output Layer (36 units

Hidden
Layer
(4 units)
Input Layer
(36 units)
Output Layer
(36 units)
Figure 2. The autoassociator network used in the simulation.
Verbal
Input/
Output
Layer
(4 units)
1st QUARTER
FULL
LAST QUARTER
Figure 3. The scenes to be classified.
NEW
Uni
Sc
en
es
O ut
put
ts
on
ts
Ver
ba l
Mo
U ni
Sc
en
es
O ut
put
on
bal
Mo
Ver
Figure 4. Patterns of activation of the verbal output units in four individuals early in the simulation.
None of the individuals can distinguish among the twelve moon scenes and there is no shared pattern of activation
across individuals for any particular moon scene.
ut U
nits
Sc
en
es
al O
utp
on
Ver
b
Mo
ut U
nits
on
Sc
en
es
al O
utp
Mo
Ver
b
Figure 5. Patterns of activation of the verbal output units in four individuals after an average of 2000 interactions
for each pair of individuals.
All of the individuals distinguish among the twelve moon scenes by producing a distinctive pattern of activation on
the verbal output units. A shared set of form-referent mappings has developed.
Speaker is trying
to lead listener's
attention to
some object in
the visual field
--the intentional
object of discourse
S
Listener is trying
to follow speaker's
lead and employs
speaker's sound and
context of utterance
to do this.
A shared visual field
L
S
S
Speaker
L
L
Listener
Speaker's produced sound takes
on meaning by association with
context of utterance.
FIGURE 6. An Interaction.
A simulation is composed of a sequence of interactions involving 2 agents chosen at random from the population,
one in the role of speaker and the other in the role of listener. These agents engage each other in a discourse
structured by a specific communication task. The speaker has in mind an object (the “intentional object”) that is
located within the shared visual field but unknown to the listener. The speaker employs sounds and gaze to direct
listener’s attention to the intentional object. The listener employs the speaker’s sounds and gaze to coordinate in
attempt to accomplish the communication task of identifying the intentional object.
Activation
Levels
Dimension
in articula1.0 tory space
3
0.0
1.0
Agent 1
2
0.0
1.0
1
0.0
1.0
3
0.0
1.0
Agent 2
2
0.0
1.0
1
0.0
Action:
Focus
of attention:
up
down
left
right dont-move dont-move
Object1
Object0
FIGURE 7a. A lexicon after 10,000 interactions of a simulation run.
In the simulation run detailed in Figures 7-9, there are two agents in the population, two objects in the environment,
and each object can occupy a single location within a 3x3 scene. There are, thus, 512 possible scenes. We limit the
simulation to 462 scenes, one of which is selected to be the “shared visual field” for each interaction, and saved 50
scenes (randomly chosen) aside for testing of the population on novel scenes. Sounds are represented by a layer of
3 output units, each unit representing a feature or dimension of agent verbal articulatory space. Thus a sound is
represented by three real-valued numbers, each within the range [0.0,1.0]. The figure shows the distribution of
sounds that each agent produces in all possible contexts. Distributions are represented by the value, along with one
standard deviation error bars. The figure lumps together many different specific contexts (there are thousands,
depending on how much history is considered that leads up to the agent’s current situation) and differentiates only
by the agent’s concurrent choice of gaze motor action (shift up, down, left, right, stay-focused-on-Object1, stayfocused-on-Object0). At the beginning of the simulation, all sounds are nearly identical—in fact, all articulatory
features take on mid-range values for all contexts (not shown here). After 10,000 interactions (shown here), we see
that there is emerging consensus for using the third articulatory feature to enable a contrast in denotation of Object0
and Object1. The lexicon is beginning to emerge.
Activation
Levels
1.0
Dimension
in articulatory space
3
0.0
1.0
Agent 1
2
0.0
1.0
1
0.0
1.0
3
0.0
1.0
Agent 2
2
0.0
1.0
1
0.0
Action:
Focus
of attention:
up
down
left
right dont-move dont-move
Object1
Object0
FIGURE 7b. A lexicon after 60,000 interactions (for the same simulation run as shown in 7a).
While the early denotation of objects in the simulation still holds (see Figure 7a), the lexicon is now elaborated to
include forms which coincide with actions associated with gaze and the spatial shifting of a focus of attention. This
elaboration is emergent and shared by the agents of the population.
1.0
.9
.8
Fraction of
interactions .7
which
.6
terminate
under each
.5
condition
Halt
Disagree
.4
.3
Invalid
.2
Cycle+Max
.1
0
0
10
20
30
40
50
60
70
Interactions (x 1000)
FIGURE 8. The evolution of a simulation run.
Every interaction terminates under one of 5 conditions (Disagree, Halt, Invalid, Cycle, Max. See main text for
explanations of these conditions.) Over time, interactions nearly all terminate under the successful Halt condition,
indicating that the communication task is being solved in almost every instance. (See Figure 7a for a description of
the parameters of this simulation run.)
1.0
.9
Fraction of
success in
test corpus
.8
Agent1-speaker
Agent2-speaker
A1-A1
A1-A2
A2-A1
A2-A2
.7
.6
.5
.4
.3
.2
.1
0
10
20
30
40
50
60
70
Point in time (interactions x1000)
where test corpus collected
FIGURE 9. Emergence of the ability of language structure alone to guide attention.
Every 10,000 interactions a “test corpus” of language constructs was elicited from agents using the following
method. Each agent, in role as speaker and in a complete set of contexts created from 10 novel scenes, produced
actions precisely as they would in a normal interaction except that there was no listener with which to negotiate the
interaction. In particular, only two types of interaction termination applied: Halt, and Max (see Figure 8). The
fraction of successful termination by speakers (i.e., Halt) is shown by the two upper curves in the diagram. These
two conditions represent reference curves for the others in the figure, and the actions produced by these speakers
(gaze and sounds) constitute the test corpus. In the four conditions represented by the four lower curves, we took
the test corpus of one speaker and had each agent process the language constructs in the role of “blind listener”. In
particular, the listening agent received no visual inputs and had to rely on language inputs alone to produce the
appropriate actions. Appropriateness of actions (and thus “success” as identified in the figure) was determined by
comparison with speaker’s actions. As can be seen, once the speakers had organized coherent language structure
which solved the communication task (by about 40,000 interactions in the simulation run) then agent’s abilities to
perform the same task solely with access to the language constructs follows (by about 60,000 interactions in the
simulation run).