Hidden Layer (4 units) Input Layer (36 units) Output Layer (36 units) Figure 2. The autoassociator network used in the simulation. Verbal Input/ Output Layer (4 units) 1st QUARTER FULL LAST QUARTER Figure 3. The scenes to be classified. NEW Uni Sc en es O ut put ts on ts Ver ba l Mo U ni Sc en es O ut put on bal Mo Ver Figure 4. Patterns of activation of the verbal output units in four individuals early in the simulation. None of the individuals can distinguish among the twelve moon scenes and there is no shared pattern of activation across individuals for any particular moon scene. ut U nits Sc en es al O utp on Ver b Mo ut U nits on Sc en es al O utp Mo Ver b Figure 5. Patterns of activation of the verbal output units in four individuals after an average of 2000 interactions for each pair of individuals. All of the individuals distinguish among the twelve moon scenes by producing a distinctive pattern of activation on the verbal output units. A shared set of form-referent mappings has developed. Speaker is trying to lead listener's attention to some object in the visual field --the intentional object of discourse S Listener is trying to follow speaker's lead and employs speaker's sound and context of utterance to do this. A shared visual field L S S Speaker L L Listener Speaker's produced sound takes on meaning by association with context of utterance. FIGURE 6. An Interaction. A simulation is composed of a sequence of interactions involving 2 agents chosen at random from the population, one in the role of speaker and the other in the role of listener. These agents engage each other in a discourse structured by a specific communication task. The speaker has in mind an object (the “intentional object”) that is located within the shared visual field but unknown to the listener. The speaker employs sounds and gaze to direct listener’s attention to the intentional object. The listener employs the speaker’s sounds and gaze to coordinate in attempt to accomplish the communication task of identifying the intentional object. Activation Levels Dimension in articula1.0 tory space 3 0.0 1.0 Agent 1 2 0.0 1.0 1 0.0 1.0 3 0.0 1.0 Agent 2 2 0.0 1.0 1 0.0 Action: Focus of attention: up down left right dont-move dont-move Object1 Object0 FIGURE 7a. A lexicon after 10,000 interactions of a simulation run. In the simulation run detailed in Figures 7-9, there are two agents in the population, two objects in the environment, and each object can occupy a single location within a 3x3 scene. There are, thus, 512 possible scenes. We limit the simulation to 462 scenes, one of which is selected to be the “shared visual field” for each interaction, and saved 50 scenes (randomly chosen) aside for testing of the population on novel scenes. Sounds are represented by a layer of 3 output units, each unit representing a feature or dimension of agent verbal articulatory space. Thus a sound is represented by three real-valued numbers, each within the range [0.0,1.0]. The figure shows the distribution of sounds that each agent produces in all possible contexts. Distributions are represented by the value, along with one standard deviation error bars. The figure lumps together many different specific contexts (there are thousands, depending on how much history is considered that leads up to the agent’s current situation) and differentiates only by the agent’s concurrent choice of gaze motor action (shift up, down, left, right, stay-focused-on-Object1, stayfocused-on-Object0). At the beginning of the simulation, all sounds are nearly identical—in fact, all articulatory features take on mid-range values for all contexts (not shown here). After 10,000 interactions (shown here), we see that there is emerging consensus for using the third articulatory feature to enable a contrast in denotation of Object0 and Object1. The lexicon is beginning to emerge. Activation Levels 1.0 Dimension in articulatory space 3 0.0 1.0 Agent 1 2 0.0 1.0 1 0.0 1.0 3 0.0 1.0 Agent 2 2 0.0 1.0 1 0.0 Action: Focus of attention: up down left right dont-move dont-move Object1 Object0 FIGURE 7b. A lexicon after 60,000 interactions (for the same simulation run as shown in 7a). While the early denotation of objects in the simulation still holds (see Figure 7a), the lexicon is now elaborated to include forms which coincide with actions associated with gaze and the spatial shifting of a focus of attention. This elaboration is emergent and shared by the agents of the population. 1.0 .9 .8 Fraction of interactions .7 which .6 terminate under each .5 condition Halt Disagree .4 .3 Invalid .2 Cycle+Max .1 0 0 10 20 30 40 50 60 70 Interactions (x 1000) FIGURE 8. The evolution of a simulation run. Every interaction terminates under one of 5 conditions (Disagree, Halt, Invalid, Cycle, Max. See main text for explanations of these conditions.) Over time, interactions nearly all terminate under the successful Halt condition, indicating that the communication task is being solved in almost every instance. (See Figure 7a for a description of the parameters of this simulation run.) 1.0 .9 Fraction of success in test corpus .8 Agent1-speaker Agent2-speaker A1-A1 A1-A2 A2-A1 A2-A2 .7 .6 .5 .4 .3 .2 .1 0 10 20 30 40 50 60 70 Point in time (interactions x1000) where test corpus collected FIGURE 9. Emergence of the ability of language structure alone to guide attention. Every 10,000 interactions a “test corpus” of language constructs was elicited from agents using the following method. Each agent, in role as speaker and in a complete set of contexts created from 10 novel scenes, produced actions precisely as they would in a normal interaction except that there was no listener with which to negotiate the interaction. In particular, only two types of interaction termination applied: Halt, and Max (see Figure 8). The fraction of successful termination by speakers (i.e., Halt) is shown by the two upper curves in the diagram. These two conditions represent reference curves for the others in the figure, and the actions produced by these speakers (gaze and sounds) constitute the test corpus. In the four conditions represented by the four lower curves, we took the test corpus of one speaker and had each agent process the language constructs in the role of “blind listener”. In particular, the listening agent received no visual inputs and had to rely on language inputs alone to produce the appropriate actions. Appropriateness of actions (and thus “success” as identified in the figure) was determined by comparison with speaker’s actions. As can be seen, once the speakers had organized coherent language structure which solved the communication task (by about 40,000 interactions in the simulation run) then agent’s abilities to perform the same task solely with access to the language constructs follows (by about 60,000 interactions in the simulation run).
© Copyright 2026 Paperzz