Agent-based model for the origins of scaling in human language

arXiv:1705.05762v1 [physics.soc-ph] 16 May 2017
Agent-based model for the origins of scaling in
human language
Javier Vera∗1 and Felipe Urbina†2
1,2
Facultad de Ingenierı́a y Ciencias, Universidad Adolfo Ibáñez, Avda.
Diagonal Las Torres 2640, Peñalolén, Santiago, Chile
2
UAI Physics Center, Universidad Adolfo Ibáñez, Santiago, Chile
Abstract
• Background/Introduction: The Zipf’s law establishes that if the
words of a (large) text are ordered by decreasing frequency, the frequency versus the rank decreases as a power law with exponent close
to -1. Previous work has stressed that this pattern arises from a conflict
of interests of the participants of communication: speakers and hearers.
• Methods: The challenge here is to define a computational language
game on a population of agents, playing games mainly based on a parameter that measures the relative participant’s interests.
• Results: Numerical simulations suggest that at critical values of the parameter a human-like vocabulary, exhibiting scaling properties, seems
to appear.
∗
†
(corresponding author) e-mail: [email protected]; telephone/fax number: +56 2 2331 1000
[email protected]
1
• Conclusions: The appearance of an intermediate distribution of frequencies at some critical values of the parameter suggests that on a
population of artificial agents the emergence of scaling partly arises as
a self-organized process only from local interactions between agents.
Keywords: Language Games, Vocabularies, Naming Game, Zipf’s Law
2
1
Introduction
Can artificial populations of agents develop vocabularies satisfying the Zipf’s law?
This question, based on an earlier version proposed in [1], relies on the assumptions
and the minimal interacting rules that allow agents to self-organize from scratch a
language, exhibiting that if the words are ordered by decreasing frequency, the
frequency of the k-th word, P (k), decays as the power law
P (k) ∼ k−α
where α ≈ 1 [2]. One possible origin of this law arises from the conflict
between the simultaneous interests of both speakers and hearers. At the lexical
organization level of language, [3] stressed that each schematic conversation role
faces a trade-off on lexical interests. The speaker will tend to choose thus the most
frequent words, positively correlated with their ambiguity, understood as the higher
number of meanings [3]. Put differently, in an idealized scenario the speaker will
prefer to transmit at each interaction the same unique word. On the other hand, this
behavior is opposed to the requirements of the hearer, which indeed needs to minimize, given the ambiguity of the transmitted word, the effort of understanding. For
the hearer, the preferred vocabulary is therefore a one-to-one word-meaning mapping. At an intermediate level of the lexical participant’s interests, [3] described,
within the framework of Information Theory, the drastic appearance of scaling, as
is shown in the Zipf’s law, in the organization between words and meanings.
The main aim here is to describe the dynamics of a language game [4] on a
population of agents which behave according to different levels of lexical interest
on word ambiguity. Moreover, the hypothesis is that at some intermediate level
3
of the participant’s interests (as is shown in [3]) agents will share a word-meaning
mapping exhibiting some scaling properties. The focus of this paper is a distributed
solution in which agents collectively reach shared communication systems without
any kind of central control influencing the formation of language, and only from
local conversations between few participants [5, 6, 7, 8].
The model proposed here is based on a prototypical agent-based model for
computational studies of language formation, the naming game [5, 8, 7, 4], which
considers a finite population of agents, each one endowed with a memory to store,
in principle, an unlimited number of words. At each discrete time step, a pair of
agents, one speaker and one hearer, negotiate words as hypothesis for naming one
object. Under the typical dynamics of the naming game, the population will share
after a finite amount of time a unique word for referring to the object.
This paper is organized as follows. Section “Methods” introduces the model
for agent’s behavior and the quantitative measures described in numerical simulations. In “Results”, it is described how the consensus dynamics is strongly influenced by the relative participant’s efforts. Finally, “Conclusion” section presents
consequences on language evolution and future work.
2
Methods
2.1 Basic model
The language game is played by a finite population of agents P = {1, ..., p},
sharing a set of words {1, ..., n} and a set of meanings {1, ..., m}. Each agent
k ), where lk = 1 if the i-th
k ∈ P is associated to a n × m lexical matrix Lk = (lij
ij
k = 0, otherwise. More generally, lexical
word is related to the j-th meaning, and lij
4
matrices can be understood in terms of language networks [1]. Next, two technical
terms are introduced. Consider one agent k ∈ P , a word w and a meaning m.
Definition 1 (known word) The agent k knows the word w if there is at least one
k
= 1. Analogously, k knows the association between w
meaning m, such that lwm
k
and m if lwm
= 1.
k
Definition 2 (ambiguity) The ambiguity of the word w is defined as the sum of lwj
P k
over j ∈ {1, ..., m}. More precisely, the ambiguity of w is j lwj
The purpose of the game is twofold: the development of a (i) common vocabulary (a lexical matrix shared by the entire population), which exhibits (ii) scaling
relations, as shown in the Zipf’s law.
Basic interaction rules
The basic interaction rules therefore read,
(step 1) at each discrete time step, a pair of agents is selected uniformly at random:
one plays the role of speaker and the other plays the role of hearer;
(step 2) first, the speaker chooses uniformly at random a topic of the conversation,
assuming meaning transfer [9]: when a word is transmitted, the hearer
knows what is the associated meaning. The speaker selects then one column (meaning) m∗ ∈ {1, ..., m}. Next, the speaker calculates a word
associated to m∗ , denoted w∗ , and transmits it to the hearer. This calculation is addressed here only based on the speaker’s lexical interest;
(step 3) finally, the hearer behaves as in the naming game. If the hearer does not
know the association between w∗ and m∗ , establishes a repair strategy
5
(in order to increase the chance of future agreements). Otherwise, mutual
agreement implies alignment strategies [8]. More precisely,
(i) if the hearer knows w∗ , both speaker and hearer cancel all entries
of the m∗ -th column of their lexical matrices, except the row (word) w∗ ;
(ii) otherwise, the hearer establishes a simple repair strategy: it adds
1 to the entry (w∗ m∗ ) of its lexical matrix.
In this paper, three strategies arising from the basic interaction rules are proposed. The first strategy focuses on the maximization of the speaker’s interest.
Speakers will prefer to transmit thus the most ambiguous words. This is opposed
to the second strategy, which involves the minimization of the speaker’s interest
(or equivalently, the maximization of the hearer’s interest). Speakers will prefer to
transmit therefore the least ambiguous words. Finally, the general case, involving
relative interests of both participants, is presented.
2.2 Speaker’s interest maximization
What would be the minimal adaptation of the basic interaction rules that allows to
focus on the speaker’s interest? How can the population reach agreement on a vocabulary, while the speaker’s interest is maximized? One simple solution consists
in to define a way the speakers calculate the most ambiguous word. With this in
mind, a new version of (step 2) is proposed:
(step 2S) (i) if the speaker does not know a word to transmit m∗ , randomly chooses
one word w∗ ∈ {1, ..., m};
(ii) otherwise, the speaker calculates w∗ as the most ambiguous word
6
(from the words associated to m∗ ): w∗ is simply the word with the
largest number of meanings.
2.2.1
Example: the most ambiguous word
At some time step, consider the following scenario: (i) the topic of the interaction
is the meaning (column) m∗ = 2; and (ii) the speaker k ∈ P has the lexical matrix

0 0 1


1 1 0



1 1 1

0 1 1

0


0

Lk = 

1

1
Therefore, the speaker calculates the most ambiguous word (row) w∗ as
∗
w = argmax
m
X
k
{w:lwm
∗ 6=0} j=1
k
lwj
= argmax
m
X
w∈{2,3} j=1
k
lwj
=3
and transmits it to the hearer.
2.3 Speaker’s interest minimization
What kind of language strategies do hearers need in order to focus on their interests? Hearers want to minimize the effort of understanding and therefore tends to
prefer the least ambiguous words, which is opposed to the speaker’s interest. A
version of (step 2) is proposed:
(step 2H) (i) if the speaker does not know a word to transmit m∗ , randomly
chooses one word w∗ ∈ {1, ..., m};
7
(ii) otherwise, the speaker calculates w∗ as the least ambiguous word
(from the words associated to m∗ ): w∗ is now the word with the lowest
number of meanings.
2.3.1
Example: the least ambiguous word
As in the previous example, the topic of the interaction is the meaning (column)
m∗ = 2, and the speaker’s lexical matrix is Lk . Therefore, the speaker calculates
the least ambiguous word (row) w∗ as
w∗ = argmin
k
{w:lwm
∗ 6=0}
m
X
k
lwj
= argmin
m
X
w∈{2,3} j=1
j=1
k
lwj
=2
and transmits it to the hearer.
2.4 Relative interests of speakers and hearers
What would be the minimal adaptation of the basic interaction rules that enables
to include at the same time the interests of both speakers and hearers? In order to
define relative interests, one feasible solution involves that speakers would prefer to
transmit words associated to a relative ambiguity, defined by a simple relationship
between (steps 2S) and (steps 2H). The solution consists in (step 2R):
(step 2R) (i) if the speaker does not know a word to transmit m∗ , randomly chooses
one word w∗ ∈ {1, ..., m};
(ii) otherwise, the speaker calculates w∗ according to the ambiguity parameter ℘ ∈ [0, 1]. Let r ∈ [0, 1] be a random number. Then,
8
• if r > ℘, the speaker calculates w∗ as the least ambiguous word
(as in the rule 2H);
• otherwise, the speaker calculates w∗ as the most ambiguous word
(as in the rule 2S).
Notice that for ℘ = 0, 1 agents play respectively (steps 2H) and (steps 2S).
For intermediate values ℘ ∈ (0, 1), agents face relative lexical interests while they
play the role of speaker or hearer.
2.4.1
Example: relative interests
For (step 2R), the speaker calculates w∗ = 2, with probability 1 − ℘; or w∗ = 3,
with probability ℘.
2.5 Measures
To explicitly describe the consensus dynamics under different participant’s lexical interests, three measures are defined: the amount of global agreement of the
population, D(t), defined as the normalized distance to the average lexical matrix,
Distance
p
1 X
k
D(t) =
|(i, j) : lij
6 ¯lij |
=
mnp
k=1
where | · | denotes cardinality and ¯lij is the association between the word i and
the meaning j on the average matrix L̄; the size of the effective vocabulary [3],
9
Effective vocabulary
p
m
k=1
j=1
X
1 X
k
V (t) =
lij
> 0|
|i :
np
where
Pm
k
j=1 lij
> 0 means that the i-th word of the lexical matrix Lk is being
occupied; and the energy-like function EKL ,
Energy-like function
EKL (℘) = d(P (℘), P (0)) + d(P (℘), P (1))
where d is the symmetric distance defined by the Kullback-Liebler divergence
KL [10]: d(P (℘), P (0)) = KL(P (℘), P (0)) + KL(P (0), P (℘)). Here, P (℘)
denotes the decreasing distribution of frequency meanings for the parameter ℘. In
order to define the probability distribution P (℘), two properties are imposed to the
P
ranked frequencies p℘i : (i) ni=1 p℘i = 1; and (ii) p℘i > 0.001, for all i ∈ {1, ..., n}.
For V (t) and D(t), the focus here relies on the values after 2p × 104 speakerhearer interactions, hV i and hDi, which average over 10 initial conditions and the
last 2 × 103 steps. One initial condition supposes that each lexical matrix entry is
0 or 1 with probability 0.5. For these measures, the parameter ℘ is varied from 0
to 1 with an increment of 3%.
For EKL (℘), it is described the average value over 10 initial conditions, after
2p × 104 speaker-hearer interactions. For this measure, ℘ is varied from 0 to 1 with
an increment of 1%.
On a population of size p = 64, each agent is endowed with a lexical matrix
formed by n = 64 words and m = 64 meanings.
10
3
Results
3.1 Speaker’s interest optimization: (step 2S) and (step 2H)
First of all, for each value of ℘, hDi ≈ 0, as shown in the embedded plot of Fig.
1. With this in mind, the negotiation dynamics defined by (step 2S) and (step 2H)
develops simpler vocabularies. As shown in Fig. 1, since hV i ≈ 0 the language
game under (step 2S) leads in turn to a vocabulary, in which only one word is
being used (for a schematic representation, see Fig. 3 (right)). By contrast, the
dynamics under (step 2H) develops a vocabulary close to a one-to-one mapping
between words and meanings, since hV i ≈ 0.9 (see Fig. 3 (left)).
3.2 Relative interests of speakers and hearers: (step 2R)
Several aspects are remarkable for relative interests of speakers and hearers, as
shown in Fig. 1 and Fig. 3 (center). Intermediate values of the parameter ℘ ∈
(0, 1) lead to drastic changes between the two idealized communication systems
preferred respectively by speakers and hearers. Around the critical parameter ℘∗ ≈
0.5, the dynamics establishes three phases in the behavior of hV i versus ℘. First,
for ℘ < 0.4 the size of the effective vocabulary exhibits a slow decreasing from the
value hV i ≈ 0.9. Next, for ℘ ∈ (0.4, 0.6) a drastic decreasing of hV i is founded.
Finally, for ℘ > 0.6 there is a slow decreasing until the value hV i ≈ 0.
One of the most interesting results is summarized by Fig. 2 (top). Around
the critical parameter ℘∗ , the energy-like function EKL is minimized. Thus, the
distribution of meaning frequencies at ℘∗ seems to become an intermediate communication system sharing features with both idealized vocabularies emerged from
(step 2S) and (step 2H) (see Fig. 3 (center)).
11
4
Conclusion
This work summarizes a descentralized agent-based approach to the origins of scaling properties in human language. The paper describes particularly the influence
of a parameter that measures the agents’s lexical interests during language game
dynamics. The appearance of an intermediate distribution of frequencies at some
critical values of the parameter suggests that on a population of artificial agents the
emergence of scaling partly arises as a self-organized process only from local interactions between agents endowed with intermediate levels of lexical interest (for
a stronger evidence of scaling, see Fig. 2 (bottom)). In some sense, if cooperation is understood as the capacity of selfish agents to forget some of their potential
to help one another [11], the emergence of scaling is crucially influenced by the
cooperation between agents.
Many extensions of the proposed model should be studied in order to increase
the complexity of the language emergence task. A first natural extension is to
develop more extensive computational simulations that involve large populations
of agents. A second extension should describe other ways to define intermediate
agent’s interests.
Acknowledgments
The authors thank Fondequip AIC-34.
12
References
[1] Solé RV, Corominas-Murtra B, Valverde S, Steels L. Language networks:
Their structure, function, and evolution. Complexity. 2010;15(6):20–26.
[2] Zipf G. Human Behaviour and the Principle of Least-Effort. Cambridge,
MA: Addison-Wesley; 1949. .
[3] Ferrer-i-Cancho R, Solé RV. Least Effort and the Origins of Scaling in the
Human Language. Proceedings of the National Academy of Science (USA).
2003;100:788–791.
[4] Loreto V, Baronchelli A, Mukherjee A, Puglisi A, Tria F. Statistical physics
of language dynamics. Journal of Statistical Mechanics: Theory and Experiment. 2011;2011(04):P04006.
[5] Steels L.
A Self-Organizing Spatial Vocabulary.
Artificial Life.
1995;2(3):319–332.
[6] Steels L. Self-organizing vocabularies. In: Proceedings of Artificial Life V,
Nara, Japan. Nara, Japan; 1996. p. 179–184.
[7] Baronchelli A, Felici M, Caglioti E, Loreto V, Steels L.
Sharp Transi-
tion towards Shared Vocabularies in Multi-Agent Systems. J Stat Mech.
2006;(P06014).
[8] Steels L. Modeling the cultural evolution of language. Physics of Life Reviews. 2011 December;8(4):339–356.
[9] De Beule J, De Vylder B, Belpaeme T. A cross-situational learning algorithm
for damping homonymy in the guessing game. In: Rocha LM, Yaeger LS,
13
1.195
×10 -7
0.9
1.194
hDi
0.8
hV i
0.7
1.193
1.192
0.6
1.191
0.5
1.19
0
0.5
1
℘
0.4
0.3
0.2
0.1
0
-0.1
0
0.15
0.3
0.45
0.6
0.75
0.9
1
℘
Figure 1: hV i and hDi versus ℘. On a population of p = 64 agents, each one
endowed with a 64×64 lexical matrix, the measures hV i and hDi (small embedded
plot) versus ℘ are showed. Vertical bars indicate standard deviation of the data.
Bedau MA, Floreano D, Goldstone RL, Vespignani A, editors. Artificial Life
X : Proceedings of the Tenth International Conference on the Simulation and
Synthesis of Living Systems. International Society for Artificial Life. The
MIT Press (Bradford Books); 2006. p. 466–472.
[10] Bishop CM. Pattern Recognition and Machine Learning. Information Science
and Statistics. Springer; 2006.
[11] Nowak M.
Five Rules for the Evolution of Cooperation.
2006;314(5805):1560–1563.
14
Science.
1.05
1
EKL (℘)
0.95
0.9
0.85
0.8
0.75
0
0.2
0.4
0.6
0.8
1
℘
10
2
℘ = 0. 3
℘ ∗ = 0. 52 (α ∗ = − 1. 08)
℘ = 0. 8
1
P ( k)
10
10
10
0
-1
10
0
10
1
k
10
2
Figure 2: Appearance of an intermediate frequency distribution in vocabularies.
(top) On a population of p = 64 agents, each one endowed with a 64 × 64 lexical
matrix, the energy-like function EKL versus ℘ is showed. The minimization of
EKL occurs at ℘ ≈ 0.55. (bottom) P (k) versus k, for ℘ = 0.3, 0.8 and the value
of ℘ that gives the power law parameter closest to 1 (℘∗ = 0.52). The plot shows
the distribution of the number of meanings associated to the k-ranked word of the
effective vocabulary, P (k), versus k (log − log plot). Black depicted lines indicate
least squares fit. The calculations average over ten initial conditions. At the critical
15
∗
parameter ℘ , the distribution restricted to the words associated at least to one
∗ =1.08
meaning follows P (k) ∼ k−α
.
Figure 3: Lexical matrices for different values of ℘. After the final configurations
are reached (as in Fig. 1), three lexical matrices of size 64 × 64 are choosen as
examples of shared vocabularies for ℘ = 0 (left), ℘ = 0.5 (center) and ℘ = 1
(right). Black squares () represent ones; white spaces represent zeros.
16