(1) Sony Computer Science Laboratory - Paris
(2) VUB Artificial Intelligence Laboratory – Brussels
email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
We have been conducting large-scale public experiments with artificial robotic agents to explore what the necessary and sufficient prerequisites are for word-meaning to evolve autonomously through a self-organised process (see [Steels 97] for an overview of this new research field). The experiments employ an open-ended set of visually grounded autonomous robotic agents which play language games with each other about scenes containing geometrical objects before them (details about the cognitive architecture of the agents can be found in [Steels and Kaplan 1999]). The robots are located in different places in the world (Paris, Brussels, Tokyo, Antwerpen, Lausanne, San Jose, etc.) and are connected through the Internet. Agents are created by human users and can teleport between the different locations. Using a web-page (http://talking-heads.csl.sony.fr/) anyone can follow the experiment and interact with the agents to explore human influence on the emerging artificial language.
Our first `Talking Heads' experiment has been running for 4 months during the summer of 1999 and shows the validity of the mechanisms that were used for the agent architecture and of the interaction patterns and group dynamics of the agents. A shared lexicon and its underlying ontology emerged after a few days, enabling successful communication by the agents about the scenes before them. In total, 400,000 grounded games have been played. The population of agents reached 1500, increasing steadily over a period of 4 months. Despite the many perturbations due to grounding, intermittent technical failures, continuous influx of agent populations, and unpredictable human interaction, the lexicon was maintained throughout the period. A total of 8000 words and 500 concepts have been created, with a core vocabulary consisting of 200 basic words expressing concepts like up, down, left, right, green, red, large, small, etc.
The goal of this paper is to identify the factors that we found to be crucial for the success of the experiment. These can be grouped into two subsets: factors relating to the individual architecture of the agents and factors relating to the group dynamics and the environments encountered.
Agents must be able to engage in coordinated interactions. This means that they must be able to have shared goals and a willingness to cooperate. To enable a coordinated interaction, each agent must be able to follow a script of actions in agreement with a shared protocol, and have a way to see whether the goal of the interaction has been satisfied. In our experiment, we simply assumed this capability and explicitly programmed into each agent the scripts achieving the desired cooperative interaction. Emergence of cooperation is not addressed in this research.
Agents must have parallel non-verbal ways to achieve the goals of verbal interactions. The goal that we have chosen for the interaction between the agents is to draw attention through verbal means to an object in a visually perceived reality. There are of course many other things humans do with language but this is surely one of them and a prerequisite for more sophisticated verbal exchanges. We have found that it is crucial that the agents have a non-verbal way to achieve the same goal: by pointing, gaze following, grasping, etc. This alternative way must be sufficiently reliable, at least initially when the system is bootstrapping from scratch. Once the language system is in place however, external behavioral feedback is less crucial or may be absent altogether.
Agents must have ways to conceptualise reality and to form these conceptualisations, constrained by the ontology underlying the emerging lexicon and the types of situations they encounter. Obviously, conceptualisation precedes verbalisation. Words (even proper names) express categories as opposed to names of specific situations, but the repertoire of concepts need not and cannot be fixed in advance. There are in principle many equally effective possible ways to conceptualise reality. So there must be a concept acquisition process, for which we found important constraints:
First of all, the concept formation processes of the agents must be based on similar sensory channels and result in similar structures (even though there are still many possible solutions). We have incorporated this constraint at present by giving each agent the same low level sensory apparatus and by assuming binary discrimination trees for the agent's conceptual repertoires. Conceptualisation schemes based on randomly structured discrimination trees, prototypes, or inductive neural networks, are adequate for finding a distinctive conceptualisation but they result in larger differences between the repertoires of the agents and it is therefore more difficult to get coherence in the group. The strength of this constraint needs to be further explored.
Second, the conceptualisation for a particular game must itself also be constrained (even if there is a more or less shared repertoire) so that the agents have a reasonable chance to guess the conceptualisation that a speaker may have used. We have achieved this by using saliency: sensory differences that stand out more will be preferred for conceptualising the scene, thus reducing the search space for the meaning of unknown words.
Agents must have ways to recognise signals and reproduce them. This is quite obvious, because otherwise words would be confused all the time. In our experiment we have simply given the agents the capability to recognise or reproduce each others' signals perfectly. Other work has been going in our group to study how a repertoire of signals may itself become shared by an imitation game and the impact of errors in recognition or reproduction.
Agents must have the ability to discover what are the strongest associations (between words and meanings) in the group. The associative memory of an agent must be two-way (from words to meanings and meanings to words), must handle multiple competing associations (one word–many meanings, one meaning–many words), and must keep track of a score that represents how well the association has been doing based on their own past experience. When a decision must be made (which word to use, which meaning to prefer), there is an internal competition between different associations and the one with the highest score wins. All this can be achieved with a quite general associative memory mechanism. This mechanism could be used for other tasks, such as associating physical locations with sources of food. We can therefore assume this to be a standard part of the neural machinery of humans.
It is perhaps important also to point which factors we did NOT incorporate:
a. No theory of mind. There is a widespread belief that verbal communication requires a strong theory of mind of the other agents before verbal interactions are possible. In our experiment, this is not the case, even though for more sophisticated language games (such as for phrases like "I believe that you know the name of this woman.") it is obviously required. However to get going, it is sufficient that agents follow specific protocols of interaction. They do not need to know why these protocols are successful. (Just like a child does not need explicit knowledge of theories of physics to throw a ball but just has to acquire the appropriate behaviors compatible with these laws.)
b. No prior ontology. There is also a widespread assumption that concepts (particularly the perceptually grounded concepts that are the focus of our experiment) need to be shared prior to and independent of language. For some cognitive researchers this implies that they are innate. For others, it suggests that they are acquired through a universal inductive mechanism that yields the same concepts for all agents. We do not assume a prior ontology in our experiments and in fact believe this to be impossible given the adaptive nature of verbal communication. Instead we have set up a strong interaction between language acquisition and concept formation: The ontology develops in a selectionist fashion under pressure from the language and concepts which have no success in verbal interaction are not encouraged.
c. No telepathy. We have not assumed that agents have a way of knowing what meaning the speaker transmits independently of language. Although non-verbal communication, similarity of sensors, shared history of past experiences, saliency, etc. help to restrict the set of possible meanings, the hearer can only guess what the speaker meant. Neither have we assumed that agents have exactly the same perception. Usually raw perception and consequently derived sensory features are different. Equal perception is of course an unrealistic assumption for embodied agents because each agent sees the scene from a different point of view.
The group dynamics must exhibit self-organisation so as to reach ontological and lexical coherence. We have achieved this in two ways. The agents have been made sensitive to the statistical spreading of word-meaning pairs in the population of individuals with which they interact by individually maintaining word-meaning scores (cf internal factor 5). But equally important is that they then use the associations with the highest scores because they will give the most success in the game. This creates a positive feedback loop between use and success. The more success a word-meaning pair has, the higher the score in each agent, and the more this word-meaning pair will be used in the future.
Second, a structural coupling has been established in each agent between the ontology and the lexicon. Each are independently developing processes but the lexicon gives feedback to the ontology on whether the conceptual repertoire is adequate for verbal interaction, and the ontology gives feedback to the lexicon proposing various possible conceptualisations. We see in the experiment that through this structural coupling agents settle on a shared ontology which is adapted to the environments they encounter, and do so without the need for innateness or a universal inductive process.
There must be sufficient group stability to enable a sufficient set of encounters between agents. We have found (in simulations) that if there is a too rapid in- and outflux of agents, a lexicon will collapse because there is not enough time for new members to acquire the conventions (so they build their own) and older members leave too quickly so that there is no memory in the population of the existing conventions. The exact critical levels of the fluxes depend on the size of the population. In the experiment, critical levels were not reached despite quite large changes in the population of active language users.
Another related constraint is that initial group size should not be too large so that there are enough encounters between the same individuals. Once a lexicon is in place however, there can be an almost unbounded increase in the population. The base lexicon in the experiment was created by a group of about 20 agents and then spread to the rest of the population, which eventually reached 1500 agents.
We have noted that sublexicons form when there is geographical separation, causing less opportunity for interaction, and that phenomena familiar in studies of language contact start to appear when the role of geographical separation is diminished.
There must be sufficient environmental stability and different degrees of complexity. The environments encountered by the agents and perceivable by the agents through their sensory apparatus must have certain invariant structural properties so that concepts can form and word-meaning pairs can settle. This does not mean that the environment needs to be closed (indeed it should not be if we want to be realistic), nor even that the sensory space should be closed (new sensory routines surely develop in the child even after she has acquired the first words).
We found that if the agents encounter only complex scenes, they cannot settle on a successful repertoire or at least they have much more difficulty due to unstable concepts. So there must be scenes, at least initially, which can be handled by making simple distinctions (such as between left and right). The learning environments of children exhibit this kind of graded complexity as well, partly because many sensory capabilities are initially not available thus simplifying the world.
For the external factors, we can cancel out other prerequisites. Based on our experience, they need not and often can not be assumed if we want to have a realistic model:
a. No global view nor central control. A central puzzle in the origins of language is how a population of distributed autonomous agents can reach coherence without a central controlling organism nor without access to a global view by the individual agents. A model should never introduce such a global control point. We have abundantly shown that self-organisation is perfectly adequate to explain language coherence without this.
b. No total coherence. It is often assumed that all individuals have exactly the same linguistic competence and that deviations are only due to performance errors. We have shown that this is not necessary. The conceptualisations and lexicons of the individual agents in the experiment were NEVER exactly the same. They had different degrees of knowledge and there were unavoidable individual differences due to the absence of a global view. The experiment shows that communicative success can nevertheless be reached without this absolute coherence. For example, words can often be maintained in a polysemous state without causing confusion in a series of environments, while synonyms are tolerated because agents can understand words that they themselves might not necessarily choose to use.
This paper is an attempt to show how experiments based on software simulations or robotic set ups, like the Talking Heads experiment, can play an important role in the debate on the origin and evolution of Human languages. In a field where "real" experimentation is not possible, this type of experiments enables to compare hypothesis and test through models which factors are crucial and which are contingent to achieve a communication system. It will be exciting to see now what we need to add to see the emergence and complexification of grammar.
[Steels 97] Steels, L. (1997) The synthetic modeling of language origins. Evolution of communication journal, 1 (1): 1-34.
[Steels and Kaplan 1999] Steels, L. and Kaplan, F. (1999) Situated grounded word semantics. In Proceedings of IJCAI 99, p.862-867.
Conference site: http://www.infres.enst.fr/confs/evolang/