Reliable Multi-agent Systems with Persistent
Publish/Subscribe Messaging
Milovan Tosic, Arkady Zaslavsky
School of Computer Science and Software Engineering,
Monash University
900 Dandenong Road
Caulfield East, Victoria 3145
Australia
{Milovan.Tosic,Arkady.Zaslavsky}@csse.monash.edu.au
Abstract. A persistent publish/subscribe messaging model allows the creation
of an application-independent fault-tolerant layer for multi-agent systems. We
propose a layer which is capable of supporting heterogenous agent platforms
from different vendors. This layer is a three-tier application, which is accessible
from multi-agent systems via web-services or a persistent publish/subscribe
messaging system. We describe the design of the fault-tolerant layer, its
messaging system, as well as the algorithm of fault-recovery procedure in the
case of agent and/or host death. We also present performance analysis of the
proposed solution, to justify its use in systems which demand different levels of
reliability.
Keywords: autonomous agents, distributed problem solving, multi-agent
systems, reliability, fault-tolerance
1 Introduction
Agents, as autonomous software entities which perform activities in a dynamic
environment, can effectively be used in certain applications. They are able to sense
their environment and conduct a set of activities to achieve the goals for which they
were developed. Their social and coordination skills can help them solve problems
which are too complex to be solved by a single agent. Mobility is a feature that allows
agents to move between hosts and perform their activities locally, at a data source.
Hosts provide agents with execution contexts, services and resources needed for task
accomplishment.
Even though agents can be simple, their cooperation in multi-agent systems can
form complex relationships. Multi-agent systems have proven their effectiveness in
the areas of e-commerce, pervasive computing, artificial intelligence,
telecommunications etc. However, they are also prone to weaknesses of their most
unreliable components. We have to achieve a satisfactory level of multi-agent system
reliability in order to be able to justify their application. A reliable system is able to
perform its tasks under conditions which exist in its environment, even if those
conditions cause software or hardware faults.
The main contribution of this paper, the proposed External Fault-Tolerant Layer
(EFTL), is able to improve the reliability of supported multi-agent systems. It is an
application and domain independent solution supporting heterogenous multi-agent
systems based on agent platforms from different vendors. It is capable of supporting
multiple agent systems simultaneously. Its support is negotiable, and the acceptance
of its support depends on the estimated support costs. The EFTL fault-recovery
procedures produce minimal overheads due to the use of context-aware messaging
components, which conform to the persistent publish/subscribe Java Message Service
(JMS) standard. EFTL is capable of solving the problems caused by agent and host
death, agent unresponsiveness, agent migration faults, certain communication
problems and faults caused by resource unavailability.
This paper is organized as follows: firstly, we present related work from the area
of multi-agent system reliability. Then, we describe the architecture of EFTL. The
fourth section focuses on the EFTL messaging system design, while the fifth section
describes a recovery procedure used in the case of agent and/or host death. The last
sections of this paper will present a performance analysis of EFTL, the conclusions
and motivations for future work.
2 Related Work
We identify relevant groups of approaches which handle the sources of system
failures. The biggest group is the one that handles the reliability of an agent as an
individual entity. Some authors proposed checkpointing as a procedure which saves
agent states to a persistent storage medium at certain time intervals. Later, if an agent
fails, its state can be reconstructed from the latest checkpoint [2]. This approach
depends on the reliability of hosts because we have the so-called blocking problem
when the host fails. The agents which have been saved at a particular host can be
recovered only after the recovery of that host [9]. The second approach that tries to
ensure an agent’s reliability is replication. In this approach, there are groups of agents
which exist as replicas of one agent, and can be chosen to act as the main agent in
case of its failure. In order to preserve the same view to the environment from all the
members of the replica group, the concept of a group proxy has been proposed in [4].
A group proxy is the agent that acts as a proxy through which all the interactions
between the group, and the environment, have to pass. When the proxy agent
approach is broadened with the primary agent concept, as in [12, 13], then the
primary agent is the only one which does all the computations until its failure. After
the failure, all the slaves vote in another primary agent from their group.
In order to watch the execution of an agent from an external entity, some authors
proposed the use of supervisor and executor agents [3, 8, 11]. The supervisor agents
watch the execution of the problem-solving agents and detect all the conditions which
can lead to, or are, the failures, and react upon detected conditions. Hosts can also be
used as the components of fault-tolerant systems, as in [1]. Basic services which are
provided by hosts can be extended by some services which help the agents achieve a
desirable level of reliability. Depending on the implementation of the fault-tolerant
system, it cannot cope with all kinds of failures. In order to determine the feasibility
of the recovery, Grantner et al. proposed the use of fuzzy logic [5].
An approach that is also a type of execution monitoring is presented in [6].
Kaminka and Tambe focused on the monitoring of multiple agents using a centralised
approach, with a single monitor agent, or a distributed approach, where problemsolving agents monitor each other. These authors introduced Socially Attentive
Monitoring, where they detected irregularities in agent relationships, not in the
fulfilment of their goals.
The benefits of the publish/subscribe messaging model in mobile computing have
been presented in [10]. Their approach specifically concentrates on context-aware
messaging, where an agent can subscribe to receive only the messages which satisfy
its subscription filter. This solution leads us to a highly effective notification
mechanism for the mobile agents.
Klein and Dellarocas introduced a fault-tolerant application-independent solution
in [7]. They made a clear distinction between the problem-solving and the exceptionhandling agents. Their solution can be applied to any application domain with only
small changes in the problem-solving agents, and is based on exception-handling
services. These agents have to implement a set of interfaces in order to cooperate with
the exception-handling agents. They also have to register their normal behavioural
patterns with the fault-tolerant layer. Then the fault-tolerant layer is able to locate
these behavioural patterns in its exception knowledge database.
3 EFTL Design
3.1 The EFTL Conceptual Architecture
EFTL’s main design goal is to improve the reliability of multi-agent systems. Its
messaging system is supposed to effectively reduce the messaging overheads. EFTL
is capable of providing its services to more than one system at the same time. In its
current state of development, EFTL improves reliability of systems implemented in
the JADE1 and Grasshopper2 agent platforms.
Since every fault-tolerant solution is also prone to faults, we designed EFTL to
work in an environment in which it would be able to inherit scalability and robustness
of underlying J2EE application, HTTP web and messaging servers. In this paper, the
term J2EE application server is used for a server program which hosts Enterprise Java
Beans (EJBs) and provides a range of services to client applications. In order to
provide the system with reliable communications and not to restrict agent autonomy,
EFTL uses our altered persistent publish/subscribe messaging model which
guarantees the exactly-once consumption of messages. In addition, EFTL’s support to
multi-agent systems is negotiable. An application can decide if the EFTL support
costs are acceptable and can sign a support contract with EFTL. All the negotiations
are performed via the EFTL web-services interface. The following sections of the
1
2
http://jade.cselt.it
http://www.grasshopper.de
paper describe some of the EFTL components, while the overall diagram of the
system is presented in Fig. 1.
Fig. 1. Architecture of EFTL
3.2 The EFTL Components
The Fault-Tolerant System Manager (FTSM) is a component identified as the central
functioning unit of EFTL. We developed it as a group of EJBs which are deployed in
a J2EE application server. FTSM regularly checks whether certain conditions are
being met in supported multi-agent systems, and reacts upon the discovery of any
events which may be important for system reliability. It issues commands which must
be performed by agents in order to improve the reliability of the system to which they
belong. FTSM is a statefull component which saves all the data that describes multiagent systems to the EFTL database.
The Reliable Agent Layer (RAL) is a platform-dependent component and a
mandatory layer of each agent that is supported by EFTL. We have developed RAL
for the JADE and Grasshopper agent platforms. This layer depends upon the
properties which describe the data needed for this layer to cooperate with the rest of
EFTL. Since one instance of FTSM is able to control more than one agent platform,
both from the same and from different vendors, RAL provides FTSM with the data
used to differentiate between those platforms. RAL performs activities at agent level,
and they are initiated in order to improve the reliability of a particular agent or other
agents in the system.
Our Messaging System Management Module (MSMM) is another component
which is deployed in a J2EE application server. It is used to connect directly to a
messaging server and to perform the creation or removal of messaging system users.
When a new user is created, its credentials are forwarded to an agent’s RAL. Then,
the agent can make a durable subscription to the messaging topic of interest, and
communicate with FTSM.
The Platform Listener’s purpose is to detect the system-wide events which are
important from a reliability viewpoint. These events can be the changes in an agent’s
life cycle: start-up, transfer, suspension, blocking, death etc. Following the detection
of these events, the Platform Listener notifies FTSM about them. FTSM can then
decide if any of the events are faults or can lead to faults, and can react by ordering
agents to perform certain recovery activities.
4 The EFTL Messaging System Design
The publish/subscribe messaging model is common for performing asynchronous
communication between publishers and subscribers, in a distributed system. A
publisher sends its message to a specific JMS topic, at a message broker which in turn
forwards the message to all the topic subscribers.
The persistent publish/subscribe messaging model guarantees the delivery of
messages to mobile agents, since all the messages are saved to a persistent storage
medium before they are being forwarded to subscribers. This model employs a retry
scheme for the undelivered messages. In the case of a mobile agent, all the messages
sent to it during its travel between hosts, would be forwarded to it as soon as the
agent arrives at a new destination and reconnects to a message broker.
In EFTL, a lightweight messaging component that performs all the connecting and
disconnecting agent activities to and from message brokers, in coordination with
agent life-cycle changes, is called a Mobile Connector. Since agent platforms usually
provide application level detection of life-cycle changes, it is not hard to disconnect
an agent from a message broker before the next migration step, and to reconnect it
after arrival at a new destination.
Fig. 2. Publish/subscribe messaging system
Any reliable agent, as well as the components of FTSM which are implemented as
EJBs, can subscribe to a JMS topic, as in Fig. 2a. A subscription message, or a
message selector, describes the rules which all the messages, which are forwarded to
a subscriber, have to obey. In EFTL, every message, sent from FTSM, has a header
with embedded information about the message recipient(s).
A message selector is a string which has a structure similar to the SQL-92
standard’s ‘where’ statement. If a messaging system provides clients with the
possibility of message selection at its broker, as in the case of EFTL, then the clients
can be context-aware. In that case, not all the messages published to a specific JMS
topic would be sent to all subscribers, but only to those whose message selectors are
satisfied, as is the case with Agent 1 in Fig. 2b. This functionality is very important
for EFTL because it decreases the messaging overheads of the proposed fault-tolerant
solution.
A JMS-compliant persistent publish/subscribe messaging system guarantees
delivery of messages, but not within the exactly-once property. For example, a
message receipt acknowledgement from an agent might not arrive at a message
broker, due to a link failure. Then, the message broker will attempt to re-send the
message to the agent. To prevent the multiple consumption of the same message,
EFTL issues each message with a unique number. Agents keep track of the consumed
message numbers and can simply discard the messages which are delivered to them
more than once.
5 Recovery Procedure in the Case of Agent and/or Host Death
As an example of cooperation between agents and EFTL during fault-recovery, we
present the procedure used in the case of agent and/or host death. Agent and/or host
death is a common type of fault in multi-agent systems. An agent and/or a host can
die due to a software or hardware fault. Other agents may not notice the
disappearance of the failed agent and will be able to continue with the execution of
their tasks. However, this type of fault can sometimes greatly affect the functioning of
the whole system. The failed agent might have had to undertake an important task
whose effects would be reflected in the overall system goal. Moreover, other agents
might not be able to perform their activities without cooperation with the failed agent.
RAL periodically, before each migration, and after a resource update, saves local
checkpoints of its agent. If a host does not support checkpointing, which can be a
common case with handheld devices, RAL can send a compressed checkpoint, via the
EFTL messaging system, to FTSM which saves it to the EFTL database. New
checkpoints of a particular agent, at a host, overwrite its earlier checkpoints if they
exist at that host. If the Platform Listener detects agent or host death, it informs
FTSM about it. When a host dies, it causes the death of all the agents that resided in
it. The failed agents cannot be recovered from the persistent storage medium of the
failed host until that host is recovered. The problem of blocking is present when we
have to wait for the recovery of a host, before we are able to recover agents. EFTL
uses the algorithm described by the following pseudo-code to address this problem:
if(agent1 died at host1)
begin
destination_host = host1;
if(host1 is dead)
begin
list1 = hosts with resources most similar to
host1;
list2 = alive and reachable hosts from list1;
if(list2 is not empty)
begin
list3 = sort list2 by the utilisation, in
descending
order;
destination_host = the first host from list3;
end-if
else
begin
destination_host = location of agent1’s
latest available checkpoint;
end-else;
end-if;
agent2 = the closest agent to the location of
agent1’s latest available checkpoint;
FTSM sends command to agent2 to recover agent1 from
the checkpoint;
FTSM orders agent1 to move to destination_host;
FTSM resends all the messages sent to agent1 from
the moment of its latest available checkpoint
to the moment of its recovery;
end-if;
6 Performance Analysis
The reliability of multi-agent systems has to be measured differently from the
reliability of other types of distributed systems. Multi-agent systems can be described
with characteristics such as component autonomy, mobility and asynchronous
execution. Therefore, system availability, as the measure of reliability, cannot be
applied to them. We have to use another reliability model that can describe the events
which can cause multi-agent system failures and allow us to evaluate our research
proposals. As described in [8], reliability in multi-agent systems can be evaluated by
measuring the reliability of each individual agent. An agent can either successfully
complete its tasks or fail to do so. Therefore, the reliability of the whole system
depends on the percentage of agents managing to achieve their goals. Lyu and Wong
proposed that the agent tasks should be defined as scheduled round-trips in a network
of agent hosts. Only the mobile agent which managed to visit all the functioning hosts
in the network, and to arrive at the final host, can be considered a successful finisher.
Consequently, the reliability can be calculated using the following expression:
F
⋅ 100 [%]
R=
(1)
M
R – reliability
F – No. of successful finisher agents
M – No. of all mobile agents.
In the first group of experiments, which were conducted in JADE, the number of
mobile agents and the number of JADE containers (hosts) were variable (10-100
agents, 5-20 containers). The mobile agents were created by a stationary agent that
was not prone to failure. At the time of instantiation, the mobile agents queried the
Agent Management System for the locations of all the containers registered within a
platform. The containers were distributed to three different servers, connected to a
local area network. We used AMD Athlon 1.67 GHz machines, each with 256 MB of
RAM. The agents attempted to visit every JADE container present in their itineraries.
They had to stay idle for five seconds in each of the visited containers. Their death
rates were changed during the course of the experiment, but the times of failure
occurrences were random. The agents were not prone to failures while they were in
the initial and final hosts. In this experiment, the containers were not prone to
failures, as we assumed that the probability of container death is much lower than the
probability of agent death.
The average results of these experiments are presented in Fig. 3. The area filled
with a line pattern represents the reliability of a system which is not supported by
EFTL. On the other hand, the solid grey area in Fig. 3 represents reliability
improvement of the same system when it is supported by EFTL. We can conclude
that EFTL considerably improves the reliability of multi-agent systems. The trend of
reliability improvement shows, approximately, uniform development. It does not
depend on numbers of agents and hosts, and application domain. Levels of reliability
achieve their maximum values in conditions where frequency and extent of faults is
not high. Reliability slightly decreases when unfavourable events occur more often.
Fig. 3. Reliability improvement
The EFTL system entities are distributed across networks, and because of their
communication, the evaluation of the costs, with which reliability improvement
comes, has to include messaging overheads. To calculate messaging overheads, we
conducted the experiments in the same environmental conditions as when we
evaluated reliability improvement, except that the numbers of agents and JADE
containers were constant (50 agents, 12 containers). These conditions allowed us to
see if there was any relationship between messaging overheads and frequency of
faults. We measured the overall size of messages exchanged via the publish/subscribe
messaging system, because it is the only messaging infrastructure that EFTL uses.
However, there was no dependence between messaging overheads and fault
frequency. This is due to small EFTL message size and the low number of commands
issued by FTSM in the case of agent and/or host death.
Another set of experiments included a variable number of mobile agents present in
the system, and a constant agent death rate of one death per 10 seconds. Fig. 4 shows
that the messaging overhead per agent declines as the number of agents in a system
increases. FTSM disperses its commands to the agents which are most suitable to
perform them. As the number of agents in a system becomes higher, the set of agents,
which can execute EFTL commands, also becomes larger. Consequently, FTSM can
choose to which agents it is going to send its commands. In that case, many of the
agents never receive any sort of command or notification from FTSM, and the
messaging overhead per agent drops.
Fig. 4. Messaging overhead per agent
7 Conclusion and future work
The proposed system, called EFTL, allows negotiable fault-tolerant support, based on
its costs. This solution significantly improves the reliability of supported systems.
The implemented persistent publish/subscribe messaging model guarantees the
exactly-once consumption of messages. The EFTL messaging system performs
message filtering, based on agent subscriptions, at message brokers. Therefore, it
reduces messaging overheads by introducing context-aware messaging in faulttolerant multi-agent systems. Our future work will focus on the performance
improvement of EFTL. A goal of supporting as many popular agent platforms as
possible will determine our efforts to create a recognisable fault-tolerant system
which improves the reliability of multi-agent systems.
References
1. Dake, W.; Leguizamo, C.P.; Mori, K., Mobile agent fault tolerance in autonomous
decentralized database systems, Autonomous Decentralized System, The 2nd International
Workshop on (2002) 192 – 199
2. Dalmeijer, M.; Rietjens, E.; Hammer, D.; Aerts, A.; Soede, M., A reliable mobile agents
architecture, Object-Oriented Real-Time Distributed Computing, ISORC 98 Proceedings.
First International Symposium on (1998) 64 – 72
3. Eustace, D.; Aylett, R.S.; Gray, J.O., Combining predictive and reactive control strategies in
multi-agent systems, Control, Control '94., Volume 2., International Conference on (1994)
989 – 994
4. Fedoruk, A.; Deters, R., Improving fault-tolerance by replicating agents, Proceedings of the
first international joint conference on Autonomous agents and multiagent systems: part 2,
ACM Press New York, NY, USA, ISBN:1-58113-480-0 (2002) 737 – 744
5. Grantner, J.L.; Fodor, G.; Driankov, D., Using fuzzy logic for bounded recovery of
autonomous agents, Fuzzy Information Processing Society, NAFIPS '97., Annual Meeting
of the North American (September 1997) 317 – 322
6. Kaminka, G.; Tambe, M., I’m OK, you’re OK, we’re OK: Experiments in distributed and
centralized socially attentive monitoring, Proceedings of the third annual conference on
Autonomous Agents (April 1999) 213 – 220
7. Klein, M.; Rodriguez-Aguilar, J. A.; Dellarocas, C., Using domain-independent exception
handling services to enable robust open multi-agent systems: The case of agent death,
Proceedings of the seventh annual conference on Autonomous Agents, Kluwer Academic
Publishers, Netherlands (2003) 179 – 189
8. Lyu, R. M.; Wong, Y. T., A progressive fault tolerant mechanism in mobile agent systems
[online], Available: http://www.cse.cuhk.edu.hk/~lyu/paper_pdf/ SCI2003.pdf , [Accessed
25 April 2004]
9. Mohindra, A.; Purakayastha, A.; Thati, P., Exploiting non-determinism for reliability of
mobile agent systems, Dependable Systems and Networks, DSN 2000. Proceedings
International Conference on (June 2000) 144 – 153
10. Padovitz, A.; Zaslavsky, A.; Loke, S. W., Awareness and Agility for Autonomic
Distributed Systems: Platform-Independent Publish-Subscribe Event-Based Communication
for Mobile Agents, the 1st International Workshop on Autonomic Computing Systems,
DEXA 2003, Prague, Czech Republic (September 2003)
11. Patel, R. B.; Garg, K., Fault-tolerant mobile agents computing on open networks [online],
Available: http://www.caip.rutgers.edu/~parashar/AAW-HiPC2003/patel-aaw-hipc-03.pdf,
[Accessed 18 April 2004]
12. Taesoon, P.; Ilsoo, B.; Hyunjoo, K.; Yeom, H.Y., The performance of checkpointing and
replication schemes for fault tolerant mobile agent systems, Reliable Distributed Systems,
2002. Proceedings. 21st IEEE Symposium on (October 2002) 256 – 261
13. Zhigang, W.; Binxing, F., Research on extensibility and reliability of agents in Web-based
Computing Resource Publishing, High Performance Computing in the Asia-Pacific Region,
2000. Proceedings. The Fourth International Conference/Exhibition on , Volume: 1 (May
2000) 432 – 435
© Copyright 2026 Paperzz