Characterisation of Systems of Systems Failures

PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
Characterisation of Systems of Systems Failures
Robert Alexander; Martin Hall-May; Tim Kelly; University of York; York, England
Keywords: systems of systems, safety, hazard analysis
Abstract
Systems of Systems (SoS) are systems in which the components are systems in their own right, designed separately
and capable of independent action, yet working together to achieve shared goals. SoS are increasingly commonplace,
for example Controlled Airspace and Network Centric Warfare.
The increasing role of such systems in safety-critical applications establishes the need for methods to analyse and
justify their safety. However, the essential characteristics of SoS present serious difficulties for traditional hazard
analysis techniques. Operational independence, heterogenous composition, emergent behaviour and the desire for
dynamic reconfiguration make conventional hazard analysis impractical. For example, it is simply not possible
to exhaustively enumerate all of the possible interactions that might take place in an SoS of any considerable
complexity.
Through systematic consideration of the characteristics of SoS, this paper classifies the types of failures associated
with these systems. For example, the fact that components have some level of autonomy carries with it potential
hazards arising from conflicts of responsibility.
Understanding of the failure modes of SoS can serve as a basis for improved identification and understanding of
SoS hazards. Through means of an example — the shooting down of Black Hawk helicopters by friendly fire during
Operation Provide Comfort — this paper illustrates how these failure modes can be seen to be key contributors in
historical SoS accidents.
Introduction
Traditionally, improvements in the safety of systems and in safety engineering itself have been largely motivated
and informed by accidents. Each time there has been a serious accident, engineers have worked to prevent similar
accidents occurring in the future.
Although this learning from failure remains an important part of safety engineering, it is no longer acceptable to
rely on it. One reason is that public attitudes towards safety have changed, and that in many cases even a small
perceived risk is politically unacceptable. Another is that engineers are now building systems that are more complex
than ever before, that operate over a much wider area and are potentially much more dangerous.
The class of systems commonly described as Systems of Systems (SoS) exhibit this latter problem. Examples of
such systems are Air Traffic Control and Network Centric Warfare. These examples feature mobile components
distributed over a large area, such as a region, country or entire continent. The components frequently interact with
each other in an ad-hoc fashion, and have the potential to cause large-scale destruction and injury.
It follows that for SoS that are being built now, safety has a high priority. However, relying on learning from
accidents is not acceptable, given the combination of the modern political climate and the high destructive potential
of SoS. Ethical issues aside, the requisite number of accidents would entail massive legal costs and punitive
legislation.
Therefore, there is a need to ensure the safety of SoS without access to knowledge from a large number of accidents.
It must be possible to perform hazard analysis, design the systems to be safe, and perform safety analysis to verify
the design, all using knowledge from sources other than accidents.
All SoS share certain characteristics, such as having an overall system purpose and having components with a
degree of autonomy. These characteristics can be seen as defining what it means for a system to be a ‘System of
Systems’ (refs. 3, 5). There are several other characteristics that are found in many but not all SoS, such as having
physically mobile components and using ad-hoc communications networks.
In this paper, it is proposed that certain failure modes can be associated with each of these characteristics. For
example, the presence of physically mobile actors causes the risk of a collision. Given the set of characteristics
499
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
possessed by a given design for an SoS, hazard analysis can predict that certain failure modes are likely to be
present.
Some existing methods for hazard analysis, such as FFA and HAZOP, use lists of guide words to aid safety analysts
when studying a design. Similarly, Sneak Circuit Analysis uses ‘Sneak Clues’ — circuit diagrams that illustrate
common faults in circuit designs. These guidelines aid analysts by allowing them to ‘pattern match’ between the
system design and the guidelines.
It is hoped that the characteristics identified here will have a similar use, prompting analysts to consider likely
failure modes based on the characteristics of a system. Such an approach to hazard analysis is outlined in the
section ‘An Approach to Hazard Analysis’, but the detail is beyond the scope of this paper.
Definitions:
SoS failure mode A way in which a failure can occur that cannot be traced to a single individual system.
SoS hazard A condition of an SoS configuration, physical or otherwise, that can lead to an accident.
The Structure of this Paper:
The first section lists the typical characteristics that have been identified. The following section then outlines
the relationships between these characteristics, in particular indicating how some characteristics imply others.
The specific failure modes associated with each characteristic are discussed in ‘Failure Modes Arising from the
Characteristics’, while ‘An Example’ illustrates some of these using an existing SoS that had a serious accident.
This is followed by a brief discussion of how the failure modes could be used for hazard analysis. ‘Conclusions’
then presents some conclusions and outlines the paths for future work.
Typical SoS Characteristics
The characteristics typical of SoS are:
• SoS has overall goals
• SoS contains multiple component systems
• Components have local goals
• Components have individual capabilities
• Components are geographically distributed
• Components are physically mobile
• Components have some level of autonomy
• Components need to collaborate
• Components communicate with each other
• Components communicate using ad-hoc networks
• Components are heterogenous
It can be noted that all of these characteristics are required for a system to be an SoS. Those characteristics that are
required are noted as ‘defining’ in the descriptions below.
The characteristics are briefly defined below; the authors present more elaborate definitions for these characteristics
in (ref. 3).
500
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
SoS has overall goals
This characteristic means that the system components all contribute to achieving some set of overall system goals.
This characteristic is defining, in the sense that it is not meaningful to consider a system to be an SoS unless all the
components share some overall goals.
SoS contains multiple component systems
The SoS is made up of multiple components which are distinct systems in their own right, and have a life cycle
outside of the SoS. This is another defining characteristic, in that a single entity is at most a system, not an
SoS.
Components have local goals
In addition to sharing the overall goals of the SoS, each component system has its own distinct goals. Typically,
these goals will include self-preservation and the maintenance of an adequate degree of safety, along with sub-goals
that represent the component’s contribution to achieving overall system goals. For example, a UAV may assume
responsibility for surveying part of the area that the overall SoS intends to survey.
This characteristic is defining, it is not meaningful to consider a system to be an SoS unless the components have
goals that are distinct from those of the overall SoS.
It should be noted that local goals may come from the component itself (e.g. self-preservation) or be handed down
from a higher authority (e.g. “attack that hill”).
Components have Individual Capabilities
The individual components of the SoS have varied capabilities, and the overall capabilities of the SoS are derived
from these. No single distinct component can be built or acquired that has the capability to accomplish all the
SoS goals. All components have some lack of capability, either with respect to types of activity or geographic area
covered.
This characteristic is very clear in the military case, where vehicles and groups are specialised for particular
purposes — a bomber aircraft cannot hold a town, and a dismounted infantry squad cannot provide reconnaissance
over a hundred-mile radius.
Capability can be broken down into classes. One such classification is along the lines of knowledge, functionality
and authority. This is used to describe failure modes in the section ‘Failure Modes Arising from the Characteristics’.
Components are geographically distributed
The component systems are not part of one physical whole, and may be distributed over a wide area such as a
building or a country.
Components are physically mobile
The components can move with respect to each other under their own power. This classification assumes that such
movement is routine rather than exceptional, such as with vehicles rather than field bases.
Components have some level of autonomy
The components are able to respond to external stimuli, to changes in their environment, without instructions from
outside. Components can vary in their levels of autonomy; some will have only the ability to find routes between
way points, while others can plot their own way points as part of achieving mission goals. Other components may
be able to formulate their own goals based on general instructions, while others still will have no autonomy at
all.
Components need to collaborate
The relationship of the individual capabilities of the components, when compared to the scale of the SoS’s overall
goals, is such that components need to collaborate in order to achieve them at all. Alternatively, it may be that
component systems have to cooperate to share a scarce resource.
501
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
Examples include the need to mass forces to attack a fortified enemy position, or for aircraft to share airspace
safely.
Components communicate with each other
There is some kind of information exchange between components, such as by radio or a computer network. That
is, the components do not work in complete isolation, or only communicate with a single central node.
Components communicate using ad-hoc networks
Rather than having fixed lines of communication (such as point-to-point wires) or pre-arranged radio channels,
communications links are often arranged between components as needed.
The most obvious example of this is voice communication by broadcast radio, using a single channel for the whole
system. Packet-switched wired networks would generally be excluded, and treated as static communications.
Components are heterogenous
SoS are often subject to change at irregular intervals and at unpredictable speeds. Different parts of the SoS may
be changed in the same way at different times. In many systems, the life cycle of a component will be much shorter
than the life cycle of the SoS.
The result is that components may have been designed separately, by different groups or organisations at different
times. When each component was designed, the set of peer components was not known, nor was the configuration
of the SoS as a whole.
For example, a military unit may exist for decades, using several successive generations of armoured vehicle. In
the air traffic control scenario, new models of aircraft are introduced frequently, and regulations have to change to
accommodate new technology such as collision avoidance systems.
Relationships Between the Characteristics
It is useful to recognise and make explicit the relationships between certain characteristics of a SoS. For example
the fact that a SoS comprises multiple systems, and that those systems share an overall goal implies that a measure
of collaboration or coordination among the component systems is required. Similarly, collaboration among systems
implies that communication must take place between them. A graph of such relationships is presented in figure
1.
When a characteristics is shown as being implied by two or more others, the relationship is ‘and’, i.e. the characteristic
is only implied in the presence of all the other characteristics. Those characteristics marked with a thicker border
are ‘primary’ characteristics, in that they are taken as given (for all SoS) and are not implied by any other
characteristics.
Following on from this, in can be argued that certain sets of failure modes are also associated with one another.
Traditionally, long sequences of events leading up to an accident have been viewed as ‘unfortunate’ or a ‘one-off’.
These relationships suggest that certain combinations of failure modes are in fact highly likely, because they are
common in the presence of certain characteristics which imply each other.
Failure Modes Arising from the Characteristics
Through consideration of the characteristics of a SoS introduced in the previous section, it is possible to derive a
number of ways in which the systems may fail. Guide words are used to prompt failure modes associated with a
particular characteristic. The guide words are derived from those found in HAZOP (ref. 2) and SHARD (ref. 6),
plus some suitable additions:
• Omission
• Commission
• Early
502
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
Physically
Mobile
Components
Geographic
Distribution
Component
Capability
Multiple
Components
System
Goals
Heterogeneity
Collaboration
Comms.
Local Goals
Autonomy
Ad-hoc
Comms.
Figure 1 — Relationships between characteristics of a SoS
• Late
• Too much
• Too little
• Conflicting
• Incorrect
It is possible, and indeed necessary, to apply the guide words in their widest possible meaning. One distinction that
is important to make, however, is between omission/commission and too much/little. The former deals with unique
and discrete things, e.g. “a specific and required goal has been omitted”, whereas the latter addresses quantities
and magnitudes of things, e.g. “there are too many goals” or “the goals are under-specified”.
Tables 2 and 3 contain the failure modes for two characteristics. Space constraints mean that all the characteristics
but could not be included, but these two have been selected as representative examples.
In addition, some of the failure modes arising from component capability are summarised in table 1. From this
table it is possible to extract a number of failure modes. Not all of the possible failure modes make sense, however
the sheer number of combinations is such that they can be succinctly summarised in table format. Below are two
503
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
such failure modes and explanatory examples:
Authority is less than required, given knowledge: Knows more than it can do anything about — i.e. has good
intelligence that it can’t act upon, or order another to act upon, due to lack of authority.
Functional capability is greater than expected, given knowledge: Automated system is reconfigured to include
weapons capability, but its control system is inadequate for the task and may present a safety risk.
Table 1 — Failure Modes arising from Component Capability
Functionality
functionality
greater than expected
Knowledge
w.r.t. / given knowledge
less than
required
Authority
authority
Characteristic
System goals
Table 2 — Failure Modes derived from System Goals
Guide Word
Failure Modes
Description
OMISSION
Important goal missing
A specific required goal has been omitted
Change in goals is not Some components are still operating under
propagated
the old goals
COMMISSION
Extra goal
Unnecessary goal distorts the system’s
behaviour
EARLY
A goal is attempted too early e.g. military system is too aggressive in
pursuing an offensive goal
LATE
A goal is attempted too late
e.g. window of opportunity is missed
TOO MUCH
Goals are over-specified
May lead to inefficient solution by
constraining options. Of safety concern
when it prevents humans from responding
intelligently to a hazard or unexpected
situation
Goals too ambitious for e.g. an Air Traffic Control centre tries to
available resources
cover more airspace and aircraft than it
can manage, given its set of radars and
operators
TOO LITTLE
Goals are under-specified
e.g. inadequate safety policy
CONFLICTING
Conflicts within system Goals are mutually exclusive. e.g. the
goals
performance requirements might only be
satisfied if the safety constraints were
violated
INCORRECT
Goals actually specified are e.g. ‘ensure safety of friendly personnel’
not what was intended
not implemented to include neutral or
allied personnel
An Example
The failure modes discussed above can be illustrated by reference to an accident description. The example used is
the friendly fire accident over the No-Fly Zone in northern Iraq in 1993. The description of the accident here will
be extremely brief; a more thorough description is given by the authors in (ref. 1), and analyses of the causes have
been performed by Leveson (ref. 4) and Snook (ref. 8).
In the friendly fire accident, two F-15 fighter aircraft operated by the US Air Force shot down two Black Hawk
transport helicopters operated by the US Army. This occurred despite the presence of the US Air Force Airborne
Warning And Control System (AWACS), which was supposed to be coordinating US activity in the region.
504
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
Characteristic
Communications
Table 3 — Failure Modes derived from Communications
Guide Word
Failure Modes
Description
OMISSION
Component
does
not e.g. component expected to report in when
communicate
when it has achieved a particular goal, but it does
expected
not
Component
ceases
to Other components cannot track its
communicate entirely
behaviour - it becomes a liability
Widespread
loss
of e.g. bad weather, jamming, failure of
communications
network hub
COMMISSION
System causes interference e.g. emits electromagnetic noise, or floods
on
a
communications a network with random packets
channel
EARLY
Component sends message Message may be lost, or may have to be
before recipient is ready
repeated
LATE
One component has to
delay its action until it can
communicate with another
Delay in data transmission
Depending on context, this may be obvious
(e.g. where there are timestamps) or subtle
(when the data has become ‘stale’ but there
is no way to tell).
TOO MUCH
Bandwidth or channels e.g. too many UAVs transmitting data
overloaded
within the same frequency bands
Recipient overloaded
e.g. when using decentralised Air Traffic
Control, a low-end UAV may not be able
to process data fast enough in a high traffic
situation.
TOO LITTLE
Component communicates Some components do not receive needed
with too few recipients
information
Component communicates Peers do not have latest information when
too infrequently
needed
CONFLICT
Two components produce e.g. when using sensor fusion, the same
contradictory information
vehicle is reported at two very different
locations.
Protocol mismatch
e.g. two systems have incompatible radios
(might only be discovered when they meet
in the field)
INCORRECT
Data corrupted in transit Data must be resent, causing a delay
(blatant)
Data corrupted in transit Information error is introduced
(subtle)
Information
error Incorrect data us received, treated as valid
propagated
and passed on to peer components.
This accident had many contributing causes, and is a good example of Perrow’s “system accident” (ref. 7) in that
removing any one of the failures would probably have prevented the accident.
One of the hazards that occurred was due to timing; although the F-15s were supposed to sweep through the no-fly
zone (NFZ) and ‘sanitize’ it before any other US aircraft entered, on the day of the accident the Black Hawks
entered the NFZ before the F-15s. This was implicated in the accident analysis as a causal factor — the F-15
pilots were not expecting to encounter friendly aircraft within the NFZ, so they were biased towards interpreting
anything they encountered as an enemy. Looking at the failure modes described in tables 1 through 3, specifically
those in table 1, this can be seen as an example of a Capability failure, specifically “knowledge less than required
given authority” — the F-15s has the authority to attack enemy aircraft, but did not have the knowledge needed to
distinguish them from friendly aircraft.
505
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
Prior to attacking the Black Hawks, the lead F-15 was in radio contact with AWACS, trying to establish the nature
of the radar returns he was getting. Had the Black Hawks been able to overhear this, they could have detected the
problem and notified AWACS of their presence (as it happened, the Black Hawks were also listening on the wrong
frequency). As it was, the F-15s had new radios with advanced anti-jamming technology. The Black Hawks had
older radios that could not receive these transmissions. This is an example of “communications — mismatched
protocols”, and “heterogeneity — incompatible systems” (not shown in the table).
When the F-15s contacted AWACS, they asked if they could identify the strange radar returns that the F-15s were
getting. These returns were from the Black Hawks, who had previously checked in with AWACS. It follows that
the AWACS controller should have been able to confirm the returns as friendly. As it happened, AWACS had ‘lost’
the Black Hawks; when their radar traces appeared on the AWACS screens, they had no identifying markings
other than the basic ‘friendly’ designation provided by the IFF system. Even this was not communicated, however;
the AWACS controller informed the F-15 pilot that they had an unidentified, rather than friendly, contact at that
location. This was an example of “communications — information error propagated”.
It can be seen that the failures in this example fit into the classification of failures described in this paper. However,
to test the validity and usefulness of the ideas presented here, they will have to be applied in a predictive analysis
for a system in order to predict failures that may occur in the future. An approach to doing so is discussed in the
next section.
An Approach to Hazard Analysis
Any SoS will exhibit some subset of the characteristics described above. Some will exhibit all the characteristics.
Once the set of characteristics for a specific SoS has been identified, they can be used to drive a hazard analysis
process.
The explanation here will be necessarily brief, as the finer details and the precise process to be followed have not
yet been derived. In outline, the analysis will start from a static description of the SoS. The analyst will use the set
of failure modes for each characteristic as a prompt to identify potentially hazardous failures in the system.
A set of scenarios for the particular SoS will then be used to evaluate whether the hazards are likely to occur in
practice. All expected scenarios should be analysed in this way; this includes completely successful “sunny day”
scenarios along with scenarios describing anticipated errors and their handling.
The process is illustrated in figure 2.
STATIC
DESCRIPTION
OF SoS
CHARACTERISTIC
FAILURE
MODES
PRELIMINARY
HAZARD
IDENTIFICATION
POSSIBLE
SoS
HAZARDS
SoS
SCENARIOS
HAZARD
ANALYSIS
CONFIRMED
SoS
HAZARDS
Figure 2 — Overview of the proposed hazard analysis approach
For example, if an SoS exhibits ‘Components communicate using ad-hoc networks’ then the analyst will consider
if the static configuration could give rise to ‘insufficient resources for some possible states’ or ‘protocol mismatch’.
Some potential hazards will result from this. The analyst will then search the available scenarios for ways in which
the identified failures could occur.
506
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
This hazard analysis will be inductive, in that it works from a normal design model to generate a set of potential
hazards. Deductive analyses, perhaps including conventional techniques such as fault trees, can then be used to
investigate the actual probability of the hazard occurring, and to aid the development of design changes to prevent
or mitigate it.
Conclusions
This paper has described several characteristics of SoS that can lead to accidents. Some of these characteristics
are defining for SoS, whereas others are only typical. It has been suggested that each characteristic has a set of
associated failure modes, and that a list of these failure modes is a useful guide for performing hazard analysis.
It is likely that the proposed set of characteristics, and the associated failure modes, can be fruitfully extended
through use and through analysis of previous accidents. An approach to hazard analysis has been outlined here,
and there is scope for developing this into a thorough and systematic process.
The example in this paper shows that known hazards in a system can be related to the characteristics. This is
interesting, but the predictive properties of the approach will only be proved through extensive application on
examples.
References
1. R. Alexander and M. Hall-May. Modelling and analysis of systems of systems accidents. Technical Report
DARP/TN/2003/19, University of York, 2003.
2. CISHEC. A Guide to Hazard and Operability Studies. The Chemical Industry Safety and Health Council of
the Chemical Industries Association Ltd, 1977.
3. G. Despotou, R. Alexander, and M. Hall-May. Systems of systems — definition, key concepts & characteristics.
Technical Report DARP/BG/2003/1, University of York, 2003.
4. N. Leveson, P. Allen, and M.-A. Storey. The analysis of a friendly fire accident using a systems model of
accidents. In Proceedings of the 20th International System Safety Society Conference (ISSC 2003), pages
345–357. System Safety Society, Unionville, Virginia, 2002.
5. M. W. Maier. Architecting principles for systems-of-systems. In 6th Annual Symposium of INCOSE, pages
567–574, 1996.
6. J. A. McDermid, M. Nicholson, D. J. Pumfrey, and P. Fenelon. Experience with the application of hazop
to computer-based systems. In Proceedings of the Tenth Annual Conference on Computer Assurance, pages
37–48. IEEE, 1995.
7. C. Perrow. Normal Accidents: Living with High-Risk Technologies. Basic Books, New York, 1984.
8. S. A. Snook. Friendly Fire: the Accidental Shootdown of U.S. Black Hawks Over Northern Iraq. Princeton
University Press, Princeton, New Jersey, 2000.
Biography
Robert Alexander, Department of Computer Science, University of York, Heslington, York, YO10 5DD, UK,
telephone — +44 1904 432792, facsimile — +44 1904 432767, e-mail — [email protected]
Robert Alexander is a Research Associate in the High Integrity Systems Engineering (HISE) group in the University
of York’s Computer Science Department. Since October 2002 he has been working on methods of safety analysis
for Systems of Systems. Robert graduated from Keele University in 2001 with a BSc (Hons) in Computer Science.
Prior to joining the research group, he worked for Sinara Consultants Ltd in London, developing financial information
systems.
Martin Hall-May, Department of Computer Science, University of York, Heslington, York, YO10 5DD, UK,
telephone — +44 1904 432792, facsimile — +44 1904 432767, e-mail — [email protected]
Martin Hall-May is a Research Associate at the University of York’s Computer Science department. He joined the
High Integrity Systems Engineering (HISE) group in October of 2002. Martin is currently part of the Defence and
507
PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004
Aerospace Research Partnership in High Integrity Real Time Systems (DARP HIRTS) research project working
towards achieving the safety of emerging classes of systems. Previously he graduated from Bristol University in
2002 with a MEng (Hons) in Computer Science with Study in Continental Europe. He spent the third year of his
undergraduate course studying at the Fachhochschule Karlsruhe, Germany.
Dr Tim Kelly, Ph.D., Department of Computer Science, University of York, Heslington, York, YO10 5DD, UK,
telephone — +44 1904 432764, facsimile — +44 1904 432708, e-mail — [email protected]
Dr Tim Kelly is a Lecturer in software and safety engineering within the Department of Computer Science at the
University of York. He is also Deputy Director of the Rolls-Royce Systems and Software Engineering University
Technology Centre (UTC) at York. His expertise lies predominantly in the areas of safety case development and
management. His doctoral research focussed upon safety argument presentation, maintenance, and reuse using
the Goal Structuring Notation (GSN). Tim has provided extensive consultative and facilitative support in the
production of acceptable safety cases for companies from the medical, aerospace, railways and power generation
sectors. Before commencing his work in the field of safety engineering, Tim graduated with first class honours
in Computer Science from the University of Cambridge. He has published a number of papers on safety case
development in international journals and conferences and has been an invited panel speaker on software safety
issues.
508