PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 Characterisation of Systems of Systems Failures Robert Alexander; Martin Hall-May; Tim Kelly; University of York; York, England Keywords: systems of systems, safety, hazard analysis Abstract Systems of Systems (SoS) are systems in which the components are systems in their own right, designed separately and capable of independent action, yet working together to achieve shared goals. SoS are increasingly commonplace, for example Controlled Airspace and Network Centric Warfare. The increasing role of such systems in safety-critical applications establishes the need for methods to analyse and justify their safety. However, the essential characteristics of SoS present serious difficulties for traditional hazard analysis techniques. Operational independence, heterogenous composition, emergent behaviour and the desire for dynamic reconfiguration make conventional hazard analysis impractical. For example, it is simply not possible to exhaustively enumerate all of the possible interactions that might take place in an SoS of any considerable complexity. Through systematic consideration of the characteristics of SoS, this paper classifies the types of failures associated with these systems. For example, the fact that components have some level of autonomy carries with it potential hazards arising from conflicts of responsibility. Understanding of the failure modes of SoS can serve as a basis for improved identification and understanding of SoS hazards. Through means of an example — the shooting down of Black Hawk helicopters by friendly fire during Operation Provide Comfort — this paper illustrates how these failure modes can be seen to be key contributors in historical SoS accidents. Introduction Traditionally, improvements in the safety of systems and in safety engineering itself have been largely motivated and informed by accidents. Each time there has been a serious accident, engineers have worked to prevent similar accidents occurring in the future. Although this learning from failure remains an important part of safety engineering, it is no longer acceptable to rely on it. One reason is that public attitudes towards safety have changed, and that in many cases even a small perceived risk is politically unacceptable. Another is that engineers are now building systems that are more complex than ever before, that operate over a much wider area and are potentially much more dangerous. The class of systems commonly described as Systems of Systems (SoS) exhibit this latter problem. Examples of such systems are Air Traffic Control and Network Centric Warfare. These examples feature mobile components distributed over a large area, such as a region, country or entire continent. The components frequently interact with each other in an ad-hoc fashion, and have the potential to cause large-scale destruction and injury. It follows that for SoS that are being built now, safety has a high priority. However, relying on learning from accidents is not acceptable, given the combination of the modern political climate and the high destructive potential of SoS. Ethical issues aside, the requisite number of accidents would entail massive legal costs and punitive legislation. Therefore, there is a need to ensure the safety of SoS without access to knowledge from a large number of accidents. It must be possible to perform hazard analysis, design the systems to be safe, and perform safety analysis to verify the design, all using knowledge from sources other than accidents. All SoS share certain characteristics, such as having an overall system purpose and having components with a degree of autonomy. These characteristics can be seen as defining what it means for a system to be a ‘System of Systems’ (refs. 3, 5). There are several other characteristics that are found in many but not all SoS, such as having physically mobile components and using ad-hoc communications networks. In this paper, it is proposed that certain failure modes can be associated with each of these characteristics. For example, the presence of physically mobile actors causes the risk of a collision. Given the set of characteristics 499 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 possessed by a given design for an SoS, hazard analysis can predict that certain failure modes are likely to be present. Some existing methods for hazard analysis, such as FFA and HAZOP, use lists of guide words to aid safety analysts when studying a design. Similarly, Sneak Circuit Analysis uses ‘Sneak Clues’ — circuit diagrams that illustrate common faults in circuit designs. These guidelines aid analysts by allowing them to ‘pattern match’ between the system design and the guidelines. It is hoped that the characteristics identified here will have a similar use, prompting analysts to consider likely failure modes based on the characteristics of a system. Such an approach to hazard analysis is outlined in the section ‘An Approach to Hazard Analysis’, but the detail is beyond the scope of this paper. Definitions: SoS failure mode A way in which a failure can occur that cannot be traced to a single individual system. SoS hazard A condition of an SoS configuration, physical or otherwise, that can lead to an accident. The Structure of this Paper: The first section lists the typical characteristics that have been identified. The following section then outlines the relationships between these characteristics, in particular indicating how some characteristics imply others. The specific failure modes associated with each characteristic are discussed in ‘Failure Modes Arising from the Characteristics’, while ‘An Example’ illustrates some of these using an existing SoS that had a serious accident. This is followed by a brief discussion of how the failure modes could be used for hazard analysis. ‘Conclusions’ then presents some conclusions and outlines the paths for future work. Typical SoS Characteristics The characteristics typical of SoS are: • SoS has overall goals • SoS contains multiple component systems • Components have local goals • Components have individual capabilities • Components are geographically distributed • Components are physically mobile • Components have some level of autonomy • Components need to collaborate • Components communicate with each other • Components communicate using ad-hoc networks • Components are heterogenous It can be noted that all of these characteristics are required for a system to be an SoS. Those characteristics that are required are noted as ‘defining’ in the descriptions below. The characteristics are briefly defined below; the authors present more elaborate definitions for these characteristics in (ref. 3). 500 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 SoS has overall goals This characteristic means that the system components all contribute to achieving some set of overall system goals. This characteristic is defining, in the sense that it is not meaningful to consider a system to be an SoS unless all the components share some overall goals. SoS contains multiple component systems The SoS is made up of multiple components which are distinct systems in their own right, and have a life cycle outside of the SoS. This is another defining characteristic, in that a single entity is at most a system, not an SoS. Components have local goals In addition to sharing the overall goals of the SoS, each component system has its own distinct goals. Typically, these goals will include self-preservation and the maintenance of an adequate degree of safety, along with sub-goals that represent the component’s contribution to achieving overall system goals. For example, a UAV may assume responsibility for surveying part of the area that the overall SoS intends to survey. This characteristic is defining, it is not meaningful to consider a system to be an SoS unless the components have goals that are distinct from those of the overall SoS. It should be noted that local goals may come from the component itself (e.g. self-preservation) or be handed down from a higher authority (e.g. “attack that hill”). Components have Individual Capabilities The individual components of the SoS have varied capabilities, and the overall capabilities of the SoS are derived from these. No single distinct component can be built or acquired that has the capability to accomplish all the SoS goals. All components have some lack of capability, either with respect to types of activity or geographic area covered. This characteristic is very clear in the military case, where vehicles and groups are specialised for particular purposes — a bomber aircraft cannot hold a town, and a dismounted infantry squad cannot provide reconnaissance over a hundred-mile radius. Capability can be broken down into classes. One such classification is along the lines of knowledge, functionality and authority. This is used to describe failure modes in the section ‘Failure Modes Arising from the Characteristics’. Components are geographically distributed The component systems are not part of one physical whole, and may be distributed over a wide area such as a building or a country. Components are physically mobile The components can move with respect to each other under their own power. This classification assumes that such movement is routine rather than exceptional, such as with vehicles rather than field bases. Components have some level of autonomy The components are able to respond to external stimuli, to changes in their environment, without instructions from outside. Components can vary in their levels of autonomy; some will have only the ability to find routes between way points, while others can plot their own way points as part of achieving mission goals. Other components may be able to formulate their own goals based on general instructions, while others still will have no autonomy at all. Components need to collaborate The relationship of the individual capabilities of the components, when compared to the scale of the SoS’s overall goals, is such that components need to collaborate in order to achieve them at all. Alternatively, it may be that component systems have to cooperate to share a scarce resource. 501 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 Examples include the need to mass forces to attack a fortified enemy position, or for aircraft to share airspace safely. Components communicate with each other There is some kind of information exchange between components, such as by radio or a computer network. That is, the components do not work in complete isolation, or only communicate with a single central node. Components communicate using ad-hoc networks Rather than having fixed lines of communication (such as point-to-point wires) or pre-arranged radio channels, communications links are often arranged between components as needed. The most obvious example of this is voice communication by broadcast radio, using a single channel for the whole system. Packet-switched wired networks would generally be excluded, and treated as static communications. Components are heterogenous SoS are often subject to change at irregular intervals and at unpredictable speeds. Different parts of the SoS may be changed in the same way at different times. In many systems, the life cycle of a component will be much shorter than the life cycle of the SoS. The result is that components may have been designed separately, by different groups or organisations at different times. When each component was designed, the set of peer components was not known, nor was the configuration of the SoS as a whole. For example, a military unit may exist for decades, using several successive generations of armoured vehicle. In the air traffic control scenario, new models of aircraft are introduced frequently, and regulations have to change to accommodate new technology such as collision avoidance systems. Relationships Between the Characteristics It is useful to recognise and make explicit the relationships between certain characteristics of a SoS. For example the fact that a SoS comprises multiple systems, and that those systems share an overall goal implies that a measure of collaboration or coordination among the component systems is required. Similarly, collaboration among systems implies that communication must take place between them. A graph of such relationships is presented in figure 1. When a characteristics is shown as being implied by two or more others, the relationship is ‘and’, i.e. the characteristic is only implied in the presence of all the other characteristics. Those characteristics marked with a thicker border are ‘primary’ characteristics, in that they are taken as given (for all SoS) and are not implied by any other characteristics. Following on from this, in can be argued that certain sets of failure modes are also associated with one another. Traditionally, long sequences of events leading up to an accident have been viewed as ‘unfortunate’ or a ‘one-off’. These relationships suggest that certain combinations of failure modes are in fact highly likely, because they are common in the presence of certain characteristics which imply each other. Failure Modes Arising from the Characteristics Through consideration of the characteristics of a SoS introduced in the previous section, it is possible to derive a number of ways in which the systems may fail. Guide words are used to prompt failure modes associated with a particular characteristic. The guide words are derived from those found in HAZOP (ref. 2) and SHARD (ref. 6), plus some suitable additions: • Omission • Commission • Early 502 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 Physically Mobile Components Geographic Distribution Component Capability Multiple Components System Goals Heterogeneity Collaboration Comms. Local Goals Autonomy Ad-hoc Comms. Figure 1 — Relationships between characteristics of a SoS • Late • Too much • Too little • Conflicting • Incorrect It is possible, and indeed necessary, to apply the guide words in their widest possible meaning. One distinction that is important to make, however, is between omission/commission and too much/little. The former deals with unique and discrete things, e.g. “a specific and required goal has been omitted”, whereas the latter addresses quantities and magnitudes of things, e.g. “there are too many goals” or “the goals are under-specified”. Tables 2 and 3 contain the failure modes for two characteristics. Space constraints mean that all the characteristics but could not be included, but these two have been selected as representative examples. In addition, some of the failure modes arising from component capability are summarised in table 1. From this table it is possible to extract a number of failure modes. Not all of the possible failure modes make sense, however the sheer number of combinations is such that they can be succinctly summarised in table format. Below are two 503 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 such failure modes and explanatory examples: Authority is less than required, given knowledge: Knows more than it can do anything about — i.e. has good intelligence that it can’t act upon, or order another to act upon, due to lack of authority. Functional capability is greater than expected, given knowledge: Automated system is reconfigured to include weapons capability, but its control system is inadequate for the task and may present a safety risk. Table 1 — Failure Modes arising from Component Capability Functionality functionality greater than expected Knowledge w.r.t. / given knowledge less than required Authority authority Characteristic System goals Table 2 — Failure Modes derived from System Goals Guide Word Failure Modes Description OMISSION Important goal missing A specific required goal has been omitted Change in goals is not Some components are still operating under propagated the old goals COMMISSION Extra goal Unnecessary goal distorts the system’s behaviour EARLY A goal is attempted too early e.g. military system is too aggressive in pursuing an offensive goal LATE A goal is attempted too late e.g. window of opportunity is missed TOO MUCH Goals are over-specified May lead to inefficient solution by constraining options. Of safety concern when it prevents humans from responding intelligently to a hazard or unexpected situation Goals too ambitious for e.g. an Air Traffic Control centre tries to available resources cover more airspace and aircraft than it can manage, given its set of radars and operators TOO LITTLE Goals are under-specified e.g. inadequate safety policy CONFLICTING Conflicts within system Goals are mutually exclusive. e.g. the goals performance requirements might only be satisfied if the safety constraints were violated INCORRECT Goals actually specified are e.g. ‘ensure safety of friendly personnel’ not what was intended not implemented to include neutral or allied personnel An Example The failure modes discussed above can be illustrated by reference to an accident description. The example used is the friendly fire accident over the No-Fly Zone in northern Iraq in 1993. The description of the accident here will be extremely brief; a more thorough description is given by the authors in (ref. 1), and analyses of the causes have been performed by Leveson (ref. 4) and Snook (ref. 8). In the friendly fire accident, two F-15 fighter aircraft operated by the US Air Force shot down two Black Hawk transport helicopters operated by the US Army. This occurred despite the presence of the US Air Force Airborne Warning And Control System (AWACS), which was supposed to be coordinating US activity in the region. 504 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 Characteristic Communications Table 3 — Failure Modes derived from Communications Guide Word Failure Modes Description OMISSION Component does not e.g. component expected to report in when communicate when it has achieved a particular goal, but it does expected not Component ceases to Other components cannot track its communicate entirely behaviour - it becomes a liability Widespread loss of e.g. bad weather, jamming, failure of communications network hub COMMISSION System causes interference e.g. emits electromagnetic noise, or floods on a communications a network with random packets channel EARLY Component sends message Message may be lost, or may have to be before recipient is ready repeated LATE One component has to delay its action until it can communicate with another Delay in data transmission Depending on context, this may be obvious (e.g. where there are timestamps) or subtle (when the data has become ‘stale’ but there is no way to tell). TOO MUCH Bandwidth or channels e.g. too many UAVs transmitting data overloaded within the same frequency bands Recipient overloaded e.g. when using decentralised Air Traffic Control, a low-end UAV may not be able to process data fast enough in a high traffic situation. TOO LITTLE Component communicates Some components do not receive needed with too few recipients information Component communicates Peers do not have latest information when too infrequently needed CONFLICT Two components produce e.g. when using sensor fusion, the same contradictory information vehicle is reported at two very different locations. Protocol mismatch e.g. two systems have incompatible radios (might only be discovered when they meet in the field) INCORRECT Data corrupted in transit Data must be resent, causing a delay (blatant) Data corrupted in transit Information error is introduced (subtle) Information error Incorrect data us received, treated as valid propagated and passed on to peer components. This accident had many contributing causes, and is a good example of Perrow’s “system accident” (ref. 7) in that removing any one of the failures would probably have prevented the accident. One of the hazards that occurred was due to timing; although the F-15s were supposed to sweep through the no-fly zone (NFZ) and ‘sanitize’ it before any other US aircraft entered, on the day of the accident the Black Hawks entered the NFZ before the F-15s. This was implicated in the accident analysis as a causal factor — the F-15 pilots were not expecting to encounter friendly aircraft within the NFZ, so they were biased towards interpreting anything they encountered as an enemy. Looking at the failure modes described in tables 1 through 3, specifically those in table 1, this can be seen as an example of a Capability failure, specifically “knowledge less than required given authority” — the F-15s has the authority to attack enemy aircraft, but did not have the knowledge needed to distinguish them from friendly aircraft. 505 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 Prior to attacking the Black Hawks, the lead F-15 was in radio contact with AWACS, trying to establish the nature of the radar returns he was getting. Had the Black Hawks been able to overhear this, they could have detected the problem and notified AWACS of their presence (as it happened, the Black Hawks were also listening on the wrong frequency). As it was, the F-15s had new radios with advanced anti-jamming technology. The Black Hawks had older radios that could not receive these transmissions. This is an example of “communications — mismatched protocols”, and “heterogeneity — incompatible systems” (not shown in the table). When the F-15s contacted AWACS, they asked if they could identify the strange radar returns that the F-15s were getting. These returns were from the Black Hawks, who had previously checked in with AWACS. It follows that the AWACS controller should have been able to confirm the returns as friendly. As it happened, AWACS had ‘lost’ the Black Hawks; when their radar traces appeared on the AWACS screens, they had no identifying markings other than the basic ‘friendly’ designation provided by the IFF system. Even this was not communicated, however; the AWACS controller informed the F-15 pilot that they had an unidentified, rather than friendly, contact at that location. This was an example of “communications — information error propagated”. It can be seen that the failures in this example fit into the classification of failures described in this paper. However, to test the validity and usefulness of the ideas presented here, they will have to be applied in a predictive analysis for a system in order to predict failures that may occur in the future. An approach to doing so is discussed in the next section. An Approach to Hazard Analysis Any SoS will exhibit some subset of the characteristics described above. Some will exhibit all the characteristics. Once the set of characteristics for a specific SoS has been identified, they can be used to drive a hazard analysis process. The explanation here will be necessarily brief, as the finer details and the precise process to be followed have not yet been derived. In outline, the analysis will start from a static description of the SoS. The analyst will use the set of failure modes for each characteristic as a prompt to identify potentially hazardous failures in the system. A set of scenarios for the particular SoS will then be used to evaluate whether the hazards are likely to occur in practice. All expected scenarios should be analysed in this way; this includes completely successful “sunny day” scenarios along with scenarios describing anticipated errors and their handling. The process is illustrated in figure 2. STATIC DESCRIPTION OF SoS CHARACTERISTIC FAILURE MODES PRELIMINARY HAZARD IDENTIFICATION POSSIBLE SoS HAZARDS SoS SCENARIOS HAZARD ANALYSIS CONFIRMED SoS HAZARDS Figure 2 — Overview of the proposed hazard analysis approach For example, if an SoS exhibits ‘Components communicate using ad-hoc networks’ then the analyst will consider if the static configuration could give rise to ‘insufficient resources for some possible states’ or ‘protocol mismatch’. Some potential hazards will result from this. The analyst will then search the available scenarios for ways in which the identified failures could occur. 506 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 This hazard analysis will be inductive, in that it works from a normal design model to generate a set of potential hazards. Deductive analyses, perhaps including conventional techniques such as fault trees, can then be used to investigate the actual probability of the hazard occurring, and to aid the development of design changes to prevent or mitigate it. Conclusions This paper has described several characteristics of SoS that can lead to accidents. Some of these characteristics are defining for SoS, whereas others are only typical. It has been suggested that each characteristic has a set of associated failure modes, and that a list of these failure modes is a useful guide for performing hazard analysis. It is likely that the proposed set of characteristics, and the associated failure modes, can be fruitfully extended through use and through analysis of previous accidents. An approach to hazard analysis has been outlined here, and there is scope for developing this into a thorough and systematic process. The example in this paper shows that known hazards in a system can be related to the characteristics. This is interesting, but the predictive properties of the approach will only be proved through extensive application on examples. References 1. R. Alexander and M. Hall-May. Modelling and analysis of systems of systems accidents. Technical Report DARP/TN/2003/19, University of York, 2003. 2. CISHEC. A Guide to Hazard and Operability Studies. The Chemical Industry Safety and Health Council of the Chemical Industries Association Ltd, 1977. 3. G. Despotou, R. Alexander, and M. Hall-May. Systems of systems — definition, key concepts & characteristics. Technical Report DARP/BG/2003/1, University of York, 2003. 4. N. Leveson, P. Allen, and M.-A. Storey. The analysis of a friendly fire accident using a systems model of accidents. In Proceedings of the 20th International System Safety Society Conference (ISSC 2003), pages 345–357. System Safety Society, Unionville, Virginia, 2002. 5. M. W. Maier. Architecting principles for systems-of-systems. In 6th Annual Symposium of INCOSE, pages 567–574, 1996. 6. J. A. McDermid, M. Nicholson, D. J. Pumfrey, and P. Fenelon. Experience with the application of hazop to computer-based systems. In Proceedings of the Tenth Annual Conference on Computer Assurance, pages 37–48. IEEE, 1995. 7. C. Perrow. Normal Accidents: Living with High-Risk Technologies. Basic Books, New York, 1984. 8. S. A. Snook. Friendly Fire: the Accidental Shootdown of U.S. Black Hawks Over Northern Iraq. Princeton University Press, Princeton, New Jersey, 2000. Biography Robert Alexander, Department of Computer Science, University of York, Heslington, York, YO10 5DD, UK, telephone — +44 1904 432792, facsimile — +44 1904 432767, e-mail — [email protected] Robert Alexander is a Research Associate in the High Integrity Systems Engineering (HISE) group in the University of York’s Computer Science Department. Since October 2002 he has been working on methods of safety analysis for Systems of Systems. Robert graduated from Keele University in 2001 with a BSc (Hons) in Computer Science. Prior to joining the research group, he worked for Sinara Consultants Ltd in London, developing financial information systems. Martin Hall-May, Department of Computer Science, University of York, Heslington, York, YO10 5DD, UK, telephone — +44 1904 432792, facsimile — +44 1904 432767, e-mail — [email protected] Martin Hall-May is a Research Associate at the University of York’s Computer Science department. He joined the High Integrity Systems Engineering (HISE) group in October of 2002. Martin is currently part of the Defence and 507 PROCEEDINGS of the 22nd INTERNATIONAL SYSTEM SAFETY CONFERENCE - 2004 Aerospace Research Partnership in High Integrity Real Time Systems (DARP HIRTS) research project working towards achieving the safety of emerging classes of systems. Previously he graduated from Bristol University in 2002 with a MEng (Hons) in Computer Science with Study in Continental Europe. He spent the third year of his undergraduate course studying at the Fachhochschule Karlsruhe, Germany. Dr Tim Kelly, Ph.D., Department of Computer Science, University of York, Heslington, York, YO10 5DD, UK, telephone — +44 1904 432764, facsimile — +44 1904 432708, e-mail — [email protected] Dr Tim Kelly is a Lecturer in software and safety engineering within the Department of Computer Science at the University of York. He is also Deputy Director of the Rolls-Royce Systems and Software Engineering University Technology Centre (UTC) at York. His expertise lies predominantly in the areas of safety case development and management. His doctoral research focussed upon safety argument presentation, maintenance, and reuse using the Goal Structuring Notation (GSN). Tim has provided extensive consultative and facilitative support in the production of acceptable safety cases for companies from the medical, aerospace, railways and power generation sectors. Before commencing his work in the field of safety engineering, Tim graduated with first class honours in Computer Science from the University of Cambridge. He has published a number of papers on safety case development in international journals and conferences and has been an invited panel speaker on software safety issues. 508
© Copyright 2026 Paperzz