HSE Health & Safety Executive Learning from incidents involving E/E/PE systems Part 2 - Recommended scheme Prepared by Adelard for the Health and Safety Executive 2003 RESEARCH REPORT 181 HSE Health & Safety Executive Learning from incidents involving E/E/PE systems Part 2 - Recommended scheme PG Bishop, LO Emmet Adelard LLP C Johnson University of Glasgow W Black Blacksafe Consulting This report is the second of 3 parts presenting the results of an HSE-sponsored research project. The overall purpose is to create a scheme for learning from incidents that involve electrical, electronic or programmable electronic (E/E/PE) systems. Part 1 reviews existing learning processes and causal analysis techniques, examines industry practice and makes recommendations for a new scheme. Part 2 (this report) presents the recommended scheme and Part 3 gives accompanying guidance, examples and rationale. This report and the work it describes were funded by the Health and Safety Executive (HSE). Its contents, including any opinions and/or conclusions expressed, are those of the authors alone and do not necessarily reflect HSE policy. HSE BOOKS © Crown copyright 2003 First published 2003 ISBN 0 7176 2789 6 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise) without the prior written permission of the copyright owner. Applications for reproduction should be made in writing to: Licensing Division, Her Majesty's Stationery Office, St Clements House, 2-16 Colegate, Norwich NR3 1BQ or by e-mail to [email protected] ii ACKNOWLEDGEMENTS Adelard LLP wishes to acknowledge numerous invaluable contributions to the project from the following people: Chris Johnson (Glasgow University), Bill Black (Blacksafe Consulting), Mark Bowell (HSE), Konstantinos Tourlas (Adelard), Viv Hamilton (Viv Hamilton Associates), Floor Koornneef (Technical University of Delft). iii iv EXECUTIVE SUMMARY This report is the second of 3 parts presenting the results of an HSE-sponsored research project. The overall purpose is to create a scheme for learning from incidents that involve electrical, electronic or programmable electronic (E/E/PE) systems. Part 1 reviews existing learning processes and causal analysis techniques, examines industry practice and makes recommendations for a new scheme. Part 2 (this report) presents the recommended scheme and Part 3 gives accompanying guidance, examples and rationale. For lessons to be learned from any kind of incident (not just involving E/E/PES), the following processes are necessary: · Incident reporting: workplace staff who witness an incident must report sufficient details for safety managers to investigate further or analyse for trends where and as appropriate. · Incident prioritisation: the recipients of incident reports decide to what extent each incident represents a learning opportunity. A serious incident or accident will obviously require corrective action but there is also much to learn from near misses, especially those that form a recurring pattern. · Incident characterisation and investigation: safety managers analyse selected safety reports and investigate further as appropriate. The aim is to sufficiently understand what happened and generate recommendations to reduce the probability of other incidents with similar causes. The amount of effort spent per incident should be proportionate to the incident’s learning potential. Characterisation of incident data is necessary if it is to be collated for trend analysis. · Response in working context: recommendations generated as a result of the incident must be implemented in the original working context for any benefit to be realised. For this to happen, the recommendations must be realistic and tracked to completion. It is also desirable disseminate lessons learnt to others, eg to other sites belonging to the company, or to users, system suppliers and component suppliers in a product supply chain. If others provide such information, each site or company needs their own procedures for acting on it. For a company or organisation to learn lessons from its own or others’ aggregated incident data, proactive analysis is required. When incidents involve E/E/PES, specialised activities should be introduced into the learning process. This is necessary because general learning processes will not include all the expertise and methods needed to examine and rectify systems where E/E/PE technology is a significant factor. The organisation learning from an incident needs to decide when to invoke E/E/PES-specific activities, what additional analyses to perform and whether to involve any suppliers or contractors. This report addresses the development, operation and maintenance of safety-related E/E/PE systems using the terminology and approach of the international standard IEC 61508, Functional safety of E/E/PE safety-related systems. The recommendations are designed to be consistent with the requirements of the standard. v The incident reporting process needs to capture sufficient information to prompt the recipient into considering whether a specialist E/E/PES investigation would be beneficial. To do this, the reporter must be able to give some indication of when E/E/PES equipment is involved. The identified E/E/PES need not have been a direct cause of the incident for a specialist investigation to be worthwhile – it may be that had the E/E/PES been different in some way it might have helped prevent the incident. Factors that affect the prioritisation of incidents and the effort required for their investigation include industry practice, regulator expectations, and both the actual consequence and the potential consequence associated with the system involved. Unexpected behaviour, particularly with respect to safety assumptions, will usually have the greatest learning potential. A specialist investigation will aim to determine what exactly happened, why the problem occurred and how it can be prevented or the consequences mitigated in the future. The PARC causal analysis method is recommended as a simple means of considering the most significant aspects in an incident and identifying its likely causes. The flow chart given is optimised for end users to examine situations where failures in their equipment under control, or in the equipment’s control system, place demands on a separate safety protection system. If more detailed investigation is required, recommended techniques are barrier and change analysis and events and causal factors charting. Results of the investigation can be recorded on checklists, and for simple investigations these may be adequate by themselves for identifying the most significant causes and for suggesting appropriate solutions. End users rely on contractors and suppliers to develop and maintain their E/E/PE systems. Although incidents generally occur with the end user, effective learning has to involve these other parties so they can identify defects in their equipment. They in turn have a responsibility to notify other users of identified problems and fixes. Incidents involving E/E/PES can be classified according to failure mode, type of problem or problem prevention (equivalent to root causes). A comprehensive root cause classification scheme is proposed based on IEC 61508 safety lifecycle activities and on common requirements that are not specific to any lifecycle phase (eg competence). This scheme can be condensed according to the responsibilities of those using it. For example, end users will focus more on operation and maintenance, and system suppliers more on realisation. Classification significantly aids long-term analysis and sharing of data. vi CONTENTS 1 INTRODUCTION ...................................................................................................1 2 2.1 2.2 2.3 2.4 SCHEME OVERVIEW ...........................................................................................2 Scope ....................................................................................................................2 Design objectives ..................................................................................................4 Scheme concepts ..................................................................................................4 Scheme design......................................................................................................5 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 THE PROCESS OF LEARNING FROM INCIDENTS ............................................6 Overall learning process and context.....................................................................6 Generic incident reporting and investigation process.............................................7 Incident reporting...................................................................................................8 Investigation policy ................................................................................................9 Incident investigation ...........................................................................................10 Dissemination and learning .................................................................................13 Proactive learning from incident data...................................................................15 4 4.1 4.2 4.3 4.4 4.5 E/E/PES SPECIFIC ACTIVITIES.........................................................................16 Identification of E/E/PES......................................................................................17 Addition to incident recording...............................................................................17 E/E/PES technical investigation policy.................................................................18 E/E/PES technical investigation...........................................................................19 Dissemination and learning .................................................................................26 5 5.1 5.2 5.3 CLASSIFICATION OF E/E/PES PROBLEM RECORDS.....................................30 Failure mode .......................................................................................................30 Problem prevention .............................................................................................31 Adapting the causal classification for different perspectives ................................32 6 CONCLUSIONS ..................................................................................................34 REFERENCES............................................................................................................35 APPENDIX A IEC 61508 CAUSAL CLASSIFICATION SCHEME .............................36 APPENDIX B SIMPLIFIED CLASSIFICATION (END USER PERSPECTIVE)...........40 APPENDIX C PARC CAUSAL ANALYSIS METHOD ...............................................42 C.1 Initiating PARC ....................................................................................................42 C.2 Applying PARC....................................................................................................42 APPENDIX D SIMPLIFIED CLASSIFICATION (SYSTEM REALISATION PERSPECTIVE)..........................................................................................................48 APPENDIX E EXAMPLE FORMS..............................................................................50 vii FIGURES Figure 1 Overall learning process.................................................................................6 Figure 2 Incident handling process...............................................................................7 Figure 3 Incident investigation....................................................................................10 Figure 4 Response to external information (the listening process)..............................13 Figure 5 Supply chain response to incident ................................................................14 Figure 6 Example event sequence involving E/E/PES................................................17 Figure 7 Initial technical investigation .........................................................................22 Figure 8 Problem prevention checklist........................................................................24 Figure 9 Recommendations checklist.........................................................................25 Figure 10 Supply chain dissemination and learning....................................................26 Figure C.1 PARC causal analysis flowchart ...............................................................46 Figure E.1 Initial incident report ..................................................................................50 Figure E.2 Equipment problem report (end user)........................................................51 Figure E.3 Causal investigation report (user organisation) .........................................52 Figure E.4 Causal analysis by E/E/PES supplier of system or equipment ..................53 Figure E.5 E/E/PES problem prevention checklist (end user) .....................................54 Figure E.6 E/E/PES recommendation checklist (end user) .........................................55 Figure E.7 E/E/PES problem prevention checklist (E/E/PES supplier)........................56 Figure E.8 E/E/PES recommendation checklist (E/E/PES supplier)............................57 TABLES Table 1 Example recommendations ...........................................................................12 Table A.1 Safety lifecycle classification ......................................................................36 Table A.2 Expansion of E/E/PES lifecycle classification .............................................38 Table A.3 Common requirements classification..........................................................39 Table B.1 Safety lifecycle classification ......................................................................40 Table B.2 Common requirements classification..........................................................41 Table D.1 System realisation phase classification ......................................................48 Table D.2 Common requirements classification..........................................................49 viii 1 INTRODUCTION This report is the second of 3 parts presenting the results of an HSE-sponsored research project. The overall purpose is to create a scheme for learning from incidents that involve electrical, electronic or programmable electronic (E/E/PE) systems. Part 1 reviews existing learning processes and causal analysis techniques, examines industry practice and makes recommendations for a new scheme. The project used these results to produce a draft new scheme [1]. The draft was subjected to review through an industry workshop, internal Adelard evaluation and four half-day interviews with two end users, a system supplier and a component supplier [2]. Part 2 (this report) presents the recommended scheme derived from this review process. Part 3 gives accompanying guidance, examples and rationale. The report consists of: · An overview of scope, objectives and key concepts for the scheme (Section 2); · A discussion of the processes involved in learning from all kinds of incidents (Section 3); · Recommendations on how to address aspects specific to E/E/PE systems (Sections 4 and 5), with supporting causal classification taxonomies (Appendices A, B and D), a simple flow chart causal analysis method (Appendix C) and example forms (Appendix E). 1 2 SCHEME OVERVIEW There are considerable benefits in learning from incidents that occur in an organisation. Incident reporting and investigation schemes can be implemented for incidents such as: · accidents to personnel, · damage to the environment, · production losses. If the learning is effective, resultant actions will reduce the chance of a recurrence. In injury reporting schemes there is often an increase in recorded incidents when the scheme is started (due to improved reporting) but the rate steadily reduces over time and the number of serious injuries also decreases. In addition, there are considerable benefits from addressing minor incidents—studies have shown that minor incidents are far more frequent than major incidents and that accident costs represent a significant proportion of company profits [3]. Companies and organisations increasingly rely on electrical, electronic and programmable electronic systems (E/E/PES) that can have an impact on safety, productivity and the environment. In this document we describe a scheme termed PARCEL (PES Analysis of Root Cause and Experience-based Learning) that covers the general learning process but focuses specifically on extensions to cover incidents involving E/E/PES equipment. The scheme should benefit user organisations and also enable suppliers of E/E/PES to learn from operational experience and hence improve their products. 2.1 SCOPE This document presents an incident reporting and analysis scheme for incidents involving E/E/PES. This will typically form part of a more general incident reporting scheme and may cover safety, the environment and other concerns of the organisation (eg economic losses). The scheme is designed to be consistent with the requirements of IEC 61508 [4]. This is a key international standard for industry. It sets out specific requirements for E/E/PE safety related systems within a generic framework that defines the safety lifecycle, safety management activities and detailed requirements for achieving functional safety. Example applications include: · emergency shutdown systems, · fire and gas systems, · safety interlocks, · a medical device, · a safety-related information system (eg a patient database), · a transport control system (eg an air traffic control system). 2 The scheme is designed to support two requirements of IEC 61508 in particular. The first of these is the need to learn from experience. 6.2.1 of IEC 61508-1 states that responsible organisations or individuals should consider specifying, implementing and monitoring the progress of: i) procedures which ensure that hazardous incidents (or incidents with potential to create hazards) are analysed, and that recommendations are made to minimise the probability of a repeat occurrence. This requirement addresses a range of safety stakeholders including the developers and operators of safety related systems. In addition IEC 61508-2 requires that: 7.8.2.2 Manufacturers or system suppliers which claim compliance with all or part of this standard shall maintain a system to initiate changes as a result of defects being detected in hardware or software and to inform users of the need for modification in the event of the defect affecting safety. This places a requirement on the "supply chain" of E/E/PES system suppliers and product manufacturers to eliminate reported defects that could be a source of safety-related incidents. It should also be noted that the standard recognises that incident reporting and analysis reduces the likelihood of common cause failures. IEC 61508-6 Annex D provides guidance on a methodology for quantifying the effects of hardware-related common cause failures in E/E/PES. Such effects often dominate the calculation of the reliability that can be achieved when using redundant systems. Common cause failures make it particularly difficult to achieve the target reliability for high integrity systems such as those containing safety functions with a safety integrity level of 3 or 4 (equivalent to better than 10-7 dangerous failures per hour). Common cause failures are normally accounted for using a Beta factor. Annex D of IEC 61508-6 provides a method for the determination of Beta that depends on the design of the sub-system and how it is to be maintained. Incident reporting and analysis reduces the need for diversity and can lead to lower beta factors with consequent ability to achieve the reliability required when high integrity level systems are required. The introduction of an incident reporting system can therefore lead to reductions in design, procurement and maintenance costs. 3 2.2 DESIGN OBJECTIVES The design objectives of the scheme are documented in Part 1. In summary, these are for the scheme to: · focus on E/E/PES aspects of incidents; · be practical (ie integrate with current management and reporting systems, have reasonable set-up costs, be usable); · be applicable to different industry sectors and to different levels of organisational complexity and maturity; · be applicable to end users, systems designers and E/E/PES product suppliers; · be based on current industry practice and current techniques; · take account of incident criticality (ie more effort is expended on serious incidents); · provide concrete examples to aid implementation; · make learning effective (so that similar problems can be avoided in the future); · use a consistent classification method that enables incidents to be categorised for subsequent analysis (eg by industry sector, safety function). 2.3 SCHEME CONCEPTS The incident reporting and analysis scheme is viewed as part of a learning process. In this process, lessons learned can be fed back to a range of different stakeholders. These stakeholders can be in: · the organisation using the E/E/PES (the end user); · a system contractor who may have overall design responsibility (eg for plant, buildings and E/E/PES equipment); · the organisation that designs and implements the E/E/PES based system for the end user or main contractor (the E/E/PES system supplier); · the organisation that provides E/E/PES products to the system supplier (the product supplier); · statutory bodies (eg HSE); · end-user interest groups (eg for specific industry sectors or end-users of specific E/E/PES). Learning can take different forms in different organisations. For example, an end-user can implement short-term risk mitigation by changes to workplace practices. Over the longer term, risks may be reduced by updating the E/E/PES, the approved vendor lists or internal standards. In addition the end-user management processes can be changed to reduce the 4 chance of similar problems on future E/E/PE based systems. Similarly the supply chain can respond to incident reports by rectifying the E/E/PE system or product and by improving processes to prevent a recurrence of the problem in future systems. 2.4 SCHEME DESIGN The following sections present the incident reporting and analysis scheme for incidents involving E/E/PES. In Section 3 we describe the generic incident handling and learning processes. These are based on a synthesis of existing industrial schemes examined in Part 1 and revisited during industrial consultation for the draft scheme [2]. Section 4 then identifies the additional E/E/PES related activities and information that needs to be embedded within the generic model of incident reporting and investigation. Our design intent is to make these additions as modular as possible so that they can be readily incorporated into existing incident reporting schemes. The E/E/PES-specific additions are intended to identify the causes of failure according to a classification scheme based on IEC 61508 concepts, and hence to facilitate product and process improvements that will benefit the different stakeholders associated with the E/E/PES. The classification approach is described in Section 5. To support the different stakeholders, the causal classification scheme is hierarchic and can be adapted to reflect the viewpoint of the reporting organisation. In other words, the scheme can be related to the life cycle stages and products for which the organisation is responsible (ie where improvements can actually be made). Supporting appendices are provided, covering: · classification scheme details (Appendices A, B, and D); · example forms for incident reporting and incident investigation, to illustrate the approach (Appendix E); · a simple analysis method, PARC, for investigating the causal factors underlying the E/E/PES-related elements of an incident (Appendix C). Adaptation of the scheme for specific organisations and industry sectors is addressed in Part 3. 5 3 THE PROCESS OF LEARNING FROM INCIDENTS The end user organisation is likely to have a general incident reporting scheme that is used to minimise recurrence of incidents. The company may also have formal or informal mechanisms for learning from experience to prevent recurrence of similar problems on future systems. In this section we outline these generic processes, and the following section we identify where E/E/PES-specific activities can be incorporated within this scheme. 3.1 OVERALL LEARNING PROCESS AND CONTEXT Figure 1 below presents an overview of a generic incident reporting and learning process. This is based on the survey of industrial practice documented in Part 1. The components of the diagram with solid lines constitute a basic incident reporting and analysis system. The components with dotted lines are additional dissemination and learning activities that can be used for further process improvement. The arrows represent transfers of information (such as incident reports, problem reports and recommendations). There is a local operational working context in which a response can be made. It is in this context that incidents first occur, some of which are recognised and reported. If an incident is deemed sufficiently interesting by the organisation (eg it may be potentially serious, novel or there may be too many incidents of this type), a more detailed characterisation and investigation occurs. Otherwise, incidents may be merely logged and a known workaround applied. Corrective actions are generated, and these may range from local workarounds, through to sharing within the organisation, to communication along the supply chain and finally to wider industry and regulatory reporting. Incident 5 Detailed Assessment Local working context New workaround re-engineering etc Detection 1 Incident reporting Engineering management Supply chain response Operations management Standard response to known issues Local management Response to new situations Recommendations Supply chain problem notification Recommendations 2 Incident prioritisation 3 Incident characterisation and investigation 4 Incident repository 7 Dissemination function Wider industry etc 6 Proactive interpretation and analysis Corporate policy/ standards 8 Listening function Other sites or departments Wider industry etc Figure 1 Overall learning process 6 Other sites or departments Supply chain There can be more general learning if incident information and supply chain problems are shared. For example, end-users can tell other departments that might have similar equipment with similar problems, or suppliers can alert a user organisation if there are known problems with the equipment they provided. These problems may or may not apply in the new working context, so there is a need for interpretation before the appropriate recommendations for corrective action can be determined. The various process components are elaborated in further detail below. 3.2 GENERIC INCIDENT REPORTING AND INVESTIGATION PROCESS The actual incident reporting and analysis process varies between companies, but practice typically includes the elements shown in Figure 2 below. Incident report Yes Novel? Analyse/classify severity and tolerable rate No Knowledge of previous incident types Log incident Serious? Yes Data gathering Report causal factors and recommendations No Incident investigation Too often? No Classification to common taxonomy Yes Investigate causes of high frequency Recommend site corrective actions Log incident investigation report STOP Disseminate for more general learning Figure 2 Incident handling process 7 Note that the scheme is designed to minimise overheads for minor incidents where it is sufficient to log the incident (the leftmost leg). However if a “tolerable level” for minor incidents is exceeded (eg excessive components failures, failures of non-critical functions that could be a source of distraction or increase workload), this merits further investigation to reduce the rate. If an incident is unexpected, some investigation is recommended. Even if the incident is classified as minor, it provides the opportunity to identify an acceptable workaround that can be fed back to the workplace staff through revised working practices. If an incident is classified as serious, the investigation will be more thorough. Typically there are several levels of incident severity, and the rigour of the investigation will vary with the severity of the incident. For a general incident reporting and analysis scheme, high severity incidents might not only relate to safety but also to economic losses (eg loss of equipment, loss of production) and be assigned a high priority for investigation. The outcomes of an incident investigation are a classification of the incident, a description of what happened, the causal factors leading to the incident, and recommendations to prevent a recurrence. The incident report provides a basis for implementing corrective action. The information contained can also be disseminated more generally to other stakeholders who can use it to prevent incidents in both current and future systems. The following sections describe the generic process for incident reporting and the information produced at each stage. 3.3 INCIDENT REPORTING Workplace staff usually have to detect and report the incident. The information supplied should not be time-consuming for workplace staff as this could result in under-reporting. In some cases initial incident detection is performed automatically or can be machine supported (so that relevant information is stored automatically). An example incident report form is shown in Figure E.1. A manual incident report might contain: · reporter name, · date and time, · free text description of incident; and optionally: · any short term fix, · reporters view of severity, · whether new to the reporter. The confidentiality of incident reporting needs to be decided. Confidentiality may well encourage reporting, but presents difficulties in subsequent investigation (as no further details can be obtained). In some schemes, incidents are reported to a third party who can check details without revealing the source. 8 Most companies seem to opt for an “open culture” of reporting. This enables the details of the original incident to be validated (ie who, what, where and when) and corrected if necessary. Open reporting also facilitates subsequent data gathering and incident investigation as the relevant personnel can be interviewed to obtain further details. However to maintain an open culture it is important to ensure that the reporter is not penalised for his actions (by instituting a “no blame” policy for incident reporting, for example). Each incident report might be extended and assessed by a workplace supervisor. The supervisor should check that the information is valid and assess whether the incident is serious or novel. Typical extensions to the report might include: · incident consequence (effect on operation, effect on personnel, eg injury), · incident type, · incident novelty. The incident report should then be stored in an incident log. 3.4 INVESTIGATION POLICY The key principle underlying the technical investigation policy is whether the incident represents a “learning opportunity”. The potential for learning from incidents varies. It is therefore important to prioritise an incident investigation according to the significance of the incident to the company. A general incident handling scheme might well include non-safety incidents such as damage to assets and loss of production. The type of incident investigated may also depend on the maturity of the organisation and the skills and resources available. For example, large organisations tend to have more levels of investigation than small companies. When introducing a new incident reporting scheme the priority will be to ensure more serious incidents are fully investigated. The incident priority will determine the nature of the investigation (the resources, time and staff available). In making decisions on the type of incidents that should be regarded as high priority, consideration should also be given to the likely frequency of incidents and the resources available. A prioritisation scheme that results in 90% of incidents being high priority is unlikely to be successful. The effectiveness of the prioritisation scheme should be reviewed on a regular basis. If suitable resources are not available to analyse all high priority incidents within a reasonably short period of time then priority levels or resource levels will need to be changed. Deferral of high priority incidents for significant time periods will undermine the learning scheme and bring it into disrepute. There is therefore a need for a management system that allows deferral but keeps it within acceptable boundaries. Factors that could lead to a high investigation priority include: 1 severe safety consequences (or potential for such consequences), such as fatalities to employees or members of the public; 2 severe environmental consequences (or potential for such consequences); 3 severe economic losses (loss of assets or loss of production); 4 excessive frequency of incidents (even minor incidents can be costly [3]); 9 5 novelty of the incident (so the worst case consequences are unknown); 6 there are many installations where the incident might occur. Lowest priority incidents are often just logged without investigation. Typically these are well known and have little impact on the organisation. But the priority should increase if such incidents recur too frequently. 3.5 INCIDENT INVESTIGATION Incident investigators need sufficient relevant knowledge and expertise to perform the analysis, including knowledge of: · the application domain, the actual process involved and the potential consequences of incidents (eg the plant, machinery); · the operation and maintenance regimes; and · the equipment used in the working environment. The overall investigation process followed is shown in Figure 3 below. Incident notification Data gathering Reconstruction Analysis Recommendations and monitoring Figure 3 Incident investigation While each investigation will contain every phase from Figure 3 to some extent, the rigour, independence and number of people involved will vary with the investigation priority. 3.5.1 Data gathering Data gathering may be required in order to establish further details about the incident and the operational context. The data gathering phase includes: · extracting and safeguarding any relevant logs (from manual or automated systems), · elicitation of testimony from witnesses, 10 · direct interviews, · gathering group testimony, for example through focus groups or anonymous questionnaires. Data gathering also needs to establish the workplace context. The amount of data required about the context varies with the knowledge of the investigating team. Less context information is required if the investigating team is familiar with the workplace, the equipment in use and the procedures in place. However familiarity with the workplace can sometimes be a danger since assumptions may be made which are not valid. 3.5.2 Incident reconstruction Based on the gathered data, the incident is reconstructed as a sequence of events leading up to the incident plus any actions following the incident that mitigated the outcome. This could for example be represented by a timeline (see Appendix A of Part 1). 3.5.3 Causal analysis Causal analysis establishes relationships between the events, and identifies underlying causes that permitted the event to occur. This can be represented in a number of different ways as described in Appendix A of Part 1 (eg as a fault tree, an ECF chart or a MORT diagram). The purpose is to determine root causes. This can be a recursive process, and it is necessary to identify some stopping point. The analysis might call for further technical investigations that require specialist expertise (eg plant engineering, human factors, equipment). For less severe or less complex incidents, technical investigations will probably be limited to aspects within the boundaries of the organisation. For more severe or more complex incidents, investigations might require the involvement of other organisations with the relevant technical knowledge and expertise (eg the E/E/PES supplier). As the events and their causes are identified, the assumptions made during design should be revisited (eg failure causes, failure modes, frequency and consequences). If discrepancies are discovered between the observed outcome of an incident and the original design assumptions, then an impact analysis using revised design assumptions is necessary. 3.5.4 Recommendations Recommendations are made to reduce the probability of a recurrence at the site or to mitigate the consequences. There can be several options for accomplishing this, including: · personnel and training, · procedures and organisation, · changes to E/E/PE systems, · changes to plant. There may also be wider recommendations not directly related to the incident but resulting from information gained while investigating it, for example the discovery of false design assumptions. 11 Each recommendation has to identify: · what should be changed or done, · who is responsible, · the importance of the change or action. These recommendations can be over different time-scales and include workaround arrangements until long-term changes can be implemented. An example is shown in table 1 below. Table 1 Example recommendations Ref Recommendation Department Priority 1 Change operational procedures to get permit from supervisor before overriding the vessel supply feed overflow interlocks Operations 1 2 Ensure operating staff are briefed about the change in procedure Operations 1 3 Request engineering change to alarm display system to ensure that operator is aware that the batch vessel is already full Engineering 2 3.5.5 Incident Investigation report The results of the investigation are usually presented in an investigation report. The report will include: · consequences and potential consequences (injuries, deaths, other losses); · incident type (this will be specific to a given domain); · a reconstruction of the incident; · an analysis of causal factors; · recommendations. An example incident investigation report form is shown in Figure E.3. 3.5.6 Recommendation implementation and tracking The operational or engineering management will assess the cost and feasibility of the recommendations and make decisions on their implementation. Accepted recommendations should be tracked to ensure they are implemented. Any deferred recommendations should also be monitored for compliance with deferral time limits. This tracking could be implemented as part of a safety management system (SMS). However it is often cost-effective to use an existing quality management infrastructure, which will already have processes for defect recording and resolution. 12 3.6 DISSEMINATION AND LEARNING The incident reporting scheme outlined above only covers the resolution of a specific incident by the end user. However the incident data can be used more generally to: · prevent recurrence on similar systems at other user sites, · rectify defects in systems, · prevent the recurrence of similar problems in new systems implemented by the end-user, · improve the processes used in the supply chain. The following sections will consider the dissemination and learning aspects in more detail. 3.6.1 Preventing recurrence on different sites Incidents or known problems can be circulated to users with similar processes or equipment. This could involve dissemination within a user organisation or to industry or equipment interest groups (box 7 in Figure 1). Some “listening” function is needed to analyse the information and apply it to the site (box 8 in Figure 1). This is illustrated in Figure 4 below. Incident report or supplier alert New to this site? No Yes Relevant to this site? Yes STOP No Serious enough? Yes STOP No STOP Review recommendations and implement corrective actions Figure 4 Response to external information (the listening process) 13 3.6.2 Rectify defects in supplied equipment Reported problems might reveal defects in some common component that is used in several systems. A supplier can perform an impact analysis to identify and update affected systems or products and alert users of the change and the available fix. As noted in Section 2.1 this process is a requirement for suppliers claiming compliance with IEC 61508. The process is outlined in Figure 5 below. User problem report or component supplier alert New problem? No Yes Identify short term corrective action Diagnose defects and products affected Supply corrective action/workaround Component problem? Alert other users where system is safety-related Yes No Alert component supplier Implement design changes in product(s) or associated documentation Alert affected users Figure 5 Supply chain response to incident 3.6.3 Preventing recurrence on future systems procured by the user Where a user finds that a supplied system was not suited to the intended application, this may indicate problems with the user’s procurement policy – particularly: · contractor and supplier assessment, · requirement specification process, · tender evaluation, and/or · system acceptance methods. Alternatively, the supplied system may have been suitable for the assumed operating infrastructure, but aspects of the actual infrastructure (such as personnel competence or associated procedures) are inadequate or have been changed. These lessons can be applied to the procurement of future systems. 14 3.7 PROACTIVE LEARNING FROM INCIDENT DATA An organisation can use past experience as a basis for taking action to reduce or prevent problems occurring at a more general level, both in current and future systems. Data from past incidents provides an information repository that can be analysed to determine preventive actions (box 6 in Figure 1). Such analyses can include: · trend analysis (eg is the frequency of a particular type of incident rising or falling), · identifying proportions for different incident types, · zonal analysis (eg identifying specific “hot spots” with high incident frequencies). This information can be used to establish management priorities for preventive action (eg to ensure incidents stay within limits or tackle problem areas). Incidents can also be used to improve knowledge within the organisation (sometimes called “corporate memory”). This can include: · generalising incidents into rules, recommendations, checklists or corporate standards for future activities, · creating information repositories (eg “how to’s” and “frequently asked questions” for use by the staff), · staff training and awareness (eg making use of actual incidents within the organisation as illustrations in order to improve the safety culture). Organisations and departments can also make use of external information. There is a lot of learning potential to be gained from information on similar systems used on other sites, or to learn more generally about best practices in the operational or design process being conducted. In particular there can be benefit in proactively searching for known issues when: · starting a new instance of a known project type, · new staff join a team, · investigating the feasibility of a novel project. Proactive learning also provides a motivation for organisations to share information with each other—industry information sharing bodies often have a policy that those who contribute have preferential access to information. Confidentiality concerns may mean that such shared information has to be anonymised (eg the identity of the company and personnel involved). 15 4 E/E/PES SPECIFIC ACTIVITIES The previous section outlines a general process for learning from incidents. This section outlines the specific activities that should be introduced into the general learning process when incidents involve E/E/PES. In the proposed extension to the incident reporting scheme we need to decide: · when to invoke a technical investigation of E/E/PES equipment involved in the incident, · what additional analyses to perform, and · who to involve from the supply chain. These requirements could apply to any type of equipment involved in an incident, but the need for additional analysis and the involvement of suppliers is particularly important for E/E/PES if the associated risks are to be effectively controlled. E/E/PES have a number of distinctive characteristics that can make them more difficult to understand, control and maintain, including: · configurability and reprogrammability, · functional complexity, · complex/unsafe failure modes, · non-deterministic behaviour, · non-linear behaviour. From our survey and evaluation studies (Part 1, [2]) it is clear that there needs to be demonstrable benefit for the extra effort required to specifically address E/E/PES issues. In other words we need “maximum gain for pain”. “Pain” factors that are likely to prevent adoption of an extended scheme are: · high implementation and running costs, · radical changes to existing procedures and processes, · excessive burdens on staff (eg training, demands on time, complexity). It was therefore important to devise extensions that are: · as modular as possible, · simple to implement and operate, · only invoked when there is significant benefit to be gained. To modularise the E/E/PES component, we view E/E/PES extensions as a subset of the incident analysis, namely a special technical investigation. In a complex incident there could be several technical investigations covering different specialities (such as plant, control and protection systems). These technical investigations contribute results back to the overall incident investigation. In a simple incident there might only be one E/E/PES involved and the technical investigation would probably be more limited. During the overall incident analysis and reconstruction, a sequence of events will be identified that lead to the incident, such as the example illustrated in Figure 6 below. 16 Control system fails to maintain intended plant condition Operator unaware of loss of control Safety system operates Operator overrides safety system Incident occurs Figure 6 Example event sequence involving E/E/PES The reasons why particular subsystems failed would be the subject of technical investigations and the technical recommendations on preventing a recurrence would be included in the recommendations of the overall incident investigation. In this example, two E/E/PES subsystems are involved in the incident (shown by the shaded boxes). However, some events leading to the incident might not be investigated. For example, the obvious failure is in the control system, but this could be viewed as a normal event and the focus for investigation would then be the subsequent events. We could ask why the operator overrode the safety system—it may be that the information supplied by the safety system is inadequate and the functionality of the safety system needs to be improved so the operator is better informed. In other cases, a technical investigation into the control system failure would be appropriate, eg if the failure frequency was too high or the failure mode was not as predicted during design. 4.1 IDENTIFICATION OF E/E/PES For both simple incident reporting and incident investigation, it is necessary to identify what equipment incorporates E/E/PES. We would recommend that a list or database be set up that identifies the safety-related E/E/PES within an organisation so that equipment can be readily identified. Such a list of safety-related E/E/PES might already exist for normal risk management purposes. Alternatively the safety classification could be incorporated into an existing asset management system that provides normal equipment identification information (eg supplier, equipment name, serial number, and version). 4.2 ADDITION TO INCIDENT RECORDING When E/E/PES are involved in incidents, this needs to be captured in the incident report. Note that the E/E/PES need not necessarily be the cause of the incident; from an incident investigation viewpoint it is also necessary to determine if an E/E/PES within the scope of the incident could have helped to prevent the incident. When the initial incident report is checked, updated and assessed by the supervisor, it will be necessary to establish if the incident involves an E/E/PES. The general incident reporting form should contain a question to capture this. The supervisor should be able to establish the equipment involved, then: · If there is a pre-existing register of assets that contains supplementary information on E/E/PES, it is sufficient to record the asset identification number or the equipment name and serial number. 17 · If this is not the case, the supervisor has to determine if the equipment uses E/E/PE components. Generally speaking, this is likely if the equipment is electrically powered. It may be feasible for the supervisor to manually assign an equipment type and operating mode based on workplace knowledge. Low priority incidents are simply logged (ie held in some incident repository), while higher priority incidents are investigated. For low priority incidents, the only addition information recorded is whether an E/E/PES is involved. The incident logging function could be part of either safety management or quality management procedures. The appropriate procedures should contain requirements for periodically reviewing the logged incidents. Similarly, equipment problem reports could build on existing engineering maintenance documents and processes, and technical investigations could to be undertaken by existing engineering support staff. 4.3 E/E/PES TECHNICAL INVESTIGATION POLICY The incident investigation may identify that an E/E/PES problem was a contributory cause of an incident, but there can be multiple causes to incidents, especially complex ones. As part of a general incident investigation, the investigator has to decide whether to perform a technical investigation focussing on the relevant E/E/PE systems. The type of E/E/PES problem that is investigated will depend on industry practice, regulator expectations and the potential consequence associated with the system involved. In deciding whether to investigate a specific E/E/PES involved in an incident, a number of factors may be relevant, including: · the E/E/PES exhibited an unsafe failure mode, · the E/E/PES failure behaviour is unexpected and unexplained, · the E/E/PES failure frequency is excessive, · the consequences of E/E/PES failure are high (eg it is the sole means of risk reduction), · there are multiple applications using the same E/E/PES, · the E/E/PES equipment or application software is novel (the uncertainty associated with using new techniques makes learning particularly important). Factors that will tend to indicate that a technical investigation is not needed include: · the undesired behaviour is known and countermeasures are in place, · the safety and economic or environmental consequences of failure are low, · multiple protection or mitigation layers exist to reduce risk from the same hazard, · the equipment and system application is simple and has been in use for some time. In the incident reconstruction shown in Figure 6, there are two E/E/PE systems involved. The control system exhibited unsafe behaviour, but there is a separate protective system. If the control equipment is well established, and the failure mode is already known and the countermeasures are considered adequate, there would be little point in performing a technical investigation as little would be gained in terms of new understanding or recommendations. 18 If, on the other hand, the control equipment exhibits unexpected failures or frequent failures, there would be a case for technical investigation, because the control system is imposing extra demands on the safety system. In the same example, the E/E/PE-based safety system actually operated as intended, so while the safety integrity requirement is high, there is no requirement for a technical investigation of the safety system itself. On the other hand, we might wish to ask why the operators failed to perform their tasks correctly, including why the safety system was overridden. A technical investigation might then analyse the human factor aspects, which could lead to recommendations for changes to existing systems including the E/E/PE-based safety system. Some E/E/PES problems may not get associated with any general incident investigation. The majority of industrial applications have large numbers of E/E/PE systems, and failures of individual components occur at frequent intervals. Failures will be reported by either operational staff recognising that the facility is not operating as intended or by maintenance staff conducting routine testing and inspection. Not all failures will be investigated. Often maintenance staff simply replace failed components and operations staff then restart the facility. For hard-pressed staff this is normally the easiest option. It is also the most sensible option if the cause can be established as a random hardware failure. However failures where the cause can be identified as systematic should always be investigated because if changes are not made to equipment or procedures then the same failure will reoccur later under the same conditions. First line maintenance staff will need to make the judgement as to whether further investigation is justified. Again the main criterion would be the potential for learning from the incident. Technical investigations might be initiated if for example: 1 The safety-related system failed when tested. There are no actual consequences, but it could be viewed as a “near miss”—especially if the fault disables or partially disables the safety function (ie a fail-danger fault). 2 There is evidence of a systematic failure so that there is potential for improvement. 3 There are excessive equipment failures. First-line maintenance staff need to be trained to recognise such problems and maintenance procedures, and report forms need to be appropriate for this purpose. The example equipment problem report form (see Figure E.2) contains a number of ticklist items that help to identify whether a technical investigation is needed (ie whether the problem is in safety-classified equipment and whether the cause is something other than random failure). 4.4 E/E/PES TECHNICAL INVESTIGATION If it is decided to undertake a technical investigation, the investigation has to answer several main questions: 1 What exactly happened? 2 Why did the problem occur? 3 How can the problem be prevented (or reduced in frequency)? 4 Can the consequences be mitigated by other means? 19 For E/E/PES technical investigations, the investigator should have the relevant knowledge and expertise on the E/E/PES and its operational environment, including: · the wider operational context in which the E/E/PES is used, · the function of the E/E/PES equipment, · integrity level requirements, · the system design, · the operation and maintenance procedures and documentation for the E/E/PES. Several different E/E/PES components may be used to implement a given function. For example, the control system might be a distributed system with elements such as data acquisition, control and monitoring—all implemented on different units. The interactions between these elements may need to be analysed as well as the performance of individual units. 4.4.1 E/E/PES data gathering and reconstruction (what exactly happened?) The data gathering process has to establish the operating context and what occurred during the incident. E/E/PE systems often have event logging facilities, and often maintain internal records of diagnostic checks and failures. The information should be extracted and safeguarded; in some cases it may be necessary to quarantine E/E/PES equipment, history files and operating logs for further analysis. Special measure may be needed in major incidents (eg where there is a fire) to avoid equipment damage, and damage to electronic storage. It may also be necessary to duplicate information that is stored at remote sites. It is also important to capture any interactions with the E/E/PE systems by operation and maintenance staff (eg by taking witness statements). For technical investigations of E/E/PES it is important to establish the exact equipment configuration. This may be derived directly from some asset reference or configuration records, but further inspection may be needed to determine the exact serial number and version number of the equipment in use. Identification of the exact configuration is important for subsequent forensic investigations (eg the product code may be the same, but the implementation different). Ideally there should be a unique code for every equipment item, so that it can be more readily tracked down the supply chain (this does occur with some IEC 61508 compliant suppliers). Significant events associated with the E/E/PES should be derived from the gathered data (eg the time when a failure occurred, whether the operating mode changed, whether new actions were requested or new information was displayed). The sequence of events should be identified. Incident reconstruction and analysis methods are available for this purpose, for example timelines or ECF charts (see Appendix A of Part 1). 20 4.4.2 Incident analysis (why did it happen)? We have to identify the immediate cause of the E/E/PES maloperation before we can identify means for their prevention or mitigation. It is helpful to have a systematic method for determining the nature of the problem. A range of different methods has been identified in Appendix A of Part 1, and examples of their application are given in Appendix A of Part 3. This includes the PARC method (PES Analysis of Root Cause) in Appendix C, which was developed during this research project and is based on the PRISMA method. PARC comprises a flowchart with a series of questions to establish whether the problem is due to: · operator actions, · maintenance actions, · equipment environment problems (eg electromagnetic interference, temperature, vibration), · random failure of equipment, · inappropriate functionality, · installation/integration with other equipment. Identification of some of these items might require specific tests (such as diagnostic tests, or replication of the circumstances of the original problem). For relatively simple E/E/PES technical investigations, it may be sufficient to use a simple checklist (see Figure E.2) as illustrated in Figure 7 below. Analysis of some “normal events” could consist of just recording the problem and its immediate cause. For example a random hardware failure might simply be rectified by repair or replacement. For less routine cases, or where operational, maintenance or environmental factors are identified, a more detailed technical review may be requested. If the problem is related to equipment functionality, it may be necessary to pass the initial problem report to the supplier for further analysis. This could result in changes to the equipment and improvement to the supplier processes to prevent similar errors in future. For safety-classified equipment, rules could be set in place for mandatory reporting of problems through the supply chain. In other cases—especially where the consequences are serious or potentially serious—a further stage of investigation should be undertaken to determine in detail the underlying causes. This is described in the following section. 4.4.3 How can this problem be prevented or mitigated in future? For particular problems, especially those with potentially major consequences, it is worthwhile to consider whether the problem can be minimised either by eliminating underlying causes or by reducing the impact of the failure. If the immediate cause is under the control of the “problem owner”, then an analysis can be undertaken to determine what preventive measures can be applied locally–both to rectify the current problem and to prevent similar problems within the organisation. 21 Equipment Problem Sheet ACME corp Manufacturer CONPAC 800Product name AC-b72-1289 Serial no. Rev 8.12 Configuration/version information Block A12 Location 12 Jun 2003 Date INC/PROD_A12/45 Incident reference (if applicable) 4 control q monitoring q alarm/direction Used for: q protection q q safety-classified Problem Description Details q failed to act when needed Controller overfilled vessel. q acted when not needed q 4 acted in unexpected way q failed completely q failed during test q maloperation of equipment by staff q other q dangerous/potentially dangerous Immediate Cause Operational q operator action q operator inaction Maintenance q fault diagnosis q fault correction q left in wrong mode q configuration q calibration Environment problem q condensation q contamination q temperature q vibration q EMI Equipment functionality q user interface q operational functions q maintenance functions q 4 response to component failure q performance q random hardware fault Equipment integration/installation q power supplies q wiring q connections q incompatible interfaces q incompatible subsystems q configuration Actions q 4 log problem q 4 repair/reconfigure equipment q report to supplier q 4 problem prevention review Details Equipment failed to detect that 4-20MA level sensor had failed and was reading low. As a result the controller kept pumping until there was an overflow trip. Details 4-20 mA sensor replaced. Figure 7 Initial technical investigation 22 This analysis can be linked to the IEC 61508 life cycle and the activities within it. Although a comprehensive taxonomy could be created for all the requirement subclauses of parts 1 to 3 of the standard, it would be so complex that it would be impracticable to apply. Some simplification is necessary even for the most comprehensive of incident analysis schemes. The taxonomy used will depend on the depth and scope of the study. The default scope should include all lifecycle phases but this may be reduced according to the circumstances. For example, the scope of an investigation by the system supplier may be limited to the E/E/PES realisation phase. A proposed taxonomy that covers the main requirements of all the lifecycle phases of IEC 61508 is shown in Appendix A. This includes the common requirements such as safety management and competence. Use of the taxonomy to this level of detail is probably only justified when analysing incidents with very high consequences. In practice, information available at the operating stage may allow only coarse judgements to be made about causes originating prior to the equipment validation stage. In any case, tracing the cause of an incident to a specific lifecycle phase may be of limited value to an operating asset if this phase is not within the investigating company’s present control. However, it will generally be of value to identify a cause at the design or installation phase since this will infer the need for modification. Appendix B contains a simplified taxonomy that is aimed to provide an appropriate level of detail for the majority of operational systems. The PARC method in Appendix C includes a focused set of questions that identify the relevant phases in the lifecycle where the problem could have been prevented. Note that there can be more than one possible means of prevention, and every option should be considered when making recommendations for improvement. Alternatively, a checklist approach could be used to help consider the options systematically, as shown in Figure 8 below. The checklist could also record the results of a PARC analysis. In the case of a detailed investigation by the system vendor, the scope will be reduced (eg to design, validation, installation and commissioning) and the remaining lifecycle stages are not relevant. In this case the more detailed taxonomy of system realisation given in Appendix D can be used. Again the supplier may choose to convert this to a checklist to assist in the analysis process. For complex incidents involving many different interacting elements a more wide-ranging incident analysis may be needed. Several analysis options are identified in Appendix A of Part 1, but we consider the following methods to be the most effective appropriate: · barrier analysis, · change analysis, and · events and causal factors (ECF) charting. The recommended approach is illustrated using a case study in Appendix A of Part 3. These methods facilitate good coverage of the causal taxonomy in Appendix A. The rationale for this choice of methods is given in Appendix D of Part 3. However this does not preclude the use of other methods, and we would expect many organisations to use pre-existing methods that they find acceptable. 23 q System assessment q System design q Installation and commissioning q Validation q Operation and maintenance q Modification Safety management q operation q maintenance q modification Lifecycle Competence q operation q maintenance q modification Verification Documentation q operation q maintenance q modification Functional safety assessment q operation q maintenance q modification E/E/PES Problem Prevention Checklist q hazard and risk assessment applied/improved q better allocation of functions to people and equipment q better specification of system functions q 4 better design and development q improved operation facilities q improved maintenance facilities q improved installation plan/procedures q improved commissioning plan/procedures q better validation techniques/plan q 4 improved implementation of validation plan q better validation equipment q better analysis/resolution of discrepancies q better usability assessment q operation procedures improved q impact assessment of operation procedures q operation procedures properly applied q maintenance procedures improved q impact assessment of maintenance procedures q maintenance procedures properly applied q routine operation and maintenance audits q 4 test interval changed q permit/hand over procedures q procedures to monitor system performance q selection/application of tools q procedures applied to initiate modification in the event of systematic failures or vendor notification of faults q modification authorisation procedures q impact analysis of modification q improve modification planning (including following appropriate lifecycle) q improve implementation of modification plan q better manufacturers information q improve verification and validation of modification q safety culture q safety audits q management of suppliers Ensure adequate definition of lifecycle process q operation, q maintenance, q modification q procedures for ensuring competence q check training, experience and qualification of a person q specify job requirements Verification of q operation, q maintenance, q modification q check documentation available/complete q check documentation clear and correct q check documentation well structured q check documentation up to date q introduce safety assessment q improve safety assessment q ensure adequate skills and independence of assessment team Figure 8 Problem prevention checklist 24 4.4.4 Recommendations Recommendations are made to reduce the likelihood of an E/E/PES malfunction recurring or to mitigate the consequences of failure. Problems might be resolved by: · eliminating the need for the E/E/PES (eg using alternative means to achieve the organisation’s goal); · re-engineering or replacing the E/E/PES; · adding external barriers and warning signs; · changing operational procedures and work practices; · changing operator training and assessment; · introducing new procedures or equipment to mitigate the potential consequences. A checklist of possible actions is given in Figure E.6, and the options are reproduced in Figure 9 below. The recommendations generated will depend on what has been established in the causal analysis and an assessment of options based on such factors as effectiveness, implementation time and implementation cost. Policy change q procurement procedures q supplier approval q engineering standards q 4 acceptance procedures Equipment change q warning/advisory labelling q equipment relocation q environmental protection q hardware repair q 4 upgrade to new version q equipment replacement q 4 reprogramming Operational change q procedures q support documentation q access controls q warnings q staff training q staff briefing q staff supervision Maintenance change q 4 procedures q support documentation q access controls q warnings q staff training q staff briefing q staff supervision Figure 9 Recommendations checklist 25 4.5 DISSEMINATION AND LEARNING The dissemination and learning that can occur between end users and E/E/PES suppliers is shown in Figure 10 below. The diagram only shows two suppliers in the chain, but there can be an indefinite number, including: · a main contractor (who might have designed and supplied site infrastructure, plant and C&I equipment); · an E/E/PES system supplier who programs, installs and maintains one or more E/E/PES products to perform the specified application tasks; · E/E/PES product suppliers (eg of PLCs or distributed control systems); Often the main contractor supplies a “turnkey” system to the user, but has no responsibility to support the system in operation. Figure 10 excludes the main contractor from the reporting chain, but ideally they should also participate in the problem resolution and learning process. Operational workarounds Warnings Barriers Replacement System redesign/ component replacement E/E/PES component problem E/E/PES system problem End user incident E/E/PES component supplier System designer End user Assess/ install revision E/E/PES redesign Alerts Known faults Workarounds Patches New versions Alerts Known faults Workarounds New versions Assess/ install revision Other end users Other system suppliers General lessons General lessons General lessons Procurement Introduction into workplace (Training and implementation programs) System design rules Preferred products Checklists Test and commissioning procedures Staff competence E/E/PES design rules Checklists Staff competence Test procedures Figure 10 Supply chain dissemination and learning 26 4.5.1 E/E/PES supply chain The activities in the supply chain are summarised below. Problem investigation Any problems associated with the use of the E/E/PES should be reported along the supply chain. To the E/E/PES supplier, the reported problem (eg information in a form similar to Figure 7) can be treated as a learning opportunity in the same way as an “incident” in Figure 1. When the learning process is applied within an E/E/PES supplier organisation, it can be used to correct problems in the E/E/PES (or mitigate their effects) and to improve the company’s internal processes to reduce the likelihood of a recurrence. To aid this, a checklist based on IEC 61508 requirements is given in Figure E.7. For companies that integrate E/E/PES components, the checklist in Figure E.5 may be more appropriate. A checklist of possible recommendations arising from the problem investigation is given in Figure E.8 Response to user The supplier should confirm to the user that an E/E/PES problem exists and identify possible ways of resolving the problem (eg via a workaround or an update to the E/E/PES). This information may be included in the user’s incident investigation report if the E/E/PES was regarded as being a causal factor in an incident. Dissemination If the E/E/PES is used on many different sites, the supplier should disseminate information about any newly discovered problems to all users. Possible dissemination methods include: · issuing immediate problem alerts (together with any workarounds and patches); · responding with known workarounds when a problem is reported; · making records of known faults (together with any workarounds and patches) available to the E/E/PES users; · notifying users of a new version with a record of the fault fixed. The dissemination mechanism chosen will usually depend on the severity of the fault and on whether the E/E/PES is being used in safety-related applications. To perform effective dissemination, a record of the end use of the equipment must be maintained by the E/E/PES system supplier. IEC 61508 compliant suppliers will be aware of the safety integrity requirements of their end users, and they can use this to establish priorities for rapid dissemination. Similar processes apply for E/E/PES product suppliers, except their clients are normally E/E/PES application system suppliers. Learning from individual problem reports In addition to disseminating reported problems and fixes, suppliers should perform further analysis to establish what improvements can be made in their development lifecycle processes. The analysis should determine where errors occurred in the process, and whether 27 there are any mechanisms that could be introduced to eliminate each error or detect it at a later stage. For a systems supplier this might lead to changes in: · procedures (specification, design, test, installation, commissioning, support); · review checklists (specification, design, software, documentation); · test, installation and commissioning checklists; · design rules (standard design, product configuration rules, software programming standards); · methods and tools (for design, programming, testing, etc); · staff competence requirements and training. An E/E/PES product supplier should undertake the same type of analysis as an E/E/PE systems supplier, except that installation and commissioning need not normally be considered. Learning from a problem database A problem database can be periodically reviewed for: · trends – whether problems are reducing or increasing, · hot-spots – whether particular clients are experiencing greater problems, · frequency – which problems arise most frequently. Unsatisfactory performance can be used to trigger an investigation. This investigation might result in the process changes already discussed above, but could also result in additional changes to reduce the incidence of such problems, such as in: · proof testing, maintenance or inspection frequency, · components used (eg to improve reliability or failure mode of the system), · component supplier (to improve supply chain response times, access to information, etc), · safety integrity level requirements if the demand rate on a protection system is higher than anticipated during design, · architecture, such as the introduction of redundancy or majority voting, · customer guidance and product information (eg clarifying functionality, usage constraints, guidance on installation best practice). 28 4.5.2 E/E/PES end user Specific problems with E/E/PES can be shared with other interested parties. Typically the interest lies in problems concerning: · A specific product (eg a specific medical device or a PLC) where user experience relating to a given product and any known workaround can be shared with other sites. · A class of products (eg heart pacemakers) to identify generic failure modes and operational problems. · An application area (eg oil and gas production, transport, signalling, alarm system) to identify generic E/E/PES design and operational problems. The incident database within an organisation can be analysed periodically to provide similar information. It can also be used to identify trends, hot spots and high-frequency problems that could require management action to control. These analyses will require some method for characterising the E/E/PES incident (preferably using some common scheme), so that relevant incidents can be retrieved. The approach to classification will be discussed in the following section. 29 5 CLASSIFICATION OF E/E/PES PROBLEM RECORDS Incident classification supports long-term proactive learning because it helps significantly in the identification of trends and “hot spots” in any particular set of incidents. The more facets of the learning process reflected in the classification scheme, the greater the range of information that can be obtained by analysing the incident data. A scheme can characterise the failure and the cause, and identify how the problem can be prevented in the future. We propose that incidents involving E/E/PES are classified using the following disjoint classifications: · A failure mode classification – what type of failure behaviour is exhibited. · A problem classification – the immediate cause of the failure (including problems in surrounding E/E/PES infrastructure such as personnel or other systems). · A problem prevention classification – where in the system development lifecycle the problem could have been prevented. This is based on the analysis of IEC 61508 causal classification categories (detailed in Appendix A). Each type of organisation has a different “viewpoint”, with different capabilities and available information. For example, a user organisation is unlikely to be aware of the reasons for an E/E/PES hardware or software failure, or be able to rectify it. So the classification scheme has to be relevant to the viewpoint of a specific organisation. The proposed hierarchic classifications can be extended with industry specific subclassifications by individual end-users and system suppliers, but it allows data to be aggregated into top-level classifications. In addition, it may be necessary to “collapse” parts of the causal classification if the root causes cannot be identified or controlled by the specific organisation. 5.1 FAILURE MODE The failure mode is the direct contributor to the incident. It enables the contribution of the equipment failure to the incident analysis to be established (even if the underlying cause is unknown). Information about failure mode might also be relevant in long-term analyses of equipment problems. The proposed categories are: · Equipment failed to act when required · Equipment acted when not required · Equipment acted when required, but in an unexpected way · Operator did not use the equipment as intended. These categories are consistent with the PARC method described in Appendix C. So the results of a PARC analysis can be transcribed to a database or to a standard form such as the checklist in Figure E.2. 30 For decisions on further technical investigation and for later analysis, it is important to identify whether the failure mode was hazardous. So an additional classification is proposed to identify the impact of the failure, namely either: · Fail-safe, or · Fail-danger. This information might be directly determined when the equipment has a specified safe failure state. In other cases, it may depend on the incident context and can only be determined once the incident has been analysed. 5.1.1 Problem leading to failure The problem leading to the failure should be identified. The classification of immediate problems identified here are consistent with those identified in the PARC analysis method (except that in PARC “environment problem” and “interfacing equipment problem” have been merged into a single heading “failure due to environment”). The proposed categories of problems leading to failure are: · Operator problem · Maintenance problem · Setting/calibration problem · Environment problem (corrosion, interconnection, vibration, etc) · Inappropriate equipment functionality · Interfacing equipment problem · Random hardware fault. Again it should be relatively simple to derive this classification directly, using a PARC analysis or the equipment problem checklist. 5.2 PROBLEM PREVENTION This part of the classification scheme identifies the point in the lifecycle where the problem could have been prevented. These stages are aligned with the overall lifecycle in IEC 61508. The full set of classification categories is given in Table A.1, but the main headings are: · Concept · Overall scope · Hazard and risk assessment · Overall safety requirements · Allocation · Installation and commissioning planning · Validation planning · Operation and maintenance planning · Realisation 31 · Installation and commissioning · Validation · Operation and maintenance · Modification. The realisation phase for E/E/PES is classified by sub-phase in table A.2 (this is mainly relevant to E/E/PES system suppliers). The sub-phases are: · E/E/PES functional requirements specification · E/E/PES integrity requirements specification · E/E/PES validation planning · E/E/PES system design · E/E/PES operation and maintenance facilities · Software requirements specification · Software design and development · Software validation planning · E/E/PES integration · E/E/PES operation and maintenance procedures · E/E/PES validation Table A.3 classifies the common processes identified within IEC 61508 that may need improving to prevent failures: · Safety management · Safety lifecycle · Competence · Verification · Documentation · Functional safety assessment These taxonomies can classify the causes identified by a range of causal analysis techniques (Appendix A of Part 3 gives an example of this). The taxonomies can be reduced in scope and detail for use with simpler causal analysis methods like PARC and checklists. 5.3 ADAPTING THE CAUSAL CLASSIFICATION FOR DIFFERENT PERSPECTIVES 5.3.1 End user viewpoint The end user may have a limited view of the internal reasons for failure of an E/E/PES, probably because they had little involvement in some of the early stages of development. For example, the user may have employed an independent contractor to supply the E/E/PES. If this is the case, the end user has little scope for preventing the problem in these lifecycle 32 phases, has little use for detailed information for these phases, and probably will not be able to obtain it anyway. Appendix B contains a simplified classification for this scenario, where the stages “Concept”, “Overall scope” and “Hazard and risk assessment” are collapsed to “System concept”, and “Overall safety requirements” and “Allocation of safety requirements” are collapsed to “Safety requirements and allocation”. The remaining phases are: · E/E/PES installation and commissioning planning · E/E/PES validation planning · E/E/PES operation and maintenance planning · E/E/PES realisation · E/E/PES installation and commissioning · E/E/PES validation · E/E/PES operation and maintenance · E/E/PES modification. This reduced taxonomy can be applied by the end user, and is used as the basis for classification in the PARC analysis method given in Appendix C. The questions in the PARC analysis are designed to identify one or more of the main lifecycle stages or common processes where the problem could have been prevented. The checklist in Figure E.5 is also based on the reduced taxonomy in Appendix B. 5.3.2 Supply chain viewpoint The E/E/PES system integrator also has a restricted viewpoint, as he is unlikely to have been involved in the early stages of concept development, functional allocation or hazard and risk analysis since these activities require industry specific knowledge and knowledge of the overall system architecture. The viewpoint from a fault prevention perspective is likely to be restricted to the E/E/PES phase, classified in detail in table A.2. These are already listed in Section 5.2 above. Appendix D gives a list of E/E/PES realisation categories drawn from Table A.2, plus E/EPES installation and commissioning. Table D.2 reproduces the categories of Table A.3. These are the basis for the E/E/PES problem prevention checklist in Figure E.7. If the E/E/PES supplier is also involved in later phases, such as E/E/PES maintenance or E/E/PES modification, some combination of the classifications in Appendices B and D may be needed. 33 6 CONCLUSIONS The scheme described in this document (termed PARCEL) focuses on learning from the E/E/PES-related aspects of incidents. It is important that the scheme is practical (ie is integrated with current management and reporting systems, has reasonable set-up costs and is usable). The scheme aims to be applicable to different industry sectors and to organisations at different points in the E/E/PES supply chain, and it aims to accommodate different levels of organisational complexity and maturity. It is apparent from industry consultation that companies are likely to have existing incident reporting infrastructures and analysis methods, and any scheme for E/E/PES should not disrupt this existing infrastructure. Therefore PARCEL is designed as an “add-in” to general incident reporting schemes to minimise the changes needed in existing schemes. The use of separate forms and a separate “technical investigation” makes it easier to include E/E/PES investigations in the overall incident handling infrastructure. In addition the add-in approach allows different staff to undertake E/E/PES analysis from that undertaking general incident analysis (eg an instrument engineer for the E/E/PES aspects and an operations supervisor for the more general aspects). Companies with no existing infrastructure can still apply PARCEL by implementing the generic incident reporting scheme, described in this report, alongside the E/E/PES additions. The classification scheme uses a taxonomy derived from IEC 61508 requirements for lifecycle stages and common processes. This provides a common basis for classification across industry sectors and the E/E/PES supply chain, but it can be tailored to suit different organisations by changing the level of detail and the scope of lifecycle phases covered. Organisations may have pre-existing methods of incident analysis. So we focus here on analysis methods that are particularly suited for E/E/PES. A number of options for performing E/E/PES technical investigation are identified, including barrier analysis, change analysis and ECF. But for simple investigations (which are likely to be the majority), a simple flowchart method called PARC and ticklists have been developed, where the classification is inherent in the analysis process. Pre-existing causal analysis techniques can be used in conjunction with these E/E/PES-specific approaches, for example by using the results of a PARC or checklist analysis in the overall incident investigation. For learning to occur, it is important that the results of incident analysis are fed back to the relevant organisations so the processes and products can be improved. The learning process identifies feedback paths to: · the local company or organisation (to fix immediate E/E/PES operational problems); · companies in the E/E/PES supply chain (to fix E/E/PES problems and help suppliers to learn); · other departments and outside organisations (to share the lessons learned). Checklists are provided for the types of action that can be taken by E/E/PES end-users and companies in the E/E/PES supply chain to improve the realisation and operation of E/E/PESbased equipment. Part 3 provides guidance on adapting PARCEL for users in different industry sectors and with different levels of maturity. It also recommends appropriate methods and tools for incident management, analysis and feedback. 34 REFERENCES [1] PG Bishop, LO Emmet, W Black, C Johnson, V Hamilton, F Koornneef, “HSE Learning from Incidents: D2—Outline scheme for E/E/PE in the context of IEC 61508”, Adelard Deliverable, D/221/2307/2, 2003 [2] PG Bishop, LO Emmet, K Tourlas, W Black, C Johnson, “HSE Learning from Incidents: D3—Scheme Evaluation”, Adelard Deliverable, D/226/2307/3, 2003 [3] IEE, “The Costs to Industry of Accidents and Ill-Health”, IEE Health and Safety Briefing, No. 03, April 2003, http://www.iee.org/Policy/Areas/Health/hsb03.pdf [4] IEC 61508, Functional safety of electrical/electronic/programmable electronic safetyrelated systems. For further details see http://www.iec.ch/functionalsafety 35 APPENDIX A IEC 61508 CAUSAL CLASSIFICATION SCHEME This appendix contains the complete causal classification taxonomy and shows how each of the category items is related to specific subclauses in IEC 61508. The classification scheme covers both IEC 61508 lifecycle phases and requirements that are common to all phases. Note that in the following tables: “LTA” stands for “Less Than Adequate” and all IEC 61508 references are to Part 1 except where indicated by number in brackets, eg (2). A series of subclauses is delimited by “/”, eg 7.2.2.2/3/4, while a consecutive range of subclauses is denoted by a hyphen, eg 7.2.2.2-4 (both these examples represent the set of subclauses 7.2.2.2, 7.2.2.3 and 7.2.2.4). Table A.1 Safety lifecycle classification Lifecycle phase Objective Causal classification IEC 61508 reference Concept Development of sufficient understanding of EUC and its environment 1 LTA consideration of EUC and environment 2 LTA information about hazards 3 hazards due to interactions not considered 7.2.2.1 Define the boundary of the EUC and control system and scope of hazard and risk assessment 1 physical environment not specified 2 external and incident-initiating events to be taken into account not specified 3 subsystems associated with the hazards not specified 7.3.2.1 7.3.2.2/4/5 Determine the hazards, event sequences and the EUC risks 1 LTA hazard and risk analysis 2 LTA consideration of elimination of hazards 3 reasonably foreseeable circumstance not identified 4 event sequence and likelihood not predicted 5 LTA consequence prediction 6 Inappropriate tools and techniques used 7.4.2.1/10/11/12 7.4.2.2 Overall scope Hazard and risk assessment 7.2.2.2/3/4 7.2.2.5 7.3.2.3 7.4.2.3 7.4.2.4/5 7.4.2.6 7.4.2.9 Overall safety requirements Development of safety functions and safety integrity requirements 1 LTA safety function requirement 2 necessary risk reduction not defined 3 EUC control system incorrectly designated 7.5.2.1 7.5.2.2/6 7.5.2.4/5 Allocation Allocation of safety functions to safety systems 1 incorrect account taken of skills and resources required or other systems 2 common cause not considered 3 operating mode not specified (continuous/low demand) 7.6.2.1/2/3 Installation and commissioning planning To plan the installation and commissioning 1 LTA installation plan 2 LTA commissioning plan 7.9.2.1 7.9.2.2 Validation planning To plan the validation 1 LTA validation techniques/plan 7.8 36 7.6.2.6/7 7.6.2.5 Lifecycle phase Objective Causal classification IEC 61508 reference Operation and maintenance planning To plan the operation and maintenance 1 LTA operation and maintenance plan 7.7 Realisation Realisation of the safety requirement specification NB These categories are expanded in table A.2 1 LTA specification of E/E/PES functional requirements 2 LTA specification of E/E/PES integrity requirements 3 LTA E/E/PES validation planning 4 LTA E/E/PES design 5 LTA E/E/PES operation and maintenance facilities 6 LTA software requirement specification 7 LTA software design and development 8 LTA software validation 9 LTA hardware and software integration 10 LTA E/E/PES operation and maintenance procedures 11 LTA E/E/PES validation 7.2.2/3 (2) 7.2.2/3 (2) 7.3.2 (2) 7.4.2/3/7 (2), 7.5 (2) 7.4.5 (2) 7.2.2 (3) 7.4 (3) 7.3 (3), 7.7 (3) 7.5 (2), 7.5 (3) 7.6 (2) 7.7.2.2/5 (2) Installation and commissioning Installation and maintenance of E/E/PES 1 LTA installation 2 LTA commissioning 7.13.2.1/2 7.13.2.3/4 Validation Validate that the E/E/PES meets the requirement specification 1 LTA implementation of validation plan 2 LTA validation equipment 3 LTA analysis/resolution of discrepancies 7.14.2.1/3 7.14.2.2 7.14.2.4 Operation and maintenance Operate and maintain the E/E/PES to maintain the required functional safety 1 LTA operation procedures 2 operation procedures not impact assessed 3 operation procedures not applied 4 LTA maintenance procedures 5 maintenance procedures not impact assessed 6 maintenance procedures not applied 7 no routine operation or maintenance audits 8 permit/hand over procedures 9 test interval not sufficient 10 LTA procedures to monitor system performance 11 tools incorrectly selected or applied 7.6.2.1/2/5 (2) 7.6.2.4 (2) 1 LTA procedures applied to initiate modification in the event of systematic failures or vendor notification of faults 2 LTA authorisation procedure 3 LTA impact analysis 4 LTA modification plan (including sufficient lifecycle activities) 5 LTA implementation of modification plan 6 LTA manufacturers information 7 LTA verification and validation 6.2.1l, 7.8.2.2 (2) Modification To ensure that functional safety remains appropriate during and after modification 37 7.15.2.1/2 7.6.2.1/2/3/5 (2) 7.6.2.4 (2) 7.15.2.1/2 7.15.2.3, 7.6.2.1/2 (2) 7.6.2.1 (2) 7.6.2.3 (2) 7.6.2.1f (2) 7.6.2.1g (2) 7.16.2.2/5, 7.8.2.1c (2) 7.16.2.3/6, 7.8.2.1b (2) 7.16.2.1/6, 7.8.2.3 (2) 7.16.2.1 7.8.2.1 (2) 7.8.2.4 (2) Table A.2 Expansion of E/E/PES lifecycle classification Sub-phase Sub-classification IEC 61508 reference E/E/PES functional requirements specification 1 LTA derivation of E/E/PES requirements from overall specification 2 LTA E/E/PES requirements, eg unstructured, unclear, ambiguous 3 not all safety actions specified 4 LTA specification of safe (or least hazardous) state and behaviour after restart 5 LTA throughput and response requirement 6 LTA EUC operator interfaces specification 7 LTA operator interface to E/E/PES 8 LTA interfaces to other systems specification 9 LTA specification of failure modes or failure rates 10 not all relevant modes (eg start-up, proof test) defined 11 limiting and constraint conditions not defined 12 LTA specification of security provisions 7.2.2.1 (2) E/E/PES integrity requirements specification 1 target safety integrity level and/or target failure measures incorrectly specified 2 not all modes of operation defined 3 LTA definition of proof testing requirements 4 LTA specification of environmental conditions 7.2.3.2a (2) E/E/PES validation planning 1 LTA validation procedures 2 LTA specification of validation equipment 3 LTA policies for resolving validation failure 7.3.2.1/2 (2) 7.3.2.2d (2) 7.3.2.2g (2) E/E/PE system design 1 LTA hardware safety integrity including fault tolerance and reliability analysis 2 LTA measures for systematic safety integrity 3 LTA response on detection of a fault 4 LTA separation and independence 5 significance of interactions not recognised 6 LTA design methods or tools 7 LTA system architecture 8 LTA products application or integration 7.4.2.2a (2) E/E/PES operation and maintenance facilities 1 LTA provision for checking or confirmation of operator inputs 2 LTA operator displays (complex, confusing) 3 LTA consideration of maintainability and testability 7.4.5.1c (2), 7.4.5.3 (2) Software requirements specification 1 2 3 4 inconsistency with overall specification unclear, ambiguous, unverifiable, not testable and traceable LTA constraints between hardware and software LTA monitoring and testing 7.2.2.2 (3) 7.2.2.6 (3) 7.2.2.8 (3) 7.2.2.9 (3) Software design and development 1 2 3 4 5 LTA software design methods LTA software architecture LTA support tools or programming language LTA code implementation LTA software testing 7.4.2.2-5 (3) 7.4.2.6-10 (3), 7.4.3 (3) 7.4.4 (3) 7.4.5.4 (3), 7.4.6 (3) 7.4.7 (3), 7.4.8 (3) Software validation 1 LTA software validation techniques/plan 2 LTA assessment of the software validation plan 3 LTA implementation of the validation plan 38 7.2.2.2 (2) 7.2.3.1a (2) 7.2.3.1a (2) 7.2.3.1b (2) 7.2.3.1c (2) 7.2.3.1c (2) 7.2.3.1e (2) 7.2.3.1g (2) 7.2.3.1g/j (2) 7.2.3.1h/i (2) 7.2.3.1 (2) 7.2.3.2b (2) 7.2.3.2c (2) 7.2.3.2d/e (2) 7.4.2.2b (2) 7.4.2.2c (2) 7.4.2.3/4/5 (2) 7.4.2.10 (2) 7.4.2.8/9 (2), 7.4.4 (2) 7.4.3.1/2 (2) 7.4.2.11 (2), 7.4.7 (2), 7.5 (2) 7.4.5.3 (2) 7.4.5.2 (2), 7.4.4.3 (2) 7.3.2.1/2/3/5 (3) 7.3.2.4 (3) 7.7 (3) Sub-phase Sub-classification IEC 61508 reference E/E/PES integration 1 LTA hardware and software integration 7.5 (2), 7.5 (3) E/E/PES operation and maintenance procedures 1 LTA specification of the routine actions that need to be carried out 2 LTA specification of actions and constraints during all modes of operation 3 LTA maintenance procedures on fault 7.6.2.1a (2) 1 2 3 4 7.7.2.1/3 (2) 7.7.2.2 (2) 7.7.2.4/5 (2) 7.7.2.3 (2) E/E/PES validation LTA implementation of validation plan LTA validation equipment LTA analysis/resolution of discrepancies LTA usability assessment 7.6.2.1b (2) 7.6.2.1e (2) Table A.3 Common requirements classification Common Requirement Causal classification IEC 61508 reference Safety management 1 2 3 4 5 6 6.1.1/6.2.1 6.1.2/6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 Safety lifecycle 1 detailed lifecycle not defined 2 lifecycle activities not divided into elemental activities 3 scope, inputs and outputs not detailed for each phase 7.1.4.1 7.1.4.4 7.1.4.5/6/7 Competence 1 no procedures for ensuring competence 2 training, experience and qualification of a person not assessed 3 job requirements not specified 6.2.1h, B.2 6.2.1h, B.2 1 verification plan not developed or incomplete 2 verification plan not implemented 3 verification plan not fully understood 7.18.2.1/2, 7.9 (2), 7.9 (3) 7.18.2.3, 7.9 (2), 7.9 (3) 7.9 (2), 7.9 (3) Documentation 1 2 3 4 5.2.1-5 5.2.6 5.2.7-10 5.2.11 Functional safety assessment 1 independent functional safety assessment not planned adequately 2 independent functional safety assessment not carried out 3 not all phases assessed (including operation and maintenance) 4 insufficient skills or independence in assessment team 5 tools not assessed Verification necessary management and technical activities not specified accountability for all activities not specified activities not implemented or monitored activities not formally reviewed by all organisations not all organisations informed of responsibilities suppliers not reviewed for appropriate quality management system documentation absent/incomplete documentation unclear or incorrect documentation not well structured documentation not up to date 39 6.2.1h, B.2 8.2.7/8/9 8.2 8.2.3 8.2.11-14 8.2.5 APPENDIX B SIMPLIFIED CLASSIFICATION (END USER) Table B.1 Safety lifecycle classification Lifecycle reference Lifecycle stage(s) Classification System assessment Concept Overall scope Hazard and risk assessment 1 LTA hazard and risk assessment Safety requirements and allocation Overall safety requirements Allocation of safety requirements 1 LTA allocation of safety functions to people and equipment E/E/PES installation and commissioning planning 1 LTA E/E/PES installation and commissioning plan E/E/PES validation planning 1 LTA E/E/PES validation techniques/plan E/E/PES operation and maintenance planning 1 LTA E/E/PES operation and maintenance plan E/E/PES realisation 1 2 3 4 E/E/PES installation and commissioning 1 LTA E/E/PES installation 2 LTA E/E/PES commissioning E/E/PES validation 1 2 3 4 E/E/PES operation and maintenance 1 LTA operation procedures 2 operation procedures not impact assessed 3 operation procedures not applied 4 LTA maintenance procedures 5 maintenance procedures not impact assessed 6 maintenance procedures not applied 7 no routine operation or maintenance audits 8 test interval not sufficient 9 LTA permit/hand over procedures 10 LTA procedures to monitor system performance 11 tools incorrectly selected or applied E/E/PES modification 1 LTA procedures applied to initiate modification in the event of systematic failures or vendor notification of faults 2 LTA authorisation procedure 3 LTA impact analysis 4 LTA modification plan (including sufficient lifecycle activities) 5 LTA implementation of modification plan 6 LTA manufacturers information 7 LTA verification and validation LTA specification of E/E/PES requirements LTA design and development LTA operation facilities LTA maintenance facilities LTA implementation of E/E/PES validation plan LTA E/E/PES validation equipment LTA E/E/PES analysis/resolution of discrepancies LTA usability assessment 40 Table B.2 Common requirements classification Common requirement Classification Safety management 1 2 3 4 Safety lifecycle 1 LTA lifecycle definition: operation 2 LTA lifecycle definition: maintenance 3 LTA lifecycle definition: modification Competence 1 LTA operation competence 2 LTA maintenance competence 3 LTA modification competence Verification 1 LTA verification of operation 2 LTA verification of maintenance 3 LTA verification of modification Documentation [For operation, maintenance or modification] 1 documentation absent/incomplete 2 documentation unclear or incorrect 3 documentation not well structured 4 documentation not up to date Functional safety assessment [For operation, maintenance or modification] 1 no safety assessment 2 LTA safety assessment 3 insufficient skills or independence in assessment team LTA safety management: operation LTA safety management: maintenance LTA safety management: modification LTA safety management: external suppliers 41 APPENDIX C PARC CAUSAL ANALYSIS METHOD The PARC (PES Analysis of Root Causes) causal analysis method contains a simple flowchart for identifying root causes of incidents involving E/E/PES. The flowchart in this appendix is optimised for user organisations to analyse system architectures where potential hazards in equipment under control, or the control system itself, lead to a demand on a separate safety protection system. This architecture is common in chemical plants and oil and gas installations. If used for other situations, it is recommended that consideration be given to the appropriateness of the terminology, questions, lifecycle phases and prevention measures, and changes made where necessary. C.1 INITIATING PARC Investigation of incidents involving E/E/PES should be carried out within the context of a general-purpose incident reporting and investigation scheme. A suitable question within the general report form will need to be framed to trigger a specialist investigation. The question will depend on the types of incident that are required to be subject to investigation. Relevant factors are industry practice, regulator expectations, the potential consequence associated with the system involved, the maturity of the organisation and the skills and resources available. In many cases it will be good practice to investigate if a demand occurs on a safety system, since a hazard would have occurred if the safety system had been in the failed state. The failure could be considered as a “near miss”. However, in an alarm system, demands may occur frequently and other safety layers may intervene if the system is in the failed state. Hence investigation of the cause of all alarms may not be practicable and may lead to there being insufficient time to investigate other more serious incidents. An example of a question that may be included in a general-purpose report form is as follows: Did the incident involve the need for action, the lack of action, the wrong action or the loss of capability of an E/E/PES, eg a demand on a system, a failure on demand or on proof test, an unexpected action? If the answer to the question on the general-purpose incident report form is “yes” then the following method for identifying the immediate problem and the root cause of the failure can be followed to identify the nature of the E/E/PES problem and the underlying causes in an IEC 61508 context. This method has been given the acronym PARC (PES Analysis of Root Causes). C.2 APPLYING PARC Data on the incident will need to be acquired through a combination of witness statements, inspection and testing. The investigation process will depend on the technology involved, the operating environment and the action (or lack of action) that triggered the need for incident analysis. The user will normally conduct a preliminary investigation but in some cases equipment will need to be returned to the system vendor for detailed investigation. The PARC method represents a typical process that may be used by a user organisation, represented as a flow chart in Figure C.1. The flow chart begins with the box labelled “System operates correctly to prevent hazard”. If the current description holds for the incident under consideration, then a dotted arrow labelled “Yes” is followed. Otherwise, the 42 solid line should be followed to the next box. When the user is directed to a table column, he should examine every cause listed in that column and record all of those that played a part in the incident. After that he should revisit the box most recently considered and follow the solid line onto the next one. In this way, the user eventually considers every row of events. If the row is applicable, he then considers in turn every event in that row, and lists causes for those events that match the incident. For each incident, the user should also consider all the listed causes in the common requirements tables, in the context of actions/inactions of both the system and the operator. The events in the flowchart boxes are described in more detail below. 1 Safety system operates correctly on demand to prevent hazard. Events leading to the demand should be established but can include the following: The demand is caused by a maintenance action. Errors in maintenance often cause system demands. Examination of maintenance records and discussions with technicians should establish if maintenance has been the cause of the demand. The demand is caused by an operation error. Error in operation is another likely cause of system demands. Examination of operation records and discussions with operators should establish if operation has been the cause of the incident. The demand is caused by equipment degradation. Installation of equipment in a hostile environment or failure to follow installation requirements can cause the equipment to exhibit transient failures. Physical degradation may be caused by contact with the environment or the process. Degradation like contamination or corrosion might be detected by external or internal inspection. Check for compliance with manufacturer’s operating conditions (temperature, pressure, vibration, etc). Equipment may also fail if there are interfacing incompatibilities. Check equipment specifications for compatible interface standards (connectors, wiring, electrical signal standards, signalling protocols, etc). The demand is caused by inappropriate equipment function. Testing by simulating the original conditions is required to establish if there is a systematic design flaw in the hardware or software, resulting in the inappropriate behaviour. More detailed investigation by the system vendor may be required. The demand is caused by a random hardware failure. All equipment has a failure rate and providing the overall demand rate is what has been assumed during design then no further investigation will be needed. Demands should be logged and compared with the assumptions made during hazard and risk analysis. 2 System fails on proof test, fails to take action when required or takes action when not required. Events leading to this should be established but can include the following: The setting is incorrect. A proof test will be needed to establish if the equipment is working according to its setting and an incorrect setting was the cause of the incident. The failure is caused by a maintenance action. Errors in maintenance often cause system failures. Inspection, examination of maintenance records and discussions with technicians should establish if maintenance has been the cause of the incident. 43 The failure is caused by an operation error. Error in operation is another likely cause of system failures. Examination of operation records and discussions with operators should establish if operation has been the cause of the incident. The failure is caused by equipment degradation. Installation of equipment in a hostile environment or failure to follow installation requirements can cause the equipment to exhibit transient failures. Physical degradation may be caused by contact with the environment or the process. Degradation like contamination or corrosion might be detected by external or internal inspection. Check for compliance with manufacturer’s operating conditions (temperature, pressure, vibration, etc). Equipment may also fail if there are interfacing incompatibilities. Check equipment specifications for compatible interface standards (connectors, wiring, electrical signal standards, signalling protocols, etc). The failure is caused by inappropriate equipment function. Testing by simulating the original conditions is required to establish if there is a systematic design flaw in the hardware or software, resulting in the inappropriate behaviour. More detailed investigation by the system vendor may be required. The failure is caused by a random hardware failure. All equipment has a failure rate and providing the failure rate of the installed equipment is what has been assumed during design then no further investigation will be needed. Failures should be logged and compared with the assumptions made during the reliability analysis to confirm the system has the required integrity. 3 Incorrect action taken by system or operator. Events leading to this should be established but can include the following: No action by operator allows demand on system. Error in operation is a likely cause of EUC failures becoming system demands. Examination of operation record and discussions with operators should establish if lack of action during operation has been the cause of the incident. System actions insufficient to terminate hazard. Actions taken by the system may be insufficient to terminate the hazard. This can be caused by unpredicted failure modes, design errors or incorrect operation and maintenance. Determination of cause will require inspection of equipment and operation and maintenance records but may also require equipment testing and examination of the design basis. System takes unpredicted actions. If the system takes unpredicted actions then an investigation will be needed as to whether the system has not functioned as required or a process interaction has occurred. Mitigation by system is inadequate. Hazards are often prevented from causing significant harm by the intervention of mitigation systems that minimise the consequences of an incident. Judgement is needed as to whether consequences should have been reduced further. Examination of the event sequences after a hazard has occurred will be necessary. Operator fails to mitigate hazard. Hazards are often prevented from causing significant harm by the intervention of an operator. Judgement is needed as to 44 whether consequences should have been reduced further by operator action. Examination of the event sequences after a hazard has occurred will be necessary. The action is caused by a random hardware failure. All equipment has a failure rate and providing the incorrect action rate of the installed equipment is what has been assumed during design then no further investigation will be needed. Failure should be logged and compared with the assumptions made during the reliability analysis to confirm the system has the required integrity. In many cases user organisations will not have the tools or expertise to investigate complex failures of programmable equipment. In such cases the equipment will need to be returned to the system vendor or the system vendor will need to be involved in the on-site investigation. In such cases the user will need to prepare a report detailing what is known about the events leading to the incident. 45 System operates correctly to prevent hazard Yes Demand caused by maintenance action Yes Yes Demand caused by equipment degradation Demand caused by operation error Demand caused by inappropriate function Yes Yes Yes System fails on proof test System fails to take action when required or takes action when not required Yes Setting is incorrect Failure caused by maintenance action Yes Yes Failure caused by operation error Failure caused by equipment degradation Yes Failure caused by inappropriate function Yes Random hardware failure Yes Yes – hazard and risk analysis had considered all modes of operation and causes – hazard and risk analysis had considered all modes of operation and causes 46 – hazard and risk analysis had considered all modes of operation and causes Design – different equipment had – maintenance facilities had – operation facilities had been designed been selected been designed adequately – the installation design had adequately been different – specification was correct Installation & commissioncommissioning – the equipment had been installed according to design Validation – different equipment had been selected – the installation design had been different – different equipment had been selected – the installation design had been different – configuration was correct – maintenance facilities had – operation facilities had been installed according been installed according to design to design – the equipment had been installed according to design – the equipment had been installed according to design – the setting had been checked during validation – maintenance facilities had – operation facilities had been fully checked been fully checked – equipment condition had been fully checked – equipment condition had been fully checked Operation & maintenance – maintenance procedures were applied – maintenance procedures were improved – maintenance tools were improved – test interval was reduced – correct maintenance procedure had been used – maintenance procedure was improved – permit procedures were improved – correct operation procedure had been used – operation procedure was improved – permit procedures were improved – maintenance procedures were applied – maintenance procedures were improved – test interval was reduced – additional protection was provided – maintenance procedures were improved – maintenance tools were improved – test interval was reduced – additional protection was provided Modification – setting had been reviewed during impact analysis – maintenance facilities or procedures had been reviewed during impact analysis – operation facilities or procedures had been reviewed during impact analysis – equipment used or installation design had been reviewed during impact analysis – equipment used or installation design had been reviewed during impact analysis Log failure and check – if dangerous failure rate is in line with design assumptions – if all expected actions occurred and no unexpected actions occurred – if safe failure causes any unexpected actions Log demand and check – if demand rate is in line with design assumptions – if demand cause was predicted in hazard and risk analysis Would the incident have been prevented if Competence Lifecycle Verification Safety management Documentation Safety assessment Operation & maintenance – operation or maintenance staff were more competent – responsibilities were defined better – a better verification scheme had been in place – safety culture was improved – audits were more frequent – documentation was clear and sufficient – operation and maintenance phase had been assessed Modification – modification had been carried out by more competent staff – modification lifecycle was better defined – a better verification scheme had been in place – accountabilities were better defined – suppliers had been reviewed – documentation had been updated – modification had been assessed Figure C.1 PARC causal analysis flowchart Would the incident have been prevented if System concept Continued from previous page Incorrect action taken by system or operator Yes No action by operator allows demand on system Mitigation by system is inadequate Yes Yes Yes Operator fails to mitigate hazard Random hardware failure Yes Yes Would the incident have been prevented if – hazard and risk analysis had considered all modes of operation and causes – hazard and risk analysis had considered all modes of operation and causes – hazard and risk analysis had considered all modes of operation and causes – hazard and risk analysis had considered all modes of operation and causes 47 System concept – hazard and risk analysis had considered all modes of operation and causes Design – additional actions had – operator facilities had been specified been designed better – the installation design had – actions had been faster – final actuation device was been different improved – design requirements were – mitigation system had been specified better documented – mitigation system had been better designed Installation & commissioncommissioning – the equipment had been installed according to design – the equipment had been installed according to design – the equipment had been installed according to design – mitigation system had been installed according to design – the equipment had been installed according to design Validation – operator facilities had been checked during validation – operation facilities had been checked during validation – operation facilities had been fully checked – mitigation system had been fully checked – operator facilities had been fully checked Operation & maintenance – operation procedures were applied – operation procedures were improved – correct maintenance procedure had been used – maintenance procedure was improved – proof testing was more frequent – correct operation procedure had been used – operation procedure was improved – permit procedures were improved – mitigation procedures were applied – mitigation procedures were improved – mitigation system was proof tested more frequently – operation procedures had been applied – operation facilities or procedures were improved Modification – operation facilities had been reviewed during impact analysis – necessary system actions – necessary system actions – need for mitigation had been reviewed during had been reviewed during had been reviewed during impact analysis impact analysis impact analysis – operator facilities had been better designed – the installation design had been different Log failure and check – if dangerous failure rate is in line with design assumptions – if all expected actions occurred and no unexpected actions occurred – if safe failure causes any unexpected actions Log demand and check – if demand rate is in line with design assumptions – if demand cause was predicted in hazard and risk analysis – need for mitigation had been reviewed during impact analysis Would the incident have been prevented if Competence Lifecycle Verification Safety management Documentation Safety assessment Operation & maintenance – operation or maintenance staff were more competent – responsibilities were better defined – a better verification scheme had been in place – safety culture was improved – audits were more frequent – documentation was clear and sufficient – operation and maintenance phase had been assessed Modification – modification had been carried out by more competent staff – modification lifecycle was better defined – a better verification scheme had been in place – accountabilities were better defined – suppliers had been reviewed – documentation had been updated – modification had been assessed Figure C.1 (continued) PARC causal analysis flowchart Yes System takes unpredicted actions System actions insufficient to terminate hazard APPENDIX D SIMPLIFIED CLASSIFICATION (SYSTEM REALISATION) Table D.1 System realisation phase classification System realisation sub-phase Sub-classification E/E/PES functional requirements specification 1 2 3 4 LTA derivation of E/E/PES requirements from overall specification LTA E/E/PES requirements, eg unstructured, unclear, ambiguous not all safety actions specified LTA specification of safe (or least hazardous) state and behaviour after restart 5 LTA throughput and response requirement 6 LTA EUC operator interfaces specification 7 LTA operator interface to E/E/PES 8 LTA interfaces to other systems specification 9 LTA specification of failure modes or failure rates 10 not all relevant modes (eg start-up, proof test) defined 11 limiting and constraint conditions not defined 12 LTA specification of security provisions E/E/PES integrity requirements specification 1 2 3 4 target safety integrity level and/or target failure measures incorrectly specified not all modes of operation defined LTA definition of proof testing requirements LTA specification of environmental conditions E/E/PES validation planning 1 2 3 LTA validation procedures LTA specification of validation environment LTA policies for resolving validation failure E/E/PES system design 1 2 3 4 5 6 7 8 LTA hardware safety integrity including fault tolerance and reliability analysis LTA measures for systematic safety integrity LTA response on detection of a fault LTA separation and independence significance of interactions not recognised LTA design methods or tools LTA system architecture LTA products application or integration E/E/PES operation and maintenance facilities 1 2 3 LTA provision for checking or confirmation of operator inputs LTA operator displays (complex, confusing) LTA consideration of maintainability and testability Software requirements specification 1 2 3 4 inconsistency with overall specification unclear, ambiguous, unverifiable, not testable and traceable LTA constraints between hardware and software LTA monitoring and testing Software design and development 1 2 3 4 5 LTA software design methods LTA software architecture LTA support tools or programming language LTA code implementation LTA software testing 48 System realisation sub-phase Sub-classification Software validation 1 LTA software validation techniques/plan 2 LTA assessment of the validation plan 3 LTA implementation of the validation plan E/E/PES integration 1 LTA hardware and software integration E/E/PES installation and commissioning 1 LTA installation 2 LTA commissioning E/E/PES operation and maintenance procedures 1 LTA specification of the routine actions that need to be carried out 2 LTA specification of actions and constraints during all modes of operation 3 LTA maintenance procedures on fault E/E/PES validation 1 2 3 4 LTA implementation of validation plan LTA validation equipment LTA analysis/resolution of discrepancies LTA usability assessment Table D.2 Common requirements classification Common requirement Classification (applies to all system realisation phases) Safety management 1 2 3 4 5 6 Safety lifecycle 1 detailed lifecycle not defined 2 lifecycle activities not divided into elemental activities 3 scope, inputs and outputs not detailed for each phase Competence 1 no procedures for ensuring competence 2 training, experience and qualification of a person not assessed 3 job requirements not specified Verification 1 Verification plan not developed or incomplete 2 Verification plan not implemented 3 Verification plan not fully understood Documentation 1 2 3 4 documentation absent/incomplete documentation unclear or incorrect documentation not well structured documentation not up to date Functional safety assessment 1 2 3 4 5 independent functional safety assessment not planned independent functional safety assessment not carried out not all phases assessed (including operation and maintenance) insufficient skills or independence in assessment team tools not assessed necessary management and technical activities not specified accountability for all activities not specified activities not implemented or monitored activities not formally reviewed by all organisations not all organisations informed of responsibilities suppliers not reviewed for appropriate quality management system 49 APPENDIX E EXAMPLE FORMS Title of form Your name Date of report Date of incident Time of incident Title Incident type Describe the incident Unique reference number Location of Incident Was any person hurt? This might drive a type or source category, eg Chemco may have other documents/reports etc. In a large scale scheme incident reports may be received from other organisations User asked to assign a short title It may be appropriate for a user to assign a type to an incident (or this may be fixed by the type of user that makes the report) Free text Could be allocated automatically or by the incident report “gatekeeper” May want this categorised – by site, region etc This identifies if health and safety mandatory reporting is invoked and would be required for insurance purposes. For accident reports this would need additional data. If applicable detail number of persons hurt, seriousness etc. Value of damaged property, whether it was personal property etc Did any damage to property occur? Or other economic loss – for insurance purposes if injury/damage also Was there a loss of occurred or for good governance in investigating cost of incident production? If so how much? In your view could this have led to more serious consequences? If so, what could have occurred? What short term fixes or work-arounds have been applied? To your knowledge, has this problem occurred before? Was any electrical or electronic equipment involved that either caused or failed to prevent the incident? If so please file an equipment problem report Figure E.1 Initial incident report The first part of the initial incident report could be filled in by staff involved in incident, the next part could be filled in by a supervisor. 50 Equipment Problem Report Manufacturer Product name Serial no. Configuration/version information Location Date Incident reference (if applicable) Used for: q protection q control q monitoring q alarm/direction q safety-classified Problem Description Details q failed to act when needed Description q acted when not needed q acted in unexpected way Identify any associated records (like event logs) q failed completely q failed during test q maloperation of equipment by staff q other q dangerous/potentially dangerous Immediate Cause Operational q operator action q operator inaction Maintenance q fault diagnosis q fault correction q left in wrong mode q configuration q calibration Environment problem q condensation q contamination q temperature q vibration q EMI Equipment functionality q user interface q operational functions q maintenance functions q response to component failure q performance q random hardware fault Equipment integration/installation q power supplies q wiring q connections q incompatible interfaces q incompatible subsystems q configuration Actions q log problem q repair/reconfigure equipment q report to supplier q problem prevention review Details Details Figure E.2 Equipment problem report (end user) 51 Original incident report fields Check and correct/expand all fields, especially Incident type Incident consequences Incident potential consequences Novel or repeat Short term fixes Fields added after investigation Incident reconstruction Using suitable technique Analysis of causal factors Using suitable technique Causal factors List of causal factors Problem prevention classification (see Figure E.5 or E.7) Recommendations List of recommended changes to a range of areas including E/E/PES (see Figure E.6 or E.8) Investigation details Could be a reference to separate documents including technical investigations of E/E/PES Figure E.3 Causal investigation report (user organisation) 52 Problem report If information is available from end customer (or internally if problem occurred while being commissioned/tested by supplier) eg end user problem report Supplier problem reference number Might be assigned automatically if held electronically Customer/end user organisation Location Date and time of incident Identity and configuration of equipment Safety-significance End user causal analysis that relates problem to E/E/PES Supplier Investigation Determine nature and origin of problem Causal analysis Results of analysis of E/E/PES problem by supplier Recommendations Product recall, redesign/update, redesign/update to related products warnings (to all affected systems), workarounds changes to design processes etc Causal classification Classification of problem origin Process of investigation Documentation of how results were arrived at, ie interview records, analysis products, meeting records etc. Figure E.4 Causal analysis by E/E/PES supplier of system or equipment 53 q System assessment q System design q Installation and commissioning q Validation q Operation and maintenance q Modification Safety management q operation q maintenance q modification Lifecycle Competence q operation q maintenance q modification Verification Documentation q operation q maintenance q modification Functional safety assessment q operation q maintenance q modification E/E/PES Problem Prevention Checklist q hazard and risk assessment applied/improved q better allocation of functions to people and equipment q better specification of system functions q better design and development q improved operation facilities q improved maintenance facilities q improved installation plan/procedures q improved commissioning plan/procedures q better validation techniques/plan q improved implementation of validation plan q better validation equipment q better analysis/resolution of discrepancies q better usability assessment q operation procedures improved q impact assessment of operation procedures q operation procedures properly applied q maintenance procedures improved q impact assessment of maintenance procedures q maintenance procedures properly applied q routine operation and maintenance audits q test interval changed q permit/hand over procedures q procedures to monitor system performance q selection/application of tools q procedures applied to initiate modification in the event of systematic failures or vendor notification of faults q modification authorisation procedures q impact analysis of modification q improve modification planning (including following appropriate lifecycle) q improve implementation of modification plan q better manufacturers information q improve verification and validation of modification q safety culture q safety audits q management of suppliers Ensure adequate definition of lifecycle process q operation, q maintenance, q modification q procedures for ensuring competence q check training, experience and qualification of a person q specify job requirements Verification of q operation, q maintenance, q modification q check documentation available/complete q check documentation clear and correct q check documentation well structured q check documentation up to date q introduce safety assessment q improve safety assessment q ensure adequate skills and independence of assessment team Figure E.5 E/E/PES problem prevention checklist (end user) 54 Policy change q procurement procedures q supplier approval q engineering standards q acceptance procedures Equipment change q warning/advisory labelling q equipment relocation q environmental protection q hardware repair q upgrade to new version q equipment replacement q reprogramming E/E/PES Recommendation Checklist Operational change q procedures q support documentation q access controls q warnings q staff training q staff briefing q staff supervision Maintenance change q procedures q support documentation q access controls q warnings q staff training q staff briefing q staff supervision Figure E.6 E/E/PES recommendation checklist (end user) This checklist could be associated with specific recommendations to prevent a local recurrence or with more general changes to prevent similar recurrences. 55 E/E/PES Problem Prevention Checklist (Page 1 of 2) Functional requirements specification q Derivation of E/E/PES requirements from overall specification q Specification of E/E/PES requirements (eg structure, clarity, unambiguity) q Completeness of safety actions specified q Specification of safe (or least hazardous) state and behaviour after restart q Throughput and response requirement q Better EUC operator interfaces specification q Better operator interface to E/E/PES q Better specification of interfaces to other systems q Specification of security provisions q Specification of failure modes or failure rates q Identification of undefined relevant operating modes (eg start-up, proof test) q Identification of undefined limiting and constraint conditions Integrity requirements specification q Specification of target safety integrity level and/or target failure measures q Specify all modes of operation q Better definition of proof testing requirements q Better specification of environmental conditions Validation planning q Better validation procedures q Better specification of validation environment q Better policies for resolving validation failure Hardware design and development q Better hardware safety integrity including fault tolerance and reliability analysis q Better measures for systematic safety integrity q Better response on detection of a fault q Better separation and independence q Better identification of interactions between subsystem components q Better design methods and tools q Better system architecture q Products supplied and integrated in accordance with product specifications Facilities for operation and maintenance q Provision for checking or confirmation of operator inputs q Better operator displays (complexity, confusion) q Better consideration of maintainability and testability Software requirements specification q Consistency checks against the overall specification q Checks requirements are clear, unambiguous, verifiable, testable and traceable q Better identification of constraints between hardware and software q Better monitoring and testing Software design and development q Better design methods q Better software architecture q Better support tools or programming language q Better code implementation q Better software testing Software validation q Better software validation techniques/plan q Better assessment of the validation plan q Better implementation of the validation plan Integration q Better hardware and software integration Installation and commissioning q Better installation procedures q Better commissioning procedures Figure E.7 E/E/PES problem prevention checklist (E/E/PES supplier) 56 E/E/PES Problem Prevention Checklist (Page 2 of 2) Operation and maintenance procedures q Better specification of the routine actions that need to be carried out q Better specification of actions and constraints during all modes of operation q Better maintenance procedures on fault Validation q Better implementation of validation plan q Better validation equipment q Better analysis/resolution of discrepancies q Better usability assessment Safety management q specification q specification of necessary management and technical activities q system design q define accountability for all activities q software q check activities are implemented and monitored q testing q check activities are formally reviewed by all organisations q integration q ensure all organisations are informed of responsibilities q validation q review suppliers for appropriate quality management system q installation q commission Lifecycle q ensure adequate definition of detailed lifecycle q define lifecycle sub-activities q ensure scope, inputs and outputs defined for each phase Competence q procedures for ensuring competence q assess training, experience and qualification of a person q specify job requirements Verification q check verification plan is developed and complete q check verification plan is implemented q check verification plan is fully understood Documentation q check documentation is available and complete q check documentation is clear and correct q check documentation is well-structured q check documentation is up to date Functional safety assessment q plan independent functional safety assessment q carry out independent functional safety assessment q ensure all phases assessed (including operation and maintenance) q ensure sufficient skills and independence in assessment team q assess tools Figure E.7 (continued) E/E/PES problem prevention checklist (E/E/PES supplier) Equipment change q hardware modification q redesign/update, q redesign/update to related products q reprogramming Operational support q support staff training q product specification q installation guidance q configuration guidance E/E/PES Recommendation Checklist Processes q contract review procedures q component supplier approval q engineering standards q product dispatch procedures q test and validation checklists q installation and commissioning procedures q staff training q staff competence User response q product recall q workarounds q warnings/alerts (all affected systems) Figure E.8 E/E/PES recommendation checklist (E/E/PES supplier) 57 Printed and published by the Health and Safety Executive C30 1/98 Printed and published by the Health and Safety Executive C1.10 12/03 ISBN 0-7176-2789-6 RR 181 £15.00 9 78071 7 627899
© Copyright 2026 Paperzz