RR181 - Learning from incidents involving E/E/PE systems

HSE
Health & Safety
Executive
Learning from incidents involving E/E/PE systems
Part 2 - Recommended scheme
Prepared by Adelard for the
Health and Safety Executive 2003
RESEARCH REPORT 181 HSE
Health & Safety
Executive
Learning from incidents involving E/E/PE systems
Part 2 - Recommended scheme
PG Bishop, LO Emmet
Adelard LLP
C Johnson
University of Glasgow
W Black
Blacksafe Consulting
This report is the second of 3 parts presenting the results of an HSE-sponsored research project. The
overall purpose is to create a scheme for learning from incidents that involve electrical, electronic or
programmable electronic (E/E/PE) systems. Part 1 reviews existing learning processes and causal
analysis techniques, examines industry practice and makes recommendations for a new scheme. Part
2 (this report) presents the recommended scheme and Part 3 gives accompanying guidance, examples
and rationale.
This report and the work it describes were funded by the Health and Safety Executive (HSE). Its
contents, including any opinions and/or conclusions expressed, are those of the authors alone and do
not necessarily reflect HSE policy.
HSE BOOKS
© Crown copyright 2003
First published 2003
ISBN 0 7176 2789 6
All rights reserved. No part of this publication may be
reproduced, stored in a retrieval system, or transmitted in
any form or by any means (electronic, mechanical,
photocopying, recording or otherwise) without the prior
written permission of the copyright owner.
Applications for reproduction should be made in writing to: Licensing Division, Her Majesty's Stationery Office, St Clements House, 2-16 Colegate, Norwich NR3 1BQ or by e-mail to [email protected]
ii
ACKNOWLEDGEMENTS
Adelard LLP wishes to acknowledge numerous invaluable contributions to the project from
the following people: Chris Johnson (Glasgow University), Bill Black (Blacksafe
Consulting), Mark Bowell (HSE), Konstantinos Tourlas (Adelard), Viv Hamilton (Viv
Hamilton Associates), Floor Koornneef (Technical University of Delft).
iii
iv
EXECUTIVE SUMMARY
This report is the second of 3 parts presenting the results of an HSE-sponsored research
project. The overall purpose is to create a scheme for learning from incidents that involve
electrical, electronic or programmable electronic (E/E/PE) systems. Part 1 reviews existing
learning processes and causal analysis techniques, examines industry practice and makes
recommendations for a new scheme. Part 2 (this report) presents the recommended scheme
and Part 3 gives accompanying guidance, examples and rationale.
For lessons to be learned from any kind of incident (not just involving E/E/PES), the
following processes are necessary:
· Incident reporting: workplace staff who witness an incident must report sufficient details
for safety managers to investigate further or analyse for trends where and as appropriate.
· Incident prioritisation: the recipients of incident reports decide to what extent each
incident represents a learning opportunity. A serious incident or accident will obviously
require corrective action but there is also much to learn from near misses, especially those
that form a recurring pattern.
· Incident characterisation and investigation: safety managers analyse selected safety
reports and investigate further as appropriate. The aim is to sufficiently understand what
happened and generate recommendations to reduce the probability of other incidents with
similar causes. The amount of effort spent per incident should be proportionate to the
incident’s learning potential. Characterisation of incident data is necessary if it is to be
collated for trend analysis.
· Response in working context: recommendations generated as a result of the incident must
be implemented in the original working context for any benefit to be realised. For this to
happen, the recommendations must be realistic and tracked to completion.
It is also desirable disseminate lessons learnt to others, eg to other sites belonging to the
company, or to users, system suppliers and component suppliers in a product supply chain. If
others provide such information, each site or company needs their own procedures for acting
on it.
For a company or organisation to learn lessons from its own or others’ aggregated incident
data, proactive analysis is required.
When incidents involve E/E/PES, specialised activities should be introduced into the learning
process. This is necessary because general learning processes will not include all the
expertise and methods needed to examine and rectify systems where E/E/PE technology is a
significant factor. The organisation learning from an incident needs to decide when to invoke
E/E/PES-specific activities, what additional analyses to perform and whether to involve any
suppliers or contractors.
This report addresses the development, operation and maintenance of safety-related E/E/PE
systems using the terminology and approach of the international standard IEC 61508,
Functional safety of E/E/PE safety-related systems. The recommendations are designed to be
consistent with the requirements of the standard.
v
The incident reporting process needs to capture sufficient information to prompt the recipient
into considering whether a specialist E/E/PES investigation would be beneficial. To do this,
the reporter must be able to give some indication of when E/E/PES equipment is involved.
The identified E/E/PES need not have been a direct cause of the incident for a specialist
investigation to be worthwhile – it may be that had the E/E/PES been different in some way it
might have helped prevent the incident.
Factors that affect the prioritisation of incidents and the effort required for their investigation
include industry practice, regulator expectations, and both the actual consequence and the
potential consequence associated with the system involved. Unexpected behaviour,
particularly with respect to safety assumptions, will usually have the greatest learning
potential.
A specialist investigation will aim to determine what exactly happened, why the problem
occurred and how it can be prevented or the consequences mitigated in the future. The PARC
causal analysis method is recommended as a simple means of considering the most significant
aspects in an incident and identifying its likely causes. The flow chart given is optimised for
end users to examine situations where failures in their equipment under control, or in the
equipment’s control system, place demands on a separate safety protection system. If more
detailed investigation is required, recommended techniques are barrier and change analysis
and events and causal factors charting.
Results of the investigation can be recorded on checklists, and for simple investigations these
may be adequate by themselves for identifying the most significant causes and for suggesting
appropriate solutions.
End users rely on contractors and suppliers to develop and maintain their E/E/PE systems.
Although incidents generally occur with the end user, effective learning has to involve these
other parties so they can identify defects in their equipment. They in turn have a
responsibility to notify other users of identified problems and fixes.
Incidents involving E/E/PES can be classified according to failure mode, type of problem or
problem prevention (equivalent to root causes). A comprehensive root cause classification
scheme is proposed based on IEC 61508 safety lifecycle activities and on common
requirements that are not specific to any lifecycle phase (eg competence). This scheme can
be condensed according to the responsibilities of those using it. For example, end users will
focus more on operation and maintenance, and system suppliers more on realisation.
Classification significantly aids long-term analysis and sharing of data.
vi
CONTENTS
1
INTRODUCTION ...................................................................................................1
2
2.1
2.2
2.3
2.4
SCHEME OVERVIEW ...........................................................................................2
Scope ....................................................................................................................2
Design objectives ..................................................................................................4
Scheme concepts ..................................................................................................4
Scheme design......................................................................................................5
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
THE PROCESS OF LEARNING FROM INCIDENTS ............................................6 Overall learning process and context.....................................................................6 Generic incident reporting and investigation process.............................................7 Incident reporting...................................................................................................8 Investigation policy ................................................................................................9
Incident investigation ...........................................................................................10 Dissemination and learning .................................................................................13 Proactive learning from incident data...................................................................15 4
4.1
4.2
4.3
4.4
4.5
E/E/PES SPECIFIC ACTIVITIES.........................................................................16
Identification of E/E/PES......................................................................................17 Addition to incident recording...............................................................................17 E/E/PES technical investigation policy.................................................................18 E/E/PES technical investigation...........................................................................19 Dissemination and learning .................................................................................26 5
5.1
5.2
5.3
CLASSIFICATION OF E/E/PES PROBLEM RECORDS.....................................30 Failure mode .......................................................................................................30
Problem prevention .............................................................................................31
Adapting the causal classification for different perspectives ................................32 6
CONCLUSIONS ..................................................................................................34
REFERENCES............................................................................................................35
APPENDIX A IEC 61508 CAUSAL CLASSIFICATION SCHEME .............................36 APPENDIX B SIMPLIFIED CLASSIFICATION (END USER PERSPECTIVE)...........40 APPENDIX C PARC CAUSAL ANALYSIS METHOD ...............................................42
C.1 Initiating PARC ....................................................................................................42
C.2 Applying PARC....................................................................................................42
APPENDIX D SIMPLIFIED CLASSIFICATION (SYSTEM REALISATION PERSPECTIVE)..........................................................................................................48 APPENDIX E EXAMPLE FORMS..............................................................................50
vii
FIGURES
Figure 1 Overall learning process.................................................................................6
Figure 2 Incident handling process...............................................................................7 Figure 3 Incident investigation....................................................................................10
Figure 4 Response to external information (the listening process)..............................13 Figure 5 Supply chain response to incident ................................................................14 Figure 6 Example event sequence involving E/E/PES................................................17 Figure 7 Initial technical investigation .........................................................................22
Figure 8 Problem prevention checklist........................................................................24
Figure 9 Recommendations checklist.........................................................................25 Figure 10 Supply chain dissemination and learning....................................................26 Figure C.1 PARC causal analysis flowchart ...............................................................46 Figure E.1 Initial incident report ..................................................................................50 Figure E.2 Equipment problem report (end user)........................................................51
Figure E.3 Causal investigation report (user organisation) .........................................52
Figure E.4 Causal analysis by E/E/PES supplier of system or equipment ..................53 Figure E.5 E/E/PES problem prevention checklist (end user) .....................................54 Figure E.6 E/E/PES recommendation checklist (end user) .........................................55 Figure E.7 E/E/PES problem prevention checklist (E/E/PES supplier)........................56 Figure E.8 E/E/PES recommendation checklist (E/E/PES supplier)............................57 TABLES
Table 1 Example recommendations ...........................................................................12 Table A.1 Safety lifecycle classification ......................................................................36
Table A.2 Expansion of E/E/PES lifecycle classification .............................................38 Table A.3 Common requirements classification..........................................................39
Table B.1 Safety lifecycle classification ......................................................................40
Table B.2 Common requirements classification..........................................................41
Table D.1 System realisation phase classification ......................................................48 Table D.2 Common requirements classification..........................................................49
viii
1 INTRODUCTION
This report is the second of 3 parts presenting the results of an HSE-sponsored research
project. The overall purpose is to create a scheme for learning from incidents that involve
electrical, electronic or programmable electronic (E/E/PE) systems.
Part 1 reviews existing learning processes and causal analysis techniques, examines industry
practice and makes recommendations for a new scheme. The project used these results to
produce a draft new scheme [1]. The draft was subjected to review through an industry
workshop, internal Adelard evaluation and four half-day interviews with two end users, a
system supplier and a component supplier [2]. Part 2 (this report) presents the recommended
scheme derived from this review process. Part 3 gives accompanying guidance, examples and
rationale.
The report consists of:
·
An overview of scope, objectives and key concepts for the scheme (Section 2);
·
A discussion of the processes involved in learning from all kinds of incidents (Section 3);
·
Recommendations on how to address aspects specific to E/E/PE systems (Sections 4 and
5), with supporting causal classification taxonomies (Appendices A, B and D), a simple
flow chart causal analysis method (Appendix C) and example forms (Appendix E).
1
2 SCHEME OVERVIEW
There are considerable benefits in learning from incidents that occur in an organisation.
Incident reporting and investigation schemes can be implemented for incidents such as:
·
accidents to personnel,
·
damage to the environment,
·
production losses.
If the learning is effective, resultant actions will reduce the chance of a recurrence. In injury
reporting schemes there is often an increase in recorded incidents when the scheme is started
(due to improved reporting) but the rate steadily reduces over time and the number of serious
injuries also decreases. In addition, there are considerable benefits from addressing minor
incidents—studies have shown that minor incidents are far more frequent than major
incidents and that accident costs represent a significant proportion of company profits [3].
Companies and organisations increasingly rely on electrical, electronic and programmable
electronic systems (E/E/PES) that can have an impact on safety, productivity and the
environment. In this document we describe a scheme termed PARCEL (PES Analysis of Root
Cause and Experience-based Learning) that covers the general learning process but focuses
specifically on extensions to cover incidents involving E/E/PES equipment. The scheme
should benefit user organisations and also enable suppliers of E/E/PES to learn from
operational experience and hence improve their products.
2.1 SCOPE
This document presents an incident reporting and analysis scheme for incidents involving
E/E/PES. This will typically form part of a more general incident reporting scheme and may
cover safety, the environment and other concerns of the organisation (eg economic losses).
The scheme is designed to be consistent with the requirements of IEC 61508 [4]. This is a key
international standard for industry. It sets out specific requirements for E/E/PE safety related
systems within a generic framework that defines the safety lifecycle, safety management
activities and detailed requirements for achieving functional safety. Example applications
include:
·
emergency shutdown systems,
·
fire and gas systems,
·
safety interlocks,
·
a medical device,
·
a safety-related information system (eg a patient database),
·
a transport control system (eg an air traffic control system).
2
The scheme is designed to support two requirements of IEC 61508 in particular. The first of
these is the need to learn from experience. 6.2.1 of IEC 61508-1 states that responsible
organisations or individuals should consider specifying, implementing and monitoring the
progress of:
i) procedures which ensure that hazardous incidents (or incidents with potential to
create hazards) are analysed, and that recommendations are made to minimise the
probability of a repeat occurrence.
This requirement addresses a range of safety stakeholders including the developers and
operators of safety related systems.
In addition IEC 61508-2 requires that:
7.8.2.2 Manufacturers or system suppliers which claim compliance with all or part
of this standard shall maintain a system to initiate changes as a result of defects
being detected in hardware or software and to inform users of the need for
modification in the event of the defect affecting safety.
This places a requirement on the "supply chain" of E/E/PES system suppliers and product
manufacturers to eliminate reported defects that could be a source of safety-related incidents.
It should also be noted that the standard recognises that incident reporting and analysis
reduces the likelihood of common cause failures. IEC 61508-6 Annex D provides guidance
on a methodology for quantifying the effects of hardware-related common cause failures in
E/E/PES. Such effects often dominate the calculation of the reliability that can be achieved
when using redundant systems. Common cause failures make it particularly difficult to
achieve the target reliability for high integrity systems such as those containing safety
functions with a safety integrity level of 3 or 4 (equivalent to better than 10-7 dangerous
failures per hour). Common cause failures are normally accounted for using a Beta factor.
Annex D of IEC 61508-6 provides a method for the determination of Beta that depends on the
design of the sub-system and how it is to be maintained. Incident reporting and analysis
reduces the need for diversity and can lead to lower beta factors with consequent ability to
achieve the reliability required when high integrity level systems are required. The
introduction of an incident reporting system can therefore lead to reductions in design,
procurement and maintenance costs.
3
2.2 DESIGN OBJECTIVES
The design objectives of the scheme are documented in Part 1. In summary, these are for the
scheme to:
·
focus on E/E/PES aspects of incidents;
·
be practical (ie integrate with current management and reporting systems, have reasonable
set-up costs, be usable);
·
be applicable to different industry sectors and to different levels of organisational
complexity and maturity;
·
be applicable to end users, systems designers and E/E/PES product suppliers;
·
be based on current industry practice and current techniques;
·
take account of incident criticality (ie more effort is expended on serious incidents);
·
provide concrete examples to aid implementation;
·
make learning effective (so that similar problems can be avoided in the future);
·
use a consistent classification method that enables incidents to be categorised for
subsequent analysis (eg by industry sector, safety function).
2.3 SCHEME CONCEPTS
The incident reporting and analysis scheme is viewed as part of a learning process. In this
process, lessons learned can be fed back to a range of different stakeholders. These
stakeholders can be in:
· the organisation using the E/E/PES (the end user);
· a system contractor who may have overall design responsibility (eg for plant, buildings
and E/E/PES equipment);
· the organisation that designs and implements the E/E/PES based system for the end user
or main contractor (the E/E/PES system supplier);
· the organisation that provides E/E/PES products to the system supplier (the product
supplier);
· statutory bodies (eg HSE);
· end-user interest groups (eg for specific industry sectors or end-users of specific
E/E/PES).
Learning can take different forms in different organisations. For example, an end-user can
implement short-term risk mitigation by changes to workplace practices. Over the longer
term, risks may be reduced by updating the E/E/PES, the approved vendor lists or internal
standards. In addition the end-user management processes can be changed to reduce the
4
chance of similar problems on future E/E/PE based systems. Similarly the supply chain can
respond to incident reports by rectifying the E/E/PE system or product and by improving
processes to prevent a recurrence of the problem in future systems.
2.4 SCHEME DESIGN
The following sections present the incident reporting and analysis scheme for incidents
involving E/E/PES. In Section 3 we describe the generic incident handling and learning
processes. These are based on a synthesis of existing industrial schemes examined in Part 1
and revisited during industrial consultation for the draft scheme [2]. Section 4 then identifies
the additional E/E/PES related activities and information that needs to be embedded within
the generic model of incident reporting and investigation. Our design intent is to make these
additions as modular as possible so that they can be readily incorporated into existing incident
reporting schemes.
The E/E/PES-specific additions are intended to identify the causes of failure according to a
classification scheme based on IEC 61508 concepts, and hence to facilitate product and
process improvements that will benefit the different stakeholders associated with the
E/E/PES. The classification approach is described in Section 5. To support the different
stakeholders, the causal classification scheme is hierarchic and can be adapted to reflect the
viewpoint of the reporting organisation. In other words, the scheme can be related to the life
cycle stages and products for which the organisation is responsible (ie where improvements
can actually be made).
Supporting appendices are provided, covering:
· classification scheme details (Appendices A, B, and D);
· example forms for incident reporting and incident investigation, to illustrate the approach
(Appendix E);
· a simple analysis method, PARC, for investigating the causal factors underlying the
E/E/PES-related elements of an incident (Appendix C).
Adaptation of the scheme for specific organisations and industry sectors is addressed in
Part 3.
5
3 THE PROCESS OF LEARNING FROM INCIDENTS
The end user organisation is likely to have a general incident reporting scheme that is used to
minimise recurrence of incidents. The company may also have formal or informal
mechanisms for learning from experience to prevent recurrence of similar problems on future
systems. In this section we outline these generic processes, and the following section we
identify where E/E/PES-specific activities can be incorporated within this scheme.
3.1 OVERALL LEARNING PROCESS AND CONTEXT
Figure 1 below presents an overview of a generic incident reporting and learning process.
This is based on the survey of industrial practice documented in Part 1. The components of
the diagram with solid lines constitute a basic incident reporting and analysis system. The
components with dotted lines are additional dissemination and learning activities that can be
used for further process improvement. The arrows represent transfers of information (such as
incident reports, problem reports and recommendations).
There is a local operational working context in which a response can be made. It is in this
context that incidents first occur, some of which are recognised and reported. If an incident is
deemed sufficiently interesting by the organisation (eg it may be potentially serious, novel or
there may be too many incidents of this type), a more detailed characterisation and
investigation occurs. Otherwise, incidents may be merely logged and a known workaround
applied. Corrective actions are generated, and these may range from local workarounds,
through to sharing within the organisation, to communication along the supply chain and
finally to wider industry and regulatory reporting.
Incident
5 Detailed
Assessment
Local working context
New workaround
re-engineering etc
Detection
1 Incident
reporting
Engineering
management
Supply chain
response
Operations
management
Standard
response to
known issues
Local
management
Response to
new situations
Recommendations
Supply chain
problem
notification
Recommendations
2 Incident
prioritisation
3 Incident
characterisation
and investigation
4 Incident
repository
7 Dissemination
function
Wider
industry etc
6 Proactive
interpretation
and analysis
Corporate
policy/
standards
8 Listening
function
Other sites or
departments
Wider
industry etc
Figure 1 Overall learning process
6
Other sites or
departments
Supply chain
There can be more general learning if incident information and supply chain problems are
shared. For example, end-users can tell other departments that might have similar equipment
with similar problems, or suppliers can alert a user organisation if there are known problems
with the equipment they provided. These problems may or may not apply in the new working
context, so there is a need for interpretation before the appropriate recommendations for
corrective action can be determined.
The various process components are elaborated in further detail below.
3.2 GENERIC INCIDENT REPORTING AND INVESTIGATION PROCESS
The actual incident reporting and analysis process varies between companies, but practice
typically includes the elements shown in Figure 2 below.
Incident report
Yes
Novel?
Analyse/classify
severity and
tolerable rate
No
Knowledge of
previous
incident types
Log incident
Serious?
Yes
Data gathering
Report causal
factors and
recommendations
No
Incident
investigation
Too
often?
No
Classification to
common
taxonomy
Yes
Investigate causes
of high frequency
Recommend site
corrective actions
Log incident
investigation report
STOP
Disseminate for more
general learning
Figure 2 Incident handling process
7
Note that the scheme is designed to minimise overheads for minor incidents where it is
sufficient to log the incident (the leftmost leg). However if a “tolerable level” for minor
incidents is exceeded (eg excessive components failures, failures of non-critical functions that
could be a source of distraction or increase workload), this merits further investigation to
reduce the rate.
If an incident is unexpected, some investigation is recommended. Even if the incident is
classified as minor, it provides the opportunity to identify an acceptable workaround that can
be fed back to the workplace staff through revised working practices.
If an incident is classified as serious, the investigation will be more thorough. Typically there
are several levels of incident severity, and the rigour of the investigation will vary with the
severity of the incident. For a general incident reporting and analysis scheme, high severity
incidents might not only relate to safety but also to economic losses (eg loss of equipment,
loss of production) and be assigned a high priority for investigation.
The outcomes of an incident investigation are a classification of the incident, a description of
what happened, the causal factors leading to the incident, and recommendations to prevent a
recurrence.
The incident report provides a basis for implementing corrective action. The information
contained can also be disseminated more generally to other stakeholders who can use it to
prevent incidents in both current and future systems.
The following sections describe the generic process for incident reporting and the information
produced at each stage.
3.3 INCIDENT REPORTING
Workplace staff usually have to detect and report the incident. The information supplied
should not be time-consuming for workplace staff as this could result in under-reporting. In
some cases initial incident detection is performed automatically or can be machine supported
(so that relevant information is stored automatically).
An example incident report form is shown in Figure E.1. A manual incident report might
contain:
·
reporter name,
·
date and time,
·
free text description of incident;
and optionally: ·
any short term fix, ·
reporters view of severity,
·
whether new to the reporter.
The confidentiality of incident reporting needs to be decided. Confidentiality may well
encourage reporting, but presents difficulties in subsequent investigation (as no further details
can be obtained). In some schemes, incidents are reported to a third party who can check
details without revealing the source.
8
Most companies seem to opt for an “open culture” of reporting. This enables the details of the
original incident to be validated (ie who, what, where and when) and corrected if necessary.
Open reporting also facilitates subsequent data gathering and incident investigation as the
relevant personnel can be interviewed to obtain further details. However to maintain an open
culture it is important to ensure that the reporter is not penalised for his actions (by instituting
a “no blame” policy for incident reporting, for example).
Each incident report might be extended and assessed by a workplace supervisor. The
supervisor should check that the information is valid and assess whether the incident is
serious or novel. Typical extensions to the report might include:
· incident consequence (effect on operation, effect on personnel, eg injury),
· incident type,
· incident novelty.
The incident report should then be stored in an incident log.
3.4 INVESTIGATION POLICY
The key principle underlying the technical investigation policy is whether the incident
represents a “learning opportunity”. The potential for learning from incidents varies. It is
therefore important to prioritise an incident investigation according to the significance of the
incident to the company.
A general incident handling scheme might well include non-safety incidents such as damage
to assets and loss of production. The type of incident investigated may also depend on the
maturity of the organisation and the skills and resources available. For example, large
organisations tend to have more levels of investigation than small companies. When
introducing a new incident reporting scheme the priority will be to ensure more serious
incidents are fully investigated. The incident priority will determine the nature of the
investigation (the resources, time and staff available).
In making decisions on the type of incidents that should be regarded as high priority,
consideration should also be given to the likely frequency of incidents and the resources
available. A prioritisation scheme that results in 90% of incidents being high priority is
unlikely to be successful. The effectiveness of the prioritisation scheme should be reviewed
on a regular basis. If suitable resources are not available to analyse all high priority incidents
within a reasonably short period of time then priority levels or resource levels will need to be
changed. Deferral of high priority incidents for significant time periods will undermine the
learning scheme and bring it into disrepute. There is therefore a need for a management
system that allows deferral but keeps it within acceptable boundaries.
Factors that could lead to a high investigation priority include:
1 severe safety consequences (or potential for such consequences), such as fatalities to
employees or members of the public;
2 severe environmental consequences (or potential for such consequences);
3 severe economic losses (loss of assets or loss of production);
4 excessive frequency of incidents (even minor incidents can be costly [3]);
9
5
novelty of the incident (so the worst case consequences are unknown);
6
there are many installations where the incident might occur.
Lowest priority incidents are often just logged without investigation. Typically these are well
known and have little impact on the organisation. But the priority should increase if such
incidents recur too frequently.
3.5 INCIDENT INVESTIGATION
Incident investigators need sufficient relevant knowledge and expertise to perform the
analysis, including knowledge of:
·
the application domain, the actual process involved and the potential consequences of
incidents (eg the plant, machinery);
·
the operation and maintenance regimes; and
·
the equipment used in the working environment.
The overall investigation process followed is shown in Figure 3 below.
Incident notification
Data gathering
Reconstruction
Analysis
Recommendations
and monitoring
Figure 3 Incident investigation
While each investigation will contain every phase from Figure 3 to some extent, the rigour,
independence and number of people involved will vary with the investigation priority.
3.5.1 Data gathering
Data gathering may be required in order to establish further details about the incident and the
operational context. The data gathering phase includes:
·
extracting and safeguarding any relevant logs (from manual or automated systems),
·
elicitation of testimony from witnesses, 10
· direct interviews,
· gathering group testimony, for example through focus groups or anonymous
questionnaires.
Data gathering also needs to establish the workplace context. The amount of data required
about the context varies with the knowledge of the investigating team. Less context
information is required if the investigating team is familiar with the workplace, the equipment
in use and the procedures in place. However familiarity with the workplace can sometimes be
a danger since assumptions may be made which are not valid.
3.5.2 Incident reconstruction
Based on the gathered data, the incident is reconstructed as a sequence of events leading up to
the incident plus any actions following the incident that mitigated the outcome. This could for
example be represented by a timeline (see Appendix A of Part 1).
3.5.3 Causal analysis
Causal analysis establishes relationships between the events, and identifies underlying causes
that permitted the event to occur. This can be represented in a number of different ways as
described in Appendix A of Part 1 (eg as a fault tree, an ECF chart or a MORT diagram). The
purpose is to determine root causes. This can be a recursive process, and it is necessary to
identify some stopping point. The analysis might call for further technical investigations that
require specialist expertise (eg plant engineering, human factors, equipment).
For less severe or less complex incidents, technical investigations will probably be limited to
aspects within the boundaries of the organisation. For more severe or more complex incidents,
investigations might require the involvement of other organisations with the relevant technical
knowledge and expertise (eg the E/E/PES supplier).
As the events and their causes are identified, the assumptions made during design should be
revisited (eg failure causes, failure modes, frequency and consequences). If discrepancies are
discovered between the observed outcome of an incident and the original design assumptions,
then an impact analysis using revised design assumptions is necessary.
3.5.4 Recommendations
Recommendations are made to reduce the probability of a recurrence at the site or to mitigate
the consequences. There can be several options for accomplishing this, including:
· personnel and training,
· procedures and organisation,
· changes to E/E/PE systems,
· changes to plant.
There may also be wider recommendations not directly related to the incident but resulting
from information gained while investigating it, for example the discovery of false design
assumptions.
11
Each recommendation has to identify:
·
what should be changed or done,
·
who is responsible,
·
the importance of the change or action.
These recommendations can be over different time-scales and include workaround
arrangements until long-term changes can be implemented. An example is shown in table 1
below.
Table 1 Example recommendations
Ref
Recommendation
Department
Priority
1
Change operational procedures to get permit from supervisor
before overriding the vessel supply feed overflow interlocks
Operations
1
2
Ensure operating staff are briefed about the change in procedure
Operations
1
3
Request engineering change to alarm display system to ensure
that operator is aware that the batch vessel is already full
Engineering
2
3.5.5 Incident Investigation report
The results of the investigation are usually presented in an investigation report. The report will include: ·
consequences and potential consequences (injuries, deaths, other losses); ·
incident type (this will be specific to a given domain);
·
a reconstruction of the incident; ·
an analysis of causal factors;
·
recommendations. An example incident investigation report form is shown in Figure E.3. 3.5.6 Recommendation implementation and tracking
The operational or engineering management will assess the cost and feasibility of the
recommendations and make decisions on their implementation. Accepted recommendations
should be tracked to ensure they are implemented. Any deferred recommendations should
also be monitored for compliance with deferral time limits.
This tracking could be implemented as part of a safety management system (SMS). However
it is often cost-effective to use an existing quality management infrastructure, which will
already have processes for defect recording and resolution.
12
3.6 DISSEMINATION AND LEARNING
The incident reporting scheme outlined above only covers the resolution of a specific incident
by the end user. However the incident data can be used more generally to:
·
prevent recurrence on similar systems at other user sites,
·
rectify defects in systems,
·
prevent the recurrence of similar problems in new systems implemented by the end-user,
·
improve the processes used in the supply chain.
The following sections will consider the dissemination and learning aspects in more detail.
3.6.1 Preventing recurrence on different sites
Incidents or known problems can be circulated to users with similar processes or equipment.
This could involve dissemination within a user organisation or to industry or equipment
interest groups (box 7 in Figure 1). Some “listening” function is needed to analyse the
information and apply it to the site (box 8 in Figure 1). This is illustrated in Figure 4 below.
Incident report or
supplier alert
New to
this site?
No
Yes
Relevant
to this site?
Yes
STOP
No
Serious
enough?
Yes
STOP
No
STOP
Review
recommendations
and implement
corrective actions
Figure 4 Response to external information (the listening process)
13
3.6.2 Rectify defects in supplied equipment
Reported problems might reveal defects in some common component that is used in several
systems. A supplier can perform an impact analysis to identify and update affected systems
or products and alert users of the change and the available fix. As noted in Section 2.1 this
process is a requirement for suppliers claiming compliance with IEC 61508. The process is
outlined in Figure 5 below.
User problem report or
component supplier alert
New
problem?
No
Yes
Identify short term
corrective action
Diagnose defects and
products affected
Supply corrective
action/workaround
Component
problem?
Alert other users where
system is safety-related
Yes
No
Alert component
supplier
Implement design changes
in product(s) or associated
documentation
Alert affected users
Figure 5 Supply chain response to incident
3.6.3 Preventing recurrence on future systems procured by the user
Where a user finds that a supplied system was not suited to the intended application, this may
indicate problems with the user’s procurement policy – particularly:
·
contractor and supplier assessment, ·
requirement specification process, ·
tender evaluation, and/or ·
system acceptance methods. Alternatively, the supplied system may have been suitable for the assumed operating
infrastructure, but aspects of the actual infrastructure (such as personnel competence or associated procedures) are inadequate or have been changed. These lessons can be applied to the procurement of future systems. 14
3.7 PROACTIVE LEARNING FROM INCIDENT DATA
An organisation can use past experience as a basis for taking action to reduce or prevent
problems occurring at a more general level, both in current and future systems.
Data from past incidents provides an information repository that can be analysed to determine
preventive actions (box 6 in Figure 1). Such analyses can include:
·
trend analysis (eg is the frequency of a particular type of incident rising or falling),
·
identifying proportions for different incident types,
·
zonal analysis (eg identifying specific “hot spots” with high incident frequencies).
This information can be used to establish management priorities for preventive action (eg to
ensure incidents stay within limits or tackle problem areas).
Incidents can also be used to improve knowledge within the organisation (sometimes called
“corporate memory”). This can include:
· generalising incidents into rules, recommendations, checklists or corporate standards for
future activities,
· creating information repositories (eg “how to’s” and “frequently asked questions” for use
by the staff),
· staff training and awareness (eg making use of actual incidents within the organisation as
illustrations in order to improve the safety culture).
Organisations and departments can also make use of external information. There is a lot of
learning potential to be gained from information on similar systems used on other sites, or to
learn more generally about best practices in the operational or design process being
conducted. In particular there can be benefit in proactively searching for known issues when:
· starting a new instance of a known project type,
· new staff join a team,
· investigating the feasibility of a novel project.
Proactive learning also provides a motivation for organisations to share information with each
other—industry information sharing bodies often have a policy that those who contribute
have preferential access to information.
Confidentiality concerns may mean that such shared information has to be anonymised (eg
the identity of the company and personnel involved).
15
4 E/E/PES SPECIFIC ACTIVITIES
The previous section outlines a general process for learning from incidents. This section
outlines the specific activities that should be introduced into the general learning process
when incidents involve E/E/PES. In the proposed extension to the incident reporting scheme
we need to decide:
·
when to invoke a technical investigation of E/E/PES equipment involved in the incident,
·
what additional analyses to perform, and
·
who to involve from the supply chain.
These requirements could apply to any type of equipment involved in an incident, but the
need for additional analysis and the involvement of suppliers is particularly important for
E/E/PES if the associated risks are to be effectively controlled. E/E/PES have a number of
distinctive characteristics that can make them more difficult to understand, control and
maintain, including:
·
configurability and reprogrammability,
·
functional complexity,
·
complex/unsafe failure modes,
·
non-deterministic behaviour,
·
non-linear behaviour.
From our survey and evaluation studies (Part 1, [2]) it is clear that there needs to be
demonstrable benefit for the extra effort required to specifically address E/E/PES issues. In
other words we need “maximum gain for pain”. “Pain” factors that are likely to prevent
adoption of an extended scheme are:
·
high implementation and running costs,
·
radical changes to existing procedures and processes,
·
excessive burdens on staff (eg training, demands on time, complexity).
It was therefore important to devise extensions that are:
·
as modular as possible,
·
simple to implement and operate,
·
only invoked when there is significant benefit to be gained.
To modularise the E/E/PES component, we view E/E/PES extensions as a subset of the
incident analysis, namely a special technical investigation. In a complex incident there could
be several technical investigations covering different specialities (such as plant, control and
protection systems). These technical investigations contribute results back to the overall
incident investigation. In a simple incident there might only be one E/E/PES involved and the
technical investigation would probably be more limited.
During the overall incident analysis and reconstruction, a sequence of events will be
identified that lead to the incident, such as the example illustrated in Figure 6 below.
16
Control
system fails
to maintain
intended
plant
condition
Operator
unaware of
loss of
control
Safety
system
operates
Operator
overrides
safety
system
Incident
occurs
Figure 6 Example event sequence involving E/E/PES
The reasons why particular subsystems failed would be the subject of technical investigations
and the technical recommendations on preventing a recurrence would be included in the
recommendations of the overall incident investigation.
In this example, two E/E/PES subsystems are involved in the incident (shown by the shaded
boxes). However, some events leading to the incident might not be investigated. For example,
the obvious failure is in the control system, but this could be viewed as a normal event and the
focus for investigation would then be the subsequent events. We could ask why the operator
overrode the safety system—it may be that the information supplied by the safety system is
inadequate and the functionality of the safety system needs to be improved so the operator is
better informed. In other cases, a technical investigation into the control system failure would
be appropriate, eg if the failure frequency was too high or the failure mode was not as
predicted during design.
4.1 IDENTIFICATION OF E/E/PES
For both simple incident reporting and incident investigation, it is necessary to identify what
equipment incorporates E/E/PES. We would recommend that a list or database be set up that
identifies the safety-related E/E/PES within an organisation so that equipment can be readily
identified. Such a list of safety-related E/E/PES might already exist for normal risk
management purposes. Alternatively the safety classification could be incorporated into an
existing asset management system that provides normal equipment identification information
(eg supplier, equipment name, serial number, and version).
4.2 ADDITION TO INCIDENT RECORDING
When E/E/PES are involved in incidents, this needs to be captured in the incident report. Note
that the E/E/PES need not necessarily be the cause of the incident; from an incident
investigation viewpoint it is also necessary to determine if an E/E/PES within the scope of the
incident could have helped to prevent the incident.
When the initial incident report is checked, updated and assessed by the supervisor, it will be
necessary to establish if the incident involves an E/E/PES. The general incident reporting
form should contain a question to capture this.
The supervisor should be able to establish the equipment involved, then:
·
If there is a pre-existing register of assets that contains supplementary information on
E/E/PES, it is sufficient to record the asset identification number or the equipment name
and serial number.
17
· If this is not the case, the supervisor has to determine if the equipment uses E/E/PE
components. Generally speaking, this is likely if the equipment is electrically powered. It
may be feasible for the supervisor to manually assign an equipment type and operating
mode based on workplace knowledge.
Low priority incidents are simply logged (ie held in some incident repository), while higher
priority incidents are investigated. For low priority incidents, the only addition information
recorded is whether an E/E/PES is involved.
The incident logging function could be part of either safety management or quality
management procedures. The appropriate procedures should contain requirements for
periodically reviewing the logged incidents.
Similarly, equipment problem reports could build on existing engineering maintenance
documents and processes, and technical investigations could to be undertaken by existing
engineering support staff.
4.3 E/E/PES TECHNICAL INVESTIGATION POLICY
The incident investigation may identify that an E/E/PES problem was a contributory cause of
an incident, but there can be multiple causes to incidents, especially complex ones. As part of
a general incident investigation, the investigator has to decide whether to perform a technical
investigation focussing on the relevant E/E/PE systems.
The type of E/E/PES problem that is investigated will depend on industry practice, regulator
expectations and the potential consequence associated with the system involved. In deciding
whether to investigate a specific E/E/PES involved in an incident, a number of factors may be
relevant, including:
· the E/E/PES exhibited an unsafe failure mode,
· the E/E/PES failure behaviour is unexpected and unexplained,
· the E/E/PES failure frequency is excessive,
· the consequences of E/E/PES failure are high (eg it is the sole means of risk reduction),
· there are multiple applications using the same E/E/PES,
· the E/E/PES equipment or application software is novel (the uncertainty associated with
using new techniques makes learning particularly important).
Factors that will tend to indicate that a technical investigation is not needed include:
· the undesired behaviour is known and countermeasures are in place,
· the safety and economic or environmental consequences of failure are low,
· multiple protection or mitigation layers exist to reduce risk from the same hazard,
· the equipment and system application is simple and has been in use for some time.
In the incident reconstruction shown in Figure 6, there are two E/E/PE systems involved. The
control system exhibited unsafe behaviour, but there is a separate protective system. If the
control equipment is well established, and the failure mode is already known and the
countermeasures are considered adequate, there would be little point in performing a technical
investigation as little would be gained in terms of new understanding or recommendations.
18
If, on the other hand, the control equipment exhibits unexpected failures or frequent failures,
there would be a case for technical investigation, because the control system is imposing extra
demands on the safety system.
In the same example, the E/E/PE-based safety system actually operated as intended, so while
the safety integrity requirement is high, there is no requirement for a technical investigation of
the safety system itself. On the other hand, we might wish to ask why the operators failed to
perform their tasks correctly, including why the safety system was overridden. A technical
investigation might then analyse the human factor aspects, which could lead to
recommendations for changes to existing systems including the E/E/PE-based safety system.
Some E/E/PES problems may not get associated with any general incident investigation. The
majority of industrial applications have large numbers of E/E/PE systems, and failures of
individual components occur at frequent intervals. Failures will be reported by either
operational staff recognising that the facility is not operating as intended or by maintenance
staff conducting routine testing and inspection. Not all failures will be investigated. Often
maintenance staff simply replace failed components and operations staff then restart the
facility. For hard-pressed staff this is normally the easiest option. It is also the most sensible
option if the cause can be established as a random hardware failure. However failures where
the cause can be identified as systematic should always be investigated because if changes are
not made to equipment or procedures then the same failure will reoccur later under the same
conditions. First line maintenance staff will need to make the judgement as to whether further
investigation is justified. Again the main criterion would be the potential for learning from
the incident. Technical investigations might be initiated if for example:
1
The safety-related system failed when tested. There are no actual consequences, but it
could be viewed as a “near miss”—especially if the fault disables or partially disables the
safety function (ie a fail-danger fault).
2
There is evidence of a systematic failure so that there is potential for improvement.
3
There are excessive equipment failures.
First-line maintenance staff need to be trained to recognise such problems and maintenance
procedures, and report forms need to be appropriate for this purpose. The example equipment
problem report form (see Figure E.2) contains a number of ticklist items that help to identify
whether a technical investigation is needed (ie whether the problem is in safety-classified
equipment and whether the cause is something other than random failure).
4.4 E/E/PES TECHNICAL INVESTIGATION
If it is decided to undertake a technical investigation, the investigation has to answer several main questions: 1
What exactly happened? 2
Why did the problem occur? 3
How can the problem be prevented (or reduced in frequency)? 4
Can the consequences be mitigated by other means? 19
For E/E/PES technical investigations, the investigator should have the relevant knowledge
and expertise on the E/E/PES and its operational environment, including:
·
the wider operational context in which the E/E/PES is used,
·
the function of the E/E/PES equipment,
·
integrity level requirements,
·
the system design,
·
the operation and maintenance procedures and documentation for the E/E/PES.
Several different E/E/PES components may be used to implement a given function. For
example, the control system might be a distributed system with elements such as data
acquisition, control and monitoring—all implemented on different units. The interactions
between these elements may need to be analysed as well as the performance of individual
units.
4.4.1 E/E/PES data gathering and reconstruction (what exactly happened?)
The data gathering process has to establish the operating context and what occurred during
the incident. E/E/PE systems often have event logging facilities, and often maintain internal
records of diagnostic checks and failures. The information should be extracted and
safeguarded; in some cases it may be necessary to quarantine E/E/PES equipment, history
files and operating logs for further analysis. Special measure may be needed in major
incidents (eg where there is a fire) to avoid equipment damage, and damage to electronic
storage. It may also be necessary to duplicate information that is stored at remote sites.
It is also important to capture any interactions with the E/E/PE systems by operation and
maintenance staff (eg by taking witness statements).
For technical investigations of E/E/PES it is important to establish the exact equipment
configuration. This may be derived directly from some asset reference or configuration
records, but further inspection may be needed to determine the exact serial number and
version number of the equipment in use. Identification of the exact configuration is important
for subsequent forensic investigations (eg the product code may be the same, but the
implementation different). Ideally there should be a unique code for every equipment item, so
that it can be more readily tracked down the supply chain (this does occur with some IEC
61508 compliant suppliers).
Significant events associated with the E/E/PES should be derived from the gathered data (eg
the time when a failure occurred, whether the operating mode changed, whether new actions
were requested or new information was displayed). The sequence of events should be
identified. Incident reconstruction and analysis methods are available for this purpose, for
example timelines or ECF charts (see Appendix A of Part 1).
20
4.4.2 Incident analysis (why did it happen)?
We have to identify the immediate cause of the E/E/PES maloperation before we can identify
means for their prevention or mitigation. It is helpful to have a systematic method for
determining the nature of the problem. A range of different methods has been identified in
Appendix A of Part 1, and examples of their application are given in Appendix A of Part 3.
This includes the PARC method (PES Analysis of Root Cause) in Appendix C, which was
developed during this research project and is based on the PRISMA method. PARC
comprises a flowchart with a series of questions to establish whether the problem is due to:
· operator actions,
· maintenance actions,
· equipment environment problems (eg electromagnetic interference, temperature,
vibration),
· random failure of equipment,
· inappropriate functionality,
· installation/integration with other equipment.
Identification of some of these items might require specific tests (such as diagnostic tests, or
replication of the circumstances of the original problem).
For relatively simple E/E/PES technical investigations, it may be sufficient to use a simple
checklist (see Figure E.2) as illustrated in Figure 7 below. Analysis of some “normal events”
could consist of just recording the problem and its immediate cause. For example a random
hardware failure might simply be rectified by repair or replacement.
For less routine cases, or where operational, maintenance or environmental factors are
identified, a more detailed technical review may be requested. If the problem is related to
equipment functionality, it may be necessary to pass the initial problem report to the supplier
for further analysis. This could result in changes to the equipment and improvement to the
supplier processes to prevent similar errors in future. For safety-classified equipment, rules
could be set in place for mandatory reporting of problems through the supply chain.
In other cases—especially where the consequences are serious or potentially serious—a
further stage of investigation should be undertaken to determine in detail the underlying
causes. This is described in the following section.
4.4.3 How can this problem be prevented or mitigated in future?
For particular problems, especially those with potentially major consequences, it is
worthwhile to consider whether the problem can be minimised either by eliminating
underlying causes or by reducing the impact of the failure.
If the immediate cause is under the control of the “problem owner”, then an analysis can be
undertaken to determine what preventive measures can be applied locally–both to rectify the
current problem and to prevent similar problems within the organisation.
21
Equipment Problem Sheet
ACME corp
Manufacturer
CONPAC 800Product name
AC-b72-1289
Serial no.
Rev 8.12
Configuration/version information
Block A12
Location
12 Jun 2003
Date
INC/PROD_A12/45
Incident reference (if applicable)
4 control q monitoring q alarm/direction
Used for: q protection q
q safety-classified
Problem Description
Details
q failed to act when needed
Controller overfilled vessel.
q acted when not needed
q
4 acted in unexpected way
q failed completely
q failed during test
q maloperation of equipment by staff
q other
q dangerous/potentially dangerous
Immediate Cause
Operational
q operator action
q operator inaction
Maintenance
q fault diagnosis
q fault correction
q left in wrong mode
q configuration
q calibration
Environment problem
q condensation
q contamination
q temperature
q vibration
q EMI
Equipment functionality
q user interface
q operational functions
q maintenance functions
q
4 response to component failure
q performance
q random hardware fault
Equipment integration/installation
q power supplies
q wiring
q connections
q incompatible interfaces
q incompatible subsystems
q configuration
Actions
q
4 log problem
q
4 repair/reconfigure equipment
q report to supplier
q
4 problem prevention review
Details
Equipment failed to detect that 4-20MA level
sensor had failed and was reading low.
As a result the controller kept pumping until there
was an overflow trip.
Details
4-20 mA sensor replaced.
Figure 7 Initial technical investigation
22
This analysis can be linked to the IEC 61508 life cycle and the activities within it. Although a
comprehensive taxonomy could be created for all the requirement subclauses of parts 1 to 3 of
the standard, it would be so complex that it would be impracticable to apply. Some
simplification is necessary even for the most comprehensive of incident analysis schemes.
The taxonomy used will depend on the depth and scope of the study. The default scope should
include all lifecycle phases but this may be reduced according to the circumstances. For
example, the scope of an investigation by the system supplier may be limited to the E/E/PES
realisation phase. A proposed taxonomy that covers the main requirements of all the lifecycle
phases of IEC 61508 is shown in Appendix A. This includes the common requirements such
as safety management and competence. Use of the taxonomy to this level of detail is probably
only justified when analysing incidents with very high consequences.
In practice, information available at the operating stage may allow only coarse judgements to
be made about causes originating prior to the equipment validation stage. In any case, tracing
the cause of an incident to a specific lifecycle phase may be of limited value to an operating
asset if this phase is not within the investigating company’s present control. However, it will
generally be of value to identify a cause at the design or installation phase since this will infer
the need for modification. Appendix B contains a simplified taxonomy that is aimed to
provide an appropriate level of detail for the majority of operational systems.
The PARC method in Appendix C includes a focused set of questions that identify the
relevant phases in the lifecycle where the problem could have been prevented. Note that there
can be more than one possible means of prevention, and every option should be considered
when making recommendations for improvement.
Alternatively, a checklist approach could be used to help consider the options systematically,
as shown in Figure 8 below. The checklist could also record the results of a PARC analysis.
In the case of a detailed investigation by the system vendor, the scope will be reduced (eg to
design, validation, installation and commissioning) and the remaining lifecycle stages are not
relevant. In this case the more detailed taxonomy of system realisation given in Appendix D
can be used. Again the supplier may choose to convert this to a checklist to assist in the
analysis process.
For complex incidents involving many different interacting elements a more wide-ranging
incident analysis may be needed. Several analysis options are identified in Appendix A of
Part 1, but we consider the following methods to be the most effective appropriate:
·
barrier analysis,
·
change analysis, and
·
events and causal factors (ECF) charting.
The recommended approach is illustrated using a case study in Appendix A of Part 3.
These methods facilitate good coverage of the causal taxonomy in Appendix A. The rationale
for this choice of methods is given in Appendix D of Part 3. However this does not preclude
the use of other methods, and we would expect many organisations to use pre-existing
methods that they find acceptable.
23
q System assessment
q System design
q Installation and
commissioning
q Validation
q Operation and
maintenance
q Modification
Safety management
q operation
q maintenance
q modification
Lifecycle
Competence
q operation
q maintenance
q modification
Verification
Documentation
q operation
q maintenance
q modification
Functional safety
assessment
q operation
q maintenance
q modification
E/E/PES Problem Prevention Checklist
q hazard and risk assessment applied/improved
q better allocation of functions to people and equipment
q better specification of system functions
q
4 better design and development
q improved operation facilities
q improved maintenance facilities
q improved installation plan/procedures
q improved commissioning plan/procedures
q better validation techniques/plan
q
4 improved implementation of validation plan
q better validation equipment
q better analysis/resolution of discrepancies
q better usability assessment
q operation procedures improved
q impact assessment of operation procedures
q operation procedures properly applied
q maintenance procedures improved
q impact assessment of maintenance procedures
q maintenance procedures properly applied
q routine operation and maintenance audits
q
4 test interval changed
q permit/hand over procedures
q procedures to monitor system performance
q selection/application of tools
q procedures applied to initiate modification in the event of
systematic failures or vendor notification of faults
q modification authorisation procedures
q impact analysis of modification
q improve modification planning (including following
appropriate lifecycle)
q improve implementation of modification plan
q better manufacturers information
q improve verification and validation of modification
q safety culture
q safety audits
q management of suppliers
Ensure adequate definition of lifecycle process
q operation, q maintenance, q modification
q procedures for ensuring competence
q check training, experience and qualification of a person
q specify job requirements
Verification of q operation, q maintenance, q modification
q check documentation available/complete
q check documentation clear and correct
q check documentation well structured
q check documentation up to date
q introduce safety assessment
q improve safety assessment
q ensure adequate skills and independence of assessment
team
Figure 8 Problem prevention checklist
24
4.4.4 Recommendations
Recommendations are made to reduce the likelihood of an E/E/PES malfunction recurring or
to mitigate the consequences of failure. Problems might be resolved by:
·
eliminating the need for the E/E/PES (eg using alternative means to achieve the
organisation’s goal);
·
re-engineering or replacing the E/E/PES;
·
adding external barriers and warning signs;
·
changing operational procedures and work practices;
·
changing operator training and assessment;
·
introducing new procedures or equipment to mitigate the potential consequences.
A checklist of possible actions is given in Figure E.6, and the options are reproduced in
Figure 9 below.
The recommendations generated will depend on what has been established in the causal
analysis and an assessment of options based on such factors as effectiveness, implementation
time and implementation cost.
Policy change
q procurement procedures
q supplier approval
q engineering standards
q
4 acceptance procedures
Equipment change
q warning/advisory labelling
q equipment relocation
q environmental protection
q hardware repair
q
4 upgrade to new version
q equipment replacement
q
4 reprogramming
Operational change
q procedures
q support documentation
q access controls
q warnings
q staff training
q staff briefing
q staff supervision
Maintenance change
q
4 procedures
q support documentation
q access controls
q warnings
q staff training
q staff briefing
q staff supervision
Figure 9 Recommendations checklist
25
4.5 DISSEMINATION AND LEARNING
The dissemination and learning that can occur between end users and E/E/PES suppliers is
shown in Figure 10 below. The diagram only shows two suppliers in the chain, but there can
be an indefinite number, including:
· a main contractor (who might have designed and supplied site infrastructure, plant and
C&I equipment);
· an E/E/PES system supplier who programs, installs and maintains one or more E/E/PES
products to perform the specified application tasks;
· E/E/PES product suppliers (eg of PLCs or distributed control systems);
Often the main contractor supplies a “turnkey” system to the user, but has no responsibility to
support the system in operation. Figure 10 excludes the main contractor from the reporting
chain, but ideally they should also participate in the problem resolution and learning process.
Operational
workarounds
Warnings
Barriers
Replacement
System
redesign/
component
replacement
E/E/PES
component
problem
E/E/PES
system
problem
End user
incident
E/E/PES
component
supplier
System
designer
End user
Assess/
install
revision
E/E/PES
redesign
Alerts
Known faults
Workarounds
Patches
New versions
Alerts
Known faults
Workarounds
New versions
Assess/
install
revision
Other end
users
Other system
suppliers
General lessons
General lessons
General lessons
Procurement
Introduction into
workplace
(Training and
implementation
programs)
System design rules
Preferred products
Checklists
Test and commissioning
procedures
Staff competence
E/E/PES design rules
Checklists
Staff competence
Test procedures
Figure 10 Supply chain dissemination and learning
26
4.5.1 E/E/PES supply chain
The activities in the supply chain are summarised below.
Problem investigation
Any problems associated with the use of the E/E/PES should be reported along the supply
chain. To the E/E/PES supplier, the reported problem (eg information in a form similar to
Figure 7) can be treated as a learning opportunity in the same way as an “incident” in
Figure 1. When the learning process is applied within an E/E/PES supplier organisation, it can
be used to correct problems in the E/E/PES (or mitigate their effects) and to improve the
company’s internal processes to reduce the likelihood of a recurrence. To aid this, a checklist
based on IEC 61508 requirements is given in Figure E.7. For companies that integrate
E/E/PES components, the checklist in Figure E.5 may be more appropriate.
A checklist of possible recommendations arising from the problem investigation is given
in Figure E.8
Response to user
The supplier should confirm to the user that an E/E/PES problem exists and identify possible
ways of resolving the problem (eg via a workaround or an update to the E/E/PES). This
information may be included in the user’s incident investigation report if the E/E/PES was
regarded as being a causal factor in an incident.
Dissemination
If the E/E/PES is used on many different sites, the supplier should disseminate information
about any newly discovered problems to all users. Possible dissemination methods include:
· issuing immediate problem alerts (together with any workarounds and patches);
· responding with known workarounds when a problem is reported;
· making records of known faults (together with any workarounds and patches) available to
the E/E/PES users;
· notifying users of a new version with a record of the fault fixed.
The dissemination mechanism chosen will usually depend on the severity of the fault and on
whether the E/E/PES is being used in safety-related applications. To perform effective
dissemination, a record of the end use of the equipment must be maintained by the E/E/PES
system supplier. IEC 61508 compliant suppliers will be aware of the safety integrity
requirements of their end users, and they can use this to establish priorities for rapid
dissemination.
Similar processes apply for E/E/PES product suppliers, except their clients are normally
E/E/PES application system suppliers.
Learning from individual problem reports
In addition to disseminating reported problems and fixes, suppliers should perform further
analysis to establish what improvements can be made in their development lifecycle
processes. The analysis should determine where errors occurred in the process, and whether
27
there are any mechanisms that could be introduced to eliminate each error or detect it at a
later stage. For a systems supplier this might lead to changes in:
·
procedures (specification, design, test, installation, commissioning, support);
·
review checklists (specification, design, software, documentation);
·
test, installation and commissioning checklists;
·
design rules (standard design, product configuration rules, software programming
standards);
·
methods and tools (for design, programming, testing, etc);
·
staff competence requirements and training.
An E/E/PES product supplier should undertake the same type of analysis as an E/E/PE
systems supplier, except that installation and commissioning need not normally be
considered.
Learning from a problem database
A problem database can be periodically reviewed for: ·
trends – whether problems are reducing or increasing, ·
hot-spots – whether particular clients are experiencing greater problems, ·
frequency – which problems arise most frequently. Unsatisfactory performance can be used to trigger an investigation. This investigation might
result in the process changes already discussed above, but could also result in additional
changes to reduce the incidence of such problems, such as in:
·
proof testing, maintenance or inspection frequency, ·
components used (eg to improve reliability or failure mode of the system), ·
component supplier (to improve supply chain response times, access to information, etc), ·
safety integrity level requirements if the demand rate on a protection system is higher than anticipated during design,
·
architecture, such as the introduction of redundancy or majority voting,
·
customer guidance and product information (eg clarifying functionality, usage
constraints, guidance on installation best practice).
28
4.5.2 E/E/PES end user
Specific problems with E/E/PES can be shared with other interested parties. Typically the
interest lies in problems concerning:
·
A specific product (eg a specific medical device or a PLC) where user experience relating
to a given product and any known workaround can be shared with other sites.
·
A class of products (eg heart pacemakers) to identify generic failure modes and
operational problems.
·
An application area (eg oil and gas production, transport, signalling, alarm system) to
identify generic E/E/PES design and operational problems.
The incident database within an organisation can be analysed periodically to provide similar
information. It can also be used to identify trends, hot spots and high-frequency problems that
could require management action to control.
These analyses will require some method for characterising the E/E/PES incident (preferably
using some common scheme), so that relevant incidents can be retrieved. The approach to
classification will be discussed in the following section.
29
5 CLASSIFICATION OF E/E/PES PROBLEM RECORDS
Incident classification supports long-term proactive learning because it helps significantly in
the identification of trends and “hot spots” in any particular set of incidents.
The more facets of the learning process reflected in the classification scheme, the greater the
range of information that can be obtained by analysing the incident data. A scheme can
characterise the failure and the cause, and identify how the problem can be prevented in the
future.
We propose that incidents involving E/E/PES are classified using the following disjoint
classifications:
· A failure mode classification – what type of failure behaviour is exhibited.
· A problem classification – the immediate cause of the failure (including problems in
surrounding E/E/PES infrastructure such as personnel or other systems).
· A problem prevention classification – where in the system development lifecycle the
problem could have been prevented. This is based on the analysis of IEC 61508 causal
classification categories (detailed in Appendix A).
Each type of organisation has a different “viewpoint”, with different capabilities and available
information. For example, a user organisation is unlikely to be aware of the reasons for an
E/E/PES hardware or software failure, or be able to rectify it. So the classification scheme has
to be relevant to the viewpoint of a specific organisation.
The proposed hierarchic classifications can be extended with industry specific subclassifications by individual end-users and system suppliers, but it allows data to be
aggregated into top-level classifications. In addition, it may be necessary to “collapse” parts
of the causal classification if the root causes cannot be identified or controlled by the specific
organisation.
5.1 FAILURE MODE
The failure mode is the direct contributor to the incident. It enables the contribution of the
equipment failure to the incident analysis to be established (even if the underlying cause is
unknown). Information about failure mode might also be relevant in long-term analyses of
equipment problems. The proposed categories are:
· Equipment failed to act when required
· Equipment acted when not required
· Equipment acted when required, but in an unexpected way
· Operator did not use the equipment as intended.
These categories are consistent with the PARC method described in Appendix C. So the
results of a PARC analysis can be transcribed to a database or to a standard form such as the
checklist in Figure E.2.
30
For decisions on further technical investigation and for later analysis, it is important to
identify whether the failure mode was hazardous. So an additional classification is proposed
to identify the impact of the failure, namely either:
·
Fail-safe, or
·
Fail-danger.
This information might be directly determined when the equipment has a specified safe
failure state. In other cases, it may depend on the incident context and can only be determined
once the incident has been analysed.
5.1.1 Problem leading to failure
The problem leading to the failure should be identified. The classification of immediate
problems identified here are consistent with those identified in the PARC analysis method
(except that in PARC “environment problem” and “interfacing equipment problem” have
been merged into a single heading “failure due to environment”). The proposed categories of
problems leading to failure are:
·
Operator problem
·
Maintenance problem
·
Setting/calibration problem
·
Environment problem (corrosion, interconnection, vibration, etc)
·
Inappropriate equipment functionality
·
Interfacing equipment problem
·
Random hardware fault.
Again it should be relatively simple to derive this classification directly, using a PARC
analysis or the equipment problem checklist.
5.2 PROBLEM PREVENTION
This part of the classification scheme identifies the point in the lifecycle where the problem
could have been prevented. These stages are aligned with the overall lifecycle in IEC 61508.
The full set of classification categories is given in Table A.1, but the main headings are:
·
Concept
·
Overall scope
·
Hazard and risk assessment
·
Overall safety requirements
·
Allocation
·
Installation and commissioning planning
·
Validation planning
·
Operation and maintenance planning
·
Realisation
31
·
Installation and commissioning
·
Validation
·
Operation and maintenance
·
Modification.
The realisation phase for E/E/PES is classified by sub-phase in table A.2 (this is mainly
relevant to E/E/PES system suppliers). The sub-phases are:
·
E/E/PES functional requirements specification
·
E/E/PES integrity requirements specification
·
E/E/PES validation planning
·
E/E/PES system design
·
E/E/PES operation and maintenance facilities
·
Software requirements specification
·
Software design and development
·
Software validation planning
·
E/E/PES integration
·
E/E/PES operation and maintenance procedures
·
E/E/PES validation
Table A.3 classifies the common processes identified within IEC 61508 that may need
improving to prevent failures:
·
Safety management
·
Safety lifecycle
·
Competence
·
Verification
·
Documentation
·
Functional safety assessment
These taxonomies can classify the causes identified by a range of causal analysis techniques
(Appendix A of Part 3 gives an example of this). The taxonomies can be reduced in scope and
detail for use with simpler causal analysis methods like PARC and checklists.
5.3 ADAPTING THE CAUSAL CLASSIFICATION FOR DIFFERENT
PERSPECTIVES
5.3.1 End user viewpoint
The end user may have a limited view of the internal reasons for failure of an E/E/PES,
probably because they had little involvement in some of the early stages of development. For
example, the user may have employed an independent contractor to supply the E/E/PES. If
this is the case, the end user has little scope for preventing the problem in these lifecycle
32
phases, has little use for detailed information for these phases, and probably will not be able
to obtain it anyway. Appendix B contains a simplified classification for this scenario, where
the stages “Concept”, “Overall scope” and “Hazard and risk assessment” are collapsed to
“System concept”, and “Overall safety requirements” and “Allocation of safety requirements”
are collapsed to “Safety requirements and allocation”.
The remaining phases are:
·
E/E/PES installation and commissioning planning
·
E/E/PES validation planning
·
E/E/PES operation and maintenance planning
·
E/E/PES realisation
·
E/E/PES installation and commissioning
·
E/E/PES validation
·
E/E/PES operation and maintenance
·
E/E/PES modification.
This reduced taxonomy can be applied by the end user, and is used as the basis for
classification in the PARC analysis method given in Appendix C. The questions in the PARC
analysis are designed to identify one or more of the main lifecycle stages or common
processes where the problem could have been prevented. The checklist in Figure E.5 is also
based on the reduced taxonomy in Appendix B.
5.3.2 Supply chain viewpoint
The E/E/PES system integrator also has a restricted viewpoint, as he is unlikely to have been
involved in the early stages of concept development, functional allocation or hazard and risk
analysis since these activities require industry specific knowledge and knowledge of the
overall system architecture.
The viewpoint from a fault prevention perspective is likely to be restricted to the E/E/PES
phase, classified in detail in table A.2. These are already listed in Section 5.2 above.
Appendix D gives a list of E/E/PES realisation categories drawn from Table A.2, plus
E/EPES installation and commissioning. Table D.2 reproduces the categories of Table A.3.
These are the basis for the E/E/PES problem prevention checklist in Figure E.7.
If the E/E/PES supplier is also involved in later phases, such as E/E/PES maintenance or
E/E/PES modification, some combination of the classifications in Appendices B and D may
be needed.
33
6 CONCLUSIONS
The scheme described in this document (termed PARCEL) focuses on learning from the
E/E/PES-related aspects of incidents. It is important that the scheme is practical (ie is
integrated with current management and reporting systems, has reasonable set-up costs and is
usable). The scheme aims to be applicable to different industry sectors and to organisations at
different points in the E/E/PES supply chain, and it aims to accommodate different levels of
organisational complexity and maturity. It is apparent from industry consultation that
companies are likely to have existing incident reporting infrastructures and analysis methods,
and any scheme for E/E/PES should not disrupt this existing infrastructure.
Therefore PARCEL is designed as an “add-in” to general incident reporting schemes to
minimise the changes needed in existing schemes. The use of separate forms and a separate
“technical investigation” makes it easier to include E/E/PES investigations in the overall
incident handling infrastructure. In addition the add-in approach allows different staff to
undertake E/E/PES analysis from that undertaking general incident analysis (eg an instrument
engineer for the E/E/PES aspects and an operations supervisor for the more general aspects).
Companies with no existing infrastructure can still apply PARCEL by implementing the
generic incident reporting scheme, described in this report, alongside the E/E/PES additions.
The classification scheme uses a taxonomy derived from IEC 61508 requirements for
lifecycle stages and common processes. This provides a common basis for classification
across industry sectors and the E/E/PES supply chain, but it can be tailored to suit different
organisations by changing the level of detail and the scope of lifecycle phases covered.
Organisations may have pre-existing methods of incident analysis. So we focus here on
analysis methods that are particularly suited for E/E/PES. A number of options for performing
E/E/PES technical investigation are identified, including barrier analysis, change analysis and
ECF. But for simple investigations (which are likely to be the majority), a simple flowchart
method called PARC and ticklists have been developed, where the classification is inherent in
the analysis process. Pre-existing causal analysis techniques can be used in conjunction with
these E/E/PES-specific approaches, for example by using the results of a PARC or checklist
analysis in the overall incident investigation.
For learning to occur, it is important that the results of incident analysis are fed back to the
relevant organisations so the processes and products can be improved. The learning process
identifies feedback paths to:
· the local company or organisation (to fix immediate E/E/PES operational problems);
· companies in the E/E/PES supply chain (to fix E/E/PES problems and help suppliers to
learn);
· other departments and outside organisations (to share the lessons learned).
Checklists are provided for the types of action that can be taken by E/E/PES end-users and
companies in the E/E/PES supply chain to improve the realisation and operation of E/E/PESbased equipment.
Part 3 provides guidance on adapting PARCEL for users in different industry sectors and with
different levels of maturity. It also recommends appropriate methods and tools for incident
management, analysis and feedback.
34
REFERENCES
[1]
PG Bishop, LO Emmet, W Black, C Johnson, V Hamilton, F Koornneef, “HSE
Learning from Incidents: D2—Outline scheme for E/E/PE in the context of IEC
61508”, Adelard Deliverable, D/221/2307/2, 2003
[2]
PG Bishop, LO Emmet, K Tourlas, W Black, C Johnson, “HSE Learning from
Incidents: D3—Scheme Evaluation”, Adelard Deliverable, D/226/2307/3, 2003
[3]
IEE, “The Costs to Industry of Accidents and Ill-Health”, IEE Health and Safety
Briefing, No. 03, April 2003, http://www.iee.org/Policy/Areas/Health/hsb03.pdf
[4]
IEC 61508, Functional safety of electrical/electronic/programmable electronic safetyrelated systems. For further details see http://www.iec.ch/functionalsafety
35
APPENDIX A IEC 61508 CAUSAL CLASSIFICATION SCHEME
This appendix contains the complete causal classification taxonomy and shows how each of
the category items is related to specific subclauses in IEC 61508. The classification scheme
covers both IEC 61508 lifecycle phases and requirements that are common to all phases.
Note that in the following tables: “LTA” stands for “Less Than Adequate” and all IEC 61508
references are to Part 1 except where indicated by number in brackets, eg (2). A series of
subclauses is delimited by “/”, eg 7.2.2.2/3/4, while a consecutive range of subclauses is
denoted by a hyphen, eg 7.2.2.2-4 (both these examples represent the set of subclauses
7.2.2.2, 7.2.2.3 and 7.2.2.4).
Table A.1 Safety lifecycle classification
Lifecycle phase
Objective
Causal classification
IEC 61508 reference
Concept
Development of
sufficient
understanding
of EUC and its
environment
1 LTA consideration of EUC and
environment
2 LTA information about hazards
3 hazards due to interactions not
considered
7.2.2.1
Define the
boundary of the
EUC and
control system
and scope of
hazard and risk
assessment
1 physical environment not specified
2 external and incident-initiating events
to be taken into account not specified
3 subsystems associated with the hazards
not specified
7.3.2.1
7.3.2.2/4/5
Determine the
hazards, event
sequences and
the EUC risks
1 LTA hazard and risk analysis
2 LTA consideration of elimination of
hazards
3 reasonably foreseeable circumstance
not identified
4 event sequence and likelihood not
predicted
5 LTA consequence prediction
6 Inappropriate tools and techniques used
7.4.2.1/10/11/12
7.4.2.2
Overall scope
Hazard and risk
assessment
7.2.2.2/3/4
7.2.2.5
7.3.2.3
7.4.2.3
7.4.2.4/5
7.4.2.6
7.4.2.9
Overall safety
requirements
Development of
safety functions
and safety
integrity
requirements
1 LTA safety function requirement
2 necessary risk reduction not defined
3 EUC control system incorrectly
designated
7.5.2.1
7.5.2.2/6
7.5.2.4/5
Allocation
Allocation of
safety functions
to safety
systems
1 incorrect account taken of skills and
resources required or other systems
2 common cause not considered
3 operating mode not specified
(continuous/low demand)
7.6.2.1/2/3
Installation and
commissioning
planning
To plan the
installation and
commissioning
1 LTA installation plan
2 LTA commissioning plan
7.9.2.1
7.9.2.2
Validation
planning
To plan the
validation
1 LTA validation techniques/plan
7.8
36
7.6.2.6/7
7.6.2.5
Lifecycle phase
Objective
Causal classification
IEC 61508 reference
Operation and
maintenance
planning
To plan the
operation and
maintenance
1 LTA operation and maintenance plan
7.7
Realisation
Realisation of
the safety
requirement
specification
NB These
categories are
expanded in
table A.2
1 LTA specification of E/E/PES
functional requirements
2 LTA specification of E/E/PES integrity
requirements
3 LTA E/E/PES validation planning
4 LTA E/E/PES design
5 LTA E/E/PES operation and
maintenance facilities
6 LTA software requirement
specification
7 LTA software design and development
8 LTA software validation
9 LTA hardware and software integration
10 LTA E/E/PES operation and
maintenance procedures
11 LTA E/E/PES validation
7.2.2/3 (2)
7.2.2/3 (2)
7.3.2 (2)
7.4.2/3/7 (2), 7.5 (2)
7.4.5 (2)
7.2.2 (3)
7.4 (3)
7.3 (3), 7.7 (3)
7.5 (2), 7.5 (3)
7.6 (2)
7.7.2.2/5 (2)
Installation and
commissioning
Installation and
maintenance of
E/E/PES
1 LTA installation
2 LTA commissioning
7.13.2.1/2
7.13.2.3/4
Validation
Validate that the
E/E/PES meets
the requirement
specification
1 LTA implementation of validation plan
2 LTA validation equipment
3 LTA analysis/resolution of
discrepancies
7.14.2.1/3
7.14.2.2
7.14.2.4
Operation and
maintenance
Operate and
maintain the
E/E/PES to
maintain the
required
functional safety
1 LTA operation procedures
2 operation procedures not impact
assessed
3 operation procedures not applied
4 LTA maintenance procedures
5 maintenance procedures not impact
assessed
6 maintenance procedures not applied
7 no routine operation or maintenance
audits
8 permit/hand over procedures
9 test interval not sufficient
10 LTA procedures to monitor system
performance
11 tools incorrectly selected or applied
7.6.2.1/2/5 (2)
7.6.2.4 (2)
1 LTA procedures applied to initiate
modification in the event of systematic
failures or vendor notification of faults
2 LTA authorisation procedure
3 LTA impact analysis
4 LTA modification plan (including
sufficient lifecycle activities)
5 LTA implementation of modification
plan
6 LTA manufacturers information
7 LTA verification and validation
6.2.1l, 7.8.2.2 (2)
Modification
To ensure that
functional safety
remains
appropriate
during and after
modification
37
7.15.2.1/2
7.6.2.1/2/3/5 (2)
7.6.2.4 (2)
7.15.2.1/2
7.15.2.3, 7.6.2.1/2 (2)
7.6.2.1 (2)
7.6.2.3 (2)
7.6.2.1f (2)
7.6.2.1g (2)
7.16.2.2/5, 7.8.2.1c (2)
7.16.2.3/6, 7.8.2.1b (2)
7.16.2.1/6, 7.8.2.3 (2)
7.16.2.1
7.8.2.1 (2)
7.8.2.4 (2)
Table A.2 Expansion of E/E/PES lifecycle classification
Sub-phase
Sub-classification
IEC 61508 reference
E/E/PES
functional
requirements
specification
1 LTA derivation of E/E/PES requirements from overall
specification
2 LTA E/E/PES requirements, eg unstructured, unclear,
ambiguous
3 not all safety actions specified
4 LTA specification of safe (or least hazardous) state and
behaviour after restart
5 LTA throughput and response requirement
6 LTA EUC operator interfaces specification
7 LTA operator interface to E/E/PES
8 LTA interfaces to other systems specification
9 LTA specification of failure modes or failure rates
10 not all relevant modes (eg start-up, proof test) defined
11 limiting and constraint conditions not defined
12 LTA specification of security provisions
7.2.2.1 (2)
E/E/PES
integrity
requirements
specification
1 target safety integrity level and/or target failure measures
incorrectly specified
2 not all modes of operation defined
3 LTA definition of proof testing requirements
4 LTA specification of environmental conditions
7.2.3.2a (2)
E/E/PES
validation
planning
1 LTA validation procedures
2 LTA specification of validation equipment
3 LTA policies for resolving validation failure
7.3.2.1/2 (2)
7.3.2.2d (2)
7.3.2.2g (2)
E/E/PE system
design
1 LTA hardware safety integrity including fault tolerance and
reliability analysis
2 LTA measures for systematic safety integrity
3 LTA response on detection of a fault
4 LTA separation and independence
5 significance of interactions not recognised
6 LTA design methods or tools
7 LTA system architecture
8 LTA products application or integration
7.4.2.2a (2)
E/E/PES
operation and
maintenance
facilities
1 LTA provision for checking or confirmation of operator
inputs
2 LTA operator displays (complex, confusing)
3 LTA consideration of maintainability and testability
7.4.5.1c (2), 7.4.5.3 (2)
Software
requirements
specification
1
2
3
4
inconsistency with overall specification
unclear, ambiguous, unverifiable, not testable and traceable
LTA constraints between hardware and software
LTA monitoring and testing
7.2.2.2 (3)
7.2.2.6 (3)
7.2.2.8 (3)
7.2.2.9 (3)
Software
design and
development
1
2
3
4
5
LTA software design methods
LTA software architecture
LTA support tools or programming language
LTA code implementation
LTA software testing
7.4.2.2-5 (3)
7.4.2.6-10 (3), 7.4.3 (3)
7.4.4 (3)
7.4.5.4 (3), 7.4.6 (3)
7.4.7 (3), 7.4.8 (3)
Software
validation
1 LTA software validation techniques/plan
2 LTA assessment of the software validation plan
3 LTA implementation of the validation plan
38
7.2.2.2 (2)
7.2.3.1a (2)
7.2.3.1a (2)
7.2.3.1b (2)
7.2.3.1c (2)
7.2.3.1c (2)
7.2.3.1e (2)
7.2.3.1g (2)
7.2.3.1g/j (2)
7.2.3.1h/i (2)
7.2.3.1 (2)
7.2.3.2b (2)
7.2.3.2c (2)
7.2.3.2d/e (2)
7.4.2.2b (2)
7.4.2.2c (2)
7.4.2.3/4/5 (2)
7.4.2.10 (2)
7.4.2.8/9 (2), 7.4.4 (2)
7.4.3.1/2 (2)
7.4.2.11 (2), 7.4.7 (2),
7.5 (2)
7.4.5.3 (2)
7.4.5.2 (2), 7.4.4.3 (2)
7.3.2.1/2/3/5 (3)
7.3.2.4 (3)
7.7 (3)
Sub-phase
Sub-classification
IEC 61508 reference
E/E/PES
integration
1 LTA hardware and software integration
7.5 (2), 7.5 (3)
E/E/PES
operation and
maintenance
procedures
1 LTA specification of the routine actions that need to be
carried out
2 LTA specification of actions and constraints during all
modes of operation
3 LTA maintenance procedures on fault
7.6.2.1a (2)
1
2
3
4
7.7.2.1/3 (2)
7.7.2.2 (2)
7.7.2.4/5 (2)
7.7.2.3 (2)
E/E/PES
validation
LTA implementation of validation plan
LTA validation equipment
LTA analysis/resolution of discrepancies
LTA usability assessment
7.6.2.1b (2)
7.6.2.1e (2)
Table A.3 Common requirements classification
Common
Requirement
Causal classification
IEC 61508 reference
Safety
management
1
2
3
4
5
6
6.1.1/6.2.1
6.1.2/6.2.1
6.2.2
6.2.3
6.2.4
6.2.5
Safety
lifecycle
1 detailed lifecycle not defined
2 lifecycle activities not divided into elemental activities
3 scope, inputs and outputs not detailed for each phase
7.1.4.1
7.1.4.4
7.1.4.5/6/7
Competence
1 no procedures for ensuring competence
2 training, experience and qualification of a person not
assessed
3 job requirements not specified
6.2.1h, B.2
6.2.1h, B.2
1 verification plan not developed or incomplete
2 verification plan not implemented
3 verification plan not fully understood
7.18.2.1/2, 7.9 (2),
7.9 (3)
7.18.2.3, 7.9 (2), 7.9 (3)
7.9 (2), 7.9 (3)
Documentation
1
2
3
4
5.2.1-5
5.2.6
5.2.7-10
5.2.11
Functional
safety
assessment
1 independent functional safety assessment not planned
adequately
2 independent functional safety assessment not carried out
3 not all phases assessed (including operation and
maintenance)
4 insufficient skills or independence in assessment team
5 tools not assessed
Verification
necessary management and technical activities not specified
accountability for all activities not specified
activities not implemented or monitored
activities not formally reviewed by all organisations
not all organisations informed of responsibilities
suppliers not reviewed for appropriate quality management
system
documentation absent/incomplete
documentation unclear or incorrect
documentation not well structured
documentation not up to date
39
6.2.1h, B.2
8.2.7/8/9
8.2
8.2.3
8.2.11-14
8.2.5
APPENDIX B SIMPLIFIED CLASSIFICATION (END USER)
Table B.1 Safety lifecycle classification
Lifecycle reference
Lifecycle stage(s)
Classification
System assessment
Concept
Overall scope
Hazard and risk
assessment
1 LTA hazard and risk assessment
Safety
requirements and
allocation
Overall safety
requirements
Allocation of safety
requirements
1 LTA allocation of safety functions to people and
equipment
E/E/PES installation and commissioning
planning
1 LTA E/E/PES installation and commissioning plan
E/E/PES validation planning
1 LTA E/E/PES validation techniques/plan
E/E/PES operation and maintenance
planning
1 LTA E/E/PES operation and maintenance plan
E/E/PES realisation
1
2
3
4
E/E/PES installation and commissioning
1 LTA E/E/PES installation
2 LTA E/E/PES commissioning
E/E/PES validation
1
2
3
4
E/E/PES operation and maintenance
1 LTA operation procedures
2 operation procedures not impact assessed
3 operation procedures not applied
4 LTA maintenance procedures
5 maintenance procedures not impact assessed
6 maintenance procedures not applied
7 no routine operation or maintenance audits
8 test interval not sufficient
9 LTA permit/hand over procedures
10 LTA procedures to monitor system performance
11 tools incorrectly selected or applied
E/E/PES modification
1 LTA procedures applied to initiate modification in
the event of systematic failures or vendor
notification of faults
2 LTA authorisation procedure
3 LTA impact analysis
4 LTA modification plan (including sufficient
lifecycle activities)
5 LTA implementation of modification plan
6 LTA manufacturers information
7 LTA verification and validation
LTA specification of E/E/PES requirements
LTA design and development
LTA operation facilities
LTA maintenance facilities
LTA implementation of E/E/PES validation plan
LTA E/E/PES validation equipment
LTA E/E/PES analysis/resolution of discrepancies
LTA usability assessment
40
Table B.2 Common requirements classification
Common requirement
Classification
Safety management
1
2
3
4
Safety lifecycle
1 LTA lifecycle definition: operation
2 LTA lifecycle definition: maintenance
3 LTA lifecycle definition: modification
Competence
1 LTA operation competence
2 LTA maintenance competence
3 LTA modification competence
Verification
1 LTA verification of operation
2 LTA verification of maintenance
3 LTA verification of modification
Documentation
[For operation, maintenance or modification]
1 documentation absent/incomplete
2 documentation unclear or incorrect
3 documentation not well structured
4 documentation not up to date
Functional safety assessment
[For operation, maintenance or modification]
1 no safety assessment
2 LTA safety assessment
3 insufficient skills or independence in assessment team
LTA safety management: operation
LTA safety management: maintenance
LTA safety management: modification
LTA safety management: external suppliers
41
APPENDIX C PARC CAUSAL ANALYSIS METHOD
The PARC (PES Analysis of Root Causes) causal analysis method contains a simple
flowchart for identifying root causes of incidents involving E/E/PES.
The flowchart in this appendix is optimised for user organisations to analyse system
architectures where potential hazards in equipment under control, or the control system itself,
lead to a demand on a separate safety protection system. This architecture is common in
chemical plants and oil and gas installations. If used for other situations, it is recommended
that consideration be given to the appropriateness of the terminology, questions, lifecycle
phases and prevention measures, and changes made where necessary.
C.1 INITIATING PARC
Investigation of incidents involving E/E/PES should be carried out within the context of a
general-purpose incident reporting and investigation scheme. A suitable question within the
general report form will need to be framed to trigger a specialist investigation. The question
will depend on the types of incident that are required to be subject to investigation. Relevant
factors are industry practice, regulator expectations, the potential consequence associated with
the system involved, the maturity of the organisation and the skills and resources available.
In many cases it will be good practice to investigate if a demand occurs on a safety system,
since a hazard would have occurred if the safety system had been in the failed state. The
failure could be considered as a “near miss”. However, in an alarm system, demands may
occur frequently and other safety layers may intervene if the system is in the failed state.
Hence investigation of the cause of all alarms may not be practicable and may lead to there
being insufficient time to investigate other more serious incidents.
An example of a question that may be included in a general-purpose report form is as follows:
Did the incident involve the need for action, the lack of action, the wrong action or
the loss of capability of an E/E/PES, eg a demand on a system, a failure on demand
or on proof test, an unexpected action?
If the answer to the question on the general-purpose incident report form is “yes” then the
following method for identifying the immediate problem and the root cause of the failure can
be followed to identify the nature of the E/E/PES problem and the underlying causes in an
IEC 61508 context. This method has been given the acronym PARC (PES Analysis of Root
Causes).
C.2 APPLYING PARC
Data on the incident will need to be acquired through a combination of witness statements,
inspection and testing. The investigation process will depend on the technology involved, the
operating environment and the action (or lack of action) that triggered the need for incident
analysis. The user will normally conduct a preliminary investigation but in some cases
equipment will need to be returned to the system vendor for detailed investigation.
The PARC method represents a typical process that may be used by a user organisation,
represented as a flow chart in Figure C.1. The flow chart begins with the box labelled
“System operates correctly to prevent hazard”. If the current description holds for the
incident under consideration, then a dotted arrow labelled “Yes” is followed. Otherwise, the
42
solid line should be followed to the next box. When the user is directed to a table column, he
should examine every cause listed in that column and record all of those that played a part in
the incident. After that he should revisit the box most recently considered and follow the
solid line onto the next one.
In this way, the user eventually considers every row of events. If the row is applicable, he
then considers in turn every event in that row, and lists causes for those events that match the
incident. For each incident, the user should also consider all the listed causes in the common
requirements tables, in the context of actions/inactions of both the system and the operator.
The events in the flowchart boxes are described in more detail below.
1 Safety system operates correctly on demand to prevent hazard. Events leading to the
demand should be established but can include the following:
The demand is caused by a maintenance action. Errors in maintenance often cause
system demands. Examination of maintenance records and discussions with
technicians should establish if maintenance has been the cause of the demand.
The demand is caused by an operation error. Error in operation is another likely
cause of system demands. Examination of operation records and discussions with
operators should establish if operation has been the cause of the incident.
The demand is caused by equipment degradation. Installation of equipment in a
hostile environment or failure to follow installation requirements can cause the
equipment to exhibit transient failures. Physical degradation may be caused by
contact with the environment or the process. Degradation like contamination or
corrosion might be detected by external or internal inspection. Check for compliance
with manufacturer’s operating conditions (temperature, pressure, vibration, etc).
Equipment may also fail if there are interfacing incompatibilities. Check equipment
specifications for compatible interface standards (connectors, wiring, electrical signal
standards, signalling protocols, etc).
The demand is caused by inappropriate equipment function. Testing by
simulating the original conditions is required to establish if there is a systematic
design flaw in the hardware or software, resulting in the inappropriate behaviour.
More detailed investigation by the system vendor may be required.
The demand is caused by a random hardware failure. All equipment has a failure
rate and providing the overall demand rate is what has been assumed during design
then no further investigation will be needed. Demands should be logged and
compared with the assumptions made during hazard and risk analysis.
2 System fails on proof test, fails to take action when required or takes action when
not required. Events leading to this should be established but can include the following:
The setting is incorrect. A proof test will be needed to establish if the equipment is
working according to its setting and an incorrect setting was the cause of the incident.
The failure is caused by a maintenance action. Errors in maintenance often cause
system failures. Inspection, examination of maintenance records and discussions
with technicians should establish if maintenance has been the cause of the incident.
43
The failure is caused by an operation error. Error in operation is another likely
cause of system failures. Examination of operation records and discussions with
operators should establish if operation has been the cause of the incident.
The failure is caused by equipment degradation. Installation of equipment in a
hostile environment or failure to follow installation requirements can cause the
equipment to exhibit transient failures. Physical degradation may be caused by
contact with the environment or the process. Degradation like contamination or
corrosion might be detected by external or internal inspection. Check for compliance
with manufacturer’s operating conditions (temperature, pressure, vibration, etc).
Equipment may also fail if there are interfacing incompatibilities. Check equipment
specifications for compatible interface standards (connectors, wiring, electrical signal
standards, signalling protocols, etc).
The failure is caused by inappropriate equipment function. Testing by simulating
the original conditions is required to establish if there is a systematic design flaw in
the hardware or software, resulting in the inappropriate behaviour. More detailed
investigation by the system vendor may be required.
The failure is caused by a random hardware failure. All equipment has a failure
rate and providing the failure rate of the installed equipment is what has been
assumed during design then no further investigation will be needed. Failures should
be logged and compared with the assumptions made during the reliability analysis to
confirm the system has the required integrity.
3
Incorrect action taken by system or operator. Events leading to this should be established but can include the following: No action by operator allows demand on system. Error in operation is a likely
cause of EUC failures becoming system demands. Examination of operation record
and discussions with operators should establish if lack of action during operation has
been the cause of the incident.
System actions insufficient to terminate hazard. Actions taken by the system may
be insufficient to terminate the hazard. This can be caused by unpredicted failure
modes, design errors or incorrect operation and maintenance. Determination of cause
will require inspection of equipment and operation and maintenance records but may
also require equipment testing and examination of the design basis.
System takes unpredicted actions. If the system takes unpredicted actions then an
investigation will be needed as to whether the system has not functioned as required
or a process interaction has occurred.
Mitigation by system is inadequate. Hazards are often prevented from causing
significant harm by the intervention of mitigation systems that minimise the
consequences of an incident. Judgement is needed as to whether consequences
should have been reduced further. Examination of the event sequences after a hazard
has occurred will be necessary.
Operator fails to mitigate hazard. Hazards are often prevented from causing
significant harm by the intervention of an operator. Judgement is needed as to
44
whether consequences should have been reduced further by operator action.
Examination of the event sequences after a hazard has occurred will be necessary.
The action is caused by a random hardware failure. All equipment has a failure
rate and providing the incorrect action rate of the installed equipment is what has
been assumed during design then no further investigation will be needed. Failure
should be logged and compared with the assumptions made during the reliability
analysis to confirm the system has the required integrity.
In many cases user organisations will not have the tools or expertise to investigate complex
failures of programmable equipment. In such cases the equipment will need to be returned to
the system vendor or the system vendor will need to be involved in the on-site investigation.
In such cases the user will need to prepare a report detailing what is known about the events
leading to the incident.
45
System operates correctly
to prevent hazard
Yes
Demand caused by
maintenance action
Yes
Yes
Demand caused by
equipment degradation
Demand caused by
operation error
Demand caused by
inappropriate function
Yes
Yes
Yes
System fails on proof test
System fails to take action
when required or takes
action when not required
Yes
Setting is
incorrect
Failure caused
by maintenance
action
Yes
Yes
Failure caused
by operation
error
Failure caused
by equipment
degradation
Yes
Failure caused
by inappropriate
function
Yes
Random
hardware failure
Yes
Yes
– hazard and risk analysis
had considered all modes
of operation and causes
– hazard and risk analysis
had considered all modes
of operation and causes
46
– hazard and risk analysis
had considered all modes
of operation and causes
Design
– different equipment had
– maintenance facilities had – operation facilities had
been designed
been selected
been designed
adequately
– the installation design had
adequately
been different
– specification was correct
Installation &
commissioncommissioning
– the equipment had been
installed according to
design
Validation
– different equipment had
been selected
– the installation design
had been different
– different equipment had
been selected
– the installation design
had been different
– configuration was correct
– maintenance facilities had – operation facilities had
been installed according
been installed according
to design
to design
– the equipment had been
installed according to
design
– the equipment had been
installed according to
design
– the setting had been
checked during validation
– maintenance facilities had – operation facilities had
been fully checked
been fully checked
– equipment condition had
been fully checked
– equipment condition had
been fully checked
Operation &
maintenance
– maintenance procedures
were applied
– maintenance procedures
were improved
– maintenance tools were
improved
– test interval was reduced
– correct maintenance
procedure had been used
– maintenance procedure
was improved
– permit procedures were
improved
– correct operation
procedure had been used
– operation procedure was
improved
– permit procedures were
improved
– maintenance procedures
were applied
– maintenance procedures
were improved
– test interval was reduced
– additional protection was
provided
– maintenance procedures
were improved
– maintenance tools were
improved
– test interval was reduced
– additional protection was
provided
Modification
– setting had been
reviewed during impact
analysis
– maintenance facilities or
procedures had been
reviewed during impact
analysis
– operation facilities or
procedures had been
reviewed during impact
analysis
– equipment used or
installation design had
been reviewed during
impact analysis
– equipment used or
installation design had
been reviewed during
impact analysis
Log failure and check
– if dangerous failure
rate is in line with
design assumptions
– if all expected actions
occurred and no
unexpected actions
occurred
– if safe failure causes
any unexpected
actions
Log demand and check
– if demand rate is in
line with design
assumptions
– if demand cause was
predicted in hazard
and risk analysis
Would the incident have been prevented if
Competence
Lifecycle
Verification
Safety management
Documentation
Safety assessment
Operation &
maintenance
– operation or maintenance
staff were more competent
– responsibilities were
defined better
– a better verification
scheme had been in place
– safety culture was improved
– audits were more frequent
– documentation was
clear and sufficient
– operation and maintenance
phase had been assessed
Modification
– modification had been
carried out by more
competent staff
– modification lifecycle
was better defined
– a better verification
scheme had been in place
– accountabilities were better
defined
– suppliers had been reviewed
– documentation had
been updated
– modification had been
assessed
Figure C.1 PARC causal analysis flowchart
Would the incident have been prevented if
System
concept
Continued from previous page
Incorrect action taken
by system or operator
Yes
No action by
operator allows
demand on system
Mitigation by
system is
inadequate
Yes
Yes
Yes
Operator fails to
mitigate hazard
Random
hardware failure
Yes
Yes
Would the incident have been prevented if
– hazard and risk analysis
had considered all modes
of operation and causes
– hazard and risk analysis
had considered all modes
of operation and causes
– hazard and risk analysis
had considered all modes
of operation and causes
– hazard and risk analysis
had considered all modes
of operation and causes
47
System
concept
– hazard and risk analysis
had considered all modes
of operation and causes
Design
– additional actions had
– operator facilities had
been specified
been designed better
– the installation design had – actions had been faster
– final actuation device was
been different
improved
– design requirements were – mitigation system had
been specified
better documented
– mitigation system had
been better designed
Installation &
commissioncommissioning
– the equipment had been
installed according to
design
– the equipment had been
installed according to
design
– the equipment had been
installed according to
design
– mitigation system had
been installed according
to design
– the equipment had been
installed according to
design
Validation
– operator facilities had
been checked during
validation
– operation facilities had
been checked during
validation
– operation facilities had
been fully checked
– mitigation system had
been fully checked
– operator facilities had
been fully checked
Operation &
maintenance
– operation procedures
were applied
– operation procedures
were improved
– correct maintenance
procedure had been used
– maintenance procedure
was improved
– proof testing was more
frequent
– correct operation
procedure had been used
– operation procedure was
improved
– permit procedures were
improved
– mitigation procedures
were applied
– mitigation procedures
were improved
– mitigation system was
proof tested more
frequently
– operation procedures had
been applied
– operation facilities or
procedures were
improved
Modification
– operation facilities had
been reviewed during
impact analysis
– necessary system actions – necessary system actions – need for mitigation had
been reviewed during
had been reviewed during
had been reviewed during
impact analysis
impact analysis
impact analysis
– operator facilities had
been better designed
– the installation design
had been different
Log failure and check
– if dangerous failure
rate is in line with
design assumptions
– if all expected actions
occurred and no
unexpected actions
occurred
– if safe failure causes
any unexpected
actions
Log demand and check
– if demand rate is in
line with design
assumptions
– if demand cause was
predicted in hazard
and risk analysis
– need for mitigation had
been reviewed during
impact analysis
Would the incident have been prevented if
Competence
Lifecycle
Verification
Safety management
Documentation
Safety assessment
Operation &
maintenance
– operation or maintenance
staff were more competent
– responsibilities were
better defined
– a better verification
scheme had been in place
– safety culture was improved
– audits were more frequent
– documentation was
clear and sufficient
– operation and maintenance
phase had been assessed
Modification
– modification had been
carried out by more
competent staff
– modification lifecycle
was better defined
– a better verification
scheme had been in place
– accountabilities were better
defined
– suppliers had been reviewed
– documentation had
been updated
– modification had been
assessed
Figure C.1 (continued) PARC causal analysis flowchart
Yes
System takes
unpredicted
actions
System actions
insufficient to
terminate hazard
APPENDIX D SIMPLIFIED CLASSIFICATION (SYSTEM
REALISATION)
Table D.1 System realisation phase classification
System realisation
sub-phase
Sub-classification
E/E/PES functional
requirements
specification
1
2
3
4
LTA derivation of E/E/PES requirements from overall specification
LTA E/E/PES requirements, eg unstructured, unclear, ambiguous
not all safety actions specified
LTA specification of safe (or least hazardous) state and behaviour after
restart
5 LTA throughput and response requirement
6 LTA EUC operator interfaces specification
7 LTA operator interface to E/E/PES
8 LTA interfaces to other systems specification
9 LTA specification of failure modes or failure rates
10 not all relevant modes (eg start-up, proof test) defined
11 limiting and constraint conditions not defined
12 LTA specification of security provisions
E/E/PES integrity
requirements
specification
1
2
3
4
target safety integrity level and/or target failure measures incorrectly
specified
not all modes of operation defined
LTA definition of proof testing requirements
LTA specification of environmental conditions
E/E/PES validation
planning
1
2
3
LTA validation procedures
LTA specification of validation environment
LTA policies for resolving validation failure
E/E/PES system
design
1
2
3
4
5
6
7
8
LTA hardware safety integrity including fault tolerance and reliability
analysis
LTA measures for systematic safety integrity
LTA response on detection of a fault
LTA separation and independence
significance of interactions not recognised
LTA design methods or tools
LTA system architecture
LTA products application or integration
E/E/PES operation
and maintenance
facilities
1
2
3
LTA provision for checking or confirmation of operator inputs
LTA operator displays (complex, confusing)
LTA consideration of maintainability and testability
Software
requirements
specification
1
2
3
4
inconsistency with overall specification
unclear, ambiguous, unverifiable, not testable and traceable
LTA constraints between hardware and software
LTA monitoring and testing
Software design and
development
1
2
3
4
5
LTA software design methods
LTA software architecture LTA support tools or programming language
LTA code implementation
LTA software testing 48
System realisation
sub-phase
Sub-classification
Software validation
1 LTA software validation techniques/plan
2 LTA assessment of the validation plan
3 LTA implementation of the validation plan
E/E/PES integration
1 LTA hardware and software integration
E/E/PES installation
and commissioning
1 LTA installation
2 LTA commissioning
E/E/PES operation
and maintenance
procedures
1 LTA specification of the routine actions that need to be carried out
2 LTA specification of actions and constraints during all modes of
operation
3 LTA maintenance procedures on fault
E/E/PES validation
1
2
3
4
LTA implementation of validation plan
LTA validation equipment
LTA analysis/resolution of discrepancies
LTA usability assessment
Table D.2 Common requirements classification
Common
requirement
Classification (applies to all system realisation phases)
Safety management
1
2
3
4
5
6
Safety lifecycle
1 detailed lifecycle not defined
2 lifecycle activities not divided into elemental activities
3 scope, inputs and outputs not detailed for each phase
Competence
1 no procedures for ensuring competence
2 training, experience and qualification of a person not assessed
3 job requirements not specified
Verification
1 Verification plan not developed or incomplete
2 Verification plan not implemented
3 Verification plan not fully understood
Documentation
1
2
3
4
documentation absent/incomplete
documentation unclear or incorrect
documentation not well structured
documentation not up to date
Functional safety
assessment
1
2
3
4
5
independent functional safety assessment not planned
independent functional safety assessment not carried out
not all phases assessed (including operation and maintenance)
insufficient skills or independence in assessment team
tools not assessed
necessary management and technical activities not specified
accountability for all activities not specified
activities not implemented or monitored
activities not formally reviewed by all organisations
not all organisations informed of responsibilities
suppliers not reviewed for appropriate quality management system
49
APPENDIX E EXAMPLE FORMS
Title of form
Your name
Date of report
Date of incident
Time of incident
Title
Incident type
Describe the incident
Unique reference number
Location of Incident
Was any person hurt?
This might drive a type or source category, eg Chemco may have other
documents/reports etc. In a large scale scheme incident reports may be
received from other organisations
User asked to assign a short title
It may be appropriate for a user to assign a type to an incident (or this
may be fixed by the type of user that makes the report)
Free text
Could be allocated automatically or by the incident report “gatekeeper”
May want this categorised – by site, region etc
This identifies if health and safety mandatory reporting is invoked and
would be required for insurance purposes. For accident reports this would
need additional data. If applicable detail number of persons hurt,
seriousness etc.
Value of damaged property, whether it was personal property etc
Did any damage to
property occur?
Or other economic loss – for insurance purposes if injury/damage also
Was there a loss of
occurred or for good governance in investigating cost of incident
production? If so how
much?
In your view could this
have led to more serious
consequences? If so, what
could have occurred?
What short term fixes or
work-arounds have been
applied?
To your knowledge, has
this problem occurred
before?
Was any electrical or
electronic equipment
involved that either caused
or failed to prevent the
incident? If so please file
an equipment problem
report
Figure E.1 Initial incident report
The first part of the initial incident report could be filled in by staff involved in incident, the
next part could be filled in by a supervisor.
50
Equipment Problem Report
Manufacturer
Product name
Serial no.
Configuration/version information
Location
Date
Incident reference (if applicable)
Used for: q protection q control q monitoring q alarm/direction
q safety-classified
Problem Description
Details
q failed to act when needed
Description
q acted when not needed
q acted in unexpected way
Identify any associated records (like event logs)
q failed completely
q failed during test
q maloperation of equipment by staff
q other
q dangerous/potentially dangerous
Immediate Cause
Operational
q operator action
q operator inaction
Maintenance
q fault diagnosis
q fault correction
q left in wrong mode
q configuration
q calibration
Environment problem
q condensation
q contamination
q temperature
q vibration
q EMI
Equipment functionality
q user interface
q operational functions
q maintenance functions
q response to component failure
q performance
q random hardware fault
Equipment integration/installation
q power supplies
q wiring
q connections
q incompatible interfaces
q incompatible subsystems
q configuration
Actions
q log problem
q repair/reconfigure equipment
q report to supplier
q problem prevention review
Details
Details
Figure E.2 Equipment problem report (end user)
51
Original incident report fields
Check and correct/expand all fields, especially
Incident type
Incident consequences
Incident potential consequences
Novel or repeat
Short term fixes
Fields added after investigation
Incident reconstruction
Using suitable technique
Analysis of causal factors
Using suitable technique
Causal factors
List of causal factors
Problem prevention classification
(see Figure E.5 or E.7)
Recommendations
List of recommended changes to a range of areas
including E/E/PES
(see Figure E.6 or E.8)
Investigation details
Could be a reference to separate documents including
technical investigations of E/E/PES
Figure E.3 Causal investigation report (user organisation)
52
Problem report
If information is available from end customer (or
internally if problem occurred while being
commissioned/tested by supplier)
eg end user problem report
Supplier problem reference number
Might be assigned automatically if held electronically
Customer/end user organisation
Location
Date and time of incident
Identity and configuration of equipment
Safety-significance
End user causal analysis that relates problem
to E/E/PES
Supplier Investigation
Determine nature and origin of problem
Causal analysis
Results of analysis of E/E/PES problem by supplier
Recommendations
Product recall,
redesign/update,
redesign/update to related products
warnings (to all affected systems),
workarounds
changes to design processes
etc
Causal classification
Classification of problem origin
Process of investigation
Documentation of how results were arrived at, ie
interview records, analysis products, meeting records
etc.
Figure E.4 Causal analysis by E/E/PES supplier of system or equipment
53
q System assessment
q System design
q Installation and
commissioning
q Validation
q Operation and
maintenance
q Modification
Safety management
q operation
q maintenance
q modification
Lifecycle
Competence
q operation
q maintenance
q modification
Verification
Documentation
q operation
q maintenance
q modification
Functional safety
assessment
q operation
q maintenance
q modification
E/E/PES Problem Prevention Checklist
q hazard and risk assessment applied/improved
q better allocation of functions to people and equipment
q better specification of system functions
q better design and development
q improved operation facilities
q improved maintenance facilities
q improved installation plan/procedures
q improved commissioning plan/procedures
q better validation techniques/plan
q improved implementation of validation plan
q better validation equipment
q better analysis/resolution of discrepancies
q better usability assessment
q operation procedures improved
q impact assessment of operation procedures
q operation procedures properly applied
q maintenance procedures improved
q impact assessment of maintenance procedures
q maintenance procedures properly applied
q routine operation and maintenance audits
q test interval changed
q permit/hand over procedures
q procedures to monitor system performance
q selection/application of tools
q procedures applied to initiate modification in the event of systematic
failures or vendor notification of faults
q modification authorisation procedures
q impact analysis of modification
q improve modification planning (including following appropriate lifecycle)
q improve implementation of modification plan
q better manufacturers information
q improve verification and validation of modification
q safety culture
q safety audits
q management of suppliers
Ensure adequate definition of lifecycle process
q operation, q maintenance, q modification
q procedures for ensuring competence
q check training, experience and qualification of a person
q specify job requirements
Verification of q operation, q maintenance, q modification
q check documentation available/complete
q check documentation clear and correct
q check documentation well structured
q check documentation up to date
q introduce safety assessment
q improve safety assessment
q ensure adequate skills and independence of assessment team
Figure E.5 E/E/PES problem prevention checklist (end user)
54
Policy change
q procurement procedures
q supplier approval
q engineering standards
q acceptance procedures
Equipment change
q warning/advisory labelling
q equipment relocation
q environmental protection
q hardware repair
q upgrade to new version
q equipment replacement
q reprogramming
E/E/PES Recommendation Checklist
Operational change
q procedures
q support documentation
q access controls
q warnings
q staff training
q staff briefing
q staff supervision
Maintenance change
q procedures
q support documentation
q access controls
q warnings
q staff training
q staff briefing
q staff supervision
Figure E.6 E/E/PES recommendation checklist (end user)
This checklist could be associated with specific recommendations to prevent a local
recurrence or with more general changes to prevent similar recurrences.
55
E/E/PES Problem Prevention Checklist (Page 1 of 2)
Functional requirements specification
q Derivation of E/E/PES requirements from overall specification
q Specification of E/E/PES requirements (eg structure, clarity, unambiguity)
q Completeness of safety actions specified
q Specification of safe (or least hazardous) state and behaviour after restart
q Throughput and response requirement
q Better EUC operator interfaces specification
q Better operator interface to E/E/PES
q Better specification of interfaces to other systems
q Specification of security provisions
q Specification of failure modes or failure rates
q Identification of undefined relevant operating modes (eg start-up, proof test)
q Identification of undefined limiting and constraint conditions
Integrity requirements specification
q Specification of target safety integrity level and/or target failure measures
q Specify all modes of operation
q Better definition of proof testing requirements
q Better specification of environmental conditions
Validation planning
q Better validation procedures
q Better specification of validation environment
q Better policies for resolving validation failure
Hardware design and development
q Better hardware safety integrity including fault tolerance and reliability analysis
q Better measures for systematic safety integrity
q Better response on detection of a fault
q Better separation and independence
q Better identification of interactions between subsystem components
q Better design methods and tools
q Better system architecture
q Products supplied and integrated in accordance with product specifications
Facilities for operation and maintenance
q Provision for checking or confirmation of operator inputs
q Better operator displays (complexity, confusion)
q Better consideration of maintainability and testability
Software requirements specification
q Consistency checks against the overall specification
q Checks requirements are clear, unambiguous, verifiable, testable and traceable
q Better identification of constraints between hardware and software
q Better monitoring and testing
Software design and development
q Better design methods
q Better software architecture
q Better support tools or programming language
q Better code implementation
q Better software testing
Software validation
q Better software validation techniques/plan
q Better assessment of the validation plan
q Better implementation of the validation plan
Integration
q Better hardware and software integration
Installation and commissioning
q Better installation procedures
q Better commissioning procedures
Figure E.7 E/E/PES problem prevention checklist (E/E/PES supplier)
56
E/E/PES Problem Prevention Checklist (Page 2 of 2)
Operation and maintenance procedures
q Better specification of the routine actions that need to be carried out
q Better specification of actions and constraints during all modes of operation
q Better maintenance procedures on fault
Validation
q Better implementation of validation plan
q Better validation equipment
q Better analysis/resolution of discrepancies
q Better usability assessment
Safety management
q specification
q specification of necessary management and technical activities
q system design
q define accountability for all activities
q software
q check activities are implemented and monitored
q testing
q check activities are formally reviewed by all organisations
q integration
q ensure all organisations are informed of responsibilities
q validation
q review suppliers for appropriate quality management system
q installation
q commission
Lifecycle
q ensure adequate definition of detailed lifecycle
q define lifecycle sub-activities
q ensure scope, inputs and outputs defined for each phase
Competence
q procedures for ensuring competence
q assess training, experience and qualification of a person
q specify job requirements
Verification
q check verification plan is developed and complete
q check verification plan is implemented
q check verification plan is fully understood
Documentation
q check documentation is available and complete
q check documentation is clear and correct
q check documentation is well-structured
q check documentation is up to date
Functional safety assessment
q plan independent functional safety assessment
q carry out independent functional safety assessment
q ensure all phases assessed (including operation and maintenance)
q ensure sufficient skills and independence in assessment team
q assess tools
Figure E.7 (continued) E/E/PES problem prevention checklist (E/E/PES supplier)
Equipment change
q hardware modification
q redesign/update,
q redesign/update to related
products
q reprogramming
Operational support
q support staff training
q product specification
q installation guidance
q configuration guidance
E/E/PES Recommendation Checklist
Processes
q contract review procedures
q component supplier approval
q engineering standards
q product dispatch procedures
q test and validation checklists
q installation and commissioning
procedures
q staff training
q staff competence
User response
q product recall
q workarounds
q warnings/alerts
(all affected systems)
Figure E.8 E/E/PES recommendation checklist (E/E/PES supplier)
57
Printed and published by the Health and Safety Executive
C30 1/98
Printed and published by the Health and Safety Executive
C1.10
12/03
ISBN 0-7176-2789-6
RR 181
£15.00
9 78071 7 627899