The Quick Guide to Alarm Review
30 Questions to help you check the health of your Alarm System
Edward Dilley MIChemE CEng
Rev1, March 2006
This paper is intended to help Managers, Engineers and Technicians review their existing Alarms System, to identify problems and to assist in designing improvements.
There are now many excellent references to the problems caused by ‘Alarm Overload'. However the team tasked with reviewing an alarm system may be overwhelmed with good advice, and be unsure how best to start; I know because I've been in that position myself.
The key part of this paper is a list of 30 questions about your process alarm-system. Your answers will identify the problems and probably make the solutions self-evident. Most solutions are cheap, but even if more advanced techniques are called for, it is still necessary to have the basics right.
My basis for discussion is an existing process with a Digital Control System and associated alarms. ‘Normal' operation is deemed acceptable.
The objective of an Alarm Review and Improvement project is that the operators can handle a process upset as safely and confidently as they do ‘normal' operation. An economic bonus may well be that ‘normal' operation itself improves.
The problem of alarm overload and alarm system design has existed for a long time, but several major incidents have highlighted the issue. The Three Mile Island disaster, the Texaco Pembroke explosion, the Channel Tunnel Fire and the Esso Longford explosion have focussed media attention on the process industries. In all these incidents, poorly-designed alarm systems contributed to the disaster, or turned a minor operational problem into a disaster.
Although this paper takes examples from the oil and petrochemical industries, the principles apply equally to any process, be it heavy industry, food or transportation.
What are the current standards? The UK Health and Safety Executive have produced a free leaflet entitled ‘Better Alarm Handling' which provides an excellent introduction to alarm philosophy.
The UK-based Engineering Equipment and Materials Users Association (EEMUA) has published the 130-page document ‘Alarm Systems, a guide to Design, Management and Procurement'. The guide has the endorsement of the US-based Abnormal Situations Management group and the UK Health and Safety Executive (HSE), and as such represents current best practice.
Neither document can give a prescriptive ‘recipe' for an instant, perfect alarm system. The documents offer guidelines, but engineering thought is required to decide what is needed for each specific site. The list of questions below is intended to kick-start the thinking process.
The recently published international standard IEC61511 is concerned with computer-based ‘Safety Related' systems (meaning safety-dedicated systems). It refers to digital control-system alarms only briefly, and indicates that they cannot normally be regarded as having a formal Safety Integrity Level. However the standard does imply that a well-designed process-alarm system can be credited with reducing serious incident frequency, a point not lost to the regulatory and insurance authorities.
The EEMUA guide gives over 20 suggested techniques for improving alarm systems. Some techniques are quick, cheap and easy; others involve more complex programming or logic. As in many engineering projects, you can often achieve 80% of the benefits with 20% of the effort or expense.
The decisions can only be made after a thorough review of the existing control and alarm system; there is no point having a complex Expert System for its own sake. Get the basics right first, and then decide if more is required.
Who should carry the review? For a typical plant, a team of an experienced Chemical Engineer working with an experienced Head Operator or Foreman is probably enough. They will seek input from others during the review and implementation process. Large sites may need more people, but the team must be small enough to be cohesive.
The team members should have site-knowledge, credibility and tact. (Experience has shown that operators can get understandably worried if it is suggested that a personal favourite alarm is to be downgraded or even eliminated.)
Here are 30 questions for you to ask about your alarm system. The questions are not in order of importance, and you will probably add further questions of your own. The ‘alarm system' includes all the alarms which occur in the control-room, not just those on the DCS.
What alarm systems coexist in the control room (DCS, Hard-wired, Fire-alarms, Emergency Shutdown Alarms, etc.).
How many process controllers and indicators are there.
How many alarms are there in total, and on which alarm system. What default values are used in the DCS for alarm-related parameters such as Priority and Enabled status. (A poor choice of defaults can create ‘deadwood' alarm data which hinders analysis.)
What is the proportion of Emergency, High-priority and Standard-priority alarms.
How many repeating, nuisance alarms are there.
How many standing (semi-permanent) alarms are there.
How are the alarms distributed between screens, operating consoles and operators.
At what frequency do alarms occur under normal operation.
At what frequency do alarms occur under upset conditions.
How many Status indications are wrongly shown as alarms. (For example, is the standby pump, which is correctly stopped, showing a ‘Stopped' alarm.)
How many process-messages are annunciated as alarms. (For example ‘Operator has switched on Reflux Ratio control' is a message, not an alarm.)
How many repeating, nuisance messages are there.
What DCS alarms or alarm trigger-points are used in control- or shutdown-logic. (Is an alarm more than a mere alarm.)
How many alarms are duplicated between different alarm-systems, (for example DCS and Hard-annunciated).
What graphics standards are used to indicate alarms or abnormal conditions (colour, inverse video, etc.)
What auditory annunciation and discrimination is used (for alarm priority, console, etc.).
What visual annunciation and discrimination is used.
How are alarm settings added or changed. (Key, password, etc.)
How are alarm priorities changed.
How and when are alarms disabled or inhibited.
How does the operator and supervision know about changed, disabled or inhibited alarms.
How is alarm status handed over at shift-change.
What Management of Change procedures exist for the alarm system.
What criteria are used for defining Emergency, High-Priority and Standard-Priority alarms.
What process- or alarm-overview displays or graphics are used for normal operation or during upsets.
What standard Alarm and Overview displays are provided by the DCS. Are they used.
What training, instructions or manuals does the operator have for action required for specifics alarms or categories of alarm.
What is the operators' opinion about existing alarms under normal and upset conditions, (helpful, hindrance, ignored, ergonomics, …).
What is the normal operator workload under normal and upset conditions (e.g. how many controllers, processes, screens, etc., must be monitored and acted on).
What alarm history is configured, in numbers and time. Is the history database clogged with repeating or nuisance alarms or messages.
The answers to these questions should highlight the nature of any Alarm System problems. The steps necessary to improve matters may well become self-evident.
At very least, ensure that the following are items are addressed:
Eliminate nuisance alarms by re-engineering or retuning.
Minimise the number of alarms. If an alarm has no defined response it should be eliminated or reassigned to an overview graphic.
Define exactly what is meant by the Emergency, High and Standard priorities of alarm, and the actions required when they occur.
Ensure that Emergency- and High-priority is allocated only to appropriate alarms. Most alarms should be Standard-priority. Have a target for the proportion of E:H:S priorities. Ensure that the alarm-trip value reflects the alarm priority; it's a two-dimensional problem.
Review the alarm-messages seen by the operator. The alarm-message may be cryptic or even wrong.
Ensure there is adequate auditory and visual discrimination between alarm priorities.
Aim for ‘blank' alarm screens under normal operation. Suppress alarms from out-of-service plant, etc, to eliminate standing alarms.
Eliminate duplication of alarms between different systems.
Document the alarms system, and have a Management of Change system that allows flexibility whilst keeping good order.
Having reviewed the Alarm System, the team must document their proposed Alarm-management policy, and get the support of management and operators. Only then should the implementation phase begin.
For hazardous industries, many regulatory authorities now insist that plants have a documented Alarms Policy as part of their ‘permit to operate' requirements. Insurance companies may look more favourably on companies that have their alarms under control.
In the United States alone, it has been estimated that economic losses arising from poor alarm systems cost industry $10 billion/year. This is quite apart from alarm-related disasters that make TV or newspaper headlines.
A process upset that is identified and rectified in 10 minutes instead of 2 hours does not make headlines, but certainly improves process throughputs, yields, efficiency and environmental performance.
It is difficult to prove a null argument, but a well-designed alarm system is an indicator of good plant operation. Better operation will help you to capture your share of these ‘lost' billions of dollars. If the alarms are right, the chances are most other things are also right.
A near-disaster, and the subsequent Alarm Review.
In July 1987 an incident occurred at the Gulf refinery at Milford Haven, Wales . By good fortune there were no injuries and minimal damage, but staff recognised the potential seriousness of the incident, and held a major inquiry.
In summary, a minor electrical power spike (induced when a paraglider collided with a nearby power transmission line) caused several units to shutdown automatically. The operators were busy for several hours sorting out the resultant operating problems, but did not realise that a train of distillation towers (which should have kept running) had also tripped to the ‘steam off' state. The towers gradually filled with liquid hydrocarbon and one eventually overfilled into a fuel-gas main causing an explosion within a boiler. There were several other cases of overpressure, causing safety-relief valves to operate.
But why had the incident happened several hours after the initial power spike? Analysis showed that the operators were assaulted by hundreds of process alarms, mostly irrelevant at the time. Buried in the list were one or two important alarms which would have triggered a response had they been noticed. The operators were all experienced, but this incident had never occurred before, and at the time the alarms were not a help but a hindrance. Only next day, with a print-out of the alarms, could one see quite easily what had happened (the reboiler-steam valves had tripped closed at the outset).
The DCS had over 1000 alarms, supplemented by 250 hard-wired panel alarms. The DCS had no facility for alarm prioritising, and many alarms were duplicated between the DCS and panel alarms. The alarms were excellent under normal operation, improving quality control, efficiency and throughput, but they were hopeless under upset conditions. Indeed, six months earlier the operators had been asking if something could be done to improve matters! Some minor improvements had been made, but the major constraint was the limitation of the outdated DCS in use.
At the next opportunity, the DCS was upgraded to a version with the facility to prioritise alarms. The new system had more screens (in two tiers, with touch-screen facility), better graphics, and standard alarm and overview displays.
An alarm review team was tasked with developing an alarms policy for the upgraded DCS. Here are the major actions:
Approximately half the existing DCS alarms were eliminated. (Some key process parameters were moved to a standard, by-exception Overview display as silent ‘pre-alarms'.)
Many existing hard-annunciated alarms were eliminated or transferred to the DCS.
Duplication between the hard-annunciated alarms and DCS alarms was eliminated.
Alarm priorities were formally defined. Target was for 10% Emergency-, 20% High- and 70%- Standard-Priority. (Assigning priorities was the most difficult part of the project.)
Alarm priorities were colour-coded on alarm lists.
Alarms had auditory discrimination between priorities.
The standard-priority beep could be silenced for 20 minutes during a major incident, subject to the correct authority.
Emergency- and High-priority alarms were displayed on a standard, DCS Alarm Annunciator display.
Any residual Hard Annunciated alarms had to be E- or H- priority or be eliminated or moved to the DCS.
Two upper-tier screens were dedicated to the Alarm- Annunciator and Overview displays.
Status indicators (for example running-lights or switches) were segregated on a separate panel beside the DCS screens.
A Management-of-Change procedure was introduced.
The successful result of this major project was that operators felt they were in control of the process under both normal- and upset-conditions. The standard-priority and overview displays aided tight control under normal operation, but could be temporarily ignored under upset conditions until the E- and H-priority alarms had been understood and acted upon.
Most alarm-management projects don't involve buying a new DCS, but the Gulf project does illustrate many of the principles, and show how to make best use of the facilities provided by the DCS.
Today software houses offer excellent database products to help you to analyse and monitor your alarms, but there is no escaping the need for engineering judgement and common sense. Learn from Gulf's, Texaco's and Esso's bitter experience; get your alarms right now!
Report of the President's Commission on The Accident at Three Mile Island . J. Kemeny, 1979.
Organisations have no Memory, T.A.Kletz, Loss Prevention, Vol 13, 1980, p1.
Safety Experience with Computer Control in an Oil Refinery, E.Dilley, Computers and Safety in the Process Industries, Wintech, 1988
Better Alarm Handling. HSE Information Sheet www.hse.gov.uk
The Explosion and Fires at the Texaco Refinery, 24 th July 1994. HSE Books 1997, ISBN 0 7176 1413 1
Alarm Systems, a guide to design, management and procurement. Publication 191, Engineering Equipment and Materials Users Association 1999, ISBN 0 8593 1076 0, www.eemua.co.uk
Lessons From Longford, The Esso Gas Plant Explosion, Andrew Hopkins, CCH Australia Ltd. 2002, ISBN 1 86468 422 4
IEC 61511-1 (2003-01) Functional Safety – Safety instrumented systems for the process industry, definitions, system, hardware and software requirements. www.iec.ch
Computer Control, Safe Practice, 1999. Course sponsored by IChemE and HSE, directed by Cris Whetton, ility Engineering, www.saunalahti.fi/ility
About the author.
Edward Dilley is a chartered Chemical Engineer with over 30 years' experience in the oil, petrochemical, polymer and sugar industries. He has specialised in Process Control, and was responsible for developing alarms policy following the Gulf ‘Paraglider' incident. He currently lives in Peterborough , UK . firstname.lastname@example.org
Tel +44 (0)1733 239595.