A Field Study of Operator Cognitive Monitoring at Pickering Nuclear Generating Station, KJ Vicente, CM Burns
Field Study, interface design, strategies, understanding, field operators, Kim J. Vicente, alarm system, Pickering Nuclear Generating Station, procedures, sources of information, University of Toronto, control room, Department of Industrial Engineering, operator performance, technical report, process control systems, systems integration, Toshiba Nuclear Energy Laboratory, performance, nuclear power plant, Natural Sciences and Engineering Research Council, information technology, Professor Vicente, field operator, Engineering Laboratory, instruments, Cognitive Engineering Laboratory, plant parameters, cognitive engineering, Japan Atomic Energy Research Institute, critical variables
Cognitive Engineering Laboratory A Field Study of Operator Cognitive Monitoring at Pickering Nuclear Generating Station - B Kim J. Vicente and Catherine M. Burns CEL 95-04 Cognitive Engineering Laboratory University of Toronto Department of Industrial Engineering 4 Taddle Creek Rd. Toronto, Ontario, Canada
M5S1A4 phone: (416) 978-0881 email: [email protected]
fax: (416) 978-3453
Cognitive Engineering Laboratory Director: Kim J. Vicente, B.A.Sc., M.S., Ph.D. The Cognitive Engineering Laboratory (CEL) at the University of Toronto (U of T) is located in the Department of Industrial Engineering, and is one of three laboratories that comprise the U of T Human Factors Research Group. The CEL began in 1992 and is primarily concerned with conducting basic and applied research on how to introduce Information Technology
into complex work environments, with a particular emphasis on power plant control rooms. Professor Vicente's areas of expertise include advanced interface design principles
, the study of expertise, and cognitive work analysis. Thus, the general mission of the CEL is to conduct principled investigations of the impact of information technology on human work so as to develop research findings
that are both relevant and useful to industries in which such issues arise. Current CEL Research Topics The CEL has been funded by Atomic Energy of Canada, Ltd., Natural Sciences and Engineering Research Council, Defense and Civil Institute for Environmental Medicine, Japan Atomic Energy Research Institute
, and Asea Brown Boveri Corporate Research - Heidelberg. The CEL also has collaborations and close contacts with the Westinghouse Electric Company, and Toshiba Nuclear Energy Laboratory. Current projects include: · Studying the interaction between interface design and skill acquisition in process control systems. · Understanding control strategy differences between people of various levels of expertise within the context of process control systems. · Developing a better understanding of the design process
so that human factors guidance can be presented in a way that will be effectively used by designers. · Evaluating existing human factors handbooks. · Developing advanced interfaces for computer-based anesthesiology equipment. CEL TECHNICAL REPORT
s For more information about CEL, CEL technical reports, or Graduate School
at the University of Toronto, please contact the address printed on the front of this technical report, or send email to Dr. Kim J. Vicente at .
ABSTRACT Most studies in the cognitive engineering literature have examined the behaviour and strategies of nuclear power plant operators under emergency conditions. However, very few studies have investigated what operators do during normal situations. This report describes a field study that was conducted to understand how operators monitor plant status under normal operating conditions. Several operators at Pickering Nuclear Generating Station - B were observed and interviewed on shift, over the course of one week. The results indicate that monitoring is a very demanding activity. Operators have adopted a number of creative strategies to make monitoring easier. In some cases, these strategies consist of adaptations to overcome design limitations associated with the control room interface. In other cases, these strategies serve to simplify the complex demands defined by the monitoring task itself. A preliminary framework summarizing the results of the study is proposed, and a number of implications for systems integration, training, and interface design are drawn.
INTRODUCTION Background The performance of nuclear generating station operators has long been recognized as a significant contributor to plant performance and safety. Historically, studies of operator performance have been placed in the context of emergency or abnormal operations where the safety implications of operator actions are immediate. Little attention has been given to operator performance during normal operations where tasks are more routine. However, a critical role for operators during normal operations is detecting the early signs of a problem. In order to prevent a transition to an abnormal plant state, operators need to determine that plant state indications are moving out of their normal ranges and then take appropriate actions to prevent further degradation. This task is made more difficult by the enormous complexity of the control room interface. We have been asked to develop a deeper understanding of operator monitoring during normal operations. More specifically, we have been asked to investigate the contributions of operator cognitive skills to monitoring performance. Each control room has in place procedures and practices to aid operator monitoring (e.g., operating logs, shift turnover). However, in addition to these performance aids, each operator brings to the job a thorough understanding of the plant and significant operating experience. We believe that monitoring is influenced as much or more by these cognitive resources as it is by the institutional practices that have been developed. The ultimate objective of this AECB project is to develop a model of operator cognitive activity during monitoring. Using this model, we hope to provide an account of both the cognitive activities that direct effective monitoring as well as the monitoring behaviors that are guided by institutional practices. In addition, we hope to gain some insights into the INDIVIDUAL DIFFERENCES
PNGS-B Field Study
among operators that are related to knowledge and cognitive skills. This understanding should have implications for control room design, operator training, and the development of control room practices and policies. The initial project has the following objectives: 1. Review and document the state of knowledge on cognitive functioning of control room operators monitoring plant conditions. 2. Observe, document, and analyze current operator monitoring practices. 3. Review and analyze recent CANDU events that have involved failures of monitoring. This report addresses the second project objective; it describes the findings obtained from a field study conducted at Pickering Nuclear Generating Station (PNGS-B). In this study, we observed and interviewed Authorized Nuclear Operators (ANOs) to identify established monitoring practices and to uncover undocumented techniques and strategies used to improve monitoring effectiveness during normal operations. Method The field study was conducted by Kim J. Vicente and Catherine M. Burns. Each conducted independent observations in the PNGS-B control room (CR). Vicente's observations spanned from February 13 to 16. He observed 3 different ANOs on 5 different back shifts, for a total of 37 hours of observation. Burns' observations spanned from February 14 to 18. She observed 4 different ANOs on 5 different day shifts. The staff at PNGS-B was extremely cooperative, allowing the observers to watch closely all of the ANOs' work activities. In addition, when time permitted, the observers were allowed to ask questions of the ANOs, either to explain previous ANO actions or to develop further the observers' understanding of the ANOs' jobs. In many cases, the ANOs actively volunteered information that they thought was germane to the study. These contributions were particularly valuable and led to important insights. Both observers took notes in the CR, but did not
PNGS-B Field Study
communicate with each other during the period of observation. Afterwards, the observers wrote independent summaries of their findings. At that point, the summaries were compared and differences were resolved. This report describes the collective findings of both observers. Outline The report is divided into five sections. The first describes the various sources of information that are available to ANOs for the purpose of monitoring. These information sources serve as the inputs to monitoring. The second section describes the task demands associated with monitoring and, in particular, tries to address the question: What makes monitoring difficult? The third section describes the effective strategies that ANOs have developed to facilitate effective monitoring. The fourth section discusses the key findings that were obtained from the field study. These conclusions form the basis of a preliminary model describing how ANOs deal with the task of cognitive monitoring. Finally, the fifth section describes the implications of this study for systems integration, training, and interface design. SOURCES OF INFORMATION FOR MONITORING Eight different sources of information that could be used for monitoring the status of the plant under normal operating conditions were identified. These are: 1. Shift turnover 2. Log 3. Testing 4. Alarm Screens 5. Control Room Panels 6. Field Operators 7. Field Tour 8. Check Forms Each of these information sources will be described in turn.
PNGS-B Field Study
Shift Turnover ANOs arrive in the CR approximately 15 to 30 minutes before their shift is scheduled to begin. At this time, they conduct a shift turnover with the ANO that they are relieving. The turnover consists of several activities. Perhaps most importantly, there is a verbal discussion between ANOs where the following points are discussed: the state of key variables, any unusual alarms, jobs completed and jobs outstanding, plans that are active, variables that need to be monitored more closely than normal, which field operators or technicians are working on which components, what the field operators are aware of, any significant operating memos, and a review of the log (see below). After these discussions, the ANO starting his shift will also look at the call up sheets to see what tests are scheduled for that shift, and the daily Work Plan that documents upcoming maintenance and call ups. They will also review the computer summaries and alarms. At this point, they will try to explain every alarm until they are satisfied that they understand why these alarms are in. ANOs are also required to execute a formal panel check procedure which involves following a check sheet that requires making checks of specific values on the CR panel to determine whether they are in an acceptable state. Some operators were also observed to conduct an informal panel walk through before beginning the official turnover. They would walk by the panels and quickly scan them to get a general feel for the status of the unit. Finally, ANOs who were not intimately familiar with a unit would also review the long-term status binder that documents the "quirks" of that unit. In summary, the shift turnover provides a great deal of information to the ANO at the beginning of his shift that gives him a very good sense of what state the unit is in and what needs to be done to it in the short-term. Log The log is a hand-written, chronological record of significant activities (not necessarily abnormal) that have occurred during a shift (e.g., tests completed). This is a short-term record of the history of a unit, as opposed to the longer term events logged in the long-term status binder.
PNGS-B Field Study
The log is reviewed during the shift turnover but it can also be consulted during a shift to remind the ANO of what had been done on the previous or even earlier shifts. The log thereby provides a means by which ANOs can be aware of the recent status of a unit (e.g, what components are not working, which meters are not working, what is currently being repaired, etc.). This provides a valuable context for monitoring and interpreting information on a shift. Testing Usually, a number of equipment tests are scheduled on every shift. The purpose of these tests is to ensure that backup systems and safety systems are in an acceptable state, should they be required. These tests provide ANOs with a means by which they can monitor the status of these systems (e.g., which safety systems are working properly, how quickly they are responding, which meters are working, etc.). Alarm Screens The two CRT screens used to display alarms are a very salient and frequently used source of information for monitoring. Because the entry and exit of an alarm is accompanied by auditory signals, the alarm screen frequently captures the ANO's attention. The response to alarms can vary widely. Some alarms can be explained by deficiency reports. Others are caused by maintenance activities. Some are ignored because they are not significant (i.e., they do not impact immediate operating goals), or there is nothing ANOs can do about them (i.e., they cannot be cleared until the unit is shutdown). Finally, some alarms motivate a search for more information, typically by sending field operators or maintenance personnel (e.g., control technicians) to conduct additional observations or to perform tests in the field. It was clear from our observations that the alarm screens play a very important role in monitoring plant status. Control Room Panels Obviously, the CR panels (including the alarm windows) are an important source of information as are the computer displays that are available for monitoring. The following
PNGS-B Field Study
displays were found to be monitored extensively on a regular basis by virtually all of the ANOs observed: the dedicated ROPs display, bar chart 11, and the reactor regulating status display.
Field Operators There are some variables and components that cannot be monitored from the CR. ANOs rely on field operators to monitor these variables and components. Field operators can also detect problems out in the field and then bring them to the attention of an ANO. Finally, after an alarm, ANOs also sometimes rely on field operators to make more observations to help diagnose the problem. ANOs communicate with field operators by phone when the latter are in the field, or in person when they are in the CR. Field Tour Periodically, ANOs will take some time during their break to take a walk through their unit out in the field. This enables them to maintain a "process feel" by directly perceiving plant components (e.g., turbine, hydrogen panel, oil purifier panel, boiler feed pumps). Check Forms Check forms are used periodically by field operators to document the status of certain variables. To fill out these forms, field operators must go out into the field to see the information and the component. These forms are subsequently reviewed by the ANO, thereby enabling him to indirectly monitor parameters out in the field. TASK DEMANDS: WHAT MAKES MONITORING DIFFICULT? Before we discuss the strategies exhibited by ANOs, it is important to discuss the task demands associated with cognitive monitoring. What makes monitoring difficult? There are at least five factors which contribute to task difficulty: 1. System complexity and reliability 2. Design of the alarm system
PNGS-B Field Study
3. Design of displays and controls 4. Design of procedures 5. Design of automation Each of these will now be discussed in turn. System Complexity and Reliability Each unit consists of thousands of components and instruments. Even though the reliability of each individual component or sensor may be high, when there are so many of them, equipment failures are bound to occur on a regular basis. Furthermore, some of these failures can only be effectively repaired when a unit is shut down. Failures of this type that are not essential to the safe and efficient operation of the unit may therefore persist for a long time until there is an opportunity for repair. For all of these reasons there are always components, instruments, or subsystems that are missing, broken, working imperfectly, or being worked on. Despite this, the unit can still function safely. Nevertheless, small failures or imperfections have very important implication for cognitive monitoring. More specifically, they change the way in which information should be interpreted. That is, whether a reading or set of readings is normal or abnormal depends very strongly on which components are broken, being repaired, or working imperfectly. The same set of readings can be perfectly acceptable in one context and safety threatening in another. Thus, the operational status of the unit's components provide a background, or context, for monitoring. Consequently, effective monitoring depends very heavily on an accurate and comprehensive understanding of the current status of plant components and instrumentation. This understanding can then be used to derive expectations about what is normal/abnormal, given the current state of the unit. These expectations then serve as referents for cognitive monitoring. There are two additional features which make this a complex task. First, the context is very rich. There are always several components or instruments that are not in perfect working order, so there is much to remember. Furthermore, there are many interactions between components,
PNGS-B Field Study
subsystems, and instrumentation so that it is not a trivial task to derive the full implications of the current failures to determine what state the unit should be in, given the present context. Second, this context is constantly changing. Some components get repaired, some do not; which components are being maintained can change on a daily basis; new failures can appear, and old problems sometimes go away. It is very important that the ANOs be able to effectively track all these changes so that their assessment of context is up to date. This will allow them to derive accurate expectations about the signals they should be seeing from the CR panels and alarms. In summary, because of the enormous number of components, there are always some components that are not working properly. ANOs must track the status of these imperfections so that they can continually adjust their expectations about what behaviour is normal/abnormal. These expectations serve as the referents for detecting abnormalities during monitoring. Design of Alarm System ANOs rely extensively on the alarm system, especially the two alarm screens at the front of the CR. As one ANO put it, "no news is good news". If one were to exaggerate, one could interpret this quotation as indicating that ANOs consider the plant to be in a safe state unless an alarm indicating otherwise appears on the screen. Clearly, this puts a great burden on the alarm system. However, since PNGS-B was designed so long ago, it is not surprising that the alarm system is very primitive by today's standards. Many of the deficiencies in the design of the alarm system arise from the fact that the alarm setpoints are not context-sensitive. As a result, nuisance alarms of various types abound. For example, some alarms are always on because the plant is not currently operated the way it was originally intended to be. Others appear because a certain component is being repaired, maintained, or not working perfectly. Nuisance alarms also appear because of a lack of filtering. For example, multiple alarms can appear for the same event thereby making interpretation more difficult (low priority alarms only get automatically blocked during an upset). Also, alarm messages and status messages are confounded by being presented on the same monitors.
PNGS-B Field Study
Perhaps the most frustrating source of nuisance alarms arises from the interaction between sensor variability and rigid alarm limits. Recall that alarms emit one auditory signal when they enter and another when they exit. If a particular variable is rapidly cycling above and below the alarm setpoint, a continuous stream of auditory signals will be generated. Clearly, this defeats the entire purpose of the alarm system. Some meters have adjustable alarm setpoints, thereby allowing operators to temporarily change the alarm setpoint so that the parameter does not continually bounce in and out of the alarm region. However, other meters do not have adjustable alarm setpoints and therefore do not permit this strategy. The result is an annoying, distracting, uninformative stream of noise. For all of these reasons, the vast majority of the alarms that come in on the screens are irrelevant and in fact do not require attention. This greatly reduces the informativeness of the alarm system, and puts a great burden on ANOs to distinguish the infrequent relevant alarms requiring actions from those that do not. This increases the difficulty of cognitive monitoring. Design of Displays & Controls It should also not be surprising to learn that the design of some of the displays and controls is also primitive by contemporary standards. What follows is a list of the problems that were observed. a) One class of analogue meters is motor-driven. As a result, when these meters fail, the needle stays in the same position that it was in; in other words, the meters fail as is, making detection of the failure difficult. b) Some meters have an LED to indicate that the variable is in an alarm state. These LEDs are very difficult to replace, so ANOs usually do not change the burnt out LEDs. They thereby lose the information provided by the LED. Instead, they monitor the meter pointer to see if it is between the set indicated setpoints to see if the variable is in the normal range. c) Some panels have a light at the top that is supposed to indicate if a handswitch on that panel is off-normal. Unfortunately, the light comes on even if a component is going through
PNGS-B Field Study
maintenance. Because components frequently have work permits on them, the light at the top of the panel is frequently on, even when there is no problem that requires immediate attention. After that, the light is useless because it does not change its state when a handswitch that is not being maintained accidentally is put in an off-normal position. As a result, the light loses its informative value and ANOs just ignore it. d) Unlike the alarm windows, ANOs cannot test the light bulbs on the panel to see which ones are burnt out. Because there are so many bulbs, it is not uncommon for them to burn out. This creates misleading feedback, thereby making it difficult for ANOs to determine whether an observed anomaly is being caused by a burnt out light or by an actual problem in the unit. e) Some instruments are designed so that their failed (irrational) value is the same as the low value on the scale. As a result, it is difficult to distinguish a failed sensor signal from a veridical one. f) Some computer displays (e.g., bar chart 11) do not show upper or lower referents for determining if the current values of the displayed parameters are normal or abnormal. Consequently, these displays require experience, knowledge, and memory to interpret. g) Different chart recorders require paper with different numerical scales. However, regardless of the scale, the rolls of paper are all of the same size. Sometimes, chart paper with the required scale is not available, so an ANO may replace an empty chart with paper belonging to another chart with a different scale. This substitution does not hamper monitoring for trend information since the line drawn by the pens on the chart will be the same, regardless of what the scale of the paper is. However, if one were to rely on the scale to read the precise value of that parameter, one would obtain an incorrect value. This problem can affect experienced ANOs more than newer ANOs because the former, over time, recognize the height that the parameter should be at (e.g., 2/3 of the way up) but sometimes forget the exact numerical value. As a result, they may not realize that the reading is incorrect when chart paper with the wrong scale is installed.
PNGS-B Field Study
h) Some banks of meters and controls do not provide an emergent feature for scanning. That is, the needles or control handles do not all line up when they are all in the normal state. As a result, the status of these instruments and controls cannot be easily and quickly monitored at a glance, or at a distance. Instead, ANOs must monitor them serially and effortfully, having to recall what the normal position or value of each is and then determine whether the control or instrument is in that state or not. i) A ROP computer has been retrofitted into the CR to help ANOs monitor the SDS2 (shut down system 2) temperatures to see how close they are to the trip points. This information is also available on analogue meters on the CR panel, but the ROP computer presents the information in a way that facilitates monitoring. Unfortunately, the SDS2 trip points are not visible on the meters in the CR (unlike SDS1), and therefore have to be set by maintenance personnel elsewhere outside of the CR. However, the ROP computer is a stand alone device; it has not been integrated with the control systems that actually regulate the behavior of SDS2. So the ROP computer serves as a display but it does not control anything. Furthermore, ANOs have to input the trip points by hand into the ROP computer. Because of the lack of integration, it is possible to have a discrepancy between the ROP computer display which the ANOs use to monitor the SDS2 trip points, and the actual trip points that control SDS2. If someone forgets to update the ROP computer setpoints or makes a mistake in doing so, then the trip points displayed in the CR on the ROP computer are wrong. Yet, this is the information source that ANOs use for monitoring. Design of Procedures As mentioned earlier, testing is done on a regular basis to help ANOs monitor the status of backup systems and safety systems. Procedures have been written to guide the execution of these tests. Unfortunately, the people who have written the test procedures apparently do not have a detailed knowledge of the layout of the panels, and therefore, of what it takes to execute the procedures. Consequently, tests sometimes require very long reaches, or having to look at
PNGS-B Field Study
several physically distant places simultaneously. To follow such procedures, one would have to perform the same steps several times. This means that, as designed, the testing procedures are sometimes very inefficient. Another feature of these procedures is that they tend to be written as a "cookbook". That is, they are very detailed and therefore quite intricate. Because of their complexity, it is very difficult for one to discern the intent of the procedure, the logic behind the steps, and which parts of the procedure are really critical and which are of secondary or incidental importance. This complicates monitoring and error detection during the execution of the tests. This is especially undesirable considering that the procedure pages sometimes come stapled in the wrong order! In summary, like the design of the alarm system and the design of some displays and controls, the design of some testing procedures inadvertently increases the demands on ANOs associated with cognitive monitoring during normal operations. Design of Automation There are two types of automated systems, analogue and digital. Analogue automated systems are governed by individual controllers, whose status is displayed by an analogue meter on the control room panels. Digital automated systems are governed by the computer, and their status is displayed on computer displays that can be brought up on a CRT. The demands associated with monitoring each of these will be discussed in turn. Analogue automation. The status of each automatic controller is represented by a linear, vertical analogue meter (see Figure 1). There is a green band indicating the setpoint region for the controller, and a red pointer indicating the current status. If the red bar is in the green area, then the goal is being satisfied. Although it may seem like a relatively straightforward to task to monitor such controllers, there are several reasons why the monitoring task is more complex than it seems. First, some controllers are backups and therefore are not controlling. For these controllers, the red line will not be in the green area, even if everything is normal. Therefore, ANOs must know which controllers are supposed to be in bounds and which are not. To do this,
PNGS-B Field Study
ANOs need to be aware of what is being done to the plant, what is down, and what should be active, given the current state. As mentioned earlier, this is quite a bit of information to keep track of. Second, it is important to distinguish between the actions of an automated system (e.g., trimming valves) and the effects that those actions have on plant parameters (e.g., increased level). From our observations, it seems that there is no direct way of monitoring the former because that information is not displayed. Thus, ANOs seem to focus on monitoring the effects of the automation on plant parameters instead. This is an important distinction because, if a controller is successfully compensating for a fault (e.g., a leak), no visible signal will be observed in the controller meter because the parameter will remain in the goal area. Thus, it is possible that the effects of a fault could be masked. In some cases, indirect cues can be used to detect a problem (e.g., decrease in storage tank level). However, this will not be possible in all situations (e.g., when makeup water comes from the lake). Eventually, the leak would trigger an alarm when the water drops on a beetle (an electrical device designed to sense water) on the floor, but this process may take a considerable amount of time. Also, the alarm may occur at a location in the plant that is downstream from the origin of the leak. Therefore, detection via a beetle alarm, or even via straying outside of the controller goal region, makes it very difficult to understand the cause of the fault and therefore perform an accurate diagnosis. Digital automation. In some ways, monitoring of the digital automation is similar to that of the analogue automation. The focus is again on monitoring the effects of the automation on plant parameters. Also, alarms go off when the error signal exceeds the threshold but this alone makes it very difficult to perform an effective diagnosis. In addition, however, there are CRT screens available to summarize the current status of the various control loops. The problem is that there are more control loops than there are CRTs. Consequently, ANOs can only monitor the effects of the most important loops (e.g., reactor regulating status). This makes it difficult to monitor, and stay in touch with, the status of all of the digital control loops. Conclusions
PNGS-B Field Study
It should be obvious by this point that monitoring is an extremely complex activity. For one, the context that one needs to be aware of to correctly interpret instruments is complex and constantly changing. Monitoring is also complicated by the suboptimal design of the alarm system, some displays and controls, some procedures, and the feedback provided by the automated systems. As a result, there is no simple answer to the question: How can one tell if the unit is in the desired state? After examining the demands associated with cognitive monitoring, one might think that the task is humanly impossible, or at least, highly error-prone. In fact, this is not the case. As the next section will show, ANOs have developed a number of ingenious strategies to facilitate the task of monitoring to the point where they can do it efficiently and effectively. GOOD THINGS THAT ANOS DO TO FACILITATE MONITORING There are a variety of effective strategies that ANOs have developed to facilitate the demands associated with cognitive monitoring. Interestingly, these strategies go well beyond the official operating procedures and the formal training that the ANOs receive. The strategies are informal in the sense that they have been spontaneously adopted by ANOs and passed along by word of mouth. The strategies fall roughly into three categories, according to the purpose that they serve: creating information, maximizing the information extracted from available data, maintaining a safety culture. Strategies that Create Information Shift turnover. There are several strategies that some ANOs rely on during shift turnover to create information to support cognitive monitoring, including: a) clearing the printer of all alarms generated on the previous shift. This way, the ANO can be sure that any alarms that appear in the printer happened on his own shift, thereby facilitating the search for, and organization of, information. b) making hard copies of everything from the computer (more than is required by formal procedures) at the start of a shift. This has the advantage of producing a hard copy referent of the
PNGS-B Field Study
history of the status of the unit, so that if unexpected events occur later in the shift, the ANO can see (rather than remember) if there was a change from the previous state, and if so, in what way. This historical record thereby provides a valuable source of context for interpreting subsequent information. c) reading the log summaries from the past few days to get continuity. This provides the ANO with the recent history of the unit, and thereby a context for establishing the relevance of new information. d) writing down parameter values during the panel check procedure. The procedure merely requires ANOs to check off whether the parameter is within the normal bounds. However, by writing down the exact value, ANOs can determine how close the parameter is to its boundary limit. If it is close, they will monitor that parameter more closely. This strategy serves the general purpose of anticipating problems before they occur. Alarms. ANOs have also adopted a number of informal strategies to create more information than the alarm system was originally designed to provide, including: a) if an alarm comes in, and there is a reason to suspect that alarmed parameter may increase (or decrease) even further, the ANO may write down the value to periodically monitor for change. The main reason for doing this is that another alarm message will not come in if the parameter continues to increase. Thus, writing down the parameter and its current value offloads memory and provides a cue for continued monitoring. b) cursoring alarms (i.e., deleting the alarm message off the screen before the alarm actually clears, but not disabling it), if they are considered to be unimportant. This strategy keeps the alarm screen clean and simple, so as to make change stand out. If there are too many alarms on the screen, there is too much to look at, making alarms almost meaningless. c) changing the setpoints of an alarm after an alarm comes in (this is only possible with the adjustable analogue meters). Because an alarm will not come in again once the value exceeds the setpoint, ANOs will increase the setpoint after the initial alarm to get a "second chance". That
PNGS-B Field Study
way, if the parameter continues to increase, the alarm will come in again. Detecting this subsequent auditory signal is much easier than continually having to check on the parameter to see if it has increased even higher. d) jumpering out (i.e., disabling) nuisance program alarms in software. The setpoints of program alarms cannot be changed. Therefore, if a parameter is bouncing in and out of tolerance, a continuous stream of alarms will be generated. This defeats the purpose of the alarm and causes the ANO to ignore the nuisance alarm (and understandably, get extremely frustrated). When this continues for an extended period, the ANO may disable that program alarm, and document this on a Post-it that is kept on the side of the CRT. If the alarm is still jumpered at the end of the shift, the ANO will either reconnect it, or tell the new ANO about it. This strategy, like others described above, serves to increase the informativeness of the alarm system. Shutdown. Shutdown is a very complex process. Thus, ANOs take several steps to keep the complexity of task demands to a manageable level. Here are some examples: a) ANOs will put Post-its on the CR panel to flag unusual indications. Usually, ANOs would react right away to such indications. Thus, the Post-it serves as a visual reminder to not react as usual to the observed signals. b) ANOs will do a detailed walk over of the entire panel during turnover. Because of the complexity of the situation, they increase the degree of coordination even though it is not required by the official procedures. Procedures. For complicated testing procedures, or valving in that field that needs to be monitored, ANOs will use a flowsheet in conjunction with the displays. This external aid offloads their memory, compensates for poor panel design (lack of topological connections), and anticipates potential problems. Displays. ANOs have also adopted strategies that create new information from the displays on the CR panel. Two fascinating examples stand out:
PNGS-B Field Study
a) ANOs will sometimes manipulate alarm set points in ways that were not intended by designers to compensate for the lack of direct information needed for a particular problem. Two classes of such behavior were noted. First, ANOs will sometimes change the alarm setpoints on a particular meter to a value that is not the usual setpoint. Instead, the temporary setpoint is a value at which they need to perform a certain action. ANOs do this so that an auditory alarm will occur when it is time to perform that action. Otherwise, they would have to remember to periodically check the meter until its value reached the point at which the action needed to be performed. Manipulating the adjustable alarm setpoint obviates the need for such periodic monitoring. Second, ANOs also manipulate alarm setpoints on some variables to compensate for the lack of alarms on others. They do this by changing the setpoint on a variable that is correlated with the one that they actually want to monitor. Thus, the setpoint on the variable with the alarm will be set at a value that will indicate that the variable of interest (the one without an alarm) has reached an undesirable state. This creates an auditory signal. Without this manipulation, no signal would be given since the first parameter was not instrumented with an alarm. By creating this new information, ANOs create an early warning of trouble where none would be available. b) ANOs will leave the door open on a particular strip chart recorder to make it stand out from others, when it is important to monitor that parameter more closely than usual (e.g., open doors on feedtrain tank levels while blowing down boilers). When several parameters need to be monitored, several strip chart doors will be open but the chart that is the most critical to monitor will be pulled out to distinguish it from the others. This very simple trick has an enormous information value. First, it provides an external memory cue for monitoring. ANOs can easily pick out the parameter to be monitored. Furthermore, if they forget and then see an open door, they will immediately remember that there must have been a reason why they needed to monitor that particular parameter more closely. Second, this practice also serves as a cue for others. For
PNGS-B Field Study
instance, when the shift supervisor does his rounds, he will look at strip charts whose doors are open. Otherwise, he will just pass by them. Summary. In summary, ANOs have develop a set of ingenious tricks that create information and thereby aid monitoring. Some of these tricks are intended to reduce the psychological burden associated with monitoring demands (e.g., opening the door on a strip chart recorder that needs to be monitored closely), whereas others compensate for design deficiencies (e.g., jumpering out program alarms). Interestingly, all of these tricks can be interpreted as efforts to "finish the interface design". That is, ANOs create a set of practices that serve the purpose that the original interface should, but in fact does not, fulfill. These adaptations can therefore be functionally viewed as an extension and elaboration of the designers' original interface design. As such, they greatly simplify the monitoring task. Strategies that Maximize Information Extraction from Available Data ANOs have also developed a set of strategies that serve to maximize the amount of information they extract from the data that are available to them. The following are examples of this class of activities: a) when an anomalous indication is observed, ANOs will avoid reacting right away. As one experienced ANO put it, "don't jump in with both feet first". Instead, they will consult redundant panel information, and/or talk to field operators to establish whether the observed anomaly is being caused by a faulty instrument or by an actual change in plant status. This independent confirmation allows ANOs to keep from reacting to false indications that could lead to problems or trips. Because of the hardware reliability issue discussed earlier, many anomalous indications turn out to be caused by instrumentation problems. This strategy is a way of coping with this fact. To conduct this independent confirmation, the panels have to be designed in such a way that redundant functional information must be available to ANOs. Some panels provide this type of information, but others apparently do not.
PNGS-B Field Study
In addition to conducting an independent confirmation, ANOs can also take other courses of action to an anomaly. Sometimes, ANOs will wait for a while before they decide what to do. Sometimes, the problem goes away after a few minutes so being patient has its advantages. In other cases, ANOs may not have any free field operators to check up on a problem so they may decide that they can live with it for now until someone becomes available. Similarly, in some situations, there is nothing that ANOs can do about a specific problem, and since it is not severe, they just operate the plant anyway. b) ANOs also know what jobs/tests have had problems in the past either by experience or by looking in the long-term status binder. Based on this knowledge, they are prepared to do more careful proactive monitoring of specific variables that can reveal problems before they occur. c) during the execution of testing procedures, ANOs always try to understand the intent of the test, not just follow the procedures in a rote fashion. As a result, they will proactively monitor certain variables to confirm that the test is going as planned. This knowledge-driven monitoring strategy serves several purposes. First, it generates information that can be used to detect errors as soon as they occur. Second, it can also help compensate for the limitations in the procedures. For example, if the procedure pages comes stapled in the wrong order (as they sometimes do), one may not even notice this if one were to execute the steps blindly without reflection. Also, sometimes the ANO's knowledge of the intentions behind the procedures allows them to adopt procedural deviations that allow them to do the test much more efficiently. This strategy overcomes some of the problems that arise from the fact that the procedure designer does not have an awareness of the panel layout (see discussion above). d) some ANOs will conduct an informal panel walk around, after they complete the formal panel check procedure, to check on the status of critical variables that can change quickly. e) similarly to point b), ANOs also know that certain types of plant changes (e.g., raising power, refueling) are more likely to cause problems, so they proactively monitor certain parameters more
PNGS-B Field Study
closely (e.g., boiler levels, storage tank level) during those times. Again, this allows them to anticipate problems and catch them at a very early stage. f) to properly interpret instruments, it is sometimes very important to know how they fail. This knowledge can take two forms, one being knowledge of the internal structure of the instruments, and the other being where or when they have failed before in the past. ANOs use both of these types of knowledge in monitoring instruments and in interpreting anomalies. g) ANOs will also exploit what one might consider "informal" sources of information for monitoring purposes. For example, the motor-driven meters mentioned earlier make a noise when they move. Therefore, when there is an upset, many of them change simultaneously, thereby providing salient auditory feedback to the ANO that something severe has happened (in addition to whatever alarms might come on). A second example is that sometimes the flicker of the lights in the CR is a precursor to problems with the power supply. Experienced ANOs know this and will therefore monitor certain parameters more closely if they notice such a flicker. Finally, when certain components fail open, a low rumbling noise can apparently be heard in the CR. Again, experienced ANOs use this "noise" as information that helps them interpret the state of the unit. Thus, the panel is not the only source of information for monitoring. h) ANOs will also take full advantage of direct perception of components in the field. Instruments can lie (i.e., fail) but sounds and vibrations do not. The reliability of direct perception is clearly shown in the following quotes obtained from an ANO: "numbers and instruments are great but if you really want to understand what's going on, you have to go out there and see it and feel it"; "most of the time, your instruments give you a very good idea of what state the plant is in. But numbers by themselves don't mean anything. If I tell you that turbine vibration is 200 mm, you know that's not good (i.e., beyond the normal limit) but it's not the same as taking you out there and showing you the turbine rails moving violently back and forth".
PNGS-B Field Study
The power of direct perception is also illustrated by the following incident. One time, field operators had to go out in the field to find a heavy water leak of 50 kg/hr (the shutdown limit). Before they went out, they and the ANO went to the sink with a cup of known volume and a watch, and adjusted the tap flowrate until they created a flowrate of 50 kg/hr. As the ANO observed, "you might think that 50 kg/hr sounds big, that it might be a gusher, but it's not. It was just a trickle". By performing this little experiment, operators acquired a perceptual (rather than symbolic) referent, that they could then effectively use for monitoring in the field. i) ANOs can use the alarm system to evaluate the trustworthiness and thoroughness of field operators. When field operators do certain jobs out in the field, they create alarm messages in the CR that are triggered by access detectors and the status of certain buttons. In this way, ANOs can determine whether field operators are doing their job. However, field operators know this so they may just go through the motions, thinking that they are fooling the ANO. Interestingly, the timing of these alarms also provides information. Because ANOs work their way up through the ranks, they are familiar with how long it takes to do certain jobs in the field. One situation was observed where 3 or 4 such alarms came in within the span of about 15 to 20 seconds. The ANO immediately recognized that the field operator was merely going through the motions because he knew that it takes at least 2 to 3 minutes to thoroughly conduct each of these checks. Summary. ANOs have developed a number of ingenious strategies to try to extract as much information as they can from the data sources available to them. Some of these involve knowledge-driven monitoring to detect errors or problems as soon as they occur. Other strategies are directed at fully exploiting informal sources of information, such as direct perception in the field. Finally, some strategies use knowledge that ANOs have gained from experience to proactively monitor parameters that are likely to reveal problems during situations which have led to problems in the past. Collectively, these strategies allow ANOs to effectively deal with the complex task of monitoring without exceeding their resource limitations. Safety Culture
PNGS-B Field Study
We were extremely impressed with the dedication, responsibility, and reflection of the ANOs we observed. They were very conscious about safety, and continually thought about the implications of their actions and practices. As an example, ANOs have developed an attitude where they never take anybody's word for something. Instead, they go and check themselves. These redundant checks are essential for error detection and recovery. Similarly, they actively monitor the performance of field operators (see above), in an effort to make sure that everything is going as planned. These are all signs of a safety culture which is essential to the operation of any potentially hazardous sociotechnical system.
DISCUSSION Figure 2 presents a framework that summarizes the findings of the field study. The ANO is shown in the middle and the various sources of information that can be used for cognitive monitoring are shown in the periphery. The various relations in Figure 2 will be briefly described first. Then, a more detailed discussion of some of the more salient points will be presented. We will begin by discussing the role of the bracket representing the shift turnover, log, testing procedures, and check forms shown on the right in Figure 2. These information sources allow the ANO to update his awareness of the state of the unit. The ANO also contributes to most of these documents by adding and passing along additional information observed during his shift. Moving clockwise, the ANO also communicates by phone with field operators and technicians. This communication can be initiated by the ANO (e.g., collecting more information after an alarm) or by the field personnel (e.g., detecting a problem in the field). The plant itself also serves as a source of information. During field tours, ANOs actively seek out information by perceptually examining certain subsystems and components. Also, in some occasions, the plant will present "informal" stimuli (e.g., low rumbling noise) which can be meaningfully
PNGS-B Field Study
interpreted by an experienced ANO. As mentioned earlier, ANOs rely extensively on the alarms. These are shown as a double line in Figure 2 because they are in auditory form. Thus, they usually grab the ANO's attention, unlike visual stimuli which have to be in the field of view before they can be perceived. The two dashed lines in Figure 2 represent the ANO's activities to "finish the design" by modifying both the alarms and the control room panels in such a way as to create additional information. Finally, the panels themselves serve as a source of information. In some cases, the ANO will actively monitor the panels, while in other cases certain signals will jump out and grab the ANO's attention during a routine scan (bottom-up perception). Some of the most important findings will now be discussed in more detail. In any work situation, monitoring involves a decision about what is relevant to look at, and what is a relevant deviation or change. In order to make these decisions about relevance, one must have an appropriate sense of context. The same variable value can be normal in one context, and disastrous in another. In the case of PNGS-B, this appropriate sense of context involves having an up-to-date, comprehensive, and accurate awareness of the operational status of a unit, including what equipment is currently failed, working imperfectly, or being repaired. As mentioned earlier, this context or background is very complex in scope and is always changing with time. Thus, retaining this awareness is not a trivial matter. However, awareness of this context is absolutely essential to effective monitoring. Our observations suggest that an effective shift turnover is instrumental in allowing ANOs to acquire this sense of context which then drives judgements of relevance in monitoring. For example, when starting a shift turnover, one ANO asked, "what's new, interesting, different"? Another important finding was that it is psychologically impossible to continuously sample the state of all of the instruments in a comprehensive and reliable manner. There are simply too many things to look at to make this a psychologically viable task. Furthermore, ANOs are frequently occupied with other activities, so monitoring often must be time-shared with other tasks. This also prevents ANOs from devoting all of their attention to comprehensively
PNGS-B Field Study
monitoring all of the variables on the CR panels. It is important to point out that the control room was designed with this in mind (e.g., alarms draw ANOs' attention to parameters that they may not be monitoring, procedures serves as aids and reminders to periodically check on the status of certain variables, etc.). Therefore, the lack of continual, comprehensive monitoring is not indicative of negligence or lack of dedication on the part of the ANOs; given the task demands, it is perhaps the only psychologically plausible approach to adopt. This finding is reinforced by one ANO's estimates of the probability of detection of a problem as a function of data source. This ANO, who had 10 years of experience, suggested to us that 75% of all problems are detected from the alarms, 20% from field reports, and only 5% from the panels. The very low frequency with which problems are estimated to be detected from direct monitoring of the panels is consistent with the observation that the instruments on the panels are not monitored continuously and comprehensively. Although comprehensive continuous monitoring was not observed, two different kinds of active, directed monitoring were observed. First, it was not unusual for ANOs to proactively monitor one or more variables more closely than they usually would. Typically, this would occur because of the context; there would be some feature of the current situation which would prompt ANOs to monitor a set of variables more closely than normal. Second, there are a subset of variables that ANOs would monitor proactively on a regular basis, even in the absence of an alarm. These variables were divided into three areas: the front end of the plant (reactor regulating system), the back end (MWe output), and the SDS2 volatile trips. The rationale that was provided for this was that if the front end and back end variables were normal, then it was extremely unlikely that any other subsystems were in an abnormal status. Thus, this strategy seems to be based on a top-down prioritization of the variables that are most likely to show significant changes originating in any part of the unit. The trip points were monitored regularly because they can change very quickly and because the consequences of exceeding a setpoint are quite severe.
PNGS-B Field Study
The subjective data listed above also show the central role that the alarm screens play in cognitive monitoring. ANOs seem to rely extensively on the alarm screens to detect problems. In fact, one could even say that, to some extent, they use the alarm screens to get an overview of plant status (a function that it was not designed to support). This method of monitoring has the advantage that the auditory signal can grab the ANO's attention, whether they are looking at the alarm screens or not. The disadvantage of this method of monitoring is that the alarm screens are not very well designed, thereby making monitoring much less efficient than it could otherwise be. It was also suggested by some ANOs that they get into problems when nuisance alarms occur, their workload is high, or symptoms are masked. These conditions seem to create problems for monitoring. Another fascinating finding from the field study was the various ways in which ANOs go about "finishing the design" by actively manipulating the interface they are given to make it more informative. Several examples of this type of behaviour were observed, including: filtering of alarms, creating salient visual cues for monitoring, creating alarms where none exist, manipulating alarm setpoints to serve as cues for action, and creating external representations that offload memory. These strategies help ANOs compensate for design deficiencies and for the complex demands imposed by cognitive monitoring. They show that providing skilled people with responsibility can lead to highly creative and functional forms of adaptation. Finally, it is important to emphasize the contribution of the various informal strategies and competencies that ANOs have developed to effectively carry out monitoring. Although these strategies are not part of the formal training programs or the official operating procedures, they are extremely important because they facilitate the very complex demands of monitoring, and they compensate for poor design decisions. Thus, one could effectively argue that a very high level of reliability is achieved because ANOs have developed innovative strategies for doing their job. If they were to follow standard operating practices and knew only what they are
PNGS-B Field Study
supposed to learn in training, then performance would not be nearly as reliable and as efficient. In other words, the system works well, not despite, but because ANOs deviate from formal practices. Limitations While this field study led to a wealth of fascinating insights, it is important to point out the limitations of the findings. First, only a small subset of the ANOs who work at PNGS-B were observed. Second, the observation period was very brief, given the complexity of the topic under investigation. Third, some of the strategies described above were not actually observed in practice but were instead communicated verbally by ANOs. Therefore, it is very difficult to assess the extent to which the strategies that we pieced together are actually representative of those used on a regular basis by most ANOs. IMPLICATIONS These findings have significant practical implications for systems integration, training, and interface design. Systems Integration It is clear that there is a lack of systems integration between the various system perspectives (e.g., training, panel layout, instrumentation, procedures, task demands) that are required to make the plant run optimally. As a simple example, we found that some testing procedures are designed without an awareness of the layout of the controls and displays on the CR panels. This lack of integration makes the ANOs' job more difficult than it need be. Improving the degree of coordination and integration between the system perspectives just mentioned should facilitate cognitive monitoring, and decrease ANO workload. Training The findings of this field study also have very important implications for training. It was patently clear to us that ANOs develop many strategies and acquire a great deal of knowledge
PNGS-B Field Study
that go well beyond the formal training that they receive. The possibility of training operators to effectively use at least some of the informal strategies identified above should be considered. Although this may add to what is already a very extensive training program, it would ensure that this important knowledge and these essential strategies would be passed on formally and become part of standard operating practice. Otherwise, they might "fall between the cracks". Another important finding from this field study is the observation that good ANOs rely extensively on knowledge-driven monitoring instead of just rote procedural compliance. This practice of knowledge-driven monitoring allows operators to detect problems before they become significant, to compensate for poor design of procedures, to distinguish instrumentation failures from component failures, and to become better aware (in a deep sense) of the current state of the unit. However, it seems that the current training, licensing, and recertification programs are based more on procedural compliance than knowledge-based understanding. The latter set of skills clearly play a very important role in effective cognitive monitoring and should therefore play a much larger role in training, licensing, and recertification. In summary, observing the extraordinary informal skills and practices that good ANOs have acquired generates important implications for operator training. Interface Design The findings of the field study also have significant implications for interface design. For example, certain parts of the CR panels control room panels could be drastically improved to facilitate monitoring by allowing ANOs to check whether displays and controls are at their desired settings at a glance. Clearly, the alarm screens could be improved enormously. Currently, ANOs have to frequently filter irrelevant messages that arise because of the lack of context sensitivity in the alarm system. ANOs also sometimes have to deal with nuisance alarms which make monitoring much more difficult than it would otherwise be. Other limitations could be pointed out as well but these serve to make the point. Another area that could possibly be improved is the level of feedback provided by the automation. It is possible that making the
PNGS-B Field Study
automation more transparent would facilitate the detection and diagnosis of automation failures as well as failures that might currently be masked by automation. Finally, the field study findings suggest that the Ecological Interface Design (EID) framework developed by Vicente and Rasmussen (1992) could be effectively adopted to address some of these problems. EID is a framework for interface design for complex human-machine systems. EID tries to provide operators with the information they need to cope with normal conditions, anticipated faults, and most importantly, unanticipated events as well. It does this by providing higher-order functional information that operators can consult to determine whether system constraints have been violated, as they are in the case of a fault. This feature of EID is consistent with the need ANOs expressed for redundant information so that they confirm if a fault has in fact occurred or not (cf. Vicente and Rasmussen, 1992). This need is well captured by the following quote from one of the ANOs: "If I don't have an additional indicator to confirm what's going on, I feel completely naked. You can make the control panel as complicated as you want, as long as I have redundant information that I can check to see what's going on, I'll be happy. Things around here break all the time. Eventually, everything breaks down. That's why you need the redundant information -- to check to see if it's a light that's gone, or whether the component is really down". EID also tries to aid operator performance by presenting the aforementioned information in a form that is consistent with the powerful capabilities of human perception. This feature of EID is consistent with ANOs' desire for information at a glance, as indicated by the following quotation obtained from one of the ANOs: "What the value is is not that important. As long as I can scan it and see whether everything is lined up (i.e., normal) than that's all I need". THE RELATIONSHIP between ANO needs and the principles of EID was noted part-way through the field study. As a result, several examples of interfaces based on EID were presented to two ANOs to obtain their feedback. These displays were described to ANOs without telling them
PNGS-B Field Study
who had designed the displays. The examples presented were favourably received by both ANOs. One ANO suggested that it would useful to apply the principles of EID to the deaerator subsystem, whose currently impoverished displays apparently cause ANOs great difficulties during diagnosis. This suggestion should be considered since it seems to have the promise of addressing current limitations with the CR panels in a manner that would receive the endorsement of the ANOs, the end users of the system. ACKNOWLEDGEMENTS This work was funded by a research subcontract with the Westinghouse science and technology Center (Randy Mumaw, contract monitor). We would like to express our sincere thanks to Francis Sarmiento and Mel Grandame of the AECB and Rick Manners of Ontario Hydro for their help in coordinating our field study. Also, we are deeply indebted to all the ANOs and SOSs who patiently answered our questions and generously shared their insights regarding the demands and skills associated with their jobs. This field study would not have been possible without their cooperation. REFERENCE Vicente, K. J., and Rasmussen, J. (1992). Ecological interface design: theoretical foundations. IEEE Transactions on Systems, Man, and Cybernetics, SMC-22, 589-606.
KJ Vicente, CM Burns