Risk mitigation procedure following criticality assessment and FMEA
In asset performance management (APM), criticality assessment and failure mode and effects analysis (FMEA) are powerful tools that enable the identification of opportunities to minimize asset lifecycle costs during the development of equipment strategies.
In asset performance management (APM), criticality assessment and failure mode and effects analysis (FMEA) are powerful tools that enable the identification of opportunities to minimize asset lifecycle costs during the development of equipment strategies.
Here, the author proposes a risk mitigation procedure that prioritizes tasks based on the criticality assessment and FMEA results, considers multiple aspects of failure effects and aims to provide data-driven solutions to maximize the risk mitigation benefit and return on investment (ROI).
Criticality assessment and FMEA
A criticality assessment is used to chart the probability of a failure against the severity of its consequences with regard to organizational impacts, such as safety, environmental or production consequences. When performed at the functional location level (in most cases, an individual component), a criticality assessment identifies the worst combination of severity and occurrence against a predetermined risk matrix. The results highlight one or several groups of “critical” components, allowing remedial efforts to be directed to mitigate the most prominent risks.
FMEA is a bottom-up, inductive analytical method that is typically performed at the failure mode level. When previously assessed criticality results are available, the information can be used as a reference to speed up FMEA development. Adding the failure mode detection rankings to the criticality results yields a risk priority number (RPN). RPNs can be calculated before and after risk mitigation for a specific failure mode.
In general, multiple failure modes (around 10, on average) are associated with a given component, and the risk mitigation tasks identified by the FMEA are usually a much longer list at a more detailed level, compared to those from the criticality assessment.
Risk mitigation benefit realization
Depending on the findings of the criticality assessment and the FMEA, one or several risk mitigation strategies can be applied to each risk item:
- Root cause analysis (RCA): This strategy applies to failures associated with very high impact, where the root causes are unknown. Only with the completion of the RCA can mitigation strategies be developed for certain risk items.
- Preventive maintenance (PM): This strategy brings the most benefit to certain wear-out failure modes where it is difficult to effectively monitor damage conditions.
- Predictive maintenance (PdM): These typically nondestructive inspections can predict an upcoming failure (with a probability) and trigger a preventive maintenance action. Some examples are periodic vibration analysis and/or oil sample analysis for rotating equipment.
- Operators’ rounds: These inspections are conducted by equipment operators, sometimes referred to as operator-driven reliability. Compared to PdM, these tasks are conducted more frequently and involve more visual and auditory observation and less technology.
- Spare part strategy: Having critical spare parts available may significantly reduce the mean time to repair (MTTR) upon an equipment failure, thereby lowering the severity of the failure. This is especially important for parts with long lead times.
- Redesign: When the risk cannot be effectively or economically mitigated by any of the prior strategies,
a redesign may be the last resort. An example is adding a standby pump to form a redundant pump system at a process-critical location.
- Risk acceptance: If no risk mitigation strategy, including redesign, is able to address the risk item effectively, then the best option is to accept the risk. This is sometimes referred to as “run to failure.” Risk acceptance often applies to components with lower criticality and should be assessed on a case-by-case basis.
Each of the outlined strategies (except for risk acceptance) has a potential benefit, either financially or with respect to environment, health and safety (EHS) risk. However, the benefit is not realized until the risk mitigation tasks are implemented. In a typical plant-level APM project, a large number of failure modes may be identified as candidates for risk mitigation. Most clients face the challenge of limited human resources, including engineers, technicians and planners, as well as limited budget for capital projects and spare parts. To best utilize limited resources, it is important to develop and follow a procedure that helps prioritize risk mitigation tasks.
Procedure to prioritize risk mitigation tasks
The following proposed procedure starts after the completion of the criticality assessment and the FMEA and ends with a list of recommended tasks for implementation:
- Perform failure mode pre-screening
- Collect information for prioritization
- Quantify risk mitigation benefits for top-failure modes
- Perform resource allocation optimization
- Summarize results and make recommendations.
Failure mode pre-screening. A pre-screening helps narrow down the risk mitigation candidates efficiently. In general, failure modes satisfying the following conditions should be selected for the next step (prioritization):
- Associated with components with high or medium criticality ranking
- Relatively high RPN scores prior to risk mitigation (identify a cutoff value based on overall RPN distribution)
- Post vs. prior risk mitigation RPN scores have the largest percentage reduction.
In addition, consider failure modes that map to a large number of components (functional locations) in the plant. In this case, the risk mitigation benefit on one component may have a large multiplier effect.
Information collection for prioritization. Provided that the criticality assessment and the FMEA are performed thoroughly, the majority of the information required for risk mitigation prioritization is already available. TABLE 1 summarizes the required and desirable information to be collected for the failure modes selected.
In TABLE 1, the required information is typically included in the standard output of the criticality assessment and the FMEA. By referring to the criticality assessment or FMEA ranking matrix, the approximate values of the financial loss (severity), the probability of occurrence or the detectability can be obtained for a given component or failure mode. To allow easy and accurate decision-making, the information should be as specific as possible. For example, a loss of $100,000/occurrence is preferable to a loss of $50,000/occurrence–$250,000/occurrence, although the latter can be estimated statistically.
Depending on the level of detail of the criticality assessment and FMEA results, the desired information may take extra effort to obtain. However, including that information in the calculation would increase the confidence level of the prioritization.
Risk mitigation benefit quantification. Depending on the type of information listed in TABLE 1 that can be collected, the amount of risk mitigation for a failure mode can be estimated accordingly. For each type of risk mitigation strategy, the benefit can be calculated by the difference between the lifecycle cost (LCC) prior to the risk mitigation and the LCC after the risk mitigation, plus the cost of implementation of the strategy, as shown in Eq. 1:
Benefit = LCCPrior – (LCCPost + Implementation cost) (1)
For failure modes with a constant failure rate (i.e., random failure), the calculation of LCC is straightforward. Otherwise, a software capable of performing random lifecycle simulation can be used. Other advantages of using this type of software include its capability to handle random input parameters and address imperfect repairs or failure detection.
TABLE 2 lists the implementation costs that must be considered for each of the risk mitigation strategies. The accuracy of the benefit estimate depends on the quality of the input parameters. Ideally, most input parameters should be determined in a data-driven approach, with a combination of data sources including FMEA, computerized maintenance management system data and inputs from maintenance and operations personnel. Commonly, some data is unavailable, in which case necessary assumptions can be made using industry averages, adjusted by the client’s operating environment and equipment conditions.
When the failure consequence is related to EHS, it is difficult to calculate the benefit in terms of dollar value. In these cases, decisions on risk mitigation should follow the EHS guidelines of the company and site.
Resource allocation optimization
In general, each risk mitigation category is subject to one type of constraint. For example, PM and PdM are usually labor-intensive and subject to labor availability, while spare part and redesign both require an initial investment and must stay below the financial budget.
Both the labor and financial constraint categories can be further divided into sub-categories, each with their own constraints. For example, labor resources may include engineers, technicians (mechanical, instrumentation and electrical, boilermaker, contractors, etc.) and planners, while the financial budget may include maintenance and capital.
Both labor and budget constraints are time-sensitive. When prioritizing risk mitigation tasks, it is important to keep the timeline of risk mitigation tasks aligned with the foreseeable constraints.
The optimization of prioritization is demonstrated with a simple example (TABLE 3). Consider four risk mitigation tasks, each associated with a different failure mode. The first two tasks are PM and PdM, respectively, and are subject to the labor constraint. The last two tasks are spare and redesign, respectively, and are subject to the budget constraint.
Consider a labor constraint of 300 hr/yr and a budget constraint of $30,000. Only one of tasks 1 and 2, plus one of tasks 3 and 4, can be implemented. If the goal is to maximize the total amount of benefit, then tasks 1 and 4 will be chosen; however, if the goal is to maximize ROI, then tasks 2 and 3 will be chosen.
Compared with this example, in real life the optimization problem will be much more complicated, with (1) many more candidate tasks for risk mitigation, (2) more categories of constraints, (3) a single task with multiple constraints, (4) a possible random distribution of benefit values instead of a deterministic value. However, the problem remains a linear constrained optimization problem that can be solved with the proper mathematical tools.
Recommendations
Even for a complicated prioritization problem, the benefit and constraint information can be summarized in a similar format, as shown in TABLE 3.
Based on the optimization results, different combinations of risk mitigation tasks can be recommended for implementation. These combinations, along with their pros and cons, should be presented to the client for decision-making. It is also good practice to include the following information in the summary to enable easier decision-making:
- Plant area
- System and subsystem
- Criticality ranking and primary driver (operational vs. EHS).
The proposed procedure can help prioritize risk mitigation tasks identified from equipment strategy development. The procedure takes advantage of the valuable information previously obtained from the criticality assessment and the FMEA, determines the most value-added risk mitigation tasks—given the constraints of labor and financial resources—and helps the plant owner make data-driven decisions.
Prioritizing risk mitigation tasks can be complicated, as the results are often affected by many factors, including equipment operating conditions, maintenance effectiveness and condition monitoring capabilities. A more extensive study should be conducted to provide customized solutions that fit each plant’s conditions. In some cases, for example, relatively simple solutions (such as an operator’s round) may prove effective. It is also worth mentioning that precision-driven maintenance,2 although not included as a risk mitigation strategy in this article, may have significant potential for future reliability improvement. HP
LITERATURE CITED
- Apelgren, R., “Use P-F intervals to map, avert failures,” Reliability Plant, July 2019.
- Troyer, D., “A proactive lifestyle for your machines—Focus on FLAB,” Machinery and Equipment MRO, August 2020.
The Author
Zou, T. - T. A. Cook, Houston, Texas
Tong Zou is a Senior Engineering Specialist at T.A. Cook Consultants. He has more than 16 yr of reliability engineering experience in the power generation, automotive, oil and gas and petrochemical industries. Dr. Tong’s expertise includes reliability-centered maintenance, system and component reliability analysis, equipment life data analytics and reliability-based design optimization. Since joining T.A. Cook, he has worked on client projects focusing on reliability improvement.
Related Articles
From the Archive