April 2022

Special Focus: Maintenance and Reliability

Use straightforward troubleshooting and failure analysis to benefit the HPI

To operate reliably and profitably, oil refineries, petrochemical plants and gas processing facilities must avoid equipment failures.

Paradies, M., System Improvements, Inc.; Bloch, H. P., Process Machinery Consulting

To operate reliably and profitably, oil refineries, petrochemical plants and gas processing facilities must avoid equipment failures. In the event such failures occur, the plant must find and remedy the root causes of these failures. Best-in-class facilities are reaching and maintaining this enviable status by pursuing repeatable failure analysis and troubleshooting (FA/TS). Guesses are rarely correct: they can lead to costly and even dangerous fixes. By their very nature, mere brainstorming sessions lack structure and timely, tangible results.

Experience shows that troubleshooting and failure analysis activities will terminate in only three possibilities:

  • The failure cause is determined and removed from the inventory of future failure possibilities
  • The failure cause is not found, and a recurrence is likely
  • The failure cause is found but not disclosed to parties interested in doing anything about it.

Reliability professionals agree that discovery and permanent remedial action of the true root cause of a failure is the only correct course of action. These reliability-focused professionals will readily agree that inaccurate troubleshooting leads to incorrect conclusions and wasted time. Because the analyst or troubleshooter is under pressure to do something, tinkering at the fringes of procedural changes or design modifications is quite common. Moreover, hiding behind claims that a failure is an “act of God” is common, albeit inappropriate. The Almighty may permit that a city built below sea level is damaged by a flood after the levies break and the pumps fail, but the Almighty did not ask the people to build substandard levies and flawed pumping equipment. The chain of events always originates or leads to decisions and indecisions, actions and/or inaction by humans who, finally, are nowhere near as perfect as they would have us believe.

Reasons to improve troubleshooting methods

Of course, nothing devised by man is perfect, nor will it ever be. Problems of one kind or another will always exist. In other words, troubles must be identified and remedied. Regrettably, troubleshooting is usually done after a problem surfaces. Suppose there was a time when the sugar cane crop failed, and the landowner attributed the problem to the sugar cane worm. Since pesticides had not yet been invented, he countered the threat with imported beetles that ate sugar cane worms. When the beetles took over, he imported ugly beetle-eating toads which, unbeknownst to him, preferred to decimate his previously profitable tobacco leaf crop. This made him decide to import Genovese disease-carrying rats as the cure. These rats failed to do the job, but by the time he considered red foxes as the problem solvers, he realized that the red fox recommendation came from gentrified landowners whose ulterior motive was nothing but hunting foxes.

While some of the above is indeed a bit far-fetched, there is also a measure of truth in it. It pays to consider the message: Do not tackle serious problems with temporary fixes. Recommend solutions that are backed up by both data and science. Try not to endorse remedies that have unintended consequences or run counter to the satisfactory experience record of others. Therefore, investigate or pay attention to others and try to learn from them.

A few fundamental facts are usually recognized and accepted by the leaders of modern industry: inadequate troubleshooting leads to higher costs, wasted time and makes the company’s troubleshooting team look unprofessional. Because we, the co-authors, have made our living as troubleshooters, allow us to make use of these facts in demonstrating simple examples of solving randomly occurring process pump problems that can be tackled by using course content that combines1 several proven approachesa.


Collecting a large failure mode inventory while realizing that such an inventory will differ for different machines and their components makes considerable sense. Examining this inventory and conducting a structured review of data observed during and after a failure makes it possible to draw conclusions as to the failure’s origin. The reviewer or failure analyst should concentrate on deviations from proper design, component function, appropriate coatings, and many other factors.

However, generalities that lead to a so-called “shotgun approach” must be avoided. In shotgun-style failure analysis and troubleshooting, much time is often wasted on debates that lack both focus and value. There will then be a trend towards verbalizing scenarios that are a mix of the probable and highly improbable, the reasonable and unreasonable, the often observed and perhaps things that have never happened since Leonardo sketched a helicopter for tentative future use by Igor Sikorsky. Do not disregard everything that happened in the 440-yr span between those two inventors. Nevertheless, accept our premise that we need focus and must concentrate on realistic events and solutions.

To begin, we observe that work execution requires emphasis on any two of the three attributes shown in FIG. 1. If a manager demands quality work done cheaply and quickly, we must tactfully train that manager to let us know which two will be accepted, since achieving all three will be impossible. Frankly, it is in everybody’s best interest that our work output be good. That leaves it to the manager to accept that good is your unyielding standard, but he/she may instruct you to either be fast (and expensive) or cheap (and slow, by prevailing standards).

FIG. 1. Good managers know that competent staffers will give their employer any two of these three. One of these three attributes will be unachievable.

In this context, a useful focus is readily obtained by accepting that all machinery failures, be they turbo-compressors or machines making cotton-tipped ointment applicators or doorknobs, will experience distress attributable to only one of seven possible categories:1

  1. Design deficiencies
  2. Material defects
  3. Processing and manufacturing deficiencies
  4. Assembly errors
  5. Off-design or unintended service conditions
  6. Maintenance deficiencies (neglect, procedures)
  7. Improper operation.

Of equal importance is recognizing that the basic agents of machinery component and parts failure mechanisms are always force, a reactive environment, time and temperature (FRETT). Each of these four can be further subdivided into steady, transient, cyclic or durations labeled very short, average length, long and so forth.

The gasket case

An expert failure investigator will take pleasure in reaffirming the above by stating that regardless of whether the component is made of plastic, steel, rosewood or porcelain, when it breaks the basic agents of failure mechanisms are found in FRETT, and FRETT alone. Once this fact is accepted, we have a starting point for investigating our first example, a leaking gasket. Looking at the data, it will not be difficult to determine which of these four basic agents can be crossed off our list. If only 6 wk had elapsed since the gasket was inserted between a pump discharge nozzle and its adjacent pipe flange, it would be safe to delete “time” from the acronym FRETT.

Suppose the remaining five gaskets in the various flanges associated with the downstream discharge piping had been in service for close to 4 yr and were of the same material composition as the leaking gasket. In that case, we might cross off both “reactive environment” and an abnormal operating “temperature.” That would then leave us with “force.”

Clamping forces are a function of bolting torques and bolting procedures, bolt sizes, fabrication methods, hardness and bolt materials. All of these must harmonize with well-established guidelines—failures will result when we deviate from, or disregard, established procedures. If we allow several deviations to exist at the same time, failures become a certainty. Data relating to bolt size, material, gasket deformation, flange condition, method and sequence of bolt tightening will lay bare the cause of gasket leakage. Of course, the human factor remains. If the two workers working on that flange were denied necessary tools or time, we can state that the case of the leaking gasket may have its roots in a measure of somebody’s flawed thinking.

The case of the overheated bearing

Our second example involves a pump bearing that showed signs of temperature-related failure. We must resist the temptation to say that the “bearing failed, therefore the bearing is at fault.” Instead, we might carefully look at failure frequency and failure record. If the pump manufacturer has used 600,000 identical bearings in 300,000 pumps of the same model designation, we have no reason to re-engineer the pump for a larger bearing.

As we then look at the seven possible failure cause categories, we should cross off numbers 1, 2, 4, 5, 6 and 7. We look more closely at (3), “processing and manufacturing deficiencies.” If the shaft was to be 75-mm diameter with a plus-tolerance of 0.02 mm–0.03 mm but we carefully measure and record 75.04 mm, and the bearing bore was to be 74.99 mm with a minus tolerance of 0.01 mm–0.02 mm but is measuring 74.96 mm, there will be an interference fit of 0.08 mm (0.003 in.).

According to the trend curve in FIG. 2, such an interference fit will cause bearing operation deep in the high-preload range where high bearing temperatures and short life are certain to exist. Excessive preload will also exist if all tolerances are in the center of their permitted ranges, but the shaft material is a type of stainless steel that thermally expands up to 17% greater than that of the tool steels normally used in pump shafts.2

FIG. 2. Effect of preload on bearing life.1

Hopefully, we also investigate the condition of our oil rings (FIG. 3), which, should be concentric within 0.002 in. (0.05 mm).2 To stay within these recommended tolerances, oil rings must be heat-stabilized. So as not to run downhill and abrade as in FIG. 3, oil rings cannot be allowed on shafts that are out-of-horizontal. Both shaft alignment and horizontality must be near-perfect. If a lubricant is thicker than allowed, the oil ring may simply not feed enough oil into the bearings.

FIG. 3. The oil ring on the right ran “downhill” and contacted a stationary part. Abraded bronze particles contaminated the lubricant and caused short bearing life.

Whenever two or more deviations combine, failures are likely to become more serious. We have seen two oil rings [a maximum allowable eccentricity of 0.002 in. (0.05 mm)] with eccentricities of 0.017 in. (0.43 mm) and 0.061 in. (1.55 mm), respectively. We have also come across bearing housings with oil rings designed to operate in ISO VG 32 lubricants but barely capable of operating while immersed in an ISO VG 100 lubricant. Repeat malfunctions were tolerated and were soon accepted as the norm. The reasons for inflated maintenance budgets and random downtime events should have been evident to experienced observers.

The case of the missing drain holes

For good measure, our third example again involves an overheated pump bearing and indications of black matter floating in the lubricant. FIG. 4 helps us understand one additional and often considered “elusive” reason for bearing distress. Access to a cross-section view was available in this instance and facilitated finding the likely cause of overheating. It was noted that oil could collect near the outboard sides of the two bearings in FIG. 4. The pump manufacturer forgot to drill a drain passage for the oil trapped behind each bearing.

FIG. 4. OEM drawing of bearings with trapped oil. Drain holes are important.

Since there were no drain holes, it could be reasoned that trapped oil turning to viscous tar or solid coke contributed to the early bearing distress reported by one of the manufacturer’s Canadian customers. When this was brought to the attention of the pump manufacturer, we were told that, although the pump manufacturer had neglected to show drain holes, the shop always provided these drain holes. Well, as a wise and observant reliability manager in Texas City said in 1980, “When it is all said and done, more will have been said than done.” That truism has been true for centuries and was correct in this instance. Accordingly, our recommendation is to trust, but verify. In the great majority of cases, not only is there more value in verifying than in trusting, but it will also cost less.

Experience-based training courses will add value

Whenever failure analysis and troubleshooting are called for, the trained analyst insists on collecting failed parts, measuring and recording dimensions, and run-outs of rotating components. Suspicious wear patterns should be photographed.

The above examples represent details of closely monitored parts, both new and failed or defective. Thorough analysis is especially important when there is evidence of bearing overheating. Contributing and causal factors can be of great importance and will require both technical knowledge and the application of structured analysis. A formal training course in systematic failure analysisa will go a long way towards uncovering and eliminating costly repeat failures. Downtime is prevented and money will be saved by learning (and later implementing) tested approaches that bring the right focus to modern equipment troubleshooting.

Indeed, the culmination of equipment failure analysis involves motivated course attendees. An experienced instructor will assist them as they contribute to very comprehensive custom troubleshooting tables. The content of these tables represents an expansion of the original equipment manufacturers’ (OEMs’) tables into territory that the OEMs may have overlooked for years. We know of instances where vendors viewed expanded troubleshooting tables and/or the use of oil mist lubrication as the original sin, a “sin” resulting in shrinking spare parts sales. We, however, cling to the belief that good marketing decisions may result in unusually reliable pumps and other fluid machines that capture premium profits and enhance vendor reputations. We believe that a good reputation will more than make up for reduced spare parts sales.

From the “5 Whys” to modern methods of failure analysis

The “5 Whys” method of defining failure causes predates the training course in systematic failure analysisa by several decades; however, suppose it had been attempted in one of our earlier examples involving a failed pump bearing. The analyst would have started by asking: Why did the bearing fail? Chances are that someone offered an answer involving the lubricant (e.g., perhaps not enough oil). Why was there insufficient oil? Was it because the constant level lubricator had not been refilled?

We will never know how the next “Why” would have been phrased, so we will leave it at that. We would venture to guess that “slightly off-horizontal shaft centerline,” “oil ring eccentricity outside allowable range,” and “OEM design allowed oil to get trapped behind thrust bearing,” would not be listed as likely root causes.

Modern and consistently effective methods will be needed to achieve the goal of optimally conducting best available troubleshooting and failure analysis in today’s highly competitive work environment. Investigate them and invest in the best.


  a Equifactor®


  1. Bloch, H. P. and F. K. Geitner, Machinery failure analysis and troubleshooting, 4th Ed., Elsevier Publishing, Oxford, UK, and Waltham, Massachusetts, U.S., 2012.
  2. Perez, R. X. and H. P. Bloch, Pump wisdom, 2nd Ed., John Wiley & Sons, Hoboken, New Jersey, U.S., 2021.
  3. Bloch, H. P., Optimized equipment lubrication, oil mist technology, and standstill preservation, 2nd Ed., De Gruyter Publishing, Berlin, Germany, 2021.

The Authors

Related Articles

From the Archive



{{ error }}
{{ comment.comment.Name }} • {{ comment.timeAgo }}
{{ comment.comment.Text }}