Dear Colleagues

A key part of every engineering professional’s job is troubleshooting some problem. In fact, arguably many engineers’ sole function (and the reason some are often paid very well) is troubleshooting and fixing intractable problems. Somewhat irritatingly is that it is often identifying and fixing other people’s errors.

The optimum approach is to keep your mind completely open when tackling the problem - to avoid pre-conceived ideas, as these can throw you off track. Naturally, one has to avoid the brute force approach of changing out components randomly in a frantic rush to fix the problem.

The suggested steps for general engineering troubleshooting are as follows. At times it will be tempting to leave some out, but it is worth working through them methodically:

1. Identify the exact issue
When someone reports a problem to you; you can bet your bottom dollar this may not be the actual problem. When seen through the eyes of a user the report of the situation may not reflect engineering reality. Ensure you get a careful explanation and if possible a demonstration of the problem. It is your job to ascertain what the real problem is in real engineering terms. Often a problem presents intermittently. Don’t walk away from it, however, presuming it has gone forever - you can be assured that it will come back at the most inconvenient time. The problem could also be a combination of different issues. Recently, when trying to tune a process control loop, which the operators had complained was sluggish, I unwittingly found that I was actually dealing with high frequency signals (an aliasing problem) - it wasn’t a tuning problem, after all, but a filtering one.

2. Reproduce the problem
It is best to reproduce the problem where possible. You can then observe the full sequence of events, view the error messages and analyse other variables that may be affecting it. If the problem is intermittent, you may need to train the user to do basic diagnostics (such as operate a protocol analyser or vibration analyser) to collect the right statistics and data. A network card, for example, wouldn’t perform erratically until the afternoon sun had warmed up a control room and heated the card up. Without this knowledge it would be difficult to reproduce. Another example is the office network slowing down to a crawl at 2pm every day for 30 minutes, due to someone doing automatic backups at this time.

3. Localise, isolate and zone in
Now you have to zone in on the equipment or software module that is responsible for the problem. The trick is to zone in on the precise element causing the problem. Penetrate the thicket of equipment and find the precise element. Remember that seemingly unrelated elements can cause problems. It is also vitally important to identify exactly what happened before the problem occurred - was a card changed out and the IP address not updated on the server, for example, (a particularly awkward one that caused an aluminium refinery to shut down - the users didn’t understand MAC and IP addressing). Or was there a sudden power surge? Or was the RTU exposed to excessive heat?

4. Make a Plan
Ensure that you assess what is required carefully. Beware the Law of unexpected consequences. The process of fixing something may cause other unexpected problems (a colleague of mine located and remedied severe harmonic problems in a plant network, but blew up three of my precious variable speed drives with overvoltage). When going through your plan, step-by-step, to best remedy the problem, you may find other issues appear that you hadn’t considered. It is worth reflecting on each item of the fix to test for these unexpected consequences. In replacing a valve, for example, you may find the loop controller may need to be tuned again, as the parameters are slightly different. Or a replaced instrument has subtly different ranges, which require updating in the PLC code and SCADA configuration.

5. Trace your steps
Ensure that when you fix the problem, you know exactly what you have done in case you need to retrace your steps later to put the equipment back into its original state.

6. Test and retest
Test and retest over a period of time before accepting that the problem has been fixed. If there is any doubt about whether the problem has been fixed or not, there is no doubt - it is, most probably, still a problem. Many leave this step out and the result is irritating for everyone when the process needs re-commencing. And ensure the user actually confirms he or she is happy with the fix and it all works satisfactorily.
 

7. Document for an absolute moron
People who come after you may not be aware of what you have done and how you have solved the problem. The problem may reappear or something similar may happen to another piece of equipment. So - document with infinite detail for someone who may have no knowledge of what you have done. This is something which we, as engineering professionals, are not so enthused with. It is, however, critical to the process. Naturally, ensure the documented fix is easily accessible by anyone; and not hidden somewhere in an arcane folder on the server.

8. Communicate with the client or user
Often the user is not convinced the problem has been fixed. Your job is to ensure you communicate honestly; what you have done and why the problem has been fixed. Don’t treat the user as a complete idiot, but as a real partner in operating your facility. This is important for your credibility (and for the engineering profession).

I like Anthony J. D'Angelo’s take on troubleshooting and fixing things. He gives the following exhortation: ‘Become a fixer, not just a fixture’.

Yours in engineering learning

Steve

Mackay’s Musings – 27th March’12 #471
125, 273 readers