I clearly remember when computing started taking off, about the wry remark referring to garbage in, garbage out. This referred to a marvelous computer program developed at huge cost being fed absolutely useless data and thus spitting out useless information it had derived from the data. This is even more true today with the advent of machine learning and the internet of things (IoT). Imagine training your computer program in a machine learning application with shoddy data. The results could be catastrophic. And this is indeed what is happening today. In the frantic rush to write the best programs to predict outcomes, we often lose sight of the desperate need to ensure our data is valid and reasonable. I have some feeling for this challenge – as I have just completed a course on machine learning.
This problem will definitely impact on you as an engineering professional and it is worth looking at a few tips in handling this problem.
Tuning a plant process has the same challenges
Some of you may recall the agonies of tuning a plant with highly sophisticated programs to ensure that your P, I (and perhaps D) parameters were absolutely perfect. In fact, the operator would often spend a huge amount of time in tuning a loop and neglect the most important element in the whole process which was of course the instrument (e.g. flow meter) - which could be spitting out poor quality data to the loop controller (which then goes onto control a valve in a totally wrong fashion).
Machine Learning Is the New Kid on the Block
What is happening now – particularly in commercial organisations is the advent of machine learning applications. All these effectively represent are programs that learn how to predict a particular outcome based around (training) data in the past that they have learnt from.
One gathers good quality data - for example, parameters about the operation of a piece of equipment e.g. a compressor and also data on when it requires maintenance (the parameter we are trying to predict). One trains up the program with this data and then uses it in future to predict possible failures before they occur. The vibration gurus would laugh at this application of machine learning and exclaim that they have been doing it for decades. Absolutely right. There is nothing new under the sun !
Machine Learning is Gathering Steam
However, these applications are steadily gathering steam because of the ability to gather data (IoT), powerfully cheap computing power and the interconnectedness of everything. The problem comes in that much of the data that is being gathered is poor quality and thus it impacts on the machine learning application. Garbage in to garbage out.
Reasons for poor quality data are myriad and range from it being incorrectly gathered (sampling wrong/biased/calibration wrong) and simply human error. Other examples of poor data are handwritten notes, badly labeled data and incorrect interpretations of variables.
A Few Tips In Ensuring your Data is Useful
A few suggestions on dealing with this problem:
Ensure your objectives match up with the data you are gathering. For example, ensure the right sensors are gathering the correct data to predict a particular machine’s failure.
Build in oodles of time to clean the data and to check its integrity before you even consider processing it in a software machine learning application. You will find the time to clean the data may take up 80% of the overall time for the project. You may find you have sampled the data at the incorrect sampling frequency or to the wrong level of accuracy or there are some strange biases in the signal which need to be fixed before it is usable.
Keep an Audit Trail. Keep all your data backed up. Even the raw preliminary data. Show clearly how you have processed it into a usable format. You may find you need to go back due to some unexpected problems with your machine learning application and find out why a particular sensor reported values the way it did.
Make one Person Responsible for the data – its processing and subsequent use. This is your go-to person when there are anomalies detected or unusual results predicted with your machine learning application. This individual would know the data and the process backwards.
Independent Review. If you are handling a particularly big project – it is definitely worthwhile getting an independent expert to review the entire process of gathering data/processing/cleaning it and then applying it to a machine learning application. They may find issues you didn’t think of. Also get your operators and users of the program connected to the programming geeks and developers of the program. They may have some highly experienced insights in running the plant which you are trying to predict its performance with your machine learning application.
Machine learning has huge power but it should always be grounded in a common sense approach. And as a good friend of mine remarked so wisely: Common sense ain’t so common around here.
My appreciation to Thomas C.Redman of Harvard Business Review for a great article entitled: If Your Data Is Bad, Your Machine Learning Tools Are Useless. I have applied it to my experiences in process control and industrial automation and to my new passion of machine learning.
Yours in engineering learning
Mackay’s Musings – 3rd April’18 #670
125, 273 readers – www.idc-online.com/blogs/