Data preparation for AI success

The secret to AI success? Focusing on data preparation

Engineers often look to the AI model as the key to delivering highly accurate results, but it is often the data that determines an AI model’s success. 

Datasets are essential to AI models. They provide the truth by which we train AI models and measure a model’s success. Engineers often look to the AI model as the key to delivering highly accurate results, but it is often the data that determines an AI model’s success. Data flows through every step of the AI workflow, from model training to deployment, and the way it is prepared can be the main driver of accuracy when designing robust AI models.

Engineers can use these tips to improve their data preparation process and drive success when developing a complete AI system.

Tip 1: Don’t settle for the data you have

If you are without enough data for your AI model, don’t settle for the data you have. There are various techniques you can use to augment and cultivate new data and overcome any shortage of data:

Generate new data through simulation of a physical model, a common scenario used in predictive maintenance applications. Consider, for example, the case of a hydraulic pump used in oil extraction. You often know what the critical failure causes are, such as a seal leak in the pump. They rarely happen and are destructive, making it very difficult to get actual failure data. With tools such as Simulink and Simscape, which allows engineers to design and simulate physical systems, you can create a realistic model of the pump and use it to run simulations under various failure scenarios.

Synthetic data produced can then be used to train AI models, alleviating the issue of lack of data for AI and allowing engineers to continue to focus on building accurate models.

Tip 2: More data doesn’t [necessarily] mean a more successful model

A common frustration when designing AI models is that even with large amounts of data, the performance of the model does not increase.

For any problem where more data isn’t translating to higher accuracy, the solution lies in cleaning, cropping, labeling, and transforming the data to provide enough high-quality data samples to the model as possible. You should look for tools that can help in creating clean data samples.

Tools such as Computer Vision Toolbox and Signal Processing Toolbox provide engineers with automated video labeling and signal labeling capabilities to create clean samples to quickly introduce to the model for training.

Tip 3: Apply your domain expertise to transform your data

Accurate models are never a surprise to the engineers creating them when they are made with thoughtful, well-prepared data. This is especially important to engineers and scientists using signal data. Raw signal data is rarely added directly to AI models, as signal data tends to be noisy and memory intensive. Instead, time-frequency techniques are often incorporated to transform the data to gather the most important features the models will learn.

Tip 4: Use data as insight into your model

Using debugging techniques, engineers can ask the model why a certain category was predicted and what features the model is primarily focused based on the category.

Debugging techniques such as LIME provide insight into the model through data. Therefore, data is equally as important to the debugging process – offering insight into both the model and the most important features, with the opportunity to use the debugging information for model improvement.

In Summary:

  • Look for the right data
  • Ensure you clean the data to develop better models
  • Apply domain expertise to transform your data
  • Debugging the data provides insights


The authors are:

Dr Amod Anandkumar, Application Engineering Manager at MathWorks leads a team of application engineers helping clients across industries successfully adopt and implement technologies like AI, automated Driving, wireless communications. 

Johanna Pingel is a Product Marketing Manager at MathWorks. She focuses on machine and deep learning applications and making AI practical, entertaining, and achievable. 


Leave a Reply

Your email address will not be published. Required fields are marked *