Big Data

Big Datasets: Analyse, Process and Creating the right Model

By: Dave Oswill, MATLAB product manager, MathWorks

There is a wealth of information and value contained in the enormous sets of data collected from smart sensors embedded in items such as measuring instruments, manufacturing equipment, medical devices, connected cars, aircraft, and power generation equipment. From this data, models can be generated that not only describe physical phenomenon, but also utilize predictive models to make decisions on when to optimally perform maintenance on expensive equipment (such as aircraft); thereby saving money on unnecessary maintenance or unplanned downtime. These models can be used to forecast when to optimally turn on expensive power generation plants and can be embedded into medical devices or vehicles to increase their efficacy and performance.

As one may realize, these models can differentiate their products and services from those of competitors. But as with any system operating in the real world, the data collected from these systems and devices is far from perfect. There can be external influences on the data that need to be understood before an effective model can be created.

A scientist or an engineer has the domain expertise and knowledge needed to decipher data. They may require a software analysis and modeling tool though that enables them to identify trends in the data, clean and correct dirty data, and provide the algorithms needed to determine the most influential signals in large Datasets to implement a practical model.

Exploring and Processing large sets of data
Before creating a model or theory from a data, it’s important to understand what is in the data, as it may have a major impact on the final result.
· Slow moving trends or infrequent events spread across the data that are important to take into account in the theory or model.
· Bad or missing data that needs to be cleaned before a valid model or theory can be established.
· Deriving additional information for use in later analysis and model creation.
· Finding the data that is most relevant for your theory or model.

Following are some of the capabilities that can help an engineer easily explore and understand data, even if it is too big to fit into the memory of their desktop workstation.

1. Visualization
Summary visualizations, such as the binScatterPlot shown below, provide a way to easily view patterns and quickly gain insights within large datasets. The binScatterPlot highlights areas of greater concentrations of datapoints, with changes in color intensity. Using a slider control to adjust color intensity lets one interactively explore large datasets to rapidly gain insights.


Figure 2: binScatterPlot in MATLAB. Copyright: © 1984–2017 The MathWorks, Inc.

2. Data Cleansing
All data contains outliers or bad and missing entries. This data needs to be removed or replaced before one is able to properly understand or interpret data. Having a way to programmatically clean this data provides a method to manage new data as it’s collected and stored.

3. Data Reduction
The large number of signals collected from systems can make it difficult to find important trends and behaviors in a data. Much of the data may not be correlated with the behavior one is looking to predict or model. Being able to calculate correlations across a data, as well as utilizing techniques such as Principal Component Analysis allows the engineers to reduce the data to only those signals that most influence the behavior they are modeling. By reducing the number of inputs to one’s model, he/she creates a more compact model and require less processing when the model is embedded into your product or integrated within a service application.

4. Data Processing at Scale
An engineer or a scientist may find that he/she is most efficient when working on a local desktop workstation using tools he/she are familiar with. However, to be efficient when working with big data requires a software analysis and modeling tool that not only works with large sets of data on the desktop workstation, but also allows them to use their analysis pipeline or algorithms on an enterprise class. The ability to move between systems without changing your code greatly increases efficiency.

Creating Models
When data is collected over months or even years, What in it makes it valuable? In the case of Baker Hughes, information from temperature, pressure, vibration, and other sensors was collected over the lifetimes of many pumps. This data was analyzed to determine which signals in the data had the strongest influence on equipment wear-and-tear. This step included performing Fourier transforms and spectral analysis, as well as filtering out large disturbances to better detect the smaller vibrations of the valves and valve seats.

The engineers discovered that data captured from pressure, vibration, and timing sensors allowed them to accurately predict machine failures. To create the models eventually used to predict actual failures from the large sets of data, machine learning was used. Machine learning is commonly used in these situations due to the large number of observations (samples) and the possibility of many variables (sensor readings/machine data) being present in the data.

Machine learning techniques use computational methods to “learn” information directly from data without relying on a predetermined equation as a model. It turns out this ability to train models using the data itself opens up a broad spectrum of use cases for predictive modeling―such as predictive health for complex machinery and systems, physical and natural behaviors, energy load forecasting, and financial credit scoring.

Using Models
To truly take advantage of the value of big data, one must be able to incorporate the models and insight gained from the data into the products, services, or operations. A direct path from development to the integration of an algorithm or predictive model into a device, vehicle, IT system, or web-based service allows an engineer to better adapt to changing environmental or business conditions and to address market needs more effectively.

There are numerous applications for which analytics and predictive models are being developed by engineers and scientists; these applications dictate whether there is a need to integrate a current model with an enterprise IT application, use it as part of an IoT system, or incorporate it within an embedded system for local processing or for reducing the amount of data sent to a centralized analytics platform.
· Connected Cars: Large amounts of real-world driving data is used to develop and implement algorithms for use within embedded systems to support driver-assist and self-driving capabilities.
· Manufacturing and Engineering Operations: Sensors on machinery are providing up-to-the-second information on the health and operation of refining, energy production, and manufacturing systems. This data is used to optimize the operation, yields, and up-time of these systems and requires integration as part of an enterprise IT application.
· Design and Reliability Engineering: Data is being captured from aircraft under test and real-world flight conditions and from mobile and medical devices. This data is being used by engineering and operations groups to improve the reliability, performance, and capabilities of these devices and systems.

Big data has the potential to greatly enhance products, services and operations. But an engineer or a scientist needs a software analysis and modeling tool that allowshim/her to explore, process, and create models with big data using a familiar syntax and functions, while also providing the ability to integrate these models and insights directly into the products, systems, or operations. Having tools like MATLAB that provide scalability and efficiency will enable an engineer as a domain expert to be a data scientist while giving the company a competitive advantage in the global marketplace.

Leave a Reply

Your email address will not be published. Required fields are marked *