By: Dave Oswill, MATLAB product manager, MathWorks
When you look at the process of scientific discovery and engineering today, working with big data is a critical step. With the help of smart sensors and technologies like Internet of Things, vast amount of data can be collected using scientific instruments, manufacturing systems, connected cars and aircrafts etc. The data collected are significant as they may show important physical phenomena or provide information on the operating environment, efficiency, and health of a system.
With the proper tools and techniques, this data can be used to make rapid scientific discoveries and develop and incorporate more intelligence into products, services, and manufacturing processes. This can aid a company with better performing products or services, as well as help in conforming to regulatory requirements (meeting engine fuel efficiency standards or providing better assisted driving capabilities).
Gaining access and actually working with the data may sound like an intriguing, yet daunting task. Because of the value and size of this data, it is commonly stored and managed in large file shares, databases, or big data systems such as Hadoop or Spark. Not too long ago, in order to apply advanced techniques such as machine learning to large sets of data, computer scientists with experience in IT systems would work alongside engineering and scientific experts. The team would jointly support a workflow that includes:
- Access big data in files, database, or in the Hadoop Distributed File System (HDFS)
- Explore, process, and analyze this data on specialized compute clusters
- Create algorithms for use in embedded systems, business applications, and other services
Today, software analysis and modeling tools such as MATLAB have been enhanced with new capabilities for working with big data. This enables engineers and scientists who have the domain knowledge and experience to make design and business decisions with this data; engineers and scientists can then conveniently access this data no matter the location and easily work with it using familiar syntax and functions.
Engineers at Baker Hughes, a provider of services to oil and gas operators, needed to develop a predictive maintenance system to reduce pump equipment costs and downtime on their oil and gas extraction trucks. If a truck at an active site has a pump failure, Baker Hughes must immediately replace the truck to ensure continuous operation. Sending spare trucks to each site costs the company tens of millions of dollars in revenue that could be generated elsewhere if they were in active use at another site. The inability to accurately predict when valves and pumps will require maintenance underpins other costs. Too-frequent maintenance wastes effort and results in parts being replaced when they are still usable, while too-infrequent maintenance risks damaging pumps beyond repair.
Terabytes of data were collected from the oil and gas extraction trucks and this data was used to develop an application that predicts when equipment needs maintenance or replacement. MATLAB provided the engineers at Baker Hughes the functionality needed for developing predictive models and combining multiple kinds of data, including sensor data from a proprietary file format, into one analysis application.
Accessing Large Sets of Data
The first challenge in working with big data is determining how to access large data sets as they come in many different forms and are stored in various types of systems.
Many big engineering and scientific data sets consist of a large number of small or medium sized files, although files are becoming increasingly large and won’t fit into the memory of a single computer. These files typically reside within one or more directories on a shared drive and may consist of delimited text, spreadsheets, images, videos, and various proprietary formats.
There are also a wide range of database types that are used to store and manage big sets of data like:
- Relational (SQL): Widely used for business applications, popular among IT developers.
- Data Warehouse: Based upon relational (SQL) databases, houses business critical data and provides analytical capabilities and fast access for business-critical applications.
- NoSQL: Optimized for data that doesn’t fit into relational databases.
- Data Historians: Optimized for time-based, production, and process data that is commonly collected from industrial equipment.
- IoT Data Aggregators: Typically includes cloud-based services for aggregating time series data from connected sensors and devices. These services are typically accessed via web service calls.
Hadoop is another system for storing and processing big data sets based upon distributed computing and storage principles. It is comprised of two major subsystems that coexist on a cluster of compute servers:
- HDFS: A large, failure-resistant file system referred to as the Hadoop Distributed File System.
- YARN: Manages applications that run on Hadoop, including batch processing frameworks, such as MapReduce and Spark, and SQL interfaces, such as Hive and Impala.
To efficiently capture the benefits of big data, engineers and scientists need a scalable tool, such as MATLAB, to provide access to a wide variety of systems and formats used to store and manage data. This is especially important in cases where more than one type of system and format may be in use. Sensor or image data stored in files on a shared drive may need to be combined with metadata stored in a database; in the case of Baker Hughes, data of many different formats must be used together in order to understand the behavior of the system and develop a predictive model.
The ability to work with big data is fast becoming an important aspect of scientific discovery and engineering. These data sets have invaluable data within them, providing the means to differentiate the products and services of a company. As a scientist or engineer, one has the domain knowledge and experience to make design and business decisions with this data, but may require a software analysis and modeling tool that is easy to work with. Using tools such as MATLAB offers scalability and efficiency, while providing the company with a competitive advantage in the global marketplace.