A research report from IDC forecasts that sales for business analytics and big data solutions would be approximately USD 200 billion by the end of 2020. The report, IDC Forecasts, 2019, also suggests that the sales will see a compound annual growth rate (CAGR) of 13.2% till 2022.
It was also reported that at USD 77.5 billion, IT services was the largest category of this market in 2019; the next largest category was hardware purchases (USD 23.7 billion), followed by business services (USD 20.7 billion). The Gartner Trends, 2019 also point towards large sales volumes in big data solutions and software. Evidently, there seems to be increasing importance of data in today’s enterprises.
There are three primary reasons identified for this trend. One, there is a need for data management and data governance in modern enterprises. This is evident from the fact that the software market for end-user query, reporting, and relational database management software holds a business of about USD 25.7 billion (IDC Forecasts, 2019). Two, there is the ongoing shift of data from on-premise to the cloud. Three, machine learning (ML) and artificial intelligence (AI) are growing as enabling technology in the software for data management and analytics and management.
Apache Hadoop provides big data capability with a collection of several open source applications that use a network of machines (computers) to manage a large amount of information and data. The storage and processing of data are done in a distributed environment and, thereby, the storage and processing are immensely improved. As enterprises continue to look at technologies like Apache Hadoop to make informed assessments, information management professionals also need to be adept at using such technologies.
In this context, we deliberate on the relevance of big data technology in data management and administration. We also draw attention to specific challenges and relevant solutions in big data technologies. We also discuss the prospects and benefits of big data technology in the context of record management and administration and technologies to monetise data (data analytics methods).
The barricades and roadblocks
The challenges related to maintaining data quality and systems for data cleansing, maintaining consistency in metadata, and ways to remove data redundancy. Management and technology strategies to reduce the cost of maintaining records also remain a significant bridge to cross. Other impediments include the handling of semi-structured and unstructured data including video, text, and audio.
In data management, the role of SQL and NoSQL databases is quite critical as they are the data repositories. SQL databases include the ones developed and marked by software providers like Oracle, IBM (DB2), Microsoft (SQL Server). They are relationally based and store data in relational and tabular format. NoSQL databases like MongoDo, Redis, Neo4 andDynamoDB are capable of storing non-textual data like images, and video in a cost-efficient manner.
So what exactly is the data that we are talking about? According to the ARMA guide to Information Profession, data is any symbol or character that represents raw facts or figures and forms the basis of information.
Similarly, big data are information assets that demand cost-effective, innovative forms of information processing since the data is high in volume (size in terms of a number of records in a file), variety (types of data – text, numbers, images, video) and velocity (the speed at which data gets collected).
There is no fixed threshold in terms of a number of records in a file or size in terms of GigaByte, TeraByte, or Peta Byte that defines the threshold of big data. It is a moving target and the data which is fairly large even for database management systems falls in the realm of big data.
Enterprises need to ensure that the data quality is suitable level before any analysis can be done. Better methods of defining metadata and data entry can help improve data quality. This will, in turn, lower the cost of data cleansing. As an example, let’s consider procurement data for any company which has a large ecosystem of suppliers and vendors.
Vendor details are usually entered in vendor master, invoices, PR, PO data in ERP systems. If names of vendors are entered in different ways – Amit Kumar, Kumar Amit, A. Kumar, or Kumar A. – it may make it difficult to identify a vendor by a unique name. Any data analytics software would recognise a vendor by a unique name.
If the name and surname of a person are entered in two different ways, the software would consider the same person as two different people. Unless this data is uniformly consistent data analytics, if applied, would give results that are faulty and cannot be relied upon.
A significant challenge in data relates to implementation. Executives are not knowledgeable on how each application in Apache Hadoop is related to others within the big data ecosystem and what benefits can be obtained from these individual applications. A major challenge for enterprises is to determine the most appropriate solution provider to employ. In a survey conducted globally, almost all the CXOs surveyed mentioned that big data technology is important to their enterprises’ growth.
The solutions
Data need to be organised and analysed to achieve the purposes for which it is kept. As an example, we cite a case of data related to commercial vehicle electronic toll charges (FASTAG data) available with the National Highway Authority of India (NHAI). The accompanying table presents an extract of this data. It contains multiple variables such as date, time, financial transaction type, unique transaction ID, and point-of-sale (POS) location.
An analyst investigating the owner of this vehicle for under-declared income would need to calculate the distance between the POS, i.e., toll gates, and obtain a valid average market rate for this type of vehicle. Rates differ for different classes of vehicles like heavy commercial vehicles, light commercial vehicles, and buses. The expected income can be computed by multiplying the two factors, rate and distance, and correlated with income declared to the income tax authorities.
A large variety of technologies are now available to run such data analytics programs. Many are in the class of commercial off-the-shelf (COTS) which are sold and maintained by software companies. They include the likes of SAS, Tableau, Qlikview (Qlik), and several others. This is just an indicative list and does not cover exhaustively all the applications.
Significant development has happened in terms of open-source programs like Python that hold a wide variety of libraries to run analytics and machine learning programs. Also, there have been several open-source developments for Add-ins to MS Excel which are used for a variety of data analytics programs in MS Excel environment as spreadsheets. Evidently, these do not provide large-scale analysis as MS Excel has a data limit of 10,00,00 rows.
Spreadsheets such as MS Excel are used extensively as they are available on desktops and relatively cheaper to procure and maintain. Text analytics software is used to analyse documents and various records of communication (like emails) to discover themes through analysis of the content (ACL, iDEA). Then, there are relational database management systems like Microsoft SQL Server, Microsoft Access, DB2, and Oracle. Thematic analysis and social network analysis on both numeric and text data are done to uncover networks and patterns.
Visualisation software is used to have a bird’s eye view and a graphical view of data. These are connected to real-time systems and provide visual reports to managers and executives at the decision-making level. Examples of such visualisation software are Tableau, Spotfire, and QlikView.
Predictive analytics tools include econometric analysis applications like SAS, SPSS, and Python. Increasingly, unstructured data like voice recordings are analysed to find patterns and detect white-collar crime, using technologies such as Nexidia.
Enterprises would be able to make better use of their data if they build on the talent pool and also the pipeline. In this case, hiring and investing in skilled data analysts and providing relevant training would improve the data literacy of the talent. Vendor support is vital for implementation.
Since this is a relatively new area, the vendor market is not too crowded. There are a few large ITES firms like Cloudera, AWS, Infosys, TCS, and Wipro which have big data implementation capabilities.
Another important challenge is that of the top management or board buy-in. Getting the board buy-in is not easy as this is a technological intervention and consultants need to articulate its value in pure business terms. Executives also feel that there is still significant untapped potential of leveraging operations.
While the benefits of big data in operations are well recognised, the innovation capabilities of big data are still in uncharted territories for many firms. It could be useful if enterprises provide high incentives to executives for data-driven innovations.
In summary, we may conclude that big data technologies and analytics are increasingly finding acceptance. If firms invest in big data capabilities they become competitively advantaged. Executives realise that there is promise in big data but they also appreciate the fact that there are certain challenges as well.
Overcoming the challenges would need investment in high-end hardware, talent, and training. There is significant work being carried out in practice and research to understand what works and what does not.
By Dr. Nitin Singh, Professor, Operations Management, IIMRanchi, and Gaganjeet Gujral, CEO, GEICO