By: Kishore Rawool, COE-Analytics, Tata Technologies
Data, data, everywhere—and that too in the range of petabytes and exabytes! Just to give an idea of what this appalling vastness of data is like: One petabyte is the amount of data generated in five years by 12 of NASA’s satellite systems for earth observation as well as their associated instruments and systems. Exabyte represents an even bigger measure of data. An avatar in the Puranas, with gigantic strides, measured the red earth and blue skies in just two steps, and was left with no place for a third step. Likewise, all words ever uttered by us humans can fit into roughly five exabytes.
Sensors and machines, humans, as well as business systems—in that order of importance—produce a voluminous amount of data (‘big data’) of the structured and unstructured kind. Extracting needles of new insights from such haystacks of data will help an organization make better decisions, resulting in more efficiency and revenues at less cost and risk. Conventional analytical tools the industry has grown up with find themselves ‘breathless and shell-shocked’ by the sheer volume of big data, its periodic spikes and, above all, high velocity. They also need to grapple with a potpourri of data—email, video, audio, log files and merging and managing this variety of data is no mean task. Responding to such data floods in real time therefore becomes a fine art requiring the services of accomplished practitioners, backed by advanced data analysis and presentation tools. In fact, it’s the big ask coming from today’s digital economy: the ability to process big data in real time, which involves continuous data input and output.
Big data analysis is increasingly trending towards real-time execution, away from batch processing. Although, at the moment, in terms of processing volumes, there is a 50:50 split between the two approaches. In batch processing, a non-real-time approach, data is collected over a certain period of time and processed as a group to produce batch results. Hadoop, an open-source framework has already captured the imagination of data experts since it is particularly suited for batch-oriented processing of huge data sets, most of all unstructured ones, something relational database servers were ill-prepared for. The animating force behind Hadoop is MapReduce, a software framework for speedier and simultaneous processing of data across several distributed servers (nodes).
MapReduce has become almost another name for big data processing while Hadoop has seen widespread deployment including in commercial products. Even so, enterprises do feel the need to move from good to great. MapReduce is good, but is it great? Google, which developed MapReduce and then open-sourced it, has cast aside this programming model in favor of newer analytical tools that can cope with exponential increases in workload to the tune of petabytes. With MapReduce, it might even take hours to complete certain jobs. Hence, the industry feels there is a need to evolve beyond MapReduce’s much-touted properties. With strong in-memory features, Apache Spark, also designed to run on Hadoop, is emerging as a faster data processing engine of choice. It is ideal for faster, real-time querying of datasets in a matter of seconds!
Automated real-time data mining programs now have the capability to crawl over humongous data sets related to social networking and return relevant posts. As an example, a careful analysis of viewers’ actions on social media (e.g., rating, sharing, liking, retweeting), along with audience tuning data obtained from set top boxes, will reveal which television shows got viewers hooked and which didn’t. The real-time analytics train doesn’t stop there. The sentimental analysis algorithm goes over a social media post like a fine-tooth comb, at both sentence-level and at the level of smaller groupings of words, to decode the overall sentiment within it and label it as positive, neutral, or negative. This process of knowledge discovery is helped in no small way by recent advances in machine-learning. With a built-in capacity to learn and discover patterns in large data sets, machine-learning can cull more accurate results faster and without calling out for human intervention.
Insight-generation tools can score social media posts around TV shows, for instance, and finally work out weekly popularity charts for these shows. This makes it possible to ‘narrowcast’ (as different from ‘broadcast’) programs that tickle the taste buds of smaller segments and niche groups of viewers. Importantly, in this model of TV programming based on user-generated data, the viewer is not just getting an opportunity to talk back to the studio, but decides, more or less, what kind of shows get produced and what goes on the air. In sum, big data needs to move beyond text analysis of a previous era to a more fine-grained analysis that will arm businesses with deep and accurate information on subgroups, within the larger market, based on which they can tailor products and services that are relevant to such small groups.
The brave new big-data driven world also needs a host of big data experts at different levels to sustain it. These include Hadoop administrators and developers, specialists in Hadoop components (eg, Apache HBase), as well as data scientists who can apply the insights they have learnt to complex business situations.