From projects of national importance like Aadhar to projects that prevent crime and improve healthcare, big data is making huge inroads in the Indian market. Domain players like MapR are naturally optimistic on the immense potential. To get a perspective, Dataquest spoke to MC Srivas, CTO and Co-founder, MapR Technologies. Excerpts
How relevant is the role of big data in a country like India?
In India, big data is used extensively. One example is Aadhar. Your Aadhar card, fingerprints, iris scans—are all running on big data and today there is no other database like MapR that can handle the scale required for India’s population, which is 1/6 of the world’s population.
So it’s about scale but if you look at how the world has evolved, especially in India, it’s also about self-service. Look at how we book a taxi—it’s moving to self-service. How do you know how your customers are doing when you don’t have a personal interaction with them? It’s all web-based, and you need to analyze and understand what’s going on.
India is a fairly poor country and the majority of Indians use pay-as-you-go SIM cards. So, they might not have their phones active every day. Telecommunication companies can motivate customers to refill their SIM cards through coupons. With hundreds of millions of cell phones tracking customer behavior and the effectiveness of various offers and coupons—that’s big data. Companies want to perform constant analytics, for instance, to know how many e-recharges were the results of discount offers. In cases like these, big data can prove to be extremely interesting.’
Please brief us about the use of MapR Hadoop in the Aadhar project?
The status of the project is that about 700 mn unique identification numbers are online and about one-to-one and half a million are coming online daily. These unique personal identification numbers are created through collection kits that take fingerprints and retina scans.
If you think about the process of using these kits, like when you got your Aadhar card, it takes roughly about 10 to 15 minutes. So, one kit in one hour can process four cards. Thus, in one day, if a guy really works for eight hours without any breaks, he will process the data for 32 people. To process one and a half million records a day online requires 50,000 operational kits. Now, this is to provide a unique identification, which means the raw data has to be checked against all the other 700 mn people to see if this person is really unique or if it is duplicate or fraud. Can you imagine the number of comparisons you are doing to process each person’s fingerprints and the retrieval of about five megabytes of raw data. If you think about the size of the data, there is just one problem, you are doing about a trillion ID verifications per week in India, and there is no database that can handle this scale, except MapR.
What’s the role of big data in the field of preventive healthcare drug research?
Preventive healthcare drug research relies on big data solutions, particularly with genome sequencing. A fully sequenced human genome is 4.2 terabytes. The first sequencing of the human genome was performed by the Human Genome Project which took 10 years, and close to $3 bn. Now, we have private companies that can sequence an entire human genome in a matter of hours, for several hundred dollars.
Drug companies can now compute and compare against thousands of other people and figure out what is actually effective and how it affects your body and interacts with different chemicals. There is really a massive amount of research happening today. And there are many drug treatments that are coming out based on that. For example, they have also figured out what causes the malarial parasite to be so resistant to antibiotics. The ability to analyze and use the results of genome sequencing will help us deal with very serious sources of death and disease.
What is your view on security analytics and do you think it is an extension of big data analytics?
There are a lot of MapR customers already using big data for analytics and security. For example, Cisco has built a security appliance called security intelligence operations, and it’s like a brain in the sky that’s watching all the Cisco firewalls across the world at the same time. If they detect a threat, let’s say in Singapore, and they are able to manage it automatically or with some operator intervention, the system learns how to do that. And let’s say 15 minutes later, a same threat occurs in South Africa, the system already knows how to handle it because it has already learned the response. So, it’s analyzing packets continuously to determine whether this is a threat, how it is happening, and how to deal with it automatically. This is already in production and is using MapR.
Large security firms also use MapR to monitor Internet traffic and detect traffic anomalies. If you analyze all the Internet traffic in New Delhi, you might want to know if certain portion, 5% for example, was diverted to some other country, via some other foreign routers, and for how long. These security applications can handle issues on a country-wide scale. You can imagine the amount of data—how many cell phones, messages, IP traffic is there in a metropolitan or regional scale. And all of that is being analyzed by big data today for security.
What are the benefits of Hadoop as a platform in business analytics?
Every business needs analytics. If you don’t do analytics, you are flying blind. Businesses have to know how how their top line is performing. We have over 700 customers who are doing all kinds of interesting things. Many customers are growing their revenue using Hadoop, others are managing and reducing risk and optimizing their costs. For example, some retailers are using Hadoop to figure out how to keep their inventory.
These companies know the traffic patterns, behavior on their website, and social media content can indicate how many products will be sold in their stores. They also want to have those items in stock which have a high likelihood of being sold. For example, one of the largest medical health insurance companies in the US was trying to run a query on Teradata on about 900 mn records and a 24-table join, with a four page long SQL query; it would run for three weeks and would not complete. They tried it on MapR and the analysis was completed in an hour. That query alone was worth several million dollars for them for fraud detection. These results are phenomenal, but by no means are results like these out of the ordinary for Hadoop.