Business Technologies Business Solutions

Key lessons learnt from building the world’s largest ever Big Data repository of electoral data

Srikanth R P

17 Jul 2015 01:53 IST

New Update

India is perhaps the test bed for piloting Big Data projects of massive scale. We first had Aadhar, whose scope was to capture 12 billion fingerprints, 1.2 billion photographs, and 2.4 billion iris scans. The size of the data that will be created is huge and it is estimated that this database will be 10 times larger than the largest existing biometric database, which has been created by FBI from the U.S.

Advertisment

Last year, Hyderabad-based startup Modak Analytics, built a Big Data repository of electoral data. The objective was to use this data for analytics and gauge how Big Data platforms and insights thrown up from the analysis can possibly be used to improve governance by say, micro-targeting of subsidies or monitoring of government subsidies. The startup obtained data from multiple open data sources for the exercise which involved 81.4 crore voters — the largest on the planet. When you put this data in perspective with respect to other countries, the U.S. has close to 19.36 crore voters, Indonesia has 17.1 crore voters, Brazil has 13.58 crore voters and the U.K. has 4.55 crore voters. Clearly, the scale at which India operates is massive.

Aarti Joshi, Co-Founder and Executive Vice President, Modak Analytics

Explaining the complexity of the project, Aarti Joshi, Co-Founder, Modak Analytics says, “We processed around 81.4 crore electoral rolls. Undoubtedly the Indian election is the mother of all elections. Apart from the volume, we had 12 languages (variety) and this data changed very frequently (velocity).”

Advertisment

Unique challenges encountered
Unlike other developed countries like the U.S., where data is ubiquitously available, is homogeneous, structured and is of high quality, it is a different case in India. As the data was not readily available, Modak Analytics had to build every piece of relevant information from ground up.

Some additional challenges were peculiar to India — voter rolls were in PDF format in 12 languages. Modak Analytics had to analyze over 9 lakh PDFs amounting to over 2.5 crore pages to be deciphered for any analysis. This data was mapped to 9.3 lakh polling booths across 543 parliamentary and 4,120 assembly constituencies.

“Every state had the data in their own vernacular language. For example, Tamil Nadu had the data in Tamil, Maharashtra

Advertisment

in Marathi and Karnataka had this data in Kannada. To do any kind of data analytics, it was important to convert the data into a single language. Hence, we had to transliterate these vernacular languages into English. We developed one more technology, which we are in the process of patenting, to transliterate Indian vernacular languages into English. Using this technology we were able to transliterate 11 Indian languages (scripts) into English,” states Milind Chitgupakar, Chief Analytics Officer, Modak Analytics.

Further, the diverse range of voter names and information presented unique challenges. To counter these challenges, the firm developed an extremely efficient and sophisticated name and address matching algorithm to run analytics. “We have custom developed a data dictionary for each state. Just to give you an example in Andhra Pradesh, the name Srinivas is spelled or misspelled in 680 different ways. We have 740 villages named Rampur in India. So the machine matching algorithm had to be intelligent enough to figure out exactly which Rampur to match the data if it contains Rampur. What we have found out is that for Indian data you need solutions developed for India specifically, already existing cookie cutter solutions would fail miserably,” says Aarti. Accordingly, the firm developed a Heuristic (machine learning) algorithm for people classification based on name, geography etc.

Advertisment

Factors considered for designing the architecture
Modak Analytics was looking at processing huge volumes of unstructured data (around 10TB of PDF documents), and also structured data. With new data sources coming along and existing sources changing frequently, the firm needed a scalable system that grew with the needs of the project. Since this processing was cumbersome and took a lot of time, the project required an architecture that was implicitly parallel, and could linearly scale for performance with addition of new hardware. The other important consideration was cost. Keeping this aspect in mind, the firm decided to use open source so that the system could be built cost effectively.

Modak chose Hadoop, and self-built a 64-node cluster that had 128TB of storage. Apart from Hadoop, the team used PostgreSQL as the front-end database.

For any Big Data project, data integration takes about 80 percent of the time and resources. Data integration means extracting the data, transforming, fusing, merging the data, and then gleaning meaningful insights that can be acted upon. In technical parlance, data integration is also called as Extraction, Transformation and Loading (ETL). As Modak Analytics had just 10 data scientists, it had to be extremely efficient to handle a project of this scale. Accordingly, the team focused on automating the key tasks of data integration.

Advertisment

For example, while Hadoop is a scalable platform for Big Data projects, one of the major issues is ingestion of data into Hadoop, and getting the data out of Hadoop. As the resources that are trained on Hadoop platform are expensive and scarce, the firm developed a proprietary tool called RapidETL that generates automated code for data integration. Using this tool, the team automatically generated the data integration code for Hadoop, that significantly reduced the timeline, and the resources required for this project. Due to the automation brought in the data integration process by RapidETL, the team was able to crunch the development as well as the testing cycle of the ETL process by four times. Since the code was generated automatically by the tool, with minimum human involvement, the data errors were few and that in turn reduced the overall testing cycles considerably.

For a political campaign in India, apart from generic demographic data, classification of potential voters based on caste, religion, and affluence is important. The firm developed algorithms that automatically classified potential voters into groups based on various factors in different languages. This process proved immensely beneficial to the political party for whom the project was executed. Apart from the electoral data, Modak was able to merge this information with other open data sources such as census, economic surveys, and also proprietary data from the political party.

Important lessons learnt from the project
From a Big Data perspective, there is a lot to learn from handling a project of this scale with limited resources. Milind offers some useful and practical tips. “For a successful Big Data project, given that resources are scarce, it is important to automate significant portions of data integration work and reducing the timeline. One of the biggest issues businesses face with Big Data or analytics is that it takes a lot of time for the project to return value to the business. By using automation technologies, one can significantly reduce the implementation timeline and achieve the goal of faster time to value.”

Advertisment

Used effectively, this data can be leveraged properly to significantly improve governance. “Big Data platforms can definitely make a big difference in reducing poverty and micro-targeting of subsidies and monitoring of government schemes. For a country like India, Big Data can also help in developing forecasting models for demand-supply scenarios, which can help in controlling inflation. Big Data can also be effectively used to implement government policies, and monitor the subsidy distribution at a micro-level,” opines Aarti.

By collecting and analyzing huge data sets of information, policy formulations and decisions can be made more effectively. Whichever way you look at it, India’s scale and diversity is a perfect opportunity for IT service providers to evaluate Big Data, its relevance and the impact that a technology like Big Data can make!

big-data-examples-in-india big-data-use-case big-data-in-indian-elections modak-analytics aarti-joshi milind-chitgupakar