India is perhaps the test bed for piloting Big Data projects of massive scale. We first had Aadhar, whose scope was to capture 12 billion fingerprints, 1.2 billion photographs, and 2.4 billion iris scans. The size of the data that will be created is huge and it is estimated that this database will be 10 times larger than the largest existing biometric database, which has been created by FBI from the U.S.
Last year, Hyderabad-based startup Modak Analytics, built a Big Data repository of electoral data. The objective was to use this data for analytics and gauge how Big Data platforms and insights thrown up from the analysis can possibly be used to improve governance by say, micro-targeting of subsidies or monitoring of government subsidies. The startup obtained data from multiple open data sources for the exercise which involved 81.4 crore voters — the largest on the planet. When you put this data in perspective with respect to other countries, the U.S. has close to 19.36 crore voters, Indonesia has 17.1 crore voters, Brazil has 13.58 crore voters and the U.K. has 4.55 crore voters. Clearly, the scale at which India operates is massive.
Explaining the complexity of the project, Aarti Joshi, Co-Founder, Modak Analytics says, “We processed around 81.4 crore electoral rolls. Undoubtedly the Indian election is the mother of all elections. Apart from the volume, we had 12 languages (variety) and this data changed very frequently (velocity).”
Unique challenges encountered
Unlike other developed countries like the U.S., where data is ubiquitously available, is homogeneous, structured and is of high quality, it is a different case in India. As the data was not readily available, Modak Analytics had to build every piece of relevant information from ground up.
Some additional challenges were peculiar to India — voter rolls were in PDF format in 12 languages. Modak Analytics had to analyze over 9 lakh PDFs amounting to over 2.5 crore pages to be deciphered for any analysis. This data was mapped to 9.3 lakh polling booths across 543 parliamentary and 4,120 assembly constituencies.
“Every state had the data in their own vernacular language. For example, Tamil Nadu had the data in Tamil, Maharashtra
in Marathi and Karnataka had this data in Kannada. To do any kind of data analytics, it was important to convert the data into a single language. Hence, we had to transliterate these vernacular languages into English. We developed one more technology, which we are in the process of patenting, to transliterate Indian vernacular languages into English. Using this technology we were able to transliterate 11 Indian languages (scripts) into English,” states Milind Chitgupakar, Chief Analytics Officer, Modak Analytics.
Further, the diverse range of voter names and information presented unique challenges. To counter these challenges, the firm developed an extremely efficient and sophisticated name and address matching algorithm to run analytics. “We have custom developed a data dictionary for each state. Just to give you an example in Andhra Pradesh, the name Srinivas is spelled or misspelled in 680 different ways. We have 740 villages named Rampur in India. So the machine matching algorithm had to be intelligent enough to figure out exactly which Rampur to match the data if it contains Rampur. What we have found out is that for Indian data you need solutions developed for India specifically, already existing cookie cutter solutions would fail miserably,” says Aarti. Accordingly, the firm developed a Heuristic (machine learning) algorithm for people classification based on name, geography etc.