The growth of information and communications technology (ICT) is on steroids and India is trying to keep up with the pace. Every year the number of computers in India is increasing and so is the dependency on these machines for education and daily livelihood. According to a Gartner report, computer shipments to India are expected to increase by 24.7% to total 13.2 mn units in 2011 over 2010. And while these machines are largely accessible in English, the irony is that the chunk of Indians that can speak, read, and write in English remains abysmally low at 10% as compared to the US, which has 95%, or even the UK that has 97% population that can read, write, and speak in English.
Indias economy is one of the fastest growing in the world and its mobile market is the fastest growing market with around 800 mn mobile subscribers (according to Trai), but the country ranks at the bottom in terms of proliferation of the ICT. While illiteracy can be seen as the umbrella problem, a large part of the literate populace cannot access these technologies which promise advancement and better employment simply because they are not available in local languages. In the reach and usage of internet too India ranks very low. Again, as most web pages are in English, the literate Indian populace, which is otherwise adept at local languages, shies away from using the internet. The very fact that the number of readers for local language newspapers are more than the readers of English newspapers in India points to the fact that, by and large, people of India prefer to still read in their regional language. If through the efforts of researchers in the language computing domain, the number of web pages available in local languages is increased, then more people will be attracted to using the internet and reap its benefits, says A Kumaran, research manager of the multilingual systems research group at Microsoft Research Lab in Bengaluru.
Computing Local Lingo
Language computing systems can be broadly classified into machine translation or optical character recognition (OCR, which converts handwritten or typed text into speech) and speech recognition (transliteration and conversion of speech to text). N Ravi Shanker, joint secretary, department of information technology (DIT), Ministry of Communications and Information Technology, Government of India oversees the language computing work across India. In 1991, the DITs Language Computing Group started a program to develop tools and resources so that computing or browsing the internet can be done in English as well as in local Indian languages. The program, named Technology Development for Indian Languages (TDIL), was conceptualized even before the advent of computers in India between 1993 and 1995. The government has been working closely with some of its institutions such as the Centre for Development of Advanced Computing (C-DAC) in Pune. The TDIL program was aimed at making the language computing tools available to the masses free of cost, to in turn enable wider penetration and better awareness of such tools. The Phase I of the TDIL was completed recently in 2010, with beta versions of all engines of language computing systems available in 10 main Indian languages, and it has given way for the rollout of Phase II which will see addition of more languages, smoother interfaces, and improved accuracy.
Language Processing
The natural language processing technology, which is one aspect of artificial intelligence, coverts spoken human languages into machine language for computation. The technology encompasses development of utilities for speech synthesis, optical character recognition, text-to-speech conversion, and machine translation.
VN Shukla, director of Special Applications group in C-DAC, Noida, UP, says, Language computing starts right from the hardware level up to the software level and then at the applications layer. And when we talk about hardware, it is not only in the form of input devices, but the whole PC structure. So the basic tool for language computing is a computer that is packed with keyboard drivers, display drivers, language fonts, rendering engines, translation tools, and more to accomplish all language needs.
In 1992, the Indian standards for Indian scripts called the Indian Standard Code for Information Interchange (ISCII) was developed, which could be used in place of the American Standard Code for Information Interchange (ASCII) for use of English language. Using a standard code for every character could help in standardization but the process required reverse engineering. However, researchers in India decided to take the alternate path and started working on the software layer, allowing people to build applications over the existing software for Indian languages. The first thought was to localize the operating systemat that time DOSto Hindi, but then we realized there was no point and then we started creating drivers for Indian languages for input and display purposes, reminisces Shukla who was a scientist with C-DAC at the time of conception of TDIL in India. He worked closely with professor RMK Sinha of Indian Institute of Technology, Kanpur (IIT-K), who is known as the father of language computing in India.
Researchers working on Indian language computing soon realized that the tools present in the global market cannot be replicated in India owing to the complexity of multiple languages that exist in the country. (India comprises not only 22 major languages with as many dialects as 1,652, but there are also 11 scripts to represent these languages.) Swarn Lata, head of the TDIL program and director of Human Centred Computing group in the DIT, explains that in Indian languages one-to-one mapping or translation of each word as it is to form a sentence is not workable. The methodology to be followed here is to first process the source language, convert words according to the target language, and then process it all again with respect to the target language for the conversion to make sense.
Apart from the typical nature of Indian languages, cultures also affect our language usage and pronunciation. For example, in northern parts of India, Hindi is spoken in varied forms across different states and cities. Thus we cannot have a generic tool, especially for translation, and all tools have to be developed for all of the languages, adds Shukla. In a C-DAC report he mentions that although all Indian languages have emerged out of Sanskrit, the core ancient language, and mostly all of them follow Paninian grammar, but that itself is a problem as different languages depend on Sanskrit and Panini in different manner. Therefore accuracy for any of these systems is not 100%.
Computing
Different centers of C-DACin Bengaluru, Kolkata, Mumbai, Noida, Pune, and Thiruvananthapuramwork on language computing technologies. Their activities include development of smaller utilities like desktops and internet access in Indian languages and core research in areas of machine translation, OCR, cross-lingual access, search engines, standardization, digital library, and more. C-DAC Noida is engaged in English to Indian language translation and has already done it for Bengali, Hindi, Malayalam, Nepali, Punjabi, Telugu, and Urdu. It has also developed Indian to Indian language text-to-text translations for Punjabi-Urdu-Hindi combinations as all 3 languages follow Devanagari script. C-DAC Noida has also collaborated with language computing labs of various countries for effective speech-to-speech translation of other languages into Indian languages. The international languages which can be translated are Japanese, Thai, Cantonese, Arabic, and Brahmic. With inclusion of foreign languages, the centres initial Asian Speech Translation (A-Star) system has now been renamed as Universal Speech Translation (U-Star).
Anuvadaksh, a consortium of English to Indian language machine translation (EILMT), a part of the TDIL program, also allows for translation of English into 6 Indian languagesBengali, Hindi, Marathi, Oriya, Tamil, and Urdu. It has advanced development of technical modules, such as named entity recognizer, word sense disambiguation, morph synthesizer, collation and ranking, and evaluation.It is vital that linguists and technology experts both work in collaboration on such projects because language experts are not adept at technology and usually technology experts are not familiar with the nuances of languages, says Manoj Jain, scientist, TDIL, DIT.
EILMT has also developed AnglaMT, a pattern-directed rule-based system with context free grammar like structure for English (source language). It generates a pseudo-target (pseudo-interlingual) applicable to a group of Indian languages (target languages), such as Indo-Aryan family (Asamiya, Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi, and more) and Dravidian family (Kannada, Malayalam, Tamil, and Telugu). Some of the major design considerations of AnglaMT have been aimed at providing a practical aid for translation to get 90% of the task done by the machine and 10% left to human post-editing and processing. Another system called Sampark from EILMT works on 6 pairs of Indian languagesHindi to Punjabi, Hindi to Telugu, Punjabi to Hindi, Tamil to Hindi, Telugu to Tamil, and Urdu to Hindi. The Sampark system is based on the analyze-transfer-generate paradigm. First, analysis of the source language is done, then a transfer of vocabulary and structure to the target language is carried out, and then the target language is generated. Each phase consists of multiple modules with 13 major ones. One main advantage of the Sampark system is that once the language analyzer has been developed for a single language, it can be paired with other language generators to get multilingual output. The DIT strongly believes that the Sampark approach helps control the complexity of the overall system.
Standardization
The Government of India is working closely with international agencies such as World Wide Web Consortium, Unicode, International Organization for Standardization, and Bureau of Indian Standards for standardization in the fields of the Indian script keyboard, transliteration, SMS, speech resources, and electronic language resource development for all official Indian languages. While the inclusion of 22 Indian languages in the Unicode Standard is complete, the W3C is working on seamless web for every Indian and at inclusion of the Indian languages in the international standards for the mobile web, web accessibility, styling and web browsing. Standardization is important for developing applications for mass usage while retaining the fundamentals of Indian language usage.
The Challenges
For the development of every tool and utility a database of sentences and words in text and speech in every language is required since most programs are based on statistical algorithms. In India while some languages are spoken by a large number of people, some are limited to a smaller group. The criteria for sample collection requires the target group to be computer savvy and conversant in English as well as the local language. This narrows down the number of people who can be contacted for giving sample of the local lingo.
Since localization is done for the common man, certain words and phrases commonly used for working with computers have to be reinvented in many cases and have to be made user-friendly. Initially, the material recognized was a set of 60,000 basic strings in English and 200,000 strings for the advanced users which needed to be localized in all languages. Many of these strings were not grammatically complete sentences, they were just computer commands. Also, words like document, folder, delimiters, add-ons are not enlisted in any dictionary of Indian languages. While in some languages it has been transliterated and retained as it is, experts of some other languages went on to create a whole new set of words corresponding to the IT terminology. Post this mammoth task, a glossary has been created with consent of various language experts.
Web Domain
Many countries have been pushing for creation of multilingual domain names, that is, domain names like the .com or the .in typed in the countrys own local language(s). These are encoded in computer systems in multibyte Unicode and are stored in the domain name systems as ASCII codes usually. In 2009, the Internet Corporation for Assigned Names and Numbers, which manages the domain names across the world, approved of the use of internet extensions based on a countrys name and its local language(s). The move has given an impetus to the growth of the web and India too has joined the league of nations applying for a domain name written in a script other than Latin. Shanker of DIT is propelling this project ahead in India with the help of Govind who is a senior director in DIT. India has now got approval of the domain name .bharat which can be written in all 11 scripts and sees an extensive usage of the same in applications such as e-governance and e-learning. Internationalized domain names provide a convenient mechanism for users to access websites in local language. For example, if a person wants to give his or her system domain name in his or her local language, such as Hindi, then that will look like www.bhasha.bharat.
Applications
As technologies for language computing evolve in India, developers, industrialists, and academia are finding new and innovative use for these. One prominent use is the digitization or creation of e-books of the mounds of rich literature in different Indian languages. A very interesting use that is emerging of OCR is tagging, that is, scanning of hard copies of old books in regional libraries and creating an index for book search. This would help greater and better digitization of libraries across the Indian cultural terrain, says Santanu Chaudhury, professor, IIT-Delhi, and head of OCR project at DIT. He mentions that DIT has received huge demand for Braille books and accessibility solutions, which is now being worked on. Physical documents can be converted into e-documents and these can be further read out using text-to-speech engines developed by private companies and institutions.
At IBM Research Labs India, senior researcher Arun Kumar also works on a very interesting project called the Spoken Web, or as IBM likes to call it, the World Wide Telecom Web. IBM says that it is the starting of its vision of creating a parallel voice-driven ecosystem just like the World Wide Web for better inclusion of the internet and also of inclusion of challenged people. The Spoken Web is a collection of voice sites, each of which has a unique virtual phone number called the VoiNumber which maps to a physical phone number. When the user dials the VoiNumber of a website, he or she gets to hear the content of the respective site over the phone.
Another application of language computing comes into play with the concept of cross-lingual search and the wordnet that are being developed by Pushpak Bhattacharyya, professor of computer science engineering at IIT-Bombay and head of laboratory for intelligent internet access, at the institute. The cross-lingual search addresses the need to search resources on the Web in different languages than the one typed. Currently, search engines like Google do a template-based search, that is, if a user types professor in the search box, documents and pages on the web containing the exact string will show up. But there might be relevant documents in other languages which the user may want to see. Bhattacharyya is working on a search engine which can also serve vernacular results to users. He has also developed the widely recognized Wordnet, an online lexical reference system, for Hindi and Marathi languages. Wordnets structure allows for language processing and research in applications such as voice recognition and dictation systems.
Social Collaboration
Another prominent entity in the computer industry which is working on cross-lingual information retrieval is Microsoft. Microsoft India has started showing avid interest in the Indian localization market and, as a software giant, has made available many versions of the Microsoft Windows operating system as well as the Microsoft Office software in different Indian languages.
Kumaran from Microsoft Research has spearheaded the project called WikiBhasha with Wikipedia to enable people to be able to create Wikipedia pages in regional languages. The WikiBhasha label has been coined through a combination of Wiki and Bhasha (which means language in Hindi). It is a multilingual content creation tool which enables users to source existing English pages from Wikipedia and convert them into their selected language, and then manually edit or add content to those pages for a parallel Wikipedia page in Indian languages. Microsoft says that WikiBhasha Beta has been released as an open-source MediaWiki extension.
WikiBhasha uses Microsofts machine translation systems which are based on statistical algorithms. Another aim of the project is to crowdsource or collect data on a volley of topics in different Indian languages which could then be used for further research. This will help in solving the basic data problem that language computing experts are facing at the basic level, says Kumaran. He tries to validate his point and says, To do translation between any 2 languages, we need 4 mn sentence pairs to develop robust machine translation systems for the purpose.
India now stands at a point where the efforts seem to be producing results and the teams are getting affirmation of their work as they are called upon by European countries for consultation. The world now wants to take lessons from India on how to manage the huge complexity in permutations and combinations of languages, scripts, and dialects and successfully develop models for integration of natural languages in machines.