Sarvam AI launches audio model for 22 Indian languages

Sarvam AI has launched an audio-first model supporting 22 Indian languages. Built on a 2B-parameter architecture, it handles code-mixing and regional accents. The model offers 8kHz support for telephony at Rs 30 per hour.

DQI Bureau

03 Feb 2026 15:07 IST

New Update

Listen to this article

0.75x1x1.5x

00:00/ 00:00

Sarvam AI released Sarvam Audio, an audio-first large language model designed to process speech across 22 Indian languages. The Bengaluru-based startup announced the launch on 3 February 2026, stating the model handles the specific linguistic patterns found in Indian speech, including code-mixing and regional accents.

The new model builds on the Sarvam 2B architecture, a 2-billion-parameter system. Unlike traditional speech-to-text tools that often struggle with the informal way Indians switch between languages in a single sentence, Sarvam Audio processes these nuances directly. The model supports major languages including Hindi, Tamil, Telugu, Malayalam, Marathi, Bengali, and Indian English.

Technical performance and accuracy

The company reports that Sarvam Audio outperforms global models like GPT-4o and Gemini 1.5 Flash in specific Indic language benchmarks. It features a specialised tokeniser that uses 1.4 to 2.1 tokens per word for Indian languages. In comparison, standard global models often require 4 to 8 tokens for the same text. This design reduces the computational power needed for processing and lowers the cost for developers.

Sarvam AI trained the model on a dataset of 4 trillion tokens, including 2 trillion tokens specifically from Indic sources. The training took place on Yotta’s Shakti Cloud using 1,024 NVIDIA H100 GPUs over five days.

Features and use cases

Sarvam Audio provides several capabilities for enterprise and developer use:

Real-time and batch processing: Supports files under 30 seconds for immediate results and up to one hour for batch tasks.
Speaker identification: Includes diarisation to distinguish between different people in a recording.
Telephony optimisation: Processes low-quality 8kHz audio, which is common in phone-based customer service.
Entity preservation: Retains specific data such as currency, dates, and URLs during transcription.

The model is currently available through the Sarvam AI API platform. The company set the price for speech-to-text services at Rs 30 per hour of audio. For tasks requiring speaker identification, the price increases to Rs 45 per hour.

Background and funding

Vivek Raghavan and Pratyush Kumar founded Sarvam AI in July 2023. Both founders previously worked with AI4Bharat, a research lab at IIT Madras. The company has raised USD 53.8 million to date from investors including Lightspeed Venture Partners, Peak XV Partners, and Khosla Ventures.

In 2025, the Indian government selected Sarvam AI to develop the country’s first indigenous foundational model under the IndiaAI Mission. This selection provides the company with access to 4,000 government-supported GPUs to further expand its technical infrastructure.

Advertisment