Automatic Speech Recognition: What a 3B Parameter Model Actually Sounds Like in Production

FonadaLabs TeamMarch 5, 202610 min read

Here is your article:

The Honest Truth About Speech Recognition in 2026

If you have ever integrated an automatic speech recognition system into a real product, you already know the feeling. The demo works beautifully. The accuracy numbers look great on paper. And then you ship it, real users start talking into it, and something quietly goes wrong.

Maybe the latency is just a beat too slow and conversations start feeling stilted. Maybe it handles English well but stumbles the moment someone switches to Hindi. Maybe the turn detection keeps cutting people off or waiting too long after they have finished speaking. None of these problems show up on a leaderboard. All of them show up in your user retention numbers.

We have been building speech technology long enough to have made most of these mistakes ourselves. Which is why when we built our 3 billion parameter ASR model, we tested it the way it would actually be used. Not on prepared audio clips in a quiet room. On live inference, across real network conditions, in two languages, with turn detection running alongside transcription the way it has to in production.

This is everything we found.

Why 3 Billion Parameters

The size of a model is not a marketing number. It is a set of tradeoffs, and being honest about those tradeoffs is the only way to make a decision you will not regret six months into a deployment.

Smaller models are faster and cheaper to run, but they start making accuracy compromises that compound in real use. Accented speech becomes harder. Rare vocabulary drops out. Multilingual performance gets patchy. The kinds of errors that are invisible on a clean test set become very visible when a user is dictating a medical note or a customer is calling a support line in their second language.

Larger models are more capable, but they create infrastructure problems. Most teams building production voice applications are not running fleets of the largest available GPUs. They need something that fits on the hardware they actually have, especially if they need on-premises deployment for compliance reasons.

Three billion parameters is where we landed after a lot of iteration. It is large enough to handle the full texture of natural speech, including noise, accents, domain-specific vocabulary, and multilingual input. It is deployable enough to run on-premises without requiring specialised infrastructure that prices out the majority of real-world use cases.

That balance is not accidental. It is the whole point.

Latency: The Number That Actually Decides Whether Your Product Feels Alive

There is a threshold in voice technology that nobody officially wrote down but every product builder eventually discovers through painful experience. When a voice system responds in under 300 milliseconds, users experience it as natural. When it creeps above that, something shifts. The interaction starts feeling like a tool rather than a conversation. Users begin compensating, speaking more slowly, pausing longer, treating the system as something to be managed rather than something to just talk to.

We measured our latency across two deployment conditions.

On a local area network, our model returns results in 184.39 milliseconds.

This is the number for controlled deployments where the inference infrastructure sits close to the user. Contact centres, enterprise meeting tools, hospital systems, legal transcription, government applications. Any environment where you can put the processing power in the same building or the same private network as the people using it. At 184 ms, the gap between speaking and seeing transcribed text essentially disappears. Conversations flow at human pace because the system is keeping up with human pace.

Across a global network, the model delivers results in 241.16 milliseconds.

This is the number for cloud-routed deployments where audio is travelling across international infrastructure before being processed and returned. 241 ms for a round trip that crosses borders and oceans, runs through a 3 billion parameter neural network, and comes back as accurate transcription is a figure worth sitting with for a moment. It means that even in the worst-case network scenario we tested, we are still sitting comfortably under the threshold where users notice delay.

The 56 millisecond gap between the two figures is also worth paying attention to. It tells you that the model itself is not the bottleneck. The compute is tight. What you are seeing in that gap is almost entirely network transit time, not processing overhead. Improve the network path and the latency follows.

On-Premises Processing: Why It Changes More Than Just the Numbers

The latency advantage of local deployment is real and measurable. But the more important reason a growing number of serious ASR deployments are moving on-premises has nothing to do with milliseconds.

When speech data never leaves your infrastructure, you eliminate an entire category of legal and regulatory exposure. Healthcare conversations are covered by laws that have specific requirements about where data can travel and who can access it. Legal proceedings generate audio that courts treat as sensitive. Financial services calls carry compliance obligations that most cloud routing arrangements complicate significantly. Government applications often cannot use cloud processing at all.

On-premises deployment does not just improve your latency. It closes the compliance conversation entirely. The data does not go anywhere. There is nothing to audit, nothing to encrypt in transit, nothing to worry about if a cloud provider has a service incident.

It also gives you a latency floor you can actually depend on. Cloud routes degrade. Traffic spikes add jitter. On-premises inference on dedicated hardware gives you a consistent 184 ms that your application can build around rather than a target that fluctuates with conditions outside your control.

Smart-Turn: The Part of ASR Nobody Talks About Enough

Transcription accuracy gets all the attention in ASR evaluations. Word error rate. Character error rate. Leaderboard positions. These numbers matter, and we will get to them.

But there is a component of any real conversational speech system that is just as important and gets a fraction of the coverage: end-of-turn detection. Knowing when someone has finished speaking.

This sounds trivial until you actually try to build it. Natural speech is full of pauses that are not turn endings. People stop mid-sentence to find a word. They trail off while thinking. They use rising intonation in the middle of a statement. They speak at different paces in different languages and different cultural contexts. A turn detection system that relies on simple silence thresholds cuts speakers off. One that waits too long kills conversational rhythm. Getting this right requires the model to understand prosody, context, and linguistic structure, not just audio energy.

We call our solution Smart-Turn.

Accuracy results:

Smart-Turn achieves 93.01% accuracy in English and 93.13% accuracy in Hindi.

That near-identical performance across two typologically different languages is the result we are most pleased about in this entire evaluation. English and Hindi have different sentence structures, different prosodic patterns, different norms around how pauses function in conversation. The fact that our turn detection performs at essentially the same level in both languages tells you that the underlying mechanism has learned something genuinely language-agnostic. It is not relying on English-specific cues dressed up as a general solution.

Latency results:

Smart-Turn adds almost no processing overhead. English turn detection completes in 0.0468 seconds. Hindi completes in 0.0467 seconds.

Under 47 milliseconds to decide whether a speaker has finished their thought. That is fast enough that it contributes essentially nothing to perceived latency in a live application. From the user's perspective, the system knows they have finished speaking as soon as they have finished speaking.

Multilingual Support That Is Actually Multilingual

The Hindi and English parity in our Smart-Turn results points to something broader about how we approached this model.

Multilingual ASR has a history of broken promises. A model publishes strong English numbers, lists a dozen other languages in the feature overview, and then quietly delivers degraded performance on anything outside the top tier. Developers who need genuine multilingual capability learn to read the fine print carefully. Accuracy on the headline language and accuracy on the secondary languages are often not the same conversation.

We made a decision at the architecture stage that this would not be how we operated. Hindi was not retrofitted after the English model was finished. Multilingual capability was built into the training process from the start, treated as a core requirement rather than a stretch goal.

The result is a model where Hindi is not a diminished version of the English experience. It is the same experience. Same underlying architecture. Same approach to turn detection. Same latency profile. Same accuracy tier.

This matters commercially because the global market for voice technology is not English-first. India represents one of the largest and fastest-growing user bases for voice applications anywhere in the world. Any ASR deployment that treats Hindi as a secondary concern is making a significant product and business mistake, not just a technical one.

What This Looks Like When You Build With It

Numbers in isolation are only half the story. Here is what these metrics translate to across the kinds of applications people are actually building in 2026.

Voice assistants and conversational agents. The combination of 184 ms LAN latency and 47 ms Smart-Turn detection means a voice assistant can respond in genuinely natural conversational time. The user finishes speaking, the system registers the turn ending almost instantly, and the response begins before any silence has time to register as a pause. The conversation feels continuous because it is.

Live transcription for meetings, courts, and clinics. At 241 ms global latency, captions appear at essentially the same moment words are spoken. There is no visible lag. Participants can read along in real time rather than watching a transcript play catch-up two sentences behind the speaker.

Multilingual contact centres. Over 93% Smart-Turn accuracy in both English and Hindi from a single model removes the need for separate language pipelines, separate accuracy benchmarks, and separate quality monitoring for different call queues. One model handles the full range of your customer base.

Compliance-sensitive deployments. On-premises processing at 184 ms means you never have to choose between speed and data sovereignty. Healthcare, legal, financial, and government applications can have both.

The Numbers, Plainly Stated

LAN latency: 184.39 ms.
Global latency: 241.16 ms.
Smart-Turn accuracy in English: 93.01%.
Smart-Turn accuracy in Hindi: 93.13%.
Smart-Turn latency: under 47 milliseconds in both languages.
Model size: 3 billion parameters, deployable on-premises.

What we value most in this set of results is not any single figure. It is the consistency. The model performs in the same tier across different network conditions, across different deployment environments, and across different languages. It does not have a best-case scenario that masks a difficult average. The numbers you see here are the numbers you get.

Final Word

Speech recognition is not a solved problem just because the leaderboard numbers keep improving. The part that remains unsolved is the gap between what a model can do on a curated dataset and what it reliably does when real people are using it in real conditions.

We built this model to close that gap. Not to win a benchmark. To work in a contact centre in Mumbai and a legal firm in London and a hospital system running fully on-premises and a global consumer product routing through cloud infrastructure. In all of those environments, across both of the languages we tested, under both of the network conditions we measured.

That is the bar we hold ourselves to. And these are the numbers we got.