tts

challenges

conversational ai

Handling Numbers, Dates, and Special Characters in TTS

FonadaLabs TeamFebruary 9, 202611 min read

The Silent Killer of TTS Quality: Why "Meet me at 5" Breaks Your Voice System

You feed a simple sentence to your TTS system: "Meet me at 5."

Does it say "meet me at five"? Or "meet me at five o'clock"? Maybe "meet me at five PM"? What if the context suggests it's 5 AM, or the 5th of the month, or 5 dollars?

Now try something actually challenging: "₹1,50,000 by 15/03/2025."

Should it say "rupees one fifty thousand" or "rupees one lakh fifty thousand"? And is that date March 15th or the 15th of March? Does the year get pronounced "twenty twenty-five" or "two thousand twenty-five"?

This seemingly simple problem—converting written text into speakable words—is one of the hardest, most underestimated challenges in production TTS systems. It's called text normalization, and here's the uncomfortable truth: 80% of TTS errors don't come from voice synthesis quality. They come from this invisible preprocessing stage where your system fundamentally misunderstands what the text means.

Get it wrong, and your perfectly synthesized voice will confidently announce complete nonsense. Get it right, and users never notice the thousands of micro-decisions happening behind every spoken sentence.

Why This Is Brutally Harder Than It Looks

Humans are remarkably good at reading context without conscious effort. You instantly know that "Dr. Smith has a Ph.D." contains three different meanings of periods: title abbreviation, academic degree, and sentence terminator. You recognize that "lead" in "lead singer" sounds nothing like "lead" in "lead poisoning."

This contextual intelligence that humans develop effortlessly over years of reading? It must be explicitly programmed into TTS systems. And the edge cases are infinite.

The Ambiguity Nightmare

Written text is compressed communication optimized for human readers, not machine parsers. Every compression creates ambiguity.

"St." could mean "Street" or "Saint." "No." could be "Number" or the word "No." The digit "2" might be pronounced "two," "to," "too," or as part of "H2O" where it becomes "aitch-two-oh" (not "aitch-to-oh" or "H-two-O").

Numbers are especially brutal. "1984" is "nineteen eighty-four" when it's a year, but "one thousand nine hundred eighty-four" in most other contexts. Except when it's a book title referencing Orwell, in which case it's specifically "nineteen eighty-four" even outside the context of dates.

Your normalization system needs to make these decisions correctly, thousands of times per document, or users notice.

Language-Specific Complexity That Multiplies Everything

The patterns that work for English completely fail for other languages, and this is where most systems fall apart when they try to go multilingual.

English says "three point five" for decimals. Hindi says "saaṛhe tīn" (literally "three and a half") for 3.5. These aren't just translation differences—they're fundamentally different conceptual frameworks for expressing the same numerical concept.

Indian numbering uses lakhs and crores, not millions and billions. A number like "1,50,000" (with Indian comma placement) should be verbalized as "one lakh fifty thousand," not "one hundred fifty thousand." The comma placement itself signals the numbering system being used.

Currency verbalization follows different patterns. "$5" becomes "five dollars" with the currency after the amount. "₹5" also becomes "five rupees" in English or "paanch rupaye" in Hindi. But the grammatical patterns differ: "₹1" is "ek rupaya" (singular) while "₹2" is "do rupaye" (plural). Your normalization must handle grammatical number agreement.

These aren't edge cases. These are fundamental linguistic patterns affecting billions of speakers.

Cultural Context That Changes Everything

Date formats flip between regions, and getting this wrong makes your system sound foreign and unnatural.

Americans say "March fifteenth." British speakers say "the fifteenth of March." Indians often say "fifteen three" (day-month) in casual speech. The numerical date "15/03/2025" is ambiguous—is it March 15 or the nonsensical 15th month?

Phone numbers chunk differently across countries. American numbers follow "(555) 123-4567" patterns. Indian mobile numbers (10 digits) typically chunk as "98765-43210" becoming "nine eight seven six five, four three two one oh."

Without cultural awareness, your TTS doesn't just sound wrong—it sounds like it was built by people who don't understand how your users actually communicate.

The Multi-Stage Normalization Pipeline (And Where It Breaks)

Production TTS systems implement complex multi-stage pipelines that progressively transform written text into pronounceable forms. Each stage introduces opportunities for errors.

Stage 1: Tokenization and Classification

Clearly, before normalization, one has to classify what is being dealt with. Tokens must be classified into such categories as plain word, number, currency, date, time, abbreviation, URL, email, chemical formula, mathematical expression, etc.

But that is what machine learning classifiers do best, and they are given millions of examples to learn from. "CO2" will be classified as a chemical formula and translated to "C-O-two" instead of "co two" or "ko-two".

But context windows matter enormously. The token "March" might be a month or a verb ("march forward"). Looking at surrounding tokens—"15 March 2025" versus "they march forward"—disambiguates meaning. Get the context window wrong, and your classifier makes systematic errors.

Stage 2: Rule-Based Expansion (Where Complexity Explodes)

Once tokens are classified, apply expansion rules. These are deterministic transformations based on language, locale, and context.

Numbers are where the complexity becomes staggering. Consider all these different cases that the same digits might represent:

Cardinals: "5" becomes "five"
Ordinals: "5th" becomes "fifth"
Decimals: "3.14" becomes "three point one four"
Fractions: "1/2" becomes "one half"
Ranges: "10-15" becomes "ten to fifteen"
Years: "1947" becomes "nineteen forty-seven"
Quantities: "5 kg" becomes "five kilograms"
Identifiers: "Flight 5" stays "flight five" (not "flight fifth")

Indian numbering adds entirely new complexity. "1,50,000" should say "one lakh fifty thousand," not "one hundred fifty thousand." The comma placement itself signals Indian notation versus Western notation.

Context determines treatment: "Rs. 5 lakh" in a financial document is formal and explicit. "5L views" in social media might say "five lakh" or might stay as "five L" depending on register and audience.

Currency verbalization changes with amount and language:

"$5.50" becomes "five dollars fifty cents" (English)
"₹5.50" becomes "paanch rupaye pachaas paise" (Hindi)
"€5.50" becomes "five euros fifty cents"

And of course, the grammatical forms differ too. With currency names, one can't just use string substitution; instead, there have to be language-specific rules on when currency names take singular forms and plural forms, etc.

Dates and times are ambiguity central. The string "15/03/2025" could be:

March 15, 2025 (Indian/European format)
The nonsensical 15th month of the year 2003 (American format misapplied)
Some other interpretation depending on context

Locale detection becomes crucial, but many systems don't have reliable locale information. You're forced to infer from other signals in the text.

Format variations are endless:

"15-03-2025" becomes "fifteenth March twenty twenty-five"
"March 15, 2025" becomes "March fifteenth, twenty twenty-five"
"15 Mar '25" becomes "fifteenth March twenty twenty-five"

Should years be "twenty twenty-five" or "two thousand twenty-five"? Both are correct. Consistency matters more than the specific choice, but users will notice if you're inconsistent.

Abbreviations depend entirely on context. "Dr." before a name is "Doctor." But "Dr." in a street address like "123 Main St., Apt. 5, Dr. 2" might be "Drive." The same token has different expansions based on position and surrounding context.

Stage 3: Linguistic Post-Processing

Lastly, after rule-based expansion, linguistic corrections are required to maintain naturalness.

Make sure that grammatical agreement is satisfied, meaning that subjects and verbs must match grammatically for either singular or plural. Correctly handle articles. "A hour" is incorrectly written; it would be "An hour" because it starts with a vowel sound.

For Indian languages, sandhi rules, i.e., phonetic changes at word boundaries, need to be followed. Also, "dus rupaye" (ten rupees) could be pronounced by altering the vowel sounds ever so slightly compared to how the individual words are spelled when written down.

These aren't optional refinements. They're the difference between speech that sounds robotic and speech that sounds natural. This is why measuring voice naturalness requires evaluating the entire pipeline, not just the acoustic quality.

Stage 4: Ambiguity Resolution with Deeper Context

Some ambiguities may call for linguistic analysis beyond simple matching patterns.

For example: "I read the book yesterday" and "I read books daily."

The word "read" has the same spelling but a different sound: past tense "red" and present tense "reed." Part of speech tagging and syntax parsing help to resolve heteronyms, or words spelled the same but with different pronunciations.

Indeed, machine learning models increasingly handle these cases. Train on large corpora where humans have annotated the correct pronunciations, and the model learns the patterns that manual rules were unable to capture.

However, this does raise some problems: training data requirements become gigantic in size, and debugging is impossible due to the "black box" nature of the models. For instance, why does the model mispronounce "1,50,000"? Debugging is easy with rules, but interpreting neural models is not as straightforward.Indian Language Challenges That Break Standard Approaches

If you're building for Indian languages, the complexity multiplies in ways English-focused systems completely miss.

Script-Specific Normalization

When input text is in Devanagari (Hindi), Tamil script, or Telugu script, normalization differs fundamentally. These scripts are mostly phonetic—what you see is roughly what you say—but numerals, punctuation, and loan words still need extensive handling.

"5 रुपये" mixes Arabic numerals with Devanagari text. Should this say "paanch rupaye"? Should the numeral be converted to Devanagari "५ रुपये" first? Production systems typically detect "5" as a number and expand to the target language's verbal form directly.

The Lakh and Crore System

Indian numbering is unique globally. Western systems use thousands, millions, billions. India uses thousands, lakhs (hundred thousands), crores (ten millions).

"15,00,000" is "fifteen lakh," not "one million five hundred thousand."

Your normalization must:

Detect Indian comma placement (1,50,000 not 150,000)
Recognize lakh/crore context from locale or explicit markers
Handle mixed notation: "1.5 lakh" (one and a half lakh)
Support informal written forms: "1.5L" or "₹15L"

Get this wrong, and you're not just technically incorrect—you sound like a foreigner who doesn't understand how Indians talk about money.

Code-Mixed Text: The Killer Feature

Real-world Indian text mixes languages constantly. "Kal 5 PM ko meeting hai" contains English loanwords embedded in Hindi syntax.

Normalization must:

Detect language boundaries at the word level
Apply language-appropriate number expansion: "5 PM" might stay "five PM" even in a Hindi sentence, or become "paanch baje"
Preserve natural code-switching prosody without jarring transitions

This isn't an edge case. This is how hundreds of millions of people write and speak daily. Understanding why TTS doesn't work in Hinglish reveals how text normalization failures compound with other challenges in code-mixed speech synthesis.

Abbreviations in Indian English

"NEFT transfer," "PAN card," "GST invoice"—Indian English is full of acronyms specific to Indian administrative, financial, and cultural contexts.

Some are pronounced as letter-by-letter acronyms (en-ee-eff-tee). Others are pronounced as words if they're pronounceable. Context and frequency determine treatment, and these patterns aren't documented anywhere—they have to be learned from real-world usage.

Implementation Approaches: Rules vs. Machine Learning

Traditional systems use hand-crafted rules: thousands of if-then statements covering edge cases. This approach is maintainable and debuggable but brittle. Every new pattern needs manual coding.

Modern systems use hybrid approaches:

Rule-based for deterministic cases: Currency symbols, basic numbers, standard date formats
ML-based for ambiguity resolution: Context-dependent pronunciation, abbreviation expansion, heteronym disambiguation

This combines the reliability of rules with the adaptability of machine learning.

Regular Expressions and Finite State Transducers

Regex patterns efficiently match number formats, dates, and times. Finite state transducers (FSTs) encode transformation rules compactly, handling the complex morphological changes common in Indian languages.

These are fast, deterministic, and debuggable—critical properties for production systems.

Neural Text Normalization: Promise and Problems

Recent approaches use sequence-to-sequence models trained end-to-end. Input: raw text. Output: normalized, pronounceable text.

The model learns patterns from millions of examples, handling novel cases better than rules. But training data requirements are massive, and debugging failures is nightmarish. When the model mispronounces "1,50,000," understanding why requires model interpretation techniques most teams don't have.

Why Most Systems Get This Wrong (And What It Costs)

Text normalization is unglamorous infrastructure work. It doesn't make for impressive demos. Investors don't ask about it. But it's the foundation that determines whether your TTS sounds professional or amateur.

Over-normalization expands everything mechanically. "COVID-19" becomes "coronavirusdisease twenty nineteen" instead of the natural "covid nineteen." Brand names, acronyms, and common abbreviations often sound better as-is.

Inconsistent handling picks different conventions randomly. If you expand "5 PM" as "five PM" in one sentence, don't suddenly say "seventeen hundred hours" in the next.

Ignoring prosody treats normalization as pure text transformation. But normalization affects how speech flows. A long number string "9876543210" said without chunking is incomprehensible. Break it into digestible pieces: "nine eight seven six five, four three two one oh."

The Path Forward: Living, Evolving Systems

Text normalization will never be "solved" because language evolves. New abbreviations emerge. Cultural conventions shift. Slang becomes mainstream. Your normalization system needs continuous updates.

Production systems require ongoing maintenance: monitoring real-world usage, collecting error cases, updating rules and models, testing comprehensively.

This is why text normalization is infrastructure, not a feature you build once and forget. It's a living system that grows with language itself.

Because great TTS isn't just about voice quality and prosody. It's about fundamentally understanding what text actually means before you try to speak it. And that understanding—converting "₹1,50,000 by 15/03/2025" into natural speech—is far harder than most teams realize until they're deep in production and users are complaining about nonsensical pronunciations.

Build this foundation right, or spend months fixing embarrassing errors after launch. Those are your options. When building your own TTS pipeline, text normalization must be a first-class component from day one. And when building low-latency TTS pipelines, remember that normalization latency compounds with all other stages—efficient, streaming normalization is essential for meeting real-time constraints.