Product Updates

FonadaLabs-Denoiser

FonadaLabs TeamJune 29, 20267 min read

Every phone call, video conference, and contact-centre interaction carries the same invisible problem, microphones do not capture just the speaker's voice. They record everything fan hum, traffic, keyboard clicks, crowd chatter, room echo and transmit the whole mix to every downstream system that tries to make sense of it.

This blog documents how we built FonadaLabs-Denoiser a neural speech enhancement system designed for real world telephony and voice AI pipelines and how it benchmarks against a focused set of open and lightweight denoisers across three processing modes and objective quality metrics.

◆ Across the Board FonadaLabs-Denoiser ranks first on all quality metrics in every processing mode tested.

◆ Quality Leader Leads on PESQ-NB, COVL, DNSMOS BAK, and P808 MOS covering both intrusive and non-intrusive assessment.

◆ Native 16 kHz Processes audio at its native sample rate no resampling, no added latency, no quality penalty from sample-rate conversion.

◆ Streaming + Full Audio Both real-time streaming and offline full audio enhancement from a single system. One pipeline covers live calls, conferencing, batch cleanup, and ASR preprocessing.

1. The Problem

People speak clearly. Microphones do not. During any real phone call, the microphone captures the speaker's voice together with everything happening nearby. By the time the audio reaches a listener or an ASR engine, the signal-to-noise ratio may already be negative the noise is louder than the speech.

Clean Voice (A) + Background Noise (B) = Noisy Audio (A+B) → FonadaLabs Denoiser → Enhanced Speech (C)

The challenge is moving from A+B toward C without damaging the naturalness and clarity of A along the way. In practice, noise routinely drops the signal to noise ratio below usable thresholds long before anyone notices it on a dashboard.

Why It Matters

Noisy audio is rarely just an inconvenience. It is a measurable accuracy problem across every voice-driven product:

ASR systems produce significantly higher word error rates when input SNR drops even moderate background noise pushes transcription accuracy below usable thresholds
Voice activity detectors trigger false positives on non-stationary noise, wasting compute and breaking silence detection
Speaker verification and voice biometric systems see corrupted embeddings, reducing authentication reliability
Conversational AI and NLU engines receive degraded transcriptions, causing intent classification failures that propagate downstream
Contact centre agents spend more time asking callers to repeat themselves a direct increase in average handle time and a measurable hit to customer satisfaction

2. Where Other Denoisers Fall Short

Noise cancellation tools have existed for years. The problem is not that denoisers do not exist. The problem is that most of them solve only one piece of the puzzle and teams building production voice products need several pieces at once.

Latency vs. Quality

Some models deliver strong enhancement quality but introduce processing delays that are immediately noticeable on live calls. Even a few hundred extra milliseconds of buffering changes how natural a conversation feels.

The Sample Rate Problem

Most real world telephony, VoIP infrastructure, and contact center platforms operate at 16 kHz. Some noise cancellation systems are built around a native 48 kHz pipeline. That means every deployment needs a resampling stage adding latency, introducing artifacts, and creating another component to maintain.

Streaming vs. Full Audio

Live communication and batch recordings are genuinely different problems. Teams building products often need both: a live call enhancement path and an offline processing path for recordings. Finding one system that handles both well is harder than it should be.

The Tone Problem

Suppressing background noise without altering the speaker's voice is harder than it sounds. Aggressive suppression can thin out or alter the natural character of the voice an equally serious problem for user experience.

The Chunk Boundary Problem

Real-time streaming denoising means processing audio in very small windows each frame carries only a fraction of a second of audio context. Systems without persistent state must re-estimate the noise environment on every chunk, which makes long session consistency difficult to maintain.

3. Built for Production

FonadaLabs-Denoiser is a neural speech enhancement system designed around practical deployment requirements not a research model tuned to score well on a single test set in isolation.

Two Modes

The system operates in two distinct modes that address fundamentally different use cases:

Streaming mode (160ms & 640ms) : The audio is processed in real time as chunks arrive. Internal state is maintained across chunk boundaries, so the noise model built in the first seconds of a call continues to inform processing as the conversation continues.
Full Audio mode : the complete recording is processed in a single pass. Broader temporal context improves consistency, reduces boundary artifacts, and enables stronger enhancement on offline files.

Native 16 kHz

FonadaLabs-Denoiser processes audio natively at 16 kHz the same sample rate used by telephony infrastructure, SIP systems, and most voice AI platforms. There is no upsampling or downsampling step in the pipeline.

Stateful Streaming

FonadaLabs-Denoiser uses recurrent processing that maintains an evolving acoustic model across the full duration of a streaming session. It does not restart its noise estimate from scratch on every 160ms window.

Beyond Denoising

In full audio mode the output is not just a suppressed version of the input, it is enhanced speech. FonadaLabs-Denoiser combines noise suppression with speech clarity restoration recovering energy that heavy noise was masking rather than only attenuating the background.

4. The Benchmark

The benchmark evaluates FonadaLabs-Denoiser alongside GTCRN, NoiseReduce, Resemble Enhance, and RNNoise. Five systems were selected because each supports 160ms streaming, 640ms streaming, and full audio processing at 16 kHz under the same evaluation protocol.

Reading the Charts

Every chart includes an orange reference line marking the unprocessed noisy baseline the score the raw audio receives before any enhancement is applied. A model scoring above that line has improved the audio relative to doing nothing.

The Metrics

Measuring speech enhancement quality is not a single-number problem. A denoiser can suppress noise aggressively while quietly damaging perceptual quality, or preserve tone while leaving too much background. The benchmark uses complementary metrics:

PESQ-NB : ITU-T P.835 narrowband perceptual quality score; higher is better.
COVL : Hu & Loizou composite overall voice quality (1–5); higher is better.
DNSMOS BAK : DNS Challenge background-quality MOS predictor; higher is better.
P808 MOS : ITU P.808 non-intrusive mean opinion score; higher is better.

5. Waveform Comparisons

Metrics describe what happened numerically. Waveforms show the mechanism directly. Each panel below stacks the noisy input, every model output, and the clean reference for the same recording so differences in clarity, residual background, and speech shape are visible at a glance.

In the first example, FonadaLabs-Denoiser removes background interference most aggressively while keeping phrase boundaries natural, residual noise low, and speech energy intact. Competing outputs retain more background residue or alter the speech envelope.

Noisy input: speech and background are mixed throughout the timeline.
FonadaLabs-Denoiser: output aligns closely with the clean reference; silence between phrases is clean.
Other evaluated models: retain more background residue or alter speech envelope shape.

The second example presents a different acoustic environment and reinforces the same observation. FonadaLabs-Denoiser delivers the strongest denoising margin across the full model stack, recovering intelligible speech structure while competing outputs leave more interference or fail to restore the speaker's natural amplitude contour.

Noisy input: the target voice is difficult to separate from the background.
FonadaLabs-Denoiser: phrase onsets and offsets are clear; the waveform follows the clean reference rhythm.
Other evaluated models: shown in the same stack for direct side-by-side comparison.

6. No Model Is Perfect

No model is finished. Benchmark results do not mark an endpoint, they reveal the next frontier. The discipline of measuring precisely what works and what does not is how production voice systems get better over time.

The benchmark results are consistent across 4 metrics, three processing modes, and five models. FonadaLabs-Denoiser leads every single measurement independently, 12 of 12 metric×mode cells not as a result of one strong score pulling an average.

Three properties underpin that performance: stateful streaming that builds and maintains a noise model across the full call rather than guessing from narrow windows; native 16 kHz operation without resampling overhead; and dual streaming and full-audio capability from one system.

At the same time, no benchmark represents every acoustic condition, deployment environment, or user scenario. No model is perfect, and neither is FonadaLabs-Denoiser. Real-world communication continues to present new challenges, making continuous evaluation and refinement essential.

The goal is not simply to achieve strong benchmark results, but to keep improving robustness, consistency, and user experience with every iteration. Today's results establish a strong baseline. Tomorrow's work is about pushing that baseline even further.

Intelligence with every word. Silence for everything else.