Whisper-RIR-Mega: ASR Benchmark for Room Acoustics

Whisper-RIR-Mega: New Benchmark Exposes ASR Weakness to Real-World Echoes

A new open-source benchmark, Whisper-RIR-Mega, has been introduced to rigorously evaluate how well modern automatic speech recognition (ASR) systems handle the challenging acoustics of real rooms. The dataset pairs pristine, clean speech from the popular LibriSpeech corpus with versions of the same audio convolved with actual room impulse responses (RIRs) from the extensive RIR-Mega collection. This creates a controlled yet realistic testbed for measuring the impact of reverberation—a major obstacle for voice assistants and transcription services in environments like conference halls or kitchens.

Stratified Testing Reveals Consistent Performance Degradation

The benchmark is meticulously structured to assess ASR robustness systematically. Samples are stratified by key acoustic metrics: reverberation time (RT60), which measures how long sound lingers, and the direct-to-reverberant ratio (DRR), which quantifies the balance between direct sound and reflected echoes. Researchers evaluated five versions of OpenAI's Whisper model—from the smallest tiny to the largest large-v3—on 1,600 test samples, reporting both Word Error Rate (WER) and Character Error Rate (CER).

The results were unequivocal: reverberation consistently degraded transcription accuracy across all model sizes. The performance penalty, measured by the increase in WER, ranged from a modest 0.12 percentage points to a substantial 1.07 percentage points, depending on the specific Whisper variant. This demonstrates that even state-of-the-art, large-scale models are not immune to the distorting effects of real-world room acoustics.

Open Resources for Advancing Robust Speech Technology

To foster reproducible and collaborative research, the team behind Whisper-RIR-Mega is publicly releasing the complete dataset, all evaluation code, and the detailed baseline results. This move provides the research community with a standardized tool to diagnose weaknesses, compare new noise-robust architectures, and develop more reliable speech recognition systems that perform well outside of studio-quality, anechoic conditions.

Why This Matters for AI and Speech Technology

Bridges a Critical Gap: Most ASR benchmarks use simulated or artificial noise. Whisper-RIR-Mega uses real recorded room impulses, providing a more authentic and challenging test for models destined for real-world deployment.
Enables Precise Diagnostics: By stratifying data by RT60 and DRR, researchers can pinpoint exactly which acoustic conditions cause specific models to fail, guiding more targeted improvements.
Drives Industry-Relevant Innovation: The benchmark's findings directly impact the development of voice assistants, meeting transcription services, and hearing aids, pushing the industry toward models that understand speech reliably anywhere.
Promotes Open Science: Releasing the dataset and code lowers the barrier to entry for robust ASR research, accelerating progress through transparent, reproducible benchmarking.

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Whisper-RIR-Mega: New Benchmark Exposes ASR Weakness to Real-World Echoes

Stratified Testing Reveals Consistent Performance Degradation

Open Resources for Advancing Robust Speech Technology

Why This Matters for AI and Speech Technology

常见问题

Whisper-RIR-Mega: New Benchmark Exposes ASR Weakness to Real-World Echoes

Stratified Testing Reveals Consistent Performance Degradation

Open Resources for Advancing Robust Speech Technology

Why This Matters for AI and Speech Technology

常见问题

相关推荐

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification