Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

The Whisper-RIR-Mega benchmark is a paired clean-reverberant speech dataset that systematically evaluates Automatic Speech Recognition (ASR) robustness to room acoustics. It reveals that even state-of-the-art Whisper models suffer performance degradation, with Word Error Rate (WER) increasing by up to 1.07 percentage points when processing speech convolved with real-world room impulse responses. The dataset pairs LibriSpeech samples with their reverberant counterparts using measured RIRs from the RIR-Mega corpus, enabling stratified testing based on Reverberation Time (RT60) and Direct-to-Reverberant Ratio (DRR).

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

New Whisper-RIR-Mega Benchmark Exposes ASR Weakness to Real-World Room Acoustics

Researchers have introduced a new, high-fidelity benchmark designed to rigorously test the robustness of modern Automatic Speech Recognition (ASR) systems against the pervasive challenge of room reverberation. The Whisper-RIR-Mega dataset pairs thousands of clean speech samples from the widely-used LibriSpeech corpus with their exact counterparts convolved with real-world room impulse responses (RIRs) from the extensive RIR-Mega corpus. In a revealing evaluation, the benchmark showed that even state-of-the-art models like OpenAI's Whisper family suffer consistent performance degradation when faced with realistic acoustic conditions, with error rates increasing by up to 1.07 percentage points.

A Stratified Test for Real-World Acoustic Challenges

The construction of the Whisper-RIR-Mega dataset is methodically designed to mirror the acoustic variability encountered in real environments. Each sample is a direct pair: a pristine LibriSpeech utterance and the same utterance transformed by an actual measured room impulse response. Crucially, the dataset includes stratified splits based on two key acoustic metrics: Reverberation Time (RT60), which measures how long sound persists in a space, and the Direct-to-Reverberant Ratio (DRR), which quantifies the balance between the direct sound from a speaker and the reflected, reverberant sound. This structure allows researchers to pinpoint performance drops associated with specific acoustic profiles, moving beyond simplistic noise addition.

Whisper Model Evaluation Reveals Consistent Vulnerability

The research team conducted a comprehensive evaluation using five models from the Whisper suite, ranging from the smallest Whisper-tiny to the largest Whisper-large-v3. The models were tested on 1,600 samples, with performance measured by both Word Error Rate (WER) and Character Error Rate (CER) under clean and reverberant conditions. The results were unequivocal: reverberation degraded transcription accuracy across every single model size. The performance penalty, quantified as the increase in WER, varied from 0.12 to 1.07 percentage points depending on the specific model, demonstrating that the challenge persists even as model capacity scales.

Open-Source Release for Reproducible Robustness Research

To accelerate progress in building more resilient speech technology, the authors are releasing the complete Whisper-RIR-Mega package to the public. This includes the full paired dataset, all evaluation code used in the study, and the detailed baseline results. This commitment to open science ensures that other researchers can not only verify the findings but also use the benchmark to test their own novel ASR architectures and noise-robustness techniques, fostering reproducible and comparable research in a critical subfield of AI.

Why This Matters for the Future of Speech AI

  • Bridging the Simulation-to-Reality Gap: By using real room impulse responses instead of synthetic ones, this benchmark provides a more accurate and challenging testbed for ASR systems destined for real-world deployment in homes, cars, and offices.
  • Quantifying a Hidden Performance Cost: The study provides concrete metrics on the "reverb penalty," highlighting that improvements in clean-speech accuracy do not automatically translate to robust performance in everyday acoustic environments.
  • Enabling Targeted Algorithmic Improvements: The stratified data splits by RT60 and DRR allow developers to diagnose whether their models fail specifically in highly reverberant spaces or in conditions with poor direct sound, guiding more effective solutions.
  • Setting a New Standard for Evaluation: The release of the dataset and code establishes a much-needed common benchmark for the research community, moving beyond proprietary or less rigorous tests to standardize the measurement of acoustic robustness.

常见问题