Training
Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data
This article presents a novel approach to filtering training data for speech-to-speech translation (S2ST) using an audio large language model (Audio-LLM). The proposed method employs a two-stage Rank-to-Distill strategy to generate pseudo-labels for noisy speech pairs, leading to improved model performance with a reported increase of up to +1.4 ASR-BLEU on benchmark datasets CVSS-C and SpeechMatrix. This advancement is significant for practitioners as it enhances the quality of training data, which is critical for robust S2ST systems.
speech-to-speechllmdata-filtering