Research
Speaker Group Encoding in Self-supervised Speech Recognition Models
The paper investigates the encoding of speaker group information (SGI) in self-supervised speech recognition models (S3Ms), analyzing their performance across different training states: pretrained, finetuned for speaker identification (SID), and automatic speech recognition (ASR). Key findings reveal that finetuning for SID enhances phonetically variant SG categories, while ASR training tends to discard such information but retains semantically variant SGI. The research highlights the implications for developing fairer ASR systems by understanding how SGI is represented across model layers and suggests potential adjustments in ASR algorithms to improve fairness in speaker representation.
speechrecognitionllm