RAG
LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization
The article presents LM-SPT, a novel speech tokenization method that utilizes semantic speech-resynthesis distillation to enhance alignment with language models (LMs). Unlike traditional approaches that rely on self-supervised learning teachers and pooling, LM-SPT resynthesizes speech from semantic tokens, leading to lower frame rates and improved semantic alignment without sacrificing speech reconstruction fidelity. Experimental results demonstrate that LM-SPT outperforms existing semantic-enhanced tokenizers in tasks such as automatic speech recognition and text-to-speech, which is significant for practitioners seeking to integrate SLMs more effectively with LMs.
speech-tokenizationsemantic-distillationlanguage-models