MultimodalarXiv cs.AI — 47 d ago

Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music

WHISPER-GPT is a newly proposed generative large language model that integrates continuous audio representations with discrete tokens, addressing the limitations of context length in high-fidelity generative architectures. By utilizing both spectrograms and discrete acoustic tokens, the model enhances performance metrics such as perplexity and negative log-likelihood for next token prediction in audio tasks. This hybrid approach offers practitioners a more efficient framework for developing applications in generative audio, speech, and music, leveraging the advantages of both continuous and discrete data representations.

speechmusicllmrelevance 0.60 · engagement 0.00

Read at source ↗← all news