Multimodal
Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music
WHISPER-GPT is a newly proposed generative large language model that integrates continuous audio representations with discrete tokens, addressing the limitations of context length in high-fidelity generative architectures. By utilizing both spectrograms and discrete acoustic tokens, the model enhances performance metrics such as perplexity and negative log-likelihood for next token prediction in audio tasks. This hybrid approach offers practitioners a more efficient framework for developing applications in generative audio, speech, and music, leveraging the advantages of both continuous and discrete data representations.
speechmusicllm