Research
Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing
The article introduces the Parallel Hybrid Architecture (PHA), which integrates Gated State Spaces (GSS), Grouped Query Attention (GQA), and Feed-Forward Networks (FFNs) in a parallel structure with a learnable mixing mechanism. This architecture achieves a perplexity of 16.51 on WikiText-103 with 125M parameters, outperforming existing models like Hedgehog and H3, while also providing 24% higher throughput and up to 40% lower memory usage for long contexts. The findings highlight the potential of specialized parallel branches in enhancing efficiency and performance in long-context language modeling, making it relevant for practitioners focusing on scalable NLP solutions.
llmauthoritarianismbenchmark