Multimodal
Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning
The article presents a novel approach to image captioning that replaces the standard self-attention mechanism in Vision Transformers with a Gaussian Mixture Model (GMM) based probabilistic transformer. This restructuring allows the model to reduce computational complexity from O(n^2) to O(nK) through clustering similar image patches, significantly enhancing efficiency. Evaluated on the Flickr 30K dataset, the model shows competitive performance, highlighting its potential for practitioners seeking faster and more effective architectures for image captioning tasks.
image captioningtransformersgmm