PreprintPDF Available

ZoDIAC: Zoneout Dropout Injection Attention Calculation

June 2022

June 2022

DOI:10.21203/rs.3.rs-1798795/v1

License
CC BY 4.0

Authors:

Preprints and early-stage research may not have been peer reviewed yet.

Recently the use of self-attention has yielded to state-of-the-art results in vision-language tasks such as image captioning as well as natural language understanding and generation (NLU and NLG) tasks and computer vision tasks such as image classification. This is since self-attention maps the internal interactions among the elements of input source and target sequences. Although self attention successfully calculates the attention values and maps the relationships among the elements of input source and target sequence, yet there is no mechanism to control the intensity of attention. In real world, when communicating with each other face to face or vocally, we tend to express different visual and linguistic context with various amounts of intensity. Some words might carry (be spoken with) more stress and weight indicating the importance of that word in the context of the whole sentence. Based on this intuition, we propose Zoneout Dropout Injection Attention Calculation (ZoDIAC) in which the intensities of attention values in the elements of the input sequence are calculated with respect to the context of the elements of input sequence. The results of our experiments reveal that employing ZoDIAC leads to better performance in comparison with the self-attention module in the Transformer model. The ultimate goal is to find out if we could modify self-attention module in the Transformer model with a method that is potentially extensible to other models that leverage on self-attention at their core. Our findings suggest that this particular goal deserves further attention and investigation by the research community. The code for ZoDIAC is available on www.github.com/zanyarz/zodiac.

Results for ensemble of 5 runs on MS-COCO Karpathy's test trained with XE loss.

…

Results for experiments on MS-COCO Karpathy's test trained with SCST loss.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

ResearchGate has not been able to resolve any citations for this publication.

Meshed-Memory Transformer for Image Captioning

Conference Paper

Full-text available

Jun 2020

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M^2 -- a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M^2 Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy'" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.

Scaling Vision Transformers

Conference Paper

Jun 2022

Scaling Up Vision-Language Pretraining for Image Captioning

Conference Paper

Jun 2022

Transformer-based Dual Relation Graph for Multi-label Image Recognition

Conference Paper

Oct 2021

MUSIQ: Multi-scale Image Quality Transformer

Conference Paper

Oct 2021

Bottleneck Transformers for Visual Recognition

Conference Paper

Jun 2021

X-Linear Attention Networks for Image Captioning

Conference Paper

Jun 2020

Normalized and Geometry-Aware Self-Attention Network for Image Captioning

Conference Paper

Jun 2020

12-in-1: Multi-Task Vision and Language Representation Learning

Conference Paper

Jun 2020

Unified Vision-Language Pre-Training for Image Captioning and VQA

Article

Apr 2020

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

ZoDIAC: Zoneout Dropout Injection Attention Calculation

Abstract and Figures

Recommended publications

ZoDIAC: Zoneout Dropout Injection Attention Calculation

Neural Twins Talk & Alternative Calculations

Neural Twins Talk

Neural attention for image captioning: review of outstanding methods