Model

TL;DR
  • What: A new architecture for Vision and Language tasks + a new pre-training strategy that benefits both image level and region level tasks.
  • How: We add cross-modality attention blocks into the image and text backbone & split pre-training into low and high resolution stages.
  • Outcome: State-of-the art results on a captioning, VQA, NLVR2, and more + efficient use of expensive fine-grained data, surpassing phrase grounding performance of models using 25x more box-annotated data!

Abstract

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.

Model

Main Results

 

We evaluate our coarse-grained and fine-grained pretrained models on a variety of tasks, including visual question asnwering, visual reasoning, image-text retrieval, image captioning, phrase grounding, and object detection tasks and demonstrate state-of-the-art performance on many of them. FIBER is pretrained with 4M images during coarse-grained pretraining and with 860k images during fine-trained pretraining.

 
Coarse-grained Pretrained Model Performance
TaskVQAv2NLVR2F30k RetrievalCOCO RetrievalCOCO Captioning
Splittest-stdtest-PtestKarpathy testKarpathy test
MetricVQA ScoreAcc.IR@1/TR@1IR@1/TR@1CIDEr
FIBER-Base78.4685.5281.44/92.90 (ITC) 84.10/95.10 (ITM)58.01/75.38 (ITC) 59.03/75.14 (ITM)144.4
Fine-grained Pretrained Model Performance on Grounding
TaskF30k GroundingRefCOCORefCOCO+RefCOCOg
Splittestval/testA/testBval/testA/testBval/test
MetricR@1/R@5/R@10Acc.Acc.Acc.
FIBER-Base87.4/96.4/97.690.68/92.59/87.2685.74/90.13/79.3887.11/87.32
Fine-grained Pretrained Model Performance on Detection
TaskCOCO DetectionLVISODinW
SplitVal 2017MiniVal13 Datasets
MetricZero-shot/Fine-tune APZero-shot/Fine-tune APAvg. Zero-shot/Fine-tune AP
FIBER-Base49.3/58.435.8/56.947.0/65.9

Visualization of the pre-trained models

From the examples below, we can see that our coarse-grained pretrained model can learn to perform phrase grounding implicitly despite given only image-caption pairs, and fine-grained pretraining can further improve the model grounding performance and allows us to localize objects more accurately.

 
Phrase Grounding Ability of Coarse-grained Pretrained Model
Phrase grounding ability of coarse-grained pretrained model.
Phrase Grounding Ability of Fine-grained Pretrained Model
Phrase grounding ability of fine-grained pretrained model.

Some results on Flickr30k Entities

Phrase Grounding examples from Flickr30k
Phrase Grounding examples from Flickr30k.

Referring Expression Comprehension

 Referring Expression Comprehension
Referring Expression Comprehension on RefCOCO+.