Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

LanguageEnglish
Title of host publicationProceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
Place of PublicationPiscataway, NJ
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages6077-6086
Number of pages10
ISBN (Electronic)9781538664209
ISBN (Print)9781538664216
DOIs
Publication statusPublished - 14 Dec 2018
Event31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 - Salt Lake City, United States
Duration: 18 Jun 201822 Jun 2018

Publication series

NameIEEE Conference on Computer Vision and Pattern Recognition
PublisherIEEE
ISSN (Print)1063-6919

Conference

Conference31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
CountryUnited States
CitySalt Lake City
Period18/06/1822/06/18

Fingerprint

Image understanding
SPICE
Servers

Cite this

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 (pp. 6077-6086). [8578734] (IEEE Conference on Computer Vision and Pattern Recognition). Piscataway, NJ: Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/CVPR.2018.00636
Anderson, Peter ; He, Xiaodong ; Buehler, Chris ; Teney, Damien ; Johnson, Mark ; Gould, Stephen ; Zhang, Lei. / Bottom-up and top-down attention for image captioning and visual question answering. Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. Piscataway, NJ : Institute of Electrical and Electronics Engineers (IEEE), 2018. pp. 6077-6086 (IEEE Conference on Computer Vision and Pattern Recognition).
@inproceedings{b31c568d6a6b4f379948bcc8384605b9,
title = "Bottom-up and top-down attention for image captioning and visual question answering",
abstract = "Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.",
author = "Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang",
year = "2018",
month = "12",
day = "14",
doi = "10.1109/CVPR.2018.00636",
language = "English",
isbn = "9781538664216",
series = "IEEE Conference on Computer Vision and Pattern Recognition",
publisher = "Institute of Electrical and Electronics Engineers (IEEE)",
pages = "6077--6086",
booktitle = "Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018",
address = "United States",

}

Anderson, P, He, X, Buehler, C, Teney, D, Johnson, M, Gould, S & Zhang, L 2018, Bottom-up and top-down attention for image captioning and visual question answering. in Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018., 8578734, IEEE Conference on Computer Vision and Pattern Recognition, Institute of Electrical and Electronics Engineers (IEEE), Piscataway, NJ, pp. 6077-6086, 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, United States, 18/06/18. https://doi.org/10.1109/CVPR.2018.00636

Bottom-up and top-down attention for image captioning and visual question answering. / Anderson, Peter; He, Xiaodong; Buehler, Chris; Teney, Damien; Johnson, Mark; Gould, Stephen; Zhang, Lei.

Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. Piscataway, NJ : Institute of Electrical and Electronics Engineers (IEEE), 2018. p. 6077-6086 8578734 (IEEE Conference on Computer Vision and Pattern Recognition).

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Bottom-up and top-down attention for image captioning and visual question answering

AU - Anderson,Peter

AU - He,Xiaodong

AU - Buehler,Chris

AU - Teney,Damien

AU - Johnson,Mark

AU - Gould,Stephen

AU - Zhang,Lei

PY - 2018/12/14

Y1 - 2018/12/14

N2 - Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

AB - Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

UR - http://www.scopus.com/inward/record.url?scp=85053519533&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2018.00636

DO - 10.1109/CVPR.2018.00636

M3 - Conference proceeding contribution

SN - 9781538664216

T3 - IEEE Conference on Computer Vision and Pattern Recognition

SP - 6077

EP - 6086

BT - Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018

PB - Institute of Electrical and Electronics Engineers (IEEE)

CY - Piscataway, NJ

ER -

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S et al. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. Piscataway, NJ: Institute of Electrical and Electronics Engineers (IEEE). 2018. p. 6077-6086. 8578734. (IEEE Conference on Computer Vision and Pattern Recognition). https://doi.org/10.1109/CVPR.2018.00636