Cited by mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning DOI

Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo

и другие.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Год журнала: 2023, Номер unknown

Опубликована: Июнь 1, 2023

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments language with special time tokens, allowing it to seamlessly predict boundaries and textual descriptions in the same output sequence. Such unified requires large-scale training data, is not available current annotated datasets. We show that possible leverage unlabeled for video captioning, by reformulating sentence of transcribed speech as pseudo boundaries, using sentences captions. resulting YT-Temporal-1B dataset improves state art variety benchmarks including YouCook2, ViTT ActivityNet Captions. also generalizes well tasks paragraph clip few-shot settings. Our code publicly [1].

Язык: Английский

Процитировано

113

Prompting Large Language Models with Answer Heuristics for Knowledge-Based Visual Question Answering DOI

Zhenwei Shao,

Yu Zhou, Meng Wang

и другие.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Год журнала: 2023, Номер unknown, С. 14974 - 14983

Опубликована: Июнь 1, 2023

Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer question. Early studies retrieve required from explicit bases (KBs), which often introduces irrelevant information question, hence restricting performance of their models. Recent works have sought use a large language model (i.e., GPT-3 [3]) as an implicit engine acquire necessary for answering. Despite encouraging results achieved by these methods, we argue that they not fully activated capacity provided input is insufficient. In this paper, present Prophet-a conceptually simple framework designed $prompt$ with heuristics knowledge-based VQA. Specifically, first train vanilla VQA on specific dataset without knowledge. After that, extract two types complementary model: candidates and answer-aware examples. Finally, are encoded into prompts enable better comprehend task thus enhancing its capacity. Prophet significantly outperforms all existing state-of-the-art methods challenging datasets, OK-VQA A-OKVQA, delivering 61.1% 55.7% accuracies testing sets, respectively.

Язык: Английский

Процитировано

103

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound DOI

Rowan Zellers,

Jiasen Lu,

Ximing Lu

и другие.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Год журнала: 2022, Номер unknown, С. 16354 - 16366

Опубликована: Июнь 1, 2022

As humans, we navigate a multimodal world, building holistic understanding from all our senses. We introduce @MERLOT RESERVE, model that represents videos jointly over time - through new training objective learns audio, subtitles, and video frames. Given video, replace snippets of text audio with MASK token; the by choosing correct masked-out snippet. Our faster than alternatives, performs well at scale: pretrain on 20 million YouTube videos. Empirical results show RESERVE strong representations. When finetuned, it sets state-of-the-art Visual Commonsense Reasoning (VCR), TVQA, Kinetics-600; outperforming prior work 5%, 7%, 1.5% respectively. Ablations these tasks benefit pretraining even VCR, QA task centered around images (without sound). Moreover, enables out-of-the-box prediction, revealing commonsense understanding. In fully zero-shot setting, obtains competitive four tasks, supervised approaches recently proposed Situated (STAR) benchmark. analyze why better vision-language representations, suggesting significant opportunities for future research. conclude discussing ethical societal implications pretraining.

Язык: Английский

Процитировано

102

Sigmoid Loss for Language Image Pre-Training DOI

Xiaohua Zhai,

Basil Mustafa, А. И. Колесников

и другие.

2021 IEEE/CVF International Conference on Computer Vision (ICCV), Год журнала: 2023, Номер unknown

Опубликована: Окт. 1, 2023

We propose a simple pairwise sigmoid loss for imagetext pre-training. Unlike standard contrastive learning with softmax normalization, the operates solely on image-text pairs and does not require global view of similarities normalization. The simultaneously allows further scaling up batch size, while also performing better at smaller sizes. With only four TPUv4 chips, we can train Base CLIP model 4k size Large LiT 20k latter achieves 84.5% ImageNet zero-shot accuracy in two days. This disentanglement from us to study impact examples vs negative positive ratio. Finally, push extreme, one million, find that benefits growing quickly diminish, more reasonable 32k being sufficient. hope our research motivates explorations improving quality efficiency language-image

Язык: Английский

Процитировано

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections DOI

Chenliang Li, Haiyang Xu, Junfeng Tian

и другие.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Год журнала: 2022, Номер unknown

Опубликована: Янв. 1, 2022

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, He Chen, Guohai Zheng Cao, Ji Zhang, Songfang Huang, Fei Jingren Zhou, Luo Si. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.

Язык: Английский

Процитировано