Cited by mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

An Empirical Study of Training End-to-End Vision-and-Language Transformers DOI

Zi-Yi Dou,

Yichong Xu,

Zhe Gan

и другие.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Год журнала: 2022, Номер unknown

Опубликована: Июнь 1, 2022

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work shown that fully transformer-based models can more efficient than previous region-feature-based methods, their performance tasks often degrades significantly. In this paper, we present Meter, a Multimodal End-to-end TransformER framework, through which investigate how design and pre-train model in an end-to-end manner. Specifically, dissect the designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text RoBERTa, De-BERTa), multimodal fusion module merged attention vs. co-attention), architectural encoder-only encoder-decoder), objectives masked image modeling). We conduct comprehensive experiments provide insights train performant transformer. Meterachieves accuracy of 77.64% VQAv2 test-std set using only 4M images for pre-training, surpassing state-of-the-art by 1.04%, outperforming best 1.6%. Notably, when further scaled up, our VQA achieves 80.54%. Code pre-trained are released at https://github.com/zdou0830/METER.

Язык: Английский

Процитировано

229

Multiview Transformers for Video Recognition DOI

Yan Shen,

Xuehan Xiong,

Anurag Arnab

и другие.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Год журнала: 2022, Номер unknown, С. 3323 - 3333

Опубликована: Июнь 1, 2022

Video understanding requires reasoning at multiple spatiotemporal resolutions – from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they not explicitly modelled different resolutions. To this end, we present Multiview Transformers for Recognition (MTV). Our model consists of separate encoders represent views input video with lateral connections fuse information across views. We thorough ablation studies our and show that MTV consistently performs better than single-view counterparts in terms accuracy computational cost a range sizes. Furthermore, achieve state-of-the-art results on six standard datasets, improve even further large-scale pretraining. Code checkpoints are available at: https://github.com/google-research/scenic.

Язык: Английский

Процитировано

213

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text DOI

Zifeng Wang, Zhenbang Wu, D.C. Agarwal

и другие.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Год журнала: 2022, Номер unknown

Опубликована: Янв. 1, 2022

Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below general images captions from internet. Moreover, previous methods encounter many false negatives, i.e., reports separate patients probably carry same semantics but wrongly treated as negatives. In this paper, we decouple texts for multimodal thus scaling usable training data in a combinatorial with low cost. We also propose replace InfoNCE loss semantic matching based on knowledge eliminate negatives learning. prove that MedCLIP is simple yet effective framework: it outperforms state-of-the-art prediction, supervised classification, retrieval. Surprisingly, observe only 20K pre-training data, wins over method (using ≈200K data).

Язык: Английский

Процитировано

208

Vision-Language Models for Vision Tasks: A Survey DOI

J Zhang, Jiaxing Huang, Sheng Jin

и другие.

IEEE Transactions on Pattern Analysis and Machine Intelligence, Год журнала: 2024, Номер 46(8), С. 5625 - 5644

Опубликована: Фев. 26, 2024

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single task, leading to laborious time-consuming paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available Internet enables zero-shot predictions various tasks with VLM. This paper provides systematic review of language models tasks, including: (1) background introduces development paradigms; (2) foundations VLM summarize widely-adopted network architectures, pre-training objectives, downstream tasks; (3) datasets evaluations; (4) categorization existing methods, transfer learning knowledge distillation methods; (5) benchmarking, analysis discussion reviewed (6) several research challenges potential directions could be pursued future recognition.

Язык: Английский

Процитировано

144

A comprehensive survey on applications of transformers for deep learning tasks DOI

Saidul Islam, Hanae Elmekki,

Ahmed Elsebai

и другие.

Expert Systems with Applications, Год журнала: 2023, Номер 241, С. 122666 - 122666

Опубликована: Ноя. 23, 2023

Язык: Английский

Процитировано

140

Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model DOI

Di Wang, Qiming Zhang, Yufei Xu

и другие.

IEEE Transactions on Geoscience and Remote Sensing, Год журнала: 2022, Номер 61, С. 1 - 15

Опубликована: Ноя. 21, 2022

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with transformers (ViTs) being the primary choice due to their good scalability and representation ability. However, large-scale remote sensing (RS) not yet been sufficiently explored. In this article, we resort plain ViTs about 100 million parameters make first attempt propose large tailored RS investigate how such perform. To handle sizes objects of arbitrary orientations a new rotated varied-size window attention replace original full transformers, which can significantly reduce computational cost memory footprint while learning better object by extracting rich context from generated diverse windows. Experiments detection show superiority our model over all state-of-the-art models, achieving 81.24% mean average precision (mAP) DOTA-V1.0 dataset. The results downstream classification segmentation also competitive performance compared existing advanced methods. Further experiments advantages terms complexity data efficiency transferring. code will be released at https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA .

Язык: Английский

Процитировано

136

CLAP Learning Audio Concepts from Natural Language Supervision DOI

Benjamin Elizalde,

Soham Deshmukh,

Mahmoud Al Ismail

и другие.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Год журнала: 2023, Номер unknown

Опубликована: Май 5, 2023

Mainstream machine listening models are trained to learn audio concepts under the paradigm of one class label many recordings focusing on task. Learning such restricted supervision limits flexibility because they require labeled for training and can only predict predefined categories. Instead, we propose from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which connects by using two encoders a contrastive learning objective, bringing text descriptions into joint multimodal space. CLAP with 128k pairs evaluated it 16 downstream tasks across 7 domains, as classification sound events, scenes, music, speech. establishes state-of-the-art (SoTA) in Zero-Shot performance. Also, CLAP's encoder supervised setup achieved SoTA 5 tasks. The capability removes need audio, enables flexible prediction at inference time, generalizes well multiple Code is available at: https://github.com/microsoft/CLAP.

Язык: Английский

Процитировано

129

Unified Contrastive Learning in Image-Text-Label Space DOI

Jianwei Yang,

Chunyuan Li,

Pengchuan Zhang

и другие.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Год журнала: 2022, Номер unknown, С. 19141 - 19151

Опубликована: Июнь 1, 2022

Visual recognition is recently learned via either super-vised learning on human-annotated image-label data or language-image contrastive with webly-crawled image-text pairs. While supervised may result in a more discriminative representation, pretraining shows unprecedented zero-shot ca-pability, largely due to the different properties of sources and objectives. In this work, we intro-duce new formulation by combining two into common image-text-label space. space, propose paradigm, called Unified Con-trastive Learning (UniCL) single objective seamlessly prompt synergy types. Ex-tensive experiments show that our UniCL an effective way semantically rich yet repre-sentations, universally for image zero-shot, linear-probing, fully finetuning transfer sce-narios. Particularly, it attains gains up 9.2% 14.5% average benchmarks over methods, respectively. linear probe setting, also boosts performance methods 7.3% 3.4%, Our study indicates stand-alone good learner pure data, rivaling across three im-age classification datasets types vision back-bones, ResNet Swin Transformer. Code available at: https://github.com/microsoft/UniCL.

Язык: Английский

Процитировано

126

Side Adapter Network for Open-Vocabulary Semantic Segmentation DOI

Mengde Xu, Zheng Zhang, Fangyun Wei

и другие.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Год журнала: 2023, Номер unknown, С. 2945 - 2954

Опубликована: Июнь 1, 2023

This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models task as region recognition problem. A side network is attached to frozen CLIP model two branches: one predicting mask proposals, and other attention bias which applied in recognize class of masks. decoupled design has benefit recognizing proposals. Since can reuse features, it be very light. In addition, entire trained end-to-end, allowing adapted makes predicted proposals CLIP-aware. fast, accurate, only adds few additional trainable parameters. We evaluate our on multiple benchmarks. method significantly outperforms counterparts, up 18 times fewer parameters 19 faster inference speed. Fig. 1 shows some visualization results ImageNet. hope will serve solid baseline help ease future research segmentation.

Язык: Английский

Процитировано

125

Large AI Models in Health Informatics: Applications, Challenges, and the Future DOI

Jianing Qiu, Lin Li, Jiankai Sun

и другие.

IEEE Journal of Biomedical and Health Informatics, Год журнала: 2023, Номер 27(12), С. 6074 - 6087

Опубликована: Сен. 22, 2023

Large AI models, or foundation are models recently emerging with massive scales both parameter-wise and data-wise, the magnitudes of which can reach beyond billions. Once pretrained, large demonstrate impressive performance in various downstream tasks. A prime example is ChatGPT, whose capability has compelled people's imagination about far-reaching influence that have their potential to transform different domains our lives. In health informatics, advent brought new paradigms for design methodologies. The scale multi-modal data biomedical domain been ever-expanding especially since community embraced era deep learning, provides ground develop, validate, advance breakthroughs health-related areas. This article presents a comprehensive review from background applications. We identify seven key sectors applicable might substantial influence, including 1) bioinformatics; 2) medical diagnosis; 3) imaging; 4) informatics; 5) education; 6) public health; 7) robotics. examine challenges, followed by critical discussion future directions pitfalls transforming field informatics.

Язык: Английский

Процитировано

122