Cited by mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Learning to Prompt for Vision-Language Models DOI

Kaiyang Zhou, Jingkang Yang, Chen Change Loy

et al.

International Journal of Computer Vision, Journal Year: 2022, Volume and Issue: 130(9), P. 2337 - 2348

Published: July 31, 2022

Language: Английский

Citations

1232

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions DOI

Wenhai Wang, Jifeng Dai, Zhe Chen

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2023, Volume and Issue: unknown, P. 14408 - 14419

Published: June 1, 2023

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, models based on convolutional neural networks (CNNs) are still an early state. This work presents a new CNN-based foundation model, termed InternImage, which can obtain gain from increasing parameters and training data like ViTs. Different CNNs that focus large dense kernels, InternImage takes deformable convolution as core operator, so our model not only has effective receptive field required for downstream tasks such detection segmentation, but also adaptive spatial aggregation conditioned by input task information. As result, proposed reduces strict inductive bias traditional makes it possible learn stronger more robust patterns with massive The effectiveness is proven challenging benchmarks including ImageNet, COCO, andADE20K. It worth mentioning InternImage-H achieved record 65.4 mAP COCO test-dev 62.9 mIoU ADE20K, outperforming current leading

Language: Английский

Citations

480

Masked Feature Prediction for Self-Supervised Visual Pre-Training DOI

Chen Wei,

Haoqi Fan,

Saining Xie

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2022, Volume and Issue: unknown

Published: June 1, 2022

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion the input sequence and then predicts feature masked regions. study five different types features find Histograms Oriented Gradients (HOG), hand-crafted descriptor, works particularly well in terms both performance efficiency. observe that local contrast normalization HOG is essential good results, which line with earlier work using visual recognition. can learn abundant knowledge drive large-scale Transformer based Without extra model weights or supervision, MaskFeat pretrained on unlabeled videos achieves unprecedented results 86.7% MViTv2-L Kinetics-400, 88.3% Kinetics 600, 80.4% Kinetics-700, 38.8 mAP AVA, 75.0% SSv2. further generalizes to image input, be interpreted as single frame obtains competitive ImageN et.

Language: Английский

Citations

383

ImageBind One Embedding Space to Bind Them All DOI

Rohit Girdhar,

Alaaeldin El-Nouby,

Zhuang Liu

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2023, Volume and Issue: unknown, P. 15180 - 15190

Published: June 1, 2023

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. show that all combinations of paired data are not necessary train such embedding, only image-paired is sufficient bind the together. ImageBind can leverage recent large scale vision-language models, extends their zero-shot capabilities new just by using natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing arithmetic, detection generation. The improve strength image encoder we set state-of-the-art on recognition tasks modalities, outperforming specialist supervised models. Finally, strong few-shot results prior work, serves as way evaluate vision models for visual non-visual tasks.

Language: Английский

Citations

310

MaPLe: Multi-modal Prompt Learning DOI

Muhammad Uzair Khattak,

Hanoona Rasheed,

Muhammad Maaz

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2023, Volume and Issue: unknown, P. 19113 - 19122

Published: June 1, 2023

Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive the choice of input text prompts and require careful selection prompt templates perform well. Inspired by Natural Language Processing (NLP) literature, recent adaptation approaches learn textual inputs fine-tune for We note that using prompting adapt representations in a single branch (language or vision) is sub-optimal since it does not allow flexibility dynamically adjust both representation spaces on task. In this work, we propose Multi-modal Prompt Learning (MaPLe) vision language branches improve alignment between representations. Our design promotes strong coupling ensure mutual synergy discourages learning independent uni-modal solutions. Further, separate across different early stages progressively model stage-wise feature relationships rich context learning. evaluate effectiveness our approach three representative tasks novel classes, new target datasets unseen domain shifts. Compared with state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance achieves an absolute gain 3.45% classes 2.72% overall harmonic-mean, averaged over 11 diverse image recognition datasets. code pre-trained available at https://github.com/muzairkhattak/multimodal-prompt-learning.

Language: Английский

Citations

296

RegionCLIP: Region-based Language-Image Pretraining DOI

Yiwu Zhong,

Jianwei Yang, Pengchuan Zhang

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2022, Volume and Issue: unknown, P. 16772 - 16782

Published: June 1, 2022

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning set-tings. However, we show that directly applying such mod-els to recognize regions for object detection leads unsatisfactory performance due a major domain shift: CLIP was trained match an as whole text de-scription, without capturing the fine-grained alignment be-tween spans. To mitigate this issue, propose new method called RegionCLIP signifi-cantly extends learn region-level visual representations, thus enabling between textual concepts. Our leverages model with template captions, then pretrains our align these region-text feature space. When transferring pretrained open-vocabulary task, outperforms state of art by 3.8 AP50 2.2 AP novel categories COCO LVIS datasets, respectively. Further, learned region representations support inference detection, showing promising datasets. code is available at https://github.com/microsoft/RegionCLIP.

Language: Английский

Citations

289

Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks DOI

Wenhui Wang,

Hangbo Bao,

Dong Li

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2023, Volume and Issue: unknown, P. 19175 - 19186

Published: June 1, 2023

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose foundation model BEIT-3, which achieves excellent transfer performance on both vision vision-language tasks. Specifically, advance the from three aspects: backbone architecture, task, scaling up. We use Multiway Transformers for modeling, where modular architecture enables deep fusion modality-specific encoding. Based shared backbone, perform masked "language" modeling images (Imglish), texts (English), image-text pairs ("parallel sentences") in unified manner. Experimental results show that BEIT-3 obtains remarkable object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), question answering (VQAv2), captioning cross-modal retrieval (Flickr30K, COCO).

Language: Английский

Citations

283

Towards a general-purpose foundation model for computational pathology DOI

Richard J. Chen, Tong Ding, Ming Y. Lu

et al.

Nature Medicine, Journal Year: 2024, Volume and Issue: 30(3), P. 850 - 862

Published: March 1, 2024

Language: Английский

Citations

250

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale DOI

Yuxin Fang, Wen Wang, Binhui Xie

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2023, Volume and Issue: unknown, P. 19358 - 19369

Published: June 1, 2023

We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data. EVA is vanilla ViT pre-trained reconstruct masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up one billion parameters, and sets new records broad range representative downstream tasks, such as recognition, video action object detection, instance segmentation semantic without heavy supervised training. Moreover, observe quantitative changes in scaling result qualitative transfer learning performance that are not present other models. For instance, takes great leap challenging large vocabulary task: our achieves almost same state-of-the-art LVIS dataset with over thousand categories COCO eighty categories. Beyond pure encoder, also serve vision-centric, multi-modal pivot connect images text. find initializing tower giant CLIP from greatly stabilize training outperform scratch counterpart much fewer samples less compute, providing direction for accelerating costly

Language: Английский

Citations

248

GLIGEN: Open-Set Grounded Text-to-Image Generation DOI

Yuheng Li, Haotian Liu, Qingyang Wu

et al.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Journal Year: 2023, Volume and Issue: unknown, P. 22511 - 22521

Published: June 1, 2023

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends functionality of existing pre-trained by enabling them also be conditioned on grounding inputs. To preserve vast concept knowledge model, freeze all its weights inject information into new trainable layers via gated mechanism. Our model achieves open-world grounded text2img generation with caption bounding box condition inputs, ability generalizes well spatial configurations concepts. GLIGEN's zero-shot performance COCO LVIS outperforms supervised layout-to-image baselines large margin.

Language: Английский

Citations

243