Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Journal Year:
2022,
Volume and Issue:
unknown
Published: Jan. 1, 2022
Chenliang
Li,
Haiyang
Xu,
Junfeng
Tian,
Wei
Wang,
Ming
Yan,
Bin
Bi,
Jiabo
Ye,
He
Chen,
Guohai
Zheng
Cao,
Ji
Zhang,
Songfang
Huang,
Fei
Jingren
Zhou,
Luo
Si.
Proceedings
of
the
2022
Conference
on
Empirical
Methods
in
Natural
Language
Processing.
2022.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2023,
Volume and Issue:
unknown, P. 14408 - 14419
Published: June 1, 2023
Compared
to
the
great
progress
of
large-scale
vision
transformers
(ViTs)
in
recent
years,
models
based
on
convolutional
neural
networks
(CNNs)
are
still
an
early
state.
This
work
presents
a
new
CNN-based
foundation
model,
termed
InternImage,
which
can
obtain
gain
from
increasing
parameters
and
training
data
like
ViTs.
Different
CNNs
that
focus
large
dense
kernels,
InternImage
takes
deformable
convolution
as
core
operator,
so
our
model
not
only
has
effective
receptive
field
required
for
downstream
tasks
such
detection
segmentation,
but
also
adaptive
spatial
aggregation
conditioned
by
input
task
information.
As
result,
proposed
reduces
strict
inductive
bias
traditional
makes
it
possible
learn
stronger
more
robust
patterns
with
massive
The
effectiveness
is
proven
challenging
benchmarks
including
ImageNet,
COCO,
andADE20K.
It
worth
mentioning
InternImage-H
achieved
record
65.4
mAP
COCO
test-dev
62.9
mIoU
ADE20K,
outperforming
current
leading
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2022,
Volume and Issue:
unknown
Published: June 1, 2022
We
present
Masked
Feature
Prediction
(MaskFeat)
for
self-supervised
pre-training
of
video
models.
Our
approach
first
randomly
masks
out
a
portion
the
input
sequence
and
then
predicts
feature
masked
regions.
study
five
different
types
features
find
Histograms
Oriented
Gradients
(HOG),
hand-crafted
descriptor,
works
particularly
well
in
terms
both
performance
efficiency.
observe
that
local
contrast
normalization
HOG
is
essential
good
results,
which
line
with
earlier
work
using
visual
recognition.
can
learn
abundant
knowledge
drive
large-scale
Transformer
based
Without
extra
model
weights
or
supervision,
MaskFeat
pretrained
on
unlabeled
videos
achieves
unprecedented
results
86.7%
MViTv2-L
Kinetics-400,
88.3%
Kinetics
600,
80.4%
Kinetics-700,
38.8
mAP
AVA,
75.0%
SSv2.
further
generalizes
to
image
input,
be
interpreted
as
single
frame
obtains
competitive
ImageN
et.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2023,
Volume and Issue:
unknown, P. 15180 - 15190
Published: June 1, 2023
We
present
ImageBind,
an
approach
to
learn
a
joint
embedding
across
six
different
modalities
-
images,
text,
audio,
depth,
thermal,
and
IMU
data.
show
that
all
combinations
of
paired
data
are
not
necessary
train
such
embedding,
only
image-paired
is
sufficient
bind
the
together.
ImageBind
can
leverage
recent
large
scale
vision-language
models,
extends
their
zero-shot
capabilities
new
just
by
using
natural
pairing
with
images.
It
enables
novel
emergent
applications
'out-of-the-box'
including
cross-modal
retrieval,
composing
arithmetic,
detection
generation.
The
improve
strength
image
encoder
we
set
state-of-the-art
on
recognition
tasks
modalities,
outperforming
specialist
supervised
models.
Finally,
strong
few-shot
results
prior
work,
serves
as
way
evaluate
vision
models
for
visual
non-visual
tasks.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2023,
Volume and Issue:
unknown, P. 19113 - 19122
Published: June 1, 2023
Pre-trained
vision-language
(V-L)
models
such
as
CLIP
have
shown
excellent
generalization
ability
to
downstream
tasks.
However,
they
are
sensitive
the
choice
of
input
text
prompts
and
require
careful
selection
prompt
templates
perform
well.
Inspired
by
Natural
Language
Processing
(NLP)
literature,
recent
adaptation
approaches
learn
textual
inputs
fine-tune
for
We
note
that
using
prompting
adapt
representations
in
a
single
branch
(language
or
vision)
is
sub-optimal
since
it
does
not
allow
flexibility
dynamically
adjust
both
representation
spaces
on
task.
In
this
work,
we
propose
Multi-modal
Prompt
Learning
(MaPLe)
vision
language
branches
improve
alignment
between
representations.
Our
design
promotes
strong
coupling
ensure
mutual
synergy
discourages
learning
independent
uni-modal
solutions.
Further,
separate
across
different
early
stages
progressively
model
stage-wise
feature
relationships
rich
context
learning.
evaluate
effectiveness
our
approach
three
representative
tasks
novel
classes,
new
target
datasets
unseen
domain
shifts.
Compared
with
state-of-the-art
method
Co-CoOp,
MaPLe
exhibits
favorable
performance
achieves
an
absolute
gain
3.45%
classes
2.72%
overall
harmonic-mean,
averaged
over
11
diverse
image
recognition
datasets.
code
pre-trained
available
at
https://github.com/muzairkhattak/multimodal-prompt-learning.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2022,
Volume and Issue:
unknown, P. 16772 - 16782
Published: June 1, 2022
Contrastive
language-image
pretraining
(CLIP)
using
image-text
pairs
has
achieved
impressive
results
on
image
classification
in
both
zero-shot
and
transfer
learning
set-tings.
However,
we
show
that
directly
applying
such
mod-els
to
recognize
regions
for
object
detection
leads
unsatisfactory
performance
due
a
major
domain
shift:
CLIP
was
trained
match
an
as
whole
text
de-scription,
without
capturing
the
fine-grained
alignment
be-tween
spans.
To
mitigate
this
issue,
propose
new
method
called
RegionCLIP
signifi-cantly
extends
learn
region-level
visual
representations,
thus
enabling
between
textual
concepts.
Our
leverages
model
with
template
captions,
then
pretrains
our
align
these
region-text
feature
space.
When
transferring
pretrained
open-vocabulary
task,
outperforms
state
of
art
by
3.8
AP50
2.2
AP
novel
categories
COCO
LVIS
datasets,
respectively.
Further,
learned
region
representations
support
inference
detection,
showing
promising
datasets.
code
is
available
at
https://github.com/microsoft/RegionCLIP.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2023,
Volume and Issue:
unknown, P. 19175 - 19186
Published: June 1, 2023
A
big
convergence
of
language,
vision,
and
multimodal
pretraining
is
emerging.
In
this
work,
we
introduce
a
general-purpose
foundation
model
BEIT-3,
which
achieves
excellent
transfer
performance
on
both
vision
vision-language
tasks.
Specifically,
advance
the
from
three
aspects:
backbone
architecture,
task,
scaling
up.
We
use
Multiway
Transformers
for
modeling,
where
modular
architecture
enables
deep
fusion
modality-specific
encoding.
Based
shared
backbone,
perform
masked
"language"
modeling
images
(Imglish),
texts
(English),
image-text
pairs
("parallel
sentences")
in
unified
manner.
Experimental
results
show
that
BEIT-3
obtains
remarkable
object
detection
(COCO),
semantic
segmentation
(ADE20K),
image
classification
(ImageNet),
visual
reasoning
(NLVR2),
question
answering
(VQAv2),
captioning
cross-modal
retrieval
(Flickr30K,
COCO).
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2023,
Volume and Issue:
unknown, P. 19358 - 19369
Published: June 1, 2023
We
launch
EVA,
a
vision-centric
foundation
model
to
Explore
the
limits
of
Visual
representation
at
scAle
using
only
publicly
accessible
data.
EVA
is
vanilla
ViT
pre-trained
reconstruct
masked
out
image-text
aligned
vision
features
conditioned
on
visible
image
patches.
Via
this
pretext
task,
we
can
efficiently
scale
up
one
billion
parameters,
and
sets
new
records
broad
range
representative
downstream
tasks,
such
as
recognition,
video
action
object
detection,
instance
segmentation
semantic
without
heavy
supervised
training.
Moreover,
observe
quantitative
changes
in
scaling
result
qualitative
transfer
learning
performance
that
are
not
present
other
models.
For
instance,
takes
great
leap
challenging
large
vocabulary
task:
our
achieves
almost
same
state-of-the-art
LVIS
dataset
with
over
thousand
categories
COCO
eighty
categories.
Beyond
pure
encoder,
also
serve
vision-centric,
multi-modal
pivot
connect
images
text.
find
initializing
tower
giant
CLIP
from
greatly
stabilize
training
outperform
scratch
counterpart
much
fewer
samples
less
compute,
providing
direction
for
accelerating
costly
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2023,
Volume and Issue:
unknown, P. 22511 - 22521
Published: June 1, 2023
Large-scale
text-to-image
diffusion
models
have
made
amazing
advances.
However,
the
status
quo
is
to
use
text
input
alone,
which
can
impede
controllability.
In
this
work,
we
propose
GLIGEN,
Grounded-Language-to-Image
Generation,
a
novel
approach
that
builds
upon
and
extends
functionality
of
existing
pre-trained
by
enabling
them
also
be
conditioned
on
grounding
inputs.
To
preserve
vast
concept
knowledge
model,
freeze
all
its
weights
inject
information
into
new
trainable
layers
via
gated
mechanism.
Our
model
achieves
open-world
grounded
text2img
generation
with
caption
bounding
box
condition
inputs,
ability
generalizes
well
spatial
configurations
concepts.
GLIGEN's
zero-shot
performance
COCO
LVIS
outperforms
supervised
layout-to-image
baselines
large
margin.