Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Год журнала:
2022,
Номер
unknown
Опубликована: Янв. 1, 2022
Chenliang
Li,
Haiyang
Xu,
Junfeng
Tian,
Wei
Wang,
Ming
Yan,
Bin
Bi,
Jiabo
Ye,
He
Chen,
Guohai
Zheng
Cao,
Ji
Zhang,
Songfang
Huang,
Fei
Jingren
Zhou,
Luo
Si.
Proceedings
of
the
2022
Conference
on
Empirical
Methods
in
Natural
Language
Processing.
2022.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Год журнала:
2022,
Номер
unknown
Опубликована: Июнь 1, 2022
Vision-and-language
(VL)
pre-training
has
proven
to
be
highly
effective
on
various
VL
downstream
tasks.
While
recent
work
shown
that
fully
transformer-based
models
can
more
efficient
than
previous
region-feature-based
methods,
their
performance
tasks
often
degrades
significantly.
In
this
paper,
we
present
Meter,
a
Multimodal
End-to-end
TransformER
framework,
through
which
investigate
how
design
and
pre-train
model
in
an
end-to-end
manner.
Specifically,
dissect
the
designs
along
multiple
dimensions:
vision
encoders
(e.g.,
CLIP-ViT,
Swin
transformer),
text
RoBERTa,
De-BERTa),
multimodal
fusion
module
merged
attention
vs.
co-attention),
architectural
encoder-only
encoder-decoder),
objectives
masked
image
modeling).
We
conduct
comprehensive
experiments
provide
insights
train
performant
transformer.
Meterachieves
accuracy
of
77.64%
VQAv2
test-std
set
using
only
4M
images
for
pre-training,
surpassing
state-of-the-art
by
1.04%,
outperforming
best
1.6%.
Notably,
when
further
scaled
up,
our
VQA
achieves
80.54%.
Code
pre-trained
are
released
at
https://github.com/zdou0830/METER.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Год журнала:
2022,
Номер
unknown, С. 3323 - 3333
Опубликована: Июнь 1, 2022
Video
understanding
requires
reasoning
at
multiple
spatiotemporal
resolutions
–
from
short
fine-grained
motions
to
events
taking
place
over
longer
durations.
Although
transformer
architectures
have
recently
advanced
the
state-of-the-art,
they
not
explicitly
modelled
different
resolutions.
To
this
end,
we
present
Multiview
Transformers
for
Recognition
(MTV).
Our
model
consists
of
separate
encoders
represent
views
input
video
with
lateral
connections
fuse
information
across
views.
We
thorough
ablation
studies
our
and
show
that
MTV
consistently
performs
better
than
single-view
counterparts
in
terms
accuracy
computational
cost
a
range
sizes.
Furthermore,
achieve
state-of-the-art
results
on
six
standard
datasets,
improve
even
further
large-scale
pretraining.
Code
checkpoints
are
available
at:
https://github.com/google-research/scenic.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Год журнала:
2022,
Номер
unknown
Опубликована: Янв. 1, 2022
Existing
vision-text
contrastive
learning
like
CLIP
(Radford
et
al.,
2021)
aims
to
match
the
paired
image
and
caption
embeddings
while
pushing
others
apart,
which
improves
representation
transferability
supports
zero-shot
prediction.
However,
medical
image-text
datasets
are
orders
of
magnitude
below
general
images
captions
from
internet.
Moreover,
previous
methods
encounter
many
false
negatives,
i.e.,
reports
separate
patients
probably
carry
same
semantics
but
wrongly
treated
as
negatives.
In
this
paper,
we
decouple
texts
for
multimodal
thus
scaling
usable
training
data
in
a
combinatorial
with
low
cost.
We
also
propose
replace
InfoNCE
loss
semantic
matching
based
on
knowledge
eliminate
negatives
learning.
prove
that
MedCLIP
is
simple
yet
effective
framework:
it
outperforms
state-of-the-art
prediction,
supervised
classification,
retrieval.
Surprisingly,
observe
only
20K
pre-training
data,
wins
over
method
(using
≈200K
data).
IEEE Transactions on Pattern Analysis and Machine Intelligence,
Год журнала:
2024,
Номер
46(8), С. 5625 - 5644
Опубликована: Фев. 26, 2024
Most
visual
recognition
studies
rely
heavily
on
crowd-labelled
data
in
deep
neural
networks
(DNNs)
training,
and
they
usually
train
a
DNN
for
each
single
task,
leading
to
laborious
time-consuming
paradigm.
To
address
the
two
challenges,
Vision-Language
Models
(VLMs)
have
been
intensively
investigated
recently,
which
learns
rich
vision-language
correlation
from
web-scale
image-text
pairs
that
are
almost
infinitely
available
Internet
enables
zero-shot
predictions
various
tasks
with
VLM.
This
paper
provides
systematic
review
of
language
models
tasks,
including:
(1)
background
introduces
development
paradigms;
(2)
foundations
VLM
summarize
widely-adopted
network
architectures,
pre-training
objectives,
downstream
tasks;
(3)
datasets
evaluations;
(4)
categorization
existing
methods,
transfer
learning
knowledge
distillation
methods;
(5)
benchmarking,
analysis
discussion
reviewed
(6)
several
research
challenges
potential
directions
could
be
pursued
future
recognition.
IEEE Transactions on Geoscience and Remote Sensing,
Год журнала:
2022,
Номер
61, С. 1 - 15
Опубликована: Ноя. 21, 2022
Large-scale
vision
foundation
models
have
made
significant
progress
in
visual
tasks
on
natural
images,
with
transformers
(ViTs)
being
the
primary
choice
due
to
their
good
scalability
and
representation
ability.
However,
large-scale
remote
sensing
(RS)
not
yet
been
sufficiently
explored.
In
this
article,
we
resort
plain
ViTs
about
100
million
parameters
make
first
attempt
propose
large
tailored
RS
investigate
how
such
perform.
To
handle
sizes
objects
of
arbitrary
orientations
a
new
rotated
varied-size
window
attention
replace
original
full
transformers,
which
can
significantly
reduce
computational
cost
memory
footprint
while
learning
better
object
by
extracting
rich
context
from
generated
diverse
windows.
Experiments
detection
show
superiority
our
model
over
all
state-of-the-art
models,
achieving
81.24%
mean
average
precision
(mAP)
DOTA-V1.0
dataset.
The
results
downstream
classification
segmentation
also
competitive
performance
compared
existing
advanced
methods.
Further
experiments
advantages
terms
complexity
data
efficiency
transferring.
code
will
be
released
at
https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA
.
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Год журнала:
2023,
Номер
unknown
Опубликована: Май 5, 2023
Mainstream
machine
listening
models
are
trained
to
learn
audio
concepts
under
the
paradigm
of
one
class
label
many
recordings
focusing
on
task.
Learning
such
restricted
supervision
limits
flexibility
because
they
require
labeled
for
training
and
can
only
predict
predefined
categories.
Instead,
we
propose
from
natural
language
supervision.
We
call
our
approach
Contrastive
Language-Audio
Pretraining
(CLAP),
which
connects
by
using
two
encoders
a
contrastive
learning
objective,
bringing
text
descriptions
into
joint
multimodal
space.
CLAP
with
128k
pairs
evaluated
it
16
downstream
tasks
across
7
domains,
as
classification
sound
events,
scenes,
music,
speech.
establishes
state-of-the-art
(SoTA)
in
Zero-Shot
performance.
Also,
CLAP's
encoder
supervised
setup
achieved
SoTA
5
tasks.
The
capability
removes
need
audio,
enables
flexible
prediction
at
inference
time,
generalizes
well
multiple
Code
is
available
at:
https://github.com/microsoft/CLAP.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Год журнала:
2022,
Номер
unknown, С. 19141 - 19151
Опубликована: Июнь 1, 2022
Visual
recognition
is
recently
learned
via
either
super-vised
learning
on
human-annotated
image-label
data
or
language-image
contrastive
with
webly-crawled
image-text
pairs.
While
supervised
may
result
in
a
more
discriminative
representation,
pretraining
shows
unprecedented
zero-shot
ca-pability,
largely
due
to
the
different
properties
of
sources
and
objectives.
In
this
work,
we
intro-duce
new
formulation
by
combining
two
into
common
image-text-label
space.
space,
propose
paradigm,
called
Unified
Con-trastive
Learning
(UniCL)
single
objective
seamlessly
prompt
synergy
types.
Ex-tensive
experiments
show
that
our
UniCL
an
effective
way
semantically
rich
yet
repre-sentations,
universally
for
image
zero-shot,
linear-probing,
fully
finetuning
transfer
sce-narios.
Particularly,
it
attains
gains
up
9.2%
14.5%
average
benchmarks
over
methods,
respectively.
linear
probe
setting,
also
boosts
performance
methods
7.3%
3.4%,
Our
study
indicates
stand-alone
good
learner
pure
data,
rivaling
across
three
im-age
classification
datasets
types
vision
back-bones,
ResNet
Swin
Transformer.
Code
available
at:
https://github.com/microsoft/UniCL.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Год журнала:
2023,
Номер
unknown, С. 2945 - 2954
Опубликована: Июнь 1, 2023
This
paper
presents
a
new
framework
for
open-vocabulary
semantic
segmentation
with
the
pre-trained
vision-language
model,
named
Side
Adapter
Network
(SAN).
Our
approach
models
task
as
region
recognition
problem.
A
side
network
is
attached
to
frozen
CLIP
model
two
branches:
one
predicting
mask
proposals,
and
other
attention
bias
which
applied
in
recognize
class
of
masks.
decoupled
design
has
benefit
recognizing
proposals.
Since
can
reuse
features,
it
be
very
light.
In
addition,
entire
trained
end-to-end,
allowing
adapted
makes
predicted
proposals
CLIP-aware.
fast,
accurate,
only
adds
few
additional
trainable
parameters.
We
evaluate
our
on
multiple
benchmarks.
method
significantly
outperforms
counterparts,
up
18
times
fewer
parameters
19
faster
inference
speed.
Fig.
1
shows
some
visualization
results
ImageNet.
hope
will
serve
solid
baseline
help
ease
future
research
segmentation.
IEEE Journal of Biomedical and Health Informatics,
Год журнала:
2023,
Номер
27(12), С. 6074 - 6087
Опубликована: Сен. 22, 2023
Large
AI
models,
or
foundation
are
models
recently
emerging
with
massive
scales
both
parameter-wise
and
data-wise,
the
magnitudes
of
which
can
reach
beyond
billions.
Once
pretrained,
large
demonstrate
impressive
performance
in
various
downstream
tasks.
A
prime
example
is
ChatGPT,
whose
capability
has
compelled
people's
imagination
about
far-reaching
influence
that
have
their
potential
to
transform
different
domains
our
lives.
In
health
informatics,
advent
brought
new
paradigms
for
design
methodologies.
The
scale
multi-modal
data
biomedical
domain
been
ever-expanding
especially
since
community
embraced
era
deep
learning,
provides
ground
develop,
validate,
advance
breakthroughs
health-related
areas.
This
article
presents
a
comprehensive
review
from
background
applications.
We
identify
seven
key
sectors
applicable
might
substantial
influence,
including
1)
bioinformatics;
2)
medical
diagnosis;
3)
imaging;
4)
informatics;
5)
education;
6)
public
health;
7)
robotics.
examine
challenges,
followed
by
critical
discussion
future
directions
pitfalls
transforming
field
informatics.