Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Год журнала:
2022,
Номер
unknown
Опубликована: Янв. 1, 2022
Chenliang
Li,
Haiyang
Xu,
Junfeng
Tian,
Wei
Wang,
Ming
Yan,
Bin
Bi,
Jiabo
Ye,
He
Chen,
Guohai
Zheng
Cao,
Ji
Zhang,
Songfang
Huang,
Fei
Jingren
Zhou,
Luo
Si.
Proceedings
of
the
2022
Conference
on
Empirical
Methods
in
Natural
Language
Processing.
2022.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Год журнала:
2023,
Номер
unknown
Опубликована: Июнь 1, 2023
In
this
work,
we
introduce
Vid2Seq,
a
multi-modal
single-stage
dense
event
captioning
model
pretrained
on
narrated
videos
which
are
readily-available
at
scale.
The
Vid2Seq
architecture
augments
language
with
special
time
tokens,
allowing
it
to
seamlessly
predict
boundaries
and
textual
descriptions
in
the
same
output
sequence.
Such
unified
requires
large-scale
training
data,
is
not
available
current
annotated
datasets.
We
show
that
possible
leverage
unlabeled
for
video
captioning,
by
reformulating
sentence
of
transcribed
speech
as
pseudo
boundaries,
using
sentences
captions.
resulting
YT-Temporal-1B
dataset
improves
state
art
variety
benchmarks
including
YouCook2,
ViTT
ActivityNet
Captions.
also
generalizes
well
tasks
paragraph
clip
few-shot
settings.
Our
code
publicly
[1].
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Год журнала:
2023,
Номер
unknown, С. 14974 - 14983
Опубликована: Июнь 1, 2023
Knowledge-based
visual
question
answering
(VQA)
requires
external
knowledge
beyond
the
image
to
answer
question.
Early
studies
retrieve
required
from
explicit
bases
(KBs),
which
often
introduces
irrelevant
information
question,
hence
restricting
performance
of
their
models.
Recent
works
have
sought
use
a
large
language
model
(i.e.,
GPT-3
[3])
as
an
implicit
engine
acquire
necessary
for
answering.
Despite
encouraging
results
achieved
by
these
methods,
we
argue
that
they
not
fully
activated
capacity
provided
input
is
insufficient.
In
this
paper,
present
Prophet-a
conceptually
simple
framework
designed
$prompt$
with
heuristics
knowledge-based
VQA.
Specifically,
first
train
vanilla
VQA
on
specific
dataset
without
knowledge.
After
that,
extract
two
types
complementary
model:
candidates
and
answer-aware
examples.
Finally,
are
encoded
into
prompts
enable
better
comprehend
task
thus
enhancing
its
capacity.
Prophet
significantly
outperforms
all
existing
state-of-the-art
methods
challenging
datasets,
OK-VQA
A-OKVQA,
delivering
61.1%
55.7%
accuracies
testing
sets,
respectively.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Год журнала:
2022,
Номер
unknown, С. 16354 - 16366
Опубликована: Июнь 1, 2022
As
humans,
we
navigate
a
multimodal
world,
building
holistic
understanding
from
all
our
senses.
We
introduce
@MERLOT
RESERVE,
model
that
represents
videos
jointly
over
time
-
through
new
training
objective
learns
audio,
subtitles,
and
video
frames.
Given
video,
replace
snippets
of
text
audio
with
MASK
token;
the
by
choosing
correct
masked-out
snippet.
Our
faster
than
alternatives,
performs
well
at
scale:
pretrain
on
20
million
YouTube
videos.
Empirical
results
show
RESERVE
strong
representations.
When
finetuned,
it
sets
state-of-the-art
Visual
Commonsense
Reasoning
(VCR),
TVQA,
Kinetics-600;
outperforming
prior
work
5%,
7%,
1.5%
respectively.
Ablations
these
tasks
benefit
pretraining
even
VCR,
QA
task
centered
around
images
(without
sound).
Moreover,
enables
out-of-the-box
prediction,
revealing
commonsense
understanding.
In
fully
zero-shot
setting,
obtains
competitive
four
tasks,
supervised
approaches
recently
proposed
Situated
(STAR)
benchmark.
analyze
why
better
vision-language
representations,
suggesting
significant
opportunities
for
future
research.
conclude
discussing
ethical
societal
implications
pretraining.
2021 IEEE/CVF International Conference on Computer Vision (ICCV),
Год журнала:
2023,
Номер
unknown
Опубликована: Окт. 1, 2023
We
propose
a
simple
pairwise
sigmoid
loss
for
imagetext
pre-training.
Unlike
standard
contrastive
learning
with
softmax
normalization,
the
operates
solely
on
image-text
pairs
and
does
not
require
global
view
of
similarities
normalization.
The
simultaneously
allows
further
scaling
up
batch
size,
while
also
performing
better
at
smaller
sizes.
With
only
four
TPUv4
chips,
we
can
train
Base
CLIP
model
4k
size
Large
LiT
20k
latter
achieves
84.5%
ImageNet
zero-shot
accuracy
in
two
days.
This
disentanglement
from
us
to
study
impact
examples
vs
negative
positive
ratio.
Finally,
push
extreme,
one
million,
find
that
benefits
growing
quickly
diminish,
more
reasonable
32k
being
sufficient.
hope
our
research
motivates
explorations
improving
quality
efficiency
language-image
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Год журнала:
2022,
Номер
unknown
Опубликована: Янв. 1, 2022
Chenliang
Li,
Haiyang
Xu,
Junfeng
Tian,
Wei
Wang,
Ming
Yan,
Bin
Bi,
Jiabo
Ye,
He
Chen,
Guohai
Zheng
Cao,
Ji
Zhang,
Songfang
Huang,
Fei
Jingren
Zhou,
Luo
Si.
Proceedings
of
the
2022
Conference
on
Empirical
Methods
in
Natural
Language
Processing.
2022.