IEEE Transactions on Circuits and Systems for Video Technology, Journal Year: 2023, Volume and Issue: 33(9), P. 4616 - 4629
Published: Feb. 16, 2023
Pretrained
vision-language
models
(VLMs)
such
as
CLIP
have
shown
impressive
generalization
capability
in
downstream
vision
tasks
with
appropriate
text
prompts.
Instead
of
designing
prompts
manually,
Context
Optimization
(CoOp)
has
been
recently
proposed
to
learn
continuous
using
task-specific
training
data.
Despite
the
performance
improvements
on
tasks,
several
studies
reported
that
CoOp
suffers
from
overfitting
issue
two
aspects:
(i)
test
accuracy
base
classes
first
improves
and
then
worsens
during
training;
(ii)
novel
keeps
decreasing.
However,
none
existing
can
understand
mitigate
problems.
In
this
study,
we
explore
cause
by
analyzing
gradient
flow.
Comparative
experiments
reveal
favors
generalizable
spurious
features
early
later
stages,
respectively,
leading
non-overfitting
phenomena.
Given
those
observations,
propose
Subspace
Prompt
Tuning
(Sub
PT)
project
gradients
back-propagation
onto
low-rank
subspace
spanned
early-stage
flow
eigenvectors
entire
process
successfully
eliminate
problem.
addition,
equip
a
Novel
Feature
Learner
(NFL)
enhance
ability
learned
categories
beyond
set,
needless
image
Extensive
11
classification
datasets
demonstrate
Sub
PT+NFL
consistently
boost
outperform
state-of-the-art
CoCoOp
approach.
Experiments
more
challenging
including
open-vocabulary
object
detection
zero-shot
semantic
segmentation,
also
verify
effectiveness
method.
Codes
be
found
at
Language: Английский