ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Год журнала:
2024,
Номер
unknown, С. 4680 - 4684
Опубликована: Март 18, 2024
Backdoor
attacks
pose
a
serious
security
threat
for
natural
language
processing
(NLP).
Backdoored
NLP
models
perform
normally
on
clean
text,
but
predict
the
attacker-specified
target
labels
text
containing
triggers.
Existing
word-level
textual
backdoor
rely
either
word
insertion
or
substitution.
Word-insertion
can
be
easily
detected
by
simple
defenses.
Meanwhile,
word-substitution
tend
to
substantially
degrade
fluency
and
semantic
consistency
of
poisoned
text.
In
this
paper,
we
propose
more
substitution
method
implement
covert
attacks.
Specifically,
combine
three
different
ways
construct
diverse
synonym
thesaurus
We
then
train
learnable
selector
producing
using
composite
loss
function
poison
fidelity
terms.
This
enables
automated
selection
minimal
critical
substitutions
necessary
induce
backdoor.
Experiments
demonstrate
our
achieves
high
attack
performance
with
less
impact
semantics.
hope
work
raise
awareness
regarding
subtle,
fluent
IEEE Transactions on Neural Networks and Learning Systems,
Год журнала:
2022,
Номер
35(1), С. 5 - 22
Опубликована: Июнь 22, 2022
Backdoor
attack
intends
to
embed
hidden
backdoors
into
deep
neural
networks
(DNNs),
so
that
the
attacked
models
perform
well
on
benign
samples,
whereas
their
predictions
will
be
maliciously
changed
if
backdoor
is
activated
by
attacker-specified
triggers.
This
threat
could
happen
when
training
process
not
fully
controlled,
such
as
third-party
datasets
or
adopting
models,
which
poses
a
new
and
realistic
threat.
Although
learning
an
emerging
rapidly
growing
research
area,
there
still
no
comprehensive
timely
review
of
it.
In
this
article,
we
present
first
survey
realm.
We
summarize
categorize
existing
attacks
defenses
based
characteristics,
provide
unified
framework
for
analyzing
poisoning-based
attacks.
Besides,
also
analyze
relation
between
relevant
fields
(i.e.,
adversarial
data
poisoning),
widely
adopted
benchmark
datasets.
Finally,
briefly
outline
certain
future
directions
relying
upon
reviewed
works.
A
curated
list
backdoor-related
resources
available
at
https://github.com/THUYimingLi/backdoor-learning-resources
.
Artificial Intelligence Review,
Год журнала:
2024,
Номер
57(7)
Опубликована: Июнь 17, 2024
Abstract
Large
language
models
(LLMs)
have
exploded
a
new
heatwave
of
AI
for
their
ability
to
engage
end-users
in
human-level
conversations
with
detailed
and
articulate
answers
across
many
knowledge
domains.
In
response
fast
adoption
industrial
applications,
this
survey
concerns
safety
trustworthiness.
First,
we
review
known
vulnerabilities
limitations
the
LLMs,
categorising
them
into
inherent
issues,
attacks,
unintended
bugs.
Then,
consider
if
how
Verification
Validation
(V&V)
techniques,
which
been
widely
developed
traditional
software
deep
learning
such
as
convolutional
neural
networks
independent
processes
check
alignment
implementations
against
specifications,
can
be
integrated
further
extended
throughout
lifecycle
LLMs
provide
rigorous
analysis
trustworthiness
applications.
Specifically,
four
complementary
techniques:
falsification
evaluation,
verification,
runtime
monitoring,
regulations
ethical
use.
total,
370+
references
are
considered
support
quick
understanding
issues
from
perspective
V&V.
While
intensive
research
has
conducted
identify
yet
practical
methods
called
ensure
requirements.
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Год журнала:
2021,
Номер
unknown
Опубликована: Янв. 1, 2021
Adversarial
attacks
alter
NLP
model
predictions
by
perturbing
test-time
inputs.
However,
it
is
much
less
understood
whether,
and
how,
can
be
manipulated
with
small,
concealed
changes
to
the
training
data.
In
this
work,
we
develop
a
new
data
poisoning
attack
that
allows
an
adversary
control
whenever
desired
trigger
phrase
present
in
input.
For
instance,
insert
50
poison
examples
into
sentiment
model’s
set
causes
frequently
predict
Positive
input
contains
“James
Bond”.
Crucially,
craft
these
using
gradient-based
procedure
so
they
do
not
mention
phrase.
We
also
apply
our
language
modeling
(“Apple
iPhone”
triggers
negative
generations)
machine
translation
(“iced
coffee”
mistranslated
as
“hot
coffee”).
conclude
proposing
three
defenses
mitigate
at
some
cost
prediction
accuracy
or
extra
human
annotation.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Год журнала:
2021,
Номер
unknown
Опубликована: Янв. 1, 2021
Pre-Trained
Models
have
been
widely
applied
and
recently
proved
vulnerable
under
backdoor
attacks:
the
released
pre-trained
weights
can
be
maliciously
poisoned
with
certain
triggers.
When
triggers
are
activated,
even
fine-tuned
model
will
predict
pre-defined
labels,
causing
a
security
threat.
These
backdoors
generated
by
poisoning
methods
erased
changing
hyper-parameters
during
fine-tuning
or
detected
finding
In
this
paper,
we
propose
stronger
weight-poisoning
attack
method
that
introduces
layerwise
weight
strategy
to
plant
deeper
backdoors;
also
introduce
combinatorial
trigger
cannot
easily
detected.
The
experiments
on
text
classification
tasks
show
previous
defense
resist
our
method,
which
indicates
may
provide
hints
for
future
robustness
studies.
IEEE Transactions on Software Engineering,
Год журнала:
2024,
Номер
50(4), С. 721 - 741
Опубликована: Фев. 9, 2024
Code
models,
such
as
CodeBERT
and
CodeT5,
offer
general-purpose
representations
of
code
play
a
vital
role
in
supporting
downstream
automated
software
engineering
tasks.
Most
recently,
models
were
revealed
to
be
vulnerable
backdoor
attacks.
A
model
that
is
backdoor-attacked
can
behave
normally
on
clean
examples
but
will
produce
pre-defined
malicious
outputs
injected
with
triggers
activate
the
backdoors.
Existing
attacks
use
unstealthy
easy-to-detect
triggers.
This
paper
aims
investigate
vulnerability
xmlns:xlink="http://www.w3.org/1999/xlink">stealthy
To
this
end,
we
propose
fraidoor
(
xmlns:xlink="http://www.w3.org/1999/xlink">A
dversarial
xmlns:xlink="http://www.w3.org/1999/xlink">F
eature
daptive
Back
xmlns:xlink="http://www.w3.org/1999/xlink">door
).
achieves
stealthiness
by
leveraging
adversarial
perturbations
inject
adaptive
triggers
into
different
inputs.
We
apply
three
widely
adopted
(CodeBERT,
PLBART,
CodeT5)
two
tasks
(code
summarization
method
name
prediction).
evaluate
used
defense
methods
find
more
unlikely
detected
than
baseline
methods.
More
specifically,
when
using
spectral
signature
defense,
around
85%
bypass
detection
process.
By
contrast,
only
less
12%
from
previous
work
defense.
When
not
applied,
both
baselines
have
almost
perfect
attack
success
rates.
However,
once
rates
decrease
dramatically,
while
rate
remains
high.
Our
finding
exposes
security
weaknesses
under
stealthy
shows
state-of-the-art
cannot
provide
sufficient
protection.
call
for
research
efforts
understanding
threats
developing
effective
countermeasures.
Wenkai
Yang,
Yankai
Lin,
Peng
Li,
Jie
Zhou,
Xu
Sun.
Proceedings
of
the
59th
Annual
Meeting
Association
for
Computational
Linguistics
and
11th
International
Joint
Conference
on
Natural
Language
Processing
(Volume
1:
Long
Papers).
2021.
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Год журнала:
2022,
Номер
unknown
Опубликована: Янв. 1, 2022
Trojan
attacks
raise
serious
security
concerns.
In
this
paper,
we
investigate
the
underlying
mechanism
of
Trojaned
BERT
models.
We
observe
attention
focus
drifting
behavior
models,
i.e.,
when
encountering
an
poisoned
input,
trigger
token
hijacks
regardless
context.
provide
a
thorough
qualitative
and
quantitative
analysis
phenomenon,
revealing
insights
into
mechanism.
Based
on
observation,
propose
attention-based
detector
to
distinguish
models
from
clean
ones.
To
best
our
knowledge,
are
first
analyze
develop
based
transformer's
attention.
Recent
studies
have
revealed
that
Backdoor
Attacks
can
threaten
the
safety
of
natural
language
processing
(NLP)
models.
Investigating
strategies
backdoor
attacks
will
help
to
understand
model's
vulnerability.
Most
existing
textual
focus
on
generating
stealthy
triggers
or
modifying
model
weights.
In
this
paper,
we
directly
target
interior
structure
neural
networks
and
mechanism.
We
propose
a
novel
Trojan
Attention
Loss
(TAL),
which
enhances
behavior
by
manipulating
attention
patterns.
Our
loss
be
applied
different
attacking
methods
boost
their
attack
efficacy
in
terms
successful
rates
poisoning
rates.
It
applies
not
only
traditional
dirty-label
attacks,
but
also
more
challenging
clean-label
attacks.
validate
our
method
backbone
models
(BERT,
RoBERTa,
DistilBERT)
various
tasks
(Sentiment
Analysis,
Toxic
Detection,
Topic
Classification).