Nowadays,
many
hate
speech
detectors
are
built
to
automatically
detect
hateful
content.
However,
their
training
sets
sometimes
skewed
towards
certain
stereotypes
(e.g.,
race
or
religion-related).
As
a
result,
the
prone
depend
on
some
shortcuts
for
predictions.
Previous
works
mainly
focus
token-level
analysis
and
heavily
rely
human
experts’
annotations
identify
spurious
correlations,
which
is
not
only
costly
but
also
incapable
of
discovering
higher-level
artifacts.
In
this
work,
we
use
grammar
induction
find
patterns
analyze
phenomenon
from
causal
perspective.
Concretely,
categorize
verify
different
biases
based
spuriousness
influence
model
prediction.
Then,
propose
two
mitigation
approaches
including
Multi-Task
Intervention
Data-Specific
these
confounders.
Experiments
conducted
9
datasets
demonstrate
effectiveness
our
approaches.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Journal Year:
2021,
Volume and Issue:
unknown
Published: Jan. 1, 2021
We
introduce
SelfExplain,
a
novel
self-explaining
model
that
explains
text
classifier’s
predictions
using
phrase-based
concepts.
SelfExplain
augments
existing
neural
classifiers
by
adding
(1)
globally
interpretable
layer
identifies
the
most
influential
concepts
in
training
set
for
given
sample
and
(2)
locally
quantifies
contribution
of
each
local
input
concept
computing
relevance
score
relative
to
predicted
label.
Experiments
across
five
text-classification
datasets
show
facilitates
interpretability
without
sacrificing
performance.
Most
importantly,
explanations
from
sufficiency
are
perceived
as
adequate,
trustworthy
understandable
human
judges
compared
widely-used
baselines.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Journal Year:
2022,
Volume and Issue:
unknown, P. 976 - 991
Published: Jan. 1, 2022
Feature
attribution
a.k.a.
input
salience
methods
which
assign
an
importance
score
to
a
feature
are
abundant
but
may
produce
surprisingly
different
results
for
the
same
model
on
input.
While
differences
expected
if
disparate
definitions
of
assumed,
most
claim
provide
faithful
attributions
and
point
at
features
relevant
model’s
prediction.
Existing
work
faithfulness
evaluation
is
not
conclusive
does
clear
answer
as
how
be
compared.Focusing
text
classification
debugging
scenario,
our
main
contribution
protocol
that
makes
use
partially
synthetic
data
obtain
ground
truth
ranking.
Following
protocol,
we
do
in-depth
analysis
four
standard
method
classes
range
datasets
lexical
shortcuts
BERT
LSTM
models.
We
demonstrate
some
popular
configurations
poor
even
simple
while
judged
too
simplistic
works
remarkably
well
BERT.
IEEE Transactions on Artificial Intelligence,
Journal Year:
2024,
Volume and Issue:
unknown, P. 1 - 21
Published: Jan. 1, 2024
This
survey
paper
delves
into
the
emerging
and
critical
area
of
symbolic
knowledge
distillation
in
Large
Language
Models
(LLMs).
As
LLMs
like
Generative
Pre-trained
Transformer-3
(GPT-3)
Bidirectional
Encoder
Representations
from
Transformers
(BERT)
continue
to
expand
scale
complexity,
challenge
effectively
harnessing
their
extensive
becomes
paramount.
concentrates
on
process
distilling
intricate,
often
implicit
contained
within
these
models
a
more
symbolic,
explicit
form.
transformation
is
crucial
for
enhancing
interpretability,
efficiency,
applicability
LLMs.
We
categorize
existing
research
based
methodologies
applications,
focusing
how
can
be
used
improve
transparency
functionality
smaller,
efficient
Artificial
Intelligence
(AI)
models.
The
discusses
core
challenges,
including
maintaining
depth
comprehensible
format,
explores
various
approaches
techniques
that
have
been
developed
this
field.
identify
gaps
current
potential
opportunities
future
advancements.
aims
provide
comprehensive
overview
LLMs,
spotlighting
its
significance
progression
towards
accessible
AI
systems.
Among
the
most
critical
limitations
of
deep
learning
NLP
models
are
their
lack
interpretability,
and
reliance
on
spurious
correlations.
Prior
work
proposed
various
approaches
to
interpreting
black-box
unveil
correlations,
but
research
was
primarily
used
in
human-computer
interaction
scenarios.
It
still
remains
underexplored
whether
or
how
such
model
interpretations
can
be
automatically
"unlearn"
confounding
features.
In
this
work,
we
propose
influence
tuning—a
procedure
that
leverages
update
parameters
towards
a
plausible
interpretation
(rather
than
an
relies
patterns
data)
addition
predict
task
labels.
We
show
controlled
setup,
tuning
help
deconfounding
from
data,
significantly
outperforming
baseline
methods
use
adversarial
training.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Journal Year:
2023,
Volume and Issue:
unknown, P. 4533 - 4559
Published: Jan. 1, 2023
Linyi
Yang,
Yaoxian
Song,
Xuan
Ren,
Chenyang
Lyu,
Yidong
Wang,
Jingming
Zhuo,
Lingqiao
Liu,
Jindong
Jennifer
Foster,
Yue
Zhang.
Proceedings
of
the
2023
Conference
on
Empirical
Methods
in
Natural
Language
Processing.
2023.
In-context
learning
(ICL)
is
an
important
paradigm
for
adapting
large
language
models
(LLMs)
to
new
tasks,
but
the
generalization
behavior
of
ICL
remains
poorly
understood.
We
investigate
inductive
biases
from
perspective
feature
bias:
which
more
likely
use
given
a
set
underspecified
demonstrations
in
two
features
are
equally
predictive
labels.
First,
we
characterize
GPT-3
by
constructing
range
NLP
datasets
and
combinations.
find
that
LLMs
exhibit
clear
biases—for
example,
demonstrating
strong
bias
predict
labels
according
sentiment
rather
than
shallow
lexical
features,
like
punctuation.
Second,
evaluate
effect
different
interventions
designed
impose
favor
particular
feature,
such
as
adding
natural
instruction
or
using
semantically
relevant
label
words.
that,
while
many
can
influence
learner
prefer
it
be
difficult
overcome
prior
biases.
Overall,
our
results
provide
broader
picture
types
may
exploit
how
better
aligned
with
intended
task.
Proceedings of the 31st ACM International Conference on Information & Knowledge Management,
Journal Year:
2022,
Volume and Issue:
unknown, P. 1552 - 1562
Published: Oct. 16, 2022
Recent
fashion
of
information
propagation
on
Twitter
makes
the
platform
a
crucial
conduit
for
tactical
data
and
emergency
responses
during
disasters.
However,
real-time
about
crises
is
immersed
in
large
volume
emotional
irrelevant
posts.
It
brings
necessity
to
develop
an
automatic
tool
identify
disaster-related
messages
summarize
consumption
situation
planning.
Besides,
explainability
methods
determining
their
applicability
real-life
scenarios.
studies
also
highlight
importance
learning
good
latent
representation
tweets
several
downstream
tasks.
In
this
paper,
we
take
advantage
state-of-the-art
methods,
such
as
transformers
contrastive
build
interpretable
classifier.
Our
proposed
model
classifies
into
different
humanitarian
categories
extracts
rationale
snippets
supporting
evidence
output
decisions.
The
framework
helps
learn
better
representations
by
bringing
related
closer
embedding
space.
Furthermore,
employ
classification
labels
rationales
efficiently
generate
summaries
crisis
events.
Extensive
experiments
over
datasets
show
that
(i).
our
classifier
obtains
best
performance-interpretability
trade-off,
(ii).
summarizer
shows
superior
performance
(1.4%-22%
improvement)
with
significantly
less
computation
cost
than
baseline
models.
While
recently
developed
NLP
explainability
methods
let
us
open
the
black
box
in
various
ways
(Madsen
et
al.,
2022),
a
missing
ingredient
this
endeavor
is
an
interactive
tool
offering
conversational
interface.
Such
dialogue
system
can
help
users
explore
datasets
and
models
with
explanations
contextualized
manner,
e.g.
via
clarification
or
follow-up
questions,
through
natural
language
We
adapt
explanation
framework
TalkToModel
(Slack
2022)
to
domain,
add
new
NLP-specific
operations
such
as
free-text
rationalization,
illustrate
its
generalizability
on
three
tasks
(dialogue
act
classification,
question
answering,
hate
speech
detection).
To
recognize
user
queries
for
explanations,
we
evaluate
fine-tuned
few-shot
prompting
implement
novel
adapter-based
approach.
then
conduct
two
studies
(1)
perceived
correctness
helpfulness
of
dialogues,
(2)
simulatability,
i.e.
how
objectively
helpful
dialogical
are
humans
figuring
out
model's
predicted
label
when
it's
not
shown.
found
rationalization
feature
attribution
were
explaining
model
behavior.
Moreover,
could
more
reliably
predict
outcome
based
rather
than
one-off
explanations.