Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Journal Year:
2022,
Volume and Issue:
unknown
Published: Jan. 1, 2022
Multimodal
sentiment
analysis
(MSA)
and
emotion
recognition
in
conversation
(ERC)
are
key
research
topics
for
computers
to
understand
human
behaviors.
From
a
psychological
perspective,
emotions
the
expression
of
affect
or
feelings
during
short
period,
while
sentiments
formed
held
longer
period.
However,
most
existing
works
study
separately
do
not
fully
exploit
complementary
knowledge
behind
two.
In
this
paper,
we
propose
multimodal
knowledge-sharing
framework
(UniMSE)
that
unifies
MSA
ERC
tasks
from
features,
labels,
models.
We
perform
modality
fusion
at
syntactic
semantic
levels
introduce
contrastive
learning
between
modalities
samples
better
capture
difference
consistency
emotions.
Experiments
on
four
public
benchmark
datasets,
MOSI,
MOSEI,
MELD,
IEMOCAP,
demonstrate
effectiveness
proposed
method
achieve
consistent
improvements
compared
with
state-of-the-art
methods.
IEEE Transactions on Fuzzy Systems,
Journal Year:
2016,
Volume and Issue:
25(4), P. 1006 - 1012
Published: June 2, 2016
Deep
learning
(DL)
is
an
emerging
and
powerful
paradigm
that
allows
large-scale
task-driven
feature
from
big
data.
However,
typical
DL
a
fully
deterministic
model
sheds
no
light
on
data
uncertainty
reductions.
In
this
paper,
we
show
how
to
introduce
the
concepts
of
fuzzy
into
overcome
shortcomings
fixed
representation.
The
bulk
proposed
system
hierarchical
deep
neural
network
derives
information
both
representations.
Then,
knowledge
learnt
these
two
respective
views
are
fused
altogether
forming
final
representation
be
classified.
effectiveness
verified
three
practical
tasks
image
categorization,
high-frequency
financial
prediction
brain
MRI
segmentation
all
contain
high
level
uncertainties
in
raw
dDL
greatly
outperforms
other
nonfuzzy
shallow
approaches
tasks.
IEEE Journal of Selected Topics in Signal Processing,
Journal Year:
2020,
Volume and Issue:
14(3), P. 478 - 493
Published: March 1, 2020
Deep
learning
methods
have
revolutionized
speech
recognition,
image
and
natural
language
processing
since
2010.
Each
of
these
tasks
involves
a
single
modality
in
their
input
signals.
However,
many
applications
the
artificial
intelligence
field
involve
multiple
modalities.
Therefore,
it
is
broad
interest
to
study
more
difficult
complex
problem
modeling
across
In
this
paper,
we
provide
technical
review
available
models
for
multimodal
intelligence.
The
main
focus
combination
vision
modalities,
which
has
become
an
important
topic
both
computer
research
communities.
This
provides
comprehensive
analysis
recent
works
on
deep
from
three
perspectives:
representations,
fusing
signals
at
various
levels,
applications.
Regarding
representation
learning,
key
concepts
embedding,
unify
into
vector
space
thereby
enable
cross-modality
signal
processing.
We
also
properties
types
embeddings
that
are
constructed
learned
general
downstream
tasks.
fusion,
focuses
special
architectures
integration
representations
unimodal
particular
task.
applications,
selected
areas
current
literature
covered,
including
image-to-text
caption
generation,
text-to-image
visual
question
answering.
believe
will
facilitate
future
studies
emerging
related
IEEE Transactions on Pattern Analysis and Machine Intelligence,
Journal Year:
2015,
Volume and Issue:
38(8), P. 1692 - 1706
Published: July 28, 2015
We
present
a
method
for
gesture
detection
and
localisation
based
on
multi-scale
multi-modal
deep
learning.
Each
visual
modality
captures
spatial
information
at
particular
scale
(such
as
motion
of
the
upper
body
or
hand),
whole
system
operates
three
temporal
scales.
Key
to
our
technique
is
training
strategy
which
exploits:
i)
careful
initialization
individual
modalities;
ii)
gradual
fusion
involving
random
dropping
separate
channels
(dubbed
ModDrop)
learning
cross-modality
correlations
while
preserving
uniqueness
each
modality-specific
representation.
experiments
ChaLearn
2014
Looking
People
Challenge
recognition
track,
in
we
placed
first
out
17
teams.
Fusing
multiple
modalities
several
scales
leads
significant
increase
rates,
allowing
model
compensate
errors
classifiers
well
noise
channels.
Futhermore,
proposed
ModDrop
ensures
robustness
classifier
missing
signals
one
produce
meaningful
predictions
from
any
number
available
modalities.
In
addition,
demonstrate
applicability
scheme
arbitrary
nature
by
same
dataset
augmented
with
audio.
Proceedings of the AAAI Conference on Artificial Intelligence,
Journal Year:
2019,
Volume and Issue:
33(01), P. 5345 - 5352
Published: July 17, 2019
Recent
work
in
domain
adaptation
bridges
different
domains
by
adversarially
learning
a
domain-invariant
representation
that
cannot
be
distinguished
discriminator.
Existing
methods
of
adversarial
mainly
align
the
global
images
across
source
and
target
domains.
However,
it
is
obvious
not
all
regions
an
image
are
transferable,
while
forcefully
aligning
untransferable
may
lead
to
negative
transfer.
Furthermore,
some
significantly
dissimilar
domains,
resulting
weak
image-level
transferability.
To
this
end,
we
present
Transferable
Attention
for
Domain
Adaptation
(TADA),
focusing
our
model
on
transferable
or
images.
We
implement
two
types
complementary
attention:
local
attention
generated
multiple
region-level
discriminators
highlight
regions,
single
discriminator
Extensive
experiments
validate
proposed
models
exceed
state
art
results
standard
datasets.
IEEE Transactions on Artificial Intelligence,
Journal Year:
2021,
Volume and Issue:
2(2), P. 146 - 168
Published: April 1, 2021
Clustering
is
a
machine
learning
paradigm
of
dividing
sample
subjects
into
number
groups
such
that
in
the
same
are
more
similar
to
those
other
groups.
With
advances
information
acquisition
technologies,
samples
can
frequently
be
viewed
from
different
angles
or
modalities,
generating
multi-view
data.
Multi-view
clustering,
clusters
subgroups
using
data,
has
attracted
and
attentions.
Although
MVC
methods
have
been
developed
rapidly,
there
not
enough
survey
summarize
analyze
current
progress.
Therefore,
we
propose
novel
taxonomy
approaches.
Similar
methods,
categorize
them
generative
discriminative
classes.
In
class,
based
on
way
view
integration,
split
it
further
five
groups:
Common
Eigenvector
Matrix,
Coefficient
Indicator
Direct
Combination
After
Projection.
Furthermore,
relate
topics:
representation,
ensemble
multi-task
supervised
semi-supervised
learning.
Several
representative
real-world
applications
elaborated
for
practitioners.
Some
benchmark
datasets
introduced
algorithms
each
group
empirically
evaluated
how
they
perform
datasets.
To
promote
future
development
approaches,
point
out
several
open
problems
may
require
investigation
thorough
examination.
IEEE Journal of Biomedical and Health Informatics,
Journal Year:
2015,
Volume and Issue:
19(5), P. 1610 - 1616
Published: May 4, 2015
Accurate
classification
of
Alzheimer's
disease
(AD)
and
its
prodromal
stage,
mild
cognitive
impairment
(MCI),
plays
a
critical
role
in
possibly
preventing
progression
memory
improving
quality
life
for
AD
patients.
Among
many
research
tasks,
it
is
particular
interest
to
identify
noninvasive
imaging
biomarkers
diagnosis.
In
this
paper,
we
present
robust
deep
learning
system
different
stages
patients
based
on
MRI
PET
scans.
We
utilized
the
dropout
technique
improve
classical
by
weight
coadaptation,
which
typical
cause
overfitting
learning.
addition,
incorporated
stability
selection,
an
adaptive
factor,
multitask
strategy
into
framework.
applied
proposed
method
ADNI
dataset,
conducted
experiments
MCI
conversion
Experimental
results
showed
that
very
effective
diagnosis,
accuracies
5.9%
average
as
compared
methods.
IEEE/ACM Transactions on Computational Biology and Bioinformatics,
Journal Year:
2014,
Volume and Issue:
12(4), P. 928 - 937
Published: Dec. 6, 2014
Identification
of
cancer
subtypes
plays
an
important
role
in
revealing
useful
insights
into
disease
pathogenesis
and
advancing
personalized
therapy.
The
recent
development
high-throughput
sequencing
technologies
has
enabled
the
rapid
collection
multi-platform
genomic
data
(e.g.,
gene
expression,
miRNA
DNA
methylation)
for
same
set
tumor
samples.
Although
numerous
integrative
clustering
approaches
have
been
developed
to
analyze
data,
few
them
are
particularly
designed
exploit
both
deep
intrinsic
statistical
properties
each
input
modality
complex
cross-modality
correlations
among
data.
In
this
paper,
we
propose
a
new
machine
learning
model,
called
multimodal
belief
network
(DBN),
cluster
patients
from
observation
our
framework,
relationships
inherent
features
single
first
encoded
multiple
layers
hidden
variables,
then
joint
latent
model
is
employed
fuse
common
derived
modalities.
A
practical
algorithm,
contrastive
divergence
(CD),
applied
infer
parameters
DBN
unsupervised
manner.
Tests
on
two
available
datasets
show
that
analysis
approach
can
effectively
extract
unified
representation
capture
intra-
correlations,
identify
meaningful
addition,
key
genes
miRNAs
may
play
distinct
roles
different
subtypes.
Among
those
miRNAs,
found
expression
level
miR-29a
highly
correlated
with
survival
time
ovarian
patients.
These
results
indicate
based
applications
studies
provide
guidelines
IEEE Transactions on Multimedia,
Journal Year:
2017,
Volume and Issue:
20(2), P. 405 - 420
Published: Aug. 21, 2017
Cross-modal
retrieval
has
become
a
highlighted
research
topic
for
across
multimedia
data
such
as
image
and
text.
A
two-stage
learning
framework
is
widely
adopted
by
most
existing
methods
based
on
deep
neural
network
(DNN):
The
first
stage
to
generate
separate
representation
each
modality
xmlns:xlink="http://www.w3.org/1999/xlink">the
second
get
the
cross-modal
common
representation.
However
have
three
limitations:
1)
In
they
only
model
intramodality
correlation
but
ignore
intermodality
with
rich
complementary
context.
2)
adopt
shallow
networks
single-loss
regularization
intrinsic
relevance
of
correlation.
3)
Only
original
instances
are
considered
while
fine-grained
clues
provided
their
patches
ignored.
For
addressing
above
problems
this
paper
proposes
(CCL)
approach
multigrained
fusion
hierarchical
contributions
follows:
CCL
exploits
multilevel
association
joint
optimization
preserve
context
from
simultaneously.
multitask
strategy
designed
adaptively
balance
semantic
category
constraints
pairwise
similarity
constraints.
adopts
modeling
which
fuses
coarse-grained
make
more
precise.
Comparing
13
state-of-the-art
6
widely-used
datasets
experimental
results
show
our
achieves
best
performance.
IEEE Transactions on Emerging Topics in Computational Intelligence,
Journal Year:
2018,
Volume and Issue:
2(2), P. 117 - 128
Published: March 23, 2018
Speech
enhancement
(SE)
aims
to
reduce
noise
in
speech
signals.
Most
SE
techniques
focus
only
on
addressing
audio
information.
In
this
paper,
inspired
by
multimodal
learning,
which
utilizes
data
from
different
modalities,
and
the
recent
success
of
convolutional
neural
networks
(CNNs)
SE,
we
propose
an
audio-visual
deep
CNNs
(AVDCNN)
model,
incorporates
visual
streams
into
a
unified
network
model.
We
also
multitask
learning
framework
for
reconstructing
signals
at
output
layer.
Precisely
speaking,
proposed
AVDCNN
model
is
structured
as
encoder-decoder
network,
are
first
processed
using
individual
CNNs,
then
fused
joint
generate
enhanced
(the
primary
task)
reconstructed
images
secondary
The
trained
endto-end
manner,
parameters
jointly
learned
through
back
propagation.
evaluate
five
instrumental
criteria.
Results
show
that
yields
notably
superior
performance
compared
with
audio-only
CNN-based
two
conventional
approaches,
confirming
effectiveness
integrating
information
process.
addition,
outperforms
existing
audio-
its
capability
effectively
combining
SE.
IEEE Journal of Biomedical and Health Informatics,
Journal Year:
2018,
Volume and Issue:
23(1), P. 83 - 94
Published: Sept. 24, 2018
The
recent
advances
in
pervasive
sensing
technologies
have
enabled
us
to
monitor
and
analyze
the
multi-channel
electroencephalogram
(EEG)
signals
of
epilepsy
patients
prevent
serious
outcomes
caused
by
epileptic
seizures.
To
avoid
manual
visual
inspection
from
long-term
EEG
readings,
automatic
seizure
detection
has
garnered
increasing
attention
among
researchers.
In
this
paper,
we
present
a
unified
multi-view
deep
learning
framework
capture
brain
abnormalities
associated
with
seizures
based
on
scalp
signals.
proposed
approach
is
an
end-to-end
model
that
able
jointly
learn
features
both
unsupervised
reconstruction
supervised
via
spectrogram
representation.
We
construct
new
autoencoder-based
incorporating
inter
intra
correlations
channels
unleash
power
information.
By
adding
channel-wise
competition
mechanism
training
phase,
propose
channel-aware
module
guide
our
structure
focus
important
relevant
channels.
validate
effectiveness
framework,
extensive
experiments
against
nine
baselines,
including
traditional
handcrafted
feature
extraction
conventional
methods,
are
carried
out
benchmark
dataset.
Experimental
results
show
achieve
higher
average
accuracy
f1-score
at
94.37%
85.34%,
respectively,
using
5-fold
subject-independent
cross
validation,
demonstrating
powerful
effective
method
task
detection.