Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
Journal Year:
2022,
Volume and Issue:
unknown
Published: Jan. 1, 2022
Multimodal
sentiment
analysis
(MSA)
and
emotion
recognition
in
conversation
(ERC)
are
key
research
topics
for
computers
to
understand
human
behaviors.
From
a
psychological
perspective,
emotions
the
expression
of
affect
or
feelings
during
short
period,
while
sentiments
formed
held
longer
period.
However,
most
existing
works
study
separately
do
not
fully
exploit
complementary
knowledge
behind
two.
In
this
paper,
we
propose
multimodal
knowledge-sharing
framework
(UniMSE)
that
unifies
MSA
ERC
tasks
from
features,
labels,
models.
We
perform
modality
fusion
at
syntactic
semantic
levels
introduce
contrastive
learning
between
modalities
samples
better
capture
difference
consistency
emotions.
Experiments
on
four
public
benchmark
datasets,
MOSI,
MOSEI,
MELD,
IEMOCAP,
demonstrate
effectiveness
proposed
method
achieve
consistent
improvements
compared
with
state-of-the-art
methods.
IEEE Access,
Journal Year:
2014,
Volume and Issue:
2, P. 514 - 525
Published: Jan. 1, 2014
Deep
learning
is
currently
an
extremely
active
research
area
in
machine
and
pattern
recognition
society.
It
has
gained
huge
successes
a
broad
of
applications
such
as
speech
recognition,
computer
vision,
natural
language
processing.
With
the
sheer
size
data
available
today,
big
brings
opportunities
transformative
potential
for
various
sectors;
on
other
hand,
it
also
presents
unprecedented
challenges
to
harnessing
information.
As
keeps
getting
bigger,
deep
coming
play
key
role
providing
predictive
analytics
solutions.
In
this
paper,
we
provide
brief
overview
learning,
highlight
current
efforts
data,
well
future
trends.
IEEE Transactions on Intelligent Transportation Systems,
Journal Year:
2014,
Volume and Issue:
15(5), P. 2191 - 2201
Published: April 10, 2014
Traffic
flow
prediction
is
a
fundamental
problem
in
transportation
modeling
and
management.
Many
existing
approaches
fail
to
provide
favorable
results
due
being:
1)
shallow
architecture;
2)
hand
engineered
features;
3)
separate
learning.
In
this
paper
we
propose
deep
architecture
that
consists
of
two
parts,
i.e.,
belief
network
(DBN)
at
the
bottom
multitask
regression
layer
top.
A
DBN
employed
here
for
unsupervised
feature
It
can
learn
effective
features
traffic
an
fashion,
which
has
been
examined
found
be
many
areas
such
as
image
audio
classification.
To
best
our
knowledge,
first
applies
learning
approach
research.
incorporate
(MTL)
architecture,
used
above
supervised
prediction.
We
further
investigate
homogeneous
MTL
heterogeneous
take
full
advantage
weight
sharing
grouping
method
based
on
weights
top
make
more
effective.
Experiments
data
sets
show
good
performance
architecture.
Abundant
experiments
achieved
close
5%
improvements
over
state
art.
also
presented
improve
generalization
shared
tasks.
These
positive
demonstrate
are
promising
IEEE Signal Processing Magazine,
Journal Year:
2017,
Volume and Issue:
34(6), P. 96 - 108
Published: Nov. 1, 2017
The
success
of
deep
learning
has
been
a
catalyst
to
solving
increasingly
complex
machine-learning
problems,
which
often
involve
multiple
data
modalities.
We
review
recent
advances
in
multimodal
and
highlight
the
state-of
art,
as
well
gaps
challenges
this
active
research
field.
first
classify
architectures
then
discuss
methods
fuse
learned
representations
deep-learning
architectures.
two
areas
research-regularization
strategies
that
learn
or
optimize
fusion
structures-as
exciting
for
future
work.
IEEE Journal of Selected Topics in Signal Processing,
Journal Year:
2017,
Volume and Issue:
11(8), P. 1301 - 1309
Published: Oct. 18, 2017
Automatic
affect
recognition
is
a
challenging
task
due
to
the
various
modalities
emotions
can
be
expressed
with.
Applications
found
in
many
domains
including
multimedia
retrieval
and
human
computer
interaction.
In
recent
years,
deep
neural
networks
have
been
used
with
great
success
determining
emotional
states.
Inspired
by
this
success,
we
propose
an
emotion
system
using
auditory
visual
modalities.
To
capture
content
for
styles
of
speaking,
robust
features
need
extracted.
purpose,
utilize
Convolutional
Neural
Network
(CNN)
extract
from
speech,
while
modality
residual
network
(ResNet)
50
layers.
addition
importance
feature
extraction,
machine
learning
algorithm
needs
also
insensitive
outliers
being
able
model
context.
tackle
problem,
Long
Short-Term
Memory
(LSTM)
are
utilized.
The
then
trained
end-to-end
fashion
where
-
taking
advantage
correlations
each
streams
manage
significantly
outperform
traditional
approaches
based
on
handcrafted
prediction
spontaneous
natural
RECOLA
database
AVEC
2016
research
challenge
recognition.
IEEE Transactions on Biomedical Engineering,
Journal Year:
2014,
Volume and Issue:
62(4), P. 1132 - 1140
Published: Nov. 20, 2014
The
accurate
diagnosis
of
Alzheimer's
disease
(AD)
is
essential
for
patient
care
and
will
be
increasingly
important
as
modifying
agents
become
available,
early
in
the
course
disease.
Although
studies
have
applied
machine
learning
methods
computer-aided
AD,
a
bottleneck
diagnostic
performance
was
shown
previous
methods,
due
to
lacking
efficient
strategies
representing
neuroimaging
biomarkers.
In
this
study,
we
designed
novel
framework
with
deep
architecture
aid
AD.
This
uses
zero-masking
strategy
data
fusion
extract
complementary
information
from
multiple
modalities.
Compared
state-of-the-art
workflows,
our
method
capable
fusing
multimodal
features
one
setting
has
potential
require
less
labeled
data.
A
gain
achieved
both
binary
classification
multiclass
advantages
limitations
proposed
are
discussed.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
Journal Year:
2016,
Volume and Issue:
38(8), P. 1583 - 1597
Published: March 2, 2016
This
paper
describes
a
novel
method
called
Deep
Dynamic
Neural
Networks
(DDNN)
for
multimodal
gesture
recognition.
A
semi-supervised
hierarchical
dynamic
framework
based
on
Hidden
Markov
Model
(HMM)
is
proposed
simultaneous
segmentation
and
recognition
where
skeleton
joint
information,
depth
RGB
images,
are
the
input
observations.
Unlike
most
traditional
approaches
that
rely
construction
of
complex
handcrafted
features,
our
approach
learns
high-level
spatio-temporal
representations
using
deep
neural
networks
suited
to
modality:
Gaussian-Bernouilli
Belief
Network
(DBN)
handle
skeletal
dynamics,
3D
Convolutional
(3DCNN)
manage
fuse
batches
images.
achieved
through
modeling
learning
emission
probabilities
HMM
required
infer
sequence.
purely
data
driven
achieves
Jaccard
index
score
0.81
in
ChaLearn
LAP
spotting
challenge.
The
performance
par
with
variety
state-of-the-art
hand-tuned
feature-based
other
learning-based
methods,
therefore
opening
door
use
techniques
order
further
explore
time
series
data.
Proceedings of the IEEE,
Journal Year:
2015,
Volume and Issue:
103(9), P. 1560 - 1584
Published: Aug. 13, 2015
Earth
observation
through
remote
sensing
images
allows
the
accurate
characterization
and
identification
of
materials
on
surface
from
space
airborne
platforms.
Multiple
heterogeneous
image
sources
can
be
available
for
same
geographical
region:
multispectral,
hyperspectral,
radar,
multitemporal,
multiangular
today
acquired
over
a
given
scene.
These
combined/fused
to
improve
classification
surface.
Even
if
this
type
systems
is
generally
accurate,
field
about
face
new
challenges:
upcoming
constellations
satellite
sensors
will
acquire
large
amounts
different
spatial,
spectral,
angular,
temporal
resolutions.
In
scenario,
multimodal
fusion
stands
out
as
appropriate
framework
address
these
problems.
paper,
we
provide
taxonomical
view
review
current
methodologies
images.
We
also
highlight
most
recent
advances,
which
exploit
synergies
with
machine
learning
signal
processing:
sparse
methods,
kernel-based
fusion,
Markov
modeling,
manifold
alignment.
Then,
illustrate
approaches
in
seven
challenging
applications:
1)
multiresolution
multispectral
classification;
2)
downscaling
form
multitemporal
multidimensional
interpolation
among
resolutions;
3)
4)
multisensor
exploiting
physically-based
feature
extractions;
5)
land
covers
incomplete,
inconsistent,
vague
sources;
6)
spatiospectral
optical
radar
change
detection;
7)
cross-sensor
adaptation
classifiers.
The
adoption
techniques
operational
settings
help
monitor
our
planet
very
near
future.
IEEE Journal of Biomedical and Health Informatics,
Journal Year:
2018,
Volume and Issue:
23(2), P. 538 - 546
Published: April 9, 2018
We
propose
a
multitask
deep
convolutional
neural
network,
trained
on
multimodal
data
(clinical
and
dermoscopic
images,
patient
metadata),
to
classify
the
7-point
melanoma
checklist
criteria
perform
skin
lesion
diagnosis.
Our
network
is
using
several
loss
functions,
where
each
considers
different
combinations
of
input
modalities,
which
allows
our
model
be
robust
missing
at
inference
time.
final
classifies
condition
diagnosis,
produces
feature
vectors
suitable
for
image
retrieval,
localizes
clinically
discriminant
regions.
benchmark
approach
1011
cases,
report
comprehensive
results
over
all
also
make
dataset
(images
metadata)
publicly
available
online
http://derm.cs.sfu.ca.
IEEE Transactions on Big Data,
Journal Year:
2015,
Volume and Issue:
1(1), P. 16 - 34
Published: March 1, 2015
Traditional
data
mining
usually
deals
with
from
a
single
domain.
In
the
big
era,
we
face
diversity
of
datasets
different
sources
in
domains.
These
consist
multiple
modalities,
each
which
has
representation,
distribution,
scale,
and
density.
How
to
unlock
power
knowledge
disparate
(but
potentially
connected)
is
paramount
research,
essentially
distinguishing
traditional
tasks.
This
calls
for
advanced
techniques
that
can
fuse
various
organically
machine
learning
task.
paper
summarizes
fusion
methodologies,
classifying
them
into
three
categories:
stage-based,
feature
level-based,
semantic
meaning-based
methods.
The
last
category
methods
further
divided
four
groups:
multi-view
learning-based,
similarity-based,
probabilistic
dependency-based,
transfer
learning-based
focus
on
rather
than
schema
mapping
merging,
significantly
between
cross-domain
studied
database
community.
does
not
only
introduce
high-level
principles
methods,
but
also
give
examples
these
are
used
handle
real
problems.
addition,
this
positions
existing
works
framework,
exploring
relationship
difference
will
help
wide
range
communities
find
solution
projects.
IEEE Robotics and Automation Letters,
Journal Year:
2018,
Volume and Issue:
3(4), P. 3355 - 3362
Published: July 4, 2018
A
deep
learning
architecture
is
proposed
to
predict
graspable
locations
for
robotic
manipulation.
It
considers
situations
where
no,
one,
or
multiple
object(s)
are
seen.
By
defining
the
problem
be
classified
with
null
hypothesis
competition
instead
of
regression,
neural
network
red,
green,
blue
and
depth
(RGB-D)
image
input
predicts
grasp
candidates
a
single
object
objects,
in
shot.
The
method
outperforms
state-of-the-art
approaches
on
Cornell
dataset
96.0%
96.1%
accuracy
imagewise
object-wise
splits,
respectively.
Evaluation
multiobject
illustrates
generalization
capability
architecture.
Grasping
experiments
achieve
localization
89.0%
grasping
success
rates
test
set
household
objects.
real-time
process
takes
less
than
0.25
s
from
plan.