UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition DOI Creative Commons
Guimin Hu, Ting-En Lin, Yi Zhao

et al.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Journal Year: 2022, Volume and Issue: unknown

Published: Jan. 1, 2022

Multimodal sentiment analysis (MSA) and emotion recognition in conversation (ERC) are key research topics for computers to understand human behaviors. From a psychological perspective, emotions the expression of affect or feelings during short period, while sentiments formed held longer period. However, most existing works study separately do not fully exploit complementary knowledge behind two. In this paper, we propose multimodal knowledge-sharing framework (UniMSE) that unifies MSA ERC tasks from features, labels, models. We perform modality fusion at syntactic semantic levels introduce contrastive learning between modalities samples better capture difference consistency emotions. Experiments on four public benchmark datasets, MOSI, MOSEI, MELD, IEMOCAP, demonstrate effectiveness proposed method achieve consistent improvements compared with state-of-the-art methods.

Language: Английский

Interpretable Hyperspectral Artificial Intelligence: When nonconvex modeling meets hyperspectral remote sensing DOI
Danfeng Hong, Wei He, Naoto Yokoya

et al.

IEEE Geoscience and Remote Sensing Magazine, Journal Year: 2021, Volume and Issue: 9(2), P. 52 - 87

Published: April 6, 2021

Hyperspectral (HS) imaging, also known as image spectrometry, is a landmark technique in geoscience and remote sensing (RS). In the past decade, enormous efforts have been made to process analyze these HS products, mainly by seasoned experts. However, with an ever-growing volume of data, bulk costs manpower material resources poses new challenges for reducing burden manual labor improving efficiency. For this reason, it urgent that more intelligent automatic approaches various RS applications be developed. Machine learning (ML) tools convex optimization successfully undertaken tasks numerous artificial intelligence (AI)-related applications; however, their ability handle complex practical problems remains limited, particularly due effects spectral variabilities imaging complexity redundancy higher-dimensional signals. Compared models, nonconvex modeling, which capable characterizing real scenes providing model interpretability technically theoretically, has proven feasible solution reduces gap between challenging vision currently advanced data processing models.

Language: Английский

Citations

216

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation DOI
Daniel Michelsanti, Zheng‐Hua Tan, Shixiong Zhang

et al.

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2021, Volume and Issue: 29, P. 1368 - 1396

Published: Jan. 1, 2021

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing machine learning techniques applied the available acoustic signals. Since visual aspect essentially unaffected environment, information speakers, such as lip movements facial expressions, has also used for systems. In order efficiently fuse information, researchers exploited flexibility data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal large number features multimodal highlighted need an overview that comprehensively describes discusses audio-visual based on learning. this paper, we provide systematic survey research topic, focusing main elements characterise systems in literature: features; methods; fusion techniques; training targets objective functions. addition, review deep-learning-based methods reconstruction silent videos sound source non-speech since can be less directly separation. Finally, commonly employed datasets, given their central role development evaluation methods, because they generally compare different determine

Language: Английский

Citations

207

Deep Learning of Transferable Representation for Scalable Domain Adaptation DOI
Mingsheng Long, Jianmin Wang, Yue Cao

et al.

IEEE Transactions on Knowledge and Data Engineering, Journal Year: 2016, Volume and Issue: 28(8), P. 2027 - 2040

Published: April 14, 2016

Domain adaptation generalizes a learning model across source domain and target that are sampled from different distributions. It is widely applied to cross-domain data mining for reusing labeled information mitigating labeling consumption. Recent studies reveal deep neural networks can learn abstract feature representation, which reduce, but not remove, the discrepancy. To enhance invariance of representation make it more transferable domains, we propose unified framework jointly classifier enable scalable adaptation, by taking advantages both optimal two-sample matching. The constitutes two inter-dependent paradigms, unsupervised pre-training effective training models using denoising autoencoders, supervised fine-tuning exploitation discriminative networks, learned embedding representations reproducing kernel Hilbert spaces (RKHSs) optimally matching learning, develop linear-time algorithm unbiased estimate scales linearly large samples. Extensive empirical results show proposed significantly outperforms state art methods on diverse tasks: sentiment polarity prediction, email spam filtering, newsgroup content categorization, visual object recognition.

Language: Английский

Citations

176

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks DOI
Michelle A. Lee, Yuke Zhu,

Peter Zachares

et al.

IEEE Transactions on Robotics, Journal Year: 2020, Volume and Issue: 36(3), P. 582 - 596

Published: March 20, 2020

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is nontrivial to manually design a robot controller that combines these modalities, which have very different characteristics. While deep reinforcement learning has shown success control policies for high-dimensional inputs, algorithms are generally intractable train directly on real robots due sample complexity. In this article, we use self-supervision learn compact multimodal representation of our sensory can then be used improve the efficiency policy learning. Evaluating method peg insertion task, show it generalizes over varying geometries, configurations, clearances, while being robust external perturbations. We also systematically study self-supervised objectives architectures. Results presented simulation physical robot.

Language: Английский

Citations

161

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition DOI Creative Commons
Guimin Hu, Ting-En Lin, Yi Zhao

et al.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Journal Year: 2022, Volume and Issue: unknown

Published: Jan. 1, 2022

Multimodal sentiment analysis (MSA) and emotion recognition in conversation (ERC) are key research topics for computers to understand human behaviors. From a psychological perspective, emotions the expression of affect or feelings during short period, while sentiments formed held longer period. However, most existing works study separately do not fully exploit complementary knowledge behind two. In this paper, we propose multimodal knowledge-sharing framework (UniMSE) that unifies MSA ERC tasks from features, labels, models. We perform modality fusion at syntactic semantic levels introduce contrastive learning between modalities samples better capture difference consistency emotions. Experiments on four public benchmark datasets, MOSI, MOSEI, MELD, IEMOCAP, demonstrate effectiveness proposed method achieve consistent improvements compared with state-of-the-art methods.

Language: Английский

Citations

95