Applied Acoustics, Journal Year: 2024, Volume and Issue: 229, P. 110403 - 110403
Published: Nov. 15, 2024
Language: Английский
Applied Acoustics, Journal Year: 2024, Volume and Issue: 229, P. 110403 - 110403
Published: Nov. 15, 2024
Language: Английский
Biomedical Signal Processing and Control, Journal Year: 2025, Volume and Issue: 107, P. 107811 - 107811
Published: March 11, 2025
Language: Английский
Citations
0Circuits Systems and Signal Processing, Journal Year: 2025, Volume and Issue: unknown
Published: March 10, 2025
Language: Английский
Citations
0Procedia Computer Science, Journal Year: 2025, Volume and Issue: 258, P. 1425 - 1434
Published: Jan. 1, 2025
Language: Английский
Citations
02022 International Conference on Communication, Computing and Internet of Things (IC3IoT), Journal Year: 2024, Volume and Issue: unknown, P. 1 - 4
Published: April 17, 2024
This study article explores the field of audio emotion categorization, using a unique method that involves log-Mel spectrogram with augmentation. The research shows respectable accuracy rate 63%, which is significantly lower than MFCC. However, it also highlights significant potential for development. utilises sophisticated model, namely 2D CNN resembles VGG19 architecture. In this characteristics are treated as 30x216 pixel picture. By datasets such TESS, CREMA-D, SAVEE, and RAVDESS, specialised functions used to extract wide range characteristics. importance promptly identifying emotions via auditory cues underscored. model demonstrates encouraging outcomes, capacity exceed conventional approaches. A class contain result in order facilitate faster interpretation. not only enhances expanding area classification but creates opportunities effective precise identification, bridging divide between image-based methods analysis.
Language: Английский
Citations
1IEEE Access, Journal Year: 2024, Volume and Issue: 12, P. 128039 - 128048
Published: Jan. 1, 2024
In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) attend correlation frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer.The proposed has following originality i) We use vertically segmented patches log-Mel analyze frequencies over time.This type patch allows us correlate most relevant for particular they were uttered.ii) image coordinate encoding, an absolute encoding suitable ViT.By normalizing x, y coordinates -1 1 concatenating them image, can effectively provide valid ViT.iii) Through feature map matching, locality location teacher network is transmitted student network.Teacher that contains convolutional stem position structure lacks basic structure.In matching stage, train mean error (L1 loss) minimize difference maps two networks.To validate method, three datasets (SAVEE, EmoDB, CREMA-D) consisting converted into spectrograms comparison experiments.The experimental results show significantly outperforms state-of-the-art methods terms weighted while requiring fewer floating point operations (FLOPs).Moreover, performance better than network, indicating introduction L1 loss solves overfitting problem.Overall, offers promising solution SER providing improved efficiency performance.
Language: Английский
Citations
1IEEE Access, Journal Year: 2024, Volume and Issue: 12, P. 130228 - 130240
Published: Jan. 1, 2024
Recognizing emotional states from speech is essential for human-computer interaction. It a challenging task to realize effective emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose lightweight multi-scale deep neural network architecture SER, which takes Mel Frequency Cepstral Coefficients (MFCCs) as input. order feature extraction, new Inception module, named A_Inception. A_Inception combines the merits of module attention-based rectified linear units (AReLU) thus can learn features adaptively low computational cost. Meanwhile, extract most important information, multiscale cepstral attention temporal-cepstral (MCA-TCA) module. The idea MCA-TCA focus key components positions. Furthermore, loss function combining Softmax Center adopted supervise model training so enhance model's discriminative Experiments have been carried out IEMOCAP, EMODB SAVEE datasets verify performance proposed compare state-of-the-art SER models. Numerical results reveal that has small number parameters (0.82M) much lower cost (81.64 MFLOPs) than compared models, achieves impressive accuracy all considered.
Language: Английский
Citations
0Published: Aug. 8, 2024
Language: Английский
Citations
0Speech Communication, Journal Year: 2024, Volume and Issue: 166, P. 103148 - 103148
Published: Nov. 14, 2024
Language: Английский
Citations
0Applied Acoustics, Journal Year: 2024, Volume and Issue: 229, P. 110403 - 110403
Published: Nov. 15, 2024
Language: Английский
Citations
0