Cited by 3D Visual Grounding-Audio: 3D scene object detection based on audio

Fine-Tune the Pretrained ATST Model for Sound Event Detection DOI

Nian Shao, Xian Li, Xiaofei Li

et al.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2024, Volume and Issue: unknown, P. 911 - 915

Published: March 18, 2024

Sound event detection (SED) often suffers from the data deficiency problem. Recent SED systems leverage large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where help produce more discriminative features for SED. However, are regarded as a frozen feature extractor in most systems, and fine-tuning of has been rarely studied. In this work, we study method We introduce frame-level audio teacher-student transformer model (ATST-Frame), our newly proposed SelfSL model, system. ATST-Frame was especially designed representations signals obtained state-of-the-art (SOTA) performances on series downstream tasks. then propose using both (in-domain) unlabelled labelled data. Our experiments show that, overcomes overfitting problem when pre-trained network, system obtains new SOTA results 0.587/0.812 PSDS1/PSDS2 DCASE challenge task 4 dataset.

Language: Английский

Citations

Transformer Models improve the acoustic recognition of buzz-pollinating bee species DOI

Alef Iury Ferreira, Nádia Félix Felipe da Silva, Fernanda Neiva Mesquita

et al.

Ecological Informatics, Journal Year: 2025, Volume and Issue: unknown, P. 103010 - 103010

Published: Jan. 1, 2025

Language: Английский

Citations

ASiT-CRNN: A Method for Sound Event Detection with Fine-Tuning of Self-Supervised Pre-Trained ASiT-Based Model DOI

Yueyang Zheng,

Ruikun Zhang,

Sara Atito

et al.

Digital Signal Processing, Journal Year: 2025, Volume and Issue: unknown, P. 105055 - 105055

Published: Feb. 1, 2025

Language: Английский

Citations

Acoustic Event Detection in Vehicles: A Multi-Label Classification Approach DOI

A. Joseph Antony,

Wolfgang Theimer,

G. Grossetti

et al.

Sensors, Journal Year: 2025, Volume and Issue: 25(8), P. 2591 - 2591

Published: April 19, 2025

Autonomous driving technologies for environmental perception are mostly based on visual cues obtained from sensors like cameras, RADAR, or LiDAR. They capture the environment as if seen through “human eyes”. If this information is complemented with auditory information, thereby also providing “ears”, driverless cars can become more reliable and safer. In paper, an Acoustic Event Detection model presented that detect various acoustic events in automotive context along their time of occurrence to create audio scene description. The proposed detection methodology uses pre-trained network Bidirectional Encoder representation Audio Transformers (BEATs) a single-layer neural trained database real recordings collected different cars. performance evaluated parameters datasets. segment-based results duration 1 s show performs well 11 sound classes mean accuracy 0.93 F1-Score 0.39 confidence threshold 0.5. threshold-independent metric mAP has value 0.77. mixtures containing two overlapping accuracy, F1-Score, equal 0.89, 0.42, 0.658, respectively.

Language: Английский

Citations

Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event Detection DOI

Yadong Guan, Jiqing Han, Hongwei Song

et al.

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 3947 - 3959

Published: Jan. 1, 2024

Language: Английский

Citations

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training DOI

Yiming Li,

Zhi‐Fang Guo,

Xiangdong Wang

et al.

Published: Oct. 26, 2024

Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success multi-modal understanding tasks.These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment.However, frame-level correspondence with texts may be ignored, making it ill-posed explainability and fine-grained challenges also undermine performances tasks.In this work, we aim improve both coarse-and alignment large-scale pre-training.To unify granularity latent distribution of two modalities, a shared codebook adopted represent features common bases, each codeword regularized encode modality-shared semantics, bridging gap between features.Based it, localityaware block involved purify patterns, hard-negative guided devised boost alignment.Experiments eleven zero-shot tasks suggest our model not only surpasses baseline CLAP significantly but yields superior competitive results compared current SOTA works.

Language: Английский

Citations

3D Visual Grounding-Audio: 3D scene object detection based on audio DOI

Can Zhang, Zeyu Cai, Xunhao Chen

et al.

Neurocomputing, Journal Year: 2024, Volume and Issue: unknown, P. 128637 - 128637

Published: Sept. 1, 2024

Language: Английский

Citations