Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation DOI Creative Commons
Irene Martín-Morató, Annamaria Mesaros

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2023, Volume and Issue: 31, P. 902 - 914

Published: Jan. 1, 2023

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format strong labels necessary sound event detection not easily obtainable through crowdsourcing. In this work, we propose novel annotation workflow that leverages efficiency crowdsourcing weak labels, and uses high number annotators to produce reliable objective labels. The are collected in highly redundant setup, allow reconstruction temporal information. To obtain annotators' competence estimated using MACE (Multi-Annotator Competence Estimation) incorporated into estimation weighing individual opinions. We show proposed method produces consistently annotations only synthetic audio mixtures, also recordings real everyday environments. While maximum 80% coincidence with complete correct reference was obtained these results explained by an extended study how polyphony SNR levels affect identification rate events annotators. On even though significantly lower under 69%, majority opinion approach aggregated comparison more difficult task directly

Language: Английский

Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks DOI
Xian Wei Li, Nian Shao, Xiaofei Li

et al.

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 1336 - 1351

Published: Jan. 1, 2024

Self-supervised learning (SSL) has emerged as a popular approach for audio representations. One goal of self-supervised pre-training is to transfer knowledge downstream tasks, generally including clip-level and frame-level tasks. While tasks are important fine-grained acoustic scene/event understanding, prior studies primarily evaluate on In order tackle both this paper proposes Audio Teacher-Student Transformer (ATST), with version (named ATST-Clip) ATST-Frame), responsible representations, respectively. Both methods use encoder teacher-student training scheme. We have carefully designed view creation strategy ATST-Clip ATST-Frame. Specifically, uses segment-wise data augmentations, ATST-Frame integrates frame-wise augmentations masking. Experimental results show that our model obtains state-of-the-art (SOTA) performances most the Especially, it outperforms other models by large margin sound event detection task. addition, performance can be further improved combining two through distillation.

Language: Английский

Citations

7

What’s all the Fuss about Free Universal Sound Separation Data? DOI
Scott Wisdom, Hakan Erdoğan, Daniel P. W. Ellis

et al.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2021, Volume and Issue: unknown, P. 186 - 190

Published: May 13, 2021

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number sounds from open domain sound types. The dataset consists 23 hours single-source audio data drawn 357 classes, which are used to create one four sources. To simulate reverberation, acoustic room simulator is generate impulse responses box-shaped rooms with frequency-dependent reflective walls. Additional open-source augmentation tools also provided produce different combinations sources and simulations. Finally, we baseline separation model, based on improved time-domain convolutional network (TDCN++), that can separate variable mixture. This model achieves 9.8 dB scale-invariant signal-to-noise ratio improvement (SI-SNRi) two sources, while reconstructing inputs 35.8 absolute SI-SNR. hope this will lower barrier research allow fast iteration application novel techniques other machine learning domains challenge.

Language: Английский

Citations

41

Unsupervised Contrastive Learning of Sound Event Representations DOI
Eduardo Fonseca, Diego Ortego, Kevin McGuinness

et al.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2021, Volume and Issue: unknown

Published: May 13, 2021

Self-supervised representation learning can mitigate the limitations in recognition tasks with few manually labeled data but abundant unlabeled data—a common scenario sound event research. In this work, we explore unsupervised contrastive as a way to learn representations. To end, propose use pretext task of contrasting differently augmented views events. The are computed primarily via mixing training examples unrelated backgrounds, followed by other augmentations. We analyze main components our method ablation experiments. evaluate learned representations using linear evaluation, and two in-domain downstream classification tasks, namely, limited data, noisy data. Our results suggest that pre-training impact scarcity increase robustness against labels.

Language: Английский

Citations

33

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations DOI Creative Commons
Daisuke Niizumi,

Daiki Takeuchi,

Yasunori Ohishi

et al.

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2022, Volume and Issue: 31, P. 137 - 151

Published: Nov. 10, 2022

Pre-trained models are essential as feature extractors in modern machine learning systems various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features the input sound. For recognizing sounds regardless perturbations such varying pitch or timbre, be to these perturbations. serving diverse needs recognition emotions music genres, information, local and global features. To implement our principle, propose a self-supervised method: Bootstrap Your Own Latent (BYOL) Audio (BYOL-A, pronounced “viola”). BYOL-A pre-trains sound invariant data augmentations, which makes learned sounds. Whereas encoder combines calculates their statistics make representation multi-aspect information. As result, information serve tasks. We evaluated task performance compared previous state-of-the-art methods, demonstrated generalizability with best average result 72.4% VoxCeleb1 57.6%. Extensive ablation experiments revealed architecture contributes most performance, final critical portion resorts BYOL framework augmentations. Our code is available online future studies.

Language: Английский

Citations

26

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation DOI Creative Commons
Irene Martín-Morató, Annamaria Mesaros

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2023, Volume and Issue: 31, P. 902 - 914

Published: Jan. 1, 2023

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format strong labels necessary sound event detection not easily obtainable through crowdsourcing. In this work, we propose novel annotation workflow that leverages efficiency crowdsourcing weak labels, and uses high number annotators to produce reliable objective labels. The are collected in highly redundant setup, allow reconstruction temporal information. To obtain annotators' competence estimated using MACE (Multi-Annotator Competence Estimation) incorporated into estimation weighing individual opinions. We show proposed method produces consistently annotations only synthetic audio mixtures, also recordings real everyday environments. While maximum 80% coincidence with complete correct reference was obtained these results explained by an extended study how polyphony SNR levels affect identification rate events annotators. On even though significantly lower under 69%, majority opinion approach aggregated comparison more difficult task directly

Language: Английский

Citations

15