Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation DOI Creative Commons
Irene Martín-Morató, Annamaria Mesaros

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2023, Volume and Issue: 31, P. 902 - 914

Published: Jan. 1, 2023

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format strong labels necessary sound event detection not easily obtainable through crowdsourcing. In this work, we propose novel annotation workflow that leverages efficiency crowdsourcing weak labels, and uses high number annotators to produce reliable objective labels. The are collected in highly redundant setup, allow reconstruction temporal information. To obtain annotators' competence estimated using MACE (Multi-Annotator Competence Estimation) incorporated into estimation weighing individual opinions. We show proposed method produces consistently annotations only synthetic audio mixtures, also recordings real everyday environments. While maximum 80% coincidence with complete correct reference was obtained these results explained by an extended study how polyphony SNR levels affect identification rate events annotators. On even though significantly lower under 69%, majority opinion approach aggregated comparison more difficult task directly

Language: Английский

The Power of Generative AI: A Review of Requirements, Models, Input–Output Formats, Evaluation Metrics, and Challenges DOI Creative Commons
Ajay Bandi,

Pydi Venkata Satya Ramesh Adapa,

Yudu Eswar Vinay Pratap Kumar Kuchi

et al.

Future Internet, Journal Year: 2023, Volume and Issue: 15(8), P. 260 - 260

Published: July 31, 2023

Generative artificial intelligence (AI) has emerged as a powerful technology with numerous applications in various domains. There is need to identify the requirements and evaluation metrics for generative AI models designed specific tasks. The purpose of research aims investigate fundamental aspects systems, including their requirements, models, input–output formats, metrics. study addresses key questions presents comprehensive insights guide researchers, developers, practitioners field. Firstly, necessary implementing systems are examined categorized into three distinct categories: hardware, software, user experience. Furthermore, explores different types described literature by presenting taxonomy based on architectural characteristics, such variational autoencoders (VAEs), adversarial networks (GANs), diffusion transformers, language normalizing flow hybrid models. A classification input output formats used also provided. Moreover, proposes system discusses commonly AI. findings contribute advancements field, enabling effectively implement evaluate applications. significance lies understanding that crucial effective planning, design, optimal performance. aids selecting suitable options driving advancements. Classifying enables leveraging diverse customized while establish standardized methods assess model quality

Language: Английский

Citations

245

A review of deep learning techniques for speech processing DOI Creative Commons
Ambuj Mehrish, Navonil Majumder,

Rishabh Bharadwaj

et al.

Information Fusion, Journal Year: 2023, Volume and Issue: 99, P. 101869 - 101869

Published: June 3, 2023

Language: Английский

Citations

148

CLAP Learning Audio Concepts from Natural Language Supervision DOI Open Access
Benjamin Elizalde,

Soham Deshmukh,

Mahmoud Al Ismail

et al.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2023, Volume and Issue: unknown

Published: May 5, 2023

Mainstream machine listening models are trained to learn audio concepts under the paradigm of one class label many recordings focusing on task. Learning such restricted supervision limits flexibility because they require labeled for training and can only predict predefined categories. Instead, we propose from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which connects by using two encoders a contrastive learning objective, bringing text descriptions into joint multimodal space. CLAP with 128k pairs evaluated it 16 downstream tasks across 7 domains, as classification sound events, scenes, music, speech. establishes state-of-the-art (SoTA) in Zero-Shot performance. Also, CLAP's encoder supervised setup achieved SoTA 5 tasks. The capability removes need audio, enables flexible prediction at inference time, generalizes well multiple Code is available at: https://github.com/microsoft/CLAP.

Language: Английский

Citations

123

Wav2CLIP: Learning Robust Audio Representations from Clip DOI
Ho-Hsiang Wu,

Prem Seetharaman,

Kundan Kumar

et al.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2022, Volume and Issue: unknown, P. 4563 - 4567

Published: April 27, 2022

We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). systematically evaluate Wav2CLIP on variety of tasks including classification, retrieval, and generation, show that can outperform several publicly available pre-trained algorithms. projects into shared embedding space with images text, which enables multimodal applications such as zero-shot cross-modal retrieval. Furthermore, needs just ∼10% the data to achieve competitive performance downstream compared fully supervised models, is more efficient pre-train than competing methods it does not require visual model in concert an auditory model. Finally, we demonstrate image generation qualitative assessment space. Our code weights are open sourced made for further applications.

Language: Английский

Citations

110

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation DOI Open Access
Yusong Wu, Ke Chen, Tianyu Zhang

et al.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2023, Volume and Issue: unknown

Published: May 5, 2023

Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline contrastive language-audio pretraining to develop an audio by combining data with natural language descriptions. To accomplish target, first release LAION-Audio-630K, large collection 633,526 audio-text pairs from different sources. Second, construct model considering encoders and text encoders. We incorporate feature fusion mechanism keyword-to-caption augmentation into design further enable process inputs variable lengths enhance performance. Third, perform comprehensive experiments evaluate our across three tasks: text-to-audio retrieval, zero-shot classification, supervised classification. The results demonstrate that achieves superior performance retrieval task. classification tasks, state-of-the-art setting is able obtain comparable models' non-zero-shot setting. LAION-Audio-630K 1 proposed xmlns:xlink="http://www.w3.org/1999/xlink">2 are both available public.

Language: Английский

Citations

110

Efficient Training of Audio Transformers with Patchout DOI
Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh

et al.

Interspeech 2022, Journal Year: 2022, Volume and Issue: unknown

Published: Sept. 16, 2022

The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures other domains such as vision and audio. Recent work shown that transformers can outperform Convolutional Neural Networks (CNNs) on audio tasks. However, one the main shortcomings transformer models, compared well-established CNNs, is computational complexity. In transformers, compute memory complexity known grow quadratically with input length. Therefore, there been extensive optimizing but often cost degrading predictive performance. this work, we propose a novel method optimize regularize spectrograms. Our proposed achieve new state-of-the-art performance Audioset be trained single consumer-grade GPU. Furthermore, model outperforms CNNs terms both training speed. Source code: https://github.com/kkoutini/PaSST

Language: Английский

Citations

100

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation DOI Creative Commons
Zhong-Qiu Wang, Samuele Cornell,

Shukjae Choi

et al.

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2023, Volume and Issue: 31, P. 3221 - 3236

Published: Jan. 1, 2023

We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, temporal cross-frame self-attention module. trained to perform complex spectral mapping, where real imaginary (RI) components input signals are stacked as features predict target RI components. first evaluate it on monaural anechoic speaker Without using data augmentation dynamic mixing, obtains state-of-the-art 23.5 dB improvement scale-invariant signal-to-distortion ratio (SI-SDR) WSJ0-2mix, standard dataset two-speaker To show its robustness noise reverberation, we reverberant separation SMS-WSJ noisy-reverberant WHAMR!, obtain performance both datasets. then extend multi-microphone conditions through integrate into two-DNN system with beamformer between (named MISO-BF-MISO earlier studies), proposed this paper multi-frame Wiener filter computed based outputs DNN. State-of-the-art obtained multi-channel tasks WHAMR!. Besides separation, apply algorithms dereverberation enhancement. recent L3DAS22 enhancement challenge.

Language: Английский

Citations

57

The Internet of Sounds: Convergent Trends, Insights, and Future Directions DOI Creative Commons
Luca Turchet, Mathieu Lagrange, Cristina Rottondi

et al.

IEEE Internet of Things Journal, Journal Year: 2023, Volume and Issue: 10(13), P. 11264 - 11292

Published: March 7, 2023

Current sound-based practices and systems developed in both academia industry point to convergent research trends that bring together the field of Sound Music Computing with Internet Things. This paper proposes a vision for emerging Sounds (IoS), which stems from such disciplines. The IoS relates network Things, i.e., devices capable sensing, acquiring, processing, actuating, exchanging data serving purpose communicating sound-related information. In paradigm, merges under unique umbrella fields Musical Things Audio heterogeneous dedicated musical non-musical tasks can interact cooperate one another other things connected facilitate services applications are globally available users. We survey state art this space, discuss technological non-technological challenges ahead us propose comprehensive agenda field.

Language: Английский

Citations

56

Fight Fire with Fire: Detecting Forest Fires with Embedded Machine Learning Models Dealing with Audio and Images on Low Power IoT Devices DOI Creative Commons
Giacomo Peruzzi, Alessandro Pozzebon, Mattia Van Der Meer

et al.

Sensors, Journal Year: 2023, Volume and Issue: 23(2), P. 783 - 783

Published: Jan. 10, 2023

Forest fires are the main cause of desertification, and they have a disastrous impact on agricultural forest ecosystems. Modern fire detection warning systems rely several techniques: satellite monitoring, sensor networks, image processing, data fusion, etc. Recently, Artificial Intelligence (AI) algorithms been applied to recognition systems, enhancing their efficiency reliability. However, these devices usually need constant transmission along with proper amount computing power, entailing high costs energy consumption. This paper presents prototype Video Surveillance Unit (VSU) for recognising signalling presence by exploiting two embedded Machine Learning (ML) running low power device. The ML models take audio samples images as respective inputs, allowing timely detection. result is that while performances comparable when work independently, joint usage according proposed methodology provides higher accuracy, precision, recall F1 score (96.15%, 92.30%, 100.00%, 96.00%, respectively). Eventually, each event remotely signalled making use Long Range Wide Area Network (LoRaWAN) protocol ensure personnel in charge able operate promptly.

Language: Английский

Citations

43

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research DOI
Xinhao Mei,

Chutong Meng,

Haohe Liu

et al.

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 3339 - 3354

Published: Jan. 1, 2024

The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years, yet the limited size existing datasets poses challenges for researchers due to costly and time-consuming collection process.To address this data scarcity issue, we introduce WavCaps, first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k clips with paired captions.We sourced their raw descriptions from web sources a sound event detection dataset.However, online-harvested are highly noisy unsuitable direct use such as automated captioning.To overcome propose three-stage processing pipeline filtering generating high-quality captions, where ChatGPT, large language model, is leveraged filter transform automatically.We conduct comprehensive analysis characteristics WavCaps dataset evaluate it on multiple downstream tasks.The systems trained outperform previous state-of-the-art (SOTA) models by margin.Our aspiration have proposed facilitate research demonstrate potential utilizing (LLMs) enhance academic research.Our codes available at https://github.com/XinhaoMei/WavCaps.

Language: Английский

Citations

41