Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers DOI
Changsheng Quan, Xiaofei Li

IEEE Signal Processing Letters, Journal Year: 2024, Volume and Issue: 31, P. 2295 - 2299

Published: Jan. 1, 2024

Language: Английский

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation DOI Creative Commons
Changsheng Quan, Xiaofei Li

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 1310 - 1323

Published: Jan. 1, 2024

This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, proposed performs end-to-end enhancement. It is mainly composed of interleaved narrow-band cross-band blocks respectively information. The process frequencies independently, use self-attention mechanism temporal convolutional layers perform spatial-feature-based speaker clustering smoothing/filtering. frames full-band linear layer frequency learn correlation between all adjacent frequencies. Experiments are conducted on various simulated real datasets, results show that 1) achieves state-of-the-art performance almost tasks; 2) suffers little from spectral generalization problem; 3) indeed performing (demonstrated by attention maps).

Language: Английский

Citations

10

Multi-Channel Conversational Speaker Separation via Neural Diarization DOI
Hassan Taherian, DeLiang Wang

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 2467 - 2476

Published: Jan. 1, 2024

When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short window to avoid many speakers inside and sequential grouping discontinuous segments. address these limitations, we introduce new multi-channel framework called "speaker via neural diarization" (SSND) environments. Our approach utilizes an end-to-end diarization system identify activity each individual speaker. By leveraging estimated boundaries, generate sequence embeddings, which turn facilitate assignment outputs multi-talker model. SSND addresses permutation ambiguity issue talker-independent during phase through location-based training, rather than process. This unique allows multiple non-overlapped be assigned same output stream, making it possible efficiently process long segments-a task impossible CSS. Additionally, naturally suitable speaker-attributed ASR. We evaluate our proposed methods on open LibriCSS dataset, advancing state-of-the-art results by large margin.

Language: Английский

Citations

6

ReZero: Region-Customizable Sound Extraction DOI
Rongzhi Gu, Yi Luo

IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 2576 - 2589

Published: Jan. 1, 2024

We introduce region-customizable sound extraction (ReZero), a general and flexible framework for the multi-channel region-wise (R-SE) task.R-SE task aims at extracting all active target sounds (e.g., human speech) within specific, user-defined spatial region, which is different from conventional existing tasks where blind separation or fixed, predefined region are typically assumed.The can be defined as an angular window, sphere, cone, other geometric patterns.Being solution to R-SE task, proposed ReZero includes (1) definitions of types regions, (2) methods feature aggregation, (3) extension band-split RNN (BSRNN) model specified task.We design experiments microphone array geometries, comprehensive ablation studies on system configurations.Experimental results both simulated real-recorded data demonstrate effectiveness ReZero.Demos available https://innerselfm.github.io/rezero/.

Language: Английский

Citations

5

GTCRN: A Speech Enhancement Model Requiring Ultralow Computational Resources DOI Open Access

Xiaobin Rong,

Tianchi Sun,

Xu Zhang

et al.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2024, Volume and Issue: unknown, P. 971 - 975

Published: March 18, 2024

While modern deep learning-based models have significantly outperformed traditional methods in the area of speech enhancement, they often necessitate a lot parameters and extensive computational power, making them impractical to be deployed on edge devices real-world applications. In this paper, we introduce Grouped Temporal Convolutional Recurrent Network (GTCRN), which incorporates grouped strategies efficiently simplify competitive model, DPCRN. Additionally, it leverages subband feature extraction modules temporal recurrent attention enhance its performance. Remarkably, resulting model demands ultralow resources, featuring only 23.7 K 39.6 MMACs per second. Experimental results show that our proposed not surpasses RNNoise, typical lightweight with similar burden, but also achieves performance when compared recent baseline higher resources requirements.

Language: Английский

Citations

4

Improving Speech Separation Via Hybrid Domain Features, Asymmetric Encoder-Decoder, and Multi-Path Separator DOI

Zixi Jia,

Jiqiang Liu,

M. Eric Cui

et al.

Published: Jan. 1, 2025

Deep learning-based speech separation studies are divided into time-frequency and time-domain approaches. The primary difference is that models replace the short-time Fourier transform with a trainable encoder, mapping mixture waveform signals directly to learned-base domain. This paper combines features from both hybrid feature beneficial for monaural separation. We propose HySplit, an end-to-end neural network framework domain feature. design asymmetrical encoder decoder multi-path separator consisting of multi-head attention blocks manage integrate various-level features. HySplit has only 20\% forward latency compared other dual-path models. Experiments show achieves scale-invariant source-to-noise ratio 20.9 dB on benchmark dataset WSJ0-2mix, 16.3 Libri2Mix-Clean, 13.5 WHAM!, moderate parameters FLOPs same level separating results.

Language: Английский

Citations

0

Accurate speaker counting, diarizarion and separation for advanced recognition of multichannel multispeaker conversations DOI
Anton Mitrofanov,

Tatiana Prisyach,

Tatiana Timofeeva

et al.

Computer Speech & Language, Journal Year: 2025, Volume and Issue: unknown, P. 101780 - 101780

Published: Feb. 1, 2025

Language: Английский

Citations

0

Deep Learning for Speech Separation: A Comprehensive Review DOI
Duyen Nguyen Thi,

Ha Minh Tan,

Trung-Nghia Phung

et al.

Lecture notes in networks and systems, Journal Year: 2025, Volume and Issue: unknown, P. 222 - 229

Published: Jan. 1, 2025

Language: Английский

Citations

0

Cocktail Party Effect Using Parallel Intra and Inter Self-attention DOI

Ha Minh Tan,

Nguyen Kim Quoc,

Duc-Quang Vu

et al.

Lecture notes in networks and systems, Journal Year: 2025, Volume and Issue: unknown, P. 716 - 723

Published: Jan. 1, 2025

Language: Английский

Citations

0

Summary of the NOTSOFAR-1 challenge: Highlights and Learnings DOI

Igor Abramovski,

Alon Vinnikov,

Shalev Shaer

et al.

Computer Speech & Language, Journal Year: 2025, Volume and Issue: unknown, P. 101796 - 101796

Published: March 1, 2025

Language: Английский

Citations

0

SuperM2M: Supervised and mixture-to-mixture co-learning for speech enhancement and noise-robust ASR DOI
Zhong-Qiu Wang

Neural Networks, Journal Year: 2025, Volume and Issue: 188, P. 107408 - 107408

Published: March 28, 2025

Language: Английский

Citations

0