IEEE Signal Processing Letters, Journal Year: 2024, Volume and Issue: 31, P. 2295 - 2299
Published: Jan. 1, 2024
Language: Английский
IEEE Signal Processing Letters, Journal Year: 2024, Volume and Issue: 31, P. 2295 - 2299
Published: Jan. 1, 2024
Language: Английский
IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 1310 - 1323
Published: Jan. 1, 2024
This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, proposed performs end-to-end enhancement. It is mainly composed of interleaved narrow-band cross-band blocks respectively information. The process frequencies independently, use self-attention mechanism temporal convolutional layers perform spatial-feature-based speaker clustering smoothing/filtering. frames full-band linear layer frequency learn correlation between all adjacent frequencies. Experiments are conducted on various simulated real datasets, results show that 1) achieves state-of-the-art performance almost tasks; 2) suffers little from spectral generalization problem; 3) indeed performing (demonstrated by attention maps).
Language: Английский
Citations
10IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 2467 - 2476
Published: Jan. 1, 2024
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short window to avoid many speakers inside and sequential grouping discontinuous segments. address these limitations, we introduce new multi-channel framework called "speaker via neural diarization" (SSND) environments. Our approach utilizes an end-to-end diarization system identify activity each individual speaker. By leveraging estimated boundaries, generate sequence embeddings, which turn facilitate assignment outputs multi-talker model. SSND addresses permutation ambiguity issue talker-independent during phase through location-based training, rather than process. This unique allows multiple non-overlapped be assigned same output stream, making it possible efficiently process long segments-a task impossible CSS. Additionally, naturally suitable speaker-attributed ASR. We evaluate our proposed methods on open LibriCSS dataset, advancing state-of-the-art results by large margin.
Language: Английский
Citations
6IEEE/ACM Transactions on Audio Speech and Language Processing, Journal Year: 2024, Volume and Issue: 32, P. 2576 - 2589
Published: Jan. 1, 2024
We introduce region-customizable sound extraction (ReZero), a general and flexible framework for the multi-channel region-wise (R-SE) task.R-SE task aims at extracting all active target sounds (e.g., human speech) within specific, user-defined spatial region, which is different from conventional existing tasks where blind separation or fixed, predefined region are typically assumed.The can be defined as an angular window, sphere, cone, other geometric patterns.Being solution to R-SE task, proposed ReZero includes (1) definitions of types regions, (2) methods feature aggregation, (3) extension band-split RNN (BSRNN) model specified task.We design experiments microphone array geometries, comprehensive ablation studies on system configurations.Experimental results both simulated real-recorded data demonstrate effectiveness ReZero.Demos available https://innerselfm.github.io/rezero/.
Language: Английский
Citations
5ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Journal Year: 2024, Volume and Issue: unknown, P. 971 - 975
Published: March 18, 2024
While modern deep learning-based models have significantly outperformed traditional methods in the area of speech enhancement, they often necessitate a lot parameters and extensive computational power, making them impractical to be deployed on edge devices real-world applications. In this paper, we introduce Grouped Temporal Convolutional Recurrent Network (GTCRN), which incorporates grouped strategies efficiently simplify competitive model, DPCRN. Additionally, it leverages subband feature extraction modules temporal recurrent attention enhance its performance. Remarkably, resulting model demands ultralow resources, featuring only 23.7 K 39.6 MMACs per second. Experimental results show that our proposed not surpasses RNNoise, typical lightweight with similar burden, but also achieves performance when compared recent baseline higher resources requirements.
Language: Английский
Citations
4Published: Jan. 1, 2025
Deep learning-based speech separation studies are divided into time-frequency and time-domain approaches. The primary difference is that models replace the short-time Fourier transform with a trainable encoder, mapping mixture waveform signals directly to learned-base domain. This paper combines features from both hybrid feature beneficial for monaural separation. We propose HySplit, an end-to-end neural network framework domain feature. design asymmetrical encoder decoder multi-path separator consisting of multi-head attention blocks manage integrate various-level features. HySplit has only 20\% forward latency compared other dual-path models. Experiments show achieves scale-invariant source-to-noise ratio 20.9 dB on benchmark dataset WSJ0-2mix, 16.3 Libri2Mix-Clean, 13.5 WHAM!, moderate parameters FLOPs same level separating results.
Language: Английский
Citations
0Computer Speech & Language, Journal Year: 2025, Volume and Issue: unknown, P. 101780 - 101780
Published: Feb. 1, 2025
Language: Английский
Citations
0Lecture notes in networks and systems, Journal Year: 2025, Volume and Issue: unknown, P. 222 - 229
Published: Jan. 1, 2025
Language: Английский
Citations
0Lecture notes in networks and systems, Journal Year: 2025, Volume and Issue: unknown, P. 716 - 723
Published: Jan. 1, 2025
Language: Английский
Citations
0Computer Speech & Language, Journal Year: 2025, Volume and Issue: unknown, P. 101796 - 101796
Published: March 1, 2025
Language: Английский
Citations
0Neural Networks, Journal Year: 2025, Volume and Issue: 188, P. 107408 - 107408
Published: March 28, 2025
Language: Английский
Citations
0