Cited by Uncovering between-categorical details in L2 pronunciation errors using Wav2Vec2.0 code vectors*

Dynamics of auditory word form encoding in human speech cortex DOI

Yizhen Zhang, Matthew K. Leonard, Laura Gwilliams

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2025, Номер unknown

Опубликована: Май 5, 2025

Summary When we hear continuous speech, perceive it as a series of discrete words, despite the lack clear boundaries in acoustic signal. The superior temporal gyrus (STG) encodes phonetic elements like consonants and vowels, but how extracts whole words perceptual units remains unclear. Using high-density cortical recordings, investigated brain represents auditory word forms—integrating acoustic-phonetic, prosodic, lexical features—while participants listened to spoken narratives. Our results show that STG neural populations exhibit distinctive reset activity at boundaries, marked by brief, sharp drop activity. Between these resets, consistently distinct information, supporting integration phonological features into coherent forms. Notably, this process tracks relative elapsed time within each word, independent its absolute duration, providing flexible scaffolding for encoding variable lengths. We observed similar form dynamics deeper layers self-supervised artificial speech network, suggesting potential convergence with computational models. Additionally, bistable perception task, responses were aligned participants’ perceived on trial-by-trial basis, further emphasizing role dynamic recognition. Together, findings support new dynamical model forms, highlighting their importance accessing linguistic meaning.

Язык: Английский

Процитировано

Bronya R. Chernyak,

Ann R. Bradlow, Joseph Keshet

и другие.

The Journal of the Acoustical Society of America, Год журнала: 2024, Номер 155(6), С. 3915 - 3929

Опубликована: Июнь 1, 2024

Speech recognition by both humans and machines frequently fails in non-optimal yet common situations. For example, word error rates for second-language (L2) speech can be high, especially under conditions involving background noise. At the same time, human machine sometimes shows remarkable robustness against signal- noise-related degradation. Which acoustic features of explain this substantial variation intelligibility? Current approaches align to text extract a small set pre-defined spectro-temporal properties from specific sounds particular words. However, these leaves much cross-talker intelligibility unexplained. We examine an alternative approach utilizing perceptual similarity space acquired using self-supervised learning. This encodes distinctions between samples without requiring or speech-to-text alignment. show that L2 English are less tightly clustered than L1 reflecting variability proficiency among talkers. Critically, distances perceptually meaningful: listeners have lower accuracy speakers whose is more distant speech. These results indicate may form basis entirely new language analysis approach.

Язык: Английский

Процитировано

Few-Shot Spoken Language Understanding Via Joint Speech-Text Models DOI

Chung-Ming Chien,

Mingjiamei Zhang,

Ju-Chieh Chou

и другие.

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Год журнала: 2023, Номер 9, С. 1 - 8

Опубликована: Дек. 16, 2023

Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving representations by encoding and in a shared space. In this paper, we leverage such to address persistent challenge limited data availability spoken language understanding tasks. By employing speech-text model, find that fine-tuned can be effectively transferred testing data. With as little 1 hour labeled data, our proposed approach achieves comparable performance tasks (specifically, sentiment analysis named entity recognition) when compared previous methods using speech-only 10 times more Beyond proof-of-concept study, also analyze latent representations. We bottom layers are largely task-agnostic align into space, while top task-specific.

Язык: Английский

Процитировано

Understanding Probe Behaviors Through Variational Bounds of Mutual Information DOI

Kwanghee Choi, Jee-weon Jung, Shinji Watanabe

и другие.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Год журнала: 2024, Номер unknown, С. 5655 - 5659

Опубликована: Март 18, 2024

With the success of self-supervised representations, researchers seek a better understanding information encapsulated within representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster solid and provide guidelines for probing by constructing novel mathematical framework leveraging theory. First, connect with variational bounds mutual (MI) relax probe design, equating fine-tuning. Then, investigate empirical behaviors practices through our framework. analyze layer-wise performance curve being convex, which seemingly violates data processing inequality. However, show that intermediate representations can have biggest MI estimate because tradeoff between separability decreasing MI. further suggest margin linearly separable be criterion measuring "goodness representation." also compare accuracy as criteria. Finally, empirically validate claims observing speech models retaining word phoneme information.

Язык: Английский

Процитировано

Visually Grounded Speech Models Have a Mutual Exclusivity Bias DOI

Leanne Nortje,

Dan Oneață, Yevgen Matusevych

и другие.

Transactions of the Association for Computational Linguistics, Год журнала: 2024, Номер 12, С. 755 - 770

Опубликована: Янв. 1, 2024

Abstract When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: A novel word is mapped to a object rather than familiar one. This bias has been studied computationally, but only in models that use discrete representations input, ignoring high variability of spoken words. We investigate ME context visually grounded speech from natural images and continuous audio. Concretely, we train model on words test its by asking it select between when queried with word. To simulate prior acoustic visual knowledge, experiment several initialization strategies using pretrained vision networks. Our findings reveal across different approaches, stronger more (in particular, visual) knowledge. Additional tests confirm robustness our results, even loss functions are considered. Based detailed analyses piece out model’s representation space, attribute how classes distinctly separated resulting space.

Язык: Английский

Процитировано

Unveiling the Linguistic Capabilities of a Self-Supervised Speech Model Through Cross-Lingual Benchmark and Layer- Wise Similarity Analysis DOI

Takanori Ashihara, Marc Delcroix,

Yusuke Ijima

и другие.

IEEE Access, Год журнала: 2024, Номер 12, С. 98835 - 98855

Опубликована: Янв. 1, 2024

Self-supervised learning (SSL), an unsupervised representation technique, has received widespread attention across various modalities. Speech, with its inherent complexity encompassing acoustic (e.g., speaker, phoneme, and paralinguistic cues) linguistic words, semantics, syntax) aspects, prompts a fundamental question: how well can speech SSL models capture knowledge solely from data? This study comprehensively analyzes off-the-shelf utilizing three methods: probing tasks, layer contribution examinations, layer-wise similarity analysis. For the task, to elucidate cross-lingual conditions, we introduce SpeechGLUE/SpeechJGLUE, version of General Language Understanding Evaluation (GLUE) Japanese variant (JGLUE), both which comprise diverse natural language understanding tasks. The system incorporates weighted sum trainable weights all layers' outputs into downstream models, offering insight predominantly contributes addressing results reveal that encode information, albeit less sophisticated information than text models. Moreover, later layers are mainly utilized tackle benchmark To highlight their primary encoding role, call them (LELs). However, in scenarios, e.g., assessing English on SpeechJGLUE, contributions equalize, suggesting challenges determining suitable or relying cues. Nevertheless, some outperform implying robustness against variation. Similarity analysis reveals block structure within LELs, particularly evident WavLM, where becomes unclear non-English/noise input, reaffirming presence LELs.

Язык: Английский

Процитировано

Convexity Based Pruning of Speech Representation Models DOI

Teresa Dorszewski,

Lenka Tětková, Lars Kai Hansen

и другие.

Опубликована: Сен. 22, 2024

Speech representation models based on the transformer architecture and trained by self-supervised learning have shown great promise for solving tasks such as speech speaker recognition, keyword spotting, emotion detection, more. Typically, it is found that larger lead to better performance. However, significant computational effort involved in large systems a challenge embedded real-world applications. Recent work has there redundancy NLP massive layer pruning feasible (Sajjad et al., 2023). Here, we investigate audio models. We base decision convexity criterion. Convexity of classification regions recently been proposed an indicator subsequent fine-tuning performance range application domains, including audio. In empirical investigations, find reduction with no loss or even improvements certain cases.

Язык: Английский

Процитировано

Deep-learning models reveal how context and listener attention shape electrophysiological correlates of speech-to-language transformation DOI

Andrew J. Anderson, Chris Davis, Edmund C. Lalor

и другие.

PLoS Computational Biology, Год журнала: 2024, Номер 20(11), С. e1012537 - e1012537

Опубликована: Ноя. 11, 2024

To transform continuous speech into words, the human brain must resolve variability across utterances in intonation, rate, volume, accents and so on. A promising approach to explaining this process has been model electroencephalogram (EEG) recordings of responses speech. Contemporary models typically invoke context invariant categories (e.g. phonemes) as an intermediary representational stage between sounds words. However, such may not capture complete picture because they do mechanism that categorizes consequently overlook associated neural representations. By providing end-to-end accounts speech-to-text transformation, new deep-learning systems could enable more models. We EEG audiobook comprehension with recognition system Whisper. find (1) Whisper provides a self-contained reflects elements prelexical lexical representation prediction; (2) modeling is accurate when informed by 5-10s context, which traditional categorical encode; (3) Deep layers encoding linguistic structure were selectively attended two-speaker “cocktail party” listening conditions than early acoustics. No layer depth advantage was observed for unattended speech, consistent superficial level processing brain.

Язык: Английский

Процитировано

Self-Supervised Syllable Discovery Based on Speaker-Disentangled Hubert DOI

Ryota Komatsu, Takahiro Shinozaki

2022 IEEE Spoken Language Technology Workshop (SLT), Год журнала: 2024, Номер unknown, С. 1131 - 1136

Опубликована: Дек. 2, 2024

Язык: Английский

Процитировано

Property Neurons in Self-Supervised Speech Transformers DOI

Tzu-Quan Lin,

Guan-Ting Lin,

Hung-yi Lee

и другие.

2022 IEEE Spoken Language Technology Workshop (SLT), Год журнала: 2024, Номер unknown, С. 401 - 408

Опубликована: Дек. 2, 2024

Язык: Английский

Процитировано