Sequence-to-Sequence Models and Their Evaluation for Spoken Language Normalization of Slovenian DOI Creative Commons
Mirjam Sepesy Maučec, Darinka Verdonik, Gregor Donaj

и другие.

Applied Sciences, Год журнала: 2024, Номер 14(20), С. 9515 - 9515

Опубликована: Окт. 18, 2024

Sequence-to-sequence models have been applied to many challenging problems, including those in text and speech technologies. Normalization is one of them. It refers transforming non-standard language forms into their standard counterparts. Non-standard come from different written spoken sources. This paper deals with such source, namely the less-resourced highly inflected Slovenian language. The explores corpora recently collected public private environments. We analyze efficiencies three sequence-to-sequence for automatic normalization literal transcriptions forms. Experiments were performed using words, subwords, characters as basic units normalization. In article, we demonstrate that superiority approach linked choice modeling unit. Statistical prefer while neural network-based characters. experimental results show best are obtained architectures based on Long short-term memory transformer gave comparable results. also present a novel analysis tool, which use in-depth error by character-based models. showed systems similar overall can differ performance types errors. Errors architecture easier correct post-editing process. an important insight, creating time-consuming costly tool incorporates two statistical significance tests: approximate randomization bootstrap resampling. Both tests confirm improved compared ones.

Язык: Английский

Natural Language Processing for Dialects of a Language: A Survey DOI Creative Commons
Aditya Joshi, Raj Dabre, Diptesh Kanojia

и другие.

ACM Computing Surveys, Год журнала: 2025, Номер unknown

Опубликована: Янв. 13, 2025

State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance evaluation datasets. This survey delves into an important attribute of these datasets: the dialect language. Motivated by degradation NLP for dialectal datasets its implications equity technologies, we past research in dialects terms datasets, approaches. We describe wide range tasks two categories: understanding (NLU) (for such as classification, sentiment analysis, parsing, NLU benchmarks) generation (NLG) summarisation, machine translation, dialogue systems). The is also broad coverage languages which include English, Arabic, German, among others. observe that work concerning goes deeper than mere extends to several NLG tasks. For tasks, classical learning using statistical models, along with recent deep learning-based approaches based pre-trained models. expect this will be useful researchers interested building equitable technologies rethinking LLM benchmarks model architectures.

Язык: Английский

Процитировано

4

The Helsinki-NLP Submissions at NADI 2023 Shared Task: Walking the Baseline DOI Creative Commons
Yves Scherrer, Aleksandra Miletić, Olli Kuparinen

и другие.

Опубликована: Янв. 1, 2023

The Helsinki-NLP team participated in the NADI 2023 shared tasks on Arabic dialect translation with seven submissions. We used statistical (SMT) and neural machine (NMT) methods explored character- subword-based data preprocessing. Our submissions placed second both tracks. In open track, our winning submission is a character-level SMT system additional Modern Standard language models. closed best BLEU scores were obtained leave-as-is baseline, simple copy of input, narrowly followed by systems. tracks, fine-tuning existing multilingual models such as AraT5 or ByT5 did not yield superior performance compared to SMT.

Язык: Английский

Процитировано

2

Sequence-to-Sequence Models and Their Evaluation for Spoken Language Normalization of Slovenian DOI Creative Commons
Mirjam Sepesy Maučec, Darinka Verdonik, Gregor Donaj

и другие.

Applied Sciences, Год журнала: 2024, Номер 14(20), С. 9515 - 9515

Опубликована: Окт. 18, 2024

Sequence-to-sequence models have been applied to many challenging problems, including those in text and speech technologies. Normalization is one of them. It refers transforming non-standard language forms into their standard counterparts. Non-standard come from different written spoken sources. This paper deals with such source, namely the less-resourced highly inflected Slovenian language. The explores corpora recently collected public private environments. We analyze efficiencies three sequence-to-sequence for automatic normalization literal transcriptions forms. Experiments were performed using words, subwords, characters as basic units normalization. In article, we demonstrate that superiority approach linked choice modeling unit. Statistical prefer while neural network-based characters. experimental results show best are obtained architectures based on Long short-term memory transformer gave comparable results. also present a novel analysis tool, which use in-depth error by character-based models. showed systems similar overall can differ performance types errors. Errors architecture easier correct post-editing process. an important insight, creating time-consuming costly tool incorporates two statistical significance tests: approximate randomization bootstrap resampling. Both tests confirm improved compared ones.

Язык: Английский

Процитировано

0