Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation DOI Creative Commons
Olli Kuparinen, Aleksandra Miletić, Yves Scherrer

et al.

Published: Jan. 1, 2023

Text normalization methods have been commonly applied to historical language or user-generated content, but less often dialectal transcriptions. In this paper, we introduce dialect-to-standard – i.e., mapping phonetic transcriptions from different dialects the orthographic norm of standard variety as a distinct sentence-level character transduction task and provide large-scale analysis methods. To end, compile multilingual dataset covering four languages: Finnish, Norwegian, Swiss German Slovene. For two biggest corpora, three data splits corresponding use cases for automatic normalization. We evaluate most successful sequence-to-sequence model architectures proposed text tasks using tokenization approaches context sizes. find that character-level Transformer trained on sliding windows words works best Slovene, whereas pre-trained byT5 full sentences obtains results Norwegian. Finally, perform an error effect performance.

Language: Английский

NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task DOI Creative Commons
Muhammad Abdul-Mageed,

AbdelRahim Elmadany,

Chiyu Zhang

et al.

Published: Jan. 1, 2023

We describe the findings of fourth Nuanced Arabic Dialect Identification Shared Task (NADI 2023). The objective NADI is to help advance state-of-the-art NLP by creating opportunities for teams researchers collaboratively compete under standardized conditions. It does so with a focus on dialects, offering novel datasets and defining subtasks that allow meaningful comparisons between different approaches. 2023 targeted both dialect identification (Subtask1) dialect-to-MSA machine translation (Subtask 2 Subtask 3). A total 58 unique registered shared task, whom 18 have participated (with 76 valid submissions during test phase). Among these, 16 in 1, 5 2, 3 3. winning achieved 87.27 F1 14.76 Bleu 21.10 3, respectively. Results show all three remain challenging, thereby motivating future work this area. methods employed participating briefly offer an outlook NADI.

Language: Английский

Citations

17

Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation DOI Creative Commons
Olli Kuparinen, Aleksandra Miletić, Yves Scherrer

et al.

Published: Jan. 1, 2023

Text normalization methods have been commonly applied to historical language or user-generated content, but less often dialectal transcriptions. In this paper, we introduce dialect-to-standard – i.e., mapping phonetic transcriptions from different dialects the orthographic norm of standard variety as a distinct sentence-level character transduction task and provide large-scale analysis methods. To end, compile multilingual dataset covering four languages: Finnish, Norwegian, Swiss German Slovene. For two biggest corpora, three data splits corresponding use cases for automatic normalization. We evaluate most successful sequence-to-sequence model architectures proposed text tasks using tokenization approaches context sizes. find that character-level Transformer trained on sliding windows words works best Slovene, whereas pre-trained byT5 full sentences obtains results Norwegian. Finally, perform an error effect performance.

Language: Английский

Citations

3