Cited by IUNADI at NADI 2023 shared task: Country-level Arabic Dialect Classification in Tweets for the Shared Task NADI 2023

ISL-AAST at NADI 2023 shared task: Enhancing Arabic Dialect Identification in the Era of Globalization and Technological Progress DOI

Shorouk Adel,

Noureldin Elmadany

Published: Jan. 1, 2023

Arabic dialects have extensive global usage owing to their significance and the vast number of speakers. However, technological progress globalization are leading significant transformations within dialects. They acquiring new characteristics involving novel vocabulary integrating linguistic elements from diverse Consequently, sentiment analysis these is becoming more challenging. This study categorizes among 18 countries, as introduced by Nuanced Dialect Identification (NADI) shared task competition. Our approach incorporates utilization MARABERT v2 models with a range methodologies, including feature extraction process. findings reveal that most effective model achieved applying averaging concatenation hidden layers v2, followed feeding resulting output into convolutional layers. Furthermore, employing ensemble method on various methods enhances model’s performance. system secures 6th position top performers in First subtask, achieving an F1 score 83.73%.

Language: Английский

Citations

DialectNLU at NADI 2023 Shared Task: Transformer Based Multitask Approach Jointly Integrating Dialect and Machine Translation Tasks in Arabic DOI

Hariram Veeramani,

Surendrabikram Thapa, Usman Naseem

et al.

Published: Jan. 1, 2023

With approximately 400 million speakers worldwide, Arabic ranks as the fifth most-spoken language globally, necessitating advancements in natural processing. This paper addresses this need by presenting a system description of approaches employed for subtasks outlined Nuanced Dialect Identification (NADI) task at EMNLP 2023. For first subtask, involving closed country-level dialect identification classification, we employ an ensemble two models. Similarly, second focused on to Modern Standard (MSA) machine translation, our approach combines sequence-to-sequence models, all trained Arabic-specific dataset. Our team 10th and 3rd subtask 1 2 respectively.

Language: Английский

Citations

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification DOI

Amr Keleg, Walid Magdy

Published: Jan. 1, 2023

Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running 2018. However, ADI systems are reported to fail distinguishing between micro-dialects Arabic. We argue that currently adopted framing task as a single-label classification problem is one main reasons for that. highlight limitation incompleteness labels demonstrate how impacts evaluation systems. A manual error analysis predictions an ADI, performed by 7 native speakers different dialects, revealed ≈ 67% validated errors not true errors. Consequently, we propose multi-label give recommendations designing new datasets.

Language: Английский

Citations

SANA at NADI 2023 shared task: Ensemble of Layer-Wise BERT-based models for Dialectal Arabic Identification DOI

Nada Almarwani,

Samah Aloufi

Published: Jan. 1, 2023

Our system, submitted to the Nuanced Arabic Dialect Identification (NADI-23), tackles first sub-task: Closed Country-level dialect identification. In this work, we propose a model that is based on an ensemble of layer-wise fine-tuned BERT-based models. The proposed ranked fourth out sixteen submissions, with F1-macro score 85.43.

Language: Английский

Citations

Sequence-to-Sequence Models and Their Evaluation for Spoken Language Normalization of Slovenian DOI

Mirjam Sepesy Maučec, Darinka Verdonik, Gregor Donaj

et al.

Applied Sciences, Journal Year: 2024, Volume and Issue: 14(20), P. 9515 - 9515

Published: Oct. 18, 2024

Sequence-to-sequence models have been applied to many challenging problems, including those in text and speech technologies. Normalization is one of them. It refers transforming non-standard language forms into their standard counterparts. Non-standard come from different written spoken sources. This paper deals with such source, namely the less-resourced highly inflected Slovenian language. The explores corpora recently collected public private environments. We analyze efficiencies three sequence-to-sequence for automatic normalization literal transcriptions forms. Experiments were performed using words, subwords, characters as basic units normalization. In article, we demonstrate that superiority approach linked choice modeling unit. Statistical prefer while neural network-based characters. experimental results show best are obtained architectures based on Long short-term memory transformer gave comparable results. also present a novel analysis tool, which use in-depth error by character-based models. showed systems similar overall can differ performance types errors. Errors architecture easier correct post-editing process. an important insight, creating time-consuming costly tool incorporates two statistical significance tests: approximate randomization bootstrap resampling. Both tests confirm improved compared ones.

Language: Английский

Citations

Multitask learning for Arabic Dialects Identification and Machine Translation DOI

Mohamed Dhleima,

Mohamedou Cheikh Tourad, Cheikh Abdelkader Ahmed Telmoud

et al.

Lecture notes in networks and systems, Journal Year: 2024, Volume and Issue: unknown, P. 284 - 292

Published: Jan. 1, 2024

Language: Английский

Citations

IUNADI at NADI 2023 shared task: Country-level Arabic Dialect Classification in Tweets for the Shared Task NADI 2023 DOI

Yash Hatekar,

Muhammad Abdo

Published: Jan. 1, 2023

In this paper, we describe our participation in the NADI2023 shared task for classification of Arabic dialects tweets. For training, evaluation, and testing purposes, a primary dataset comprising tweets from 18 Arab countries is provided, along with three older datasets. The main objective to develop model capable classifying these countries. We outline approach, which leverages various machine learning models. Our experiments demonstrate that large language models, particularly Arabertv2-Large, Arabertv2-Base, CAMeLBERT-Mix DID MADAR, consistently outperform traditional methods such as SVM, XGBOOST, Multinomial Naive Bayes, AdaBoost, Random Forests.

Language: Английский

Citations