IUNADI at NADI 2023 shared task: Country-level Arabic Dialect Classification in Tweets for the Shared Task NADI 2023 DOI Creative Commons

Yash Hatekar,

Muhammad Abdo

Published: Jan. 1, 2023

In this paper, we describe our participation in the NADI2023 shared task for classification of Arabic dialects tweets. For training, evaluation, and testing purposes, a primary dataset comprising tweets from 18 Arab countries is provided, along with three older datasets. The main objective to develop model capable classifying these countries. We outline approach, which leverages various machine learning models. Our experiments demonstrate that large language models, particularly Arabertv2-Large, Arabertv2-Base, CAMeLBERT-Mix DID MADAR, consistently outperform traditional methods such as SVM, XGBOOST, Multinomial Naive Bayes, AdaBoost, Random Forests.

Language: Английский

Enhancing Arabic Dialect Detection on Social Media: A Hybrid Model with an Attention Mechanism DOI Creative Commons
Wael M. S. Yafooz

Information, Journal Year: 2024, Volume and Issue: 15(6), P. 316 - 316

Published: May 28, 2024

Recently, the widespread use of social media and easy access to Internet have brought about a significant transformation in type textual data available on Web. This change is particularly evident Arabic language usage, as growing number users from diverse domains has led considerable influx text various dialects, each characterized by differences morphology, syntax, vocabulary, pronunciation. Consequently, researchers recognition natural processing become increasingly interested identifying dialects. Numerous methods been proposed recognize this informal data, owing its crucial implications for several applications, such sentiment analysis, topic modeling, summarization, machine translation. However, dialect identification challenge due vast diversity study introduces novel hybrid deep learning model, incorporating an attention mechanism detecting classifying Several experiments were conducted using dataset that collected information user-generated comments Twitter namely, Egyptian, Gulf, Jordanian, Yemeni, evaluate effectiveness model. The comprises 34,905 rows extracted Twitter, representing unbalanced distribution. annotation was performed native speakers proficient dialect. results demonstrate model outperforms performance long short-term memory, bidirectional logistic regression models classification different word representations follows: term frequency-inverse document frequency, Word2Vec, global vector representation.

Language: Английский

Citations

6

ArabBert-LSTM: improving Arabic sentiment analysis based on transformer model and Long Short-Term Memory DOI Creative Commons
Wael Alosaimi,

Hager Saleh,

Ali A. Hamzah

et al.

Frontiers in Artificial Intelligence, Journal Year: 2024, Volume and Issue: 7

Published: July 2, 2024

Sentiment analysis also referred to as opinion mining, plays a significant role in automating the identification of negative, positive, or neutral sentiments expressed textual data. The proliferation social networks, review sites, and blogs has rendered these platforms valuable resources for mining opinions. finds applications various domains languages, including English Arabic. However, Arabic presents unique challenges due its complex morphology characterized by inflectional derivation patterns. To effectively analyze sentiment text, techniques must account this intricacy. This paper proposes model designed using transformer deep learning (DL) techniques. word embedding is represented Transformer-based Model Language Understanding (ArabBert), then passed AraBERT model. output subsequently fed into Long Short-Term Memory (LSTM) model, followed feedforward neural networks an layer. used capture rich contextual information LSTM enhance sequence modeling retain long-term dependencies within text We compared proposed with machine (ML) algorithms DL algorithms, well different vectorization techniques: term frequency-inverse document frequency (TF-IDF), ArabBert, Continuous Bag-of-Words (CBOW), skipGrams four benchmark datasets. Through extensive experimentation evaluation datasets, we showcase effectiveness our approach. results underscore improvements accuracy, highlighting potential leveraging models Analysis. outcomes research contribute advancing analysis, enabling more accurate reliable text. findings reveal that framework exhibits exceptional performance classification, achieving impressive accuracy rate over 97%.

Language: Английский

Citations

3

Advancing low-resource dialect identification: A hybrid cross-lingual model leveraging CAMeLBERT and FastText for Algerian Arabic DOI
Mafaza Chabane, Fouzi Harrag, Khaled Shaalan

et al.

Expert Systems with Applications, Journal Year: 2025, Volume and Issue: 284, P. 127816 - 127816

Published: May 5, 2025

Language: Английский

Citations

0

Enhancing Arabic Sentiment Analysis of Consumer Reviews: Machine Learning and Deep Learning Methods Based on NLP DOI Creative Commons
Hani Almaqtari, Feng Zeng,

Ammar Mohammed

et al.

Algorithms, Journal Year: 2024, Volume and Issue: 17(11), P. 495 - 495

Published: Nov. 3, 2024

Sentiment analysis utilizes Natural Language Processing (NLP) techniques to extract opinions from text, which is critical for businesses looking refine strategies and better understand customer feedback. Understanding people’s sentiments about products through emotional tone paramount. However, analyzing sentiment in Arabic its dialects poses challenges due the language’s intricate morphology, right-to-left script, nuanced expressions. To address this, this study introduces Arb-MCNN-Bi Model, integrates strengths of transformer-based AraBERT (Arabic Bidirectional Encoder Representations Transformers) model with a Multi-channel Convolutional Neural Network (MCNN) Gated Recurrent Unit (BiGRU) analysis. AraBERT, designed specifically Arabic, captures rich contextual information word embeddings. These embeddings are processed by MCNN enhance feature extraction BiGRU retain long-term dependencies. The final output obtained feedforward neural networks. compares proposed various machine learning deep methods, applying advanced NLP such as Term Frequency-Inverse Document Frequency (TF-IDF), n-gram, Word2Vec (Skip-gram), fastText (Skip-gram). Experiments conducted on three datasets: Customer Reviews Dataset (ACRD), Large-scale Book (LABR), Hotel dataset (HARD). achieved accuracies 96.92%, 96.68%, 92.93% ACRD, HARD, LABR datasets, respectively. results demonstrate model’s effectiveness text data outperforming traditional approaches.

Language: Английский

Citations

1

Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation DOI Creative Commons
Olli Kuparinen, Aleksandra Miletić, Yves Scherrer

et al.

Published: Jan. 1, 2023

Text normalization methods have been commonly applied to historical language or user-generated content, but less often dialectal transcriptions. In this paper, we introduce dialect-to-standard – i.e., mapping phonetic transcriptions from different dialects the orthographic norm of standard variety as a distinct sentence-level character transduction task and provide large-scale analysis methods. To end, compile multilingual dataset covering four languages: Finnish, Norwegian, Swiss German Slovene. For two biggest corpora, three data splits corresponding use cases for automatic normalization. We evaluate most successful sequence-to-sequence model architectures proposed text tasks using tokenization approaches context sizes. find that character-level Transformer trained on sliding windows words works best Slovene, whereas pre-trained byT5 full sentences obtains results Norwegian. Finally, perform an error effect performance.

Language: Английский

Citations

3

The Helsinki-NLP Submissions at NADI 2023 Shared Task: Walking the Baseline DOI Creative Commons
Yves Scherrer, Aleksandra Miletić, Olli Kuparinen

et al.

Published: Jan. 1, 2023

The Helsinki-NLP team participated in the NADI 2023 shared tasks on Arabic dialect translation with seven submissions. We used statistical (SMT) and neural machine (NMT) methods explored character- subword-based data preprocessing. Our submissions placed second both tracks. In open track, our winning submission is a character-level SMT system additional Modern Standard language models. closed best BLEU scores were obtained leave-as-is baseline, simple copy of input, narrowly followed by systems. tracks, fine-tuning existing multilingual models such as AraT5 or ByT5 did not yield superior performance compared to SMT.

Language: Английский

Citations

2

Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through Dialect Identification using Transformer-based Approach DOI Creative Commons

Vedant Deshpande,

Yash Patwardhan,

Kshitij Deshpande

et al.

Published: Jan. 1, 2023

In this paper, we present our approach for the “Nuanced Arabic Dialect Identification (NADI) Shared Task 2023”. We highlight methodology subtask 1 which deals with country-level dialect identification. Recognizing dialects plays an instrumental role in enhancing performance of various downstream NLP tasks such as speech recognition and translation. The task uses Twitter dataset (TWT-2023) that encompasses 18 multi-class classification problem. Numerous transformer-based models, pre-trained on language, are employed identifying dialects. fine-tune these state-of-the-art models provided dataset. Ensembling method is leveraged to yield improved system. achieved F1-score 76.65 (11th rank leaderboard) test

Language: Английский

Citations

1

NLPeople at NADI 2023 Shared Task: Arabic Dialect Identification with Augmented Context and Multi-Stage Tuning DOI Creative Commons

Mohab Elkaref,

Movina Moses,

Shinnosuke Tanaka

et al.

Published: Jan. 1, 2023

This paper presents the approach of NLPeople team to Nuanced Arabic Dialect Identification (NADI) 2023 shared task. Subtask 1 involves identifying dialect a source text at country level. Our makes use language-specific language models, clustering and retrieval method provide additional context target sentence, fine-tuning strategy which provided data from 2020 2021 tasks, finally, ensembling over predictions multiple models. submission achieves macro-averaged F1 score 87.27, ranking 1st among other participants in

Language: Английский

Citations

1

UoT at NADI 2023 shared task: Automatic Arabic Dialect Identification is Made Possible DOI Creative Commons

Abduslam F A Nwesri,

Nabila A S Shinbir,

Hassan Ebrahem

et al.

Published: Jan. 1, 2023

In this paper we present our approach towards Arabic Dialect identification which was part of the The Fourth Nuanced Identification Shared Task (NADI 2023). We tested several techniques to identify dialects. obtained best result by fine-tuning pre-trained MARBERTv2 model with a modified training dataset. set expanded sorting tweets based on dialects, concatenating every two adjacent tweets, and adding them original dataset as new tweets. achieved 82.87 F1 score were at seventh position among 16 participants.

Language: Английский

Citations

1

UniManc at NADI 2023 Shared Task: A Comparison of Various T5-based Models for Translating Arabic Dialectical Text to Modern Standard Arabic DOI Creative Commons

Abdullah Khered,

Ingy Abdelhalim,

Nadine Abdelhalim

et al.

Published: Jan. 1, 2023

This paper presents the methods we developed for Nuanced Arabic Dialect Identification (NADI) 2023 shared task, specifically targeting two subtasks focussed on sentence-level machine translation (MT) of text written in any four dialects (Egyptian, Emirati, Jordanian and Palestinian) to Modern Standard (MSA). Our team, UniManc, employed models based T5: multilingual T5 (mT5), multi-task fine-tuned mT5 (mT0) AraT5. These were trained configurations: joint model training all regional (J-R) independent every dialect (I-R). Based results official NADI evaluation, our I-R AraT5 obtained an overall BLEU score 14.76, ranking first Closed Dialect-to-MSA MT subtask. Moreover, Open subtask, J-R also ranked first, obtaining 21.10.

Language: Английский

Citations

1