In
this
paper,
we
describe
our
participation
in
the
NADI2023
shared
task
for
classification
of
Arabic
dialects
tweets.
For
training,
evaluation,
and
testing
purposes,
a
primary
dataset
comprising
tweets
from
18
Arab
countries
is
provided,
along
with
three
older
datasets.
The
main
objective
to
develop
model
capable
classifying
these
countries.
We
outline
approach,
which
leverages
various
machine
learning
models.
Our
experiments
demonstrate
that
large
language
models,
particularly
Arabertv2-Large,
Arabertv2-Base,
CAMeLBERT-Mix
DID
MADAR,
consistently
outperform
traditional
methods
such
as
SVM,
XGBOOST,
Multinomial
Naive
Bayes,
AdaBoost,
Random
Forests.
Information,
Journal Year:
2024,
Volume and Issue:
15(6), P. 316 - 316
Published: May 28, 2024
Recently,
the
widespread
use
of
social
media
and
easy
access
to
Internet
have
brought
about
a
significant
transformation
in
type
textual
data
available
on
Web.
This
change
is
particularly
evident
Arabic
language
usage,
as
growing
number
users
from
diverse
domains
has
led
considerable
influx
text
various
dialects,
each
characterized
by
differences
morphology,
syntax,
vocabulary,
pronunciation.
Consequently,
researchers
recognition
natural
processing
become
increasingly
interested
identifying
dialects.
Numerous
methods
been
proposed
recognize
this
informal
data,
owing
its
crucial
implications
for
several
applications,
such
sentiment
analysis,
topic
modeling,
summarization,
machine
translation.
However,
dialect
identification
challenge
due
vast
diversity
study
introduces
novel
hybrid
deep
learning
model,
incorporating
an
attention
mechanism
detecting
classifying
Several
experiments
were
conducted
using
dataset
that
collected
information
user-generated
comments
Twitter
namely,
Egyptian,
Gulf,
Jordanian,
Yemeni,
evaluate
effectiveness
model.
The
comprises
34,905
rows
extracted
Twitter,
representing
unbalanced
distribution.
annotation
was
performed
native
speakers
proficient
dialect.
results
demonstrate
model
outperforms
performance
long
short-term
memory,
bidirectional
logistic
regression
models
classification
different
word
representations
follows:
term
frequency-inverse
document
frequency,
Word2Vec,
global
vector
representation.
Frontiers in Artificial Intelligence,
Journal Year:
2024,
Volume and Issue:
7
Published: July 2, 2024
Sentiment
analysis
also
referred
to
as
opinion
mining,
plays
a
significant
role
in
automating
the
identification
of
negative,
positive,
or
neutral
sentiments
expressed
textual
data.
The
proliferation
social
networks,
review
sites,
and
blogs
has
rendered
these
platforms
valuable
resources
for
mining
opinions.
finds
applications
various
domains
languages,
including
English
Arabic.
However,
Arabic
presents
unique
challenges
due
its
complex
morphology
characterized
by
inflectional
derivation
patterns.
To
effectively
analyze
sentiment
text,
techniques
must
account
this
intricacy.
This
paper
proposes
model
designed
using
transformer
deep
learning
(DL)
techniques.
word
embedding
is
represented
Transformer-based
Model
Language
Understanding
(ArabBert),
then
passed
AraBERT
model.
output
subsequently
fed
into
Long
Short-Term
Memory
(LSTM)
model,
followed
feedforward
neural
networks
an
layer.
used
capture
rich
contextual
information
LSTM
enhance
sequence
modeling
retain
long-term
dependencies
within
text
We
compared
proposed
with
machine
(ML)
algorithms
DL
algorithms,
well
different
vectorization
techniques:
term
frequency-inverse
document
frequency
(TF-IDF),
ArabBert,
Continuous
Bag-of-Words
(CBOW),
skipGrams
four
benchmark
datasets.
Through
extensive
experimentation
evaluation
datasets,
we
showcase
effectiveness
our
approach.
results
underscore
improvements
accuracy,
highlighting
potential
leveraging
models
Analysis.
outcomes
research
contribute
advancing
analysis,
enabling
more
accurate
reliable
text.
findings
reveal
that
framework
exhibits
exceptional
performance
classification,
achieving
impressive
accuracy
rate
over
97%.
Algorithms,
Journal Year:
2024,
Volume and Issue:
17(11), P. 495 - 495
Published: Nov. 3, 2024
Sentiment
analysis
utilizes
Natural
Language
Processing
(NLP)
techniques
to
extract
opinions
from
text,
which
is
critical
for
businesses
looking
refine
strategies
and
better
understand
customer
feedback.
Understanding
people’s
sentiments
about
products
through
emotional
tone
paramount.
However,
analyzing
sentiment
in
Arabic
its
dialects
poses
challenges
due
the
language’s
intricate
morphology,
right-to-left
script,
nuanced
expressions.
To
address
this,
this
study
introduces
Arb-MCNN-Bi
Model,
integrates
strengths
of
transformer-based
AraBERT
(Arabic
Bidirectional
Encoder
Representations
Transformers)
model
with
a
Multi-channel
Convolutional
Neural
Network
(MCNN)
Gated
Recurrent
Unit
(BiGRU)
analysis.
AraBERT,
designed
specifically
Arabic,
captures
rich
contextual
information
word
embeddings.
These
embeddings
are
processed
by
MCNN
enhance
feature
extraction
BiGRU
retain
long-term
dependencies.
The
final
output
obtained
feedforward
neural
networks.
compares
proposed
various
machine
learning
deep
methods,
applying
advanced
NLP
such
as
Term
Frequency-Inverse
Document
Frequency
(TF-IDF),
n-gram,
Word2Vec
(Skip-gram),
fastText
(Skip-gram).
Experiments
conducted
on
three
datasets:
Customer
Reviews
Dataset
(ACRD),
Large-scale
Book
(LABR),
Hotel
dataset
(HARD).
achieved
accuracies
96.92%,
96.68%,
92.93%
ACRD,
HARD,
LABR
datasets,
respectively.
results
demonstrate
model’s
effectiveness
text
data
outperforming
traditional
approaches.
Text
normalization
methods
have
been
commonly
applied
to
historical
language
or
user-generated
content,
but
less
often
dialectal
transcriptions.
In
this
paper,
we
introduce
dialect-to-standard
–
i.e.,
mapping
phonetic
transcriptions
from
different
dialects
the
orthographic
norm
of
standard
variety
as
a
distinct
sentence-level
character
transduction
task
and
provide
large-scale
analysis
methods.
To
end,
compile
multilingual
dataset
covering
four
languages:
Finnish,
Norwegian,
Swiss
German
Slovene.
For
two
biggest
corpora,
three
data
splits
corresponding
use
cases
for
automatic
normalization.
We
evaluate
most
successful
sequence-to-sequence
model
architectures
proposed
text
tasks
using
tokenization
approaches
context
sizes.
find
that
character-level
Transformer
trained
on
sliding
windows
words
works
best
Slovene,
whereas
pre-trained
byT5
full
sentences
obtains
results
Norwegian.
Finally,
perform
an
error
effect
performance.
The
Helsinki-NLP
team
participated
in
the
NADI
2023
shared
tasks
on
Arabic
dialect
translation
with
seven
submissions.
We
used
statistical
(SMT)
and
neural
machine
(NMT)
methods
explored
character-
subword-based
data
preprocessing.
Our
submissions
placed
second
both
tracks.
In
open
track,
our
winning
submission
is
a
character-level
SMT
system
additional
Modern
Standard
language
models.
closed
best
BLEU
scores
were
obtained
leave-as-is
baseline,
simple
copy
of
input,
narrowly
followed
by
systems.
tracks,
fine-tuning
existing
multilingual
models
such
as
AraT5
or
ByT5
did
not
yield
superior
performance
compared
to
SMT.
In
this
paper,
we
present
our
approach
for
the
“Nuanced
Arabic
Dialect
Identification
(NADI)
Shared
Task
2023”.
We
highlight
methodology
subtask
1
which
deals
with
country-level
dialect
identification.
Recognizing
dialects
plays
an
instrumental
role
in
enhancing
performance
of
various
downstream
NLP
tasks
such
as
speech
recognition
and
translation.
The
task
uses
Twitter
dataset
(TWT-2023)
that
encompasses
18
multi-class
classification
problem.
Numerous
transformer-based
models,
pre-trained
on
language,
are
employed
identifying
dialects.
fine-tune
these
state-of-the-art
models
provided
dataset.
Ensembling
method
is
leveraged
to
yield
improved
system.
achieved
F1-score
76.65
(11th
rank
leaderboard)
test
This
paper
presents
the
approach
of
NLPeople
team
to
Nuanced
Arabic
Dialect
Identification
(NADI)
2023
shared
task.
Subtask
1
involves
identifying
dialect
a
source
text
at
country
level.
Our
makes
use
language-specific
language
models,
clustering
and
retrieval
method
provide
additional
context
target
sentence,
fine-tuning
strategy
which
provided
data
from
2020
2021
tasks,
finally,
ensembling
over
predictions
multiple
models.
submission
achieves
macro-averaged
F1
score
87.27,
ranking
1st
among
other
participants
in
In
this
paper
we
present
our
approach
towards
Arabic
Dialect
identification
which
was
part
of
the
The
Fourth
Nuanced
Identification
Shared
Task
(NADI
2023).
We
tested
several
techniques
to
identify
dialects.
obtained
best
result
by
fine-tuning
pre-trained
MARBERTv2
model
with
a
modified
training
dataset.
set
expanded
sorting
tweets
based
on
dialects,
concatenating
every
two
adjacent
tweets,
and
adding
them
original
dataset
as
new
tweets.
achieved
82.87
F1
score
were
at
seventh
position
among
16
participants.
This
paper
presents
the
methods
we
developed
for
Nuanced
Arabic
Dialect
Identification
(NADI)
2023
shared
task,
specifically
targeting
two
subtasks
focussed
on
sentence-level
machine
translation
(MT)
of
text
written
in
any
four
dialects
(Egyptian,
Emirati,
Jordanian
and
Palestinian)
to
Modern
Standard
(MSA).
Our
team,
UniManc,
employed
models
based
T5:
multilingual
T5
(mT5),
multi-task
fine-tuned
mT5
(mT0)
AraT5.
These
were
trained
configurations:
joint
model
training
all
regional
(J-R)
independent
every
dialect
(I-R).
Based
results
official
NADI
evaluation,
our
I-R
AraT5
obtained
an
overall
BLEU
score
14.76,
ranking
first
Closed
Dialect-to-MSA
MT
subtask.
Moreover,
Open
subtask,
J-R
also
ranked
first,
obtaining
21.10.