In
this
paper,
we
describe
our
participation
in
the
NADI2023
shared
task
for
classification
of
Arabic
dialects
tweets.
For
training,
evaluation,
and
testing
purposes,
a
primary
dataset
comprising
tweets
from
18
Arab
countries
is
provided,
along
with
three
older
datasets.
The
main
objective
to
develop
model
capable
classifying
these
countries.
We
outline
approach,
which
leverages
various
machine
learning
models.
Our
experiments
demonstrate
that
large
language
models,
particularly
Arabertv2-Large,
Arabertv2-Base,
CAMeLBERT-Mix
DID
MADAR,
consistently
outperform
traditional
methods
such
as
SVM,
XGBOOST,
Multinomial
Naive
Bayes,
AdaBoost,
Random
Forests.
Arabic
dialects
have
extensive
global
usage
owing
to
their
significance
and
the
vast
number
of
speakers.
However,
technological
progress
globalization
are
leading
significant
transformations
within
dialects.
They
acquiring
new
characteristics
involving
novel
vocabulary
integrating
linguistic
elements
from
diverse
Consequently,
sentiment
analysis
these
is
becoming
more
challenging.
This
study
categorizes
among
18
countries,
as
introduced
by
Nuanced
Dialect
Identification
(NADI)
shared
task
competition.
Our
approach
incorporates
utilization
MARABERT
v2
models
with
a
range
methodologies,
including
feature
extraction
process.
findings
reveal
that
most
effective
model
achieved
applying
averaging
concatenation
hidden
layers
v2,
followed
feeding
resulting
output
into
convolutional
layers.
Furthermore,
employing
ensemble
method
on
various
methods
enhances
model’s
performance.
system
secures
6th
position
top
performers
in
First
subtask,
achieving
an
F1
score
83.73%.
With
approximately
400
million
speakers
worldwide,
Arabic
ranks
as
the
fifth
most-spoken
language
globally,
necessitating
advancements
in
natural
processing.
This
paper
addresses
this
need
by
presenting
a
system
description
of
approaches
employed
for
subtasks
outlined
Nuanced
Dialect
Identification
(NADI)
task
at
EMNLP
2023.
For
first
subtask,
involving
closed
country-level
dialect
identification
classification,
we
employ
an
ensemble
two
models.
Similarly,
second
focused
on
to
Modern
Standard
(MSA)
machine
translation,
our
approach
combines
sequence-to-sequence
models,
all
trained
Arabic-specific
dataset.
Our
team
10th
and
3rd
subtask
1
2
respectively.
Automatic
Arabic
Dialect
Identification
(ADI)
of
text
has
gained
great
popularity
since
it
was
introduced
in
the
early
2010s.
Multiple
datasets
were
developed,
and
yearly
shared
tasks
have
been
running
2018.
However,
ADI
systems
are
reported
to
fail
distinguishing
between
micro-dialects
Arabic.
We
argue
that
currently
adopted
framing
task
as
a
single-label
classification
problem
is
one
main
reasons
for
that.
highlight
limitation
incompleteness
labels
demonstrate
how
impacts
evaluation
systems.
A
manual
error
analysis
predictions
an
ADI,
performed
by
7
native
speakers
different
dialects,
revealed
≈
67%
validated
errors
not
true
errors.
Consequently,
we
propose
multi-label
give
recommendations
designing
new
datasets.
Our
system,
submitted
to
the
Nuanced
Arabic
Dialect
Identification
(NADI-23),
tackles
first
sub-task:
Closed
Country-level
dialect
identification.
In
this
work,
we
propose
a
model
that
is
based
on
an
ensemble
of
layer-wise
fine-tuned
BERT-based
models.
The
proposed
ranked
fourth
out
sixteen
submissions,
with
F1-macro
score
85.43.
Applied Sciences,
Journal Year:
2024,
Volume and Issue:
14(20), P. 9515 - 9515
Published: Oct. 18, 2024
Sequence-to-sequence
models
have
been
applied
to
many
challenging
problems,
including
those
in
text
and
speech
technologies.
Normalization
is
one
of
them.
It
refers
transforming
non-standard
language
forms
into
their
standard
counterparts.
Non-standard
come
from
different
written
spoken
sources.
This
paper
deals
with
such
source,
namely
the
less-resourced
highly
inflected
Slovenian
language.
The
explores
corpora
recently
collected
public
private
environments.
We
analyze
efficiencies
three
sequence-to-sequence
for
automatic
normalization
literal
transcriptions
forms.
Experiments
were
performed
using
words,
subwords,
characters
as
basic
units
normalization.
In
article,
we
demonstrate
that
superiority
approach
linked
choice
modeling
unit.
Statistical
prefer
while
neural
network-based
characters.
experimental
results
show
best
are
obtained
architectures
based
on
Long
short-term
memory
transformer
gave
comparable
results.
also
present
a
novel
analysis
tool,
which
use
in-depth
error
by
character-based
models.
showed
systems
similar
overall
can
differ
performance
types
errors.
Errors
architecture
easier
correct
post-editing
process.
an
important
insight,
creating
time-consuming
costly
tool
incorporates
two
statistical
significance
tests:
approximate
randomization
bootstrap
resampling.
Both
tests
confirm
improved
compared
ones.
In
this
paper,
we
describe
our
participation
in
the
NADI2023
shared
task
for
classification
of
Arabic
dialects
tweets.
For
training,
evaluation,
and
testing
purposes,
a
primary
dataset
comprising
tweets
from
18
Arab
countries
is
provided,
along
with
three
older
datasets.
The
main
objective
to
develop
model
capable
classifying
these
countries.
We
outline
approach,
which
leverages
various
machine
learning
models.
Our
experiments
demonstrate
that
large
language
models,
particularly
Arabertv2-Large,
Arabertv2-Base,
CAMeLBERT-Mix
DID
MADAR,
consistently
outperform
traditional
methods
such
as
SVM,
XGBOOST,
Multinomial
Naive
Bayes,
AdaBoost,
Random
Forests.