NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task
Published: Jan. 1, 2023
We
describe
the
findings
of
fourth
Nuanced
Arabic
Dialect
Identification
Shared
Task
(NADI
2023).
The
objective
NADI
is
to
help
advance
state-of-the-art
NLP
by
creating
opportunities
for
teams
researchers
collaboratively
compete
under
standardized
conditions.
It
does
so
with
a
focus
on
dialects,
offering
novel
datasets
and
defining
subtasks
that
allow
meaningful
comparisons
between
different
approaches.
2023
targeted
both
dialect
identification
(Subtask1)
dialect-to-MSA
machine
translation
(Subtask
2
Subtask
3).
A
total
58
unique
registered
shared
task,
whom
18
have
participated
(with
76
valid
submissions
during
test
phase).
Among
these,
16
in
1,
5
2,
3
3.
winning
achieved
87.27
F1
14.76
Bleu
21.10
3,
respectively.
Results
show
all
three
remain
challenging,
thereby
motivating
future
work
this
area.
methods
employed
participating
briefly
offer
an
outlook
NADI.
Language: Английский
Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation
Published: Jan. 1, 2023
Text
normalization
methods
have
been
commonly
applied
to
historical
language
or
user-generated
content,
but
less
often
dialectal
transcriptions.
In
this
paper,
we
introduce
dialect-to-standard
–
i.e.,
mapping
phonetic
transcriptions
from
different
dialects
the
orthographic
norm
of
standard
variety
as
a
distinct
sentence-level
character
transduction
task
and
provide
large-scale
analysis
methods.
To
end,
compile
multilingual
dataset
covering
four
languages:
Finnish,
Norwegian,
Swiss
German
Slovene.
For
two
biggest
corpora,
three
data
splits
corresponding
use
cases
for
automatic
normalization.
We
evaluate
most
successful
sequence-to-sequence
model
architectures
proposed
text
tasks
using
tokenization
approaches
context
sizes.
find
that
character-level
Transformer
trained
on
sliding
windows
words
works
best
Slovene,
whereas
pre-trained
byT5
full
sentences
obtains
results
Norwegian.
Finally,
perform
an
error
effect
performance.
Language: Английский