Natural Language Processing for Dialects of a Language: A Survey
ACM Computing Surveys,
Год журнала:
2025,
Номер
unknown
Опубликована: Янв. 13, 2025
State-of-the-art
natural
language
processing
(NLP)
models
are
trained
on
massive
training
corpora,
and
report
a
superlative
performance
evaluation
datasets.
This
survey
delves
into
an
important
attribute
of
these
datasets:
the
dialect
language.
Motivated
by
degradation
NLP
for
dialectal
datasets
its
implications
equity
technologies,
we
past
research
in
dialects
terms
datasets,
approaches.
We
describe
wide
range
tasks
two
categories:
understanding
(NLU)
(for
such
as
classification,
sentiment
analysis,
parsing,
NLU
benchmarks)
generation
(NLG)
summarisation,
machine
translation,
dialogue
systems).
The
is
also
broad
coverage
languages
which
include
English,
Arabic,
German,
among
others.
observe
that
work
concerning
goes
deeper
than
mere
extends
to
several
NLG
tasks.
For
tasks,
classical
learning
using
statistical
models,
along
with
recent
deep
learning-based
approaches
based
pre-trained
models.
expect
this
will
be
useful
researchers
interested
building
equitable
technologies
rethinking
LLM
benchmarks
model
architectures.
Язык: Английский
The Helsinki-NLP Submissions at NADI 2023 Shared Task: Walking the Baseline
Опубликована: Янв. 1, 2023
The
Helsinki-NLP
team
participated
in
the
NADI
2023
shared
tasks
on
Arabic
dialect
translation
with
seven
submissions.
We
used
statistical
(SMT)
and
neural
machine
(NMT)
methods
explored
character-
subword-based
data
preprocessing.
Our
submissions
placed
second
both
tracks.
In
open
track,
our
winning
submission
is
a
character-level
SMT
system
additional
Modern
Standard
language
models.
closed
best
BLEU
scores
were
obtained
leave-as-is
baseline,
simple
copy
of
input,
narrowly
followed
by
systems.
tracks,
fine-tuning
existing
multilingual
models
such
as
AraT5
or
ByT5
did
not
yield
superior
performance
compared
to
SMT.
Язык: Английский
Sequence-to-Sequence Models and Their Evaluation for Spoken Language Normalization of Slovenian
Applied Sciences,
Год журнала:
2024,
Номер
14(20), С. 9515 - 9515
Опубликована: Окт. 18, 2024
Sequence-to-sequence
models
have
been
applied
to
many
challenging
problems,
including
those
in
text
and
speech
technologies.
Normalization
is
one
of
them.
It
refers
transforming
non-standard
language
forms
into
their
standard
counterparts.
Non-standard
come
from
different
written
spoken
sources.
This
paper
deals
with
such
source,
namely
the
less-resourced
highly
inflected
Slovenian
language.
The
explores
corpora
recently
collected
public
private
environments.
We
analyze
efficiencies
three
sequence-to-sequence
for
automatic
normalization
literal
transcriptions
forms.
Experiments
were
performed
using
words,
subwords,
characters
as
basic
units
normalization.
In
article,
we
demonstrate
that
superiority
approach
linked
choice
modeling
unit.
Statistical
prefer
while
neural
network-based
characters.
experimental
results
show
best
are
obtained
architectures
based
on
Long
short-term
memory
transformer
gave
comparable
results.
also
present
a
novel
analysis
tool,
which
use
in-depth
error
by
character-based
models.
showed
systems
similar
overall
can
differ
performance
types
errors.
Errors
architecture
easier
correct
post-editing
process.
an
important
insight,
creating
time-consuming
costly
tool
incorporates
two
statistical
significance
tests:
approximate
randomization
bootstrap
resampling.
Both
tests
confirm
improved
compared
ones.
Язык: Английский