IEEE/ACM Transactions on Audio Speech and Language Processing,
Journal Year:
2023,
Volume and Issue:
31, P. 902 - 914
Published: Jan. 1, 2023
Crowdsourcing
is
a
popular
tool
for
collecting
large
amounts
of
annotated
data,
but
the
specific
format
strong
labels
necessary
sound
event
detection
not
easily
obtainable
through
crowdsourcing.
In
this
work,
we
propose
novel
annotation
workflow
that
leverages
efficiency
crowdsourcing
weak
labels,
and
uses
high
number
annotators
to
produce
reliable
objective
labels.
The
are
collected
in
highly
redundant
setup,
allow
reconstruction
temporal
information.
To
obtain
annotators'
competence
estimated
using
MACE
(Multi-Annotator
Competence
Estimation)
incorporated
into
estimation
weighing
individual
opinions.
We
show
proposed
method
produces
consistently
annotations
only
synthetic
audio
mixtures,
also
recordings
real
everyday
environments.
While
maximum
80%
coincidence
with
complete
correct
reference
was
obtained
these
results
explained
by
an
extended
study
how
polyphony
SNR
levels
affect
identification
rate
events
annotators.
On
even
though
significantly
lower
under
69%,
majority
opinion
approach
aggregated
comparison
more
difficult
task
directly
IEEE/ACM Transactions on Audio Speech and Language Processing,
Journal Year:
2024,
Volume and Issue:
32, P. 1336 - 1351
Published: Jan. 1, 2024
Self-supervised
learning
(SSL)
has
emerged
as
a
popular
approach
for
audio
representations.
One
goal
of
self-supervised
pre-training
is
to
transfer
knowledge
downstream
tasks,
generally
including
clip-level
and
frame-level
tasks.
While
tasks
are
important
fine-grained
acoustic
scene/event
understanding,
prior
studies
primarily
evaluate
on
In
order
tackle
both
this
paper
proposes
Audio
Teacher-Student
Transformer
(ATST),
with
version
(named
ATST-Clip)
ATST-Frame),
responsible
representations,
respectively.
Both
methods
use
encoder
teacher-student
training
scheme.
We
have
carefully
designed
view
creation
strategy
ATST-Clip
ATST-Frame.
Specifically,
uses
segment-wise
data
augmentations,
ATST-Frame
integrates
frame-wise
augmentations
masking.
Experimental
results
show
that
our
model
obtains
state-of-the-art
(SOTA)
performances
most
the
Especially,
it
outperforms
other
models
by
large
margin
sound
event
detection
task.
addition,
performance
can
be
further
improved
combining
two
through
distillation.
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Journal Year:
2021,
Volume and Issue:
unknown, P. 186 - 190
Published: May 13, 2021
We
introduce
the
Free
Universal
Sound
Separation
(FUSS)
dataset,
a
new
corpus
for
experiments
in
separating
mixtures
of
an
unknown
number
sounds
from
open
domain
sound
types.
The
dataset
consists
23
hours
single-source
audio
data
drawn
357
classes,
which
are
used
to
create
one
four
sources.
To
simulate
reverberation,
acoustic
room
simulator
is
generate
impulse
responses
box-shaped
rooms
with
frequency-dependent
reflective
walls.
Additional
open-source
augmentation
tools
also
provided
produce
different
combinations
sources
and
simulations.
Finally,
we
baseline
separation
model,
based
on
improved
time-domain
convolutional
network
(TDCN++),
that
can
separate
variable
mixture.
This
model
achieves
9.8
dB
scale-invariant
signal-to-noise
ratio
improvement
(SI-SNRi)
two
sources,
while
reconstructing
inputs
35.8
absolute
SI-SNR.
hope
this
will
lower
barrier
research
allow
fast
iteration
application
novel
techniques
other
machine
learning
domains
challenge.
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Journal Year:
2021,
Volume and Issue:
unknown
Published: May 13, 2021
Self-supervised
representation
learning
can
mitigate
the
limitations
in
recognition
tasks
with
few
manually
labeled
data
but
abundant
unlabeled
data—a
common
scenario
sound
event
research.
In
this
work,
we
explore
unsupervised
contrastive
as
a
way
to
learn
representations.
To
end,
propose
use
pretext
task
of
contrasting
differently
augmented
views
events.
The
are
computed
primarily
via
mixing
training
examples
unrelated
backgrounds,
followed
by
other
augmentations.
We
analyze
main
components
our
method
ablation
experiments.
evaluate
learned
representations
using
linear
evaluation,
and
two
in-domain
downstream
classification
tasks,
namely,
limited
data,
noisy
data.
Our
results
suggest
that
pre-training
impact
scarcity
increase
robustness
against
labels.
IEEE/ACM Transactions on Audio Speech and Language Processing,
Journal Year:
2022,
Volume and Issue:
31, P. 137 - 151
Published: Nov. 10, 2022
Pre-trained
models
are
essential
as
feature
extractors
in
modern
machine
learning
systems
various
domains.
In
this
study,
we
hypothesize
that
representations
effective
for
general
audio
tasks
should
provide
multiple
aspects
of
robust
features
the
input
sound.
For
recognizing
sounds
regardless
perturbations
such
varying
pitch
or
timbre,
be
to
these
perturbations.
serving
diverse
needs
recognition
emotions
music
genres,
information,
local
and
global
features.
To
implement
our
principle,
propose
a
self-supervised
method:
Bootstrap
Your
Own
Latent
(BYOL)
Audio
(BYOL-A,
pronounced
“viola”).
BYOL-A
pre-trains
sound
invariant
data
augmentations,
which
makes
learned
sounds.
Whereas
encoder
combines
calculates
their
statistics
make
representation
multi-aspect
information.
As
result,
information
serve
tasks.
We
evaluated
task
performance
compared
previous
state-of-the-art
methods,
demonstrated
generalizability
with
best
average
result
72.4%
VoxCeleb1
57.6%.
Extensive
ablation
experiments
revealed
architecture
contributes
most
performance,
final
critical
portion
resorts
BYOL
framework
augmentations.
Our
code
is
available
online
future
studies.
IEEE/ACM Transactions on Audio Speech and Language Processing,
Journal Year:
2023,
Volume and Issue:
31, P. 902 - 914
Published: Jan. 1, 2023
Crowdsourcing
is
a
popular
tool
for
collecting
large
amounts
of
annotated
data,
but
the
specific
format
strong
labels
necessary
sound
event
detection
not
easily
obtainable
through
crowdsourcing.
In
this
work,
we
propose
novel
annotation
workflow
that
leverages
efficiency
crowdsourcing
weak
labels,
and
uses
high
number
annotators
to
produce
reliable
objective
labels.
The
are
collected
in
highly
redundant
setup,
allow
reconstruction
temporal
information.
To
obtain
annotators'
competence
estimated
using
MACE
(Multi-Annotator
Competence
Estimation)
incorporated
into
estimation
weighing
individual
opinions.
We
show
proposed
method
produces
consistently
annotations
only
synthetic
audio
mixtures,
also
recordings
real
everyday
environments.
While
maximum
80%
coincidence
with
complete
correct
reference
was
obtained
these
results
explained
by
an
extended
study
how
polyphony
SNR
levels
affect
identification
rate
events
annotators.
On
even
though
significantly
lower
under
69%,
majority
opinion
approach
aggregated
comparison
more
difficult
task
directly