IEEE/ACM Transactions on Audio Speech and Language Processing,
Journal Year:
2023,
Volume and Issue:
31, P. 902 - 914
Published: Jan. 1, 2023
Crowdsourcing
is
a
popular
tool
for
collecting
large
amounts
of
annotated
data,
but
the
specific
format
strong
labels
necessary
sound
event
detection
not
easily
obtainable
through
crowdsourcing.
In
this
work,
we
propose
novel
annotation
workflow
that
leverages
efficiency
crowdsourcing
weak
labels,
and
uses
high
number
annotators
to
produce
reliable
objective
labels.
The
are
collected
in
highly
redundant
setup,
allow
reconstruction
temporal
information.
To
obtain
annotators'
competence
estimated
using
MACE
(Multi-Annotator
Competence
Estimation)
incorporated
into
estimation
weighing
individual
opinions.
We
show
proposed
method
produces
consistently
annotations
only
synthetic
audio
mixtures,
also
recordings
real
everyday
environments.
While
maximum
80%
coincidence
with
complete
correct
reference
was
obtained
these
results
explained
by
an
extended
study
how
polyphony
SNR
levels
affect
identification
rate
events
annotators.
On
even
though
significantly
lower
under
69%,
majority
opinion
approach
aggregated
comparison
more
difficult
task
directly
Future Internet,
Journal Year:
2023,
Volume and Issue:
15(8), P. 260 - 260
Published: July 31, 2023
Generative
artificial
intelligence
(AI)
has
emerged
as
a
powerful
technology
with
numerous
applications
in
various
domains.
There
is
need
to
identify
the
requirements
and
evaluation
metrics
for
generative
AI
models
designed
specific
tasks.
The
purpose
of
research
aims
investigate
fundamental
aspects
systems,
including
their
requirements,
models,
input–output
formats,
metrics.
study
addresses
key
questions
presents
comprehensive
insights
guide
researchers,
developers,
practitioners
field.
Firstly,
necessary
implementing
systems
are
examined
categorized
into
three
distinct
categories:
hardware,
software,
user
experience.
Furthermore,
explores
different
types
described
literature
by
presenting
taxonomy
based
on
architectural
characteristics,
such
variational
autoencoders
(VAEs),
adversarial
networks
(GANs),
diffusion
transformers,
language
normalizing
flow
hybrid
models.
A
classification
input
output
formats
used
also
provided.
Moreover,
proposes
system
discusses
commonly
AI.
findings
contribute
advancements
field,
enabling
effectively
implement
evaluate
applications.
significance
lies
understanding
that
crucial
effective
planning,
design,
optimal
performance.
aids
selecting
suitable
options
driving
advancements.
Classifying
enables
leveraging
diverse
customized
while
establish
standardized
methods
assess
model
quality
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Journal Year:
2023,
Volume and Issue:
unknown
Published: May 5, 2023
Mainstream
machine
listening
models
are
trained
to
learn
audio
concepts
under
the
paradigm
of
one
class
label
many
recordings
focusing
on
task.
Learning
such
restricted
supervision
limits
flexibility
because
they
require
labeled
for
training
and
can
only
predict
predefined
categories.
Instead,
we
propose
from
natural
language
supervision.
We
call
our
approach
Contrastive
Language-Audio
Pretraining
(CLAP),
which
connects
by
using
two
encoders
a
contrastive
learning
objective,
bringing
text
descriptions
into
joint
multimodal
space.
CLAP
with
128k
pairs
evaluated
it
16
downstream
tasks
across
7
domains,
as
classification
sound
events,
scenes,
music,
speech.
establishes
state-of-the-art
(SoTA)
in
Zero-Shot
performance.
Also,
CLAP's
encoder
supervised
setup
achieved
SoTA
5
tasks.
The
capability
removes
need
audio,
enables
flexible
prediction
at
inference
time,
generalizes
well
multiple
Code
is
available
at:
https://github.com/microsoft/CLAP.
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Journal Year:
2022,
Volume and Issue:
unknown, P. 4563 - 4567
Published: April 27, 2022
We
propose
Wav2CLIP,
a
robust
audio
representation
learning
method
by
distilling
from
Contrastive
Language-Image
Pre-training
(CLIP).
systematically
evaluate
Wav2CLIP
on
variety
of
tasks
including
classification,
retrieval,
and
generation,
show
that
can
outperform
several
publicly
available
pre-trained
algorithms.
projects
into
shared
embedding
space
with
images
text,
which
enables
multimodal
applications
such
as
zero-shot
cross-modal
retrieval.
Furthermore,
needs
just
∼10%
the
data
to
achieve
competitive
performance
downstream
compared
fully
supervised
models,
is
more
efficient
pre-train
than
competing
methods
it
does
not
require
visual
model
in
concert
an
auditory
model.
Finally,
we
demonstrate
image
generation
qualitative
assessment
space.
Our
code
weights
are
open
sourced
made
for
further
applications.
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Journal Year:
2023,
Volume and Issue:
unknown
Published: May 5, 2023
Contrastive
learning
has
shown
remarkable
success
in
the
field
of
multimodal
representation
learning.
In
this
paper,
we
propose
a
pipeline
contrastive
language-audio
pretraining
to
develop
an
audio
by
combining
data
with
natural
language
descriptions.
To
accomplish
target,
first
release
LAION-Audio-630K,
large
collection
633,526
audio-text
pairs
from
different
sources.
Second,
construct
model
considering
encoders
and
text
encoders.
We
incorporate
feature
fusion
mechanism
keyword-to-caption
augmentation
into
design
further
enable
process
inputs
variable
lengths
enhance
performance.
Third,
perform
comprehensive
experiments
evaluate
our
across
three
tasks:
text-to-audio
retrieval,
zero-shot
classification,
supervised
classification.
The
results
demonstrate
that
achieves
superior
performance
retrieval
task.
classification
tasks,
state-of-the-art
setting
is
able
obtain
comparable
models'
non-zero-shot
setting.
LAION-Audio-630K
1
proposed
xmlns:xlink="http://www.w3.org/1999/xlink">2
are
both
available
public.
Interspeech 2022,
Journal Year:
2022,
Volume and Issue:
unknown
Published: Sept. 16, 2022
The
great
success
of
transformer-based
models
in
natural
language
processing
(NLP)
has
led
to
various
attempts
at
adapting
these
architectures
other
domains
such
as
vision
and
audio.
Recent
work
shown
that
transformers
can
outperform
Convolutional
Neural
Networks
(CNNs)
on
audio
tasks.
However,
one
the
main
shortcomings
transformer
models,
compared
well-established
CNNs,
is
computational
complexity.
In
transformers,
compute
memory
complexity
known
grow
quadratically
with
input
length.
Therefore,
there
been
extensive
optimizing
but
often
cost
degrading
predictive
performance.
this
work,
we
propose
a
novel
method
optimize
regularize
spectrograms.
Our
proposed
achieve
new
state-of-the-art
performance
Audioset
be
trained
single
consumer-grade
GPU.
Furthermore,
model
outperforms
CNNs
terms
both
training
speed.
Source
code:
https://github.com/kkoutini/PaSST
IEEE/ACM Transactions on Audio Speech and Language Processing,
Journal Year:
2023,
Volume and Issue:
31, P. 3221 - 3236
Published: Jan. 1, 2023
We
propose
TF-GridNet
for
speech
separation.
The
model
is
a
novel
deep
neural
network
(DNN)
integrating
full-
and
sub-band
modeling
in
the
time-frequency
(T-F)
domain.
It
stacks
several
blocks,
each
consisting
of
an
intra-frame
full-band
module,
temporal
cross-frame
self-attention
module.
trained
to
perform
complex
spectral
mapping,
where
real
imaginary
(RI)
components
input
signals
are
stacked
as
features
predict
target
RI
components.
first
evaluate
it
on
monaural
anechoic
speaker
Without
using
data
augmentation
dynamic
mixing,
obtains
state-of-the-art
23.5
dB
improvement
scale-invariant
signal-to-distortion
ratio
(SI-SDR)
WSJ0-2mix,
standard
dataset
two-speaker
To
show
its
robustness
noise
reverberation,
we
reverberant
separation
SMS-WSJ
noisy-reverberant
WHAMR!,
obtain
performance
both
datasets.
then
extend
multi-microphone
conditions
through
integrate
into
two-DNN
system
with
beamformer
between
(named
MISO-BF-MISO
earlier
studies),
proposed
this
paper
multi-frame
Wiener
filter
computed
based
outputs
DNN.
State-of-the-art
obtained
multi-channel
tasks
WHAMR!.
Besides
separation,
apply
algorithms
dereverberation
enhancement.
recent
L3DAS22
enhancement
challenge.
IEEE Internet of Things Journal,
Journal Year:
2023,
Volume and Issue:
10(13), P. 11264 - 11292
Published: March 7, 2023
Current
sound-based
practices
and
systems
developed
in
both
academia
industry
point
to
convergent
research
trends
that
bring
together
the
field
of
Sound
Music
Computing
with
Internet
Things.
This
paper
proposes
a
vision
for
emerging
Sounds
(IoS),
which
stems
from
such
disciplines.
The
IoS
relates
network
Things,
i.e.,
devices
capable
sensing,
acquiring,
processing,
actuating,
exchanging
data
serving
purpose
communicating
sound-related
information.
In
paradigm,
merges
under
unique
umbrella
fields
Musical
Things
Audio
heterogeneous
dedicated
musical
non-musical
tasks
can
interact
cooperate
one
another
other
things
connected
facilitate
services
applications
are
globally
available
users.
We
survey
state
art
this
space,
discuss
technological
non-technological
challenges
ahead
us
propose
comprehensive
agenda
field.
Sensors,
Journal Year:
2023,
Volume and Issue:
23(2), P. 783 - 783
Published: Jan. 10, 2023
Forest
fires
are
the
main
cause
of
desertification,
and
they
have
a
disastrous
impact
on
agricultural
forest
ecosystems.
Modern
fire
detection
warning
systems
rely
several
techniques:
satellite
monitoring,
sensor
networks,
image
processing,
data
fusion,
etc.
Recently,
Artificial
Intelligence
(AI)
algorithms
been
applied
to
recognition
systems,
enhancing
their
efficiency
reliability.
However,
these
devices
usually
need
constant
transmission
along
with
proper
amount
computing
power,
entailing
high
costs
energy
consumption.
This
paper
presents
prototype
Video
Surveillance
Unit
(VSU)
for
recognising
signalling
presence
by
exploiting
two
embedded
Machine
Learning
(ML)
running
low
power
device.
The
ML
models
take
audio
samples
images
as
respective
inputs,
allowing
timely
detection.
result
is
that
while
performances
comparable
when
work
independently,
joint
usage
according
proposed
methodology
provides
higher
accuracy,
precision,
recall
F1
score
(96.15%,
92.30%,
100.00%,
96.00%,
respectively).
Eventually,
each
event
remotely
signalled
making
use
Long
Range
Wide
Area
Network
(LoRaWAN)
protocol
ensure
personnel
in
charge
able
operate
promptly.
IEEE/ACM Transactions on Audio Speech and Language Processing,
Journal Year:
2024,
Volume and Issue:
32, P. 3339 - 3354
Published: Jan. 1, 2024
The
advancement
of
audio-language
(AL)
multimodal
learning
tasks
has
been
significant
in
recent
years,
yet
the
limited
size
existing
datasets
poses
challenges
for
researchers
due
to
costly
and
time-consuming
collection
process.To
address
this
data
scarcity
issue,
we
introduce
WavCaps,
first
large-scale
weakly-labelled
audio
captioning
dataset,
comprising
approximately
400k
clips
with
paired
captions.We
sourced
their
raw
descriptions
from
web
sources
a
sound
event
detection
dataset.However,
online-harvested
are
highly
noisy
unsuitable
direct
use
such
as
automated
captioning.To
overcome
propose
three-stage
processing
pipeline
filtering
generating
high-quality
captions,
where
ChatGPT,
large
language
model,
is
leveraged
filter
transform
automatically.We
conduct
comprehensive
analysis
characteristics
WavCaps
dataset
evaluate
it
on
multiple
downstream
tasks.The
systems
trained
outperform
previous
state-of-the-art
(SOTA)
models
by
margin.Our
aspiration
have
proposed
facilitate
research
demonstrate
potential
utilizing
(LLMs)
enhance
academic
research.Our
codes
available
at
https://github.com/XinhaoMei/WavCaps.