Proceedings of the AAAI Conference on Artificial Intelligence,
Journal Year:
2024,
Volume and Issue:
38(4), P. 3630 - 3638
Published: March 24, 2024
3D
Single
Object
Tracking
(SOT)
stands
a
forefront
task
of
computer
vision,
proving
essential
for
applications
like
autonomous
driving.
Sparse
and
occluded
data
in
scene
point
clouds
introduce
variations
the
appearance
tracked
objects,
adding
complexity
to
task.
In
this
research,
we
unveil
M3SOT,
novel
SOT
framework,
which
synergizes
multiple
input
frames
(template
sets),
receptive
fields
(continuous
contexts),
solution
spaces
(distinct
tasks)
ONE
model.
Remarkably,
M3SOT
pioneers
modeling
temporality,
contexts,
tasks
directly
from
clouds,
revisiting
perspective
on
key
factors
influencing
SOT.
To
end,
design
transformer-based
network
centered
cloud
targets
search
area,
aggregating
diverse
contextual
representations
propagating
target
cues
by
employing
historical
frames.
As
spans
varied
processing
perspectives,
we've
streamlined
network—trimming
its
depth
optimizing
structure—to
ensure
lightweight
efficient
deployment
applications.
We
posit
that,
backed
practical
construction,
sidesteps
need
complex
frameworks
auxiliary
components
deliver
sterling
results.
Extensive
experiments
benchmarks
such
as
KITTI,
nuScenes,
Waymo
Open
Dataset
demonstrate
that
achieves
state-of-the-art
performance
at
38
FPS.
Our
code
models
are
available
https://github.com/ywu0912/TeamCode.git.
IEEE Transactions on Multimedia,
Journal Year:
2023,
Volume and Issue:
26, P. 1626 - 1638
Published: June 9, 2023
Learning
effective
representations
from
unlabeled
data
is
a
challenging
task
for
point
cloud
understanding.
As
the
human
visual
system
can
map
concepts
learned
2D
images
to
3D
world,
and
inspired
by
recent
multimodal
research,
we
introduce
modality
image
joint
learning.
Based
on
properties
of
clouds
images,
propose
CrossNet,
comprehensive
intra-
cross-modal
contrastive
learning
method
that
learns
representations.
The
proposed
achieves
3D-3D
3D-2D
correspondences
objectives
maximizing
consistency
their
augmented
versions,
with
corresponding
rendered
in
invariant
space.
We
further
distinguish
into
RGB
grayscale
extract
color
geometric
features,
respectively.
These
training
combine
feature
between
modalities
rich
signals
images.
Our
CrossNet
simple:
add
extraction
module
projection
head
branches,
respectively,
train
backbone
network
self-supervised
manner.
After
pretrained,
only
required
fine-tuning
directly
predicting
results
downstream
tasks.
experiments
multiple
benchmarks
demonstrate
improved
classification
segmentation
results,
be
generalized
across
domains.
IEEE Transactions on Circuits and Systems for Video Technology,
Journal Year:
2024,
Volume and Issue:
34(9), P. 8343 - 8354
Published: March 19, 2024
Extracting
discriminative
representations
is
the
key
step
for
correspondence-free
point
cloud
registration.
The
extracted
require
to
be
transformation,
which
demands
reduce
influence
of
redundant
information
irrelevant
transformation.
However,
recently
proposed
methods
ignore
this
crucial
property,
resulting
in
limited
ability
represent
cloud.
In
addition,
researching
registration
has
stagnated
recent
years.
paper,
we
try
relieve
features
redundancy
issue
from
a
new
perspective.
Specifically,
our
method
comprises
two
stages:
feature
extraction
stage
and
rigid
body
transformation
stage.
stage,
aim
maximize
multi-hierarchical
mutual
between
different
hierarchical
features,
can
provide
less
regress
parameters
next
utilize
dual
quaternion
estimate
parameters,
combines
rotation
translation
simultaneously
within
unified
framework
obtains
compact
model
trained
an
unsupervised
manner
on
ModelNet40
dataset.
experimental
results
illustrate
that
achieves
higher
accuracy
robustness
compared
with
existing
methods.
Information Fusion,
Journal Year:
2024,
Volume and Issue:
108, P. 102358 - 102358
Published: March 24, 2024
Change
detection
plays
a
fundamental
role
in
Earth
observation
for
analyzing
temporal
iterations
over
time.
However,
recent
studies
have
largely
neglected
the
utilization
of
multimodal
data
that
presents
significant
practical
and
technical
advantages
compared
to
single-modal
approaches.
This
research
focuses
on
leveraging
pre-event
digital
surface
model
(DSM)
post-event
aerial
images
captured
at
different
times
detecting
change
beyond
2D.
We
observe
current
methods
struggle
with
multitask
conflicts
between
semantic
height
tasks.
To
address
this
challenge,
we
propose
an
efficient
Transformer-based
network
learns
shared
representation
cross-dimensional
inputs
through
cross-attention.
It
adopts
consistency
constraint
establish
relationship.
Initially,
pseudo-changes
are
derived
by
employing
thresholding.
Subsequently,
L2
distance
within
their
overlapping
regions
is
minimized.
explicitly
endows
(regression
task)
(classification
consistency.
A
DSM-to-image
dataset
encompassing
three
cities
Netherlands
was
constructed.
lays
new
foundation
beyond-2D
from
inputs.
Compared
five
state-of-the-art
methods,
our
demonstrates
consistent
superiority
terms
detection.
Furthermore,
strategy
can
be
seamlessly
adapted
other
yielding
promising
improvements.
IEEE Computational Intelligence Magazine,
Journal Year:
2023,
Volume and Issue:
18(4), P. 66 - 79
Published: Oct. 17, 2023
Point
cloud
registration,
which
effectively
coincides
the
source
and
target
point
clouds,
is
generally
implemented
by
geometric
metrics
or
feature
metrics.
In
terms
of
resistance
to
noise
outliers,
feature-metric
registration
has
less
error
than
traditional
point-to-point
corresponding
metric,
reconstruction
can
generate
reveal
more
potential
information
during
recovery
process,
further
optimize
process.
this
paper,
CFNet,
a
correspondence-free
framework
based
on
metrics,
proposed
learn
adaptive
representations,
with
an
emphasis
optimizing
network.
Considering
correlations
among
paired
clouds
in
interaction
module
that
perceive
strengthen
association
between
multiple
stages
proposed.
To
clarify
fact
rotation
translation
are
essentially
uncorrelated,
they
considered
different
solution
spaces,
interactive
features
divided
into
two
parts
produce
dual
branch
regression.
addition,
CFNet
its
comprehensive
objectives
estimates
transformation
matrix
input
minimizing
loss
The
extensive
experiments
conducted
both
synthetic
real-world
datasets
show
our
method
outperforms
existing
methods.
IEEE Transactions on Multimedia,
Journal Year:
2023,
Volume and Issue:
26, P. 3505 - 3516
Published: Sept. 12, 2023
The
self-attention
(SA)
network
revisits
the
essence
of
data
and
has
achieved
remarkable
results
in
text
processing
image
analysis.
SA
is
conceptualized
as
a
set
operator
that
insensitive
to
order
number
data,
making
it
suitable
for
point
sets
embedded
3D
space.
However,
working
with
clouds
still
poses
challenges.
To
tackle
issue
exponential
growth
complexity
singularity
induced
by
original
without
position
encoding,
we
modify
attention
mechanism
incorporating
encoding
make
linear,
thus
reducing
its
computational
cost
memory
usage
more
feasible
clouds.
This
article
presents
new
framework
called
multiscale
cloud
transformer
(MPCT),
which
improves
upon
prior
methods
cross-domain
applications.
utilization
multiple
embeddings
enables
complete
capture
remote
local
contextual
connections
within
clouds,
determined
our
proposed
mechanism.
Additionally,
use
residual
facilitate
fusion
features,
allowing
MPCT
better
comprehend
representations
at
each
stage
attention.
Experiments
conducted
on
several
datasets
demonstrate
outperforms
existing
methods,
such
achieving
accuracies
94.2%
84.9%
classification
tasks
implemented
ModelNet40
ScanObjectNN,
respectively.
Remote Sensing,
Journal Year:
2024,
Volume and Issue:
16(6), P. 970 - 970
Published: March 10, 2024
In
the
realm
of
Earth
observation
and
remote
sensing
data
analysis,
advancement
hyperspectral
imaging
(HSI)
classification
technology
is
paramount
importance.
Nevertheless,
intricate
nature
data,
coupled
with
scarcity
labeled
presents
significant
challenges
in
this
domain.
To
mitigate
these
issues,
we
introduce
a
self-supervised
learning
algorithm
predicated
on
spectral
transformer
for
HSI
under
conditions
limited
objective
enhancing
efficacy
classification.
The
S3L
operates
two
distinct
phases:
pretraining
fine-tuning.
During
phase,
learns
spatial
representation
from
unlabeled
utilizing
masking
mechanism
transformer,
thereby
augmenting
sequence
dependence
features.
Subsequently,
fine-tuning
employed
to
refine
pretrained
weights,
improving
precision
Within
comprehensive
encoder–decoder
framework,
propose
novel
module
specifically
engineered
synergize
feature
extraction
domain
analysis.
This
innovative
adeptly
navigates
complex
interplay
among
various
bands,
capturing
both
global
sequential
dependencies.
Uniquely,
it
incorporates
gated
recurrent
unit
(GRU)
layer
within
encoder
enhance
its
ability
process
sequences.
Our
experimental
evaluations
across
several
public
datasets
reveal
that
our
proposed
method,
distinguished
by
achieves
superior
performance,
particularly
scenarios
samples,
outperforming
existing
state-of-the-art
approaches.
IEEE Geoscience and Remote Sensing Letters,
Journal Year:
2024,
Volume and Issue:
21, P. 1 - 5
Published: Jan. 1, 2024
In
the
scene
classification
task
of
remote
sensing
image
(RSI),
in
order
to
fully
perceive
multi-scale
local
objects
and
explore
their
interdependencies
mine
semantics
RSI,
this
letter
designs
a
novel
Position-Sensitive
Cross-Layer
Interactive
Transformer
(PSCLI-TF)
model
improve
accuracy
RSI
classification.
Firstly,
ResNet50
is
utilized
as
backbone
extract
multi-layer
feature
maps
RSI.
Then,
enhance
model's
position
sensitivity
new
Attention
(PSCLIA)
mechanism
designed,
based
on
it
PSCLI-TF
encoder
constructed
perform
layer-by-layer
interactive
fusion
obtain
multi-granularity
Fusion
(CLF)
Finally,
prototype-based
self-supervised
loss
function
alleviate
semantic
gap
problem
"large
intra-class
variance
small
inter-class
variance"
Comparative
experimental
results
three
datasets
(i.e.,
AID,
NWPU
UCM)
indicate
that
performance
designed
highly
competitive
compared
other
state-of-the-art
methods.