bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2022,
Volume and Issue:
unknown
Published: Dec. 19, 2022
Abstract
Protein
representations
from
deep
language
models
have
yielded
state-of-the-art
performance
across
many
tasks
in
computational
protein
engineering.
In
recent
years,
progress
has
primarily
focused
on
parameter
count,
with
models’
capacities
surpassing
the
size
of
very
datasets
they
were
trained
on.
Here,
we
propose
an
alternative
direction.
We
show
that
large
codons,
instead
amino
acid
sequences,
provide
high-quality
outperform
comparable
a
variety
tasks.
some
tasks,
like
species
recognition,
prediction
and
transcript
abundance,
or
melting
point
estimation,
model
codons
outperforms
every
other
published
model,
including
contain
over
50
times
more
parameters.
These
results
suggest
that,
addition
to
commonly
studied
scale
complexity,
information
content
biological
data
provides
orthogonal
direction
improve
power
machine
learning
biology.
Genes,
Journal Year:
2020,
Volume and Issue:
11(10), P. 1133 - 1133
Published: Sept. 25, 2020
Chloroplasts
are
unique
organelles
within
the
plant
cells
and
responsible
for
sustaining
life
forms
on
earth
due
to
their
ability
conduct
photosynthesis.
Multiple
functional
genes
chloroplast
a
variety
of
metabolic
processes
that
occur
in
chloroplast.
Considering
its
fundamental
role
earth,
it
is
important
identify
level
diversity
present
genome,
what
genomic
content
have
been
lost,
transferred
nuclear
duplication
events,
overall
origin
evolution
genome.
Our
analysis
2511
genomes
indicated
genome
size
number
coding
DNA
sequences
(CDS)
chloroplasts
algae
higher
relative
other
lineages.
Approximately
10.31%
examined
species
lost
inverted
repeats
(IR)
span
across
all
Genome-wide
analyses
revealed
loss
Rbcl
gene
parasitic
heterotrophic
plants
occurred
approximately
56
Ma
ago.
PsaM,
Psb30,
ChlB,
ChlL,
ChlN,
Rpl21
were
found
be
characteristic
signature
algae,
bryophytes,
pteridophytes,
gymnosperms;
however,
none
these
angiosperm
or
magnoliid
lineage
which
appeared
them
203-156
A
chloroplast-encoded
different
lineages
throughout
evolutionary
process.
The
Rpl20
gene,
was
most
stable
intact
not
any
analyzed
species,
suggesting
plastome.
evolved
from
multiple
common
ancestors
~1293
ago
undergone
vivid
recombination
events
taxonomic
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Jan. 23, 2024
Clustering
is
a
critical
step
in
the
analysis
of
single-cell
data,
as
it
enables
discovery
and
characterization
putative
cell
types
states.
However,
most
popular
clustering
tools
do
not
subject
results
to
statistical
inference
testing,
leading
risks
overclustering
or
underclustering
data
often
resulting
ineffective
identification
with
widely
differing
prevalence.
To
address
these
challenges,
we
present
CHOIR
(clustering
hierarchy
optimization
by
iterative
random
forests),
which
applies
framework
forest
classifiers
permutation
tests
across
hierarchical
tree
statistically
determine
clusters
represent
distinct
populations.
We
demonstrate
enhanced
performance
through
extensive
benchmarking
against
14
existing
methods
100
simulated
4
real
RNA-seq,
ATAC-seq,
spatial
transcriptomic,
multi-omic
datasets.
can
be
applied
any
type
provides
flexible,
scalable,
robust
solution
important
challenge
identifying
biologically
relevant
groupings
within
heterogeneous
data.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Feb. 21, 2024
Abstract
DNA-binding
proteins
are
essential
in
different
biological
processes,
including
DNA
replication,
transcription,
packaging,
and
chromatin
remodelling.
Exploring
their
characteristics
functions
has
become
relevant
diverse
scientific
domains.
Computational
biology
bioinformatics
have
assisted
studying
proteins,
complementing
traditional
molecular
methods.
While
recent
advances
machine
learning
enabled
the
integration
of
predictive
systems
with
bioinformatic
approaches,
there
still
needs
to
be
generalizable
pipelines
for
identifying
unknown
as
assessing
specific
type
strand
they
recognize.
In
this
work,
we
introduce
RUDEUS,
a
Python
library
featuring
hierarchical
classification
models
designed
identify
assess
interaction
type,
whether
single-stranded
or
double-stranded.
RUDEUS
versatile
pipeline
capable
training
models,
synergizing
protein
language
supervised
algorithms,
integrating
Bayesian
optimization
strategies.
The
trained
high
performance,
achieving
precision
rate
95%
identification
89%
discerning
between
doublestranded
interactions.
includes
an
exploration
tool
evaluating
sequences,
annotating
them
DNA-binding,
determining
Moreover,
structural
been
integrated
into
validating
identified
through
DNA-protein
docking.
These
comprehensive
strategies
straightforward
implementation
demonstrate
comparable
performance
high-end
enhance
usability
engineering
pipelines.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Sept. 12, 2024
Severe
tissue
loss
resulting
from
extremity
trauma,
such
as
volumetric
muscle
(VML),
poses
significant
clinical
challenges
for
both
general
and
military
populations.
VML
disrupts
the
endogenous
repair
mechanisms,
in
acute
unresolved
chronic
inflammation
immune
cell
presence,
impaired
healing,
scar
formation,
persistent
pain,
permanent
functional
deficits.
The
aberrant
healing
response
is
preceded
by
infiltration
which
does
not
resolve.
We
analyzed
biosynthesis
of
inflammatory
specialized
pro-resolving
lipid
mediators
(SPMs)
after
injury
two
different
models;
with
critical-sized
defects
had
a
decreased
capacity
to
biosynthesize
SPMs,
leading
dysregulated
inflammation.
developed
modular
poly(ethylene
glycol)-maleimide
hydrogel
platform
locally
release
stable
isomer
Resolvin
D1
(AT-RvD1)
promote
pathways
resolution
models.
local
delivery
AT-RvD1
enhanced
regeneration,
improved
function,
reduced
pain
sensitivity
promoting
molecular
cellular
These
findings
provide
new
insights
into
pathogenesis
establish
therapeutic
promising
strategy
regeneration
traumatic
injury.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Feb. 22, 2023
Cryo-electron
microscopy
(cryo-EM)
is
currently
the
most
powerful
technique
for
determining
structures
of
large
protein
complexes
and
assemblies.
Picking
single-protein
particles
from
cryo-EM
micrographs
(images)
a
key
step
in
reconstructing
structures.
However,
widely
used
template-based
particle
picking
process
labor-intensive
time-consuming.
Though
emerging
machine
learning-based
can
potentially
automate
process,
its
development
severely
hindered
by
lack
large,
high-quality,
manually
labelled
training
data.
Here,
we
present
CryoPPP,
diverse,
expert-curated
image
dataset
single
analysis
to
address
this
bottleneck.
It
consists
32
non-redundant,
representative
datasets
selected
Electron
Microscopy
Public
Image
Archive
(EMPIAR).
includes
9,089
high-resolution
(∼300
images
per
EMPIAR
dataset)
which
coordinates
were
human
experts.
The
labelling
was
rigorously
validated
both
2D
class
validation
3D
density
map
with
gold
standard.
expected
greatly
facilitate
learning
artificial
intelligence
methods
automated
picking.
data
processing
scripts
are
available
at
https://github.com/BioinfoMachineLearning/cryoppp.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 6, 2025
Abstract
Accurate
and
trustworthy
prediction
of
Enzyme
Commission
(EC)
numbers
is
critical
for
understanding
enzyme
functions
their
roles
in
biological
processes.
Despite
the
success
recently
proposed
deep
learning-based
models,
there
remain
limitations,
such
as
low
performance
underrepresented
EC
numbers,
lack
learning
strategy
with
incomplete
annotations,
limited
interpretability.
To
address
these
challenges,
we
propose
a
novel
hierarchical
interpretable
transformer
model,
HIT-EC,
number
prediction.
HIT-EC
employs
four-level
architecture
that
aligns
structure
leverages
both
local
global
dependencies
within
protein
sequences
this
multi-label
classification
task.
We
also
to
handle
numbers.
an
evidential
produces
predictions
by
providing
domain-specific
evidence
through
biologically
meaningful
interpretation
scheme.
The
predictive
was
assessed
multiple
experiments:
cross-validation
large
dataset,
validation
external
data,
species-based
evaluation.
showed
statistically
significant
improvement
when
compared
current
state-of-the-art
benchmark
models.
HIT-EC’s
robust
interpretability
further
validated
identifying
well-known
conserved
motifs
functional
regions
CYP106A2
family.
would
be
robust,
interpretable,
reliable
solution
prediction,
implications
enzymology,
drug
discovery,
metabolic
engineering.
open-source
code
publicly
available
at:
https://github.com/datax-lab/HIT-EC
.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 27, 2025
Ongoing
mutagenesis
in
cancer
drives
genetic
diversity
throughout
the
natural
history
of
cancers.
As
activities
mutational
processes
are
dynamic
evolution,
distinguishing
signatures
'active'
and
'historical'
has
important
implications
for
studying
how
tumors
evolve.
This
can
aid
understanding
mutagenic
states
at
time
presentation,
associating
active
process
with
therapeutic
resistance.
bulk
sequencing
primarily
captures
historical
processes,
we
studied
whether
ultra-low-coverage
single-cell
whole-genome
(scWGS),
which
measures
distribution
mutations
across
hundreds
or
thousands
individual
cells,
could
enable
distinction
between
processes.
While
technical
challenges
data
sparsity
have
limited
mutation
analysis
scWGS,
show
that
these
contain
valuable
information
about
To
robustly
interpret
single
nucleotide
variants
(SNVs)
introduce
ArtiCull,
a
method
to
identify
remove
SNV
artifacts
by
leveraging
evolutionary
constraints,
enabling
reliable
detection
signature
analysis.
Applying
this
approach
scWGS
from
pancreatic
ductal
adenocarcinoma
(PDAC),
triple-negative
breast
(TNBC),
high-grade
serous
ovarian
(HGSOC),
uncover
temporal
spatial
patterns
In
PDAC,
observe
increase
mismatch
repair
deficiency
(MMRd).
cisplatin-treated
TNBC
patient-derived
xenografts,
therapy-induced
inactivation
APOBEC3
activity.
HGSOC,
distinct
mutagenesis,
including
late
tumor-wide
activation
one
case
clade-specific
enrichment
another.
Additionally,
detect
clone-specific
SBS17
activity,
clone
previously
linked
recurrence.
Our
findings
establish
as
powerful
may
influence
ongoing
clonal
evolution
Mitochondrial DNA Part A,
Journal Year:
2025,
Volume and Issue:
unknown, P. 1 - 7
Published: March 12, 2025
The
subject
of
this
study
is
Aegilops
aucheri
Boiss.
1844:
a
member
the
section
Sitopsis,
subsection
Truncata.
This
species
infrequently
included
in
phylogenetic
studies
and
commonly
regarded
as
heterotypic
synonym
speltoides
Tausch.
aim
was
to
detect
genetic
differences
between
Ae.
using
signal
retrieved
from
chloroplast
genomes.
Plastomes
five
accessions
different
geographical
locations
were
sequenced,
annotated,
subjected
analysis.
Plastome
sizes
found
range
135,666
135,668
bp
aucheri.
Comparative
analysis
genome
sequences
revealed
single-nucleotide
polymorphisms
(SNPs)
insertions/deletions
(indels)
relative
plastome.
To
gain
more
comprehensive
understanding
divergence
within
Truncata
subsection,
sequencing
nuclear
comparing
it
that
essential.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 21, 2025
Abstract
Understanding
protein-protein
interactions
(PPIs)
between
viruses
and
human
proteins
is
crucial
for
uncovering
infection
mechanisms
identifying
potential
therapeutic
targets.
The
ability
to
generalize
PPI
predictive
models
across
understudied
presents
a
significant
challenge.
In
this
work,
we
use
arenavirus-human
PPIs
illustrate
the
difficulties
associated
with
model
generalization,
which
are
compounded
by
lack
of
both
positive
negative
data.
We
employ
Transfer
Learning
approach
investigate
utilizing
trained
on
better-studied
virus-human
human-human
interactions.
Additionally,
curate
assess
four
types
sampling
datasets
evaluate
their
impact
performance.
Despite
overall
high
accuracies
(93-99%)
AUPRC
scores
(0.8-0.9)
appearing
promising,
further
analysis
indicates
that
these
performance
metrics
can
be
misleading
due
data
leakage,
bias,
overfitting,
especially
concerning
under-represented
viral
proteins.
reveal
gaps
imbalance
through
standard
k-fold
cross-validation
Independent
Blind
Testing
Balanced
Dataset,
leading
drop
in
accuracy
below
50%.
propose
protein-specific
evaluation
framework
groups
into
majority
minority
classes
based
representation
dataset,
allowing
comparison
using
balanced
accuracies.
This
offers
more
robust
generalizability,
addressing
biases
inherent
techniques
paving
way
reliable
prediction
viruses.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 19, 2025
Abstract
Single-cell
technologies
have
revolutionized
our
ability
to
study
cellular
heterogeneity
and
dynamics
at
unprecedented
resolutions.
In
this
fast-growing
field,
it
becomes
increasingly
challenging
navigate
the
vast
amount
of
tools
steps
for
analysis.
It
is
particularly
difficult
integrate
analyze
large
datasets
that
require
extensive
collaborations
customized
pipelines
obtain
robust
results.
We
present
CytoAnalyst,
a
web-based
platform
offers
number
important
advantages
over
existing
single-cell
First,
enables
custom
pipeline
configuration
using
an
efficient
management
system
broad
range
analysis
modules.
Second,
supports
parallel
instances,
facilitating
comprehensive
comparison
different
methods
or
parameter
settings
available
each
step.
Third,
advanced
sharing
facilitates
real-time
synchronization
among
team
members
seamless
continuation
across
devices.
Finally,
multi-grid
visualization
simultaneous
display
data
aspects,
allowing
multiple
labels
plots
side-by-side
insights,
with
save
reload
any
The
incorporates
blending
modes,
users
combine
in
various
ways
exploration.
CytoAnalyst
high
level
analytical
rigor
while
providing
user-friendly
flexible
operations
through
its
carefully
designed
interface
documentation.
all
major
web
browsers
freely
https://cytoanalyst.tinnguyen-lab.com
.