Proceedings of the National Academy of Sciences,
Год журнала:
2024,
Номер
121(26)
Опубликована: Июнь 20, 2024
Proteomics
has
been
revolutionized
by
large
protein
language
models
(PLMs),
which
learn
unsupervised
representations
from
corpora
of
sequences.
These
are
typically
fine-tuned
in
a
supervised
setting
to
adapt
the
model
specific
downstream
tasks.
However,
computational
and
memory
footprint
fine-tuning
(FT)
PLMs
presents
barrier
for
many
research
groups
with
limited
resources.
Natural
processing
seen
similar
explosion
size
models,
where
these
challenges
have
addressed
methods
parameter-efficient
(PEFT).
In
this
work,
we
introduce
paradigm
proteomics
through
leveraging
method
LoRA
training
new
two
important
tasks:
predicting
protein–protein
interactions
(PPIs)
symmetry
homooligomer
quaternary
structures.
We
show
that
approaches
competitive
traditional
FT
while
requiring
reduced
substantially
fewer
parameters.
additionally
PPI
prediction
task,
only
classification
head
also
remains
full
FT,
using
five
orders
magnitude
parameters,
each
outperform
state-of-the-art
compute.
further
perform
comprehensive
evaluation
hyperparameter
space,
demonstrate
PEFT
is
robust
variations
hyperparameters,
elucidate
best
practices
differ
those
natural
processing.
All
our
adaptation
code
available
open-source
at
https://github.com/microsoft/peft_proteomics
.
Thus,
provide
blueprint
democratize
power
PLM
Briefings in Bioinformatics,
Год журнала:
2023,
Номер
25(1)
Опубликована: Ноя. 22, 2023
Abstract
Network
pharmacology
(NP)
provides
a
new
methodological
perspective
for
understanding
traditional
medicine
from
holistic
perspective,
giving
rise
to
frontiers
such
as
Chinese
network
(TCM-NP).
With
the
development
of
artificial
intelligence
(AI)
technology,
it
is
key
NP
develop
network-based
AI
methods
reveal
treatment
mechanism
complex
diseases
massive
omics
data.
In
this
review,
focusing
on
TCM-NP,
we
summarize
involved
into
three
categories:
relationship
mining,
target
positioning
and
navigating,
present
typical
application
TCM-NP
in
uncovering
biological
basis
clinical
value
Cold/Hot
syndromes.
Collectively,
our
review
researchers
with
an
innovative
overview
progress
its
TCM
perspective.
Nature,
Год журнала:
2024,
Номер
630(8015), С. 181 - 188
Опубликована: Май 22, 2024
Abstract
Digital
pathology
poses
unique
computational
challenges,
as
a
standard
gigapixel
slide
may
comprise
tens
of
thousands
image
tiles
1–3
.
Prior
models
have
often
resorted
to
subsampling
small
portion
for
each
slide,
thus
missing
the
important
slide-level
context
4
Here
we
present
Prov-GigaPath,
whole-slide
foundation
model
pretrained
on
1.3
billion
256
×
in
171,189
whole
slides
from
Providence,
large
US
health
network
comprising
28
cancer
centres.
The
originated
more
than
30,000
patients
covering
31
major
tissue
types.
To
pretrain
propose
GigaPath,
novel
vision
transformer
architecture
pretraining
slides.
scale
GigaPath
learning
with
tiles,
adapts
newly
developed
LongNet
5
method
digital
pathology.
evaluate
construct
benchmark
9
subtyping
tasks
and
17
pathomics
tasks,
using
both
Providence
TCGA
data
6
With
large-scale
ultra-large-context
modelling,
Prov-GigaPath
attains
state-of-the-art
performance
25
out
26
significant
improvement
over
second-best
18
tasks.
We
further
demonstrate
potential
vision–language
7,8
by
incorporating
reports.
In
sum,
is
an
open-weight
that
achieves
various
demonstrating
importance
real-world
modelling.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Ноя. 2, 2023
Abstract
Hundreds
of
millions
single
cells
have
been
analyzed
to
date
using
high
throughput
transcriptomic
methods,
thanks
technological
advances
driving
the
increasingly
rapid
generation
single-cell
data.
This
provides
an
exciting
opportunity
for
unlocking
new
insights
into
health
and
disease,
made
possible
by
meta-analysis
that
span
diverse
datasets
building
on
recent
in
large
language
models
other
machine
learning
approaches.
Despite
promise
these
emerging
analytical
tools
analyzing
amounts
data,
a
major
challenge
remains
sheer
number
inconsistent
format,
data
accessibility.
Many
are
available
via
unique
portals
platforms
often
lack
interoperability.
Here,
we
present
CZ
CellxGene
Discover
(
cellxgene.cziscience.com
),
platform
curated
interoperable
resource,
free-to-use
online
portal,
hosts
growing
corpus
community
contributed
spans
more
than
50
million
cells.
Curated,
standardized,
associated
with
consistent
cell-level
metadata,
this
collection
is
largest
its
kind.
A
suite
features
enables
accessibility
reusability
both
computational
visual
interfaces
allow
researchers
rapidly
explore
individual
perform
cross-corpus
analysis.
functionality
enabling
meta-analyses
tens
across
studies
tissues
providing
global
views
human
at
resolution
Nature,
Год журнала:
2024,
Номер
626(8001), С. 1084 - 1093
Опубликована: Фев. 14, 2024
Abstract
The
house
mouse
(
Mus
musculus
)
is
an
exceptional
model
system,
combining
genetic
tractability
with
close
evolutionary
affinity
to
humans
1,2
.
Mouse
gestation
lasts
only
3
weeks,
during
which
the
genome
orchestrates
astonishing
transformation
of
a
single-cell
zygote
into
free-living
pup
composed
more
than
500
million
cells.
Here,
establish
global
framework
for
exploring
mammalian
development,
we
applied
optimized
combinatorial
indexing
profile
transcriptional
states
12.4
nuclei
from
83
embryos,
precisely
staged
at
2-
6-hour
intervals
spanning
late
gastrulation
(embryonic
day
8)
birth
(postnatal
0).
From
these
data,
annotate
hundreds
cell
types
and
explore
ontogenesis
posterior
embryo
somitogenesis
kidney,
mesenchyme,
retina
early
neurons.
We
leverage
temporal
resolution
sampling
depth
whole-embryo
snapshots,
together
published
data
4–8
earlier
timepoints,
construct
rooted
tree
cell-type
relationships
that
spans
entirety
prenatal
birth.
Throughout
this
tree,
systematically
nominate
genes
encoding
transcription
factors
other
proteins
as
candidate
drivers
in
vivo
differentiation
types.
Remarkably,
most
marked
shifts
are
observed
within
one
hour
presumably
underlie
massive
physiological
adaptations
must
accompany
successful
transition
fetus
life
outside
womb.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Май 31, 2023
Abstract
Large-scale
pretrained
models
have
become
foundation
leading
to
breakthroughs
in
natural
language
processing
and
related
fields.
Developing
life
science
for
deciphering
the
“languages”
of
cells
facilitating
biomedical
research
is
promising
yet
challenging.
We
developed
a
large-scale
model
scFoundation
with
100M
parameters
this
purpose.
was
trained
on
over
50
million
human
single-cell
transcriptomics
data,
which
contain
high-throughput
observations
complex
molecular
features
all
known
types
cells.
currently
largest
terms
size
trainable
parameters,
dimensionality
genes
number
used
pre-training.
Experiments
showed
that
can
serve
as
achieve
state-of-the-art
performances
diverse
array
downstream
tasks,
such
gene
expression
enhancement,
tissue
drug
response
prediction,
classification,
perturbation
prediction.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Фев. 27, 2024
The
genome
is
a
sequence
that
completely
encodes
the
DNA,
RNA,
and
proteins
orchestrate
function
of
whole
organism.
Advances
in
machine
learning
combined
with
massive
datasets
genomes
could
enable
biological
foundation
model
accelerates
mechanistic
understanding
generative
design
complex
molecular
interactions.
We
report
Evo,
genomic
enables
prediction
generation
tasks
from
to
scale.
Using
an
architecture
based
on
advances
deep
signal
processing,
we
scale
Evo
7
billion
parameters
context
length
131
kilobases
(kb)
at
single-nucleotide,
byte
resolution.
Trained
prokaryotic
genomes,
can
generalize
across
three
fundamental
modalities
central
dogma
biology
perform
zero-shot
competitive
with,
or
outperforms,
leading
domain-specific
language
models.
also
excels
multi-element
tasks,
which
demonstrate
by
generating
synthetic
CRISPR-Cas
complexes
entire
transposable
systems
for
first
time.
information
learned
over
predict
gene
essentiality
nucleotide
resolution
generate
coding-rich
sequences
up
650
kb
length,
orders
magnitude
longer
than
previous
methods.
multi-modal
multi-scale
provides
promising
path
toward
improving
our
control
multiple
levels
complexity.
Nucleic Acids Research,
Год журнала:
2024,
Номер
53(D1), С. D886 - D900
Опубликована: Ноя. 28, 2024
Hundreds
of
millions
single
cells
have
been
analyzed
using
high-throughput
transcriptomic
methods.
The
cumulative
knowledge
within
these
datasets
provides
an
exciting
opportunity
for
unlocking
insights
into
health
and
disease
at
the
level
cells.
Meta-analyses
that
span
diverse
building
on
recent
advances
in
large
language
models
other
machine-learning
approaches
pose
new
directions
to
model
extract
insight
from
single-cell
data.
Despite
promise
emerging
analytical
tools
analyzing
amounts
data,
sheer
number
datasets,
data
accessibility
remains
a
challenge.
Here,
we
present
CZ
CELLxGENE
Discover
(cellxgene.cziscience.com),
platform
curated
interoperable
Available
via
free-to-use
online
portal,
hosts
growing
corpus
community-contributed
over
93
million
unique
Curated,
standardized
associated
with
consistent
cell-level
metadata,
this
collection
is
largest
its
kind
rapidly
community
contributions.
A
suite
features
enables
reusability
both
computational
visual
interfaces
allow
researchers
explore
individual
perform
cross-corpus
analysis,
run
meta-analyses
tens
across
studies
tissues
resolution