Democratizing protein language models with parameter-efficient fine-tuning DOI Creative Commons
Samuel Sledzieski, Meghana Kshirsagar, Minkyung Baek

и другие.

Proceedings of the National Academy of Sciences, Год журнала: 2024, Номер 121(26)

Опубликована: Июнь 20, 2024

Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from corpora of sequences. These are typically fine-tuned in a supervised setting to adapt the model specific downstream tasks. However, computational and memory footprint fine-tuning (FT) PLMs presents barrier for many research groups with limited resources. Natural processing seen similar explosion size models, where these challenges have addressed methods parameter-efficient (PEFT). In this work, we introduce paradigm proteomics through leveraging method LoRA training new two important tasks: predicting protein–protein interactions (PPIs) symmetry homooligomer quaternary structures. We show that approaches competitive traditional FT while requiring reduced substantially fewer parameters. additionally PPI prediction task, only classification head also remains full FT, using five orders magnitude parameters, each outperform state-of-the-art compute. further perform comprehensive evaluation hyperparameter space, demonstrate PEFT is robust variations hyperparameters, elucidate best practices differ those natural processing. All our adaptation code available open-source at https://github.com/microsoft/peft_proteomics . Thus, provide blueprint democratize power PLM

Язык: Английский

scGPT: toward building a foundation model for single-cell multi-omics using generative AI DOI
Haotian Cui, Xiaoming Wang, Hassaan Maan

и другие.

Nature Methods, Год журнала: 2024, Номер 21(8), С. 1470 - 1480

Опубликована: Фев. 26, 2024

Язык: Английский

Процитировано

290

Network pharmacology: towards the artificial intelligence-based precision traditional Chinese medicine DOI Creative Commons
Peng Zhang, Dingfan Zhang, Wuai Zhou

и другие.

Briefings in Bioinformatics, Год журнала: 2023, Номер 25(1)

Опубликована: Ноя. 22, 2023

Abstract Network pharmacology (NP) provides a new methodological perspective for understanding traditional medicine from holistic perspective, giving rise to frontiers such as Chinese network (TCM-NP). With the development of artificial intelligence (AI) technology, it is key NP develop network-based AI methods reveal treatment mechanism complex diseases massive omics data. In this review, focusing on TCM-NP, we summarize involved into three categories: relationship mining, target positioning and navigating, present typical application TCM-NP in uncovering biological basis clinical value Cold/Hot syndromes. Collectively, our review researchers with an innovative overview progress its TCM perspective.

Язык: Английский

Процитировано

199

A whole-slide foundation model for digital pathology from real-world data DOI Creative Commons
Hanwen Xu, Naoto Usuyama,

Jaspreet Bagga

и другие.

Nature, Год журнала: 2024, Номер 630(8015), С. 181 - 188

Опубликована: Май 22, 2024

Abstract Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands image tiles 1–3 . Prior models have often resorted to subsampling small portion for each slide, thus missing the important slide-level context 4 Here we present Prov-GigaPath, whole-slide foundation model pretrained on 1.3 billion 256 × in 171,189 whole slides from Providence, large US health network comprising 28 cancer centres. The originated more than 30,000 patients covering 31 major tissue types. To pretrain propose GigaPath, novel vision transformer architecture pretraining slides. scale GigaPath learning with tiles, adapts newly developed LongNet 5 method digital pathology. evaluate construct benchmark 9 subtyping tasks and 17 pathomics tasks, using both Providence TCGA data 6 With large-scale ultra-large-context modelling, Prov-GigaPath attains state-of-the-art performance 25 out 26 significant improvement over second-best 18 tasks. We further demonstrate potential vision–language 7,8 by incorporating reports. In sum, is an open-weight that achieves various demonstrating importance real-world modelling.

Язык: Английский

Процитировано

133

Large-scale foundation model on single-cell transcriptomics DOI
Minsheng Hao,

Jing Gong,

Xin Zeng

и другие.

Nature Methods, Год журнала: 2024, Номер 21(8), С. 1481 - 1491

Опубликована: Июнь 6, 2024

Язык: Английский

Процитировано

97

RNAi-based drug design: considerations and future directions DOI
Qi Tang, Anastasia Khvorova

Nature Reviews Drug Discovery, Год журнала: 2024, Номер 23(5), С. 341 - 364

Опубликована: Апрель 3, 2024

Язык: Английский

Процитировано

86

CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data DOI Creative Commons

Shibla Abdulla,

Brian D. Aevermann, Pedro Assis

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Ноя. 2, 2023

Abstract Hundreds of millions single cells have been analyzed to date using high throughput transcriptomic methods, thanks technological advances driving the increasingly rapid generation single-cell data. This provides an exciting opportunity for unlocking new insights into health and disease, made possible by meta-analysis that span diverse datasets building on recent in large language models other machine learning approaches. Despite promise these emerging analytical tools analyzing amounts data, a major challenge remains sheer number inconsistent format, data accessibility. Many are available via unique portals platforms often lack interoperability. Here, we present CZ CellxGene Discover ( cellxgene.cziscience.com ), platform curated interoperable resource, free-to-use online portal, hosts growing corpus community contributed spans more than 50 million cells. Curated, standardized, associated with consistent cell-level metadata, this collection is largest its kind. A suite features enables accessibility reusability both computational visual interfaces allow researchers rapidly explore individual perform cross-corpus analysis. functionality enabling meta-analyses tens across studies tissues providing global views human at resolution

Язык: Английский

Процитировано

67

A single-cell time-lapse of mouse prenatal development from gastrula to birth DOI Creative Commons
Chengxiang Qiu, Beth Martin, Ian Welsh

и другие.

Nature, Год журнала: 2024, Номер 626(8001), С. 1084 - 1093

Опубликована: Фев. 14, 2024

Abstract The house mouse ( Mus musculus ) is an exceptional model system, combining genetic tractability with close evolutionary affinity to humans 1,2 . Mouse gestation lasts only 3 weeks, during which the genome orchestrates astonishing transformation of a single-cell zygote into free-living pup composed more than 500 million cells. Here, establish global framework for exploring mammalian development, we applied optimized combinatorial indexing profile transcriptional states 12.4 nuclei from 83 embryos, precisely staged at 2- 6-hour intervals spanning late gastrulation (embryonic day 8) birth (postnatal 0). From these data, annotate hundreds cell types and explore ontogenesis posterior embryo somitogenesis kidney, mesenchyme, retina early neurons. We leverage temporal resolution sampling depth whole-embryo snapshots, together published data 4–8 earlier timepoints, construct rooted tree cell-type relationships that spans entirety prenatal birth. Throughout this tree, systematically nominate genes encoding transcription factors other proteins as candidate drivers in vivo differentiation types. Remarkably, most marked shifts are observed within one hour presumably underlie massive physiological adaptations must accompany successful transition fetus life outside womb.

Язык: Английский

Процитировано

65

Large Scale Foundation Model on Single-cell Transcriptomics DOI Open Access
Minsheng Hao,

Jing Gong,

Xin Zeng

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Май 31, 2023

Abstract Large-scale pretrained models have become foundation leading to breakthroughs in natural language processing and related fields. Developing life science for deciphering the “languages” of cells facilitating biomedical research is promising yet challenging. We developed a large-scale model scFoundation with 100M parameters this purpose. was trained on over 50 million human single-cell transcriptomics data, which contain high-throughput observations complex molecular features all known types cells. currently largest terms size trainable parameters, dimensionality genes number used pre-training. Experiments showed that can serve as achieve state-of-the-art performances diverse array downstream tasks, such gene expression enhancement, tissue drug response prediction, classification, perturbation prediction.

Язык: Английский

Процитировано

62

Sequence modeling and design from molecular to genome scale with Evo DOI Creative Commons
Éric Nguyen, Michael Poli, Matthew G. Durrant

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Фев. 27, 2024

The genome is a sequence that completely encodes the DNA, RNA, and proteins orchestrate function of whole organism. Advances in machine learning combined with massive datasets genomes could enable biological foundation model accelerates mechanistic understanding generative design complex molecular interactions. We report Evo, genomic enables prediction generation tasks from to scale. Using an architecture based on advances deep signal processing, we scale Evo 7 billion parameters context length 131 kilobases (kb) at single-nucleotide, byte resolution. Trained prokaryotic genomes, can generalize across three fundamental modalities central dogma biology perform zero-shot competitive with, or outperforms, leading domain-specific language models. also excels multi-element tasks, which demonstrate by generating synthetic CRISPR-Cas complexes entire transposable systems for first time. information learned over predict gene essentiality nucleotide resolution generate coding-rich sequences up 650 kb length, orders magnitude longer than previous methods. multi-modal multi-scale provides promising path toward improving our control multiple levels complexity.

Язык: Английский

Процитировано

53

CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data DOI Creative Commons

Shibla Abdulla,

Brian D. Aevermann,

Pedro Assis

и другие.

Nucleic Acids Research, Год журнала: 2024, Номер 53(D1), С. D886 - D900

Опубликована: Ноя. 28, 2024

Hundreds of millions single cells have been analyzed using high-throughput transcriptomic methods. The cumulative knowledge within these datasets provides an exciting opportunity for unlocking insights into health and disease at the level cells. Meta-analyses that span diverse building on recent advances in large language models other machine-learning approaches pose new directions to model extract insight from single-cell data. Despite promise emerging analytical tools analyzing amounts data, sheer number datasets, data accessibility remains a challenge. Here, we present CZ CELLxGENE Discover (cellxgene.cziscience.com), platform curated interoperable Available via free-to-use online portal, hosts growing corpus community-contributed over 93 million unique Curated, standardized associated with consistent cell-level metadata, this collection is largest its kind rapidly community contributions. A suite features enables reusability both computational visual interfaces allow researchers explore individual perform cross-corpus analysis, run meta-analyses tens across studies tissues resolution

Язык: Английский

Процитировано

48