GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction DOI Creative Commons
Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Окт. 11, 2023

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially complex genomes such as that humans. To address this challenge, we here introduce GPN-MSA, novel framework leverages whole-genome sequence alignments across multiple species and takes only few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, OMIM), experimental functional assays (DMS, DepMap), population genomic data (gnomAD), our model human genome achieves outstanding performance deleteriousness prediction both coding non-coding variants.

Язык: Английский

Adapting protein language models for structure-conditioned design DOI Creative Commons
Jeffrey A. Ruffolo,

Aadyot Bhatnagar,

Joel Beazer

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Авг. 3, 2024

Generative models for protein design trained on experimentally determined structures have proven useful a variety of tasks. However, such methods are limited by the quantity and diversity used training, which represent small, biased fraction space. Here, we describe proseLM, method sequence based adaptation language to incorporate structural functional context. We show that proseLM benefits from scaling trends underlying models, addition non-protein context – nucleic acids, ligands, ions improves recovery native residues during 4-5% across model scales. These improvements most pronounced directly interface with context, faithfully recovered at rates >70% capable models. validated optimizing editing efficiency genome editors in human cells, achieving 50% increase base activity, redesigning therapeutic antibodies, resulting PD-1 binder 2.2 nM affinity.

Язык: Английский

Процитировано

7

Microdroplet screening rapidly profiles a biocatalyst to enable its AI-assisted engineering DOI Open Access
Maximilian Gantz, Simon V. Mathis, Friederike E. H. Nintzel

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Апрель 8, 2024

Abstract Engineering enzyme biocatalysts for higher efficiency is key to enabling sustainable, ‘green’ production processes the chemical and pharmaceutical industry. This challenge can be tackled from two angles: by directed evolution, based on labor-intensive experimental testing of variant libraries, or computational methods, where sequence-function data are used predict biocatalyst improvements. Here, we combine both approaches into a two-week workflow, ultra-high throughput screening library imine reductases (IREDs) in microfluidic devices provides not only selected ‘hits’, but also long-read sequence linked fitness scores >17 thousand variants. We demonstrate engineering an IRED chiral amine synthesis mapping functional information one go, ready interpretation extrapolation protein engineers with help machine learning (ML). calculate position-dependent mutability combinability mutations comprehensively illuminate complex interplay driven synergistic, often positively epistatic effects. Interpreted easy-to-use regression tree-based ML algorithms designed suit evaluation random whole-gene mutagenesis data, 3-fold improved ‘hits’ obtained extrapolated further give up 23-fold improvements catalytic rate after handful mutants. Our campaign paradigmatic future that will rely access large maps as profiles way responds mutation. These chart function exploiting synergy rapid combined extrapolation.

Язык: Английский

Процитировано

6

Rapid protein evolution by few-shot learning with a protein language model DOI
Kaiyi Jiang, Zhaoqing Yan, Matteo Di Bernardo

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июль 18, 2024

Directed evolution of proteins is critical for applications in basic biological research, therapeutics, diagnostics, and sustainability. However, directed methods are labor intensive, cannot efficiently optimize over multiple protein properties, often trapped by local maxima.

Язык: Английский

Процитировано

6

Deep indel mutagenesis reveals the impact of amino acid insertions and deletions on protein stability and function DOI Creative Commons
Magdalena Topolska, Toni Beltran, Ben Lehner

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Окт. 6, 2023

Abstract Amino acid insertions and deletions (indels) are an abundant class of genetic variants. However, compared to substitutions, the effects indels on protein stability not well understood poorly predicted. To better understand here we analyze new existing large-scale deep indel mutagenesis (DIM) structurally diverse proteins. The vary extensively among within proteins predicted by computational methods. address this shortcoming present INDELi, a series models that combine experimental or substitution secondary structure information provide good prediction both pathogenicity. Moreover, quantifying protein-protein interactions suggests can be important gain-of-function Our results overview impact method predict their genome-wide.

Язык: Английский

Процитировано

14

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction DOI Creative Commons
Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Окт. 11, 2023

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially complex genomes such as that humans. To address this challenge, we here introduce GPN-MSA, novel framework leverages whole-genome sequence alignments across multiple species and takes only few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, OMIM), experimental functional assays (DMS, DepMap), population genomic data (gnomAD), our model human genome achieves outstanding performance deleteriousness prediction both coding non-coding variants.

Язык: Английский

Процитировано

13