Strategies to improve selection compared to selection based on estimated breeding values DOI Open Access
Torsten Pook, Azadeh Hassanpour, Tobias Niehoff

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 13, 2025

Abstract Background Selection of individuals based on their estimated breeding values aims to maximize response selection the next generation in additive model. However, when aim is not only about short-term population-wide genetic gain but also over multiple generations, an optimal strategy as clear-cut, maintenance diversity may become important factor. This study provides extended comparison existing strategies a unifying testing pipeline using simulation software MoBPS. Results Applying weighting factor SNP effects frequency beneficial allele resulted increase long-term 1.6% after 50 generations while reducing inbreeding rates by 16.2% compared truncation values. this losses 1.2% with break-even point reached 25 generations. In contrast, inclusion average kinship individual top population additional trait index weight 17.5% no and increased gains 4.3% 15.8%, achieving very similar efficiency use optimum contribution selection. Combining management strategies, weights for each optimized evolutionary algorithm scheme 5.1% 37.3% reduced rates. The proposed included contribution, frequency, index, avoiding matings between related individuals, lowering proportion selected individuals. Conclusions combination was shown be far superior any singular method tested study. As efficient methods does necessarily lead comes at extra costs, it critical companies implement such success.

Language: Английский

Learning inverse folding from millions of predicted structures DOI Creative Commons
Chloe Hsu, Robert Verkuil, Jason Liu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2022, Volume and Issue: unknown

Published: April 10, 2022

Abstract We consider the problem of predicting a protein sequence from its backbone atom coordinates. Machine learning approaches to this date have been limited by number available experimentally determined structures. augment training data nearly three orders magnitude structures for 12M sequences using AlphaFold2. Trained with additional data, sequence-to-sequence transformer invariant geometric input processing layers achieves 51% native recovery on structurally held-out backbones 72% buried residues, an overall improvement almost 10 percentage points over existing methods. The model generalizes variety more complex tasks including design complexes, partially masked structures, binding interfaces, and multiple states.

Language: Английский

Citations

245

Serine Catabolism Feeds NADH when Respiration Is Impaired DOI Creative Commons
Lifeng Yang, Juan Carlos García‐Cañaveras, Zihong Chen

et al.

Cell Metabolism, Journal Year: 2020, Volume and Issue: 31(4), P. 809 - 821.e6

Published: March 17, 2020

Language: Английский

Citations

164

BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA DOI Creative Commons
Lars Gabriel, Tomáš Brůna, Katharina J. Hoff

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: June 12, 2023

Abstract Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene in large eukaryotic genomes presents challenge that must be addressed by new algorithms. The amount and significance the evidence available from transcriptomes proteomes vary across genomes, between genes even along single gene. User-friendly accurate annotation pipelines can cope with such data heterogeneity are needed. previously developed BRAKER1 BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement was made recently released GeneMark-ETP integrating all three types. We here present BRAKER3 pipeline builds on AUGUSTUS improves accuracy using TSEBRA combiner. annotates protein-coding both short-read database, statistical models learned iteratively specifically target genome. benchmarked 11 species under assumed level relatedness proteome to proteomes. outperformed BRAKER2. average transcript-level F1-score increased ∼ 20 percentage points average, while difference most pronounced withlarge complex genomes. also other existing tools, MAKER2, Funannotate FINDER. code is GitHub as ready-to-run Docker container execution Singularity. Overall, accurate, easy-to-use tool genome annotation.

Language: Английский

Citations

126

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics DOI Creative Commons

Hugo Dalla-Torre,

Liam Gonzalez,

Javier Mendoza‐Revilla

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Jan. 15, 2023

Closing the gap between measurable genetic information and observable traits is a longstanding challenge in genomics. Yet, prediction of molecular phenotypes from DNA sequences alone remains limited inaccurate, often driven by scarcity annotated data inability to transfer learnings tasks. Here, we present an extensive study foundation models pre-trained on sequences, named Nucleotide Transformer, ranging 50M up 2.5B parameters integrating 3,202 diverse human genomes, as well 850 genomes selected across phyla, including both model non-model organisms. These transformer yield transferable, context-specific representations nucleotide which allow for accurate phenotype even low-data settings. We show that developed can be fine-tuned at low cost despite available regime solve variety genomics applications. Despite no supervision, learned focus attention key genomic elements, those regulate gene expression, such enhancers. Lastly, demonstrate utilizing improve prioritization functional variants. The training application foundational explored this provide widely applicable stepping stone bridge sequence. Code weights at: https://github.com/instadeepai/nucleotide-transformer Jax https://huggingface.co/InstaDeepAI Pytorch. Example notebooks apply these any downstream task are https://huggingface.co/docs/transformers/notebooks#pytorch-bio.

Language: Английский

Citations

107

SaProt: Protein Language Modeling with Structure-aware Vocabulary DOI Open Access

Jin Su,

Chenchen Han, Yuyang Zhou

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Oct. 2, 2023

A bstract Large-scale protein language models (PLMs), such as the ESM family, have achieved remarkable performance in various downstream tasks related to structure and function by undergoing unsupervised training on residue sequences. They become essential tools for researchers practitioners biology. However, a limitation of vanilla PLMs is their lack explicit consideration information, which suggests potential further improvement. Motivated this, we introduce concept “ s tructure- ware vocabulary” that integrates tokens with tokens. The are derived encoding 3D proteins using Foldseek. We then propose SaProt, large-scale general-purpose PLM trained an extensive dataset comprising approximately 40 million sequences structures. Through evaluation, our SaProt model surpasses well-established renowned baselines across 10 significant tasks, demonstrating its exceptional capacity broad applicability. made code 1 , pre-trained model, all relevant materials available at https://github.com/westlake-repl/SaProt .

Language: Английский

Citations

87

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning DOI Creative Commons
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2020, Volume and Issue: unknown

Published: July 12, 2020

Abstract Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken NLP. These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality reduction revealed that raw LM- embeddings unlabeled captured some biophysical features of sequences. We validated advantage as exclusive input several subsequent tasks. first was a per-residue secondary structure (3-state accuracy Q3=81%-87%); second per-protein predictions sub-cellular localization (ten-state accuracy: Q10=81%) membrane vs. water-soluble (2-state Q2=91%). For transfer most informative (ProtT5) time outperformed state-of-the-art without evolutionary information thereby bypassing expensive database searches. Taken together, results implied learned grammar language life . To facilitate future work, released our https://github.com/agemagician/ProtTrans

Language: Английский

Citations

117

FLIP: Benchmark tasks in fitness landscape inference for proteins DOI Creative Commons
Christian Dallago,

Jody Mou,

Kadina E. Johnston

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2021, Volume and Issue: unknown

Published: Nov. 11, 2021

Abstract Machine learning could enable an unprecedented level of control in protein engineering for therapeutic and industrial applications. Critical to its use designing proteins with desired properties, machine models must capture the sequence-function relationship, often termed fitness landscape . Existing bench-marks like CASP or CAFA assess structure function predictions proteins, respectively, yet they do not target metrics relevant engineering. In this work, we introduce Fitness Landscape Inference Proteins (FLIP), a benchmark prediction encourage rapid scoring representation Our curated tasks, baselines, probe model generalization settings engineering, e.g. low-resource extrapolative. Currently, FLIP encompasses experimental data across adeno-associated virus stability gene therapy, domain B1 immunoglobulin binding, thermostability from multiple families. order ease future expansion new all are presented standard format. scripts freely accessible at https://benchmark.protein.properties

Language: Английский

Citations

98

Software tools, databases and resources in metabolomics: updates from 2018 to 2019 DOI
Keiron O’Shea, Biswapriya B. Misra

Metabolomics, Journal Year: 2020, Volume and Issue: 16(3)

Published: March 1, 2020

Language: Английский

Citations

97

Modern Hopfield Networks and Attention for Immune Repertoire Classification DOI Open Access
Michael Widrich, Bernhard Schäfl, Milena Pavlović

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2020, Volume and Issue: unknown

Published: April 13, 2020

Abstract A central mechanism in machine learning is to identify, store, and recognize patterns. How learn, access, retrieve such patterns crucial Hopfield networks the more recent transformer architectures. We show that attention of architectures actually update rule modern Hop-field can store exponentially many exploit this high storage capacity solve a challenging multiple instance (MIL) problem computational biology: immune repertoire classification. Accurate interpretable methods solving could pave way towards new vaccines therapies, which currently very relevant research topic intensified by COVID-19 crisis. Immune classification based on vast number immunosequences an individual MIL with unprecedentedly massive instances, two orders magnitude larger than considered problems, extremely low witness rate. In work, we present our novel method DeepRC integrates transformer-like attention, or equivalently networks, into deep for as demonstrate outperforms all other respect predictive performance large-scale experiments, including simulated real-world virus infection data, enables extraction sequence motifs are connected given disease class. Source code datasets: https://github.com/ml-jku/DeepRC

Language: Английский

Citations

89

ATF4 Protects the Heart From Failure by Antagonizing Oxidative Stress DOI Open Access
Xiaoding Wang, Guangyu Zhang,

Subhajit Dasgupta

et al.

Circulation Research, Journal Year: 2022, Volume and Issue: 131(1), P. 91 - 105

Published: May 16, 2022

Background: Cellular redox control is maintained by generation of reactive oxygen/nitrogen species balanced activation antioxidative pathways. Disruption balance leads to oxidative stress, a central causative event in numerous diseases including heart failure. Redox the exposed hemodynamic however, remains be fully elucidated. Methods: Pressure overload was triggered transverse aortic constriction mice. Transcriptomic and metabolomic regulations were evaluated RNA-sequencing metabolomics, respectively. Stable isotope tracer labeling experiments conducted determine metabolic flux vitro. Neonatal rat ventricular myocytes H9c2 cells used examine molecular mechanisms. Results: We show that production cardiomyocyte NADPH, key factor regulation, decreased pressure overload-induced As consequence, level reduced glutathione downregulated, change associated with fibrosis cardiomyopathy. report pentose phosphate pathway mitochondrial serine/glycine/folate signaling, 2 NADPH-generating pathways cytosol mitochondria, respectively, are induced constriction. identify ATF4 (activating transcription 4) as an upstream controlling expression multiple enzymes these Consistently, joint analysis transcriptomic data reveal preferably controls stress redox-related Overexpression neonatal increases NADPH-producing enzymes‚ whereas silencing decreases their expression. Further, stable overexpression augments within In vivo, cardiomyocyte-specific deletion exacerbates cardiomyopathy setting accelerates failure development, attributable, at least part, inability increase enzymes. Conclusions: Our findings plays critical role under conditions governing both cytosolic NADPH.

Language: Английский

Citations

63