Scaling Down for Efficiency: Medium-Sized Transformer Models for Protein Sequence Transfer Learning DOI Creative Commons
Luiz Carlos Vieira,

Morgan L. Handojo,

Claus O. Wilke

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 24, 2024

Protein language models such as the transformer-based Evolutionary Scale Modeling 2 (ESM2) can offer deep insights into evolutionary and structural properties of proteins. While larger models, ESM2 15B, promise to capture more complex patterns in sequence space, they also present practical challenges due their high dimensionality computational cost. We systematically evaluated performance all across many biological datasets determine impact model size on transfer learning. Surprisingly, do not always outperform smaller ones, especially when data is limited. Medium sized 650M, exhibited consistent performance, falling only slightly behind 15B parameter despite being over 20 times smaller. Additionally, we compared various methods embedding compression identify most effective approach, found that mean embeddings consistently outperformed other methods. Our results show 650M with offers an optimal balance between efficiency, making it a scalable choice for learning variety applications. Significance Statement This work common belief yield better results, here context protein biochemistry. By comparing transformer different sizes tasks, demonstrate medium frequently perform well variants, specially These findings provide efficient strategy machine learning-based analysis promote broader accessibility AI biology. Smaller, help democratize advanced machine-learning tools, them accessible researchers limited resources.

Language: Английский

Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering DOI Creative Commons
Jason Yang, Francesca-Zhoufan Li, Frances H. Arnold

et al.

ACS Central Science, Journal Year: 2024, Volume and Issue: 10(2), P. 226 - 241

Published: Feb. 5, 2024

Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even unlock new activities not found in nature. Because search space possible proteins is vast, enzyme engineering usually involves discovering an starting point that has some desired activity followed by directed evolution improve its "fitness" for a application. Recently, machine learning (ML) emerged powerful tool complement this empirical process. ML models contribute (1) discovery functional annotation known protein or generating novel with functions (2) navigating fitness landscapes optimization mappings between associated values. In Outlook, we explain how complements discuss future potential improved outcomes.

Language: Английский

Citations

67

PUNCH2: Explore the strategy for intrinsically disordered protein predictor DOI Creative Commons
Di Meng, Gianluca Pollastri

PLoS ONE, Journal Year: 2025, Volume and Issue: 20(3), P. e0319208 - e0319208

Published: March 26, 2025

Intrinsically disordered proteins (IDPs) and their intrinsically regions (IDRs) lack stable three-dimensional structures, posing significant challenges for computational prediction. This study introduces PUNCH2 PUNCH2-light , advanced predictors designed to address these through curated datasets, innovative feature extraction, optimized neural architectures. By integrating experimental datasets from PDB ( PDB_missing ) fully sequences DisProt DisProt_FD ), we enhanced model performance robustness. Three embedding strategies—One-Hot, MSA-based, PLM-based embeddings—were evaluated, with ProtTrans emerging as the most effective single combined embeddings achieving best results. The employ a 12-layer convolutional network (CNN_L12_narrow), offering balance between accuracy efficiency. combines One-Hot, ProtTrans, MSA-Transformer embeddings, while provides faster alternative excluding MSA-based embeddings. its streamlined variant, are competitive other on CAID2 benchmark rank top two in CAID3 competition. These tools provide efficient, accurate solutions advance IDP research understanding.

Language: Английский

Citations

2

Are protein language models the new universal key? DOI Creative Commons
Konstantin Weißenow, Burkhard Rost

Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 91, P. 102997 - 102997

Published: Feb. 7, 2025

Protein language models (pLMs) capture some aspects of the grammar life as written in protein sequences. The so-called pLM embeddings implicitly contain this information. Therefore, can serve exclusive input into downstream supervised methods for prediction. Over last 33 years, evolutionary information extracted through simple averaging specific families from multiple sequence alignments (MSAs) has been most successful universal key to success For many applications, MSA-free pLM-based predictions now have become significantly more accurate. reason is often a combination two aspects. Firstly, condense so efficiently that prediction succeed with small models, i.e., they need few free parameters particular era exploding deep neural networks. Secondly, provide protein-specific solutions. As additional benefit, once pre-training complete, solutions tend consume much fewer resources than MSA-based In fact, we appeal community rather optimize foundation retrain new ones and evolve incentives require even at loss accuracy. Although pLMs not, yet, succeeded entirely replace body developed over three decades, clearly are rapidly advancing

Language: Английский

Citations

1

Teaching AI to speak protein DOI Creative Commons
Michael Heinzinger, Burkhard Rost

Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 91, P. 102986 - 102986

Published: Feb. 21, 2025

Language: Английский

Citations

1

Aggregating Residue-Level Protein Language Model Embeddings with Optimal Transport DOI Creative Commons
Navid Naderializadeh, Rohit Singh

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Jan. 31, 2024

Abstract Protein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into embeddings suitable various applications. As representation schemes, PLMs generate per-token (i.e., per-residue) representations, resulting in variable-sized outputs based on length. This variability poses a challenge protein-level prediction tasks that require uniform-sized consistent analysis across different proteins. Previous work has typically used average pooling to summarize token-level PLM outputs, but it is unclear whether this method effectively prioritizes the relevant information representations. We introduce novel utilizing optimal transport convert variable-length fixed-length conceptualize samples from probabilistic distribution and employ sliced-Wasserstein distances map these against reference set, creating Euclidean embedding output space. The agnostic length of input represents entire protein. demonstrate superiority our over several downstream tasks, particularly with constrained sizes, enabling smaller-scale match or exceed performance average-pooled larger-scale PLMs. Our aggregation scheme especially effective longer by capturing essential might be lost through pooling.

Language: Английский

Citations

5

Natural Language Processing Methods for the Study of Protein–Ligand Interactions DOI Creative Commons

J.H. Michels,

Ramya Bandarupalli,

Ahmad Akbari

et al.

Journal of Chemical Information and Modeling, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 24, 2025

Natural Language Processing (NLP) has revolutionized the way computers are used to study and interact with human languages is increasingly influential in of protein ligand binding, which critical for drug discovery development. This review examines how NLP techniques have been adapted decode "language" proteins small molecule ligands predict protein–ligand interactions (PLIs). We discuss methods such as long short-term memory (LSTM) networks, transformers, attention mechanisms can leverage different data types identify potential interaction patterns. Significant challenges highlighted including scarcity high-quality negative data, difficulties interpreting model decisions, sampling biases existing sets. argue that focusing on improving quality, enhancing robustness, fostering both collaboration competition could catalyze future advances machine-learning-based predictions PLIs.

Language: Английский

Citations

0

Combining Directed Evolution with Machine Learning Enables Accurate Genotype-to-Phenotype Predictions DOI Creative Commons
Alexander J. Howard, Ellen Youngsoo Rim, Oscar D. Garrett

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 29, 2025

Abstract Linking sequence variation to phenotypic effects is critical for efficient exploitation of large genomic datasets. Here we present a novel approach combining directed evolution with protein language modeling characterize naturally-evolved variants rice immune receptor. Using high-throughput evolution, engineered the receptor Pik-1 bind and recognize fungal proteins Avr-PikC Avr-PikF, which evade detection by currently characterized alleles. A model was fine-tuned on this data correlate ligand binding behavior. This then used found in 3,000 Rice Genomes Project dataset. Two scored highly against Avr-PikC, vitro analyses confirmed their improved over wild-type Overall, machine learning identified promising sources disease resistance shows potential utility exploring other interest.

Language: Английский

Citations

0

ESM-Effect: An Effective and Efficient Fine-Tuning Framework towards accurate prediction of Mutation's Functional Effect DOI Creative Commons
Moritz Glaser, Johannes Brägelmann

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 7, 2025

Abstract Predicting functional properties of mutations like the change in enzyme activity remains challenging and is not well captured by traditional pathogenicity prediction. Yet such predictions are crucial areas targeted cancer therapy where some drugs may only be administered if a mutation causes an increase activity. Current approaches either leverage static Protein-Language Model (PLM) embeddings or complex multi-modal features (e.g., PLM embeddings, structure, evolutionary data) (1) fall short accuracy (2) involve data processing pre-training. Standardized datasets metrics for robust benchmarking would benefit model development but do yet exist effect To address these challenges we develop ESM-Effect, optimized PLM-based prediction framework through extensive ablation studies. ESM-Effect fine-tunes ESM2 with inductive bias regression head to achieve state-of-the-art performance. It surpasses method PreMode, indicating redundancy structural features, while training 6.7-times faster. In addition, test strategies, propose novel metric termed relative Bin-Mean Error (rBME): rBME emphasizes challenging, non-clustered, rare gain-of-function regions correlates more intuitively performance than commonly used Spearman’s rho. Finally, demonstrate partial generalization unseen mutational within same protein, illustrating its potential precision medicine applications. Extending this across different proteins promising direction future research. available at: https://github.com/moritzgls/ESM-Effect .

Language: Английский

Citations

0

Prediction of Single-Mutation Effects for Fluorescent Immunosensor Engineering with an End-to-End Trained Protein Language Model DOI Creative Commons

Akihito Inoue,

Bo Zhu,

Keisuke Mizutani

et al.

JACS Au, Journal Year: 2025, Volume and Issue: 5(2), P. 955 - 964

Published: Feb. 10, 2025

A quenchbody (Q-body) is a fluorophore-labeled homogeneous immunosensor in which the fluorophore quenched by tryptophan (Trp) residues vicinity of antigen-binding paratope and dequenched response to antigen binding. Developing Q-bodies against targets on demand remains challenging due large sequence space complementarity-determining regions (CDRs) related binding quenching. In this study, we pioneered strategy using high-throughput screening protein language model (pLM) predict effects mutations quenching with single amino acid resolution, thereby enhancing performance Q-bodies. We collected yeasts displaying nanobodies high- low-quenching properties for TAMRA from modified synthetic nanobody library followed next-generation sequencing. The pretrained pLM, connected single-layer perceptron, was trained end-to-end enriched CDR sequences. achieved prediction that focused CDR1 + 3 performed best evaluation precision-recall curves. Using model, predicted validated effective two anti-SARS-CoV-2 nanobodies, RBD1i13 RBD10i14, converted them into For RBD1i13, three Trp mutants were have high probability scores through silico scanning. These verified via yeast surface display, all showed enhanced at four positions close an existing gave saturation mutagenesis Six eight high-score mutants, derived each positions, exhibited deeper surface. Next, combined investigation successfully responses. Overall, our allows fluorescence responses solely basis antibody will be essential rational selection design antibodies achieve immunosensors larger

Language: Английский

Citations

0

Foundation models of protein sequences: A brief overview DOI Creative Commons

Andreas Bjerregaard,

Peter Mørch Groth, Søren Hauberg

et al.

Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 91, P. 103004 - 103004

Published: Feb. 20, 2025

Language: Английский

Citations

0