2021 IEEE International Conference on Big Data (Big Data), Journal Year: 2024, Volume and Issue: unknown, P. 4283 - 4291
Published: Dec. 15, 2024
Language: Английский
2021 IEEE International Conference on Big Data (Big Data), Journal Year: 2024, Volume and Issue: unknown, P. 4283 - 4291
Published: Dec. 15, 2024
Language: Английский
Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 90, P. 102984 - 102984
Published: Jan. 27, 2025
Language: Английский
Citations
0bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown
Published: April 23, 2025
ABSTRACT Protein language models (pLMs) have revolutionized computational biology by generating rich protein vector representations—or embeddings—enabling major advancements in de novo design, structure prediction, variant effect analysis, and evolutionary studies. Despite these breakthroughs, current pLMs often exhibit biases against proteins from underrepresented species, with viral being particularly affected—frequently referred to as the “dark matter” of biological world due their vast diversity ubiquity, yet sparse representation training datasets. Here, we show that fine-tuning pre-trained on sequences, using diverse learning frameworks parameter-efficient strategies, significantly enhances quality improves performance downstream tasks. To support further research, provide source code for benchmarking embedding quality. By enabling more accurate modeling proteins, our approach advances tools understanding biology, combating emerging infectious diseases, driving biotechnological innovation.
Language: Английский
Citations
02021 IEEE International Conference on Big Data (Big Data), Journal Year: 2024, Volume and Issue: unknown, P. 4283 - 4291
Published: Dec. 15, 2024
Language: Английский
Citations
0