Codon language embeddings provide strong signals for protein engineering DOI Creative Commons
Carlos Outeiral, Charlotte M. Deane

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2022, Volume and Issue: unknown

Published: Dec. 19, 2022

Abstract Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with models’ capacities surpassing the size of very datasets they were trained on. Here, we propose an alternative direction. We show that large codons, instead amino acid sequences, provide high-quality outperform comparable a variety tasks. some tasks, like species recognition, prediction and transcript abundance, or melting point estimation, model codons outperforms every other published model, including contain over 50 times more parameters. These results suggest that, addition to commonly studied scale complexity, information content biological data provides orthogonal direction improve power machine learning biology.

Language: Английский

Gene Loss and Evolution of the Plastome DOI Open Access
Tapan Kumar Mohanta, Awdhesh Kumar Mishra, Adil Khan

et al.

Genes, Journal Year: 2020, Volume and Issue: 11(10), P. 1133 - 1133

Published: Sept. 25, 2020

Chloroplasts are unique organelles within the plant cells and responsible for sustaining life forms on earth due to their ability conduct photosynthesis. Multiple functional genes chloroplast a variety of metabolic processes that occur in chloroplast. Considering its fundamental role earth, it is important identify level diversity present genome, what genomic content have been lost, transferred nuclear duplication events, overall origin evolution genome. Our analysis 2511 genomes indicated genome size number coding DNA sequences (CDS) chloroplasts algae higher relative other lineages. Approximately 10.31% examined species lost inverted repeats (IR) span across all Genome-wide analyses revealed loss Rbcl gene parasitic heterotrophic plants occurred approximately 56 Ma ago. PsaM, Psb30, ChlB, ChlL, ChlN, Rpl21 were found be characteristic signature algae, bryophytes, pteridophytes, gymnosperms; however, none these angiosperm or magnoliid lineage which appeared them 203-156 A chloroplast-encoded different lineages throughout evolutionary process. The Rpl20 gene, was most stable intact not any analyzed species, suggesting plastome. evolved from multiple common ancestors ~1293 ago undergone vivid recombination events taxonomic

Language: Английский

Citations

67

CHOIR improves significance-based detection of cell types and states from single-cell data DOI Creative Commons
Cathrine Petersen, Lennart Mucke, M. Ryan Corces

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Jan. 23, 2024

Clustering is a critical step in the analysis of single-cell data, as it enables discovery and characterization putative cell types states. However, most popular clustering tools do not subject results to statistical inference testing, leading risks overclustering or underclustering data often resulting ineffective identification with widely differing prevalence. To address these challenges, we present CHOIR (clustering hierarchy optimization by iterative random forests), which applies framework forest classifiers permutation tests across hierarchical tree statistically determine clusters represent distinct populations. We demonstrate enhanced performance through extensive benchmarking against 14 existing methods 100 simulated 4 real RNA-seq, ATAC-seq, spatial transcriptomic, multi-omic datasets. can be applied any type provides flexible, scalable, robust solution important challenge identifying biologically relevant groupings within heterogeneous data.

Language: Английский

Citations

5

RUDEUS, a machine learning classification system to study DNA-Binding proteins DOI Creative Commons
David Medina-Ortiz, Gabriel Cabas-Mora, Iván Moya-Barría

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Feb. 21, 2024

Abstract DNA-binding proteins are essential in different biological processes, including DNA replication, transcription, packaging, and chromatin remodelling. Exploring their characteristics functions has become relevant diverse scientific domains. Computational biology bioinformatics have assisted studying proteins, complementing traditional molecular methods. While recent advances machine learning enabled the integration of predictive systems with bioinformatic approaches, there still needs to be generalizable pipelines for identifying unknown as assessing specific type strand they recognize. In this work, we introduce RUDEUS, a Python library featuring hierarchical classification models designed identify assess interaction type, whether single-stranded or double-stranded. RUDEUS versatile pipeline capable training models, synergizing protein language supervised algorithms, integrating Bayesian optimization strategies. The trained high performance, achieving precision rate 95% identification 89% discerning between doublestranded interactions. includes an exploration tool evaluating sequences, annotating them DNA-binding, determining Moreover, structural been integrated into validating identified through DNA-protein docking. These comprehensive strategies straightforward implementation demonstrate comparable performance high-end enhance usability engineering pipelines.

Language: Английский

Citations

5

Improving Functional Muscle Regeneration in Volumetric Muscle Loss Injuries by Shifting the Balance of Inflammatory and Pro-Resolving Lipid Mediators DOI
T. Turner,

Frank S. Pittman,

Hongmanlin Zhang

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Sept. 12, 2024

Severe tissue loss resulting from extremity trauma, such as volumetric muscle (VML), poses significant clinical challenges for both general and military populations. VML disrupts the endogenous repair mechanisms, in acute unresolved chronic inflammation immune cell presence, impaired healing, scar formation, persistent pain, permanent functional deficits. The aberrant healing response is preceded by infiltration which does not resolve. We analyzed biosynthesis of inflammatory specialized pro-resolving lipid mediators (SPMs) after injury two different models; with critical-sized defects had a decreased capacity to biosynthesize SPMs, leading dysregulated inflammation. developed modular poly(ethylene glycol)-maleimide hydrogel platform locally release stable isomer Resolvin D1 (AT-RvD1) promote pathways resolution models. local delivery AT-RvD1 enhanced regeneration, improved function, reduced pain sensitivity promoting molecular cellular These findings provide new insights into pathogenesis establish therapeutic promising strategy regeneration traumatic injury.

Language: Английский

Citations

4

CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking DOI Creative Commons
Ashwin Dhakal, Rajan Gyawali, Liguo Wang

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Feb. 22, 2023

Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) a key step in reconstructing structures. However, widely used template-based particle picking process labor-intensive time-consuming. Though emerging machine learning-based can potentially automate process, its development severely hindered by lack large, high-quality, manually labelled training data. Here, we present CryoPPP, diverse, expert-curated image dataset single analysis to address this bottleneck. It consists 32 non-redundant, representative datasets selected Electron Microscopy Public Image Archive (EMPIAR). includes 9,089 high-resolution (∼300 images per EMPIAR dataset) which coordinates were human experts. The labelling was rigorously validated both 2D class validation 3D density map with gold standard. expected greatly facilitate learning artificial intelligence methods automated picking. data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp.

Language: Английский

Citations

10

HIT-EC: Trustworthy prediction of enzyme commission numbers using a hierarchical interpretable transformer DOI Creative Commons
Louis Dumontet, So-Ra Han, Tae‐Jin Oh

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 6, 2025

Abstract Accurate and trustworthy prediction of Enzyme Commission (EC) numbers is critical for understanding enzyme functions their roles in biological processes. Despite the success recently proposed deep learning-based models, there remain limitations, such as low performance underrepresented EC numbers, lack learning strategy with incomplete annotations, limited interpretability. To address these challenges, we propose a novel hierarchical interpretable transformer model, HIT-EC, number prediction. HIT-EC employs four-level architecture that aligns structure leverages both local global dependencies within protein sequences this multi-label classification task. We also to handle numbers. an evidential produces predictions by providing domain-specific evidence through biologically meaningful interpretation scheme. The predictive was assessed multiple experiments: cross-validation large dataset, validation external data, species-based evaluation. showed statistically significant improvement when compared current state-of-the-art benchmark models. HIT-EC’s robust interpretability further validated identifying well-known conserved motifs functional regions CYP106A2 family. would be robust, interpretable, reliable solution prediction, implications enzymology, drug discovery, metabolic engineering. open-source code publicly available at: https://github.com/datax-lab/HIT-EC .

Language: Английский

Citations

0

Inferring active mutational processes in cancer using single cell sequencing and evolutionary constraints DOI Creative Commons
Gryte Satas, Matthew A. Myers, Andrew McPherson

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 27, 2025

Ongoing mutagenesis in cancer drives genetic diversity throughout the natural history of cancers. As activities mutational processes are dynamic evolution, distinguishing signatures 'active' and 'historical' has important implications for studying how tumors evolve. This can aid understanding mutagenic states at time presentation, associating active process with therapeutic resistance. bulk sequencing primarily captures historical processes, we studied whether ultra-low-coverage single-cell whole-genome (scWGS), which measures distribution mutations across hundreds or thousands individual cells, could enable distinction between processes. While technical challenges data sparsity have limited mutation analysis scWGS, show that these contain valuable information about To robustly interpret single nucleotide variants (SNVs) introduce ArtiCull, a method to identify remove SNV artifacts by leveraging evolutionary constraints, enabling reliable detection signature analysis. Applying this approach scWGS from pancreatic ductal adenocarcinoma (PDAC), triple-negative breast (TNBC), high-grade serous ovarian (HGSOC), uncover temporal spatial patterns In PDAC, observe increase mismatch repair deficiency (MMRd). cisplatin-treated TNBC patient-derived xenografts, therapy-induced inactivation APOBEC3 activity. HGSOC, distinct mutagenesis, including late tumor-wide activation one case clade-specific enrichment another. Additionally, detect clone-specific SBS17 activity, clone previously linked recurrence. Our findings establish as powerful may influence ongoing clonal evolution

Language: Английский

Citations

0

Complete chloroplast genomes of five Aegilops aucheri Boiss. accessions having different geographical origins DOI
А. R. Kuluev,

R. Т. Matniyazov,

Б. Р. Кулуев

et al.

Mitochondrial DNA Part A, Journal Year: 2025, Volume and Issue: unknown, P. 1 - 7

Published: March 12, 2025

The subject of this study is Aegilops aucheri Boiss. 1844: a member the section Sitopsis, subsection Truncata. This species infrequently included in phylogenetic studies and commonly regarded as heterotypic synonym speltoides Tausch. aim was to detect genetic differences between Ae. using signal retrieved from chloroplast genomes. Plastomes five accessions different geographical locations were sequenced, annotated, subjected analysis. Plastome sizes found range 135,666 135,668 bp aucheri. Comparative analysis genome sequences revealed single-nucleotide polymorphisms (SNPs) insertions/deletions (indels) relative plastome. To gain more comprehensive understanding divergence within Truncata subsection, sequencing nuclear comparing it that essential.

Language: Английский

Citations

0

Challenges in predicting protein-protein interactions of understudied viruses: Arenavirus-Human interactions DOI Creative Commons

Harshita Sahni,

Sarah Michelle Crotzer,

J Strother Moore

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: April 21, 2025

Abstract Understanding protein-protein interactions (PPIs) between viruses and human proteins is crucial for uncovering infection mechanisms identifying potential therapeutic targets. The ability to generalize PPI predictive models across understudied presents a significant challenge. In this work, we use arenavirus-human PPIs illustrate the difficulties associated with model generalization, which are compounded by lack of both positive negative data. We employ Transfer Learning approach investigate utilizing trained on better-studied virus-human human-human interactions. Additionally, curate assess four types sampling datasets evaluate their impact performance. Despite overall high accuracies (93-99%) AUPRC scores (0.8-0.9) appearing promising, further analysis indicates that these performance metrics can be misleading due data leakage, bias, overfitting, especially concerning under-represented viral proteins. reveal gaps imbalance through standard k-fold cross-validation Independent Blind Testing Balanced Dataset, leading drop in accuracy below 50%. propose protein-specific evaluation framework groups into majority minority classes based representation dataset, allowing comparison using balanced accuracies. This offers more robust generalizability, addressing biases inherent techniques paving way reliable prediction viruses.

Language: Английский

Citations

0

CytoAnalyst: A web-based platform for comprehensive single-cell RNA sequencing analysis DOI Creative Commons

Phi Bya,

Duy Tran,

Khoi Nguyen

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: April 19, 2025

Abstract Single-cell technologies have revolutionized our ability to study cellular heterogeneity and dynamics at unprecedented resolutions. In this fast-growing field, it becomes increasingly challenging navigate the vast amount of tools steps for analysis. It is particularly difficult integrate analyze large datasets that require extensive collaborations customized pipelines obtain robust results. We present CytoAnalyst, a web-based platform offers number important advantages over existing single-cell First, enables custom pipeline configuration using an efficient management system broad range analysis modules. Second, supports parallel instances, facilitating comprehensive comparison different methods or parameter settings available each step. Third, advanced sharing facilitates real-time synchronization among team members seamless continuation across devices. Finally, multi-grid visualization simultaneous display data aspects, allowing multiple labels plots side-by-side insights, with save reload any The incorporates blending modes, users combine in various ways exploration. CytoAnalyst high level analytical rigor while providing user-friendly flexible operations through its carefully designed interface documentation. all major web browsers freely https://cytoanalyst.tinnguyen-lab.com .

Language: Английский

Citations

0