FGeneBERT: function-driven pre-trained gene language model for metagenomics DOI Creative Commons

Chenrui Duan,

Zelin Zang, Yongjie Xu

et al.

Briefings in Bioinformatics, Journal Year: 2025, Volume and Issue: 26(2)

Published: March 1, 2025

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health ecological functions. However, current research relies on K-mer, which limits the capture of structurally functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes fail to address one-to-many many-to-one relationships inherent metagenomic data. To overcome challenges, we introduce FGeneBERT, a novel pre-trained model that employs protein-based representation as context-aware structure-relevant tokenizer. FGeneBERT incorporates masked modeling enhance understanding inter-gene contextual triplet enhanced contrastive learning elucidate sequence-function relationships. Pre-trained over 100 million sequences, demonstrates superior performance datasets at four levels, spanning gene, functional, bacterial, environmental levels ranging from 1 213 k input sequences. Case studies ATP synthase operons highlight FGeneBERT's capability for functional recognition its biological relevance research.

Language: Английский

Comparative Analysis of Transposable Elements in the Genomes of Citrus and Citrus-Related Genera DOI Creative Commons
Yilei Wu, Fusheng Wang,

Keliang Lyu

et al.

Plants, Journal Year: 2024, Volume and Issue: 13(17), P. 2462 - 2462

Published: Sept. 3, 2024

Transposable elements (TEs) significantly contribute to the evolution and diversity of plant genomes. In this study, we explored roles TEs in genomes Citrus Citrus-related genera by constructing a pan-genome TE library from 20 published accessions. Our results revealed an increase content number types compared original annotations, as well decrease unclassified TEs. The average length per assembly was approximately 194.23 Mb, representing 41.76% (Murraya paniculata) 64.76% (Citrus gilletiana) genomes, with mean value 56.95%. A significant positive correlation found between genome size both content. Consistent difference whole-genome (39.83 Mb) genera, contained 34.36 Mb more sequences than Analysis estimated insertion time half-life long terminal repeat retrotransposons (LTR-RTs) suggested that removal not primary factor contributing differences among These findings collectively indicate are determinants play major role shaping structures. Principal coordinate analysis (PCoA) Gene Ontology (GO) Kyoto Encyclopedia Genes Genomes (KEGG) identifiers fragmented were predominantly derived ancestral while intact crucial recent evolutionary diversification Citrus. Moreover, presence or absence near AdhE superfamily closely associated bitterness trait species. Overall, study enhances annotation provides valuable data for future genetic breeding agronomic research

Language: Английский

Citations

4

OGRP: a comprehensive bioinformatics platform for the efficient empowerment of Oleaceae genomics research DOI Creative Commons
Zijian Yu, Yu Li, Tengfei Song

et al.

Horticultural Plant Journal, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Language: Английский

Citations

0

Genomic and Methylomic Signatures Associated With the Maintenance of Genome Stability and Adaptive Evolution in Two Closely Allied Wolf Spiders DOI Open Access
Qing Zuo, Runbiao Wu,

Lina Sun

et al.

Molecular Ecology Resources, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 20, 2025

ABSTRACT Pardosa spiders, belonging to the wolf spider family Lycosidae, play a vital role in maintaining health of forest and agricultural ecosystems due their function pest control. This study presents chromosome‐level genome assemblies for two allied species, P. laura agraria . Both species' genomes show notable expansion helitron transposable elements, which contributes large sizes. Methylome analysis indicates that has higher overall DNA methylation levels compared may not only aids element‐driven but also positively affects three‐dimensional organisation after transposon amplification, thereby potentially enhancing stability. Genes associated with hyper‐differentially methylated regions (compared ) are enriched functions related mRNA processing energy production. Furthermore, combined transcriptome methylome profiling uncovered complex regulatory interplay between gene expression, emphasising important body regulation expression. Comparative genomic shows significant cuticle protein detoxification‐related families , improve its adaptability various habitats. provides essential methylomic insights, offering deeper understanding relationship elements stability, illuminating adaptive evolution species differentiation among spiders.

Language: Английский

Citations

0

Constraint of accessible chromatins maps regulatory loci involved in maize speciation and domestication DOI Creative Commons
Y Liu, Xiang Gao, Hongjun Liu

et al.

Nature Communications, Journal Year: 2025, Volume and Issue: 16(1)

Published: March 12, 2025

Comparative genomic studies can identify genes under evolutionary constraint or specialized for trait innovation. Growing evidence suggests that also acts on non-coding regulatory sequences, exerting significant impacts fitness-related traits, although it has yet to be thoroughly explored in plants. Using the assay transposase-accessible chromatin by sequencing (ATAC-seq), we profile over 80,000 maize accessible regions (ACRs), revealing ACRs evolve faster than coding genes, with about one-third being maize-specific and regulating associated speciation. We highlight role of transposable elements (TEs) driving intraspecific innovation hundreds candidate potentially involved transcriptional rewiring during domestication. Additionally, demonstrate importance maintaining subgenome dominance controlling complex variations. This study establishes a framework analyzing trajectory plant sequences offers loci downstream exploration application breeding. Intricate regulation gene expression is important execution biology processes. Here, authors generate comprehensive map integrating ATAC-seq data 12 major tissues explore their interspecific constraints multiple Poaceae genomes.

Language: Английский

Citations

0

FGeneBERT: function-driven pre-trained gene language model for metagenomics DOI Creative Commons

Chenrui Duan,

Zelin Zang, Yongjie Xu

et al.

Briefings in Bioinformatics, Journal Year: 2025, Volume and Issue: 26(2)

Published: March 1, 2025

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health ecological functions. However, current research relies on K-mer, which limits the capture of structurally functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes fail to address one-to-many many-to-one relationships inherent metagenomic data. To overcome challenges, we introduce FGeneBERT, a novel pre-trained model that employs protein-based representation as context-aware structure-relevant tokenizer. FGeneBERT incorporates masked modeling enhance understanding inter-gene contextual triplet enhanced contrastive learning elucidate sequence-function relationships. Pre-trained over 100 million sequences, demonstrates superior performance datasets at four levels, spanning gene, functional, bacterial, environmental levels ranging from 1 213 k input sequences. Case studies ATP synthase operons highlight FGeneBERT's capability for functional recognition its biological relevance research.

Language: Английский

Citations

0