FGeneBERT: function-driven pre-trained gene language model for metagenomics DOI Creative Commons

Chenrui Duan,

Zelin Zang, Yongjie Xu

et al.

Briefings in Bioinformatics, Journal Year: 2025, Volume and Issue: 26(2)

Published: March 1, 2025

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health ecological functions. However, current research relies on K-mer, which limits the capture of structurally functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes fail to address one-to-many many-to-one relationships inherent metagenomic data. To overcome challenges, we introduce FGeneBERT, a novel pre-trained model that employs protein-based representation as context-aware structure-relevant tokenizer. FGeneBERT incorporates masked modeling enhance understanding inter-gene contextual triplet enhanced contrastive learning elucidate sequence-function relationships. Pre-trained over 100 million sequences, demonstrates superior performance datasets at four levels, spanning gene, functional, bacterial, environmental levels ranging from 1 213 k input sequences. Case studies ATP synthase operons highlight FGeneBERT's capability for functional recognition its biological relevance research.

Language: Английский

Plant pan-genomes are the new reference DOI
Philipp E. Bayer, Agnieszka A. Golicz, Armin Scheben

et al.

Nature Plants, Journal Year: 2020, Volume and Issue: 6(8), P. 914 - 920

Published: July 20, 2020

Language: Английский

Citations

401

Giant lungfish genome elucidates the conquest of land by vertebrates DOI Creative Commons
Axel Meyer, Siegfried Schloissnig, Paolo Franchini

et al.

Nature, Journal Year: 2021, Volume and Issue: 590(7845), P. 284 - 289

Published: Jan. 18, 2021

Abstract Lungfishes belong to lobe-fined fish (Sarcopterygii) that, in the Devonian period, ‘conquered’ land and ultimately gave rise all vertebrates, including humans 1–3 . Here we determine chromosome-quality genome of Australian lungfish ( Neoceratodus forsteri ), which is known have largest any animal. The vast size this genome, about 14× larger than that humans, attributable mostly huge intergenic regions introns with high repeat content (around 90%), components resemble those tetrapods (comprising mainly long interspersed nuclear elements) more they do ray-finned fish. continues expand independently (its transposable elements are still active), through mechanisms different enormous genomes salamanders. 17 fully assembled macrochromosomes maintain synteny other vertebrate chromosomes, microchromosomes conserved ancient homology ancestral karyotype. Our phylogenomic analyses confirm previous reports occupy a key evolutionary position as closest living relatives 4,5 , underscoring importance for understanding innovations associated terrestrialization. Lungfish preadaptations on include gain limb-like expression developmental genes such hoxc13 sall1 their lobed fins. Increased rates evolution duplication obligate air-breathing, lung surfactants expansion odorant receptor gene families (which encode proteins involved detecting airborne odours), contribute tetrapod-like biology lungfishes. These findings advance our major transition during evolution.

Language: Английский

Citations

181

The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding DOI Creative Commons

Xiaoya Shi,

Shuo Cao, Xu Wang

et al.

Horticulture Research, Journal Year: 2023, Volume and Issue: 10(5)

Published: April 4, 2023

Grapevine is one of the most economically important crops worldwide. However, previous versions grapevine reference genome tipically consist thousands fragments with missing centromeres and telomeres, limiting accessibility repetitive sequences, centromeric telomeric regions, study inheritance agronomic traits in these regions. Here, we assembled a telomere-to-telomere (T2T) gap-free for cultivar PN40024 using PacBio HiFi long reads. The T2T (PN_T2T) 69 Mb longer 9018 more genes identified than 12X.v0 version. We annotated 67% 19 36 incorporated gene annotations into PN_T2T assembly. detected total 377 clusters, which showed associations complex traits, such as aroma disease resistance. Even though derives from nine generations selfing, still found genomic hotspots heterozygous sites associated biological processes, oxidation-reduction process protein phosphorylation. fully complete therefore constitutes an resource genetic studies breeding programs.

Language: Английский

Citations

116

A beginner’s guide to manual curation of transposable elements DOI Creative Commons
Clément Goubert, Rory J. Craig,

Agustin F. Bilat

et al.

Mobile DNA, Journal Year: 2022, Volume and Issue: 13(1)

Published: March 30, 2022

In the study of transposable elements (TEs), generation a high confidence set consensus sequences that represent diversity TEs found in given genome is key step path to investigate these fascinating genomic elements. Many algorithms and pipelines are available automatically identify putative TE families present genome. Despite availability valuable resources, producing library high-quality full-length largely remains process manual curation. This know-how often passed on from mentor-to-mentee within research groups, making it difficult for those outside field access this highly specialised skill.

Language: Английский

Citations

82

Repetitive DNA sequence detection and its role in the human genome DOI Creative Commons
Xingyu Liao,

Wufei Zhu,

Juexiao Zhou

et al.

Communications Biology, Journal Year: 2023, Volume and Issue: 6(1)

Published: Sept. 19, 2023

Abstract Repetitive DNA sequences playing critical roles in driving evolution, inducing variation, and regulating gene expression. In this review, we summarized the definition, arrangement, structural characteristics of repeats. Besides, introduced diverse biological functions repeats reviewed existing methods for automatic repeat detection, classification, masking. Finally, analyzed type, structure, regulation human genome their role induction complex diseases. We believe that review will facilitate a comprehensive understanding provide guidance annotation in-depth exploration its association with

Language: Английский

Citations

65

Seagrass genomes reveal ancient polyploidy and adaptations to the marine environment DOI

Xiao Ma,

Steffen Vanneste, Jiyang Chang

et al.

Nature Plants, Journal Year: 2024, Volume and Issue: 10(2), P. 240 - 255

Published: Jan. 26, 2024

Language: Английский

Citations

21

Genome size evolution: towards new model systems for old questions DOI Open Access
Julie Blommaert

Proceedings of the Royal Society B Biological Sciences, Journal Year: 2020, Volume and Issue: 287(1933)

Published: Aug. 26, 2020

Genome size (GS) variation is a fundamental biological characteristic; however, its evolutionary causes and consequences are the topic of ongoing debate. Whether GS neutral trait or one subject to selective pressures, how strong these pressures are, may remain open questions. Fundamentally, genomic sequences responsible for this directly impact potential outcomes and, equally, targets different pressures. For example, duplications deletions genic regions (large small) can have immediate drastic phenotypic effects, while an expansion contraction non-coding DNA less likely cause catastrophic effects. However, in long term, accumulation deletion ncDNA larger Modern sequencing technologies allowing dissection proximate causes, but combination new with more traditional experiments approaches could revolutionize debate potentially resolve many arguments. Here, I discuss ambitious way forward research, putting it context historical debates, theories sometimes contradictory evidence, highlighting promise combining analytical developments experimental evolution approaches.

Language: Английский

Citations

93

Milletdb: a multi‐omics database to accelerate the research of functional genomics and molecular breeding of millets DOI Creative Commons
Min Sun, Haidong Yan,

Aling Zhang

et al.

Plant Biotechnology Journal, Journal Year: 2023, Volume and Issue: 21(11), P. 2348 - 2357

Published: Aug. 2, 2023

Summary Millets are a class of nutrient‐rich coarse cereals with high resistance to abiotic stress; thus, they guarantee food security for people living in areas extreme climatic conditions and provide stress‐related genetic resources other crops. However, no platform is available comprehensive systematic multi‐omics analysis millets, which seriously hinders the mining genes molecular breeding millets. Here, free, web‐accessible, user‐friendly millets database (Milletdb, http://milletdb.novogene.com ) has been developed. The Milletdb contains six their one related species genomes, graph‐based pan‐genomics pearl millet, data, enable be most complete available. We stored GWAS (genome‐wide association study) results 20 yield‐related trait data obtained under three environmental [field (no stress), early drought late drought] 2 years database, allowing users identify that support yield improvement. can simplify functional genomics by providing different tools (e.g., ‘Gene mapping’, ‘Co‐expression’, ‘KEGG/GO Enrichment’ analysis, etc.). On platform, gene PMA1G03779.1 was identified through ‘GWAS’, potential modulate respond stresses. Using provided Milletdb, we found PLATZs TFs (transcription factors) family expands 87.5% millet accessions contributes vegetative growth stress responses. effectively serve researchers key genes, genome editing

Language: Английский

Citations

35

Long-Read Sequencing Reveals Rapid Evolution of Immunity- and Cancer-Related Genes in Bats DOI Creative Commons
Armin Scheben, O. Ramos, Melissa Kramer

et al.

Genome Biology and Evolution, Journal Year: 2023, Volume and Issue: 15(9)

Published: Sept. 1, 2023

Bats are exceptional among mammals for their powered flight, extended lifespans, and robust immune systems therefore have been of particular interest in comparative genomics. Using the Oxford Nanopore Technologies long-read platform, we sequenced genomes two bat species with key phylogenetic positions, Jamaican fruit (Artibeus jamaicensis) Mesoamerican mustached (Pteronotus mesoamericanus), carried out a comprehensive genomic analysis diverse collection bats other mammals. The high-quality, genome assemblies revealed contraction interferon (IFN)-α at immunity-related type I IFN locus bats, resulting shift relative IFN-ω IFN-α copy numbers. Contradicting previous hypotheses constitutive expression being feature system, three lost all genes. This to could contribute increased viral tolerance that has made common reservoir viruses can be transmitted humans. Antiviral genes stimulated by IFNs also showed evidence rapid evolution, including lineage-specific duplication IFN-induced transmembrane positive selection IFIT2. In addition, 33 tumor suppressors 6 DNA-repair signs selection, perhaps contributing longevity reduced cancer rates bats. rely on both bat-wide evolution gene repertoire, suggesting strategies. Our study provides new resources sheds light extraordinary molecular this critically important group

Language: Английский

Citations

31

From tradition to innovation: conventional and deep learning frameworks in genome annotation DOI Creative Commons

Zhaojia Chen,

Noor ul Ain, Qian Zhao

et al.

Briefings in Bioinformatics, Journal Year: 2024, Volume and Issue: 25(3)

Published: March 27, 2024

Following the milestone success of Human Genome Project, 'Encyclopedia DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about numerous functional elements within genome. This endeavor coincided with emergence novel technologies, accompanied by provision vast amounts whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful from this massive dataset has become a critical aspect many recent studies, particularly annotating predicting functions unknown genes. The core idea behind genome annotation is identify genes various sequence infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for verification. However, early bioinformatics algorithms software primarily employed shallow learning techniques; thus, ability characterize features limited. With widespread adoption RNA-Seq technology, scientists community began harness potential machine deep approaches gene structure prediction annotation. In context, we reviewed both conventional contemporary frameworks, highlighted perspectives challenges arising during underscoring dynamic nature evolving scientific landscape.

Language: Английский

Citations

11