FGeneBERT: function-driven pre-trained gene language model for metagenomics DOI Creative Commons

Chenrui Duan,

Zelin Zang, Yongjie Xu

и другие.

Briefings in Bioinformatics, Год журнала: 2025, Номер 26(2)

Опубликована: Март 1, 2025

Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health ecological functions. However, current research relies on K-mer, which limits the capture of structurally functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes fail to address one-to-many many-to-one relationships inherent metagenomic data. To overcome challenges, we introduce FGeneBERT, a novel pre-trained model that employs protein-based representation as context-aware structure-relevant tokenizer. FGeneBERT incorporates masked modeling enhance understanding inter-gene contextual triplet enhanced contrastive learning elucidate sequence-function relationships. Pre-trained over 100 million sequences, demonstrates superior performance datasets at four levels, spanning gene, functional, bacterial, environmental levels ranging from 1 213 k input sequences. Case studies ATP synthase operons highlight FGeneBERT's capability for functional recognition its biological relevance research.

Язык: Английский

Plant pan-genomes are the new reference DOI
Philipp E. Bayer, Agnieszka A. Golicz, Armin Scheben

и другие.

Nature Plants, Год журнала: 2020, Номер 6(8), С. 914 - 920

Опубликована: Июль 20, 2020

Язык: Английский

Процитировано

401

Giant lungfish genome elucidates the conquest of land by vertebrates DOI Creative Commons
Axel Meyer, Siegfried Schloissnig, Paolo Franchini

и другие.

Nature, Год журнала: 2021, Номер 590(7845), С. 284 - 289

Опубликована: Янв. 18, 2021

Abstract Lungfishes belong to lobe-fined fish (Sarcopterygii) that, in the Devonian period, ‘conquered’ land and ultimately gave rise all vertebrates, including humans 1–3 . Here we determine chromosome-quality genome of Australian lungfish ( Neoceratodus forsteri ), which is known have largest any animal. The vast size this genome, about 14× larger than that humans, attributable mostly huge intergenic regions introns with high repeat content (around 90%), components resemble those tetrapods (comprising mainly long interspersed nuclear elements) more they do ray-finned fish. continues expand independently (its transposable elements are still active), through mechanisms different enormous genomes salamanders. 17 fully assembled macrochromosomes maintain synteny other vertebrate chromosomes, microchromosomes conserved ancient homology ancestral karyotype. Our phylogenomic analyses confirm previous reports occupy a key evolutionary position as closest living relatives 4,5 , underscoring importance for understanding innovations associated terrestrialization. Lungfish preadaptations on include gain limb-like expression developmental genes such hoxc13 sall1 their lobed fins. Increased rates evolution duplication obligate air-breathing, lung surfactants expansion odorant receptor gene families (which encode proteins involved detecting airborne odours), contribute tetrapod-like biology lungfishes. These findings advance our major transition during evolution.

Язык: Английский

Процитировано

181

The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding DOI Creative Commons

Xiaoya Shi,

Shuo Cao, Xu Wang

и другие.

Horticulture Research, Год журнала: 2023, Номер 10(5)

Опубликована: Апрель 4, 2023

Grapevine is one of the most economically important crops worldwide. However, previous versions grapevine reference genome tipically consist thousands fragments with missing centromeres and telomeres, limiting accessibility repetitive sequences, centromeric telomeric regions, study inheritance agronomic traits in these regions. Here, we assembled a telomere-to-telomere (T2T) gap-free for cultivar PN40024 using PacBio HiFi long reads. The T2T (PN_T2T) 69 Mb longer 9018 more genes identified than 12X.v0 version. We annotated 67% 19 36 incorporated gene annotations into PN_T2T assembly. detected total 377 clusters, which showed associations complex traits, such as aroma disease resistance. Even though derives from nine generations selfing, still found genomic hotspots heterozygous sites associated biological processes, oxidation-reduction process protein phosphorylation. fully complete therefore constitutes an resource genetic studies breeding programs.

Язык: Английский

Процитировано

116

A beginner’s guide to manual curation of transposable elements DOI Creative Commons
Clément Goubert, Rory J. Craig,

Agustin F. Bilat

и другие.

Mobile DNA, Год журнала: 2022, Номер 13(1)

Опубликована: Март 30, 2022

In the study of transposable elements (TEs), generation a high confidence set consensus sequences that represent diversity TEs found in given genome is key step path to investigate these fascinating genomic elements. Many algorithms and pipelines are available automatically identify putative TE families present genome. Despite availability valuable resources, producing library high-quality full-length largely remains process manual curation. This know-how often passed on from mentor-to-mentee within research groups, making it difficult for those outside field access this highly specialised skill.

Язык: Английский

Процитировано

82

Repetitive DNA sequence detection and its role in the human genome DOI Creative Commons
Xingyu Liao,

Wufei Zhu,

Juexiao Zhou

и другие.

Communications Biology, Год журнала: 2023, Номер 6(1)

Опубликована: Сен. 19, 2023

Abstract Repetitive DNA sequences playing critical roles in driving evolution, inducing variation, and regulating gene expression. In this review, we summarized the definition, arrangement, structural characteristics of repeats. Besides, introduced diverse biological functions repeats reviewed existing methods for automatic repeat detection, classification, masking. Finally, analyzed type, structure, regulation human genome their role induction complex diseases. We believe that review will facilitate a comprehensive understanding provide guidance annotation in-depth exploration its association with

Язык: Английский

Процитировано

65

Seagrass genomes reveal ancient polyploidy and adaptations to the marine environment DOI

Xiao Ma,

Steffen Vanneste, Jiyang Chang

и другие.

Nature Plants, Год журнала: 2024, Номер 10(2), С. 240 - 255

Опубликована: Янв. 26, 2024

Язык: Английский

Процитировано

21

Genome size evolution: towards new model systems for old questions DOI Open Access
Julie Blommaert

Proceedings of the Royal Society B Biological Sciences, Год журнала: 2020, Номер 287(1933)

Опубликована: Авг. 26, 2020

Genome size (GS) variation is a fundamental biological characteristic; however, its evolutionary causes and consequences are the topic of ongoing debate. Whether GS neutral trait or one subject to selective pressures, how strong these pressures are, may remain open questions. Fundamentally, genomic sequences responsible for this directly impact potential outcomes and, equally, targets different pressures. For example, duplications deletions genic regions (large small) can have immediate drastic phenotypic effects, while an expansion contraction non-coding DNA less likely cause catastrophic effects. However, in long term, accumulation deletion ncDNA larger Modern sequencing technologies allowing dissection proximate causes, but combination new with more traditional experiments approaches could revolutionize debate potentially resolve many arguments. Here, I discuss ambitious way forward research, putting it context historical debates, theories sometimes contradictory evidence, highlighting promise combining analytical developments experimental evolution approaches.

Язык: Английский

Процитировано

93

Milletdb: a multi‐omics database to accelerate the research of functional genomics and molecular breeding of millets DOI Creative Commons
Min Sun, Haidong Yan,

Aling Zhang

и другие.

Plant Biotechnology Journal, Год журнала: 2023, Номер 21(11), С. 2348 - 2357

Опубликована: Авг. 2, 2023

Summary Millets are a class of nutrient‐rich coarse cereals with high resistance to abiotic stress; thus, they guarantee food security for people living in areas extreme climatic conditions and provide stress‐related genetic resources other crops. However, no platform is available comprehensive systematic multi‐omics analysis millets, which seriously hinders the mining genes molecular breeding millets. Here, free, web‐accessible, user‐friendly millets database (Milletdb, http://milletdb.novogene.com ) has been developed. The Milletdb contains six their one related species genomes, graph‐based pan‐genomics pearl millet, data, enable be most complete available. We stored GWAS (genome‐wide association study) results 20 yield‐related trait data obtained under three environmental [field (no stress), early drought late drought] 2 years database, allowing users identify that support yield improvement. can simplify functional genomics by providing different tools (e.g., ‘Gene mapping’, ‘Co‐expression’, ‘KEGG/GO Enrichment’ analysis, etc.). On platform, gene PMA1G03779.1 was identified through ‘GWAS’, potential modulate respond stresses. Using provided Milletdb, we found PLATZs TFs (transcription factors) family expands 87.5% millet accessions contributes vegetative growth stress responses. effectively serve researchers key genes, genome editing

Язык: Английский

Процитировано

35

Long-Read Sequencing Reveals Rapid Evolution of Immunity- and Cancer-Related Genes in Bats DOI Creative Commons
Armin Scheben, O. Ramos, Melissa Kramer

и другие.

Genome Biology and Evolution, Год журнала: 2023, Номер 15(9)

Опубликована: Сен. 1, 2023

Bats are exceptional among mammals for their powered flight, extended lifespans, and robust immune systems therefore have been of particular interest in comparative genomics. Using the Oxford Nanopore Technologies long-read platform, we sequenced genomes two bat species with key phylogenetic positions, Jamaican fruit (Artibeus jamaicensis) Mesoamerican mustached (Pteronotus mesoamericanus), carried out a comprehensive genomic analysis diverse collection bats other mammals. The high-quality, genome assemblies revealed contraction interferon (IFN)-α at immunity-related type I IFN locus bats, resulting shift relative IFN-ω IFN-α copy numbers. Contradicting previous hypotheses constitutive expression being feature system, three lost all genes. This to could contribute increased viral tolerance that has made common reservoir viruses can be transmitted humans. Antiviral genes stimulated by IFNs also showed evidence rapid evolution, including lineage-specific duplication IFN-induced transmembrane positive selection IFIT2. In addition, 33 tumor suppressors 6 DNA-repair signs selection, perhaps contributing longevity reduced cancer rates bats. rely on both bat-wide evolution gene repertoire, suggesting strategies. Our study provides new resources sheds light extraordinary molecular this critically important group

Язык: Английский

Процитировано

31

From tradition to innovation: conventional and deep learning frameworks in genome annotation DOI Creative Commons

Zhaojia Chen,

Noor ul Ain, Qian Zhao

и другие.

Briefings in Bioinformatics, Год журнала: 2024, Номер 25(3)

Опубликована: Март 27, 2024

Following the milestone success of Human Genome Project, 'Encyclopedia DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about numerous functional elements within genome. This endeavor coincided with emergence novel technologies, accompanied by provision vast amounts whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful from this massive dataset has become a critical aspect many recent studies, particularly annotating predicting functions unknown genes. The core idea behind genome annotation is identify genes various sequence infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for verification. However, early bioinformatics algorithms software primarily employed shallow learning techniques; thus, ability characterize features limited. With widespread adoption RNA-Seq technology, scientists community began harness potential machine deep approaches gene structure prediction annotation. In context, we reviewed both conventional contemporary frameworks, highlighted perspectives challenges arising during underscoring dynamic nature evolving scientific landscape.

Язык: Английский

Процитировано

11