Ensembl 2024 DOI Creative Commons
Peter W. Harrison,

M Ridwan Amode,

Olanrewaju Austine-Orimoloye

et al.

Nucleic Acids Research, Journal Year: 2023, Volume and Issue: 52(D1), P. D891 - D899

Published: Nov. 11, 2023

Abstract Ensembl (https://www.ensembl.org) is a freely available genomic resource that has produced high-quality annotations, tools, and services for vertebrates model organisms more than two decades. In recent years, there been dramatic shift in the landscape, with large increase number phylogenetic breadth of reference genomes, alongside major advances pan-genome representations higher species. order to support these efforts accelerate downstream research, continues focus on scaling rapid annotation new genome assemblies, developing methods comparative analysis, expanding depth quality our annotations. This year we have continued expansion global biodiversity doubling annotated genomes Rapid Release site over 1700, driven by close collaboration projects such as Darwin Tree Life. We also strengthened key agricultural species, including first regulatory builds farmed animals, updated tools resources scientific community, notably Variant Effect Predictor. data, software, are available.

Language: Английский

Sensitive protein alignments at tree-of-life scale using DIAMOND DOI Creative Commons
Benjamin Buchfink, Klaus Reuter, Hajk‐Georg Drost

et al.

Nature Methods, Journal Year: 2021, Volume and Issue: 18(4), P. 366 - 368

Published: April 1, 2021

Abstract We are at the beginning of a genomic revolution in which all known species planned to be sequenced. Accessing such data for comparative analyses is crucial this new age data-driven biology. Here, we introduce an improved version DIAMOND that greatly exceeds previous search performances and harnesses supercomputing perform tree-of-life scale protein alignments hours, while matching sensitivity gold standard BLASTP.

Language: Английский

Citations

2738

RepeatModeler2 for automated genomic discovery of transposable element families DOI Open Access
Jullien M. Flynn, Robert Hubley, Clément Goubert

et al.

Proceedings of the National Academy of Sciences, Journal Year: 2020, Volume and Issue: 117(17), P. 9451 - 9457

Published: April 16, 2020

The accelerating pace of genome sequencing throughout the tree life is driving need for improved unsupervised annotation components such as transposable elements (TEs). Because types and sequences TEs are highly variable across species, automated TE discovery challenging time-consuming tasks. A critical first step de novo identification accurate compilation sequence models representing all unique families dispersed in genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over original version RepeatModeler, one most widely used tools discovery. In particular, incorporates module structural complete long terminal repeat (LTR) retroelements, which widespread eukaryotic genomes but recalcitrant to because their size complexity. We benchmarked RepeatModeler2 on three model species with diverse landscapes high-quality, manually curated libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), Oryza sativa (rice). these identified approximately 3 times more consensus matching >95% identity coverage than RepeatModeler. As expected, greatest improvement LTR retroelements. Thus, represents valuable addition toolkit will enhance study sequences. available source code or containerized package under an open license ( https://github.com/Dfam-consortium/RepeatModeler , http://www.repeatmasker.org/RepeatModeler/ ).

Language: Английский

Citations

2557

Towards complete and error-free genome assemblies of all vertebrate species DOI Creative Commons
Arang Rhie, Shane McCarthy, Olivier Fédrigo

et al.

Nature, Journal Year: 2021, Volume and Issue: 592(7856), P. 737 - 746

Published: April 28, 2021

Abstract High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, biodiversity conservation. However, such available only a few non-microbial species 1–4 . To address this issue, international Genome 10K (G10K) consortium 5,6 has worked over five-year period evaluate develop cost-effective methods assembling highly accurate nearly genomes. Here we present lessons learned from generating 16 that represent six major vertebrate lineages. We confirm long-read sequencing technologies essential maximizing quality, unresolved complex repeats haplotype heterozygosity sources assembly error when not handled correctly. Our correct substantial errors, add missing sequence in some best historical genomes, reveal biological discoveries. These include identification many false gene duplications, increases sizes, chromosome rearrangements specific lineages, repeated independent breakpoint bat canonical GC-rich pattern protein-coding genes their regulatory regions. Adopting these lessons, have embarked on Vertebrate Genomes Project (VGP), an effort generate high-quality, genomes all roughly 70,000 extant help enable new era discovery across life sciences.

Language: Английский

Citations

2060

Ensembl 2022 DOI Creative Commons
Fiona Cunningham, James E. Allen, Jamie Allen

et al.

Nucleic Acids Research, Journal Year: 2021, Volume and Issue: 50(D1), P. D988 - D995

Published: Oct. 19, 2021

Ensembl (https://www.ensembl.org) is unique in its flexible infrastructure for access to genomic data and annotation. It has been designed efficiently deliver annotation at scale all eukaryotic life, it also provides deep comprehensive key species. Genomes representing a greater diversity of species are increasingly being sequenced. In response, we have focussed our recent efforts on expediting the new assemblies. Here, report release greatest annual number newly annotated genomes history via dedicated Rapid Release platform (http://rapid.ensembl.org). We developed method generate comparative analyses these assemblies and, first time, non-vertebrate eukaryotes. Meanwhile, continually improve, extend update high-value reference vertebrate details here. range specific software tools tasks, such as Variant Effect Predictor (VEP) interface Recoder. All data, freely available download accessible programmatically.

Language: Английский

Citations

1677

BlobToolKit – Interactive Quality Assessment of Genome Assemblies DOI Creative Commons
Richard Challis, E. G. Richards, Jeena Rajan

et al.

G3 Genes Genomes Genetics, Journal Year: 2020, Volume and Issue: 10(4), P. 1361 - 1374

Published: Feb. 19, 2020

Reconstruction of target genomes from sequence data produced by instruments that are agnostic as to the species-of-origin may be confounded contaminant DNA. Whether introduced during sample processing or through co-extraction alongside DNA, if insufficient care is taken assembly process, final assembled genome a mixture several species. Such assemblies can confound sequence-based biological inference and, when deposited in public databases, included downstream analyses users unaware underlying problems. We present BlobToolKit, software suite aid researchers identifying and isolating non-target draft publicly available assemblies. BlobToolKit used process assembly, read analysis files for fully reproducible interactive exploration browser-based Viewer. filter helping produce with high credibility. have been running an automated pipeline on eukaryotic International Nucleotide Sequence Data Collaboration making results instance Viewer at https://blobtoolkit.genomehubs.org/view aim complete all then maintain currency flow new genomes. worked embed these views into presentation European Archive, providing indication quality record links out allow full

Language: Английский

Citations

1628

Ensembl 2021 DOI Creative Commons
Kevin Howe, Premanand Achuthan, James E. Allen

et al.

Nucleic Acids Research, Journal Year: 2020, Volume and Issue: 49(D1), P. D884 - D891

Published: Oct. 7, 2020

Abstract The Ensembl project (https://www.ensembl.org) annotates genomes and disseminates genomic data for vertebrate species. We create detailed comprehensive annotation of gene structures, regulatory elements variants, enable comparative genomics by inferring the evolutionary history genes genomes. Our integrated are made available in a variety ways, including genome browsers, search interfaces, specialist tools such as Variant Effect Predictor, download files programmatic interfaces. Here, we present recent developments two new website portals. Rapid Release (http://rapid.ensembl.org) is designed to provide core services soon possible has been deployed support large biodiversity sequencing projects. SARS-CoV-2 browser (https://covid-19.ensembl.org) integrates our own with publicly from numerous sources facilitate use international scientific response COVID-19 pandemic. also report on other updates resources, services. All software freely without restriction.

Language: Английский

Citations

1464

BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database DOI Creative Commons
Tomáš Brůna, Katharina J. Hoff,

Alexandre Lomsadze

et al.

NAR Genomics and Bioinformatics, Journal Year: 2021, Volume and Issue: 3(1)

Published: Jan. 6, 2021

Abstract The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards achieved through tremendous investment human curation efforts. Still, the correctness all alternative isoforms, even in best-annotated genomes, be good subject for further investigation. new BRAKER2 pipeline generates and integrates external protein support into iterative process training gene prediction by GeneMark-EP+ AUGUSTUS. continues line started BRAKER1 where self-training GeneMark-ET AUGUSTUS made predictions supported transcriptomic data. Among challenges addressed was generation reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines annotation, is fully automatic. It favorably compared under equal conditions pipelines, e.g. MAKER2, terms accuracy performance. Development should facilitate solving harmonization genes different species. However, we understand that several more innovations are needed proteomic technologies well algorithmic development reach goal highly accurate genomes.

Language: Английский

Citations

1368

Significantly improving the quality of genome assemblies through curation DOI Creative Commons
Kerstin Howe, William Chow, Joanna Collins

et al.

GigaScience, Journal Year: 2021, Volume and Issue: 10(1)

Published: Jan. 1, 2021

Abstract Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free is therefore ultimate, but sadly still unachieved goal a multitude research projects. Despite ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near genome eukaryotes. Whilst working towards improved datasets fully evaluation curation actively used to bridge this shortcoming significantly reduce number errors. In addition increase product value, insights gained from are fed back into strategy contribute notable quality. We describe tried tested using gEVAL, browser. outline procedures applied gEVAL also recommendations gEVAL-independent context facilitate uptake wider community.

Language: Английский

Citations

1364

YaHS: yet another Hi-C scaffolding tool DOI Creative Commons
Chenxi Zhou, Shane McCarthy, Richard Durbin

et al.

Bioinformatics, Journal Year: 2022, Volume and Issue: 39(1)

Published: Dec. 16, 2022

Abstract Summary We present YaHS, a user-friendly command-line tool for the construction of chromosome-scale scaffolds from Hi-C data. It can be run with single-line command, requires minimal input users (an assembly file and an alignment file) which is compatible similar tools provides results in multiple formats, thereby enabling rapid, robust scalable high-quality genome assemblies high accuracy contiguity. Availability implementation YaHS implemented C licensed under MIT License. The source code, documentation tutorial are available at https://github.com/sanger-tol/yahs. Supplementary information data Bioinformatics online.

Language: Английский

Citations

1334

The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types DOI Creative Commons
Tingting Chen, Xu Chen, Sisi Zhang

et al.

Genomics Proteomics & Bioinformatics, Journal Year: 2021, Volume and Issue: 19(4), P. 578 - 583

Published: Aug. 1, 2021

The Genome Sequence Archive (GSA) is a data repository for archiving raw sequence data, which provides storage and sharing services worldwide scientific communities. Considering explosive growth with diverse types, here we present the GSA family by expanding into set of resources archive different purposes, namely, (https://ngdc.cncb.ac.cn/gsa/), Human (GSA-Human, https://ngdc.cncb.ac.cn/gsa-human/), Open Miscellaneous Data (OMIX, https://ngdc.cncb.ac.cn/omix/). Compared 2017 version, has been significantly updated in model, online functionalities, web interfaces. GSA-Human, as new partner GSA, specialized human genetics-related controlled access security. OMIX, critical complement to two mentioned above, an open miscellaneous data. Together, all these form dedicated accepting submissions from over world, providing free publicly available support research activities.

Language: Английский

Citations

1131