Nucleic Acids Research,
Journal Year:
2023,
Volume and Issue:
52(D1), P. D891 - D899
Published: Nov. 11, 2023
Abstract
Ensembl
(https://www.ensembl.org)
is
a
freely
available
genomic
resource
that
has
produced
high-quality
annotations,
tools,
and
services
for
vertebrates
model
organisms
more
than
two
decades.
In
recent
years,
there
been
dramatic
shift
in
the
landscape,
with
large
increase
number
phylogenetic
breadth
of
reference
genomes,
alongside
major
advances
pan-genome
representations
higher
species.
order
to
support
these
efforts
accelerate
downstream
research,
continues
focus
on
scaling
rapid
annotation
new
genome
assemblies,
developing
methods
comparative
analysis,
expanding
depth
quality
our
annotations.
This
year
we
have
continued
expansion
global
biodiversity
doubling
annotated
genomes
Rapid
Release
site
over
1700,
driven
by
close
collaboration
projects
such
as
Darwin
Tree
Life.
We
also
strengthened
key
agricultural
species,
including
first
regulatory
builds
farmed
animals,
updated
tools
resources
scientific
community,
notably
Variant
Effect
Predictor.
data,
software,
are
available.
Nature Methods,
Journal Year:
2021,
Volume and Issue:
18(4), P. 366 - 368
Published: April 1, 2021
Abstract
We
are
at
the
beginning
of
a
genomic
revolution
in
which
all
known
species
planned
to
be
sequenced.
Accessing
such
data
for
comparative
analyses
is
crucial
this
new
age
data-driven
biology.
Here,
we
introduce
an
improved
version
DIAMOND
that
greatly
exceeds
previous
search
performances
and
harnesses
supercomputing
perform
tree-of-life
scale
protein
alignments
hours,
while
matching
sensitivity
gold
standard
BLASTP.
Proceedings of the National Academy of Sciences,
Journal Year:
2020,
Volume and Issue:
117(17), P. 9451 - 9457
Published: April 16, 2020
The
accelerating
pace
of
genome
sequencing
throughout
the
tree
life
is
driving
need
for
improved
unsupervised
annotation
components
such
as
transposable
elements
(TEs).
Because
types
and
sequences
TEs
are
highly
variable
across
species,
automated
TE
discovery
challenging
time-consuming
tasks.
A
critical
first
step
de
novo
identification
accurate
compilation
sequence
models
representing
all
unique
families
dispersed
in
genome.
Here
we
introduce
RepeatModeler2,
a
pipeline
that
greatly
facilitates
this
process.
This
program
brings
substantial
improvements
over
original
version
RepeatModeler,
one
most
widely
used
tools
discovery.
In
particular,
incorporates
module
structural
complete
long
terminal
repeat
(LTR)
retroelements,
which
widespread
eukaryotic
genomes
but
recalcitrant
to
because
their
size
complexity.
We
benchmarked
RepeatModeler2
on
three
model
species
with
diverse
landscapes
high-quality,
manually
curated
libraries:
Drosophila
melanogaster
(fruit
fly),
Danio
rerio
(zebrafish),
Oryza
sativa
(rice).
these
identified
approximately
3
times
more
consensus
matching
>95%
identity
coverage
than
RepeatModeler.
As
expected,
greatest
improvement
LTR
retroelements.
Thus,
represents
valuable
addition
toolkit
will
enhance
study
sequences.
available
source
code
or
containerized
package
under
an
open
license
(
https://github.com/Dfam-consortium/RepeatModeler
,
http://www.repeatmasker.org/RepeatModeler/
).
Nature,
Journal Year:
2021,
Volume and Issue:
592(7856), P. 737 - 746
Published: April 28, 2021
Abstract
High-quality
and
complete
reference
genome
assemblies
are
fundamental
for
the
application
of
genomics
to
biology,
disease,
biodiversity
conservation.
However,
such
available
only
a
few
non-microbial
species
1–4
.
To
address
this
issue,
international
Genome
10K
(G10K)
consortium
5,6
has
worked
over
five-year
period
evaluate
develop
cost-effective
methods
assembling
highly
accurate
nearly
genomes.
Here
we
present
lessons
learned
from
generating
16
that
represent
six
major
vertebrate
lineages.
We
confirm
long-read
sequencing
technologies
essential
maximizing
quality,
unresolved
complex
repeats
haplotype
heterozygosity
sources
assembly
error
when
not
handled
correctly.
Our
correct
substantial
errors,
add
missing
sequence
in
some
best
historical
genomes,
reveal
biological
discoveries.
These
include
identification
many
false
gene
duplications,
increases
sizes,
chromosome
rearrangements
specific
lineages,
repeated
independent
breakpoint
bat
canonical
GC-rich
pattern
protein-coding
genes
their
regulatory
regions.
Adopting
these
lessons,
have
embarked
on
Vertebrate
Genomes
Project
(VGP),
an
effort
generate
high-quality,
genomes
all
roughly
70,000
extant
help
enable
new
era
discovery
across
life
sciences.
Nucleic Acids Research,
Journal Year:
2021,
Volume and Issue:
50(D1), P. D988 - D995
Published: Oct. 19, 2021
Ensembl
(https://www.ensembl.org)
is
unique
in
its
flexible
infrastructure
for
access
to
genomic
data
and
annotation.
It
has
been
designed
efficiently
deliver
annotation
at
scale
all
eukaryotic
life,
it
also
provides
deep
comprehensive
key
species.
Genomes
representing
a
greater
diversity
of
species
are
increasingly
being
sequenced.
In
response,
we
have
focussed
our
recent
efforts
on
expediting
the
new
assemblies.
Here,
report
release
greatest
annual
number
newly
annotated
genomes
history
via
dedicated
Rapid
Release
platform
(http://rapid.ensembl.org).
We
developed
method
generate
comparative
analyses
these
assemblies
and,
first
time,
non-vertebrate
eukaryotes.
Meanwhile,
continually
improve,
extend
update
high-value
reference
vertebrate
details
here.
range
specific
software
tools
tasks,
such
as
Variant
Effect
Predictor
(VEP)
interface
Recoder.
All
data,
freely
available
download
accessible
programmatically.
G3 Genes Genomes Genetics,
Journal Year:
2020,
Volume and Issue:
10(4), P. 1361 - 1374
Published: Feb. 19, 2020
Reconstruction
of
target
genomes
from
sequence
data
produced
by
instruments
that
are
agnostic
as
to
the
species-of-origin
may
be
confounded
contaminant
DNA.
Whether
introduced
during
sample
processing
or
through
co-extraction
alongside
DNA,
if
insufficient
care
is
taken
assembly
process,
final
assembled
genome
a
mixture
several
species.
Such
assemblies
can
confound
sequence-based
biological
inference
and,
when
deposited
in
public
databases,
included
downstream
analyses
users
unaware
underlying
problems.
We
present
BlobToolKit,
software
suite
aid
researchers
identifying
and
isolating
non-target
draft
publicly
available
assemblies.
BlobToolKit
used
process
assembly,
read
analysis
files
for
fully
reproducible
interactive
exploration
browser-based
Viewer.
filter
helping
produce
with
high
credibility.
have
been
running
an
automated
pipeline
on
eukaryotic
International
Nucleotide
Sequence
Data
Collaboration
making
results
instance
Viewer
at
https://blobtoolkit.genomehubs.org/view
aim
complete
all
then
maintain
currency
flow
new
genomes.
worked
embed
these
views
into
presentation
European
Archive,
providing
indication
quality
record
links
out
allow
full
Nucleic Acids Research,
Journal Year:
2020,
Volume and Issue:
49(D1), P. D884 - D891
Published: Oct. 7, 2020
Abstract
The
Ensembl
project
(https://www.ensembl.org)
annotates
genomes
and
disseminates
genomic
data
for
vertebrate
species.
We
create
detailed
comprehensive
annotation
of
gene
structures,
regulatory
elements
variants,
enable
comparative
genomics
by
inferring
the
evolutionary
history
genes
genomes.
Our
integrated
are
made
available
in
a
variety
ways,
including
genome
browsers,
search
interfaces,
specialist
tools
such
as
Variant
Effect
Predictor,
download
files
programmatic
interfaces.
Here,
we
present
recent
developments
two
new
website
portals.
Rapid
Release
(http://rapid.ensembl.org)
is
designed
to
provide
core
services
soon
possible
has
been
deployed
support
large
biodiversity
sequencing
projects.
SARS-CoV-2
browser
(https://covid-19.ensembl.org)
integrates
our
own
with
publicly
from
numerous
sources
facilitate
use
international
scientific
response
COVID-19
pandemic.
also
report
on
other
updates
resources,
services.
All
software
freely
without
restriction.
NAR Genomics and Bioinformatics,
Journal Year:
2021,
Volume and Issue:
3(1)
Published: Jan. 6, 2021
Abstract
The
task
of
eukaryotic
genome
annotation
remains
challenging.
Only
a
few
genomes
could
serve
as
standards
achieved
through
tremendous
investment
human
curation
efforts.
Still,
the
correctness
all
alternative
isoforms,
even
in
best-annotated
genomes,
be
good
subject
for
further
investigation.
new
BRAKER2
pipeline
generates
and
integrates
external
protein
support
into
iterative
process
training
gene
prediction
by
GeneMark-EP+
AUGUSTUS.
continues
line
started
BRAKER1
where
self-training
GeneMark-ET
AUGUSTUS
made
predictions
supported
transcriptomic
data.
Among
challenges
addressed
was
generation
reliable
hints
to
protein-coding
exon
boundaries
from
likely
homologous
but
evolutionarily
distant
proteins.
In
comparison
with
other
pipelines
annotation,
is
fully
automatic.
It
favorably
compared
under
equal
conditions
pipelines,
e.g.
MAKER2,
terms
accuracy
performance.
Development
should
facilitate
solving
harmonization
genes
different
species.
However,
we
understand
that
several
more
innovations
are
needed
proteomic
technologies
well
algorithmic
development
reach
goal
highly
accurate
genomes.
GigaScience,
Journal Year:
2021,
Volume and Issue:
10(1)
Published: Jan. 1, 2021
Abstract
Genome
sequence
assemblies
provide
the
basis
for
our
understanding
of
biology.
Generating
error-free
is
therefore
ultimate,
but
sadly
still
unachieved
goal
a
multitude
research
projects.
Despite
ever-advancing
improvements
in
data
generation,
assembly
algorithms
and
pipelines,
no
automated
approach
has
so
far
reliably
generated
near
genome
eukaryotes.
Whilst
working
towards
improved
datasets
fully
evaluation
curation
actively
used
to
bridge
this
shortcoming
significantly
reduce
number
errors.
In
addition
increase
product
value,
insights
gained
from
are
fed
back
into
strategy
contribute
notable
quality.
We
describe
tried
tested
using
gEVAL,
browser.
outline
procedures
applied
gEVAL
also
recommendations
gEVAL-independent
context
facilitate
uptake
wider
community.
Bioinformatics,
Journal Year:
2022,
Volume and Issue:
39(1)
Published: Dec. 16, 2022
Abstract
Summary
We
present
YaHS,
a
user-friendly
command-line
tool
for
the
construction
of
chromosome-scale
scaffolds
from
Hi-C
data.
It
can
be
run
with
single-line
command,
requires
minimal
input
users
(an
assembly
file
and
an
alignment
file)
which
is
compatible
similar
tools
provides
results
in
multiple
formats,
thereby
enabling
rapid,
robust
scalable
high-quality
genome
assemblies
high
accuracy
contiguity.
Availability
implementation
YaHS
implemented
C
licensed
under
MIT
License.
The
source
code,
documentation
tutorial
are
available
at
https://github.com/sanger-tol/yahs.
Supplementary
information
data
Bioinformatics
online.
Genomics Proteomics & Bioinformatics,
Journal Year:
2021,
Volume and Issue:
19(4), P. 578 - 583
Published: Aug. 1, 2021
The
Genome
Sequence
Archive
(GSA)
is
a
data
repository
for
archiving
raw
sequence
data,
which
provides
storage
and
sharing
services
worldwide
scientific
communities.
Considering
explosive
growth
with
diverse
types,
here
we
present
the
GSA
family
by
expanding
into
set
of
resources
archive
different
purposes,
namely,
(https://ngdc.cncb.ac.cn/gsa/),
Human
(GSA-Human,
https://ngdc.cncb.ac.cn/gsa-human/),
Open
Miscellaneous
Data
(OMIX,
https://ngdc.cncb.ac.cn/omix/).
Compared
2017
version,
has
been
significantly
updated
in
model,
online
functionalities,
web
interfaces.
GSA-Human,
as
new
partner
GSA,
specialized
human
genetics-related
controlled
access
security.
OMIX,
critical
complement
to
two
mentioned
above,
an
open
miscellaneous
data.
Together,
all
these
form
dedicated
accepting
submissions
from
over
world,
providing
free
publicly
available
support
research
activities.