< Back

De novo assembly is a method for constructing genomes from a large number of (short- or long-) DNA fragments, with no a priori knowledge of the correct sequence or order of those fragments.

The terminology for de novo assembly is sometimes inconsistent so we’ll use the definitions below:

Reads

Reads are DNA fragments.  “Short-reads” typically range in size 35 – 1,000 bp (nucleotide base pairs).  “long-reads” typically range in size 1,000 – 500,000 bp.  For our purposes, we’ll assume the read length is 150 bp, although the read length depends on the sequencer model and library prep protocol used for a particular sequencing run.  Raw reads generated by sequencers are generally stored in FastQ files.

Contigs

Contigs are a set of overlapping oriented reads.  A single contig is constructed from two or more overlapping and oriented reads.  The reads share a subset or all of their nucleotide base pairs.  The reads may have to be reversed (“flipped”) to yield a matching orientation, although this is rarely necessary.

Scaffolds

Scaffolds are a set of joined-oriented contigs.  A single scaffold is constructed from two or more joined and oriented contigs.  The contigs may have to be reversed (“flipped”) to yield a matching orientation.  The contigs may be overlapping or non-overlapping.

Chromosomes

Chromosomes are a set of joined-oriented scaffolds.  A single chromosome is constructed from two or more joined and oriented scaffolds.  The scaffolds may have to be reversed (“flipped”) to yield a matching orientation.  The scaffolds may be overlapping or non-overlapping.

 

Fig. 1

De novo Assembly Process

Fig. 1 shows an overview of the de novo assembly process.  Partially, or sometimes fully, overlapping reads are assembled into one or more contigs.  Sets of overlapping or non-overlapping contigs are joined into one or more scaffolds.  Sets of overlapping or non-overlapping scaffolds are joined into a single chromosome.

In the contig assembly step, reads must overlap by a minimum number of base pairs, or k-mers, before they can be mapped together.

In the scaffold assembly step, contigs do not necessarily have to overlap in order to be joined together.  This can be attributed to paired-end sequencing.

In the chromosome assembly step, scaffolds are joined together in a gap-filling, gap-closing, or genome finishing process.  This final step is difficult, and sometimes impossible, to complete using only short-read technology.  The presence of repetitive sequences especially can inhibit gap-filling using only short-reads, although some progress is being made in this area.

Finishing complete chromosomes often require the use of multiple sequencing technologies and hybrid assembly protocols.  You’ll often see short-read technology combined with long-read technology, optical maps, Bionano maps, etc. to generate fully finished genomes.  Employing multiple sequencing technologies on a per-sample basis can be costly.

Assembly Algorithms

There are many de novo assembly algorithms and software applications available for Next Generation sequencing projects.  For small genome assembly (i.e. bacterial scale genomes) we often use Spades and Geneious but may use other tools if it’s more appropriate.

Assembly Quality

In general, we use Quast to report on the quality of de novo assembled scaffolds.

Sequencing Assembly Services at The Sequencing Center

We offer de novo assembly services for short-read and long-read datasets, and hybrid assembly using a combination of short- and long-reads.

With short-reads alone, we can perform de novo assembly to the scaffold level for many smaller genomes, such as bacteria, bacteriophage, virus, yeast, fungi, etc.  With long-reads alone, we can perform de novo assembly to generate either complete genomes or nearly complete genomes, again for mostly smaller (microbial scale) genomes.

We can also perform hybrid de novo assembly with a combination of short-reads and long-reads.  In most cases, hybrid assemblies yield fully finished, complete genomes for microbial-sized genomes, although this incurs additional costs as you must run both short-read and long-read sequencing for each sample.