Overview of the Dovetail™ De Novo Assembly Process

When building a de novo genome assembly for your favorite organism, assembly contiguity and accuracy are equally important.  Let the de novo experts at Dovetail Genomics® build you an accurate and contiguous genome assembly or improve an existing assembly with our two proprietary proximity ligation methods, Chicago® and Dovetail™ Hi-C, and our leading scaffolding software, HiRise™.

genome assembly workflow

Our end-to-end service begins with a draft assembly of the genome.  This can either be provided by you or built here at Dovetail using the right sequencing technology for your genome – we will consult with you to decide on the approach.  Our minimum quality requirement at this stage is a scaffold N50 >20Kb.

We then use our proprietary Chicago® in vitro proximity ligation data to build up assembly contiguity by making long-range joins. Our scaffolding software, HiRise™, uses this data to find and correct false misjoins in the input assembly.

Next, we build Dovetail™ Hi-C libraries using intact cells or tissue.  HiRise™ uses the Dovetail™ Hi-C data to make even longer range connections, up to full chromosomes, thereby greatly increasing contiguity.

The final assembly is both highly contiguous (Chicago® + Dovetail™ Hi-C) and highly accurate (Chicago®).  Our bioinformatics team manually inspects the final assembly, ensuring the highest level of quality, before delivery.

“It’s one thing to sequence the genome and it’s quite another to have quality. That’s why we like working with Dovetail -it’s super high-quality”

Jaroslav Dolezel and Eva Hribova, Institute of Experimental Botany, Academy of Sciences of the Czech Republic

eva and jaroslav

Dovetail’s genome assembly services and products will take your research to the next level.

Correct Misjoins to Improve Contig Order and Orientation

graph of Chicago alone, Chicago + Dovetail Hi-C and Dovetail Hi-C alone

When HiRise™ has high confidence a join has been made incorrectly in the input de novo assembly, it will break that join thereby improving overall accuracy. The identification and subsequent breaking of a false join was made by HiRise™ in this example of a polyploid plant. As a result, the input scaffold N50 of 5.980Mb was lowered to 2.915Mb by Chicago® + HiRise™, but the order and orientation of the assembly is now accurate.

Next, the Dovetail™ Hi-C data was applied to the interim Chicago® assembly and contiguity was dramatically improved, from an N50 of 2.915Mb to 44.288Mb.  Thus, Chicago® and Dovetail™ Hi-C complement each other to provide a final assembly that is both accurate and contiguous.

Note that when Dovetail™ Hi-C alone is applied to the original input assembly, a comparable final scaffold N50 is reached (44.449Mb), but the scaffold N90 is significantly lower (19.818Mb vs. 26.550Mb) and the overall accuracy of the assembly is less than optimal.

Increase Contiguity

bowfin assembly improvement

Chicago® data builds assembly contiguity by making long-range joins between distant contigs and scaffolds in the input assembly.  With this bowfin* assembly, an input scaffold N50 of only 24Kb is boosted to over 10Mb with Chicago® plus HiRise™.

*acknowledgement to Ingo Braasch (Michigan State, project leader), Andrew W. Thompson (Michigan State), Solomon David (Nicholls State University), Allyse Ferrara (Nicholls State University), and the GenoFish Consortium for their work on this project.

Reduce Background Noise

In vivo Hi-C data contains noise from biological events, such as TADs, highlighted by red arrows (left), while Chicago® in vitro proximity ligation data is far cleaner (right). The Chicago® data therefore enables high resolution identification and correction of misjoins in the input assembly.

Unlike Dovetail™ Hi-C libraries that are created in situ with natural chromatin, Chicago® libraries are built from high molecular weight DNA that is reconstituted into artificial chromatin.  Artificial chromatin does not form looping structures, greatly improving signal to noise, and enabling high resolution detection of contig order and orientation errors.

reduce biological noise
correct assembly order and orientation

In addition to contiguity improvement, Chicago® data serves another very important function. HiRise™ uses Chicago® data to find and correct contig order and orientation errors in the input assembly. High confidence misjoins are corrected by HiRise™, thereby improving the overall accuracy of the final assembly. In this example, a false join (circled) is clearly evident in the Chicago® plot, but not visible in the Dovetail™ Hi-C plot. In other words, Dovetail™ Hi-C data does not have the resolution to reveal contig order and orientation errors.

Chicago® Boosts Contiguity to Enable Hi-C

Dtgaap_stats2

Dovetail™ Hi-C requires an input scaffold N50 of ~1Mb to work effectively. In this example, weeping lovegrass, applying Dovetail™ Hi-C directly to the low contiguity input assembly does not result in a chromosome-scale level of improvement (top table, 0.38Mb to 5.187Mb).

*acknowledgement to Mario Caccamo, José Carballo, Bruno Santos, Emidio Albertini and Vivianna Echenique for their work on this project.

However, when the Chicago® assembly was then scaffolded with DovetailTM Hi-C (lower table), a dramatic improvement in contiguity was seen (0.791Mb to 45.345Mb). Chicago® was needed to boost the assembly contiguity to a point where Dovetail™ Hi-C could then work effectively.