Frequently Asked Questions
If you ever have questions about Dovetail Genomics, our technology, or how we can help simplify genomic discovery, we’re always happy to assist. Simply select any of the expandable topics below for answers to the most commonly asked questions. If you can’t find the answer you are seeking, don’t worry—just contact us and someone will get back to you shortly.
If you have questions regarding Dovetail Hi-C please contact us via email at firstname.lastname@example.org or by phone at 831.713.4465.
Dovetail’s proprietary, in vitro, long-range sequencing library preparation method
A sequencing library produced by the Chicago method.
The estimated number of unique molecules within a sequencing library.
Dovetail’s internally developed genome scaffolding software pipeline.
The assembly to be scaffolded by the HiRise pipeline using Chicago data.
A contiguous, assembled fragment within a genome assembly.
A series of connected contigs with gaps of unidentified sequence (“N’s”) in between.
A statistical measure of average scaffold or contig size in a genome assembly. Specifically, 50% of the total assembled sequence is in scaffolds/contigs N50 or larger.
A statistical measure of average scaffold or contig size in a genome assembly. Specifically, 90% of the total assembled sequence is in scaffold/contigs N90 or larger.
There are two primary project types: full de novo assembly and assembly improvements. In both cases, the Chicago library generation and HiRise scaffolding are performed by Dovetail.
Full de novo projects are those for which Dovetail also performs the initial (draft) assembly. The data for the initial assembly (typically shotgun sequence data at ~80X coverage) may be generated by Dovetail or provided by the customer, according to the customer’s desires.
Assembly improvement projects begin with an existing, customer-provided assembly, which is then scaffolded by Dovetail using the HiRise pipeline and Chicago sequence data.
Dovetail strives to complete assembly improvement projects within 6-10 weeks and de novo assembly projects in 12 weeks. If the customer decides not to use Dovetail to perform sequencing, additional sequencing queue time will likely be incurred.
There are a number of standard QC checkpoints in Dovetail’s process, and the customer can expect to receive communication from Dovetail after each one. They are:
- DNA QC: Dovetail will ensure that there is enough DNA and that it is sufficiently pure, high molecular weight, and concentrated for the Chicago library preparation. This check is performed whether the DNA is extracted by Dovetail or is provided directly by the customer.
- Chicago library QC: Upon completion of the Chicago library preparation, Dovetail sequences a small number of read pairs (~1-2 million pairs) from the library and aligns them to the draft assembly. Dovetail uses these alignments to examine the complexity of the library, read pair separation distribution, and signal:noise levels. The library will pass QC if it is predicted to be able to produce ~100X physical coverage of the genome with reasonable levels of sequencing. If the library passes both this and the prior QC, then it is ready for deep sequencing.
- Draft Assembly QC: Dovetail will ensure that the de novo (draft) assembly meets minimum requirements for scaffolding with HiRise. In cases where it doesn’t, Dovetail will offer proposals to the customer to help them reach those thresholds.
- Scaffolded Assembly QC (HiRise): After running through the HiRise pipeline the resulting assemblies are checked against historical results to ensure that the scaffolding is performing as expected. When high-quality, related genome assemblies are available, Dovetail will also perform a synteny analysis as an additional coarse QC check. If an assembly passes this checkpoint it is then delivered to the customer.
For de novo (draft) assemblies Dovetail primarily uses the Meraculous assembler (http://jgi.doe.gov/data-and-tools/meraculous). For scaffolding with Chicago data, Dovetail uses its in-house scaffolding pipeline called HiRise.
All customer data reception and transmission is conducted over SFTP to Dovetail’s servers.
Dovetail’s protocol currently requires a minimum of 500 ng of pure, high molecular weight DNA at a concentration of 100 ng/µl. To ensure that extracted DNA meets these requirements, Dovetail can also extract it for you. These options are outlined below. For more details, please contact Dovetail with a request for sample guidelines.
- Customer provides DNA: 2 µg, free of contaminants (e.g., RNA, enzymes), concentration of 100 ng/µl, and a minimum mean fragment size of 50 kbp (100+ kbp preferred).
- Customer provides animal tissue: 500 mg of cell-dense tissue (e.g., liver, brain) freshly collected and flash frozen.
- Customer provides plant tissue: 5 g of youngest possible tissue (e.g., new leaves, seedlings) freshly collected and flash frozen.
The inputs to the HiRise pipeline are:
- A draft (de novo) assembly.
- Chicago sequence data.
- Shotgun data. Though not strictly required, best results are achieved when shotgun data is included.
Additional data types cannot currently be directly utilized by the HiRise pipeline. However, all sources of data may be incorporated into the draft assembly, whether it is produced by Dovetail or the customer. To the extent that such data improve the contiguity of the draft assembly, their incorporation will benefit the entire assembly process. Additionally, certain data types (e.g., optical maps) may be utilized after HiRise scaffolding to further improve contiguity.
Furthermore, data with low utility for assembly (e.g., low-coverage BAC-end sequences, transcriptome) may be used to perform validation of the final assembly.
No. Draft assemblies produced by any means (e.g., Discovar De Novo, SGA, Meraculous) and with any type of data (e.g., shotgun, PacBio) can be used for input to HiRise. Please note that errors introduced in the draft assembly may carry through to the final assembly, so the initial assembly should be as accurate as possible.
There are currently no strict requirements for an input assembly. However, input assemblies with a scaffold N50 of less than 20kb and/or less than 75% of the genome represented in the assembly may not perform as well as assemblies with higher contiguity. It is strongly recommended that interested customers contact Dovetail to discuss the state of their input assembly and its feasibility as a starting point for HiRise.
Generally a smaller N50 with fewer errors is preferred.
Chicago libraries are compatible with any sequencing platform. For convenience, Dovetail’s standard preparation is made ready for sequencing on Illumina instruments, and this is the most common platform for sequencing.
Paired-end (or full length) sequencing and a minimum read size of 100 bp are required for Chicago data intended for HiRise scaffolding. Longer read lengths (e.g., 2×250) can offer more powerful variant phasing information, but only marginally impact the quality and contiguity of the ultimate scaffolded assembly.
Chicago sequencing depth requirements vary considerably from genome to genome and project to project. Factors impacting the amount of sequencing required include:
Genome: size, repeat content, GC-content, and heterozygosity.
Chicago library: complexity, signal to noise ratio, and read pair separation (which is a function of input DNA size).
Input genome assembly: contiguity and accuracy.
Dovetail accounts for these complexities by aiming for ~100X physical coverage from a Chicago library after correcting for library complexity. Typically only 150 to 300 million read pairs (~1-2 HiSeq 2500 lanes) are required.
Dovetail is still performing experiments in this regard but currently recognize continual improvement in contiguity with coverage levels up to 100X, and potentially beyond.
The primary deliverables are the new, scaffolded assembly in FASTA format and a report with key statistical metrics on the final assembly. Dovetail also provides lists of putative misjoins that were broken in the input assembly and a “map” showing the relationship between input contigs and scaffolds and the final HiRise scaffolds.
Genome assemblies are always complex and each one is unique, with many factors affecting intermediate and final outcomes.
The quality of the final HiRise assembly is significantly impacted by the quality of the input assembly. More contiguous and accurate input assemblies will yield more contiguous and accurate HiRise assemblies. The quality of the input assembly itself depends upon characteristics of the genome (repeat content, heterozygisty, size, etc.), quality and quantity of the input data, and the method used for assembly (e.g., Meraculous, Discovar De Novo, SGA).
The ultimate HiRise assembly is similarly affected by the same genomic qualities that impact the input assembly. Additionally, the HiRise assembly is also affected by the quality and quantity of the Chicago data used. Features of the Chicago data that affect the quality of the result are the complexity, read pair separation distribution, and the signal to noise ratio. Dovetail has refined the library preparation assay such that many of these factors are stable and near-optimal. Because most factors are already optimized, the quality and length of the input DNA has the single largest, controllable impact on the quality of the final Chicago data.
In the small minority of cases where gold standard references are available, Dovetail uses reference comparison to estimate error classes and rates. Dovetail produces Chicago/HiRise assemblies for genomes with such gold standards primarily to validate the Chicago assay and HiRise pipeline themselves, and such comparisons have shown Dovetail’s assemblies to be highly accurate.
When references are not available, Dovetail produces and examines synteny comparisons between the subject genome and a close evolutionary relative, preferably with a more complete genome assembly. Additionally, and particularly if no close relatives are available, Dovetail can also validate assemblies using hold out data (e.g., BAC-ends or transcriptome). Such data are preferably long-range (10’s to 100’s of kbp) and are not used in the assembly itself, only for validation. In this case, Dovetail aligns the hold out data to the final assembly and use deviations from the expected insert size and read orientation to produce error estimates.
All of these validation approaches have demonstrated that Dovetail’s high-contiguity assemblies are also highly accurate.
Chicago begins with in vitro chromatin reconstitution from high-molecular weight DNA, rather than fixation of chromosomes in live cells. In addition to removing the burdensome requirement for living cells, this offers a number of benefits over Hi-C including: lower input requirements, faster library generation, more consistent results, and an improved signal to noise ratio. Furthermore, biologically meaningful but assembly-confounding contact information is absent in Chicago libraries, and Chicago’s shorter read pair separations allow for finer ordering and orienting of contigs and scaffolds, particularly small ones. Consequently, less sequence is required for Chicago libraries to yield better results.
Chicago libraries are capable of spanning much longer distances (100+ kbp) than PacBio reads are currently capable of (10’s of kbp). Additionally, Chicago libraries can be sequenced with any next generation sequencing platform, including the large install base of Illumina instrumentation.
Sequence coverage describes the number of times, on average, that a given nucleotide in the genome will be directly observed, e.g., a shotgun sequence read covers that position. Most discussions of “coverage” in scientific literature use this definition, unless otherwise specified.
Physical coverage measures the number of times, on average, that a pair of reads *span* a given nucleotide in the genome. The use of this measurement has historically been most common in describing coverage from conventional mate-pair and BAC-end libraries. Physical coverage is the more useful measurement of coverage for the purposes of genome assembly and characterization of structural variation due to the importance of genomic distances spanned for these applications.
For example, if a given read pair contains two reads of 100 bp each separated in the genome by a distance of 1,000 bp, then this pair contributes 200 bp of sequence coverage (2×100), but 1,000 bp of physical coverage since that is the genomic distance spanned.