Proper preprocessing of your NGS reads will improve assembly accuracy and also will usually significantly reduce the computation and time required to complete the assembly.
If you have paired read data then your first step should always be to Set paired reads, followed by trimming, then if required, other preprocessing steps as depicted in the following flow diagram.
Importing/Pairing your NGS data
An NGS sequence service provider will normally provide Illumina paired read data as separate forward and reverse read lists in fastq format. Usually standard Illumina adapters will have been removed. In most cases fastq lists will be compressed by gzip (.gz). Geneious can import compressed or uncompressed fastq files.
You can import forward and reverse read files together via menu File → From Multiple files and Geneious will offer to pair the files and create a single paired read list. Similarly, if you drag and drop pairs of read lists into the Geneious window then you will be given the option to pair the reads during the import process.
Geneious will determine the likely read technology, so you only need to set the expected insert size (the expected average insert size excluding adapters) and hit OK.
If you have already imported your reads as separate lists then you can pair after importing by selecting the lists and going menu Sequence → Set paired reads.
NGS Trimming
It is important to trim read ends prior to assembly. Incorrect low quality calls at sequence ends will potentially prevent proper assembly and increase the computation and time required to perform assembly.
Geneious Prime has the BBDuk trimmer, a fast and accurate tool specifically for trimming and filtering NGS reads.
BBDuk is available as a plugin and can be installed via menu Tools → Plugins. Once installed BBDuk can be accessed via menu Annotate & Predict → Trim using BBDuk.
BBDuk has options to:
- Identify and Trim adapters using presets for Illumina adapters
- Trim ends based on quality (Q)
- Trim adapters based on paired read overhangs
- Discard short reads (and associated pair mate)
We recommend trimming Illumina data with a minimum quality (Q) of 13, preferably 30. Suggested trimming options are shown below.
If trimming Oxford Nanopore reads based on quality, the threshold may need to be set lower to represent the higher error rate of this technology. We recommend using a minimum quality score of 7 (Q7) for these reads.
For advanced users, BBDuk has many more "hidden" options you can access. For example, users can use the following "command line" options to filter reads with %G+C content between 25% and 75%:
mingc=0.25 maxgc=0.75
Click on the More/Fewer Options button, then click on the (?) button next to Custom BDDuk Options: to learn about the additional command line options you can use.
Error correct and Normalize reads (Accessed via menu Sequence → Error correct and normalize reads)
The Error correct and Normalize reads tool utilizes BBNorm. For most use cases, error correction can be turned off, and normalization run by itself. This tool is designed to normalize coverage by down-sampling reads in high-depth areas of a genome, resulting in a more even coverage distribution. Importantly, normalization will not remove reads in lower coverage areas.
Normalization can substantially reduce data set sizes, and subsequently, for de novo assembly it can significantly reduce assembly time and RAM requirements. See the de novo assembly tutorial for more information on the use of Normalization.
From the BBNorm Guide: BBNorm is mainly intended for use in assembly, and with short reads. Normalization is often useful if you have too much data (for example, 600x average coverage when you only want 100x) or uneven coverage (amplified single-cell, RNA-seq, viruses, metagenomes, etc). It is not useful if you have smooth coverage and approximately the right amount of data, or too little data. BBNorm cannot inflate low coverage (bring 15x coverage up to 100x), only reduce it. Never normalize read data prior to a quantitative analysis (like ChIP-seq, RNA-seq for expression profiling, etc); if you assemble normalized data, and want to use mapping to determine coverage, map the non-normalized reads. Also, do not normalize data prior to mapping for variant discovery; it will cause bias. If you need to reduce data volume in any of these scenarios, use subsampling rather than normalization. Do not attempt to normalize high-error-rate data from platforms such as PacBio or Nanopore; it is designed for relatively-low-error-rate, short, fixed-length reads such as Illumina and Ion Torrent.
Also, error-correction is not advisable when you are looking for rare variants. It should generally be fine with relatively high-depth coverage of heterozygous mutations in a diploid (where you expect a 50/50 allele split), but with low-depth coverage (like 5x), or very lopsided distributions (like a 1/100 allele split), it may correct the minority allele into the majority allele, so should be used with caution.
Merge paired Reads (Accessed via menu Sequence → Merge paired reads)
This tool utilizes BBMerge and is designed to merge two overlapping paired reads into a single read. This tool is useful generating a consensus from overlapping reads generated by amplicon sequencing.
Remove duplicate Reads (Accessed via menu Sequence → Remove duplicate reads)
This tool utilizes Dedupe and is designed to find and remove all contained and overlapping sequences in a read dataset.
The Dedupe operation must be run on read lists prior to assembly. It cannot be used to remove duplicate reads in an assembly file.
Remove Chimeras (Accessed via Sequence → Remove chimeric reads)
This tool will filter chimeric reads from sequencing data by comparing to a reference database. You can choose between the bundled public domain UCHIME algorithm or download and use the faster USEARCH 8. Note that the free version of USEARCH 8 is limited to using 4 GB of RAM and so cannot handle larger NGS datasets.
Barcode splitting (Accessed via menu Sequence → Separate by barcodes)
This tool will demultiplex custom barcoded data into separate lists. The tool has 454 MID barcode presets, or you can define and use your own custom barcode sets.
Note: demultiplexing should always be performed before trimming with BBduk.