In this video we will go step-by-step through a microbial DNA variants pipeline using Genius Prime. This is the main dashboard screen and I have already loaded up paired-end reads, both the forward reads and the reverse reads, into a single file in Genious. When we click on this file we will see the actual paired reads. Down here we see the forward reads, the reverse read, and so forth all paired throughout the file.
Merging Paired-End Reads
The first thing we want to do in this in this workflow is merge the paired-end reads. We will go to Sequence in the menu bar, come down to Merge Paired Reads, click on that and use all the default values here. In this case, the merge rate is Normal, we’ll leave it at Normal. We don’t need a lot of memory for this typically, so we’ll leave this at 8MB for this particular run. This is a tool called BBMerge from the Joint Genome Institute at DOE. Now we click OK. It will launch the Merge Paired Reads function and this normally takes a couple of minutes.
When BBMerge is done you can see it has generated two additional files here. One file shows the reads that could be merged, and the other one shows the reads that were not merged. In general we are probably not going to see a lot of merged files because the forward reading the reverse read generally should not overlap. If they do overlap then BBMerge is going to try to merge those into a single read.
Let’s look at the merged file. First of all, it looks like we have about 406,000 reads approximately that were actually emerged. In a 2 x 150 base pair of run each of the reads is about 150 base pairs. We can see here that when they merged the forward and reverse read into a single read they can actually run out almost 300 base pairs long in some cases. So in a way we are creating something like synthetic long reads, but these are still longer reads than a typical 150 base pair reads. If you look at the other file, which are the ones that could not be merged, these should be the paired-end reads that were basically untouched.
Trimming Low-Quality Reads
The next step is to choose both of these files and then use the filter and trim low-quality reads function called BBDuk. We will go up to Annotate & Predict and choose Trim Using BBDuk, and that brings up the BBDuk screen. I will make sure that we restore or reset everything to the default values. There are several options and again we are going to tweak a couple of these or at least one. We could choose Trim Adapters if we wanted to. We have already supposedly removed all the adapters from these files using Illumina BaseSpace. Nevertheless, we can go ahead and do it again.
The next option is to Trim Low Quality Reads, and we want to bump this value up to about 20. That’s the PhredQ score or quality score. A Q-Score of 20 means that the base call accuracy is 99.0% correct, in other words, of 1% error rate. A standard is Q 20 and anything below that we want to trim out, which should help with the alignment and variant calls downstream.
The next option is trimming adapters based on paired read overhangs. We can just leave that as the default of 24. Again, there should be no adapters in here so this may may not have any effect. We can discard really short reads if, in this case, the default is minimum length of 10 base pairs. We can leave it a 10. Then go ahead and run BBDuk.
Now that we have filtered and trimmed the low quality reads out of the FASTQ file it generated two more files. They both have the words “trimmed” at the end. We can click on those, we won’t see very much here, except for the fact that there’s probably a fewer number of reads there. Some of them have been been filtered out and we want to go ahead and click both of these files.
Mapping to a Reference Genome
The next step is to actually map the sample reads against the reference genome. The way we do that is we go up here to Align/Assemble in the menu bar and choose Map to Reference. There are several sections here. The first one is the actual reference genome or reference sequence that we are going to map against. In this case, we have already loaded up Pseudomonas aeruginosa. It looks like it’s a PA 14 strain of Pseudomonas. We will leave everything else default.
For the mapping tool we will go ahead and use Geneious, which is a built-in mapper in the Geneious Prime genius tool kit. We’ll just leave that as a default. The sensitivity is chosen, or rather the recommended sensitivity is chosen by Geneious. In this case, it’s Medium-Low, which is fairly fast mapping. You do you want to check this option to find structural variants and small indels. That’s not a default value, but I want to put that in there so we can find some structural variants in case they exist.
The trimming part is more targeted towards Sanger Sequencing. Also, we have already filtered in from the reads so there’s reason to try doing it again. So we just leave that as do not trim. Then we can save the unused or unmapped reads. That could be quite useful in bacterial sequencing because some of the unmapped reads may very well be things like bacteriophage or phages that are inserted into the bacterial genome. We can also save the contigs that come out of this and save the consensus sequence, which is the default again. We will go ahead and run this.
The sequence alignment function is finished and it has generated a couple of additional files already. We can see there are two files here that include unused or unmapped reads. As I mentioned earlier, with bacteria especially, we can use those downstream for some additional analysis, so we just leave those aside for the time being. Geneious also generates a file with the circularized chromosomes, single chromosome, for bacteria. We will not be using this either right now. The file that we are interested in is the last one here, which shows the actual sequence alignment. Basically it shows the coverage across the entire bacterial genome. In this case, it’s very uniform, fairly high coverage, probably 50-100x all the way across. It gives information about the revs of cells. It gives some annotation information and a lot of other data that we can look at down the road.
What we want to do with this is click this file and choose the variant calls. So we go up to Annotate & Predict and then run down to Find Variations/SNPSs, click that. Make sure we have all the default set here. For this screen and this run, we will leave it all set to default values, minimum coverage, variant frequency, P-values, whatever these are all going to be left in the default values. Then we will just run this, click start. It takes about a minute or two to find all the SNPs and indels that are in this data set.
The variants function call is done, and we can see the final result of our pipeline here, which is pretty interesting. We have this display as before with the sequence alignments, except that now there’s a polymorphism track showing up. It shows the SNPs and small indels all along the reference genome and so they can see that graphically. In the bottom we have a variants table that we can export. It shows different SNPs in small indels, their location, and their potential effect on protein configurations, enzyme configurations, that sort of thing. We can save the variance table and view that later.
That is basically an overview of microbial DNA bioinformatics pipeline. We have many, many pipelines and this is just one. What we’ll do in future videos is go back and look at some of the individual steps in more detail to show you some of the rationale behind the actual pipeline itself.