Episode 4: Microbial DNA Variants Pipeline in Geneious

Welcome to Basic Bioinformatics brought to you by The Sequencing Center.

Here I have already launched Genius Prime. This is the main dashboard screen that we’re looking at.  I’ve also already loaded up paired-end reads both the forward reads and the reverse reads and it loads these into a single file in Genious.  Then if we just click on this file we’ll see the actual paired reads.  Down here we see the forward reads, the reverse read, and so forth all paired throughout the file.

Merging Paired-End Reads

The first thing we want to do in this in this workflow is merge the paired-end reads.  So what we’re gonna do there is go to Sequence in the menu bar, come down to Merge Paired Reads, click on that and we’re just going to use all the default values here.  In this case, the merge rate is Normal, we’ll leave it at Normal.  We don’t need a lot of memory for this typically, so we’ll leave this at 8MB for this particular run.  This is a tool called BBMerge from the Joint Genome Institute at DOE.   And that’s pretty much all we have to do, and then we just click OK.  It will launch the Merge Paired Reads function and this normally takes a couple of minutes so we’ll come back when this is done.

Okay, so BBMerge is done and as you can see, it’s generated two additional files here.  One file shows the reads that could be merged, and the other one shows the reads that were not merged.  In general, we’re probably not going to see a lot of merged files because the forward reading the reverse read generally should not overlap.  If they do overlap then BBMerge is going to try to merge those into a single read.

So let’s look at the merge file first, and it’s pretty interesting.  First of all, looks like we have about 406,000 reads approximately that were actually emerged.  In a 2 x 150 base pair of run each of the reads is about 150 base pairs.  We can see here that when they merged the forward and reverse read into a single read, they can actually run out almost 300 base pairs in some cases.  So in sort of in a sense we’re creating something like synthetic long reads here, but these are still longer reads than a typical 150 base pair reads.  And if you look at the other file, which are the ones that could not be merged, these should be the paired-end reads that were basically untouched, which that’s what it looks like.

Trimming Low-Quality Reads

So the next step is to choose both of these files, and then we’re gonna use the filter and trim low-quality reads function called BBDuk.  So we’ll go up to Annotate & Predict and choose Trim Using BBDuk, and that brings up the BBDuk screen. I’m gonna make sure that we restore or reset everything to the default values.  There’s several options here and again we’re gonna tweak a couple of these or at least one of them anyway.  We could choose Trim Adapters if we wanted to. We’ve already supposedly removed all the adapters from these files using Illumina BaseSpace.  Nevertheless, we can go ahead and do it again.  It probably wouldn’t hurt anything.  It does show that, the adapters I’m looking at here, one of them is Nextera. We didn’t make this sequencing run with the NexteraXT library kit, so if it finds any additional adapters and eliminates those that be fine.

The next option is to Trim Low Quality reads, and we want to bump this value up to about 20.  That’s the PhredQ score or quality score.  A Q score of 20 means that the base call accuracy is 99.0% correct, in other words, of 1% error rate.  A standard is Q 20 and anything below that we want to trim out, and that should help with the alignment and variant calls downstream.

The next one is trimming adapters is based on paired read overhangs.  We can just leave that as the default of 24.  Again, there should be no adapters in here so this may may not have any effect really.  We can discard really short reads of only, in this case, the default is minimum length of 10 base pairs.  We can leave it a 10.  We could make it more stringent if you want, bump it up to 20 or something, 25 and cut out more of the short reads, but we’ll just leave it at 10 as the default.  Then go ahead and run BBDuk.  This takes a couple of minutes and we’ll come back when it’s done.

Okay, so now we’ve filtered and trimmed the low quality reads out of the FASTQ file and it generated two more files here that you can see.  They both have the words “trimmed” at the end.  We can click on those, we won’t see very much here, except for the fact that there’s probably a fewer number of reads there.  Some of them have been been filtered out.  So we want to go ahead and click both of these files.

Mapping to a Reference Genome

The next step is to actually map the sample reads against the reference genome.  And so the way we do that is we go up here to Align/Assemble in the menu bar, come down to Map to Reference, and choose that.  I’m going to go ahead and reset these to defaults.

There’s several sections here.  The 1st one is the actual reference genome or reference sequence that we’re gonna map against.  In this case, we’ve already loaded up Pseudomonas aeruginosa.  It looks like it’s a PA 14 strain of Pseudomonas.  In another video I’ll show how we actually find these reference sequences and load them up and so forth, that’s a bit of a task.  We’ll just assume it’s already there.  We’ll leave everything else default.

For the mapping tool will go ahead and use Geneious, which is a built-in mapper in the Geneious Prime genius tool kit. We’ll just leave that as a default.  The sensitivity is chosen, or rather the recommended sensitivity is chosen by Geneious.  In this case, it’s Medium-Low, which is fairly fast mapping.  There are other levels of sensitivity, but we’ll just leave it at the default for this run.  You do you want to check this option to find structural variants and small indels.  That’s not a default value, but I want to put that in there so we can find some structural variants in case they exist.

On the trimming part, I believe this is more targeted towards Sanger Sequencing.  Also, we have already filtered in from the reads so there’s reason to try doing it again.  So we just leave that as do not trim.  Then we can save the unused or unmapped reads.  That could be quite useful in bacterial sequencing because some of the unmapped reads may very well be things like bacteriophage or phages that are inserted into the bacterial genome.  I’ll have a whole other video of what to do with the unmapped reads. We can also save the contigs that come out of this and save the consensus sequence, which is the default again. So we’ll go ahead and run this and this takes probably 5 to 10 minutes roughly, and we’ll just come back when the when the mapping is done.

The sequence alignment function is finished and it’s generated a couple of additional files already.  We can see there’s two files here that include unused or unmapped reads.  As I mentioned earlier, with bacteria especially, we can use those downstream for some additional analysis, so we just leave those aside for the time being.  Geneious also generates a file with the circularized chromosomes, single chromosome, for bacteria.  We will not be using this either right now.  The file that we’re interested in is the last one here.  This one shows the actual sequence alignment, and there’s a wealth of data here that will put in a different video.  Basically it shows the coverage across the entire bacterial genome.  In this case, it’s very uniform, fairly high coverage, probably 50-100x all the way across. It gives information about the revs of cells. It gives some annotation information and a lot of other data that we can look at down the road.

Variant Calls

So what we want to do with this is click this file and choose the variant calls.  So we go up to Annotate & Predict and then run down to Find Variations/SNPSs, click that.  Make sure we have all the default set here.  For this screen and this run, we’re gonna just leave it all set to default values, minimum coverage, variant frequency, P-values, whatever these are all going to be left in the default values.  Then we will just run this, click start.  It takes about a minute or two to find all the SNPs and indels that are in this data set.  We’ll come back when this is done.

The variants function call is done, and we can see the final result of our pipeline here, which is pretty interesting.  We have this pretty much the same display as before with the sequence alignments, except that now there’s a polymorphism track that shows up.  It shows basically the SNPs and small indels all along the reference genome and so they can see that graphically.  Also down here in the bottom, we have a variants table that we can actually export.  It shows different SNPs in small indels, their location, and their potential effect on protein configurations, enzyme configurations, that sort of thing.  We can save away the variance table and view that later.

That is basically an overview of microbial DNA bioinformatics pipeline.  We have many, many pipelines and this is just one.  What we’ll do in future videos is go back and look at some of the individual steps in more detail to show you some of the rationale behind the actual pipeline itself.  Thanks for watching this episode of Basic Bioinformatics.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *