Episode 3: Importing Bacterial Reference Genomes into Geneious

Welcome to Basic Bioinformatics brought to you by The Sequencing Center.

In this screencast I’m going to show you one method of importing a bacterial reference genome into Geneious Prime.  Once you’ve imported the reference genome, you can do things like sequence alignments, variant calls, annotations, and lots of other fun things.  This is just one of many methods, and we’ll show you how we do that.  We’re going to use a bacterial genome Pseudomonas aeruginosa, and the strain is going to be UCBPP-PA14.  So we’ll see if we can find that.  One thing you can do is go into your web browser and search for GQuery.  Usually at the very top it’ll bring up the GQuery NCBI cross reference database, the integrated data base.  If we click on that, we should go to the NCBI integrated data base screen, which looks something like this.  And this is a site that integrates many, many different databases into one gigantic database.

What we want to do here is search for Pseudomonas aeruginosa.  We’re gonna take out this particular strain, get rid of the period, and do a search on that.  We’ll see what it brings up.  If we scroll down just a little bit here we’ll see that it’s populated these different categories.  The one that we’re most interested in is under Genomes.  And if we go down to Genomes, there’s a subcategory called Genome, singular, and we’ll just click on that.  So it looks like it brought up Pseudomonas aeruginosa and if you scroll down a little bit there’s a lot of metadata here such as the reference genome, the size of the genome, a dendrogram, and so on.  I want to point your attention, though, to this line where it says “all 4761 genomes for this species”, that’s a lot of genomes.

So what we want to do is click browse the list. The list is hyperlinked.  We’ll go ahead and browse the list and see what that brings up.  We can see there are a lot of depositions for Pseudomonas aeruginosa.  A lot of different strains are listed here.  For a reference genome, note in this line there are different levels of assembly. So there are complete assemblies which are fully annotated, fully finished genomes.  There are chromosomes which are not quite fully finished.  And then they are scaffolds and contigs.  If you look over here, we can see the total number of deposited genomes.  What we want to do is un-click contigs, and you’ll see this number decline a little bit because we’re filtering out a lot of contigs there.  We’ll unclick scaffold, you’ll see this number go down again.  We’ll unclick chromosome, this number goes down to 189.  So what’s left here are just the fully finished, complete genomes.  These are probably all candidates for a really nice reference genome.

What we’ll do next is rank order these in sort descending by the strain column.  And then what we want to do is search for,  or scroll through here, until we find UCBPP-PA14 and there it is.  So in this particular case this is the reference genome we’re trying to find.  What we want to do now is look over to the right and we’ll see that there’s a couple of listings here.  The NC is the RefSeq asession number and then right next to it is the GenBank CP asession number.  Presumably these are two identical sequences deposited in the RefSeq database and the GenBank database.  They should be identical.  What we want to choose is the NC_008463.1 genome. So if we do that, we just click on that, and it brings up another screen here that shows the NCBI reference sequence 8463.1 and this is the Pseudomonas aeruginosa with the PA14.

Now, if we look down here, this is just a summary screen we don’t see any sequence here.  We need the sequence.  This is not quite intuitive, but here’s the way we’re going to get it.  If we go up to Send to, click on that, we want to leave the Complete Record, the default chosen.  For destination we want to choose File because we want to get a file out of this.  And importantly, down here in the format section, or the format drop down box, we have several options here. What we want to get is GenBank (full).  Make sure we get full because that includes the nucleotide sequence, the reference sequence, as well as annotations and metadata and other things that could be useful later on.  And then we create the file and we should see it downloading.  The default naming convention is always sequence.gb.txt.  So we’ll give that a few seconds to download.

It looks like the download is complete so now we want to drag and drop this onto the desktop.  One thing we need to do here, or at least we want to do, is rename this file.  What we normally do to make this easier is rename it to Pseudomonas_aeruginosa_UCBPP-PA14.gb.tx and we want to do is get rid of the file suffix (.txt) and we want to call this .gbk, which means that its a file format suffix for a GenBank file.

In our case, we happen to be using Amazon AWS Cloud so we need to push this file to the cloud.  So we want to drag this file over to WorkDocs, move it there, and then in a couple of minutes, we’ll see this file syncing up with the AWS Geneious in the cloud. So wait for that finish.

It looks like this file has been synced to AWS. So now we’re going to log into AWS and look at Geneious.  We’ll take a quick look here in WorkDocs and we can see that our .gbk file has been uploaded to the cloud.  What we want to do now is import the reference sequence into Geneious.  So in our case, we already have a directory set up called Sample Documents.  If we drill down a little bit we have another directory called Genomes and below that another one called Bacteria where we’ve already loaded a number of reference genomes.  The way we do this is we just go to File – Import – From File – find our .gbk file and import it.  The import takes just a couple seconds in most cases.

This is the actual Pseudomonas reference genome.  We can take a couple quick look at it, we can drill all the way down if we want to see the actual reference sequence itself.  We can scroll out just a little bit to see some of the gene structure come in.  If we keep going out, it’ll circularize eventually because it’s bacterial genome.  At this point we’ve got a reference genome imported into a Geneious and we could use this for sequence alignments and variant calls and many other things.  Basically, that’s how we import reference genomes into Geneious.

Thanks for watching this episode of Basic Bioinformatics.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *