Episode 3: Importing Bacterial Reference Genomes into Geneious

This video will show you one method of importing a bacterial reference genome into Geneious Prime. Once you have imported the reference genome, you can do things like sequence alignments, variant calls, annotations, and lots of other fun things. This is just one of many methods and we will show you how we do that. We’re going to use a bacterial genome Pseudomonas aeruginosa and the strain is going to be UCBPP-PA14. One thing you can do is go into your web browser and search for GQuery. Usually at the very top it will bring up the GQuery NCBI cross reference database, the integrated data base. If we click on that we should go to the NCBI integrated data base screen, which looks something like this. This is a site that integrates many different databases into one gigantic database.

Now we can search for Pseudomonas aeruginosa. If we scroll down just a little bit here we’ll see that it has populated these different categories. The one that we are most interested in is under Genomes. If we go down to Genomes there is a subcategory called Genome, and we will click on that. It looks like it brought up Pseudomonas aeruginosa and if you scroll down a little bit there is a lot of metadata such as the reference genome, the size of the genome, a dendrogram, and so on. I want to point your attention to this line where it says “all 4761 genomes for this species”, that’s a lot of genomes.

What we want to do is click Browse the hyperlinked list. We will go ahead and browse the list and see what that brings up. We can see there are a lot of depositions for Pseudomonas aeruginosa. A lot of different strains are listed here. For a reference genome, note in this line there are different levels of assembly. There are complete assemblies, which are fully annotated or fully finished genomes, there are chromosomes which are not quite fully finished, and then there are scaffolds and contigs. If you look over here, we can see the total number of deposited genomes. What we want to do is un-click Contigs and you will see this number decline a little bit because we are filtering out a lot of contigs there. We will unclick Scaffold, you’ll see this number go down again. We’ll unclick Chromosome, this number goes down to 189. So what is left here are just the fully finished, complete genomes. These are probably all candidates for a really nice reference genome.

What we’ll do next is rank order these in sort descending by the strain column. Then what we scroll through until we find UCBPP-PA14. In this particular case, this is the reference genome we’re trying to find. What we want to do now is look over to the right and we’ll see that there’s a couple of listings here. The NC is the RefSeq accession number and then right next to it is the GenBank CP accession number. Presumably these are two identical sequences deposited in the RefSeq database and the GenBank database. What we want to choose is the NC_008463.1 genome. We just click on that and it brings up another screen here that shows the NCBI reference sequence 8463.1. This is the Pseudomonas aeruginosa with the PA14.

This is just a summary screen we don’t see any sequence here. We need the sequence. This is not quite intuitive, but here’s the way we’re going to get it. If we go up to and click on Send to. We want to leave the Complete Record, the default, chosen. For Destination we want to choose File because we want to get a file out of this. Importantly, down here in the format section, or the format drop down box, we have several options. What we want to get is GenBank (full). Make sure we get full because that includes the nucleotide sequence, the reference sequence, as well as annotations and metadata and other things that could be useful later on. Then we create the file and we should see it downloading. The default naming convention is always sequence.gb.txt.

It looks like the download is complete so now we want to drag and drop this onto the desktop. One thing we typically do is rename this file. What we normally do to make this easier is rename it to Pseudomonas_aeruginosa_UCBPP-PA14.gb.tx and we want to do is get rid of the file suffix (.txt) and call this .gbk, which means that its a file format suffix for a GenBank file.

In our case, we happen to be using Amazon AWS Cloud so we need to push this file to the cloud. We drag this file over to WorkDocs, and then in a couple of minutes we will see this file syncing up with the AWS Geneious in the cloud.

It looks like this file has been synced to AWS. Now we are going to log into AWS and look at Geneious. We’ll take a quick look here in WorkDocs and we can see that our .gbk file has been uploaded to the cloud. Now we can import the reference sequence into Geneious. In our case, we already have a directory set up called Sample Documents. If we drill down a little bit we have another directory called Genomes and below that another one called Bacteria, where we’ve already loaded a number of reference genomes. The way we do this is we just go to File > Import > From File > find our .gbk file and import it.

This is the actual Pseudomonas reference genome. We can take a couple quick look at it and rill all the way down to see the actual reference sequence itself. We can scroll out just a little bit to see some of the gene structure come in. If we keep going out, it’ll circularize eventually because it’s bacterial genome. At this point we have a reference genome imported into a Geneious and we could use this for sequence alignments and variant calls and many other things.

Bioinformatics