The goal of many sequencing projects is to identify polymorphisms and mutations in sequenced samples. These often include SNP’s, indels, chromosomal rearrangements and various kinds of spontaneous or induced changes in nucleotide sequence. To identify polymorphisms we run sequence alignment and variant calling algorithms on sequenced samples with respect to reference genomes. For most mutation analyses we need fully-assembled, finished and annotated sequences if they are to be used as reference genomes. In general we cannot use contigs, scaffolds or incomplete chromosomes as reference genomes.
There are many sources available for acquiring reference genomes. A partial, and by no means exhaustive, list includes:
- Microbial Genome Database
- Saccharomyces Genome Database
Note that many of these databases cross-reference one-another, with the same depositions found in multiple sources.
NCBI Genome Database
NCBI maintains one of the largest and most comprehensive databases of fully assembled genomes. We often acquire reference genomes from NCBI. You may search this site yourself for an appropriate genome sequence and either send it to us directly or refer us to it.
For example, suppose we’re interested in a particular strain of the venerable enteric bacterium Escherichia coli, namely the type strain E. coli NCTC86. Here is one possible method for acquiring this genome:
- Go to GQUERY
- In the Search bar, enter “Escherichia coli NCTC86” and then click Search
- In the results page, under the “Genomes” section, click on “Genome”
- In the results page, near the top of the page, find “Browse the list” and click on “list”
- In the results page, near the top of the page, make sure “Complete” is checked, and then uncheck “Chromosome”, “Scaffold” and “Contig”. The remaining list will include only fully-finished, complete and annotated genomes.
- At this point you may do one of two things: A) In the Search bar at the top of the page, enter “NCTC86”. This step may return more than one assembly. Choose the appropriate assembly. OR B) In the Strain column at the top of the page, click on “Strain” to sort the table by Strain. Scroll through the table until you reach “NCTC86”.
- Following either method in Step 6, go to the “Replicons” column and click on the genome name. Typically the genome names begin with “NZ_”, “NC_” or “CP”.
- In the results page, near the top of the page, click “Send to:”, check “Complete Record”, under “Choose Destination” check “File”, under “Format” choose “GenBank (full)”, then click “Create File”.
- At this point the E. coli NCTC86 reference genome should download through your web browser to your workstation. The file name will be “sequence.gb.txt”. You may want to rename the file to something like “Escherichia_coli_NCTC86.gb.txt” to clearly identify it. Also, for some applications you may have to change the file suffix to “*.gbk”, i.e. “Escherichia_coli_NCTC86.gbk”.
Note that the naming conventions for assemblies, genomes, chromosomes, strains, substrains, etc. may be inconsistent. You’ll want to check carefully that the reference genome you choose from NCBI (or other sources) is in fact the correct one.
Also note in the “Replicons” column that many bacterial species include one or more plasmid sequences. You may need or want to include plasmids in your research. If this is the case then simply download the plasmid sequences along with the reference genome.