How to find a reference genome - The Sequencing Center

The goal of many sequencing projects is to identify polymorphisms and mutations in sequenced samples. These often include SNP’s, indels, chromosomal rearrangements, and various kinds of spontaneous or induced changes in nucleotide sequence. To identify polymorphisms we run sequence alignment and variant calling algorithms on sequenced samples with respect to reference genomes. For most mutation analyses we need fully assembled, finished, and annotated sequences if they are to be used as reference genomes. In general, we cannot use contigs, scaffolds, or incomplete chromosomes as reference genomes.

There are many sources available for acquiring reference genomes. A partial, and by no means exhaustive, list includes:

Note that many of these databases cross-reference one another, with the same depositions found in multiple sources.

NCBI Genome Database

NCBI maintains one of the largest and most comprehensive databases of fully assembled genomes. We often acquire reference genomes from NCBI. You may search this site yourself for an appropriate genome sequence and either send it to us directly or refer us to it.

For example, suppose we’re interested in a particular strain of the venerable enteric bacterium Escherichia coli, namely the type strain E. coli NCTC86. Here is one possible method for acquiring this genome:

Go to GQUERY
In the Search bar, enter “Escherichia coli NCTC86” and then click Search
On the results page, under the “Genomes” section, click on “Genome”
On the results page, near the top of the page, find “Browse the list” and click on “list”
On the results page, near the top of the page, make sure “Complete” is checked, and then uncheck “Chromosome”, “Scaffold” and “Contig”. The remaining list will include only fully finished, complete, and annotated genomes.
At this point you may do one of two things:
1. In the Search bar at the top of the page, enter “NCTC86”. This step may return more than one assembly. Choose the appropriate assembly.
2. In the Strain column at the top of the page, click on “Strain” to sort the table by Strain. Scroll through the table until you reach “NCTC86”.
Following either method in Step 6, go to the “Replicons” column and click on the genome name. Typically the genome names begin with “NZ_”, “NC_” or “CP”.
In the results page, near the top of the page, click “Send to:”, check “Complete Record”, under “Choose Destination” check “File”, under “Format” choose “GenBank (full)”, then click “Create File”.
At this point, the E. coli NCTC86 reference genome should download through your web browser to your workstation. The file name will be “sequence.gb.txt”. You may want to rename the file to something like “Escherichia_coli_NCTC86.gb.txt” to clearly identify it. Also, for some applications, you may have to change the file suffix to “*.gbk”, i.e. “Escherichia_coli_NCTC86.gbk”.

Note that the naming conventions for assemblies, genomes, chromosomes, strains, substrains, etc. may be inconsistent. Carefully check that the reference genome you choose from NCBI (or other sources) is in fact the correct one.

Also, note in the “Replicons” column that many bacterial species include one or more plasmid sequences. You may need or want to include plasmids in your research. If this is the case then simply download the plasmid sequences along with the reference genome.

Tags:

NCBI Genome Database

About Us

The Sequencing Center

Connect With Us