Save Unmapped Reads
One of the most common activities we perform is the sequence alignment of sequenced bacterial samples against known reference genomes. The alignment process generates two types of output: mapped reads and unmapped reads. (“reads” = short DNA fragments, typically 150 bp in length). For typical bacterial sequencing runs, each sample often generates several million reads, and in most cases the sequence alignment process creates >95% mapped reads and <5% unmapped reads. Mapped reads refer to those reads from the sequenced sample that align directly to a single region (set of loci) on the reference genome. Unmapped reads refer to those reads that map nowhere on the reference genome. Sequence alignment algorithms typically dump the entire set of unmapped reads into a separate bin or file for easy downstream analysis.
Unmapped reads are often ignored or discarded without further analysis. However, there may be gold, or junk(!), in the set of unmapped reads. If you want to spend some time analyzing them more fully, here are some reasons we save unmapped reads after sequence alignment is complete.
- Species divergence. Unmapped reads may simply indicate that the sequenced species shows evolutionary divergence from the reference species. The sequenced species may contain sets of reads that are not present in the reference species. This divergence can also occur at the subspecies and strain level, which is one reason it’s important to choose subspecies or strains as reference genomes that are as closely related to the sequenced species as possible.
- Bacteriophages (prophages). Unmapped reads may indicate the presence of inserted prophage sequences in the sequenced sample. If the sample includes prophages that are not present in the reference genome, the sequence alignment algorithm will not be able to map those prophage reads to the reference, and instead dump them in the unmapped reads bin. A useful strategy is to BLAST the unmapped reads in one or several phage databases (i.e. Phaster) to see if it can identify the phage species. Prophages are common in bacteria so there’s a good chance this BLAST search will identify something in the sequenced sample. You can also BLAST unmapped reads against the NCBI nucleotide database to see if it can identify a prophage sequence. For either BLAST search you may have to run de novo assembly on the unmapped reads first, to generate longer contigs or, occasionally, fully-assembled phages, which may improve the BLAST search results.
- Plasmids. The unmapped reads may represent exogenous plasmid sequences or integrated plasmid sequences. The NextGen sequencing library prep step may include the sample bacterial genome itself and exogenous plasmid sequences, if any are present in the sample. If you run a sequence alignment algorithm of sample reads against the reference genome only, any exogenous plasmid sequences will show up in the unmapped reads bucket. If your reference genome was derived from the NCBI database, for example, it will often include both the reference bacterial genome itself and one or several associated plasmid sequences, if they exist. You can run sequence alignments of the unmapped reads against those known plasmid sequences to see if there is a match. For example, E. coli Eco889 includes the bacterial reference genome plus two named plasmids. We would recommend that you run sequence alignments against the reference and both plasmid sequences. This may eliminate many unmapped reads, if they belong to the plasmids. Another option is to search plasmid databases for sequences matching the unmapped reads.
- Incomplete reference genome. For bacterial sequence alignments we generally recommend using fully-assembled, finished and hopefully well-annotated reference genomes. Occasionally, you may have reason to run sequence alignments against partially sequenced reference genomes (i.e. contigs, scaffolds, etc.). This may generate a set of unmapped reads. The sequenced sample will include reads from the entire sample bacterial genome. Some reads in the sample will map to nothing in the reference, as the reference is incomplete and missing those complimentary reads. Depending on your project goals, this may be acceptable, in which case the unmapped reads can be ignored.
- Misassembled reference genome. In some fairly rare cases the reference genome itself may be misassembled, in which case a subset of sample reads may not align properly to the misassembled regions. Current de novo assembly algorithms, reference guided assembly algorithms and other methods (short-read/long-read hybrid assemblies, etc.) are generally quiet good for bacterial sized genomes. So it’s probably unlikely that deposited reference genomes are misassembled, but we should nevertheless keep this in mind.
- Sample contamination. We occasionally receive DNA samples with appreciable levels of contaminants. If necessary we’ll use DNA clean-up kits (i.e. Zymo Research, others) to recover high-quality contaminant-free DNA before progressing to library prep and sequencing. However, unwanted contaminant sequences may be present after sequencing is complete, usually in very low levels. These contaminant sequences may show up in the set of unmapped reads.