How to read FastQ files
FastQ File Format
Illumina sequencing instruments generate FastQ files when a sequencing run is finished. FastQ files are the starting point for all downstream bioinformatics data analysis.
The file name suffix for a FastQ file is: .fastq
For example, a typical FastQ file name could be sample.fastq
FastQ files are often found in gzip-compressed format with the file name: sample.fastq.gz
The Illumina FastQ file format is shown below.
Each record in a FastQ file consists of four lines:
- Sequence identifier
- Nucleotide sequence
- Quality score identifier line (always a single “+” (plus) sign)
- Quality scores
The first line contains the following elements:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index>
|@||@||Each sequence identifier line starts with @.|
a–z, A–Z, 0–9 and underscore
|<run number>||Numerical||Run number on instrument.|
|<flowcell ID>||Characters allowed:
a–z, A–Z, 0–9
|<x_pos>||Numerical||X coordinate of cluster.|
|<y_pos>||Numerical||Y coordinate of cluster.|
|<read>||Numerical||Read number. 1 can be single read or Read 2 of paired-end.|
|<is filtered>||Y or N||Y if the read is filtered (did not pass), N otherwise.|
|<control number>||Numerical||0 when none of the control bits are on, otherwise it is an even number.|
|<sample number>||Numerical||Sample number|
Table 1. Elements in the first line of a FastQ file record.
The second line contains the nucleotide sequence of a single read (DNA fragment).
The third line contains a quality score identifier and is always a “+” (plus) sign.
The fourth line contains base call quality scores for each nucleotide in the sequence shown in line two. These are Phred +33 encoded scores using ASCII characters to represent the numerical quality scores.
The number of records in a FastQ file equals the number of reads generated during a sequencing run. On an Illumina MiniSeq instrument, there can be up to 100M records in a single file.
Example FastQ Record
Here is an example of a single FastQ file record:
There are two FastQ files generated in an Illumina paired-end reads sequencing run. The files have this naming convention:
where “xxx” is a file prefix and
R1 = file contains “forward” reads
R2 = file contains “reverse” reads
Most downstream data analysis tools automatically recognize the fact that the R1 and R2 files are paired with one other. Most tools will ask you to import both files at once. Therefore, it’s important to save both files in the same location for future reference.