NGS Data Wrangling: Understanding FASTQ and Quality Control

NGS Data Wrangling

In the previous parts of The Bioinformatics Blueprint, we set up our "lab" in the Linux terminal. Now, it’s time to handle the actual “specimens”- the raw data.

Most genomic research starts with raw data from an Illumina, PacBio, or Nanopore sequencer. This data arrives in a format called FASTQ. Before we run a single alignment or variant calling script, we must perform Quality Control (QC).

If your data is "noisy," your results will be "garbage." Let’s make sure your data is clean.

1. Decoding the FASTQ Format

A FASTQ file is a text-based format for storing both a biological sequence and its corresponding quality scores. Each "read" (sequence fragment) is represented by exactly 4 lines.

Line 1 (The Header): Starts with @. Contains the instrument ID, run ID, and flowcell coordinates.
Line 2 (The Sequence): The actual A, C, T, G, or N (unknown) calls.
Line 3 (The Separator): A + sign (sometimes followed by the header again).
Line 4 (The Quality Scores): These look like random symbols (!, #, A, f). These are Phred Scores encoded in ASCII characters to save space.

The Math behind the Symbols:
A Phred score ($Q$) is calculated as $Q = -10 \log_{10}(P)$, where $P$ is the probability that the base call is incorrect.
Q30 (Symbol '?' in some encodings): 99.9% accuracy.
Q10 (Symbol '*'): 90% accuracy (1 in 10 chance of error).

2. Practical Exercise: Creating a Test FASTQ File

Don't wait for a 50GB file to practice. Let's use the nano editor to create a "dummy" FASTQ file to test our scripts.

Step-by-Step:

Open the terminal and type: nano test_data.fastq
Copy and paste this sample read (which has poor quality at the end):
```
@BIOINFOQUANT_001:Read1
GATCGATCGATCGATCGATCGATCGATCGATC
+
IIIIIIIIIIIIIIIIIIIIIIIIII######
```
(Note: 'I' represents high quality, while '#' represents very low quality).
Save and exit (Ctrl+O, Enter, Ctrl+X).

3. Running Professional QC with FastQC

Instead of checking millions of lines manually, we use FastQC. If you followed Part 3, you should have your environment ready.

The Installation:

conda activate genomics_basics
mamba install -c bioconda fastqc -y

The Execution:

fastqc test_data.fastq

This will produce two files:

test_data_fastqc.zip (The raw data)
test_data_fastqc.html (The visual report)

4. Interpreting the "Traffic Light" Report

When you open the HTML report, FastQC gives you a summary. Here is what you need to focus on:

Per Base Sequence Quality: This is a box-plot. If the boxes stay in the Green zone, your data is great. If they dip into the Red zone (usually at the 3' end), you will need to perform "Trimming" in our next lesson.
Per Base Sequence Content: In a balanced genome, the lines for A, C, T, and G should run parallel. A sudden spike at the beginning often indicates Adapter Contamination.
Sequence Duplication Levels: High duplication might mean you over-amplified your DNA during PCR.

5. Scaling Up: The "Multi-File" Automation

In a real lab, you might have 96 samples. Use the Bash Loop from Part 2 to handle them all:

mkdir -p qc_results
for sample in *.fastq
do
    echo "Analyzing $sample..."
    fastqc $sample --outdir=qc_results/
done

Summary Checklist

[ ] Identify the 4 lines of a FASTQ read.
[ ] Create test files with nano to verify your pipeline.
[ ] Interpret the Phred scores (Aim for Q30+).
[ ] Automate the process for multiple samples.

The command line doesn't just make you faster; it makes your science more rigorous.

NGS Data Wrangling: Understanding FASTQ and Quality Control