Data Cleaning: Trimming and Filtering for High-Accuracy Results

In Part 4 of The Bioinformatics Blueprint, we used FastQC to diagnose our data. We often find that "raw" sequencing data isn't perfect—it's cluttered with sequencing adapters and low-quality bases at the ends of the reads.

If you don’t remove these, your next step (Mapping) will produce "misalignments." This leads to false-positive variant calls or incorrect gene expression counts. Today, we perform "DNA Surgery" using Trimmomatic.

1. Why Trimmomatic?

While there are many trimmers (like Cutadapt or Fastp), Trimmomatic remains a favorite because of its precision in handling Paired-End (PE) reads. It ensures that if one read in a pair is discarded, the other is moved to an "unpaired" file, keeping your data synchronized.

Installation:

conda activate genomics_basics
mamba install -c bioconda trimmomatic -y

2. Mastering the Trimmomatic Logic

Trimmomatic doesn't just cut randomly; it follows a specific order of operations.

The Command Breakdown:

trimmomatic SE -phred33 input.fastq output_trimmed.fastq \
ILLUMINACLIP:adapters.fa:2:30:10 \
LEADING:3 TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36

ILLUMINACLIP: This removes the adapters. It looks for matches in your adapters.fa file.
LEADING/TRAILING: Cuts bases from the start or end if they fall below a quality score of 3.
SLIDINGWINDOW:4:15: This is the most critical parameter.
- 4: The size of the window (number of bases).
- 15: The required average quality.
- Logic: As the "window" slides across the read, the moment the average quality of those 4 bases drops below 15, the tool "snaps" the read and discards everything following it.
MINLEN:36: If the remaining sequence is shorter than 36 bases, it's discarded entirely. Short reads are often ambiguous and map to too many places in the genome.

3. Practical Exercise: Fixing the "Red Zone"

Yesterday, we created a test_data.fastq with ###### (very low quality) at the end. Let’s clean it.

The Action:

trimmomatic SE test_data.fastq cleaned_data.fastq SLIDINGWINDOW:4:15 MINLEN:20

The Comparison: Use the head command to see the difference:

Before:GATCGATCGATCGATCGATCGATCGATCGATC (followed by low-quality symbols)
After:GATCGATCGATCGATCGATCGATC (the low-quality tail is gone!)

4. Record Notes: Paired-End (PE) Complexity

When running PE data, you must provide two input files and four output files.

trimmomatic PE R1_in.fq R2_in.fq R1_paired.fq R1_unpaired.fq R2_paired.fq R2_unpaired.fq ...

Founder's Insight: Always use the _paired.fq files for your downstream alignment. The _unpaired.fq files are "orphans" where their partner read failed the quality check.

Summary Checklist

[ ] Conda Activate: Ensure you are in your genomics environment.
[ ] Adapter File: Ensure you have the correct adapter sequences for your sequencing platform (Illumina, Nextera, etc.).
[ ] Sliding Window: Adjust based on your goals (15 for general, 20+ for high-stringency).
[ ] Re-Check: Always run FastQC on your output_trimmed.fastq to confirm the "Red Zone" is gone.

Data Cleaning: Trimming and Filtering for High-Accuracy Results