Data Cleaning: Trimming and Filtering for High-Accuracy Results
In Part 4 of The Bioinformatics Blueprint, we used FastQC to diagnose our data. We often find that "raw" sequencing data isn't perfect—it's cluttered with sequencing adapters and low-quality bases at the ends of the reads.
If you don’t remove these, your next step (Mapping) will produce "misalignments." This leads to false-positive variant calls or incorrect gene expression counts. Today, we perform "DNA Surgery" using Trimmomatic.
1. Why Trimmomatic?
While there are many trimmers (like Cutadapt or Fastp), Trimmomatic remains a favorite because of its precision in handling Paired-End (PE) reads. It ensures that if one read in a pair is discarded, the other is moved to an "unpaired" file, keeping your data synchronized.
Installation:
conda activate genomics_basics
mamba install -c bioconda trimmomatic -y2. Mastering the Trimmomatic Logic
Trimmomatic doesn't just cut randomly; it follows a specific order of operations.
The Command Breakdown:
trimmomatic SE -phred33 input.fastq output_trimmed.fastq \
ILLUMINACLIP:adapters.fa:2:30:10 \
LEADING:3 TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36- ILLUMINACLIP: This removes the adapters. It looks for matches in your
adapters.fafile. - LEADING/TRAILING: Cuts bases from the start or end if they fall below a quality score of 3.
- SLIDINGWINDOW:4:15: This is the most critical parameter.
- 4: The size of the window (number of bases).
- 15: The required average quality.
- Logic: As the "window" slides across the read, the moment the average quality of those 4 bases drops below 15, the tool "snaps" the read and discards everything following it.
- MINLEN:36: If the remaining sequence is shorter than 36 bases, it's discarded entirely. Short reads are often ambiguous and map to too many places in the genome.
3. Practical Exercise: Fixing the "Red Zone"
Yesterday, we created a test_data.fastq with ###### (very low quality) at the end. Let’s clean it.
The Action:
trimmomatic SE test_data.fastq cleaned_data.fastq SLIDINGWINDOW:4:15 MINLEN:20The Comparison: Use the head command to see the difference:
- Before:
GATCGATCGATCGATCGATCGATCGATCGATC(followed by low-quality symbols) - After:
GATCGATCGATCGATCGATCGATC(the low-quality tail is gone!)
4. Record Notes: Paired-End (PE) Complexity
When running PE data, you must provide two input files and four output files.
trimmomatic PE R1_in.fq R2_in.fq R1_paired.fq R1_unpaired.fq R2_paired.fq R2_unpaired.fq ...Founder's Insight: Always use the
_paired.fqfiles for your downstream alignment. The_unpaired.fqfiles are "orphans" where their partner read failed the quality check.
Summary Checklist
- [ ] Conda Activate: Ensure you are in your genomics environment.
- [ ] Adapter File: Ensure you have the correct adapter sequences for your sequencing platform (Illumina, Nextera, etc.).
- [ ] Sliding Window: Adjust based on your goals (15 for general, 20+ for high-stringency).
- [ ] Re-Check: Always run FastQC on your
output_trimmed.fastqto confirm the "Red Zone" is gone.

