Bash Scripting: Automating Your Bioinformatics Workflows (Part 2)

Welcome to Part 2 of The Bash Automation!

Yesterday, you learned to navigate the Linux terminal. You can now ls, cd, and nano like a pro. But imagine facing 100 raw sequencing files. Are you going to type fastqc sample_001.fastq.gz 100 times?

Absolutely not.

In professional bioinformatics, if you have to perform a task more than once, you automate it. Today, we're diving into Bash Scripting—the fundamental skill that transforms you from a user into a workflow architect.

1. What is a Bash Script? The "To-Do List" for Your Server

A Bash script is simply a plain text file containing a sequence of commands, just like the ones you type directly into the terminal. The magic is that the computer executes them one after another, automatically.

Every Bash script starts with a special line called a "Shebang": #!/bin/bash. This tells the system which program (the Bash shell) should interpret the commands in the file.

Step-by-Step: Your First Automation Script

Create the script file:

nano first_script.sh

Bash

nano first_script.sh

Add these lines:

#!/bin/bash

# This is a comment - Bash ignores lines starting with #
echo "--- Starting BioInfoQuant Automation ---"

# Create a new directory for our project
mkdir -p my_new_project
cd my_new_project

# Create a dummy file inside it
echo "This is some dummy data." > dummy_file.txt

# List the contents to verify
ls -lh

echo "--- Script Finished Successfully ---"

Bash

#!/bin/bash
	
	

# This is a comment - Bash ignores lines starting with #
	
	
echo
	
	 "--- Starting BioInfoQuant Automation ---"
	
	

# Create a new directory for our project
	
	
mkdir -p my_new_project
cd
	
	 my_new_project

# Create a dummy file inside it
	
	
echo
	
	 "This is some dummy data."
	
	 > dummy_file.txt

# List the contents to verify
	
	
ls -lh

echo
	
	 "--- Script Finished Successfully ---"

Save and Exit: Press Ctrl+O, then Enter, then Ctrl+X.
Make it Executable: Before Bash can run it, you need to give it permission:
```
chmod +x first_script.sh
```
Bash
```
chmod +x first_script.sh
```
Run Your Script:
Bash
```
./first_script.sh
```
You should see the messages printed, a new directory created, and dummy_file.txt appear inside it.
```
./first_script.sh
```

2. Variables: Dynamic Scripting

Hardcoding filenames is bad practice. Variables allow you to store information (like filenames, paths, or parameters) and reuse them.

#!/bin/bash

PROJECT_NAME="MyRNASeqProject"
OUTPUT_DIR="results"

echo "Initializing $PROJECT_NAME..."
mkdir -p "$OUTPUT_DIR/$PROJECT_NAME" # Always quote variables with spaces!
echo "Output will be in: $(pwd)/$OUTPUT_DIR/$PROJECT_NAME"

Bash

#!/bin/bash
	
	

PROJECT_NAME="MyRNASeqProject"
	
	
OUTPUT_DIR="results"
	
	

echo
	
	 "Initializing $PROJECT_NAME
	
	..."
	
	
mkdir -p "$OUTPUT_DIR
	
	/$PROJECT_NAME
	
	"
	
	 # Always quote variables with spaces!
	
	
echo
	
	 "Output will be in: $(pwd)
	
	/$OUTPUT_DIR
	
	/$PROJECT_NAME
	
	"

Why the Quotes? If OUTPUT_DIR had a space (e.g., My Results), mkdir would try to create two separate directories. Quotes treat it as one.

3. The "For Loop": Your Bioinformatics Powerhouse

This is the most critical concept for processing large datasets. A for loop allows you to repeat a block of commands for every item in a list (e.g., every .fastq file, every protein PDB, every sample directory).

Syntax:

for variable in item1 item2 item3 ...
do
    # Commands to execute for each 'variable'
done

Bash

for
	
	 variable in
	
	 item1 item2 item3 ...
do
	
	
    # Commands to execute for each 'variable'
	
	
done

Practical Example 1: Batch Decompressing Files

Imagine you have dozens of .fasta.gz files.

#!/bin/bash

DATA_DIR="raw_sequences"
OUTPUT_DIR="unzipped_fasta"

mkdir -p "$OUTPUT_DIR"

# Loop through all gzipped FASTA files in DATA_DIR
for gzipped_file in "$DATA_DIR"/*.fasta.gz
do
    if [ -f "$gzipped_file" ]; then # Check if it's a regular file
        echo "Decompressing $(basename "$gzipped_file")..."
        gunzip -c "$gzipped_file" > "$OUTPUT_DIR/$(basename "$gzipped_file" .gz)"
    fi
done

echo "All FASTA files decompressed to $OUTPUT_DIR."

Bash

#!/bin/bash
	
	

DATA_DIR="raw_sequences"
	
	
OUTPUT_DIR="unzipped_fasta"
	
	

mkdir -p "$OUTPUT_DIR
	
	"
	
	

# Loop through all gzipped FASTA files in DATA_DIR
	
	
for
	
	 gzipped_file in
	
	 "$DATA_DIR
	
	"
	
	/*.fasta.gz
do
	
	
    if
	
	 [ -f "$gzipped_file
	
	"
	
	 ]; then
	
	 # Check if it's a regular file
	
	
        echo
	
	 "Decompressing $(basename "$gzipped_file
	
	"
	
	)
	
	..."
	
	
        gunzip -c "$gzipped_file
	
	"
	
	 > "$OUTPUT_DIR
	
	/$(basename "$gzipped_file
	
	"
	
	 .gz)
	
	"
	
	
    fi
	
	
done
	
	

echo
	
	 "All FASTA files decompressed to $OUTPUT_DIR
	
	."

basename "$gzipped_file": Extracts just the filename from the path.
basename "$gzipped_file" .gz: Extracts the filename and removes the .gz extension.

Practical Example 2: Automating GROMACS Minimization (Simplified)

If you have multiple protein-ligand complex directories (e.g., complex_A/, complex_B/), each needing energy minimization:

#!/bin/bash

# Ensure you are in the parent directory containing all complex folders
# For example, if you have:
#   /project_folder/complex_1/
#   /project_folder/complex_2/
# Run this script from /project_folder/

for complex_dir in complex_*
do
    if [ -d "$complex_dir" ]; then # Check if it's a directory
        echo "Processing GROMACS minimization for $complex_dir..."
        cd "$complex_dir" || { echo "Error: Cannot change directory to $complex_dir"; exit 1; }
        
        # Check for essential GROMACS files (e.g., em.mdp, topol.top, conf.gro)
        if [ ! -f "em.mdp" ] || [ ! -f "topol.top" ] || [ ! -f "conf.gro" ]; then
            echo "Warning: Missing GROMACS input files in $complex_dir. Skipping."
            cd ..
            continue # Move to the next directory
        fi

        # Run GROMACS commands (simplified - full workflow has more steps)
        gmx grompp -f em.mdp -c conf.gro -p topol.top -o em.tpr -maxwarn 1
        gmx mdrun -v -deffnm em

        # Check for successful completion
        if [ $? -eq 0 ]; then # $? holds the exit status of the last command (0 for success)
            echo "Minimization for $complex_dir completed successfully."
        else
            echo "Error: GROMACS minimization failed for $complex_dir."
        fi
        
        cd .. # Go back to the parent directory
    fi
done

echo "GROMACS batch processing complete."

Bash

#!/bin/bash
	
	

# Ensure you are in the parent directory containing all complex folders
	
	
# For example, if you have:
	
	
#   /project_folder/complex_1/
	
	
#   /project_folder/complex_2/
	
	
# Run this script from /project_folder/
	
	

for
	
	 complex_dir in
	
	 complex_*
do
	
	
    if
	
	 [ -d "$complex_dir
	
	"
	
	 ]; then
	
	 # Check if it's a directory
	
	
        echo
	
	 "Processing GROMACS minimization for $complex_dir
	
	..."
	
	
        cd
	
	 "$complex_dir
	
	"
	
	 || { echo
	
	 "Error: Cannot change directory to $complex_dir
	
	"
	
	; exit
	
	 1; }
        
        # Check for essential GROMACS files (e.g., em.mdp, topol.top, conf.gro)
	
	
        if
	
	 [ ! -f "em.mdp"
	
	 ] || [ ! -f "topol.top"
	
	 ] || [ ! -f "conf.gro"
	
	 ]; then
	
	
            echo
	
	 "Warning: Missing GROMACS input files in $complex_dir
	
	. Skipping."
	
	
            cd
	
	 ..
            continue
	
	 # Move to the next directory
	
	
        fi
	
	

        # Run GROMACS commands (simplified - full workflow has more steps)
	
	
        gmx grompp -f em.mdp -c conf.gro -p topol.top -o em.tpr -maxwarn 1
        gmx mdrun -v -deffnm em

        # Check for successful completion
	
	
        if
	
	 [ $? -eq 0 ]; then
	
	 # $? holds the exit status of the last command (0 for success)
	
	
            echo
	
	 "Minimization for $complex_dir
	
	 completed successfully."
	
	
        else
	
	
            echo
	
	 "Error: GROMACS minimization failed for $complex_dir
	
	."
	
	
        fi
	
	
        
        cd
	
	 .. # Go back to the parent directory
	
	
    fi
	
	
done
	
	

echo
	
	 "GROMACS batch processing complete."

if [ -d "$complex_dir" ]: Ensures we only process actual directories.
cd "$complex_dir" || { echo ...; exit 1; }:|| means "OR". If cd fails, print an error and exit.
[ ! -f "filename" ]: Checks if a file DOES NOT exist.
continue: Skips the rest of the loop for the current item and moves to the next.
$? -eq 0: A powerful trick to check if the previous command ran without errors.

4. Record Notes: Essential Scripting Best Practices

Comments (#): Explain your code. Future You (or your colleagues) will thank you.
Shebang (#!/bin/bash): Always at the top.
Executable Permissions (chmod +x): Don't forget this after creating a new script.
Quoting Variables ("$VAR"): Prevents unexpected behavior with spaces or special characters.
Error Handling: Use if statements and test ([ ]) to check if files exist, if commands ran successfully, or if directories are valid. This makes your scripts robust.
Test Small, Then Scale: Don't write a 100-line script and run it on 1TB of data. Test small parts first.

Practical Exercise for Students

Create a script named process_fastqs.sh that:

Accepts a single argument (e.g., sample_name).
Creates a directory for that sample_name.
Copies all .fastq.gz files from a raw_data/ folder into the new sample directory.
Prints "Processing complete for [sample_name]!"
Bonus: Modify it to loop through a list of sample names instead of taking just one argument.

Summary Checklist

[ ] Start scripts with #!/bin/bash.
[ ] Use chmod +x to make scripts runnable.
[ ] Declare and use variables to make scripts flexible.
[ ] Master the for loop for batch processing.
[ ] Add comments and basic error checking (if statements).

Bash scripting is the backbone of efficient bioinformatics. Automate today, innovate tomorrow.

Bash Scripting: Automating Your Bioinformatics Workflows (Part 2)