Bash Scripting: Automating Your Bioinformatics Workflows (Part 2)
Welcome to Part 2 of The Bash Automation!
Yesterday, you learned to navigate the Linux terminal. You can now ls, cd, and nano like a pro. But imagine facing 100 raw sequencing files. Are you going to type fastqc sample_001.fastq.gz 100 times?
Absolutely not.
In professional bioinformatics, if you have to perform a task more than once, you automate it. Today, we're diving into Bash Scripting—the fundamental skill that transforms you from a user into a workflow architect.
1. What is a Bash Script? The "To-Do List" for Your Server
A Bash script is simply a plain text file containing a sequence of commands, just like the ones you type directly into the terminal. The magic is that the computer executes them one after another, automatically.
Every Bash script starts with a special line called a "Shebang": #!/bin/bash. This tells the system which program (the Bash shell) should interpret the commands in the file.
Step-by-Step: Your First Automation Script
Create the script file:
nano first_script.shBash
nano first_script.shAdd these lines:
#!/bin/bash # This is a comment - Bash ignores lines starting with # echo "--- Starting BioInfoQuant Automation ---" # Create a new directory for our project mkdir -p my_new_project cd my_new_project # Create a dummy file inside it echo "This is some dummy data." > dummy_file.txt # List the contents to verify ls -lh echo "--- Script Finished Successfully ---"Bash
#!/bin/bash # This is a comment - Bash ignores lines starting with # echo "--- Starting BioInfoQuant Automation ---" # Create a new directory for our project mkdir -p my_new_project cd my_new_project # Create a dummy file inside it echo "This is some dummy data." > dummy_file.txt # List the contents to verify ls -lh echo "--- Script Finished Successfully ---"- Save and Exit: Press
Ctrl+O, thenEnter, thenCtrl+X. Make it Executable: Before Bash can run it, you need to give it permission:
chmod +x first_script.shBash
chmod +x first_script.shRun Your Script:
Bash
./first_script.shYou should see the messages printed, a new directory created, and
dummy_file.txtappear inside it../first_script.sh
2. Variables: Dynamic Scripting
Hardcoding filenames is bad practice. Variables allow you to store information (like filenames, paths, or parameters) and reuse them.
#!/bin/bash
PROJECT_NAME="MyRNASeqProject"
OUTPUT_DIR="results"
echo "Initializing $PROJECT_NAME..."
mkdir -p "$OUTPUT_DIR/$PROJECT_NAME" # Always quote variables with spaces!
echo "Output will be in: $(pwd)/$OUTPUT_DIR/$PROJECT_NAME"Bash
#!/bin/bash
PROJECT_NAME="MyRNASeqProject"
OUTPUT_DIR="results"
echo
"Initializing $PROJECT_NAME
..."
mkdir -p "$OUTPUT_DIR
/$PROJECT_NAME
"
# Always quote variables with spaces!
echo
"Output will be in: $(pwd)
/$OUTPUT_DIR
/$PROJECT_NAME
"
- Why the Quotes? If
OUTPUT_DIRhad a space (e.g.,My Results),mkdirwould try to create two separate directories. Quotes treat it as one.
3. The "For Loop": Your Bioinformatics Powerhouse
This is the most critical concept for processing large datasets. A for loop allows you to repeat a block of commands for every item in a list (e.g., every .fastq file, every protein PDB, every sample directory).
Syntax:
for variable in item1 item2 item3 ...
do
# Commands to execute for each 'variable'
doneBash
for
variable in
item1 item2 item3 ...
do
# Commands to execute for each 'variable'
done
Practical Example 1: Batch Decompressing Files
Imagine you have dozens of .fasta.gz files.
#!/bin/bash
DATA_DIR="raw_sequences"
OUTPUT_DIR="unzipped_fasta"
mkdir -p "$OUTPUT_DIR"
# Loop through all gzipped FASTA files in DATA_DIR
for gzipped_file in "$DATA_DIR"/*.fasta.gz
do
if [ -f "$gzipped_file" ]; then # Check if it's a regular file
echo "Decompressing $(basename "$gzipped_file")..."
gunzip -c "$gzipped_file" > "$OUTPUT_DIR/$(basename "$gzipped_file" .gz)"
fi
done
echo "All FASTA files decompressed to $OUTPUT_DIR."Bash
#!/bin/bash
DATA_DIR="raw_sequences"
OUTPUT_DIR="unzipped_fasta"
mkdir -p "$OUTPUT_DIR
"
# Loop through all gzipped FASTA files in DATA_DIR
for
gzipped_file in
"$DATA_DIR
"
/*.fasta.gz
do
if
[ -f "$gzipped_file
"
]; then
# Check if it's a regular file
echo
"Decompressing $(basename "$gzipped_file
"
)
..."
gunzip -c "$gzipped_file
"
> "$OUTPUT_DIR
/$(basename "$gzipped_file
"
.gz)
"
fi
done
echo
"All FASTA files decompressed to $OUTPUT_DIR
."
basename "$gzipped_file": Extracts just the filename from the path.basename "$gzipped_file" .gz: Extracts the filename and removes the.gzextension.
Practical Example 2: Automating GROMACS Minimization (Simplified)
If you have multiple protein-ligand complex directories (e.g., complex_A/, complex_B/), each needing energy minimization:
#!/bin/bash
# Ensure you are in the parent directory containing all complex folders
# For example, if you have:
# /project_folder/complex_1/
# /project_folder/complex_2/
# Run this script from /project_folder/
for complex_dir in complex_*
do
if [ -d "$complex_dir" ]; then # Check if it's a directory
echo "Processing GROMACS minimization for $complex_dir..."
cd "$complex_dir" || { echo "Error: Cannot change directory to $complex_dir"; exit 1; }
# Check for essential GROMACS files (e.g., em.mdp, topol.top, conf.gro)
if [ ! -f "em.mdp" ] || [ ! -f "topol.top" ] || [ ! -f "conf.gro" ]; then
echo "Warning: Missing GROMACS input files in $complex_dir. Skipping."
cd ..
continue # Move to the next directory
fi
# Run GROMACS commands (simplified - full workflow has more steps)
gmx grompp -f em.mdp -c conf.gro -p topol.top -o em.tpr -maxwarn 1
gmx mdrun -v -deffnm em
# Check for successful completion
if [ $? -eq 0 ]; then # $? holds the exit status of the last command (0 for success)
echo "Minimization for $complex_dir completed successfully."
else
echo "Error: GROMACS minimization failed for $complex_dir."
fi
cd .. # Go back to the parent directory
fi
done
echo "GROMACS batch processing complete."Bash
#!/bin/bash
# Ensure you are in the parent directory containing all complex folders
# For example, if you have:
# /project_folder/complex_1/
# /project_folder/complex_2/
# Run this script from /project_folder/
for
complex_dir in
complex_*
do
if
[ -d "$complex_dir
"
]; then
# Check if it's a directory
echo
"Processing GROMACS minimization for $complex_dir
..."
cd
"$complex_dir
"
|| { echo
"Error: Cannot change directory to $complex_dir
"
; exit
1; }
# Check for essential GROMACS files (e.g., em.mdp, topol.top, conf.gro)
if
[ ! -f "em.mdp"
] || [ ! -f "topol.top"
] || [ ! -f "conf.gro"
]; then
echo
"Warning: Missing GROMACS input files in $complex_dir
. Skipping."
cd
..
continue
# Move to the next directory
fi
# Run GROMACS commands (simplified - full workflow has more steps)
gmx grompp -f em.mdp -c conf.gro -p topol.top -o em.tpr -maxwarn 1
gmx mdrun -v -deffnm em
# Check for successful completion
if
[ $? -eq 0 ]; then
# $? holds the exit status of the last command (0 for success)
echo
"Minimization for $complex_dir
completed successfully."
else
echo
"Error: GROMACS minimization failed for $complex_dir
."
fi
cd
.. # Go back to the parent directory
fi
done
echo
"GROMACS batch processing complete."
if [ -d "$complex_dir" ]: Ensures we only process actual directories.cd "$complex_dir" || { echo ...; exit 1; }:||means "OR". Ifcdfails, print an error and exit.[ ! -f "filename" ]: Checks if a file DOES NOT exist.continue: Skips the rest of the loop for the current item and moves to the next.$? -eq 0: A powerful trick to check if the previous command ran without errors.
4. Record Notes: Essential Scripting Best Practices
- Comments (
#): Explain your code. Future You (or your colleagues) will thank you. - Shebang (
#!/bin/bash): Always at the top. - Executable Permissions (
chmod +x): Don't forget this after creating a new script. - Quoting Variables (
"$VAR"): Prevents unexpected behavior with spaces or special characters. - Error Handling: Use
ifstatements andtest([ ]) to check if files exist, if commands ran successfully, or if directories are valid. This makes your scripts robust. - Test Small, Then Scale: Don't write a 100-line script and run it on 1TB of data. Test small parts first.
Practical Exercise for Students
Create a script named process_fastqs.sh that:
- Accepts a single argument (e.g.,
sample_name). - Creates a directory for that
sample_name. - Copies all
.fastq.gzfiles from araw_data/folder into the new sample directory. - Prints "Processing complete for [sample_name]!"
- Bonus: Modify it to loop through a list of sample names instead of taking just one argument.
Summary Checklist
- [ ] Start scripts with
#!/bin/bash. - [ ] Use
chmod +xto make scripts runnable. - [ ] Declare and use variables to make scripts flexible.
- [ ] Master the
forloop for batch processing. - [ ] Add comments and basic error checking (
ifstatements).
Bash scripting is the backbone of efficient bioinformatics. Automate today, innovate tomorrow.


