Trimming reads

Overview

Teaching: 30 min
Exercises: 25 min
Learning Objectives
  • Clean FASTQ reads using Trimmommatic.

  • Select and set multiple options for command-line bioinformatics tools.

  • Write for loops with two variables.

Cleaning Reads

In the previous episode, we took a high-level look at the quality of each of our samples using FastQC. We vizualized per-base quality graphs showing the distribution of read quality at each base across all reads in a sample and extracted information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. This doesn’t mean, though, that our samples should be thrown out! It’s very common to have some quality metrics fail, and this may or may not be a problem for your downstream application. For our variant callig workflow, we will be removing some of the low quality sequences to reduce our false positive rate due to sequencing error.

We will use a program called Trimmomatic to filter poor quality reads and trim poor quality bases from our samples.

Trimmomatic Options

Trimmomatic has a variety of options to trim your reads. If we run the command, we can see some of our options. If you installed it via apt, there are two command line versions, one for single end and on for paired end sequencing. Let’s look at the one for paired end sequencing:

TrimmomaticPE
Usage: TrimmomaticPE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...

There is a similar version for single end reads, TrimmmomaticSE.

Reads get dropped when trimming

When trimming reads from a fastq file, some reads are naturally dropped. For example, an entire read may be trimmed so that there is nothing left, or the read remaining after trimming may be shorter than the minimum length requirement for keeping it. In the case of single-end (SE) sequencing, the consequences are simple: reads are simply dropped from the results file. In the case of PE sequencing, if one read of a mate pair is dropped, then the other read is orphaned. Trimmomatic produces separate files for the surviving pairs and for the surviving orphans (four files in total).

Next, we specify what flag we would like to run. For example, you can specify threads to indicate the number processors on your computer that you want Trimmomatic to use. These flags are not necessary, but they can give you more control over the command. The flags are followed by positional arguments, meaning the order in which you specify them is important. In paired end mode, Trimmomatic expects the two input files, and then the names of the output files. These files are described below.

option meaning
<inputFile1> Input reads to be trimmed. Typically the file name will contain an 1 or R1 in the name.
<inputFile2> Input reads to be trimmed. Typically the file name will contain an 2 or R2 in the name.
<outputFile1P> Output file that contains surviving pairs from the 1 file.
<outputFile1U> Output file that contains orphaned reads from the 1 file.
<outputFile2P> Output file that contains surviving pairs from the 2 file.
<outputFile2U> Output file that contains orphaned reads from the 2 file.

The last thing trimmomatic expects to see is the trimming parameters:

step meaning
ILLUMINACLIP Perform adapter removal
SLIDINGWINDOW Perform sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING Cut bases off the start of a read, if below a threshold quality.
TRAILING Cut bases off the end of a read, if below a threshold quality.
CROP Cut the read to a specified length.
HEADCROP Cut the specified number of bases from the start of the read.
MINLEN Drop an entire read if it is below a specified length.
TOPHRED33 Convert quality scores to Phred-33.
TOPHRED64 Convert quality scores to Phred-64.

We will use only a few of these options and trimming steps in our analysis. It is important to understand the steps you are using to clean your data. For more information about the Trimmomatic arguments and options, see the Trimmomatic manual.

However, a complete command for Trimmomatic will look something like the command below. This command is an example and will not work as we do not have the files it refers to:

TrimmomaticPE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq  \
              SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
              SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
              ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20

In this example, we’ve told Trimmomatic:

code meaning
-threads 4 to use four computing threads to run (this will spead up our run)
SRR_1056_1.fastq the first input file name
SRR_1056_2.fastq the second input file name
SRR_1056_1.trimmed.fastq the output file for surviving pairs from the _1 file
SRR_1056_1un.trimmed.fastq the output file for orphaned reads from the _1 file
SRR_1056_2.trimmed.fastq the output file for surviving pairs from the _2 file
SRR_1056_2un.trimmed.fastq the output file for orphaned reads from the _2 file
ILLUMINACLIP:SRR_adapters.fa to clip the Illumina adapters from the input file using the adapter sequences listed in SRR_adapters.fa
SLIDINGWINDOW:4:20 to use a sliding window of size 4 that will remove bases if their phred score is below 20

Running Trimmomatic

Now we will run Trimmomatic on our data. To begin, navigate to your data directory:

cd data

We are going to run Trimmomatic on one of our paired-end samples. We saw using FastQC that Nextera adapters were present in our samples. The adapter sequences came with the installation of trimmomatic, so let’s find them. It turns out that underneath apt is a program called dpkg. We can list all the files that were installed with a package like this

dpkg -L trimmomatic
/.
/usr
/usr/bin
/usr/bin/TrimmomaticPE
/usr/share
/usr/share/doc
/usr/share/doc/trimmomatic
/usr/share/doc/trimmomatic/README.Debian
/usr/share/doc/trimmomatic/README.test
/usr/share/doc/trimmomatic/changelog.Debian.gz
/usr/share/doc/trimmomatic/copyright
/usr/share/doc/trimmomatic/examples
/usr/share/doc/trimmomatic/examples/batchTrim
/usr/share/doc/trimmomatic/run-unit-test
/usr/share/doc/trimmomatic/test_data
/usr/share/doc/trimmomatic/test_data/test_forward.fq.gz
/usr/share/doc/trimmomatic/test_data/test_reverse.fq.gz
/usr/share/doc/trimmomatic/test_data/test_single.fq.gz
/usr/share/java
/usr/share/java/trimmomatic-0.36.jar
/usr/share/man
/usr/share/man/man1
/usr/share/man/man1/TrimmomaticPE.1.gz
/usr/share/trimmomatic
/usr/share/trimmomatic/NexteraPE-PE.fa
/usr/share/trimmomatic/TruSeq2-PE.fa
/usr/share/trimmomatic/TruSeq2-SE.fa
/usr/share/trimmomatic/TruSeq3-PE-2.fa
/usr/share/trimmomatic/TruSeq3-PE.fa
/usr/share/trimmomatic/TruSeq3-SE.fa
/usr/bin/TrimmomaticSE
/usr/share/java/trimmomatic.jar
/usr/share/man/man1/TrimmomaticSE.1.gz

Assuming we are currently in the data directory, let’s copy those adapter to it.

cp /usr/share/trimmomatic/NexteraPE-PE.fa . 

We will also use a sliding window of size 4 that will remove bases if their phred score is below 20 (like in our example above). We will also discard any reads that do not have at least 20 bases remaining after this trimming step. This command will take a few minutes to run.

TrimmomaticPE SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz \
              SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz \
              SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz \
              SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:/usr/share/trimmomatic/NexteraPE-PE.fa:2:40:15 
TrimmomaticPE: Started with arguments:
 SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:/usr/share/trimmomatic/NexteraPE-PE.fa:2:40:15
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 1107090 Both Surviving: 885220 (79.96%) Forward Only Surviving: 216472 (19.55%) Reverse Only Surviving: 2850 (0.26%) Dropped: 2548 (0.23%)
TrimmomaticPE: Completed successfully

Exercise

Use the output from your Trimmomatic command to answer the following questions.

1) What percent of reads did we discard from our sample? 2) What percent of reads did we keep both pairs?

Solution

1) 0.23% 2) 79.96%

You may have noticed that Trimmomatic automatically detected the quality encoding of our sample. It is always a good idea to double-check this or to enter the quality encoding manually.

We can confirm that we have our output file, and that the output files are smaller than the input

ls -l -h SRR2589044*
-rw-r--r-- 1 gtk gtk 124M Feb 14 14:28 SRR2589044_1.fastq.gz
-rw-r--r-- 1 gtk gtk  94M Feb 19 16:22 SRR2589044_1.trim.fastq.gz
-rw-r--r-- 1 gtk gtk  18M Feb 19 16:22 SRR2589044_1un.trim.fastq.gz
-rw-r--r-- 1 gtk gtk 128M Feb 14 14:28 SRR2589044_2.fastq.gz
-rw-r--r-- 1 gtk gtk  91M Feb 19 16:22 SRR2589044_2.trim.fastq.gz
-rw-r--r-- 1 gtk gtk 271K Feb 19 16:22 SRR2589044_2un.trim.fastq.gz

We’ve just successfully run Trimmomatic on one of our FASTQ files! However, there is some bad news. Trimmomatic can only operate on one sample at a time and we have more than one sample. The good news is that we can use a for loop to iterate through our sample files quickly!

As a separate matter, we’ve just added four more files to our data directory, but at least in my mind the output of Trimmomatic should be considered “results”, not data. So the best way to organise our results is to make separate directories for the trimmed pairs and the orphans.

# get rid of the results in our data directory
rm *trim*
# go up one from the data directory
cd ..
# make a separate directory for the trimmed pairs and orphaned results
mkdir -p results/trimmed
mkdir -p results/orphaned

# let's set this up to place the results where we want

for file in  data/*_1.fastq.gz;
do
    SRR=$(basename $file _1.fastq.gz)
    echo working on $SRR
    TrimmomaticPE data/${SRR}_1.fastq.gz data/${SRR}_2.fastq.gz \
                  results/trimmed/${SRR}_1.trim.fastq.gz results/orphaned/${SRR}_1.untrim.fastq.gz \
                  results/trimmed/${SRR}_2.trim.fastq.gz results/orphaned/${SRR}_2.untrim.fastq.gz \
                  SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 
done

Go ahead and run the for loop. It should take a few minutes for Trimmomatic to run for each of our six input files. Once it’s done running, take a look at your directory contents. You’ll notice that even though we ran Trimmomatic on file SRR2589044 before running the for loop, there is only one set of files for it. Because of we matched the ending _1.fastq.gz, we re-ran Trimmomatic on this file, overwriting our first results. That’s ok, but it’s good to be aware that it happened.

ls results/trimmed
ls results/orphaned
SRR2584863_1.trim.fastq.gz
SRR2584863_2.trim.fastq.gz
SRR2584866_1.trim.fastq.gz
SRR2584866_2.trim.fastq.gz
SRR2589044_1.trim.fastq.gz
SRR2589044_2.trim.fastq.gz
SRR2584863_1.untrim.fastq.gz
SRR2584863_2.untrim.fastq.gz
SRR2584866_1.untrim.fastq.gz
SRR2584866_2.untrim.fastq.gz
SRR2589044_1.untrim.fastq.gz
SRR2589044_2.untrim.fastq.gz

We’ve now completed the trimming and filtering steps of our quality control process!

Bonus Exercise (Advanced)

Now that we’ve quality controled our samples, they should perform better on the quality tests run by FastQC. Go ahead and re-run FastQC on your trimmed FASTQ files and visualize the HTML files to see whether your per base sequence quality is higher after trimming.

Solution

In your terminal window (at the top directory for this episode) do

mkdir -p results/trimmed/fastqc
 fastqc  results/trimmed/*.fastq.gz  -o results/trimmed/fastqc/

After trimming and filtering, our overall quality is much higher, we have a distribution of sequence lengths, and more samples pass adapter content. However, quality trimming is not perfect, and some programs are better at removing some sequences than others. Because our sequences still contain 3’ adapters, it could be important to explore other trimming tools like cutadapt to remove these, depending on your downstream application. Trimmomatic did pretty well though, and its performance is good enough for our workflow.

Key Points

  • The options you set for the command-line tools you use are important!

  • Data cleaning is an essential step in a genomics workflow.