Genomic Data Analysis: NGS data wrangling on the command line

As we saw in the bash introductory lesson, the Linux shell is a powerful system for interacting with genomic data. Because NGS data files are so large and are often processed end-to-end, the Unix tool/pipe metaphor works particularly well for high-throughput sequencing experiments.

What does the command line approach offer over GUIs?

Repetition
You will often need to repeat the same tasks with multiple input files. As the number of input files grows, the advantages of a command-line interface over a graphical user interface (GUI) also grow.
Reproducibility
Command-line tools provide greater reproducibility than GUIs: if you need to change the parameters, you can do so and regenerate all of the downstream results.
Project organisation
As you work with more data, you will generate more results. It’s not uncommon for a single project to generate hundreds of data files. If you are working at a command line interface, you can think about how you want to organise those files and do so in a consistent way.
Scaling from desktops to servers
GUIs are great for interactive use on an individual computer, but if you need the power of a server there is likely no GUI application to meet your needs. Command line interfaces scale well to server-baed computing.

This lesson will take you through an example genomic analysis of high-throughput (NGS) sequencing data and and will help you build the skills for reproducible research.

Prerequisites

A working understanding of the bash shell, as introduced earlier in the module, is assumed. R or Python programming skills may at times be useful, but are not required.

This lesson also assumes some familiarity with biological concepts, including the structure of DNA, nucleotide abbreviations, and the concept of genomic variation within a population.



Schedule

Time Topic Learning Objectives
Before start Setup
14:00 Background and metadata
  • Understand the data set
  • What is hypermutability?
14:15 Assessing read quality
  • Explain how a FASTQ file encodes per-base quality scores.
  • Interpret a FastQC plot summarizing per-base quality across all reads.
  • Use for loops to automate operations on multiple files.
15:05 Trimming reads
  • Clean FASTQ reads using Trimmommatic.
  • Select and set multiple options for command-line bioinformatics tools.
  • Write for loops with two variables.
16:00 Variant calling workflow
  • Understand the steps involved in variant calling.
  • Describe the types of data formats encountered during variant calling.
  • Use command line tools to perform variant calling.
17:00 Automating a variant calling workflow
  • Write a shell script with multiple variables.
  • Incorporate a for loop into a shell script.
17:45 Finish