1. Organise your file system
Create a directory for this lesson somewhere on your computer. Inside that directory, create three directories: data
, src
, and results
. These will be the starting organisation of a project for this lesson.
2. Install the software you will need
You can run the software installation part (described in detail below) by running ./data/wsl-install.sh. MacOSX users may have to run brew instead of sudo apt
The following software (instructions for Unbuntu linux systems) will be used for this lesson and future lessons.
wget and/or curl
wget and curl are command line tools for accessing resources on the web. They have slightly different command line interfaces.
sudo apt install wget curl
fastqc
For an Ubuntu Linux system, you can install fastqc using the apt
system on the bash command line because fastq is part of the standard package archive. If you need to install another way or find the original materials, the are https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
sudo apt install fastqc
Trimmomatic
Trimmomatic is a read trimming tool that work on the command line It can be installed with apt.
sudo apt install trimmomatic
Picard tools
Fastq actually uses a library called picard, which has an associated set of command line tools. We may use them later in the semester. To install,
sudo apt install picard-tools
Bowtie and bowtie2
Bowtie and bowtie2 are programs for rapid alignment of next-generation sequencing data to a reference. You can install bowtie and bowtie2 on Ubuntu using apt:
sudo apt install bowtie bowtie2
samtools
Samtools is also available on the ubuntu sytem in apt, so you can use
sudo apt install samtools
bcftools
sudo apt install bcftools
bedtools
The amazing bedtools is avaiable on apt:
sudo apt install bedtools
3. Download the data you will need
Let’s go get the data we’ll need using wget and curl. BE SURE TO HAVE CREATED THE CORRECT DIRECTORIES FIRST
The steps below can be run by running ./data/get-data.sh when in the correct directory.
metadata
curl
normally places whatever it retrieves in the standard output. To place it in a file of the same name as the remote file, you can use the “-O” option.
cd data
curl -O https://raw.githubusercontent.com/data-lessons/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.csv
wget
places output in a file by default, much like curl -O
does.
fastq data
cd data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz