5  Alignment to The Reference Genome

5.1 Download reference genome and annotation file

  • Human reference genome
mkdir hg38/

cd hg38/

wget http://hgdownload.soe.ucsc.edu/\
goldenPath/hg38/bigZips/hg38.fa.gz

wget https://hgdownload.soe.ucsc.edu/\
goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
  • Mouse Reference Genome
mkdir mm39/

cd mm39/

wget https://hgdownload.soe.ucsc.edu/\
goldenPath/mm39/bigZips/mm39.fa.gz
 
wget https://hgdownload.soe.ucsc.edu/\
goldenPath/mm39/bigZips/genes/refGene.gtf.gz

5.2 Generating Indexes

  • We must first generate an index of the genome we want to align to, so that there tools can efficently map over millions of sequences
cd hg38/

STAR --runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles hg38.fa \
--sjdbGTFfile hg38.refGene.gtf \
--runThreadN 40

5.3 Alignment using STAR

  • Check the following trial script whether it is working or not by executing it on the terminal.

  • Go back one directory upward of trim_galore.

for filename in trim_galore/fastq/*_R1_val_1.fq.gz
do
base=$(basename $filename _R1_val_1.fq.gz)
echo "Running alignment for ${base} with STAR"
done
 vim star.sh
  • Insert the following script by pressing ‘I’ in your keyboard (meaning insert). Paste the following script. Press Esc and then type :wq to save. Later make the file executable.
 chmod +x star.sh
 bash star.sh
#!/bin/bash/
echo "Starting Alignment"
for filename in trim_galore/fastq/*_R1_val_1.fq.gz
do
base=$(basename $filename _R1_val_1.fq.gz)
echo "Running alignment for ${base} with STAR"
STAR --runMode alignReads --outSAMtype BAM SortedByCoordinate \
--readFilesCommand zcat\ 
--genomeDir ~/hg38/star_index \
--outFileNamePrefix output/${base} \
--quantMode GeneCounts \
--readFilesIn trim_galore/fastq/${base}_R1_val_1.fq.gz\
trim_galore/fastq/${base}_R2_val_2.fq.gz \
--runThreadN 20
done
  • This process will generate a directory ‘output’.

  • ‘output’ folder will contain files with the naming pattern: “sortedByCoord.out.bam”.

  • These files would be used in the next step of featureCount.

5.4 Run featureCount

cd output/

mkdir aligned_bam/

mv *.bam aligned_bam/

mkdir featureCount/

featureCounts -p -T 4 -a ~/hg38/hg38.refGene.gtf \ 
-o featureCount/final_counts_all.txt -g 'gene_name'\
aligned_bam/*.out.bam
  • -p This is only applicable for paired-end reads.
    -T specifies the number (n) of threads to be used.
    -a is the genome annotation file (example_genome_annotation.gtf).
    -o specifies the name of the output file, which includes the read counts (example_featureCounts_output.txt).

5.5 Run multiqc

multiqc output/ --outdir multiqc_all
  • It creates a html file compiling all the alignment data.

  • Transfer featurecount and multiqc files to the local computer for further analysis.

  • In the next section we shall discuss about differential gene expression analysis in RStudio.