5 Alignment to The Reference Genome
5.1 Download reference genome and annotation file
- Human reference genome
mkdir hg38/
cd hg38/
wget http://hgdownload.soe.ucsc.edu/\
goldenPath/hg38/bigZips/hg38.fa.gz
wget https://hgdownload.soe.ucsc.edu/\
goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
- Mouse Reference Genome
mkdir mm39/
cd mm39/
wget https://hgdownload.soe.ucsc.edu/\
goldenPath/mm39/bigZips/mm39.fa.gz
wget https://hgdownload.soe.ucsc.edu/\
goldenPath/mm39/bigZips/genes/refGene.gtf.gz
5.2 Generating Indexes
- We must first generate an index of the genome we want to align to, so that there tools can efficently map over millions of sequences
cd hg38/
STAR --runMode genomeGenerate \
\
--genomeDir star_index \
--genomeFastaFiles hg38.fa \
--sjdbGTFfile hg38.refGene.gtf --runThreadN 40
5.3 Alignment using STAR
Check the following trial script whether it is working or not by executing it on the terminal.
Go back one directory upward of trim_galore.
for filename in trim_galore/fastq/*_R1_val_1.fq.gz
do
base=$(basename $filename _R1_val_1.fq.gz)
echo "Running alignment for ${base} with STAR"
done
vim star.sh
- Insert the following script by pressing ‘I’ in your keyboard (meaning insert). Paste the following script. Press Esc and then type :wq to save. Later make the file executable.
chmod +x star.sh
bash star.sh
#!/bin/bash/
echo "Starting Alignment"
for filename in trim_galore/fastq/*_R1_val_1.fq.gz
do
base=$(basename $filename _R1_val_1.fq.gz)
echo "Running alignment for ${base} with STAR"
STAR --runMode alignReads --outSAMtype BAM SortedByCoordinate \
\
--readFilesCommand zcat--genomeDir ~/hg38/star_index \
${base} \
--outFileNamePrefix output/\
--quantMode GeneCounts ${base}_R1_val_1.fq.gz\
--readFilesIn trim_galore/fastq/${base}_R2_val_2.fq.gz \
trim_galore/fastq/
--runThreadN 20done
This process will generate a directory ‘output’.
‘output’ folder will contain files with the naming pattern: “sortedByCoord.out.bam”.
These files would be used in the next step of featureCount.
5.4 Run featureCount
cd output/
mkdir aligned_bam/
mv *.bam aligned_bam/
mkdir featureCount/
featureCounts -p -T 4 -a ~/hg38/hg38.refGene.gtf \
-o featureCount/final_counts_all.txt -g 'gene_name'\
*.out.bam aligned_bam/
- -p This is only applicable for paired-end reads.
-T specifies the number (n) of threads to be used.
-a is the genome annotation file (example_genome_annotation.gtf).
-o specifies the name of the output file, which includes the read counts (example_featureCounts_output.txt).
5.5 Run multiqc
multiqc output/ --outdir multiqc_all
It creates a html file compiling all the alignment data.
Transfer featurecount and multiqc files to the local computer for further analysis.
In the next section we shall discuss about differential gene expression analysis in RStudio.