Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Processing ddRAD Quality check the reads

Navigate to exercise location

Code Block
cd intro_to_rad_2018/demultiplexing/process_ddRAD_stacks/


Investigate reads. We paid the sequencing facility for 50,000 paired end reads.

1. How can be confirm we got this?

Code Block
titlesolution
collapsetrue
#we have two fastq files
ls -lh *.fastq


#take a look at the top of the files
less Lib01_R1.fastq


#how long are the reads?


#type 'q' to exit


#how many lines are in each fastq file?
wc -l Lib01*.fastq


#look at how many reads there are (note that fastq generally files have 4 lines per read)
expr $(cat Lib01_R1.fastq | wc -l) / 4
expr $(cat Lib01_R2.fastq | wc -l) / 4



Paired end read files should have matching names read for read.

2. Double-check this is case for your reads

Code Block
titlesolution
collapsetrue
#this can be checked many ways, here is one option:


#search for lines that start with @ symbol (the read definition lines in our fastqs)
#and save the top 10 of them as a text file.
#Repeat for the R2 reads.
#Then line them up


grep "^@" Lib01_R1.fastq | head > first_10_R1_read_names.txt
grep "^@" Lib01_R2.fastq | head > first_10_R2_read_names.txt
paste first_10_R1_read_names.txt first_10_R2_read_names.txt

What if you wanted to repeat the above solution to check the bottom of the files as well?

Code Block
titlesolution
collapsetrue
grep "^@" Lib01_R1.fastq | tail > last_10_R1_read_names.txt
grep "^@" Lib01_R2.fastq | tail > last_10_R2_read_names.txt
paste last_10_R1_read_names.txt last_10_R2_read_names.txt

Based on the library preparation, we expect that our forward reads were cut with the restriction enzyme nlaIII (NLA3).

We expect that the vast majority , (say at least 96%) of our reads should have the recognition sequence in them.

3. How can we check that our reads largely fit this expectation? (Hint: we'll need to know the restriction enzymes recognition sequence)

Code Block
titlesolution
collapsetrue
#Looking up nlaIII, we see it's cut site:
#     CATG'
#    'GTAC
  
#check if we see this cut site in the forward reads
grep CATG Lib01_R1.fastq | wc -l


#compare with total read count
expr $(cat Lib01_R1.fastq | wc -l) / 4

Next run the program FastQC on the reads

Code Block
titleFastQC
collapsetrue
module load fastqc
mkdir raw_Fastqc_Restults
fastqc -o raw_Fastqc_Restults/ -f fastq Lib01_R*.fastq


#scp the resulting .html files to your personal computer to see

The reads appear to tail off in quality toward the ends. This can be fixed with a program called cutadapt.

Code Block
titlecutadapt and re-check quality
collapsetrue
#run cutadapt without any adapter sequences to cut, but with a PHRED quality cutoff of 30
#(a PHRED score of 30 indicates an expected error rate of 1/1000)
cutadapt --pair-filter=any -q 30 -o Lib01_R1_qced.fastq -p Lib01_R2_qced.fastq Lib01_R1.fastq Lib01_R2.fastq


#now rerun FastQC to see how it worked
mkdir qced_Fastqc_Restults
fastqc -o qced_Fastqc_Restults/ -f fastq Lib01_R*_qced.fastq

Demultiplexing

Now demultiplex the reads. We will do this with process_radtags from the packages STACKS.

Code Block
titledemultiplexing
collapsetrue
#look at the set of barcodes included for our samples
cat barcodes_Lib1.tsv 


#check the reads
grep CATG Lib01_R1.fastq | head


#check where these are in our reads
grep "^TCGAT" Lib01_R1.fastq | wc -l
grep "^CGATC" Lib01_R1.fastq | wc -l
grep "^AACCA" Lib01_R1.fastq | wc -l
grep "^GCATG" Lib01_R1.fastq | wc -l

#make a directory to put the resulting sample fastq files into
mkdir sample_fastqs
 
#look at the documentation for process_radtags
./process_radtags -h
  
#execute the command to process the rad data
./process_radtags -i 'fastq' -1 Lib01_R1.fastq -2 Lib01_R2.fastq -o ./sample_fastqs/ -b barcodes_Lib1.tsv --inline_index -e 'nlaIII' -r --disable_rad_check

From our barcode file, we expected to have four barcoded samples in our library.

Did we get all of our samples back?

Code Block
titlesolution
collapsetrue
#move to the output directory we used when demultiplexing
cd sample_fastqs/


#look at size of our output files
ls -lh *.fq


#remove the empty *.rem.* files
rm *.rem.*


#check the number of paired end files we have
ls *.1.fq | wc -l
ls *.2.fq | wc -l


#Are the paired end files for each sample the same size?
wc -l *.fq

Mapping

Now we're ready to map the reads to a reference genome

Code Block
titleMapping
collapsetrue
#check out our reference
less stickleback_chrom3.fasta 


#how many sequences are there?
grep "^>" stickleback_chrom3.fasta | wc -l


#index the reference
module load bwa
bwa index stickleback_chrom3.fasta


#run mapping
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_AACCA-AGCGAC.1.fq sample_fastqs/sample_AACCA-AGCGAC.2.fq > sampleA.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_CGATC-AGCGAC.1.fq sample_fastqs/sample_CGATC-AGCGAC.2.fq > sampleB.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_GCATG-AGCGAC.1.fq sample_fastqs/sample_GCATG-AGCGAC.2.fq > sampleC.sam
bwa mem stickleback_chrom3.fasta sample_fastqs/sample_TCGAT-AGCGAC.1.fq sample_fastqs/sample_TCGAT-AGCGAC.2.fq > sampleD.sam


#assess mapping efficiency
module load samtools
samtools flagstat sampleA.sam > flagStatRes_sampleA.txt
samtools flagstat sampleB.sam > flagStatRes_sampleB.txt
samtools flagstat sampleC.sam > flagStatRes_sampleC.txt
samtools flagstat sampleD.sam > flagStatRes_sampleD.txt


#these results are not spectacular.
#80% mapping efficiency is OK, but the low rate of properly paired reads is dissapointing.


#convert the sam files to bam files
samtools sort sampleA.sam -O BAM -o sampleA.bam
samtools sort sampleB.sam -O BAM -o sampleB.bam
samtools sort sampleC.sam -O BAM -o sampleC.bam
samtools sort sampleD.sam -O BAM -o sampleD.bam

Genotyping

Now genotype the samples with mpileup.

Code Block
titlempileup
collapsetrue
#index the reference for samtools
module load samtools
samtools faidx stickleback_chrom3.fasta
 
#make a list of the bam files
ls *.bam > my_bamfiles.txt
 
#run mpileup
samtools mpileup -f stickleback_chrom3.fasta -u -b my_bamfiles.txt > mpileup_results.bcf


#now call genotypes from the mpileup results
bcftools call -vmO v -o raw_calls.vcf mpileup_results.bcf

Quality filter the variant calls:

Code Block
titlequality filtering
collapsetrue
#how many raw genotyped sites to be have?
vcftools --vcf raw_calls.vcf


      #After filtering, kept 4 out of 4 Individuals
      #After filtering, kept 14559 out of a possible 14559 Sites


#now perform basic quality filtering on raw calls
bcftools filter --exclude 'QUAL < 30' raw_calls.vcf | bcftools view > filt0.vcf
 
#check the new quality checked vcf
vcftools --vcf filt0.vcf


#Additionally, you can filter on representation among samples, allele frequency, and other options with VCFtools


#Filter for biallelic sites variants 
#genotyped in 75% of samples,
#with at least two reads covering the site


vcftools --vcf filt0.vcf --max-alleles 2 --max-missing 0.75 --minDP 2 --recode --out filt2

Complete.