process 2bRAD

Check the reads

We sequenced four different samples with single-end reads across two lanes on an Illumina Hiseq.

solution
#navigate to the directory
cd process_2bRAD

#look at our reads
ls *fastq.gz

#decompress them
gunzip *.gz

#We have two groups, A and B, that are labeled with
#barcode sequences CTGCAG and GAAGTT respectively.
#These groups represent pools of samples that all had
#the same barcode (attached during PCR step in library prep)
#Each group was has two files labeled with L004 and L005 indicating lanes 4 and 5 from the sequencing run
#concatenate the lane 4 and lane 5 data for each pool
cat A_CTGCAG_L004_R1.fastq A_CTGCAG_L005_R1.fastq > A_cat.fastq
cat B_GAAGTT_L004_R1.fastq B_GAAGTT_L005_R1.fastq > B_cat.fastq


#Look at the barcode data to get an idea of what these fastq files are
#This is the same information that would be uploaded to GSAF when the
#samples were submitted for sequencing.
less barcode_data.tsv

Compare this with expected final product from library prep:

Within each group (A or B) all samples had the same PCR barcode which appears in the file name (6 bases long).

The PCR barcode was read by the indexing primer and will not show up in the reads themselves. GSAF already used this barcode to demultiplex the reads for you into the groups A and B.

The individual samples in these groups have different ligation barcodes (4 bases long), which will appear within the sequencing reads. These ligation barcodes will be used to demultiplex the pooled fastq files for groups A and B into fastqs for the individual samples.

bcgl cut site:


Check the read content

The vast majority of our reads should have the bcgI recognition site.

How can we check this?

solution
#This library was prepared with bcgl restriction enzyme
#Check that the cut site appears in the reads
#How many reads do we have?
expr $(cat A_cat.fastq | wc -l) / 4
expr $(cat B_cat.fastq | wc -l) / 4
    #426295
	#444053
 
#(Note these are small for demo purposes)

#How many of the reads have the cut site?
grep "CGA......TGC" A_cat.fastq | wc -l
grep "CGA......TGC" B_cat.fastq | wc -l
    #216159
	#220686
  
#216159/426295 = 0.51
#220686/444053 = 0.50

Why is the cut site only in half our reads?

solution
#In the 2bRAD library prep the adapters can be ligated to the fragment in either
#fragment in either orientation (note the mirrored sticky ends left by bcgl).
#So the genomic fragment may have been read from either direction
  
#check for reverse complement of cut site
grep "GCA......TCG" A_cat.fastq | wc -l
grep "GCA......TCG" B_cat.fastq | wc -l
    #201574
	#202908

#(216159 + 201574)/426295 = 0.98
#(220686 + 202908)/444053 = 0.95

Demultiplexing

Separate the pooled fastq files based on the ligation barcodes found at the end of the reads.

solution
#The command to run trim2bRAD_2barcodes_dedup.pl is complicated, so
#another script -- 2bRAD_trim_launch_dedup.pl -- is used to generate the command for us:
  
2bRAD_trim_launch_dedup.pl cat.fastq > dedupCommands


#look at the commands file
cat   dedupCommands

#returns our next commands to execute:
trim2bRAD_2barcodes_dedup.pl input=A_CTGCAG_R1.fastq site=".{12}CGA.{6}TGC.{12}|.{12}GCA.{6}TCG.{12}" adaptor="AGATC" sampleID=100
trim2bRAD_2barcodes_dedup.pl input=B_GAAGTT_R1.fastq site=".{12}CGA.{6}TGC.{12}|.{12}GCA.{6}TCG.{12}" adaptor="AGATC" sampleID=100

Do we have all the samples we expect?

solution
#The resulting files are labled with their barcode combinations and a *tr0 suffix
ls -l A*tr0
ls -l B*tr0
  
#How many did we get?
ls -l A*tr0 | wc -l
ls -l B*tr0 | wc -l
  
#Looking back at our barcodes, does this make sense?
cat barcode_data.tsv


#4 samples total
#two from group A with pcr_barcode CTGCAG
#two from group B with pcr_barcode GAAGTT

Which sample is B_cat_ACCA.tr0?

solution
#search for a line in the barcode file with both the barcodes
grep TGGT barcode_data.tsv | grep GAAGTT
  
    #MCNA4
  
#Why search for TGGT instead of ACCA?
#The ligation barcode on 3Illbc was TGGT, but was then read as ACCA.
#When recording which barcode is which, easier to record the antiBC sequence
#(see 2bRAD library preparation protocol):
https://github.com/z0on/2bRAD_denovo/blob/master/2bRAD_protocol_june1_2018.pdf



Complete.