Pungitius exercise

This exercise uses data generated from three species of nine-spine sticklebacks: Pungitius pungitius, Pungitius sinensis, and Pungitius tymensis.

Check out data

The data are divided by chromosome.

How many individuals do we have and how many variant sites for each chromosome?

#check the number of individuals and sites for each vcf
for file in *.vcf
 do echo "----------------"
 vcftools --vcf $file
done

What are the samples names?

#The names appear in the top of the file
less chrI.vcf

How many individuals from each species are there?

#we can make a list of the sample names by extracting them from one of the VCFs
grep CHROM chrI.vcf | tr "\t" "\n" | grep [1-9] > all_samples.txt


#view the list
less all_samples.txt


#we can then count the different species
grep Pun all_samples.txt | wc -l
grep Sin all_samples.txt | wc -l
grep Tym all_samples.txt | wc -l

Principal Component Analysis

On personal computer, use the R script snp_pca.R to do PCA on the data. What's going on with chromosome 12?

VCFtools to window stats

VCFtools has some nice features to get stats in windows along chromosomes.

Return to TACC to look closer at variation between male and female Pungitius along chromosome 12

First we'll get Fst between males and females. We'll use sample_data.txt to build a single-column table of each.

grep Pun sample_data.txt | grep F | cut -f 1 > Pun_females.txt
grep Pun sample_data.txt | grep M | cut -f 1 > Pun_males.txt

Now run vcftools --weir-fst-pop in sliding window mode (see VCFtools documentation for how to build command).

#(back on TACC)
#First choose a window size and step size (here 100Kb)
WINDOW_SIZE=100000
WINDOW_STEP=100000

#Run for chromosome 1
vcftools --vcf chrI.vcf \
 --weir-fst-pop Pun_males.txt \
 --weir-fst-pop Pun_females.txt \
 --fst-window-size $WINDOW_SIZE \
 --fst-window-step $WINDOW_STEP \
 --out chrI_female_v_male

#chromosome 2
vcftools --vcf chrII.vcf \
 --weir-fst-pop Pun_males.txt \
 --weir-fst-pop Pun_females.txt \
 --fst-window-size $WINDOW_SIZE \
 --fst-window-step $WINDOW_STEP \
 --out chrII_female_v_male

#chromosome12
vcftools --vcf chrXII.vcf \
 --weir-fst-pop Pun_males.txt \
 --weir-fst-pop Pun_females.txt \
 --fst-window-size $WINDOW_SIZE \
 --fst-window-step $WINDOW_STEP \
 --out chrXII_female_v_male


#look at result files
ls -l *windowed.weir.fst


#send these to pungitius_exercise/ on personal computer to plot with R

We also want to look at nucleotide diversity along the chromosomes within each species.

We'll need a file listing the samples names for each species.

grep Pun sample_data.txt | cut -f 1 > Pun_samples.txt
grep Sin sample_data.txt | cut -f 1 > Sin_samples.txt
grep Tym sample_data.txt | cut -f 1 > Tym_samples.txt

Separate out the species from each chromosome into its own VCF.

#make a directory to keep them in
mkdir species_separated


#loop through the vcfs and to make commands
>separateCommands
for vcfFile in chr*.vcf
do echo "vcftools --vcf $vcfFile --keep Pun_samples.txt --recode --out species_separated/${vcfFile/.vcf/}_punOnly &
vcftools --vcf $vcfFile --keep Sin_samples.txt --recode --out species_separated/${vcfFile/.vcf/}_sinOnly &
vcftools --vcf $vcfFile --keep Tym_samples.txt --recode --out species_separated/${vcfFile/.vcf/}_tymOnly &" >> separateCommands
done


#look at commands
cat separateCommands


#copy paste to execute
#look at result files
ls species_separated/*.vcf

Now use vcftools --window-pi to look at nucleotide diversity within each species

#change to directory with species-separated vcfs
cd species_separated


#select window size and step (100Kb)
WINDOW_SIZE=100000
WINDOW_STEP=100000


#loop through vcfs and build commands for running --window-pi
>piCommands
for vcfFile in *.recode.vcf
do echo "vcftools --vcf $vcfFile --window-pi $WINDOW_SIZE --window-pi-step $WINDOW_STEP --out ${vcfFile/.recode.vcf} &" >> piCommands
done


#execute them


#look at results files
ls -l *windowed.pi


#send to pungitius_exercise/ on personal computer

Plot the resulting *weir.fst and *.pi results with plot_window_stats.R

A pseudoautosomal region (PAR) is a region of a sex chromosome that still undergoes recombination.

What is the likely boundary for the PAR in Pungitius pungitius? (plot window stats with plot_window_stats.R)

probably at about 3 Mb

Admixture

Run admixture on chromosome 1 and also the PAR and the divergent region of chr12

(based on the Fst plot, let's use these boundaries: PAR: 0 - 2Mb, divergent: 4 - 17 Mb)

#first switch to integers, because plink doesn't like roman numerals
sed 's/chrI/1/' chrI.vcf > chr1.vcf
sed 's/chrXII/12/' chrXII.vcf > chr12.vcf


#subset for the PAR and divergent region
vcftools --vcf chr12.vcf --chr 12 --from-bp 0 --to-bp 2000000 --recode --out PAR
vcftools --vcf chr12.vcf --chr 12 --from-bp 4000000 --to-bp 17000000 --recode --out SDR


#convert to bed files with plink
plink2 --vcf chr1.vcf --make-bed --out chr1 --allow-extra-chr
plink2 --vcf PAR.recode.vcf --make-bed --out PAR --allow-extra-chr
plink2 --vcf SDR.recode.vcf --make-bed --out SDR --allow-extra-chr


#run admixture
admixture chr1.bed 3
admixture PAR.bed 3
admixture SDR.bed 3


#send the *.Q files to personal computer to plot with plot_pungitius_admixture.R

Phylogenetics

(on TACC)
cd phylogenetics

Build gene trees from the PAR and from the divergent regions of chr12

Where did the Y chromosome come from?

#subset the vcf for the two regions
vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp 0 --to-bp 2000000 --recode --out PAR
vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp 4000000 --to-bp 17000000 --recode --out SDR


#use vcf-kit to build trees
vk phylo tree upgma PAR.recode.vcf > PAR.newick
vk phylo tree upgma SDR.recode.vcf > SDR.newick


#look at the trees on TACC with view_newick.py
view_newick.py PAR.newick
view_newick.py SDR.newick

For a nicer tree viewer, Dendroscope works well.

Build gene trees in windows along chromosome

#set up boundaries
seq 0 250000 19750000 > lefts
seq 250000 250000 20000000 > rights
paste lefts rights > bounds.txt


#now build commands to subset the vcf with VCFtools, then pipe the subset into VCF-kit for tree building
awk '{print "vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp "$1 " --to-bp "$2" --recode -c | vk phylo tree upgma - > window_"$1"-"$2".newick"}' bounds.txt > treeBuildCommands


#check the commands (should have 80 lines)
head treeBuildCommands
wc -l treeBuildCommands


#Double-check you are in an idev session
#If you are in an idev session, and had 80 lines, print to screen and execute


sh treeBuildCommands




#check that you have all trees
ls window*.newick | wc -l




#compile the resulting trees into a single file
#they need to be in order, so we can just simply cat them
#read bounds.txt again to cat them in order
echo ">allTrees.newick" >catTrees
while read bound
 do left=$(echo $bound | cut -d " " -f 1)
 right=$(echo $bound | cut -d " " -f 2)
 echo "cat window_${left}-${right}.newick | tr -d '\n' >> allTrees.newick && echo '' >>allTrees.newick" >> catTrees
done<bounds.txt


#execute the catting commands
sh catTrees


#check that the input is correct size (should be 80)
wc -l allTrees.newick


#now run Twisst on them
~/twisst/twisst.py -t allTrees.newick\
 -w topology_weights.txt\
 -g Pun_X -g Pun_Y -g Sin -g Tym\
 --groupsFile groups_file.txt\
 --method threshold\
 --thresholdTable ~/twisst/threshold_tables/binom_threshold_wilson_CI95_0.1.tsv



#This takes a bit.
#Once it's done, format the output
grep "^#" topology_weights.txt > topos.txt
grep -v "^#" topology_weights.txt > weights.txt


#send the results (all end in *.txt) to personal computer to plot with plot_twisst_results.R