Pungitius exercise

Pungitius exercise

 

This exercise uses data generated from three species of nine-spine sticklebacks: Pungitius pungitius, Pungitius sinensis, and Pungitius tymensis.

Check out data

The data are divided by chromosome.

How many individuals do we have and how many variant sites for each chromosome?

#check the number of individuals and sites for each vcf for file in *.vcf do echo "----------------" vcftools --vcf $file done

What are the samples names?

#The names appear in the top of the file less chrI.vcf

How many individuals from each species are there?

#we can make a list of the sample names by extracting them from one of the VCFs grep CHROM chrI.vcf | tr "\t" "\n" | grep [1-9] > all_samples.txt #view the list less all_samples.txt #we can then count the different species grep Pun all_samples.txt | wc -l grep Sin all_samples.txt | wc -l grep Tym all_samples.txt | wc -l

Principal Component Analysis

On personal computer, use the R script snp_pca.R to do PCA on the data. What's going on with chromosome 12?

 

VCFtools to window stats

VCFtools has some nice features to get stats in windows along chromosomes.

Return to TACC to look closer at variation between male and female Pungitius along chromosome 12

First we'll get Fst between males and females. We'll use sample_data.txt to build a single-column table of each.

grep Pun sample_data.txt | grep F | cut -f 1 > Pun_females.txt grep Pun sample_data.txt | grep M | cut -f 1 > Pun_males.txt

Now run vcftools --weir-fst-pop in sliding window mode (see VCFtools documentation for how to build command).

#(back on TACC) #First choose a window size and step size (here 100Kb) WINDOW_SIZE=100000 WINDOW_STEP=100000 #Run for chromosome 1 vcftools --vcf chrI.vcf \ --weir-fst-pop Pun_males.txt \ --weir-fst-pop Pun_females.txt \ --fst-window-size $WINDOW_SIZE \ --fst-window-step $WINDOW_STEP \ --out chrI_female_v_male #chromosome 2 vcftools --vcf chrII.vcf \ --weir-fst-pop Pun_males.txt \ --weir-fst-pop Pun_females.txt \ --fst-window-size $WINDOW_SIZE \ --fst-window-step $WINDOW_STEP \ --out chrII_female_v_male #chromosome12 vcftools --vcf chrXII.vcf \ --weir-fst-pop Pun_males.txt \ --weir-fst-pop Pun_females.txt \ --fst-window-size $WINDOW_SIZE \ --fst-window-step $WINDOW_STEP \ --out chrXII_female_v_male #look at result files ls -l *windowed.weir.fst #send these to pungitius_exercise/ on personal computer to plot with R

We also want to look at nucleotide diversity along the chromosomes within each species.

We'll need a file listing the samples names for each species.

grep Pun sample_data.txt | cut -f 1 > Pun_samples.txt grep Sin sample_data.txt | cut -f 1 > Sin_samples.txt grep Tym sample_data.txt | cut -f 1 > Tym_samples.txt

Separate out the species from each chromosome into its own VCF.

#make a directory to keep them in mkdir species_separated #loop through the vcfs and to make commands >separateCommands for vcfFile in chr*.vcf do echo "vcftools --vcf $vcfFile --keep Pun_samples.txt --recode --out species_separated/${vcfFile/.vcf/}_punOnly & vcftools --vcf $vcfFile --keep Sin_samples.txt --recode --out species_separated/${vcfFile/.vcf/}_sinOnly & vcftools --vcf $vcfFile --keep Tym_samples.txt --recode --out species_separated/${vcfFile/.vcf/}_tymOnly &" >> separateCommands done #look at commands cat separateCommands #copy paste to execute #look at result files ls species_separated/*.vcf

Now use vcftools --window-pi to look at nucleotide diversity within each species

#change to directory with species-separated vcfs cd species_separated #select window size and step (100Kb) WINDOW_SIZE=100000 WINDOW_STEP=100000 #loop through vcfs and build commands for running --window-pi >piCommands for vcfFile in *.recode.vcf do echo "vcftools --vcf $vcfFile --window-pi $WINDOW_SIZE --window-pi-step $WINDOW_STEP --out ${vcfFile/.recode.vcf} &" >> piCommands done #execute them #look at results files ls -l *windowed.pi #send to pungitius_exercise/ on personal computer

Plot the resulting *weir.fst and *.pi results with plot_window_stats.R

A pseudoautosomal region (PAR) is a region of a sex chromosome that still undergoes recombination.

What is the likely boundary for the PAR in Pungitius pungitius? (plot window stats with plot_window_stats.R)

probably at about 3 Mb

Admixture

Run admixture on chromosome 1 and also the PAR and the divergent region of chr12

(based on the Fst plot, let's use these boundaries: PAR: 0 - 2Mb, divergent: 4 - 17 Mb)

#first switch to integers, because plink doesn't like roman numerals sed 's/chrI/1/' chrI.vcf > chr1.vcf sed 's/chrXII/12/' chrXII.vcf > chr12.vcf #subset for the PAR and divergent region vcftools --vcf chr12.vcf --chr 12 --from-bp 0 --to-bp 2000000 --recode --out PAR vcftools --vcf chr12.vcf --chr 12 --from-bp 4000000 --to-bp 17000000 --recode --out SDR #convert to bed files with plink plink2 --vcf chr1.vcf --make-bed --out chr1 --allow-extra-chr plink2 --vcf PAR.recode.vcf --make-bed --out PAR --allow-extra-chr plink2 --vcf SDR.recode.vcf --make-bed --out SDR --allow-extra-chr #run admixture admixture chr1.bed 3 admixture PAR.bed 3 admixture SDR.bed 3 #send the *.Q files to personal computer to plot with plot_pungitius_admixture.R

Phylogenetics

(on TACC) cd phylogenetics

Build gene trees from the PAR and from the divergent regions of chr12

Where did the Y chromosome come from?

#subset the vcf for the two regions vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp 0 --to-bp 2000000 --recode --out PAR vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp 4000000 --to-bp 17000000 --recode --out SDR #use vcf-kit to build trees vk phylo tree upgma PAR.recode.vcf > PAR.newick vk phylo tree upgma SDR.recode.vcf > SDR.newick #look at the trees on TACC with view_newick.py view_newick.py PAR.newick view_newick.py SDR.newick

For a nicer tree viewer, Dendroscope works well.

Build gene trees in windows along chromosome

#set up boundaries seq 0 250000 19750000 > lefts seq 250000 250000 20000000 > rights paste lefts rights > bounds.txt #now build commands to subset the vcf with VCFtools, then pipe the subset into VCF-kit for tree building awk '{print "vcftools --vcf chr12_for_phylo.vcf --chr chrXII --from-bp "$1 " --to-bp "$2" --recode -c | vk phylo tree upgma - > window_"$1"-"$2".newick"}' bounds.txt > treeBuildCommands #check the commands (should have 80 lines) head treeBuildCommands wc -l treeBuildCommands #Double-check you are in an idev session #If you are in an idev session, and had 80 lines, print to screen and execute sh treeBuildCommands #check that you have all trees ls window*.newick | wc -l #compile the resulting trees into a single file #they need to be in order, so we can just simply cat them #read bounds.txt again to cat them in order echo ">allTrees.newick" >catTrees while read bound do left=$(echo $bound | cut -d " " -f 1) right=$(echo $bound | cut -d " " -f 2) echo "cat window_${left}-${right}.newick | tr -d '\n' >> allTrees.newick && echo '' >>allTrees.newick" >> catTrees done<bounds.txt #execute the catting commands sh catTrees #check that the input is correct size (should be 80) wc -l allTrees.newick #now run Twisst on them ~/twisst/twisst.py -t allTrees.newick\ -w topology_weights.txt\ -g Pun_X -g Pun_Y -g Sin -g Tym\ --groupsFile groups_file.txt\ --method threshold\ --thresholdTable ~/twisst/threshold_tables/binom_threshold_wilson_CI95_0.1.tsv #This takes a bit. #Once it's done, format the output grep "^#" topology_weights.txt > topos.txt grep -v "^#" topology_weights.txt > weights.txt #send the results (all end in *.txt) to personal computer to plot with plot_twisst_results.R