...
For major bonus points and a great THANK YOU from Scott, compute the mean and standard deviation of the intersected and subtracted SNPs from NA12878 vs all and then perform a t-test to make sure the differences are statistically significant using only linux command line tools (probably in a shell script). Yes, it's probably easier in Python, Perl, or R.
Other linux utilities useful for making subsets of VCF files and comparing them
Code Block | ||
---|---|---|
| ||
cat NA12878.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
| sort -n | grep "0/1" > NA12878.raw.vcf.simple.het
cat NA12878.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
| sort -n | grep "1/1" > NA12878.raw.vcf.simple.hom
|
Code Block | ||
---|---|---|
| ||
cat NA12891.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
| sort -n | grep "0/1" | sort > NA12891.raw.vcf.simple.het
cat NA12892.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
| sort -n | grep "0/1" | sort > NA12892.raw.vcf.simple.het
cat NA12891.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
| sort -n | grep "1/1" | sort > NA12891.raw.vcf.simple.hom
cat NA12892.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
| sort -n | grep "1/1" | sort > NA12892.raw.vcf.simple.hom
|
Now count how many GT are het in both of the second two (parents) but hom in the first (child):
Code Block join NA12892.raw.vcf.simple.het NA12891.raw.vcf.simple.het > both.het join both.het NA12878.raw.vcf.simple.hom | wc -l
(would you have expected this result?)
Now find which GT are hom in both of the second two (parents) but het in the first (child):
Code Block join NA12892.raw.vcf.simple.hom NA12891.raw.vcf.simple.hom > both.hom join both.hom NA12878.raw.vcf.simple.het | wc -l
Virmid - an advanced auto-screener
...