Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleAnswer

Output column 5 has the gene count.

Code Block
languagebash
cut -f merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   57506374 1
    105 2
  18 2    4 3
      1 4
      1 7

There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.

Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.

Expand
titleAnswer
Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed

sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s -c 4,4 -o count,collapse > merged.good.sc_genes.txt

wc -l good.sc_genes.bed, merged.good.sc_genes.txt

There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

Code Block
languagebash
cut -f merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   63745750 1
    105 18 2
    
 4 3       1 4
      1 7

Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.