Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleAnswer
Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed

sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s -c 4,4 -o count,collapse > merged.good.sc_genes.txt

wc -l good.sc_genes.bed, merged.good.sc_genes.txt

There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

Code Block
languagebash
cut -f merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   5750 1
     18 2
      1 4
      1 7

Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.

Exercise: Why did we name the merged file with the extension .txt instead of .bed? What would we need to do to convert it to a proper BED6 file?

Expand
titleAnswer

The output does not follow the BED6 specification: "chrom, start, end, name, score, strand"

The first 3 output columns comply with the BED3 standard (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

We can use awk to re-order the fields:

Code Block
languagebash
cat merged.good.sc_genes.txt | awk '
  BEGIN{FS=OFS="\t"}
  print{$1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed