Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Here we're going to use bedtools merge to collapse our gene annotations into a non-overlapping set, first for all genes, then for only non-Dubious genes.

The output from bedtools merge always starts with either 3 or 4 columns:

  • chrom, start and end of the merged region, if a stranded merge was not requested
  • the strand of the merged region in column 4 if a stranded was requested

Using the -c (column) and -o (operation) options, you can have information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.

For example, our sc_genes.bed file has a gene name in column 4, and for each (possibly merged) gene region, we want to know the number of gene regions that were collapsed into the region, and also which gene names were collapsed.

We can do this with -c 4,4 -o count,collapse, which says that two custom output columns should be added:

  • the 1st should result from counting the gene names in column 4 for all genes that were merged, and
  • the 2nd should be a comma-separated list of those same column 4 gene names

bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):

Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s -c 4,4 -o count,collapse > merged.sc_genes.txt

The first few lines of the merged.sc_genes.txt file look like this:

Code Block
2-micron        251     1523    +       1       R0010W
2-micron        1886    3008    -       1       R0020C
2-micron        3270    3816    +       1       R0030W
2-micron        5307    6198    -       1       R0040C
chrI    334     792     +       2       YAL069W,YAL068W-A
chrI    1806    2169    -       1       YAL068C
chrI    2479    2707    +       1       YAL067W-A
chrI    7234    9016    -       1       YAL067C
chrI    10090   10399   +       1       YAL066W
chrI    11564   11951   -       1       YAL065C

Output column 4 has the region's strand (since we asked for a strand-specific merge). Column 5 is the count of merged regions, and column 6 is a comma-separated list of the merged gene names.

Exercise: Compare the number of regions in the merged and before-merge gene files.

Expand
titleAnswer
Code Block
languagebash
wc -l sc_genes.bed merged.sc_genes.txt

There were 6485 genes before merging and 6485 after.

Exercise: How many regions represent only 1 gene, 2 genes, or more?

Expand
titleAnswer

Output column 5 has the gene count.

Code Block
languagebash
cut -f merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   5750 1
     18 2
      1 4
      1 7

Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.

Expand
titleAnswer
Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed

sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s -c 4,4 -o count,collapse > merged.good.sc_genes.txt

wc -l good.sc_genes.bed, merged.good.sc_genes.txt

There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

Code Block
languagebash
cut -f merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   6374 1
    105 2
      4 3
      1 4
      1 7