Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Using the -c (column) and -o (operation) options, you can have add information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.

...

We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:

  • the 1st custom column (output column 4) should result from collapsing distinct (unique) values of gene file column 6 (the strand column, + or -)
    • since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
  • the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
  • the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names.

bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):

...

The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):

Code Block
2-micron        251     1523    +       1       R0010W
2-micron        1886    3008    -       1       R0020C
2-micron        3270    3816    +       1       R0030W
2-micron        5307    6198    -       1       R0040C
chrI            334     792     +       2       YAL069W,YAL068W-A
chrI            1806    2169    -       1       YAL068C
chrI            2479    2707    +       1       YAL067W-A
chrI            7234    9016    -       1       YAL067C
chrI            10090   10399   +       1       YAL066W
chrI            11564   11951   -       1       YAL065CYAL065C

As we specified:

  • Output column 4 has the region's strand.
  • Column 5 is the count of merged regions

...

  • Column 6 is a collapsed, comma-separated list of the merged gene names

...

Exercise: Compare the number of regions in the merged and before-merge gene files.

...

Expand
titleAnswer

Output column 5 has the gene count.

Code Block
languagebash
cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   6374 1
    105 2
      4 3
      1 4
      1 7

There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.

Or being fancy:

Code Block
languagebash
cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n \
  | grep -v ' 1$' | awk 'BEGIN{ct=0}{ct=ct+$1}END{print ct}'


Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.

...

To make a valid BED6 file, we'll include the first 3 output columns of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

...