...
Using the -c (column) and -o (operation) options, you can have add information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.
...
We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:
- the 1st custom column (output column 4) should result from collapsing distinct (unique) values of gene file column 6 (the strand column, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
- the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
- the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names.
bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):
...
The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):
Code Block |
---|
2-micron 251 1523 + 1 R0010W 2-micron 1886 3008 - 1 R0020C 2-micron 3270 3816 + 1 R0030W 2-micron 5307 6198 - 1 R0040C chrI 334 792 + 2 YAL069W,YAL068W-A chrI 1806 2169 - 1 YAL068C chrI 2479 2707 + 1 YAL067W-A chrI 7234 9016 - 1 YAL067C chrI 10090 10399 + 1 YAL066W chrI 11564 11951 - 1 YAL065CYAL065C |
As we specified:
- Output column 4 has the region's strand.
- Column 5 is the count of merged regions
...
- Column 6 is a collapsed, comma-separated list of the merged gene names
...
Exercise: Compare the number of regions in the merged and before-merge gene files.
...
Expand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||
Output column 5 has the gene count.
Produces this histogram:
There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed. Or being fancy:
|
Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.
...
To make a valid BED6 file, we'll include the first 3 output columns of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).
...