...
Here we're going to use bedtools merge to collapse our gene annotations into a non-overlapping set, first for all genes, then for only non-Dubious genes.
The output from bedtools merge always starts with either 3 or 4 columns:
- chrom, start and end of the merged region, if a stranded merge was not requested
- the strand of the merged region in column 4 if a stranded was requested
Using the -c (column) and -o (operation) options, you can have information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.
For example, our sc_genes.bed file has a gene name in column 4, and for each (possibly merged) gene region, we want to know the number of gene regions that were collapsed into the region, and also which gene names were collapsed.
We can do this with -c 4,4 -o count,collapse, which says that two custom output columns should be added:
- the 1st should result from counting the gene names in column 4 for all genes that were merged, and
- the 2nd should be a comma-separated list of those same column 4 gene names
bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/bedtools
sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s -c 4,4 -o count,collapse > merged.sc_genes.txt |
The first few lines of the merged.sc_genes.txt file look like this:
Code Block |
---|
2-micron 251 1523 + 1 R0010W
2-micron 1886 3008 - 1 R0020C
2-micron 3270 3816 + 1 R0030W
2-micron 5307 6198 - 1 R0040C
chrI 334 792 + 2 YAL069W,YAL068W-A
chrI 1806 2169 - 1 YAL068C
chrI 2479 2707 + 1 YAL067W-A
chrI 7234 9016 - 1 YAL067C
chrI 10090 10399 + 1 YAL066W
chrI 11564 11951 - 1 YAL065C |
Output column 4 has the region's strand (since we asked for a strand-specific merge). Column 5 is the count of merged regions, and column 6 is a comma-separated list of the merged gene names.
Exercise: Compare the number of regions in the merged and before-merge gene files.
Expand | |||||
---|---|---|---|---|---|
| |||||
There were 6485 genes before merging and 6485 after. |
Exercise: How many regions represent only 1 gene, 2 genes, or more?
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
Output column 5 has the gene count.
Produces this histogram:
|
Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.
Expand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||
There were 5797 "good" (non-Dubious) genes before merging and 5770 after.
Produces this histogram:
|