...
| Expand |
|---|
|
| Code Block |
|---|
| cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed
sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
-c 6,4,4 -o distinct,count,collapse > merged.good.sc_genes.txt
wc -l good.sc_genes.bed merged.good.sc_genes.txt |
There were 5797 "good" (non-Dubious) genes before merging and 5770 after. | Code Block |
|---|
| cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n |
Produces this histogram: | Code Block |
|---|
| 5750 1
18 2
1 4
1 7 |
Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped. |
Exercise: Why did we name the merged file with the extension .txt instead of .bed? What would So there's one more thing we need to do to convert it to a proper BED6 file?
...
...
create a valid BED format file. Our merged.good.sc_genes.txt columns are chrom, start, end, strand, merged_region_count, merged_region(s), but the BED6 specification is: chrom, start, end, name, score, strand
...
.
To make a valid BED6 file, we'll include the first 3 output columns
...
of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).
We can use awk to re-order the fields:
| Code Block |
|---|
|
cat merged.good.sc_genes.txt | awk '
BEGIN{FS=OFS="\t"}
{print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed |
Use bedtools multicov to count feature overlaps
...
| Code Block |
|---|
| language | bash |
|---|
| title | Setup for BEDTools multicov |
|---|
|
# Get the merged yeast genes bed file if you didn't create one
mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged.sc_genes.*bed .
# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
...
| Expand |
|---|
|
| Code Block |
|---|
| idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load samtools
module load bedtools
mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged.sc_genes.bed*bed $SCRATCH/core_ngs/bedtools_multicov/
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* $SCRATCH/core_ngs/bedtools_multicov/ |
|
| Code Block |
|---|
| language | bash |
|---|
| title | Run bedtools multicov to count BAM alignments overlapping a set of genes |
|---|
|
cd $SCRATCH/core_ngs/bedtools_multicov
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
-bed merged.good.sc_genes.bed > yeast_mrna_gene_counts.bed |
...
| Expand |
|---|
|
| Code Block |
|---|
| wc -l yeast_mrna_gene_counts.bed |
6485 records were written, one for each feature in the merged.sc_genes.bed file. The overlap count was added as the last field in each output record (here field 97, since the input annotation file had 8 6 columns). |
Exercise: How many features have non-zero overlap counts?
| Expand |
|---|
|
| Expand |
|---|
|
| Code Block |
|---|
| cut -f 97 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | \
awk '{if ($9 > 0) print $9}' | wc -l |
Most of the genes (6235/6607) have non-zero read overlap counts. |
Exercise: What is the total count of reads mapping to gene features?
| Code Block |
|---|
| l
# or
cat yeast_mrna_gene_counts.bed | awk\
' BEGIN{FS="\t";sum=0;tot=0}
{if($9awk '{if ($7 > 0) { sum = sum + $9; tot = tot + 1 }}
END{printf("%d overlapping reads in %d genes\n", sum, tot) }' |
There are 1144990 overlapping reads in 6235 gene annotations. |
Recall that in the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence.
...
Most of the genes (6141/6485) have non-zero read overlap counts. |
Exercise: What is the total count of reads mapping to gene features other than Dubious?
| Expand |
|---|
|
| grep -v 'Dubious' |
| Expand |
|---|
|
| Code Block |
|---|
| grep -v 'Dubious'cat yeast_mrna_gene_counts.bed | awk '
BEGIN{FS="\t";sum=0;tot=0}
{if($9$7 > 0) { sum = sum + $9$7; tot = tot + 1 }}
END{printf("%d overlapping reads in %d non-Dubious genes\n", sum, tot) }' |
There are 1093140 1,152,831 overlapping reads in 5578 6,141 non-Dubious genes0 gene annotations. |