Content Comparison

...

Expand

title	Answer

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed

sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
  -c 6,4,4 -o distinct,count,collapse > merged.good.sc_genes.txt

wc -l good.sc_genes.bed merged.good.sc_genes.txt

There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

Code Block

language	bash

cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block

language	bash

Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.

Exercise: Why did we name the merged file with the extension .txt instead of .bed? What would So there's one more thing we need to do to convert it to a proper BED6 file?

...

title	Answer

...

create a valid BED format file. Our merged.good.sc_genes.txt columns are chrom, start, end, strand, merged_region_count, merged_region(s), but the BED6 specification is: chrom, start, end, name, score, strand

...

.

To make a valid BED6 file, we'll include the first 3 output columns

...

of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

We can use awk to re-order the fields:

Code Block

language	bash

cat merged.good.sc_genes.txt | awk '
  BEGIN{FS=OFS="\t"}
  {print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed

Use bedtools multicov to count feature overlaps

...

Code Block

language	bash
title	Setup for BEDTools multicov

# Get the merged yeast genes bed file if you didn't create one
mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged.sc_genes.*bed .

# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

...

Expand

title	Setup (if needed)

Code Block

language	bash

idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load samtools
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged.sc_genes.bed*bed      $SCRATCH/core_ngs/bedtools_multicov/
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* $SCRATCH/core_ngs/bedtools_multicov/

Code Block

language	bash
title	Run bedtools multicov to count BAM alignments overlapping a set of genes

cd $SCRATCH/core_ngs/bedtools_multicov
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
  -bed merged.good.sc_genes.bed > yeast_mrna_gene_counts.bed

...

Expand

title	Answers

Code Block

language	bash

wc -l yeast_mrna_gene_counts.bed

6485 records were written, one for each feature in the merged.sc_genes.bed file.

The overlap count was added as the last field in each output record (here field 97, since the input annotation file had 8 6 columns).

Exercise: How many features have non-zero overlap counts?

Expand

title	Answer

Expand

title	Answer

Code Block

language	bash

cut -f 97 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | \
  awk '{if ($9 > 0) print $9}' | wc -l

Most of the genes (6235/6607) have non-zero read overlap counts.

Exercise: What is the total count of reads mapping to gene features?

Code Block

language	bash

l
# or
cat yeast_mrna_gene_counts.bed | awk\
'  BEGIN{FS="\t";sum=0;tot=0}
 {if($9awk '{if ($7 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{printf("%d overlapping reads in %d genes\n", sum, tot) }'

There are 1144990 overlapping reads in 6235 gene annotations.

Recall that in the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence.

...

print $7}' | wc -l

Most of the genes (6141/6485) have non-zero read overlap counts.

Exercise: What is the total count of reads mapping to gene features other than Dubious?

Expand

title	Hint

grep -v 'Dubious'

Expand

title	Answer

Code Block

language	bash

grep -v 'Dubious'cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9$7 > 0) { sum = sum + $9$7; tot = tot + 1 }}
 END{printf("%d overlapping reads in %d non-Dubious genes\n", sum, tot) }'

There are 1093140 1,152,831 overlapping reads in 5578 6,141 non-Dubious genes0 gene annotations.

Version	Old Version 68	New Version 69
Changes made by	Anna Battenhouse	Anna Battenhouse
Saved on	Jun 19, 2022	Jun 19, 2022

Versions Compared

Key

Use bedtools multicov to count feature overlaps