Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleAnswer


Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed

sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
  -c 6,4,4 -o distinct,count,collapse > merged.good.sc_genes.txt

wc -l good.sc_genes.bed merged.good.sc_genes.txt

There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

Code Block
languagebash
cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   5750 1
     18 2
      1 4
      1 7

Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.

Exercise: Why did we name the merged file with the extension .txt instead of .bed? What would So there's one more thing we need to do to convert it to a proper BED6 file?

...

titleAnswer

...

create a valid BED format file. Our merged.good.sc_genes.txt columns are chrom, start, end, strand, merged_region_count, merged_region(s), but the BED6 specification is: chrom, start, end, name, score, strand

...

.

To make a valid BED6 file, we'll include the first 3 output columns

...

of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

We can use awk to re-order the fields:

Code Block
languagebash
cat merged.good.sc_genes.txt | awk '
  BEGIN{FS=OFS="\t"}
  {print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed

Use bedtools multicov to count feature overlaps

...

Code Block
languagebash
titleSetup for BEDTools multicov
# Get the merged yeast genes bed file if you didn't create one
mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged.sc_genes.*bed .

# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

...

Expand
titleSetup (if needed)


Code Block
languagebash
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load samtools
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged.sc_genes.bed*bed      $SCRATCH/core_ngs/bedtools_multicov/
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* $SCRATCH/core_ngs/bedtools_multicov/



Code Block
languagebash
titleRun bedtools multicov to count BAM alignments overlapping a set of genes
cd $SCRATCH/core_ngs/bedtools_multicov
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
  -bed merged.good.sc_genes.bed > yeast_mrna_gene_counts.bed

...

Expand
titleAnswers


Code Block
languagebash
wc -l yeast_mrna_gene_counts.bed

6485 records were written, one for each feature in the merged.sc_genes.bed file.

The overlap count was added as the last field in each output record (here field 97, since the input annotation file had 8 6 columns).

Exercise: How many features have non-zero overlap counts?

Expand
titleAnswer
Expand
titleAnswer


Code Block
languagebash
cut -f 97 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | \
  awk '{if ($9 > 0) print $9}' | wc -l

Most of the genes (6235/6607) have non-zero read overlap counts.

Exercise: What is the total count of reads mapping to gene features?

Code Block
languagebash
l
# or
cat yeast_mrna_gene_counts.bed | awk\
'  BEGIN{FS="\t";sum=0;tot=0}
 {if($9awk '{if ($7 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{printf("%d overlapping reads in %d genes\n", sum, tot) }'

There are 1144990 overlapping reads in 6235 gene annotations.

Recall that in the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence.

...

print $7}' | wc -l

Most of the genes (6141/6485) have non-zero read overlap counts.

Exercise: What is the total count of reads mapping to gene features other than Dubious?

Expand
titleHint
grep -v 'Dubious'
Expand
titleAnswer


Code Block
languagebash
grep -v 'Dubious'cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9$7 > 0) { sum = sum + $9$7; tot = tot + 1 }}
 END{printf("%d overlapping reads in %d non-Dubious genes\n", sum, tot) }'

There are 1093140 1,152,831 overlapping reads in 5578 6,141 non-Dubious genes0 gene annotations.