Page Comparison

...

Code Block

language	bash
title	Re-order the final BED fields

head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed.hdr
tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}
 { if($15 == "") {$15 = 10$10} # make sure gene name is populated
   print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed

...

Code Block

title	Converted BED attributes

chrI    334     649     YAL069W 315     +       10      Dubious
chrI    537     792     YAL068W-A       255     +       10      Dubious
chrI    1806    2169    YAL068C 363     -       PAU8    Verified
chrI    2479    2707    YAL067W-A       228     +       10      Uncharacterized
chrI    7234    9016    YAL067C 1782    -       SEO1    Verified
chrI    10090   10399   YAL066W 309     +       10      Dubious
chrI    11564   11951   YAL065C 387     -       10      Uncharacterized
chrI    12045   12426   YAL064W-B       381     +       10      Uncharacterized
chrI    13362   13743   YAL064C-A       381     -       10      Uncharacterized
chrI    21565   21850   YAL064W 285     +       10      Verified
chrI    22394   22685   YAL063C-A       291     -       10      Uncharacterized
chrI    23999   27968   YAL063C 3969    -       FLO9    Verified
chrI    31566   32940   YAL062W 1374    +       GDH3    Verified
chrI    33447   34701   YAL061W 1254    +       BDH2    Uncharacterized
chrI    35154   36303   YAL060W 1149    +       BDH1    Verified
chrI    36495   36918   YAL059C-A       423     -       10      Dubious
chrI    36508   37147   YAL059W 639     +       ECM1    Verified
chrI    37463   38972   YAL058W 1509    +       CNE1    Verified
chrI    38695   39046   YAL056C-A       351     -       10      Dubious
chrI    39258   41901   YAL056W 2643    +       GPB2    Verified

About strandedness

By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but it can be made specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.

-s says the R1 read is sense stranded (on the same strand as the gene).
-S says the R1 read is antisense stranded (the opposite strand as the gene).

RNA-seq libraries can be constructed with 3 types of strandedness:

sense stranded – the R1 read should be on the same strand as the gene.
antisense stranded – the R1 read should be on the opposite strand as the gene.
unstranded – the R1 could be on either strand

Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded.

If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.

Exercises

We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.

Get the RNA-seq BAM

First start an idev session, since we will be doing some significant computation.

Code Block

language	bash
title	Start an idev session

idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBB

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

Code Block

language	bash
title	Setup for BEDTools exercises

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
cp $CORENGS/yeast_rna/sc_genes.bed* .

Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

Expand

title	Hints

samtools flagstat for the different read counts.

samtools view + cut + sort + uniq -c for mapping quality distribution

samtools view + awk for average mapping quality

...

title	Answer

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt

...

title	samtools flagstat output

...

Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.

Exercise: How many genes in our sc_genes.bed file are in each category?

Expand

title	Hint

Use cut to isolate that field, sort to sort the resulting values into blocks, then uniq -c to count the members of each block.

Expand

title	Answer

Code Block

language	bash

cut -f 8 sc_genes.bed | sort | uniq -c

You should see this:

Code Block
810 Dubious 897 Uncharacterized 4896 Verified 4 Verified\|silenced_gene

If you want to further order this output listing the most abundant category first, add another sort statement:

Code Block

language	bash

cut -f 8 sc_genes.bed | sort | uniq -c | sort -k 1,1nr

The -k 1,1nr options says to sort on the 1st field of input, using numeric sorting, in reverse order (i.e., largest first). Which produces:

Code Block
4896 Verified 897 Uncharacterized 809 Dubious 4 Verified\|silenced_gene

About strandedness

By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but it can be made specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.

-s says the R1 read is sense stranded (on the same strand as the gene).
-S says the R1 read is antisense stranded (the opposite strand as the gene).

RNA-seq libraries can be constructed with 3 types of strandedness:

sense stranded – the R1 read should be on the same strand as the gene.
antisense stranded – the R1 read should be on the opposite strand as the gene.
unstranded – the R1 could be on either strand

Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded.

If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.

Exercises

We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.

Get the RNA-seq BAM

First start an idev session, since we will be doing some significant computation.

Code Block

language	bash
title	Start an idev session

idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBB

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

Code Block

language	bash
title	Setup for BEDTools exercises

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
cp $CORENGS/yeast_rna/sc_genes.bed* .

Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

Expand

title	Hints

samtools flagstat for the different read counts.

samtools view + cut + sort + uniq -c for mapping quality distribution

samtools view + awk for average mapping quality

Expand

title	Answer

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt

Code Block

title	samtools flagstat output

3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c | sort -k2,2n

Code Block

title	distribution of mapping qualities

To compute average mapping quality:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{print "average:",sum/tot,"for",tot,"reads"}'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

...

Expand

title	Answer

Code Block

language	bash

cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"overlapping reads in",tot,"records"}'

There are 1144990 overlapping reads in 6235 records

print sum,"overlapping reads in",tot,"records"}'

There are 1144990 overlapping reads in 6235 records

In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence so are generally excluded.

Exercise: What is the total count of reads mapping to gene features?

Expand

title	Hint

The classification is in the 7th column of sc_genes.bed.

Use cut to isolate that field, sort to sort the resulting values into blocks, then uniq -c to count the members of each block.

x

Versions Compared

Old Version 31

New Version 32

Key

About strandedness

Exercises

Get the RNA-seq BAM

About strandedness

Exercises

Get the RNA-seq BAM

BEDTools merge