Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleRe-order the final BED fields
head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed.hdr
tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}
 { if($15 == "") {$15 = 10$10} # make sure gene name is populated
   print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed

...

Code Block
titleConverted BED attributes
chrI    334     649     YAL069W 315     +       10      Dubious
chrI    537     792     YAL068W-A       255     +       10      Dubious
chrI    1806    2169    YAL068C 363     -       PAU8    Verified
chrI    2479    2707    YAL067W-A       228     +       10      Uncharacterized
chrI    7234    9016    YAL067C 1782    -       SEO1    Verified
chrI    10090   10399   YAL066W 309     +       10      Dubious
chrI    11564   11951   YAL065C 387     -       10      Uncharacterized
chrI    12045   12426   YAL064W-B       381     +       10      Uncharacterized
chrI    13362   13743   YAL064C-A       381     -       10      Uncharacterized
chrI    21565   21850   YAL064W 285     +       10      Verified
chrI    22394   22685   YAL063C-A       291     -       10      Uncharacterized
chrI    23999   27968   YAL063C 3969    -       FLO9    Verified
chrI    31566   32940   YAL062W 1374    +       GDH3    Verified
chrI    33447   34701   YAL061W 1254    +       BDH2    Uncharacterized
chrI    35154   36303   YAL060W 1149    +       BDH1    Verified
chrI    36495   36918   YAL059C-A       423     -       10      Dubious
chrI    36508   37147   YAL059W 639     +       ECM1    Verified
chrI    37463   38972   YAL058W 1509    +       CNE1    Verified
chrI    38695   39046   YAL056C-A       351     -       10      Dubious
chrI    39258   41901   YAL056W 2643    +       GPB2    Verified

About strandedness

By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but it can be made specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.

  • -s says the R1 read is sense stranded (on the same strand as the gene).
  • -S says the R1 read is antisense stranded (the opposite strand as the gene).

RNA-seq libraries can be constructed with 3 types of strandedness:

  1. sense stranded – the R1 read should be on the same strand as the gene.
  2. antisense stranded – the R1 read should be on the opposite strand as the gene.
  3. unstranded – the R1 could be on either strand

Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded.

If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.

Exercises

We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.

Get the RNA-seq BAM

First start an idev session, since we will be doing some significant computation.

Code Block
languagebash
titleStart an idev session
idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBB

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

Code Block
languagebash
titleSetup for BEDTools exercises
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
cp $CORENGS/yeast_rna/sc_genes.bed* . 

Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

Expand
titleHints

samtools flagstat for the different read counts.

samtools view + cut + sort + uniq -c for mapping quality distribution

samtools view + awk for average mapping quality

...

titleAnswer


Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt

...

titlesamtools flagstat output

...

Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.

Exercise: How many genes in our sc_genes.bed file are in each category?

Expand
titleHint

Use cut to isolate that field, sort to sort the resulting values into blocks, then uniq -c to count the members of each block.

Expand
titleAnswer
Code Block
languagebash
cut -f 8 sc_genes.bed | sort | uniq -c

You should see this:

Code Block
    810 Dubious
    897 Uncharacterized
   4896 Verified
      4 Verified|silenced_gene

If you want to further order this output listing the most abundant category first, add another sort statement:

Code Block
languagebash
cut -f 8 sc_genes.bed | sort | uniq -c | sort -k 1,1nr

The -k 1,1nr options says to sort on the 1st field of input, using numeric sorting, in reverse order (i.e., largest first). Which produces:

Code Block
   4896 Verified
    897 Uncharacterized
    809 Dubious
      4 Verified|silenced_gene

About strandedness

By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but it can be made specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.

  • -s says the R1 read is sense stranded (on the same strand as the gene).
  • -S says the R1 read is antisense stranded (the opposite strand as the gene).

RNA-seq libraries can be constructed with 3 types of strandedness:

  1. sense stranded – the R1 read should be on the same strand as the gene.
  2. antisense stranded – the R1 read should be on the opposite strand as the gene.
  3. unstranded – the R1 could be on either strand

Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded.

If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.

Exercises

We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.

Get the RNA-seq BAM

First start an idev session, since we will be doing some significant computation.

Code Block
languagebash
titleStart an idev session
idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBB

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

Code Block
languagebash
titleSetup for BEDTools exercises
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
cp $CORENGS/yeast_rna/sc_genes.bed* . 

Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

Expand
titleHints

samtools flagstat for the different read counts.

samtools view + cut + sort + uniq -c for mapping quality distribution

samtools view + awk for average mapping quality

Expand
titleAnswer
Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt
Code Block
titlesamtools flagstat output
3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

Code Block
languagebash
samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c | sort -k2,2n
Code Block
titledistribution of mapping qualities
    453 20
   6260 21
    889 22
    326 23
    971 24
   2698 25
    376 26
  12769 27
    268 28
    337 29
    933 30
   1229 31
    345 32
   5977 33
    256 34
    249 35
   1103 36
    887 37
    292 38
   4648 39
   5706 40
    426 41
   1946 42
   1547 43
   1761 44
   6138 45
   1751 46
   3019 47
   3710 48
   3236 49
   4467 50
  15691 51
  25370 52
  16636 53
  18081 54
   7084 55
   2701 56
  59851 57
   2836 58
   2118 59
3097901 60

To compute average mapping quality:

Code Block
languagebash
samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{print "average:",sum/tot,"for",tot,"reads"}'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

...

Expand
titleAnswer
Code Block
languagebash
cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"overlapping reads in",tot,"records"}'
There are 1144990 overlapping reads in 6235 records
print sum,"overlapping reads in",tot,"records"}'

There are 1144990 overlapping reads in 6235 records

In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence so are generally excluded.

Exercise: What is the total count of reads mapping to gene features?

Expand
titleHint

The classification is in the 7th column of sc_genes.bed.

Use cut to isolate that field, sort to sort the resulting values into blocks, then uniq -c to count the members of each block.


x

BEDTools merge