...
Code Block | ||||
---|---|---|---|---|
| ||||
head -1 sc_genes.converted.bed | sed 's/\r//' | awk ' BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16} ' > sc_genes.bed.hdr tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk ' BEGIN{FS=OFS="\t"} { if($15 == "") {$15 = 10$10} # make sure gene name is populated print $1,$2,$3,$10,$5,$6,$15,$16} ' > sc_genes.bed |
...
Code Block | ||
---|---|---|
| ||
chrI 334 649 YAL069W 315 + 10 Dubious chrI 537 792 YAL068W-A 255 + 10 Dubious chrI 1806 2169 YAL068C 363 - PAU8 Verified chrI 2479 2707 YAL067W-A 228 + 10 Uncharacterized chrI 7234 9016 YAL067C 1782 - SEO1 Verified chrI 10090 10399 YAL066W 309 + 10 Dubious chrI 11564 11951 YAL065C 387 - 10 Uncharacterized chrI 12045 12426 YAL064W-B 381 + 10 Uncharacterized chrI 13362 13743 YAL064C-A 381 - 10 Uncharacterized chrI 21565 21850 YAL064W 285 + 10 Verified chrI 22394 22685 YAL063C-A 291 - 10 Uncharacterized chrI 23999 27968 YAL063C 3969 - FLO9 Verified chrI 31566 32940 YAL062W 1374 + GDH3 Verified chrI 33447 34701 YAL061W 1254 + BDH2 Uncharacterized chrI 35154 36303 YAL060W 1149 + BDH1 Verified chrI 36495 36918 YAL059C-A 423 - 10 Dubious chrI 36508 37147 YAL059W 639 + ECM1 Verified chrI 37463 38972 YAL058W 1509 + CNE1 Verified chrI 38695 39046 YAL056C-A 351 - 10 Dubious chrI 39258 41901 YAL056W 2643 + GPB2 Verified |
About strandedness
By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but it can be made specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.
- -s says the R1 read is sense stranded (on the same strand as the gene).
- -S says the R1 read is antisense stranded (the opposite strand as the gene).
RNA-seq libraries can be constructed with 3 types of strandedness:
- sense stranded – the R1 read should be on the same strand as the gene.
- antisense stranded – the R1 read should be on the opposite strand as the gene.
- unstranded – the R1 could be on either strand
Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded.
If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.
Exercises
We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.
Get the RNA-seq BAM
First start an idev session, since we will be doing some significant computation.
Code Block | ||||
---|---|---|---|---|
| ||||
idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBB |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
cp $CORENGS/yeast_rna/sc_genes.bed* . |
Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?
Expand | ||
---|---|---|
| ||
samtools flagstat for the different read counts. samtools view + cut + sort + uniq -c for mapping quality distribution samtools view + awk for average mapping quality |
...
title | Answer |
---|
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt |
...
title | samtools flagstat output |
---|
...
Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.
Exercise: How many genes in our sc_genes.bed file are in each category?
Expand | ||
---|---|---|
| ||
Use cut to isolate that field, sort to sort the resulting values into blocks, then uniq -c to count the members of each block. |
Expand | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
You should see this:
If you want to further order this output listing the most abundant category first, add another sort statement:
The -k 1,1nr options says to sort on the 1st field of input, using numeric sorting, in reverse order (i.e., largest first). Which produces:
|
About strandedness
By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but it can be made specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.
- -s says the R1 read is sense stranded (on the same strand as the gene).
- -S says the R1 read is antisense stranded (the opposite strand as the gene).
RNA-seq libraries can be constructed with 3 types of strandedness:
- sense stranded – the R1 read should be on the same strand as the gene.
- antisense stranded – the R1 read should be on the opposite strand as the gene.
- unstranded – the R1 could be on either strand
Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded.
If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.
Exercises
We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.
Get the RNA-seq BAM
First start an idev session, since we will be doing some significant computation.
Code Block | ||||
---|---|---|---|---|
| ||||
idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBB |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
cp $CORENGS/yeast_rna/sc_genes.bed* . |
Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?
Expand | ||
---|---|---|
| ||
samtools flagstat for the different read counts. samtools view + cut + sort + uniq -c for mapping quality distribution samtools view + awk for average mapping quality |
Expand | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities:
To compute average mapping quality:
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
...
Expand | ||||||
---|---|---|---|---|---|---|
| ||||||
There are 1144990 overlapping reads in 6235 records |
In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence so are generally excluded.
Exercise: What is the total count of reads mapping to gene features?
Expand | ||
---|---|---|
| ||
The classification is in the 7th column of sc_genes.bed. Use cut to isolate that field, sort to sort the resulting values into blocks, then uniq -c to count the members of each block. |
x