...
We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted, merged yeast gene annotation file we created above.
...
| Code Block |
|---|
| language | bash |
|---|
| title | Setup for BEDTools exercisesmulticov |
|---|
|
# To catch up... Get the merged yeast genes bed file if you didn't create one
mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/yeastcatchup/bedtools_rnaseqmerge/merged.sc_genes.bed* .
cp
$CORENGS/yeast_rnaseq/*.gff .
# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
...
| Expand |
|---|
|
| Code Block |
|---|
| cd $SCRATCH/core_ngs/bedtools_multicov
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt |
| Code Block |
|---|
| title | samtools flagstat output |
|---|
| 33232423347559 + 0 in total (QC-passed reads + QC-failed reads)
024317 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
33232423347559 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5) |
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities: | Code Block |
|---|
| samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c |
| Code Block |
|---|
| title | distribution of mapping qualities |
|---|
| 453498 20
62606504 21
8891012 22
326355 23
1054 971 24
26982800 25
376495 26
1276914133 27
268282 28
337358 29
933954 30
12291244 31
345358 32
59776143 33
256 34
249265 35
11031112 36
887905 37
292309 38
46484845 39
5706 40
426427 41
1946 42
15471552 43
17611771 44
61386140 45
17511771 46
30193049 47
37103881 48
32363264 49
44674475 50
1569115692 51
2537025378 52
1663616659 53
1808118305 54
70847108 55
27012705 56
5985159867 57
28362884 58
21182392 59
30979013118705 60 |
To compute average mapping quality: | Code Block |
|---|
| samtools view yeast_mrna.sort.filt.bam | awk '
BEGIN{FS="\t"; sum=0; tot=0}
{sum = sum + $5; tot = tot + 1}
END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }' |
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
Here's how to run bedtools multicov in stranded mode, directing the standard output to a file:
| Expand |
|---|
|
| Code Block |
|---|
| idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load samtools
module load bedtools
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/yeastcatchup/bedtools_rnaseqmerge/*.gff .
cp $CORENGS/yeast_rnaseq/sc_genes.bed* .merged.sc_genes.bed $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .$SCRATCH/core_ngs/bedtools_multicov |
|
| Code Block |
|---|
| language | bash |
|---|
| title | Run bedtools multicov to count BAM alignments overlapping a set of genes |
|---|
|
cd $SCRATCH/core_ngs/bedtools_multicov
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
-bed merged.sc_genes.bed > yeast_mrna_gene_counts.bed |
...
| Expand |
|---|
|
| Code Block |
|---|
| wc -l yeast_mrna_gene_counts.bed |
6607 6485 records were written, one for each feature in the merged.sc_genes.bed file. The overlap count was added as the last field in each output record (here field 9, since the input annotation file had 8 columns). |
Exercise: How many features have non-zero overlap counts? (The count from bedtools multicov is in field 9)
| Expand |
|---|
|
| Code Block |
|---|
| cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | \
awk '{if ($9 > 0) print $9}' | wc -l |
Most of the genes (6235/6607) have non-zero read overlap counts. |
...