...
| Expand |
|---|
|
| Code Block |
|---|
| cut -f 8 sc_genes.bed | sort | uniq -c |
You should see this: | Code Block |
|---|
810 Dubious
897 Uncharacterized
4896 Verified
4 Verified|silenced_gene |
If you want to further order this output listing the most abundant category first, add another sort statement: | Code Block |
|---|
| cut -f 8 sc_genes.bed | sort | uniq -c | sort -k1,1nr |
The -k 1,1nr options says to sort on the 1st field (whitespace delimited) of input, using numeric sorting, in reverse order (i.e., largest first). Which produces: | Code Block |
|---|
4896 Verified
897 Uncharacterized
809 Dubious
4 Verified|silenced_gene |
|
Exercises
We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.
Get the RNA-seq BAM
Make sure you're in an idev session, since we will be doing some significant computation, and make bedtools and samtools available.
| Code Block |
|---|
| language | bash |
|---|
| title | Start an idev session |
|---|
|
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5 |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
| Code Block |
|---|
| language | bash |
|---|
| title | Setup for BEDTools exercises |
|---|
|
# To catch up...
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/sc_genes.bed* .
cp $CORENGS/yeast_rnaseq/*.gff .
# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
Exercises: How many reads are represented in the yeast_mrna.sort.filt.bam file? How many mapped? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?
| Expand |
|---|
|
samtools flagstat for the different read counts. samtools view + cut + sort + uniq -c for mapping quality distribution samtools view + awk for average mapping quality |
...
...
...
Use bedtools merge to collapse overlapping annotations
One issue that often arises when dealing with BED regions is that they can overlap one another. For example, on the yeast genome, which has very few non-coding areas, there are some overlapping ORFs (Open Reading Frames), especially Dubious ORFs that overlap Verified or Uncharacterized ones. When bedtools looks for overlaps, it will count a read that overlaps any of those overlapping ORFs – so some reads can be counted twice.
One way to avoid this double-counting is to collapse the overlapping regions into a merged set of non-overlapping regions – and that's what the bedtools merge utility does (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html).
Here we're going to use bedtools merge to collapse our gene annotations into a non-overlapping set, first for all genes, then for only non-Dubious genes.
The output from bedtools merge always starts with 3 columns: chrom, start and end of the merged region only.
Using the -c (column) and -o (operation) options, you can have information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.
For example, our sc_genes.bed file has a gene name in column 4, and for each (possibly merged) gene region, we want to know the number of gene regions that were collapsed into the region, and also which gene names were collapsed.
We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:
- the 1st custom column should result from collapsing distinct (unique) values of gene file column 6 (the strand, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
- the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
- the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names
bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):
| Code Block |
|---|
| language | bash |
|---|
| title | Run bedtools multicov to count BAM alignments overlapping a set of genes |
|---|
|
cd $SCRATCH/core_ngs/bedtools
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
-bed sc_genes.bed > yeast_mrna_gene_counts.bed |
Exercise: How may records of output were written? Where is the count of overlaps per output record?
| Expand |
|---|
| title | Answers
| Code Block |
|---|
| mkdir -p $SCRATCH/core_ngs/bedtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt |
| Code Block |
|---|
| title | samtools flagstat output |
|---|
| 3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5) |
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities: | Code Block |
|---|
| samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c |
| Code Block |
|---|
| title | distribution of mapping qualities |
|---|
| 453 20
6260 21
889 22
326 23
971 24
2698 25
376 26
12769 27
268 28
337 29
933 30
1229 31
345 32
5977 33
256 34
249 35
1103 36
887 37
292 38
4648 39
5706 40
426 41
1946 42
1547 43
1761 44
6138 45
1751 46
3019 47
3710 48
3236 49
4467 50
15691 51
25370 52
16636 53
18081 54
7084 55
2701 56
59851 57
2836 58
2118 59
3097901 60 |
To compute average mapping quality: | Code Block |
|---|
| samtools view yeast_mrna.sort.filt.bam | awk '
BEGIN{FS="\t"; sum=0; tot=0}
{sum = sum + $5; tot = tot + 1}
END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }' |
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
|---|
Use bedtools multicov to count feature overlaps
In this section we'll use bedtools multicov to count RNA-seq reads that overlap our gene features. The bedtools multicov command (http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html) takes a feature file (GFF/BED/VCF) and counts how many reads from one or more input BAM files overlap those feature. The input BAM file(s) must be position-sorted and indexed.
Here's how to run bedtools multicov, directing the standard output to a file:
...
| Code Block |
|---|
|
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load samtools
module load bedtools
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/*.gff .
cp $CORENGS/yeast_rnaseq/sc_genes.bed* .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/*.gff .
cp $CORENGS/yeast_rnaseq/sc_genes.bed* .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .
module load biocontainers
module load bedtools |
|
| Code Block |
|---|
| language | bash |
|---|
| title | Use bedtools merge to collapse overlapping gene annotations |
|---|
|
cd $SCRATCH/core_ngs/bedtools
sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s -c 6,4,4 -o distinct,count,collapse > merged.sc_genes.txt |
The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):
| Code Block |
|---|
2-micron 251 1523 + 1 R0010W
2-micron 1886 3008 - 1 R0020C
2-micron 3270 3816 + 1 R0030W
2-micron 5307 6198 - 1 R0040C
chrI 334 792 + 2 YAL069W,YAL068W-A
chrI 1806 2169 - 1 YAL068C
chrI 2479 2707 + 1 YAL067W-A
chrI 7234 9016 - 1 YAL067C
chrI 10090 10399 + 1 YAL066W
chrI 11564 11951 - 1 YAL065C |
Output column 4 has the region's strand. Column 5 is the count of merged regions, and column 6 is a comma-separated list of the merged gene names.
Exercise: Compare the number of regions in the merged and before-merge gene files.
| Expand |
|---|
|
| Code Block |
|---|
| wc -l yeast_mrna_gene_counts.bed | 6607 records were written, one for each feature in the sc_genes.bed merged.sc_genes. | bed file.The overlap count was added as the last field in each output record (here field 9, since the input annotation file had 8 columns)There were 6607 genes before merging and 6485 after. |
Exercise: How many features have non-zero overlap counts? (The count from bedtools multicov is in field 9)regions represent only 1 gene, 2 genes, or more?
| Expand |
|---|
|
| Expand |
|---|
|
Output column 5 has the gene count. | Code Block |
|---|
| cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | \
awk '{if ($9 > 0) print $9}' | wc -l |
Most of the genes (6235/6607) have non-zero read overlap counts. |
Exercise: What is the total count of reads mapping to gene features?
5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n |
Produces this histogram: |
| Expand |
|---|
|
grep -v 'Dubious'| Code Block |
|---|
| cat yeast_mrna_gene_counts.bed | awk '
BEGIN{FS="\t";sum=0;tot=0}
{if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
END{printf("%d overlapping reads in %d genes\n", sum, tot) }' |
There are 1144990 overlapping reads in 6235 gene annotations. |
Recall that in the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence.
Exercise: What is the total count of reads mapping to gene features other than Dubious?
There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed. |
Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.
| Expand |
|---|
|
| Expand |
|---|
|
| Code Block |
|---|
| cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' yeast_mrna_gene_counts.bed | awk '
BEGIN{FS="\t";sum=0;tot=0}
{if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
END{printf("%d overlapping reads in %d non-Dubious genes\n", sum, tot) }' |
There are 1093140 overlapping reads in 5578 non-Dubious genes |
Use bedtools merge to collapse overlapping annotations
One issue that often arises when dealing with BED regions is that they can overlap one another. For example, on the yeast genome, which has very few non-coding areas, there are some overlapping ORFs (Open Reading Frames), especially Dubious ORFs that overlap Verified or Uncharacterized ones. When bedtools looks for overlaps, it will count a read that overlaps any of those overlapping ORFs – so some reads can be counted twice.
One way to avoid this double-counting is to collapse the overlapping regions into a merged set of non-overlapping regions – and that's what the bedtools merge utility does (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html).
Here we're going to use bedtools merge to collapse our gene annotations into a non-overlapping set, first for all genes, then for only non-Dubious genes.
The output from bedtools merge always starts with 3 columns: chrom, start and end of the merged region only.
Using the -c (column) and -o (operation) options, you can have information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.
For example, our sc_genes.bed file has a gene name in column 4, and for each (possibly merged) gene region, we want to know the number of gene regions that were collapsed into the region, and also which gene names were collapsed.
We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:
- the 1st custom column should result from collapsing distinct (unique) values of gene file column 6 (the strand, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
- the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
- the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names
bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):
| Code Block |
|---|
| sc_genes.bed > good.sc_genes.bed
sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
-c 6,4,4 -o distinct,count,collapse > merged.good.sc_genes.txt
wc -l good.sc_genes.bed merged.good.sc_genes.txt |
There were 5797 "good" (non-Dubious) genes before merging and 5770 after. | Code Block |
|---|
| cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n |
Produces this histogram: | Code Block |
|---|
| 5750 1
18 2
1 4
1 7 |
Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped. |
Exercise: Why did we name the merged file with the extension .txt instead of .bed? What would we need to do to convert it to a proper BED6 file?
| Expand |
|---|
|
The output does not follow the BED6 specification: "chrom, start, end, name, score, strand" The first 3 output columns comply with the BED3 standard (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there). We can use awk to re-order the fields: | Code Block |
|---|
| cat merged.good.sc_genes.txt | awk '
BEGIN{FS=OFS="\t"}
{print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed |
|
Use bedtools multicov to count feature overlaps
We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.
In this section we'll use bedtools multicov to count RNA-seq reads that overlap our gene features. The bedtools multicov command (http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html) takes a feature file (GFF/BED/VCF) and counts how many reads from one or more input BAM files overlap those feature. The input BAM file(s) must be position-sorted and indexed.
Make sure you're in an idev session, since we will be doing some significant computation, and make bedtools and samtools available.
| Code Block |
|---|
| language | bash |
|---|
| title | Start an idev session |
|---|
|
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5 |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
| Code Block |
|---|
| language | bash |
|---|
| title | Setup for BEDTools exercises |
|---|
|
# To catch up...
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/sc_genes.bed* .
cp $CORENGS/yeast_rnaseq/*.gff .
# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/ sc_genes.bed* .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .
module load biocontainers
module load bedtoolsyeast_mrna.sort.filt.bam* . |
Exercises: How many reads are represented in the yeast_mrna.sort.filt.bam file? How many mapped? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?
| Expand |
|---|
|
samtools flagstat for the different read counts. samtools view + cut + sort + uniq -c for mapping quality distribution samtools view + awk for average mapping quality |
| Code Block |
|---|
|
wc -l sc_genes.bed merged.sc_genes.txt |
There were 6607 genes before merging and 6485 after.
Exercise: How many regions represent only 1 gene, 2 genes, or more?
| Expand |
|---|
|
Output column 5 has the gene count. | Code Block |
|---|
| cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n |
Produces this histogram: | Code Block |
|---|
| 6374 1
105 2
4 3
1 4
1 7 |
There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed. |
Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.
| Expand |
|---|
|
|
| title | Use bedtools merge to collapse overlapping gene annotations |
|---|
cd $SCRATCH/core_ngs/bedtools
| sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s -c 6,4,4 -o distinct,count,collapse > merged.sc_genes.txt |
The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):
| Code Block |
|---|
2-micron 251 1523 + 1 R0010W
2-micron 1886 3008 - 1 R0020C
2-micron 3270 3816 + 1 R0030W
2-micron 5307 6198 - 1 R0040C
chrI 334 792 + 2 YAL069W,YAL068W-A
chrI 1806 2169 - 1 YAL068C
chrI 2479 2707 + 1 YAL067W-A
chrI 7234 9016 - 1 YAL067C
chrI 10090 10399 + 1 YAL066W
chrI 11564 11951 - 1 YAL065C |
Output column 4 has the region's strand. Column 5 is the count of merged regions, and column 6 is a comma-separated list of the merged gene names.
Exercise: Compare the number of regions in the merged and before-merge gene files.
...
| Code Block |
|---|
| language | bashsamtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt |
|---|
| Code Block |
|---|
| title | samtools flagstat output |
|---|
| 3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5) |
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities: | Code Block |
|---|
| samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c |
| Code Block |
|---|
| title | distribution of mapping qualities |
|---|
| 453 20
6260 21
889 22
326 23
971 24
2698 25
376 26
12769 27
268 28
337 29
933 30
1229 31
345 32
5977 33
256 34
249 35
1103 36
887 37
292 38
4648 39
5706 40
426 41
1946 42
1547 43
1761 44
6138 45
1751 46
3019 47
3710 48
3236 49
4467 50
15691 51
25370 52
16636 53
18081 54
7084 55
2701 56
59851 57
2836 58
2118 59
3097901 60 |
To compute average mapping quality: | Code Block |
|---|
| samtools view yeast_mrna.sort.filt.bam | awk '
BEGIN{FS="\t"; sum=0; tot=0}
{sum = sum + $5; tot = tot + 1}
END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }' |
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
Here's how to run bedtools multicov, directing the standard output to a file:
| Expand |
|---|
|
| Code Block |
|---|
| idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load samtools
module load bedtools
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/*.gff .
cp $CORENGS/yeast_rnaseq/sc_genes.bed* .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
|
| Code Block |
|---|
| language | bash |
|---|
| title | Run bedtools multicov to count BAM alignments overlapping a set of genes |
|---|
|
cd $SCRATCH/core_ngs/bedtools
grepbedtools multicov - v 'Dubious' sc_genes.bed > good.sc_genes.bed
sort -k1,1 -k2,2n good.s -bams yeast_mrna.sort.filt.bam \
-bed sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
-c 6,4,4 -o distinct,count,collapse > merged.good.sc_genes.txt
wc -l good.sc_genes.bed merged.good.sc_genes.txtThere were 5797 "good" (non-Dubious) genes before merging and 5770 after.yeast_mrna_gene_counts.bed |
Exercise: How may records of output were written? Where is the count of overlaps per output record?
| Expand |
|---|
|
| Code Block |
|---|
| wc -l yeast_mrna_gene_counts.bed |
6607 records were written, one for each feature in the sc_genes.bed file. The overlap count was added as the last field in each output record (here field 9, since the input annotation file had 8 columns). |
Exercise: How many features have non-zero overlap counts? (The count from bedtools multicov is in field 9)
| Expand |
|---|
|
| Code Block |
|---|
| cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n |
Produces this histogram: 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | \
awk '{if ($9 > 0) print $9}' | wc -l |
Most of the genes (6235/6607) have non-zero read overlap counts. |
Exercise: What is the total count of reads mapping to gene features?
| Expand |
|---|
|
| Code Block |
|---|
| cat yeast_mrna_gene_counts.bed | awk '
5750 1
18 2
1 4
1 7 |
Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped. |
...
BEGIN{FS="\t";sum=0;tot=0}
{if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
END{printf("%d overlapping reads in %d genes\n", sum, tot) }' |
There are 1144990 overlapping reads in 6235 gene annotations. |
Recall that in the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence.
Exercise: What is the total count of reads mapping to gene features other than Dubious?
| Expand |
|---|
| The output does not follow the BED6 specification: "chrom, start, end, name, score, strand"
The first 3 output columns comply with the BED3 standard (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).
We can use awk to re-order the fields:
| Code Block |
|---|
|
cat merged.good.sc_genes.txt | awk '
BEGIN{FS=OFS="\t"}
{print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed | | grep -v 'Dubious' |
| Expand |
|---|
|
| Code Block |
|---|
| grep -v 'Dubious' yeast_mrna_gene_counts.bed | awk '
BEGIN{FS="\t";sum=0;tot=0}
{if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
END{printf("%d overlapping reads in %d non-Dubious genes\n", sum, tot) }' |
There are 1093140 overlapping reads in 5578 non-Dubious genes |