...
So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version. When you run the code below, you should see that the bedtools version in the standard TACC module system is bedtools v2.26.0. (Login to login5.ls5.tacc.utexas.edu first.)
| Code Block |
|---|
module load bedtools
bedtools --version |
Input format considerations
...
Login to stampede2, start and idev session, then load the BioContainers bedtools module, and check its version.
| Code Block | ||||
|---|---|---|---|---|
| ||||
idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 68 --reservation=BIO_DATA_week_1
# ...
module load biocontainers
module load bedtools
bedtools --version # should be bedtools v2.27.1 |
Input format considerations
- Most BEDTools functions now accept either BAM or BED files as input.
- BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
- When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
- All text-format input files (BED, GTF/GFF, VCF) should use Unix line endings (linefeed only).
...
Take a quick look at a yeast annotation file, sacCer_R64-1-1_20110208.gff using less. (Login to login5.ls5.tacc.utexas.edu first.).
| Expand | |||||||
|---|---|---|---|---|---|---|---|
| |||||||
|
| Code Block | ||||
|---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/bedtools cd $SCRATCH/core_ngs/bedtools cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff . cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* . # Use the less pager to look at multiple lines less sacCer_R64-1-1_20110208.gff # Look at just the most-important Tab-separated columns cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20 # Include the ugly 9th column where attributes are stored cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head |
...
What most folks to is find some way to convert their GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /workwork2/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFF/GTF files. And it's pretty easy to run.
...
| Code Block | ||||
|---|---|---|---|---|
| ||||
/workwork2/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \ > sc_genes.converted.bed |
...
Doesn't this look better? (I've tidied up the output a bit below.)
| Code Block |
|---|
chrI 334 649 YAL069W 315 + YAL069W Dubious chrI 537 792 YAL068W-A 255 + YAL068W-A Dubious chrI 1806 2169 YAL068C 363 - PAU8 Verified chrI 2479 2707 YAL067W-A 228 + YAL067W-A Uncharacterized chrI 7234 9016 YAL067C 1782 - SEO1 Verified chrI 10090 10399 YAL066W 309 + YAL066W Dubious chrI 11564 11951 YAL065C 387 - YAL065C Uncharacterized chrI 12045 12426 YAL064W-B 381 + YAL064W-B Uncharacterized chrI 13362 13743 YAL064C-A 381 - YAL064C-A Uncharacterized chrI 21565 21850 YAL064W 285 + YAL064W Verified chrI 22394 22685 YAL063C-A 291 - YAL063C-A Uncharacterized chrI 23999 27968 YAL063C 3969 - FLO9 Verified chrI 31566 32940 YAL062W 1374 + GDH3 Verified chrI 33447 34701 YAL061W 1254 + BDH2 Uncharacterized chrI 35154 36303 YAL060W 1149 + BDH1 Verified chrI 36495 36918 YAL059C-A 423 - YAL059C-A Dubious chrI 36508 37147 YAL059W 639 + ECM1 Verified chrI 37463 38972 YAL058W 1509 + CNE1 Verified chrI 38695 39046 YAL056C-A 351 - YAL056C-A Dubious chrI 39258 41901 YAL056W 2643 + GPB2 Verified |
...
| Expand | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
You should see this:
If you want to further order this output listing the most abundant category first, add another sort statement:
The -k 1,1nr options says to sort on the 1st field of input, using numeric sorting, in reverse order (i.e., largest first). Which produces:
|
...
We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.
Get the RNA-seq BAM
First start Make sure you're in an idev session, since we will be doing some significant computation, and make bedtools and samtools available.
| Code Block | ||||
|---|---|---|---|---|
| ||||
idev -p development -m 20120 -A UT-2015-05-18 -N 1 -n 24 --reservation=intro_NGSBIO_DATA_week_1 |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
...
Exercises: How many reads pairs are represented in the yeast_mrna.sort.filt.bam file? How many mapped? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?
...
| Expand | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities:
To compute average mapping quality:
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
...