...
| Sub-command | Description | Use case(s) |
|---|---|---|
| bamtobed | Convert BAM files to BED format. | You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string. |
| bamtofastq | Extract FASTQ sequences from BAM alignment records. | You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences. |
| getfasta | Get FASTA entries corresponding to regions. | You want to run motif analysis, which requires the original FASTA sequences, on a set of regions of interest. In addition to the BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index). |
| coverage |
|
|
| multicov | Count overlaps between one or more BAM files and a set of regions of interest. |
|
| merge | Combine a set of possibly-overlapping regions into a single set of non-overlapping regions. | Collapse overlapping gene annotations into per-strand non-overlapping regions before counting (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects. |
| subtract | Remove unwanted regions. | Remove rRNA gene regions from a merged gene annotations file before counting. |
| intersect | Determine the overlap between two sets of regions. | Similar to multicov, but can also report (not just count) the overlapping regions. |
| closest | Find the genomic features nearest to a set of regions. | For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation. |
...
So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version.
Login to stampede2 ls6, start and idev session, then load the BioContainers bedtools module, and check its version.
| Code Block | ||||
|---|---|---|---|---|
| ||||
idev -p normal -m 120 -A UT-2015-05-18N 1 -NA 1OTH21164 -nr 68CoreNGSday5 # ... module load biocontainers module load bedtools bedtools --version # should be bedtools v2.27.1 |
...
- Most BEDTools functions now accept either BAM or BED files as input.
- BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
- When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
- All text-format input files (BED, GTF/GFF, VCF) should use Unix line endings (linefeed only).
The most important thing to remember about comparing regions using BEDTools, is that all region input files must share the same set contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, use annotations from the corresponding GFF/GTF annotations.
...
| Expand | |||||||
|---|---|---|---|---|---|---|---|
| |||||||
|
...
What most folks to is find some way to convert their GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /work2work/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFF/GTF files. And it's pretty easy to run.
...
Here we just give the script the GFF file to convert, plus a 1 that tells it to URL decode weird looking text (e.g. our Note attribute values).
...
| Code Block | ||||
|---|---|---|---|---|
| ||||
/work2work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \ > sc_genes.converted.bed |
...
The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, becuase because cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.
...
One final detail. Annotation files you download may have non-Unix (linefeed-only) line endings. Specifically, they may use Windows line endings (carriage return + linefeed). (Read about Line ending nightmares.) The expression sed 's/\r//' uses the sed (substitution editor) tool to replace carriage return characters ( \r ) with nothing, removing them from the output.
...
| Code Block |
|---|
chrI 334 649 YAL069W 315 + YAL069W Dubious chrI 537 792 YAL068W-A 255 + YAL068W-A Dubious chrI 1806 Dubious chrI2169 1806YAL068C 2169 YAL068C 363 - PAU8 Verified chrI 2479 2707 YAL067W-A 228 + YAL067W-A Uncharacterized chrI 7234 Uncharacterized chrI9016 7234YAL067C 9016 YAL067C 1782 - SEO1 Verified chrI 10090 10399 YAL066W 309 309 + YAL066W Dubious chrI 11564 11951 YAL065C 387 - YAL065C Uncharacterized chrI 12045 12426 YAL064W-B 381 + YAL064W-B Uncharacterized chrI 13362 13743 YAL064C-A 381 - YAL064C-A Uncharacterized chrI 21565 Uncharacterized21850 chrI YAL064W 21565 21850 YAL064W 285 + YAL064W Verified chrI 22394 22685 YAL063C-A 291 - YAL063C-A Uncharacterized chrI 23999 Uncharacterized27968 chrI YAL063C 23999 27968 YAL063C 3969 - FLO9 Verified chrI 31566 32940 YAL062W 1374 + GDH3 Verified chrI 33447 34701 YAL061W 1254 + BDH2 Uncharacterized chrI 35154 36303 YAL060W 1149 + BDH1 Verified chrI 36495 36918 YAL059C-A 423 - YAL059C-A Dubious chrI 36508 Dubious37147 chrI YAL059W 36508 37147 YAL059W 639 + ECM1 Verified chrI 37463 38972 YAL058W 1509 + CNE1 Verified chrI 38695 39046 YAL056C-A 351 - YAL056C-A Dubious chrI Dubious39258 41901 chrI YAL056W 39258 41901 YAL056W 2643 + GPB2 Verified |
Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.
...