Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Sub-commandDescriptionUse case(s)
bamtobedConvert BAM files to BED format.You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.
bamtofastqExtract FASTQ sequences from BAM alignment records.You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
getfastaGet FASTA entries corresponding to regions.You want to run motif analysis, which requires the original FASTA sequences, on a set of regions of interest.  In addition to the BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
coverage
  • Compute genome-wide coverage of your regions
  • Generate per-base genome-wide signal trace
  • You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth.
  • Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Provide the transcript regions as a BED or GFF/GTF file.
  • Produce a per-base genome-wide signal (in bedGraph format) for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be configured in the UCSC Genome Browser as custom tracks.
multicovCount overlaps between one or more BAM files and a set of regions of interest.
  • Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes.
mergeCombine a set of possibly-overlapping regions into a single set of non-overlapping regions.Collapse overlapping gene annotations into per-strand non-overlapping regions before counting (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.
subtractRemove unwanted regions.Remove rRNA gene regions from a merged gene annotations file before counting.
intersectDetermine the overlap between two sets of regions.Similar to multicov, but can also report (not just count) the overlapping regions.
closestFind the genomic features nearest to a set of regions.For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

...

So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version.

Login to stampede2 ls6, start and idev session, then load the BioContainers bedtools module, and check its version.

Code Block
languagebash
titleStart an idev session
idev -p normal -m 120 -A UT-2015-05-18N 1 -NA 1OTH21164 -nr 68CoreNGSday5 
# ...
module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

...

  • Most BEDTools functions now accept either BAM or BED files as input. 
    • BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
  • When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
  • All text-format input files (BED, GTF/GFF, VCF) should use Unix line endings (linefeed only).

The most important thing to remember about comparing regions using BEDTools, is that all region input files must share the same set contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, use annotations from the corresponding GFF/GTF annotations.

...

Expand
titleMake sure you're in an idev session


Code Block
languagebash
titleStart an idev session
idev -p normal -m 120 -A UT-2015-05-18N 1 -NA 1OTH21164 -nr 68CoreNGSday5

# ...
module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1


...

What most folks to is find some way to convert their GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /work2work/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFF/GTF files. And it's pretty easy to run.

...

Here we just give the script the GFF file to convert, plus a 1 that tells it to URL decode weird looking text (e.g. our Note attribute values).

...

Code Block
languagebash
titleConvert GFF to BED with BioITeam script
/work2work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
  > sc_genes.converted.bed

...

The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, becuase because cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.

...

One final detail. Annotation files you download may have non-Unix (linefeed-only) line endings. Specifically, they may use Windows line endings (carriage return + linefeed). (Read about Line ending nightmares.) The expression sed 's/\r//' uses the sed (substitution editor) tool to replace carriage return characters ( \r ) with nothing, removing them from the output.

...

Code Block
chrI    334     649     YAL069W         315     +       YAL069W    Dubious
chrI    537     792     YAL068W-A       255     +       YAL068W-A  Dubious
chrI    1806   Dubious chrI2169    1806YAL068C    2169    YAL068C 363     -       PAU8       Verified
chrI    2479    2707    YAL067W-A       228     +       YAL067W-A  Uncharacterized
chrI    7234  Uncharacterized  chrI9016    7234YAL067C    9016    YAL067C 1782    -       SEO1       Verified
chrI    10090   10399   YAL066W 309        309     +       YAL066W    Dubious
chrI    11564   11951   YAL065C         387     -       YAL065C    Uncharacterized
chrI    12045   12426   YAL064W-B       381     +       YAL064W-B       Uncharacterized
chrI    13362   13743   YAL064C-A       381     -       YAL064C-A  Uncharacterized
chrI    21565   Uncharacterized21850 chrI  YAL064W  21565   21850   YAL064W 285     +       YAL064W    Verified
chrI    22394   22685   YAL063C-A       291     -       YAL063C-A  Uncharacterized
chrI    23999   Uncharacterized27968 chrI  YAL063C  23999   27968   YAL063C 3969    -       FLO9       Verified
chrI    31566   32940   YAL062W         1374    +       GDH3       Verified
chrI    33447   34701   YAL061W         1254    +       BDH2       Uncharacterized
chrI    35154   36303   YAL060W         1149    +       BDH1       Verified
chrI    36495   36918   YAL059C-A       423     -       YAL059C-A  Dubious
chrI    36508   Dubious37147 chrI  YAL059W  36508   37147   YAL059W 639     +       ECM1       Verified
chrI    37463   38972   YAL058W         1509    +       CNE1       Verified
chrI    38695   39046   YAL056C-A       351     -       YAL056C-A  Dubious
chrI    Dubious39258   41901 chrI  YAL056W  39258   41901   YAL056W 2643    +       GPB2       Verified

Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.

...