Analysis using BEDTools
Reservations
Use our today's reservation (core-ngs-class-0606) when submitting batch jobs to get higher priority on the ls6 normal queue.
Request an interactive (idev) node
# Request a 180 minute idev node on the normal queue w/our reservation
idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0606
# Request a 120 minute interactive node on the development queue
idev -m 120 -N 1 -A OTH21164 -p development- 2 The BED format
- 3 BEDTools overview
- 4 About GFF/GTF annotation files
- 5 Exercises
The BED format
BED (Browser Extensible Data) format is a simple text format for location-oriented data (genomic regions) developed to support UCSC Genome Browser (GenBrowse) tracks. Standard BED files have 3 to 6 Tab-separated columns, although up to 12 columns are defined. (Read more about the UCSC Genome Browser's official BED format.)
Memorize the 6 main BED fields
These 6 BED fields are so important that you should memorize them. Keep repeating "chrom, start, end, name, score, strand" until the words trip off your tongue
chrom (required) – string naming the chromosome or other contig
start (required) – the 0-based start position of the region
end (required) – the 1-based end position of the region
name (optional) – an arbitrary string describing the region
for BED files loaded as UCSC Genome Browser tracks, this text is displayed above the region
score (optional) – an integer score for the region
for BED files to be loaded as UCSC Genome Browser tracks, this should be a number between 0 and 1000, higher = "better"
for non-GenBrowse BED files, this can be any integer value (e.g. the length of the region)
strand (optional) - a single character describing the region's strand
+ plus strand (Watson strand) region
- minus strand (Crick strand) region
. no strand – the region is not associated with a strand (e.g. a transcription factor binding region)
Important rules for BED format:
The number of fields per line must be consistent throughout any single BED file
e.g. they must all have 3 fields or all have 6 fields
The first base on a contig is numbered 0
versus 1 for BAM file positions
so the a BED start of 99 is actually the 100th base on the contig
but end positions are 1-based
so a BED end of 200 is the 200th base on the contig
the length of a BED region is end - start
not end - start + 1, as it would be if both coordinates with 0-based or both 1-based
this difference is one of the greatest source of errors dealing with BED files!
Note that the UCSC Genome Browser also defines many BED-like data formats (e.g. bedGraph, narrowPeak, tagAlign and various RNA element formats). See supported UCSC Genome Browser data formats for more information and examples.
In addition to standard-format BED files, one can create custom BED files that have at least 3 of the standard fields (chrom, start, end), followed by any number of custom fields. For example:
A BED3+ file contains the 3 required BED fields, followed by some number of user-defined columns
all records having the same number number of columns
A BED6+ file contains the 3 required BED fields, 3 additional standard BED fields (name, score, strand), followed by some number of user-defined columns
all records having the same number number columns
As we will see, BEDTools functions require BED3+ input files, or BED6+ if strand-specific operations are requested.
BEDTools overview
The BEDTools suite is a set of utilities for manipulating BED and BAM files. We call it the "Swiss army knife" for genomic region analyses because its sub-commands are so numerous and versatile. Some of the most common bedtools operations perform set-theory functions on regions: intersection (intersect), union (merge), set difference (subtract) – but there are many others. The table below lists some of the most useful sub-commands along with applicable use cases.
Sub-command | Description | Use case(s) |
|---|---|---|
Convert BAM files to BED format. | You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string. | |
Extract FASTQ sequences from BAM alignment records. | You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences. | |
Get FASTA entries corresponding to regions. | You want to run motif analysis, which requires FASTA sequences, on a set of regions of interest. In addition to a BED or BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index). | |
Generate per-base genome-wide signal trace | Produce a per-base genome-wide signal (in bedGraph format), for example for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be visualized in the Broad's IGV (Integrative Genome Browser) application, or configured in the UCSC Genome Browser as custom tracks. | |
Compute coverage of your regions |
In either case, regions (e.g. chromosomes or transcripts) are provided as a BED or GFF/GTF file. | |
Count overlaps between one or more BAM files and a set of regions of interest. | Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes. | |
Determine the overlap between two sets of regions. | Similar to multicov, but can also report the overlapping regions, not just count them. | |
Combine a set of possibly-overlapping regions into a single set of non-overlapping regions. | Collapse overlapping gene annotations into per-strand non-overlapping regions. For example, to create non-overlapping transcipt regions before counting RNA-seq reads (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects. | |
Remove unwanted regions. |
| |
Find the genomic features nearest to a set of regions. | For a set of significant ChIP-seq Transcription Factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation. |
We will explore a few of these functions in our exercises.
BEDTools versions
BEDTools is under active development and is always being refined and extended. Unfortunately, sometimes changes are made that are incompatible with previous BEDTools versions. For example, a major change to the way bedtool merge functions was made after bedtools v2.17.0.
So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version.
Login to ls6, start and idev session, load the BioContainers bedtools module, then check its version.
Start an idev session
idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0606
# or, without a reservation
idev -m 120 -N 1 -A OTH21164 -p development
module load biocontainers
module load bedtools
bedtools --version # should be bedtools v2.27.1Input format considerations
Most BEDTools functions accept either BAM or BED files as input.
BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
All text-format input files (BED, GTF/GFF3, VCF) should use Unix line endings (linefeed only).
The most important thing to remember about comparing regions using BEDTools, is that all input files must share the same set of contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, use annotations from the corresponding GTF/GFF3 annotations.
About strandedness
By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but can be made strand-specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.
-s says the R1 read is sense stranded (on the same strand as the gene).
-S says the R1 read is antisense stranded (the opposite strand as the gene).
RNA-seq libraries can be constructed with 3 types of strandedness:
sense stranded – the R1 read should be on the same strand as the gene.
antisense stranded – the R1 read should be on the opposite strand as the gene.
unstranded – the R1 could be on either strand
Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded; however note that most RNA-seq libraries these days, including ones prepared by GSAF, are more often antisense stranded.
If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.
About GFF/GTF annotation files
Annotation files that you retrieve from public databases are often in GTF (Gene Transfer Format) or one of the in GFF (General Feature Format) formats (usually GFF3 these days).
Unfortunately, both formats are obscure and hard to work with directly. While bedtools does accept annotation files in GTF/GFF format, you will not like the results. This is because the most useful information in a GTF/GFF file is in a loosely-structured attributes field.
Also unfortunately, there are a number of variations of both annotation formats; however both GTF and GFF share the first 8 Tab-separated fields:
seqname - The name of the chromosome or contig.
source - Name of the program that generated this feature, or other data source (e.g. public database)
feature_type - Type of the feature, for example:
chromosome
CDS (coding sequence), exon
gene, transcript
start_codon, stop_codon
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A numeric value. Often but not always an integer. Meaning differs and not usually important.
strand - Defined as + (forward), - (reverse), or . (no relevant strand)
frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'
The Tab-separated columns will care about are (1) seqname, (3) feature_type, (4,5) start, end and (7) strand. The reason we care is that when working with annotations, we usually only want to look at annotations of a particular type, most commonly gene, but also transcript or exon.
So where is the real annotation information, such as the unique gene ID or gene name? Both formats also have a 9th field, which is usually populated by a set of name/value pair attributes, and that's where the useful information is (e.g. the unique feature identifier, name, and so forth).
Take a quick look at a yeast annotation file, sacCer_R64-1-1_20110208.gff using less.
Look at GFF annotation entries with less
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .
cp $CORENGS/catchup/references/gff/sacCer3.R64-1-1_20110208.gff .
# Use the less pager to look at multiple lines
less sacCer3.R64-1-1_20110208.gff
# Look at just the most-important Tab-separated columns
cat sacCer3.R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5,7 | head -20
# Include the ugly 9th column where attributes are stored
cat sacCer3.R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | headIn addition to comment lines (starting with #), you can see the chrI contig names in column 1 and various feature types in column 3. You see also see tags like Name=YAL067C;gene=SEO1; among the attributes on some records, but in general the attributes column information is really ugly.
To summarize, we have two problems to solve:
We only care about a subset of feature types (here genes), and
We want the important annotation information – gene names and IDs – to appear as regular columns instead of weird name/value pairs.
Filter annotations based on desired feature type
One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that:
Read more about what's going on here at piping a histogram.
Create a histogram of all the feature types in a GFF
cd $SCRATCH/core_ngs/bedtools
cat sacCer3.R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \
sort | uniq -c | sort -k1,1nr | moreYou should see something like this.
Histogram of yeast annotation features
7077 CDS
6607 gene
480 noncoding_exon
383 long_terminal_repeat
376 intron
337 ARS
299 tRNA
190 region
129 repeat_region
102 nucleotide_match
89 transposable_element_gene
77 snoRNA
50 LTR_retrotransposon
32 telomere
31 binding_site
27 rRNA
24 five_prime_UTR_intron
21 pseudogene
17 chromosome
16 centromere
15 ncRNA
8 external_transcribed_spacer_region
8 internal_transcribed_spacer_region
6 snRNA
3 gene_cassette
2 insertionLet's create a file that contains only the 6607 gene entries:
Filter GFF gene feature with awk
cat sacCer3.R64-1-1_20110208.gff | grep -v '^#' | \
awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print} }' \
> sc_genes.gff
wc -l sc_genes.gffThe line count of sc_genes.gff should be 6607 – one for each gene entry.
Convert GFF/GTF format to BED with ID in the name field
Our sc_genes.gff annotation subset now contains only the 6607 genes in the Saccharomyces cerevisiae genome. This addresses our first problem, but entries in this file still have the important information – the gene ID and name – in the loosely-structured 9th attributes field.
If we want to associate reads with features, we need to have the feature names where they are easy to extract!
What most folks to is find some way to convert their GTF/GFF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /work/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GTF/GFF3 files. And it's pretty easy to run.
Let Anna know if you run into problems
If this script doesn't work on your annotation file, please let Anna know. She is always looking for cases where the conversion fails, and will try to fix it.
Here we just give the script the GFF file to convert, plus a 1 that tells it to URL decode weird looking text (e.g. our Note attribute values).
Convert GFF to BED with BioITeam script
/work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
> sc_genes.converted.bedThe program reads the input file twice – once to gather all the attribute names, and then a second time to write the attribute values in well-defined columns. You'll see output like this:
----------------------------------------
Gathering all attribute names for GTF 'sc_genes.gff'...
urlDecode = 1, tagAttr = tag
Done!
6607 lines read
6607 locus entries
8 attributes found:
(Alias ID Name Note Ontology_term dbxref gene orf_classification)
----------------------------------------
Writing BED output for GTF 'sc_genes.gff'...
Done! Wrote 6607 locus entries from 6607 linesTo find out what the resulting columns are, look at the header line out the output BED file:
head -1 sc_genes.converted.bed For me the resulting 16 attributes are as follows (they may have a different order for you). I've numbered them below for convenience.
Converted BED attributes
1. chrom 2. start 3. end 4. featureType 5. length 6. strand
7. source 8. frame 9. Alias 10. ID 11. Name 12. Note
13. Ontology_term 14. dbxref 15. gene 16. orf_classificationThe final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, because cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header – but we know what those fields are. We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.