Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

BED (Browser Extensible Data) format is a simple text format for location-oriented data (genomic regions) developed to support UCSC Genome Browser (GenBrowse) tracks. Standard BED files have 3 to 6 Tab-separated columns, although up to 12 columns are defined. (Read more about the UCSC Genome Browser's official BED format.)

...

The BEDTools suite is a set of utilities for manipulating BED and BAM files. We call it the "Swiss army knife" for genomic region analyses because its sub-commands are so numerous and versatile. Some of the most common bedtools operations perform set-theory functions on regions: intersection (intersect), union (merge), set difference (subtract) – but there are many others. The table below lists some of the most useful sub-commands along with applicable use cases.

Sub-commandDescriptionUse case(s)
bamtobedConvert BAM files to BED format.You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.
bamtofastqExtract FASTQ sequences from BAM alignment records.You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
getfastaGet FASTA entries corresponding to regions.You want to run motif analysis, which requires
the original
FASTA sequences, on a set of regions of interest.
 
In addition to
the
a BED or BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
genomecov

Generate per-base genome-wide signal trace

Produce a per-base genome-wide signal (in bedGraph format), for example for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be visualized in the Broad's IGV (Integrative Genome Browser) application, or configured in the UCSC Genome Browser as custom tracks.

coverage

Compute coverage of your regions

  • You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth.
  • Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments.
Provide the transcript
  • Transcript regions are provided as a BED or GFF/GTF file.
multicovCount overlaps between one or more BAM files and a set of regions of interest.

Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes.

mergeCombine a set of possibly-overlapping regions into a single set of non-overlapping regions.Collapse overlapping gene annotations into per-strand non-overlapping regions before counting (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.
subtractRemove unwanted regions.
  • Remove rRNA/tRNA gene regions
from a merged gene annotations file
  • before counting for RNA-seq.
  • Remove low-complexity genomic regions before peak calling for ChIP-seq or ATAC-seq.
intersectDetermine the overlap between two sets of regions.Similar to multicov, but can also report the overlapping regions, not just count them.
closestFind the genomic features nearest to a set of regions.For a set of significant ChIP-seq
transcription factor
Transcription Factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

We will explore a few of these functions in our exercises.

...

Login to ls6, start and idev session, then load the BioContainers bedtools module, and then check its version.

Code Block
languagebash
titleStart an idev session
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Fri      # or -A TRA23004
# or
idev -m 90 -N 1 -A OTH21164 -p development   # or -A TRA23004

module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

Input format considerations

  • Most BEDTools functions now accept either BAM or BED files as input. 
    • BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
  • When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
  • All text-format input files (BED, GTF/GFF, VCF) should use Unix line endings (linefeed only).

...

Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded (; however note that most RNA-seq libraries these days, including ones prepared by GSAF, are antisense stranded).

If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.

...

  1. seqname - The name of the chromosome or contig.
  2. source - Name of the program that generated this feature, or other data source (e.g. public database)
  3. feature_type - Type of the feature, for example:
    • chromosome
    • CDS (coding sequence), exon
    • gene, transcript
    • start_codon, stop_codon
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A numeric value. Often but not always an integer. Meaning differs and not usually important.
  7. strand - Defined as + (forward), - (reverse), or . (no relevant strand)
  8. frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'

...

Expand
titleMake sure you're in an idev session


Code Block
languagebash
titleStart an idev session
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Fri     # or
idev -m A TRA23004
# or
idev -m 90 -N 1 -A OTH21164 -p development  # or -A TRA23004

module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1


...

One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at piping a histogram)

Expand
titleSetup (if needed)


Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/catchup/references/gff/sacCer3.R64-1-1_20110208.gff . 


Read more about what's going on here at piping a histogram.

Code Block
languagebash
titleCreate a histogram of all the feature types in a GFF
cd $SCRATCH/core_ngs/bedtools
cat sacCer3.R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \
  sort | uniq -c | sort -k1,1nr | more

...

Code Block
chrI    334     649     YAL069W         315     +       YAL069W    Dubious
chrI    537     792     YAL068W-A       255     +       YAL068W-A  Dubious
chrI    1806    2169    YAL068C         363     -       PAU8       Verified
chrI    2479    2707    YAL067W-A       228     +       YAL067W-A  Uncharacterized
chrI    7234    9016    YAL067C         1782    -       SEO1       Verified
chrI    10090   10399   YAL066W         309     +       YAL066W    Dubious
chrI    11564   11951   YAL065C         387     -       YAL065C    Uncharacterized
chrI    12045   12426   YAL064W-B       381     +       YAL064W-B  Uncharacterized
chrI    13362   13743   YAL064C-A       381     -       YAL064C-A  Uncharacterized
chrI    21565   21850   YAL064W         285     +    YAL064W   YAL064W    Verified
chrI    22394   22685   YAL063C-A       291     -       YAL063C-A  Uncharacterized
chrI    23999   27968   YAL063C         3969    -       FLO9       Verified
chrI    31566   32940   YAL062W         1374    +       GDH3       Verified
chrI    33447   34701   YAL061W         1254    +       BDH2       Uncharacterized
chrI    35154   36303   YAL060W         1149    +       BDH1       Verified
chrI    36495   36918   YAL059C-A       423     -       YAL059C-A  Dubious
chrI    36508   37147   YAL059W         639     +       ECM1       Verified
chrI    37463   38972   YAL058W         1509    +       CNE1       Verified
chrI    38695   39046   YAL056C-A       351     -       YAL056C-A  Dubious
chrI    39258   41901   YAL056W         2643    +       GPB2       Verified

Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.

...

Using the -c (column) and -o (operation) options, you can have add information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.

...

We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:

  • the 1st custom column (output column 4) should result from collapsing distinct (unique) values of gene file column 6 (the strand column, + or -)
    • since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
  • the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
  • the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names.

bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):

...

The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):

Code Block
2-micron        251     1523    +       1       R0010W
2-micron        1886    3008    -       1       R0020C
2-micron        3270    3816    +       1       R0030W
2-micron        5307    6198    -       1       R0040C
chrI        334    334     792     +       2       YAL069W,YAL068W-A
chrI            1806    2169    -       1       YAL068C
chrI            2479    2707    +       1       YAL067W-A
chrI            7234    9016    -       1       YAL067C
chrI            10090   10399   +       1       YAL066W
chrI            11564   11951   -       1       YAL065C

As we specified:

  • Output column 4 has the region's strand.
  • Column 5 is the count of merged regions

...

  • Column 6 is a collapsed, comma-separated list of the merged gene names

...

Exercise: Compare the number of regions in the merged and before-merge gene files.

...

Expand
titleAnswer

Output column 5 has the gene count.

Code Block
languagebash
cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   6374 1
    105 2
      4 3
      1 4
      1 7

There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.gene contributed.

Or being fancy:

Code Block
languagebash
cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n \
  | grep -v ' 1$' | awk 'BEGIN{ct=0}{ct=ct+$1}END{print ct}'


Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.

...

To make a valid BED6 file, we'll include the first 3 output columns of merged.good.sc_genes.txt (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

...

Code Block
languagebash
titleStart an idev session
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Fri     # or -A TRA23004
# or
idev -m 90 -N 1 -A OTH21164 -p development  # or -A TRA23004

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

...

Expand
titleAnswer


Code Block
languagebash
module load samtools

cd $SCRATCH/core_ngs/bedtools_multicov
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt


Code Block
titlesamtools flagstat output
3347559 + 0 in total (QC-passed reads + QC-failed reads)
24317 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3347559 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

Code Block
languagebash
samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c 


Code Block
titledistribution of mapping qualities
    498 20
   6504 21
   1012 22
    355 23
   1054 24
   2800 25
    495 26
  14133 27
    282 28
    358 29
    954 30
   1244 31
    358 32
   6143 33
    256 34
    265 35
   1112 36
    905 37
    309 38
   4845 39
   5706 40
    427 41
   1946 42
   1552 43
   1771 44
   6140 45
   1771 46
   3049 47
   3881 48
   3264 49
   4475 50
  15692 51
  25378 52
  16659 53
  18305 54
   7108 55
   2705 56
  59867 57
   2884 58
   2392 59
3118705 60

To compute average mapping quality:

Code Block
languagebash
samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

...

Expand
titleSetup (if needed)


Code Block
languagebash
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5CoreNGS  # or -A TRA23004
module load biocontainers
module load samtools
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .


...

Expand
titleAnswers


Code Block
languagebash
wc -l yeast_mrna_gene_counts.bed

6485 records were written, one for each feature in the merged.sc_genes.bed file.

The overlap count was added as the last field in each output record (. So here it is field 7 , since the input annotation file had 6 columns).

Exercise: How many features have non-zero overlap counts?

...

A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions that shows the count of overlapping bases for the regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).

  • Go to the UCSC Genome Browser: https://genome.ucsc.edu/
  • Select Genomes from the top menu bar
  • Select Human from POPULAR SPECIES
    • under Human Assembly select Feb 2009 (GrCh37/hg19)
    • select GO
  • In the hg19 browser page, the Layered H3K27Ac track is a signal track
    • the x-axis is the genome position
    • the y-axis represents the count of ChIP-seq reads that overlap each position
      • where the ChIP'd protein is H3K27AC (histone H3, acetylated on the Lysine at amino acid position 27)

...

Code Block
languagebash
titlePrepare for bedtools coverage
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-day5     # or -A TRA23004
# or
idev -m 90 -N 1 -A OTH21164 -p development  # or -A TRA23004

module load biocontainers
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_genomecov
cd $SCRATCH/core_ngs/bedtools_genomecov 
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

Then calling bedtools genomecov is easy. The -bg option says to report the depth in bedGraph format.

Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools_genomecov
bedtools genomecov -bg -ibam yeast_mrna.sort.filt.bam > yeast_mrna.genomecov.bedGraph

wc -l yeast_mrna.genomecov.bedGraph # 1519274 lines

...

Because this bedGraph file is for the small-ish (12Mb) yeast genome, and for reads that cover only part of that genome, it is not too big – only ~34M. But depending on the species and read depth, bedGraph files can get very large, so there is a coresponding corresponding binary format called bigWig (see https://genome.ucsc.edu/goldenpath/help/bigWig.html). The program to covert a bedGraph file to bigWig format is part of the UCSC Tools suite of programs. Look for it with module spider, and note that you can get information about all the tools in it using module spider with a specific container version:

...

Looking at the help for bedGraphToBigWig, we'll need a file of chromosome sizes. We can create one from our BAM header, using a Perl substitution script, which I prefer to sed(see Tips and tricks#perlpatternsubstitution):

Code Block
languagebash
module load ucsc_tools

cd $SCRATCH/core_ngs/bedtools_genomecov
bedGraphToBigWig  # look at its usage

# create the needed chromosome sizes file from our BAM header
module load samtools
samtools view -H yeast_mrna.sort.filt.bam | grep -P 'SN[:]' | \
  perl -pe 's/.*SN[:]//' | perl -pe 's/LN[:]//' > sc_chrom_sizes.txt

cat sc_chrom_sizes.txt

# displays:
chrI    230218
chrII   813184
chrIII  316620
chrIV   1531933
chrV    576874
chrVI   270161
chrVII  1090940
chrVIII 562643
chrIX   439888
chrX    745751
chrXI   666816
chrXII  1078177
chrXIII 924431
chrXIV  784333
chrXV   1091291
chrXVI  948066
chrM    85779

...

Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools_genomecov
export LC_COLLATE=C  # may or may not need this...
sort -k1,1 -k2,2n yeast_mrna.genomecov.bedGraph > yeast_mrna.genomecov.sorted.bedGraph
bedGraphToBigWig yeast_mrna.genomecov.sorted.bedGraph sc_chrom_sizes.txt yeast_mrna.genomecov.bw

...