Content Comparison

Tip

title	Reservations

Use our

summer school

today's reservation (

CoreNGSday5

core-ngs-class-0606) when submitting batch jobs to get higher priority on the ls6 normal queue

today:

sbatch --reservation=CoreNGSday5 <batch_file>.slurm idev -m 180 -N 1 -A OTH21164 -r CoreNGSday5

Table of Contents

The BED format

...

.

Code Block

language	bash
title	Request an interactive (idev) node

# Request a 180 minute idev node on the normal queue w/our reservation
idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0606

# Request a 120 minute interactive node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development

Table of Contents

The BED format

BED (Browser Extensible Data) format is a simple text format for location-oriented data (genomic regions) developed to support UCSC Genome Browser (GenBrowse) tracks. Standard BED files have 3 to 6 Tab-separated columns, although up to 12 columns are defined. (Read more about the UCSC Genome Browser's official BED format.)

...

chrom (required) – string naming the chromosome or other contig
start (required) – the 0-based start position of the region
end (required) – the 1-based end position of the region
name (optional) – an arbitrary string describing the region
- for BED files loaded as UCSC Genome Browser tracks, this text is displayed above the region
score (optional) – an integer score for the region
- for BED files to be loaded as UCSC Genome Browser tracks, this should be a number between 0 and 1000, higher = "better"
- for non-GenBrowse BED files, this can be any integer value (e.g. the length of the region)
strand (optional) - a single character describing the region's strand
- + – plus strand (Watson strand) region
- - – minus strand (Crick strand) region
- . – no strand – the region is not associated with a strand (e.g. a transcription factor binding region)

...

The number of fields per line must be consistent throughout any single BED file
- e.g. they must all have 3 fields or all have 6 fields
The first base on a contig is numbered 0
- versus 1 for BAM file positions
- so the a BED start of 99 is actually the 100th base on the contig
- but end positions are 1-based
  - so a BED end of 200 is the 200th base on the contig
- the length of a BED region is end - start
  - not end - start + 1, as it would be if both coordinates with 0-based or both 1-based
- this difference is one of the single greatest source of errors dealing with BED files!

...

A BED3+ file contains the 3 required BED fields, followed by some number of user-defined columns (
- all records
with
- having the same number number
)
- of columns

A BED6+ file contains the 3 required BED fields, 3 additional standard BED fields (name, score, strand), followed by some number of user-defined columns

...

- all records having the same number number columns

As we will see, BEDTools functions require BED3+ input files, or BED6+ if strand-specific operations are requested.

...

The BEDTools suite is a set of utilities for manipulating BED and BAM files. We call it the "Swiss army knife" for genomic region analyses because its sub-commands are so numerous and versatile. Some of the most common bedtools operations perform set-theory functions on regions: intersection (intersect), union (merge), set difference (subtract) – but there are many others. The table below lists some of the most useful sub-commands along with applicable use cases.

Sub-command

Description

Use case(s)

bamtobed

Convert BAM files to BED format.

You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields.

Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.

bamtofastq

Extract FASTQ sequences from BAM alignment records.

You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.

getfasta

Get FASTA entries corresponding to regions.

You want to run motif analysis, which requires

the original

FASTA sequences, on a set of regions of interest.

In addition to

the

Compute genome-wide coverage of your regions

a BED or BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).

coverage

genomecov

Generate per-base genome-wide signal trace

You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth.

Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Provide the transcript regions as a BED or GFF/GTF file.

Produce a per-base genome-wide signal (in bedGraph format), for example for a ChIP-seq or ATAC-seq experiment.

After

After conversion to binary bigWig format, such tracks can be

configured in the UCSC Genome Browser as custom tracks.Combine a set of

visualized in the Broad's IGV (Integrative Genome Browser) application, or configured in the UCSC Genome Browser as custom tracks.

coverage

Compute coverage of your regions

You have performed a WGS (Whole Genome Sequencing) experiment and want to know if has resulted in the desired coverage depth.
Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments.

In either case, regions (e.g. chromosomes or transcripts) are provided as a BED or GFF/GTF file.

multicov

Count overlaps between one or more BAM files and a set of regions of interest.

Count RNA-seq alignments that overlap a set of genes of interest.

While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes.

merge

intersect

Determine the overlap between two sets of regions.

Similar to multicov, but can also report the overlapping regions, not just count them.

merge

Combine a set of possibly-overlapping regions into a single set of non-overlapping regions.

Collapse overlapping gene annotations into per-strand non-overlapping regions.

For example, to create non-overlapping transcipt regions before counting RNA-seq reads (e.g with featureCounts or HTSeq).

If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.

subtract

Remove unwanted regions.

Remove rRNA/tRNA gene regions

from a merged gene annotations file

before counting

.

intersect

Determine the

overlap

between two sets of regions.

Similar to

multicov

, but can also report (not just count) the overlapping regions

for RNA-seq. Remove low-complexity genomic regions before peak calling for ChIP-seq or ATAC-seq.
closest	Find the genomic features nearest to a set of regions.	For a set of significant ChIP-seq

transcription factor

Transcription Factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

Version	Old Version 69	New Version Current
Changes made by	Anna Battenhouse	Anna Battenhouse
Saved on	Jun 19, 2022	Jun 04, 2025

Versions Compared

Key

The BED format

The BED format

Input format considerations

About strandedness

Use bedtools genomecov to create a signal track