Content Comparison

...

Tip

title	Reservations

Use our summer school today's reservation (CoreNGScore-ngs-class-0606) when submitting batch jobs to get higher priority on the ls6 normal queue.

Code Block

language	bash
title	Request an interactive (idev) node

# Request a 180 minute interactiveidev node on the normal queue using w/our reservation
idev -m 120 -N 1 -A OTH21164 -r CoreNGS
idev -m 120 -N 1 -A TRA23004 -r CoreNGScore-ngs-class-0606

# Request a 120 minute idevinteractive node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development
idev -m 120 -N 1 -A TRA23004 -p development

Table of Contents

The BED format

...

chrom (required) – string naming the chromosome or other contig
start (required) – the 0-based start position of the region
end (required) – the 1-based end position of the region
name (optional) – an arbitrary string describing the region
- for BED files loaded as UCSC Genome Browser tracks, this text is displayed above the region
score (optional) – an integer score for the region
- for BED files to be loaded as UCSC Genome Browser tracks, this should be a number between 0 and 1000, higher = "better"
- for non-GenBrowse BED files, this can be any integer value (e.g. the length of the region)
strand (optional) - a single character describing the region's strand
- + – plus strand (Watson strand) region
- - – minus strand (Crick strand) region
- . – no strand – the region is not associated with a strand (e.g. a transcription factor binding region)

...

The number of fields per line must be consistent throughout any single BED file
- e.g. they must all have 3 fields or all have 6 fields
The first base on a contig is numbered 0
- versus 1 for BAM file positions
- so the a BED start of 99 is actually the 100th base on the contig
- but end positions are 1-based
  - so a BED end of 200 is the 200th base on the contig
- the length of a BED region is end - start
  - not end - start + 1, as it would be if both coordinates with 0-based or both 1-based
- this difference is one of the single greatest source of errors dealing with BED files!

...

A BED3+ file contains the 3 required BED fields, followed by some number of user-defined columns (
- all records
with
- having the same number
)
- number of columns
A BED6+ file contains the 3 required BED fields, 3 additional standard BED fields (name, score, strand), followed by some number of user-defined columns
- all records having the same number number columns

As we will see, BEDTools functions require BED3+ input files, or BED6+ if strand-specific operations are requested.

...

Sub-command	Description	Use case(s)
bamtobed	Convert BAM files to BED format.	You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.
bamtofastq	Extract FASTQ sequences from BAM alignment records.	You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
getfasta	Get FASTA entries corresponding to regions.	You want to run motif analysis, which requires FASTA sequences, on a set of regions of interest. In addition to a BED or BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
genomecov	Generate per-base genome-wide signal trace	Produce a per-base genome-wide signal (in bedGraph format), for example for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be visualized in the Broad's IGV (Integrative Genome Browser) application, or configured in the UCSC Genome Browser as custom tracks.
coverage	Compute coverage of your regions	You have performed a WGS ( whole genome sequencing Whole Genome Sequencing) experiment and want to know if has resulted in the desired coverage depth. Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Transcript regions In either case, regions (e.g. chromosomes or transcripts) are provided as a BED or GFF/GTF file.
multicov	Count overlaps between one or more BAM files and a set of regions of interest.	Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes.
intersect	Determine the overlap between two sets of regions.	Similar to multicov, but can also report the overlapping regions, not just count them.
merge	Combine a set of possibly-overlapping regions into a single set of non-overlapping regions.	Collapse overlapping gene annotations into per-strand non-overlapping regions. For example, to create non-overlapping transcipt regions before counting RNA-seq reads (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.
subtract	Remove unwanted regions.	Remove rRNA/tRNA gene regions before counting for RNA-seq. Remove low-complexity genomic regions before peak calling for ChIP-seq or ATAC-seq.
intersect	Determine the overlap between two sets of regions.	Similar to multicov, but can also report the overlapping regions, not just count them.
closest	Find the genomic features nearest to a set of regions.	For a set of significant ChIP-seq Transcription Factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

...

Code Block

language	bash
title	Start an idev session

idev -m 120 -N 1 -A OTH21164 -r CoreNGScore-ngs-class-0606

    # or, -Awithout TRA23004a # orreservation
idev -m 90120 -N 1 -A OTH21164 -p development 

# or -A TRA23004

modulemodule load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

...

Most BEDTools functions accept either BAM or BED files as input.
- BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
All text-format input files (BED, GTF/GFFGFF3, VCF) should use Unix line endings (linefeed only).

The most important thing to remember about comparing regions using BEDTools, is that all input files must share the same set of contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, use annotations from the corresponding GFFGTF/GTFGFF3 annotations.

About strandedness

...

Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded; however note that most RNA-seq libraries these days, including ones prepared by GSAF, are more often antisense stranded.

If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.

...

Annotation files that you retrieve from public databases are often in GTF (Gene Transfer Format) or one of the in GFF (General Feature Format) formats (usually GFF3 these days).

Unfortunately, both formats are obscure and hard to work with directly. While bedtools does accept annotation files in GTF/GFF/GTF format, you will not like the results. This is because the most useful information in a GTF/GFF/GTF file is in a loosely-structured attributes field.

Also unfortunately, there are a number of variations of both annotation formats However ; however both GTF and GFF share the first 8 Tab-separated fields:

...

The Tab-separated columns will care about are (1) seqname, (3) feature_type and , (4,5) start, end and (7) strand. The reason we care is that when working with annotations, we usually only want to look at annotations of a particular type, most commonly gene, but also transcript or exon.

...

Expand

title	Make sure you're in an idev session

Code Block

language	bash
title	Start an idev session

idev -m 120 -N 1 -A OTH21164 -r CoreNGS   core-ngs-class-0606

# or, -Awithout TRA23004
# ora reservation:
idev -m 90120 -N 1 -A OTH21164 -p development  # or -A TRA23004

module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

...

Code Block

language	bash
title	Look at GFF annotation entries with less

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools 
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .  
cp $CORENGS/catchup/references/gff/sacCer3.R64-1-1_20110208.gff . 

# Use the less pager to look at multiple lines
less sacCer3.R64-1-1_20110208.gff

# Look at just the most-important Tab-separated columns
cat sacCer3.R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5,7 | head -20

# Include the ugly 9th column where attributes are stored
cat sacCer3.R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head

...

Code Block

title	Filter GFF gene feature with awk

cat sacCer3.R64-1-1_20110208.gff | grep -v '#^#' | \
  awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print} }' \
  > sc_genes.gff
wc -l sc_genes.gff

...

What most folks to is find some way to convert their GTF/GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /work/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFFGTF/GTF filesGFF3 files. And it's pretty easy to run.

...

The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, because cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (– but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.

...

Code Block

language	bash
title	Re-order the final BED fields

head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed.hdr

tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}
 { # make sure gene name is populated
   if($15 == "") {$15 = $10}
# make sure gene name is populated
   print $1print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed

One final detail. Annotation files you download may have non-Unix (linefeed-only) line endings. Specifically, they may use Windows line endings (carriage return + linefeed). (Read about Line ending nightmares.) The expression sed 's/\r//' uses the sed (substitution editor) tool to replace carriage return characters ( \r ) with nothing, removing them from the output if they are present.

Finally, the 8 re-ordered attributes are:

...

Expand

title	Hint

Use cut to isolate that field (8), sort to sort the resulting values into blocks, then uniq -c to count the members of each block.

...

the 1st custom column (output column 4) should result from collapsing distinct (unique) values of gene file column 6 (the strand column, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names.

...

Code Block

language	bash
title	Use bedtools merge to collapse overlapping gene annotations

cd $SCRATCH/core_ngs/bedtools
sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s \
  -c 6,4,4 -o distinct,count,collapse > merged.sc_genes.txt

...

Expand

title	Answer

Output column 5 has the gene count.

Code Block

language	bash

cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block

language	bash

There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.

Or being fancy:

Code Block

language	bash

cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n | \
  | grep -v ' 1$' | awk 'BEGIN{ct=0}{ct=ct+$1}END{print ct}'

...

Make sure you're in an idev session, since we will be doing some significant computation, and make bedtools and samtools available.

Expand

title	Make sure you're in an idev session

Code Block

language	bash
title	Start an idev session

idev -m 120 -N 1 -A OTH21164 -r

CoreNGS

core-ngs-class-0606

# or,

-A

without

TRA23004

#

reservation

or

idev -m

90

120 -N 1 -A OTH21164 -p development 

module load biocontainers
module load bedtools
bedtools --version   # should

or

be

-A TRA23004

bedtools v2.27.1

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

Code Block

language	bash
title	Setup for BEDTools multicov

# Get the merged yeast genes bed file if you didn't create one
mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged*bed .

# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

Exercises:

How many reads are represented in the yeast_mrna.sort.filt.bam file?
How many mapped? How many proper pairs? How many duplicates?
What is the distribution of mapping qualities? What is the average mapping quality?

Expand

title	Hints

samtools flagstat for the different read counts.

samtools view + cut + sort + uniq -c for mapping quality distribution

samtools view + awk for average mapping quality

...

Expand

title	Answer

Code Block

language	bash

module load samtools

cd $SCRATCH/core_ngs/bedtools_multicov
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt

Code Block

title	samtools flagstat output

3347559 + 0 in total (QC-passed reads + QC-failed reads)
24317 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3347559 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM.

There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities :(BAM field 5)

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c

Code Block

title	distribution of mapping qualities

To compute average mapping quality:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59.2. So again, there must have been quality filtering performed on upstream alignment records.

Here's how to run bedtools multicov in stranded mode, directing the standard output to a file:

Expand

title	Setup (if needed)

Code Block

language	bash

idev -m 120 -N 1 -A OTH21164 -r CoreNGScore-ngs-class-0606

# or, without a reservation:
idev -m 120 -N 1 -A TRA23004 OTH21164 -p development

module load biocontainers
module load samtools
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

...

Code Block

language	bash
title	Run bedtools multicov to count BAM alignments overlapping a set of genes

cd $SCRATCH/core_ngs/bedtools_multicov
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
  -bed merged.good.sc_genes.bed > yeast_mrna_gene_counts.bed

ExerciseExercises:

How may records of output were written?
Where is the count of overlaps per output record?

Expand

title	Answers

Code Block

language	bash

wc -l yeast_mrna_gene_counts.bed

6485 records were written, one for each feature in the merged.sc_genes.bed file.

The overlap count was added as the last field in each output record. So here it is field 7 since the input annotation file had 6 columns.

...

Expand

title	Answer

Code Block

language	bash

cut -f 7 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cut -f 7 yeast_mrna_gene_counts.bed | grep -v -c '^0'
# or
cat yeast_mrna_gene_counts.bed | \
  awk '{if ($7 > 0) print $7}' | wc -l

Most of the genes (6141/6485) have non-zero read overlap counts.

...

Go to the UCSC Genome Browser: https://genome.ucsc.edu/
Select Genomes from the top menu bar
Select Human from POPULAR SPECIES
- under Human Assembly select Feb 2009 (GrCh37/hg19)
- select GO
In the hg19 browser page, the Layered H3K27Ac
- the 100 Vert. Cons track is a signal track
  - the x-axis is the genome position
  - the y-axis represents the
  count of ChIP-seq reads that overlap each position
  - where the ChIP'd protein is H3K27AC base-wise conservation among vertebrates
- customize the 100 Vert. Cons track
  - right-click on "100 Vert. Cons" text in the left margin,
    - select "Configure 100 Vert. Cons" from the menu
  - in the 100 Vert. Cons Track Settings dialog:
    - change "Track height" to 100
    - change "Data view scaling" to "auto-scale to data view"
    - click "OK"
- the Layered H3K27Ac track is a signal track
  - the x-axis is the genome position
  - the y-axis represents the count of ChIP-seq reads that overlap each position
    - where the ChIP'd protein is H3K27AC (histone H3, acetylated on the Lysine at amino acid position 27)

The bedtools genomecov function (https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html), with the -bg (bedgraph) option produces output in bedGraph format. Here we'll analyze the per-base coverage of yeast RNAseq reads in our merged yeast gene regions.

Make sure you're in an idev session, then prepare a directory for this exercise.

Prepare for bedtools coverage

Expand

title	Make sure you're in an idev session

Code Block

title
language	bash

Start an idev session

idev -m 120 -N 1 -A OTH21164 -r

CoreNGS

core-ngs-class-0606

# or,

-A

without

TRA23004

#

reservation:

or

idev -m

90

120 -N 1 -A OTH21164 -p development

# or

-A TRA23004

module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

Code Block

language	bash
title	Prepare for bedtools coverage

mkdir -p $SCRATCH/core_ngs/bedtools_genomecov
cd $SCRATCH/core_ngs/bedtools_genomecov 
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

...

Version	Old Version 90	New Version Current
Changes made by	Anna Battenhouse	Anna Battenhouse
Saved on	Jun 05, 2024	Jun 04, 2025

Versions Compared

Key

The BED format

About strandedness