Content Comparison

...

So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version. When you run the code below, you should see that the bedtools version in the standard TACC module system is bedtools v2.26.0. (Login to login5.ls5.tacc.utexas.edu first.)

Code Block
module load bedtools bedtools --version

Input format considerations

...

Login to stampede2, start and idev session, then load the BioContainers bedtools module, and check its version.

Code Block

language	bash
title	Start an idev session

idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 68 --reservation=BIO_DATA_week_1
# ...
module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

Input format considerations

Most BEDTools functions now accept either BAM or BED files as input.
- BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
All text-format input files (BED, GTF/GFF, VCF) should use Unix line endings (linefeed only).

...

Take a quick look at a yeast annotation file, sacCer_R64-1-1_20110208.gff using less. (Login to login5.ls5.tacc.utexas.edu first.).

Expand

title	Make sure you're in an idev session

Code Block

language	bash
title	Start an idev session

idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 68 --reservation=BIO_DATA_week_1
# ...
module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

Code Block

language	bash
title	Look at GFF annotation entries with less

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .

# Use the less pager to look at multiple lines
less sacCer_R64-1-1_20110208.gff

# Look at just the most-important Tab-separated columns
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20

# Include the ugly 9th column where attributes are stored
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head

...

What most folks to is find some way to convert their GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /workwork2/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFF/GTF files. And it's pretty easy to run.

...

Code Block

language	bash
title	Convert GFF to BED with BioITeam script

/workwork2/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
  > sc_genes.converted.bed

...

Doesn't this look better? (I've tidied up the output a bit below.)

Code Block

chrI    334     649     YAL069W 315     +       YAL069W Dubious
chrI    537     792     YAL068W-A       255     +       YAL068W-A       Dubious
chrI    1806    2169    YAL068C 363     -       PAU8    Verified
chrI    2479    2707    YAL067W-A       228     +       YAL067W-A       Uncharacterized
chrI    7234    9016    YAL067C 1782    -       SEO1    Verified
chrI    10090   10399   YAL066W 309     +       YAL066W Dubious
chrI    11564   11951   YAL065C 387     -       YAL065C Uncharacterized
chrI    12045   12426   YAL064W-B       381     +       YAL064W-B       Uncharacterized
chrI    13362   13743   YAL064C-A       381     -       YAL064C-A       Uncharacterized
chrI    21565   21850   YAL064W 285     +       YAL064W Verified
chrI    22394   22685   YAL063C-A       291     -       YAL063C-A       Uncharacterized
chrI    23999   27968   YAL063C 3969    -       FLO9    Verified
chrI    31566   32940   YAL062W 1374    +       GDH3    Verified
chrI    33447   34701   YAL061W 1254    +       BDH2    Uncharacterized
chrI    35154   36303   YAL060W 1149    +       BDH1    Verified
chrI    36495   36918   YAL059C-A       423     -       YAL059C-A       Dubious
chrI    36508   37147   YAL059W 639     +       ECM1    Verified
chrI    37463   38972   YAL058W 1509    +       CNE1    Verified
chrI    38695   39046   YAL056C-A       351     -       YAL056C-A       Dubious
chrI    39258   41901   YAL056W 2643    +       GPB2    Verified

...

Expand

title	Answer

Code Block

language	bash

cut -f 8 sc_genes.bed | sort | uniq -c

You should see this:

Code Block
810 Dubious 897 Uncharacterized 4896 Verified 4 Verified\|silenced_gene

If you want to further order this output listing the most abundant category first, add another sort statement:

Code Block

language	bash

cut -f 8 sc_genes.bed | sort | uniq -c | sort -k 1k1,1nr

The -k 1,1nr options says to sort on the 1st field of input, using numeric sorting, in reverse order (i.e., largest first). Which produces:

Code Block
4896 Verified 897 Uncharacterized 809 Dubious 4 Verified\|silenced_gene

...

We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.

Get the RNA-seq BAM

First start Make sure you're in an idev session, since we will be doing some significant computation, and make bedtools and samtools available.

Code Block

language	bash
title	Start an idev session

idev -p development -m 20120 -A UT-2015-05-18 -N 1 -n 24 --reservation=intro_NGSBIO_DATA_week_1

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

...

Exercises: How many reads pairs are represented in the yeast_mrna.sort.filt.bam file? How many mapped? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

...

Expand

title	Answer

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt

Code Block

title	samtools flagstat output

3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c

Code Block

title	distribution of mapping qualities

To compute average mapping quality:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{print "average:",printf("mapping quality average: %.1f for %d reads\n", sum/tot,"for",tot,"reads") }'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

...

Version	Old Version 53	New Version 54
Changes made by	Anna Battenhouse	Anna Battenhouse
Saved on	Jun 16, 2020	Jun 05, 2021

Versions Compared

Key

Input format considerations

Input format considerations

Get the RNA-seq BAM