Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version. When you run the code below, you should see that the bedtools version in the standard TACC module system is bedtools v2.26.0. (Login to login5.ls5.tacc.utexas.edu first.)

Code Block
module load bedtools
bedtools --version

Input format considerations

...

Login to stampede2, start and idev session, then load the BioContainers bedtools module, and check its version.

Code Block
languagebash
titleStart an idev session
idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 68 --reservation=BIO_DATA_week_1
# ...
module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

Input format considerations

  • Most BEDTools functions now accept either BAM or BED files as input. 
    • BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
  • When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
  • All text-format input files (BED, GTF/GFF, VCF) should use Unix line endings (linefeed only).

...

Take a quick look at a yeast annotation file, sacCer_R64-1-1_20110208.gff using less. (Login to login5.ls5.tacc.utexas.edu first.).

Expand
titleMake sure you're in an idev session
Code Block
languagebash
titleStart an idev session
idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 68 --reservation=BIO_DATA_week_1
# ...
module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1
Code Block
languagebash
titleLook at GFF annotation entries with less
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .

# Use the less pager to look at multiple lines
less sacCer_R64-1-1_20110208.gff

# Look at just the most-important Tab-separated columns
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20

# Include the ugly 9th column where attributes are stored
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head

...

What most folks to is find some way to convert their GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /workwork2/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFF/GTF files. And it's pretty easy to run.

...

Code Block
languagebash
titleConvert GFF to BED with BioITeam script
/workwork2/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
  > sc_genes.converted.bed

...

Doesn't this look better? (I've tidied up the output a bit below.)

Code Block
chrI    334     649     YAL069W 315     +       YAL069W Dubious
chrI    537     792     YAL068W-A       255     +       YAL068W-A       Dubious
chrI    1806    2169    YAL068C 363     -       PAU8    Verified
chrI    2479    2707    YAL067W-A       228     +       YAL067W-A       Uncharacterized
chrI    7234    9016    YAL067C 1782    -       SEO1    Verified
chrI    10090   10399   YAL066W 309     +       YAL066W Dubious
chrI    11564   11951   YAL065C 387     -       YAL065C Uncharacterized
chrI    12045   12426   YAL064W-B       381     +       YAL064W-B       Uncharacterized
chrI    13362   13743   YAL064C-A       381     -       YAL064C-A       Uncharacterized
chrI    21565   21850   YAL064W 285     +       YAL064W Verified
chrI    22394   22685   YAL063C-A       291     -       YAL063C-A       Uncharacterized
chrI    23999   27968   YAL063C 3969    -       FLO9    Verified
chrI    31566   32940   YAL062W 1374    +       GDH3    Verified
chrI    33447   34701   YAL061W 1254    +       BDH2    Uncharacterized
chrI    35154   36303   YAL060W 1149    +       BDH1    Verified
chrI    36495   36918   YAL059C-A       423     -       YAL059C-A       Dubious
chrI    36508   37147   YAL059W 639     +       ECM1    Verified
chrI    37463   38972   YAL058W 1509    +       CNE1    Verified
chrI    38695   39046   YAL056C-A       351     -       YAL056C-A       Dubious
chrI    39258   41901   YAL056W 2643    +       GPB2    Verified

...

Expand
titleAnswer
Code Block
languagebash
cut -f 8 sc_genes.bed | sort | uniq -c

You should see this:

Code Block
    810 Dubious
    897 Uncharacterized
   4896 Verified
      4 Verified|silenced_gene

If you want to further order this output listing the most abundant category first, add another sort statement:

Code Block
languagebash
cut -f 8 sc_genes.bed | sort | uniq -c | sort -k 1k1,1nr

The -k 1,1nr options says to sort on the 1st field of input, using numeric sorting, in reverse order (i.e., largest first). Which produces:

Code Block
   4896 Verified
    897 Uncharacterized
    809 Dubious
      4 Verified|silenced_gene

...

We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.

Get the RNA-seq BAM

First start Make sure you're in an idev session, since we will be doing some significant computation, and make bedtools and samtools available.

Code Block
languagebash
titleStart an idev session
idev -p development -m 20120 -A UT-2015-05-18 -N 1 -n 24 --reservation=intro_NGSBIO_DATA_week_1

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

...

Exercises: How many reads pairs are represented in the yeast_mrna.sort.filt.bam file? How many mapped? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

...

Expand
titleAnswer
Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt
Code Block
titlesamtools flagstat output
3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

Code Block
languagebash
samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c 
Code Block
titledistribution of mapping qualities
    453 20
   6260 21
    889 22
    326 23
    971 24
   2698 25
    376 26
  12769 27
    268 28
    337 29
    933 30
   1229 31
    345 32
   5977 33
    256 34
    249 35
   1103 36
    887 37
    292 38
   4648 39
   5706 40
    426 41
   1946 42
   1547 43
   1761 44
   6138 45
   1751 46
   3019 47
   3710 48
   3236 49
   4467 50
  15691 51
  25370 52
  16636 53
  18081 54
   7084 55
   2701 56
  59851 57
   2836 58
   2118 59
3097901 60

To compute average mapping quality:

Code Block
languagebash
samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{print "average:",printf("mapping quality average: %.1f for %d reads\n", sum/tot,"for",tot,"reads") }'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

...