Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Tip
titleReservations

Use our

summer school

today's reservation (

CoreNGSday5

core-ngs-class-0606) when submitting batch jobs to get higher priority on the ls6 normal queue

today:

sbatch --reservation=CoreNGSday5 <batch_file>.slurm
idev -m 180 -N 1 -A OTH21164 -r CoreNGSday5

Table of Contents

The BED format

...

.

Code Block
languagebash
titleRequest an interactive (idev) node
# Request a 180 minute idev node on the normal queue w/our reservation
idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0606

# Request a 120 minute interactive node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development


Table of Contents

The BED format

BED (Browser Extensible Data) format is a simple text format for location-oriented data (genomic regions) developed to support UCSC Genome Browser (GenBrowse) tracks. Standard BED files have 3 to 6 Tab-separated columns, although up to 12 columns are defined. (Read more about the UCSC Genome Browser's official BED format.)

...

  1. chrom (required) – string naming the chromosome or other contig
  2. start (required) – the 0-based start position of the region
  3. end (required) – the 1-based end position of the region
  4. name (optional) – an arbitrary string describing the region
    • for BED files loaded as UCSC Genome Browser tracks, this text is displayed above the region
  5. score (optional) – an integer score for the region
    • for BED files to be loaded as UCSC Genome Browser tracks, this should be a number between 0 and 1000, higher = "better"
    • for non-GenBrowse BED files, this can be any integer value (e.g. the length of the region)
  6. strand (optional) - a single character describing the region's strand
    • +   plus strand (Watson strand) region
    • -    minus strand (Crick strand) region
    • .    no strand – the region is not associated with a strand (e.g. a transcription factor binding region)

...

  • The number of fields per line must be consistent throughout any single BED file
    • e.g. they must all have 3 fields or all have 6 fields
  • The first base on a contig is numbered 0
    • versus 1 for BAM file positions
    • so the a BED start of 99 is actually the 100th base on the contig
    • but end positions are 1-based
      • so a BED end of 200 is the 200th base on the contig
    • the length of a BED region is end - start
      • not end - start  + 1, as it would be if both coordinates with 0-based or both 1-based
    • this difference is one of the single greatest source of errors dealing with BED files!

...

  • A BED3+ file contains the 3 required BED fields, followed by some number of user-defined columns (
    • all records
    with
    • having the same number number
    )
    • of columns
  • A BED6+ file contains the 3 required BED fields, 3 additional standard BED fields (name, score, strand), followed by some number of user-defined columns

      ...

        • all records having the same number number columns

      As we will see, BEDTools functions require BED3+ input files, or BED6+ if strand-specific operations are requested.

      ...

      The BEDTools suite is a set of utilities for manipulating BED and BAM files. We call it the "Swiss army knife" for genomic region analyses because its sub-commands are so numerous and versatile. Some of the most common bedtools operations perform set-theory functions on regions: intersection (intersect), union (merge), set difference (subtract) – but there are many others. The table below lists some of the most useful sub-commands along with applicable use cases.

      Sub-commandDescriptionUse case(s)
      bamtobedConvert BAM files to BED format.

      You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields.

      Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.

      bamtofastqExtract FASTQ sequences from BAM alignment records.You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
      getfastaGet FASTA entries corresponding to regions.You want to run motif analysis, which requires
      the original
      FASTA sequences, on a set of regions of interest.
       
      In addition to
      the
    • Compute genome-wide coverage of your regions
    • a BED or BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
      coverage
      genomecov

      Generate per-base genome-wide signal trace

      You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth.

    • Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Provide the transcript regions as a BED or GFF/GTF file.
    • Produce a per-base genome-wide signal (in bedGraph format), for example for a ChIP-seq or ATAC-seq experiment.

      After 

      After conversion to binary bigWig format, such tracks can be

      configured in the UCSC Genome Browser as custom tracks.Combine a set of

      visualized in the Broad's IGV (Integrative Genome Browser) application, or configured in the UCSC Genome Browser as custom tracks.

      coverage

      Compute coverage of your regions

      1. You have performed a WGS (Whole Genome Sequencing) experiment and want to know if has resulted in the desired coverage depth.
      2. Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. 

      In either case, regions (e.g. chromosomes or transcripts) are provided as a BED or GFF/GTF file.

      multicovCount overlaps between one or more BAM files and a set of regions of interest.

      Count RNA-seq alignments that overlap a set of genes of interest.

      While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes.

      merge
      intersectDetermine the overlap between two sets of regions.Similar to multicov, but can also report the overlapping regions, not just count them.
      mergeCombine a set of possibly-overlapping regions into a single set of non-overlapping regions.

      Collapse overlapping gene annotations into per-strand non-overlapping regions.

      For example, to create non-overlapping transcipt regions before counting RNA-seq reads (e.g with featureCounts or HTSeq).

      If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.

      subtractRemove unwanted regions.
      1. Remove rRNA/tRNA gene regions
      from a merged gene annotations file
      1. before counting
      .intersectDetermine the overlap between two sets of regions.Similar to multicov, but can also report (not just count) the overlapping regions
      1. for RNA-seq.
      2. Remove low-complexity genomic regions before peak calling for ChIP-seq or ATAC-seq.
      closestFind the genomic features nearest to a set of regions.For a set of significant ChIP-seq
      transcription factor
      Transcription Factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

    We will explore a few of these functions in our exercises.

    ...

    Login to ls6, start and idev session, then load the BioContainers bedtools module, and then check its version.

    Code Block
    languagebash
    titleStart an idev session
    idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5core-ngs-class-0606
    
    # ...or, without a reservation
    idev -m 120 -N 1 -A OTH21164 -p development 
    
    module load biocontainers
    module load bedtools
    bedtools --version   # should be bedtools v2.27.1

    Input format considerations

    • Most BEDTools functions now accept either BAM or BED files as input. 
      • BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
    • When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
    • All text-format input files (BED, GTF/GFFGFF3, VCF) should use Unix line endings (linefeed only).

    The most important thing to remember about comparing regions using BEDTools, is that all input files must share the same set of contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, use annotations from the corresponding GFFGTF/GTFGFF3 annotations.

    About strandedness

    By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but can be made strand-specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.

    • -s says the R1 read is sense stranded (on the same strand as the gene).
    • -S says the R1 read is antisense stranded (the opposite strand as the gene).

    ...

    Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded (; however note that most RNA-seq libraries these days, including ones prepared by GSAF, are more often antisense stranded).

    If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.

    ...

    Annotation files that you retrieve from public databases are often in GTF (Gene Transfer Format) or one of the in GFF (General Feature Format) formats (usually GFF3 these days).

    Unfortunately, both formats are obscure and hard to work with directly. While bedtools does accept annotation files in GTF/GFF/GTF format, you will not like the results. This is because the most useful information in a GTF/GFF/GTF file is in a loosely-structured attributes field.

    Also unfortunately, there are a number of variations of both annotation formats However ; however both GTF and GFF share the first 8 Tab-separated fields:

    1. seqname - The name of the chromosome or scaffoldcontig.
    2. source - Name of the program that generated this feature, or other data source (e.g. public database)
    3. feature_type - Type of the feature. Examples of common feature types include:Some examples of common feature types are:, for example:
      • chromosome
      • CDS (coding sequence), exon
      • gene, transcript
      • start_codon, stop_codon
    4. start - Start position of the feature, with sequence numbering starting at 1.
    5. end - End position of the feature, with sequence numbering starting at 1.
    6. score - A numeric value. Often but not always an integer. Meaning differs and not usually important.
    7. strand - Defined as + (forward), - (reverse), or . (no relevant strand)
    8. frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'

    The Tab-separated columns will care about are (1) seqname, (3) feature_type and , (4,5) start, end and (7) strand. The reason we care is that when working with annotations, we usually only want to look at annotations of a particular type, most commonly gene, but also transcript or exon.

    ...

    Expand
    titleMake sure you're in an idev session


    Code Block
    languagebash
    titleStart an idev session
    idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
    # ...core-ngs-class-0606
    
    # or, without a reservation:
    idev -m 120 -N 1 -A OTH21164 -p development 
    
    module load biocontainers
    module load bedtools
    bedtools --version   # should be bedtools v2.27.1


    ...

    Code Block
    languagebash
    titleLook at GFF annotation entries with less
    mkdir -p $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools 
    cp $CORENGS/yeast_rnarnaseq/sacCer_R64-1-1_20110208.gffyeast_mrna.sort.filt.bam* .  
    cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .catchup/references/gff/sacCer3.R64-1-1_20110208.gff . 
    
    # Use the less pager to look at multiple lines
    less sacCer_sacCer3.R64-1-1_20110208.gff
    
    # Look at just the most-important Tab-separated columns
    cat sacCer_sacCer3.R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5,7 | head -20
    
    # Include the ugly 9th column where attributes are stored
    cat sacCer_sacCer3.R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head

    ...

    One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at Piping a histogram.)

    Expand
    titleSetup (if needed)


    Code Block
    languagebash
    mkdir -p $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools
    cp $CORENGS/yeast_rna/sacCer_/catchup/references/gff/sacCer3.R64-1-1_20110208.gff . 


    Read more about what's going on here at piping a histogram.

    Code Block
    languagebash
    titleCreate a histogram of all the feature types in a GFF
    cd $SCRATCH/core_ngs/bedtools
    cat sacCer_sacCer3.R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \
      sort | uniq -c | sort -k1,1nr | more

    ...

    Code Block
    titleFilter GFF gene feature with awk
    cat sacCer_sacCer3.R64-1-1_20110208.gff | grep -v '#^#' | \
      awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print} }' \
      > sc_genes.gff
    wc -l sc_genes.gff

    ...

    What most folks to is find some way to convert their GTF/GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /work/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFFGTF/GTF filesGFF3 files. And it's pretty easy to run.

    ...

    Expand
    titleSetup (if needed)


    Code Block
    languagebash
    mkdir -p $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools
    cp $CORENGS/yeast_rna/*catchup/references/gff/sacCer3.R64-1-1_20110208.gff .  



    Code Block
    languagebash
    titleConvert GFF to BED with BioITeam script
    /work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
      > sc_genes.converted.bed

    ...

    For me the resulting 16 attributes are as follows (they may have a different order for you). I've numbered them below for convenience.

    Code Block
    titleConverted BED attributes
     1. chrom          2. start   3. end     4. featureType  5. length  6. strand
     7. source         8. frame   9. Alias  10. ID          11. Name   12. Note
    13. Ontology_term 14. dbxref 15. gene   16. orf_classification

    The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, because cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.

    Expand
    titleSetup (if needed)


    Code Block
    languagebash
    mkdir -p $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools
    cp $CORENGS/catchup/bedtools_merge/*.gff .  
    cp $CORENGS/catchup/bedtools_merge/sc_genes.converted.bed



    Code Block
    languagebash
    titleRe-order the final BED fields
    head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
     BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
     ' > sc_genes.bed.hdr
    
    tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
     BEGIN{FS=OFS="\t"}
     { # make sure gene name is populated
       if($15 == "") {$15 = $10} # make sure gene name is populated
       print $1,$2,$3,$10,$5,$6,$15,$16}
     ' > sc_genes.bed

    One final detail. Annotation files you download may have non-Unix (linefeed-only) line endings. Specifically, they may use Windows line endings (carriage return + linefeed). (Read about Line ending nightmares.) The expression sed 's/\r//' uses the sed (substitution editor) tool to replace carriage return characters ( \r ) with nothing, removing them from the output if they are present.

    Finally, the 8 re-ordered attributes are:

    ...

    Code Block
    chrI    334     649     YAL069W         315     +       YAL069W    Dubious
    chrI    537     792     YAL068W-A       255     +       YAL068W-A  Dubious
    chrI    1806    2169    YAL068C         363     -       PAU8       Verified
    chrI    2479    2707    YAL067W-A       228     +       YAL067W-A  Uncharacterized
    chrI    7234    9016    YAL067C         1782    -       SEO1       Verified
    chrI    10090   10399   YAL066W         309     +       YAL066W    Dubious
    chrI    11564   11951   YAL065C         387     -       YAL065C    Uncharacterized
    chrI    12045   12426   YAL064W-B       381     +       YAL064W-B  Uncharacterized
    chrI    13362   13743   YAL064C-A       381     -       YAL064C-A  Uncharacterized
    chrI    21565   21850   YAL064W         285     +       YAL064W    Verified
    chrI    22394   22685   YAL063C-A       291     -       YAL063C-A  Uncharacterized
    chrI    23999   27968   YAL063C         3969    -       FLO9       Verified
    chrI    31566   32940   YAL062W         1374    +       GDH3       Verified
    chrI    33447   34701   YAL061W         1254    +       BDH2       Uncharacterized
    chrI    35154   36303   YAL060W         1149    +       BDH1       Verified
    chrI    36495   36918   YAL059C-A       423     -       YAL059C-A  Dubious
    chrI    36508   37147   YAL059W         639     +       ECM1       Verified
    chrI    37463   38972   YAL058W         1509    +       CNE1       Verified
    chrI    38695   39046   YAL056C-A       351     -       YAL056CYAL056C-A  Dubious
    chrI    39258   41901   YAL056W         2643    +       GPB2       Verified

    Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.

    Expand
    titleSetup (if needed)


    Code Block
    languagebash
    mkdir -p $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools
    cp $CORENGS/yeastcatchup/bedtools_rnamerge/*.gff .  
    cp $CORENGS/yeastcatchup/bedtools_rnamerge/sc_genes* .


    Exercise: How many genes in our sc_genes.bed file are in each category?

    Expand
    titleHint

    Use cut to isolate that field (8), sort to sort the resulting values into blocks, then uniq -c to count the members of each block.

    ...

    Expand
    titleAnswer


    Code Block
    languagebash
    cut -f 8 sc_genes.bed | sort | uniq -c

    You should see this:

    Code Block
        810 Dubious
        897 Uncharacterized
       4896 Verified
          4 Verified|silenced_gene

    If you want to further order this output listing the most abundant category first, add another sort statement:

    Code Block
    languagebash
    cut -f 8 sc_genes.bed | sort | uniq -c | sort -k1,1nr

    The -k 1,1nr options says to sort on the 1st field (whitespace delimited) of input, using numeric sorting, in reverse order (i.e., largest first). Which produces:

    Code Block
       4896 Verified
        897 Uncharacterized
        809 Dubious
          4 Verified|silenced_gene


    Exercises

    We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.

    Get the RNA-seq BAM

    Make sure you're in an idev session, since we will be doing some significant computation, and make bedtools and samtools available.

    Code Block
    languagebash
    titleStart an idev session
    idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 24 --reservation=BIO_DATA_week_1

    Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

    Code Block
    languagebash
    titleSetup for BEDTools exercises
    # To catch up...
    mkdir -p $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools
    cp $CORENGS/yeast_rna/sc_genes.bed* . 
    cp $CORENGS/yeast_rna/*.gff .
    
    # Copy the BAM file
    cd $SCRATCH/core_ngs/bedtools
    cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .

    Exercises: How many reads are represented in the yeast_mrna.sort.filt.bam file? How many mapped? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

    Expand
    titleHints

    samtools flagstat for the different read counts.

    samtools view + cut + sort + uniq -c for mapping quality distribution

    samtools view + awk for average mapping quality

    ...

    titleAnswer
    Code Block
    languagebash
    cd $SCRATCH/core_ngs/bedtools
    samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt
    Code Block
    titlesamtools flagstat output
    3323242 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    922114 + 0 duplicates
    3323242 + 0 mapped (100.00% : N/A)
    3323242 + 0 paired in sequencing
    1661699 + 0 read1
    1661543 + 0 read2
    3323242 + 0 properly paired (100.00% : N/A)
    3323242 + 0 with itself and mate mapped
    0 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

    To get the distribution of mapping qualities:

    Code Block
    languagebash
    samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c 
    Code Block
    titledistribution of mapping qualities
        453 20
       6260 21
        889 22
        326 23
        971 24
       2698 25
        376 26
      12769 27
        268 28
        337 29
        933 30
       1229 31
        345 32
       5977 33
        256 34
        249 35
       1103 36
        887 37
        292 38
       4648 39
       5706 40
        426 41
       1946 42
       1547 43
       1761 44
       6138 45
       1751 46
       3019 47
       3710 48
       3236 49
       4467 50
      15691 51
      25370 52
      16636 53
      18081 54
       7084 55
       2701 56
      59851 57
       2836 58
       2118 59
    3097901 60

    To compute average mapping quality:

    Code Block
    languagebash
    samtools view yeast_mrna.sort.filt.bam | awk '
      BEGIN{FS="\t"; sum=0; tot=0}
      {sum = sum + $5; tot = tot + 1}
      END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }'

    Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

    Use bedtools multicov to count feature overlaps

    In this section we'll use bedtools multicov to count RNA-seq reads that overlap our gene features. The bedtools multicov command (http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html) takes a feature file (GFF/BED/VCF) and counts how many reads from one or more input BAM files overlap those feature. The input BAM file(s) must be position-sorted and indexed.

    Here's how to run bedtools multicov, directing the standard output to a file:

    ...

    titleSetup (if needed)
    Code Block
    languagebash
    idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 68 --reservation=BIO_DATA_week_1
    module load biocontainers
    module load samtools
    module load bedtools
    
    mkdir -p $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools
    cp $CORENGS/yeast_rna/*.gff .
    cp $CORENGS/yeast_rna/sc_genes.bed* .
    cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
    Code Block
    languagebash
    titleRun bedtools multicov to count BAM alignments overlapping a set of genes
    cd $SCRATCH/core_ngs/bedtools
    bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
      -bed sc_genes.bed > yeast_mrna_gene_counts.bed

    Exercise: How may records of output were written? Where is the count of overlaps per output record?

    ...

    titleAnswers
    Code Block
    languagebash
    wc -l yeast_mrna_gene_counts.bed

    6607 records were written, one for each feature in the sc_genes.bed file.

    The overlap count was added as the last field in each output record (here field 9, since the input annotation file had 8 columns).

    Exercise: How many features have non-zero overlap counts?

    ...

    titleAnswer
    Code Block
    languagebash
    cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
    # or
    cat yeast_mrna_gene_counts.bed | \
      awk '{if ($9 > 0) print $9}' | wc -l

    Most of the genes (6235/6607) have non-zero read overlap counts.

    Exercise: What is the total count of reads mapping to gene features?

    ...

    titleAnswer
    Code Block
    languagebash
    cat yeast_mrna_gene_counts.bed | awk '
     BEGIN{FS="\t";sum=0;tot=0}
     {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
     END{printf("%d overlapping reads in %d genes\n", sum, tot) }'

    There are 1144990 overlapping reads in 6235 gene annotations.

    Recall that in the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence.

    Exercise: What is the total count of reads mapping to gene features other than Dubious?

    Expand
    titleHint
    grep -v 'Dubious'

    ...

    titleAnswer
    Code Block
    languagebash
    grep -v 'Dubious' yeast_mrna_gene_counts.bed | awk '
     BEGIN{FS="\t";sum=0;tot=0}
     {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
     END{printf("%d overlapping reads in %d non-Dubious genes\n", sum, tot) }'

    There are 1093140 overlapping reads in 5578 non-Dubious genes

    Use bedtools merge to collapse overlapping annotations

    One issue that often arises when dealing with BED regions is that they can overlap one another. For example, on the yeast genome, which has very few non-coding areas, there are some overlapping ORFs (Open Reading Frames), especially Dubious ORFs that overlap Verified or Uncharacterized ones. When bedtools looks for overlaps, it will count a read that overlaps any of those overlapping ORFs – so some reads can be counted twice.

    ...

    Use bedtools merge to collapse overlapping annotations

    One issue that often arises when dealing with BED regions is that they can overlap one another. For example, on the yeast genome, which has very few non-coding areas, there are some overlapping ORFs (Open Reading Frames), especially Dubious ORFs that overlap Verified or Uncharacterized ones. When bedtools looks for overlaps, it will count a read that overlaps any of those overlapping ORFs – so some reads can be counted twice.

    One way to avoid this double-counting is to collapse the overlapping regions into a merged set of non-overlapping regions – and that's what the bedtools merge utility does (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html).

    Here we're going to use bedtools merge to collapse our gene annotations into a non-overlapping set, first for all genes, then for only non-Dubious genes.

    The output from bedtools merge always starts with 3 columns: chrom, start and end of the merged region only.

    Using the -c (column) and -o (operation) options, you can add information in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.

    For example, our sc_genes.bed file has a gene name in column 4, and for each (possibly merged) gene region, we want to know the number of gene regions that were collapsed into the region, and also which gene names were collapsed.

    We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:

    • the 1st custom column (output column 4) should result from collapsing distinct (unique) values of gene file column 6 (the strand column, + or -)
      • since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
    • the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
    • the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names.

    bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):

    Expand
    titleSetup (if needed)


    Code Block
    languagebash
    mkdir -p $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools
    cp $CORENGS/yeast_rnaseq/*.gff .
    cp $CORENGS/yeast_rnaseq/sc_genes.bed* .
    cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .
    module load biocontainers
    module load bedtools



    Code Block
    languagebash
    titleUse bedtools merge to collapse overlapping gene annotations
    cd $SCRATCH/core_ngs/bedtools
    sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
    bedtools merge -i sc_genes.sorted.bed -s \
      -c 6,4,4 -o distinct,count,collapse > merged.sc_genes.txt

    The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):

    Code Block
    2-micron    251     1523    +     1    R0010W
    2-micron    1886    3008    -     1    R0020C
    2-micron    3270    3816    +     1    R0030W
    2-micron    5307    6198    -     1    R0040C
    chrI        334     792     +     2    YAL069W,YAL068W-A
    chrI        1806    2169    -     1    YAL068C
    chrI        2479    2707    +     1    YAL067W-A
    chrI        7234    9016    -     1    YAL067C
    chrI        10090   10399   +     1    YAL066W
    chrI        11564   11951   -     1    YAL065C

    As we specified:

    • Output column 4 has the region's strand.
    • Column 5 is the count of merged regions
    • Column 6 is a collapsed, comma-separated list of the merged gene names

    Exercise: Compare the number of regions in the merged and before-merge gene files.

    Expand
    titleAnswer


    Code Block
    languagebash
    wc -l sc_genes.bed merged.sc_genes.txt

    There were 6607 genes before merging and 6485 after.

    Exercise: How many regions represent only 1 gene, 2 genes, or more?

    Expand
    titleAnswer

    Output column 5 has the gene count.

    Code Block
    languagebash
    cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

    Produces this histogram:

    Code Block
    languagebash
       6374 1
        105 2
          4 3
          1 4
          1 7

    There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.

    Or being fancy:

    Code Block
    languagebash
    cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n | \
      grep -v ' 1$' | awk 'BEGIN{ct=0}{ct=ct+$1}END{print ct}'


    Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.

    Expand
    titleAnswer


    Code Block
    languagebash
    cd $SCRATCH/core_ngs/bedtools
    grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed
    
    sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
    bedtools merge -i good.sc_genes.sorted.bed -s \
      -c 6,4,4 -o distinct,count,collapse > merged.good.sc_genes.txt
    
    wc -l good.sc_genes.bed merged.good.sc_genes.txt

    There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

    Code Block
    languagebash
    cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

    Produces this histogram:

    Code Block
    languagebash
       5750 1
         18 2
          1 4
          1 7

    Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.

    So there's one more thing we need to do to create a valid BED format file. Our merged.good.sc_genes.txt columns are chrom, start, end, strand, merged_region_count, merged_region(s), but the BED6 specification is: chrom, start, end, name, score, strand.

    To make a valid BED6 file, we'll include the first 3 output columns of merged.good.sc_genes.txt (chrom, start, end), but strand should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

    We can use awk to re-order the fields:

    Code Block
    languagebash
    cat merged.good.sc_genes.txt | awk '
      BEGIN{FS=OFS="\t"}
      {print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed

    Use bedtools multicov to count feature overlaps

    We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted, merged yeast gene annotation file we created above.

    In this section we'll use bedtools multicov to count RNA-seq reads that overlap our gene features. The bedtools multicov command (http://bedtools.readthedocs.io/en/latest/content/tools/mergemulticov.html) .

    Here we're going to use bedtools merge to collapse our gene annotations into a non-overlapping set, first for all genes, then for only non-Dubious genes.

    The output from bedtools merge always starts with 3 columns: chrom, start and end of the merged region only.

    Using the -c (column) and -o (operation) options, you can have information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.

    For example, our sc_genes.bed file has a gene name in column 4, and for each (possibly merged) gene region, we want to know the number of gene regions that were collapsed into the region, and also which gene names were collapsed.

    We can do this with -c 6,4,4 -o distinct,count,collapse, which says that three custom output columns should be added:

    • the 1st custom column should result from collapsing distinct (unique) values of gene file column 6 (the strand, + or -)
      • since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
    • the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
    • the 3rd custom output should be a comma-separated collapsed list of those same column 4 gene names

    bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):

    ...

    titleSetup (if needed)

    ...

    languagebash

    ...

    takes a feature file (GFF/BED/VCF) and counts how many reads from one or more input BAM files overlap those feature. The input BAM file(s) must be position-sorted and indexed.

    Make sure you're in an idev session, since we will be doing some significant computation, and make bedtools and samtools available.

    Expand
    titleMake sure you're in an idev session


    Code Block
    languagebash
    titleStart an idev session
    idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0606
    
    # or, without a reservation
    idev -m 120 -N 1 -A OTH21164 -p development 
    
    module load biocontainers
    module load bedtools
    bedtools --version   # should be bedtools v2.27.1


    Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

    Code Block
    languagebash
    titleSetup for BEDTools multicov
    # Get the merged yeast genes bed file if you didn't create one
    mkdir -p $SCRATCH/core_ngs/bedtools_multicov
    cd $SCRATCH/core_ngs/bedtools
    cd $SCRATCH/core_ngs/bedtools
    _multicov
    cp $CORENGS/
    yeast
    catchup/bedtools_
    rna
    merge/merged*
    .gff . cp $CORENGS/yeast_rna/sc_genes.bed* .
    bed .
    
    # Copy the BAM file
    cd $SCRATCH/core_ngs/bedtools_multicov
    cp $CORENGS/yeast_
    rna
    rnaseq/yeast_mrna.sort.filt.bam* .
    module load biocontainers module load bedtools
    Code Block
    languagebash
    titleUse bedtools merge to collapse overlapping gene annotations
    cd $SCRATCH/core_ngs/bedtools
    sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
    bedtools merge -i sc_genes.sorted.bed -s -c 6,4,4 -o distinct,count,collapse > merged.sc_genes.txt

    The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):

    Code Block
    2-micron        251     1523    +       1       R0010W
    2-micron        1886    3008    -       1       R0020C
    2-micron        3270    3816    +       1       R0030W
    2-micron        5307    6198    -       1       R0040C
    chrI            334     792     +       2       YAL069W,YAL068W-A
    chrI            1806    2169    -       1       YAL068C
    chrI            2479    2707    +       1       YAL067W-A
    chrI            7234    9016    -       1       YAL067C
    chrI            10090   10399   +       1       YAL066W
    chrI            11564   11951   -       1       YAL065C

    Output column 4 has the region's strand. Column 5 is the count of merged regions, and column 6 is a comma-separated list of the merged gene names.

    Exercise: Compare the number of regions in the merged and before-merge gene files.

    ...

    titleAnswer
    Code Block
    languagebash
    wc -l sc_genes.bed merged.sc_genes.txt

    There were 6607 genes before merging and 6485 after.

    Exercise: How many regions represent only 1 gene, 2 genes, or more?

    Expand
    titleAnswer

    Output column 5 has the gene count.

    Code Block
    languagebash
    cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

    Produces this histogram:

    Code Block
    languagebash
       6374 1
        105 2
          4 3
          1 4
          1 7

    There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.

    Exercise: Repeat the steps above, but first create a good.sc_genes.bed file that does not include Dubious ORFs.

    ...

    titleAnswer
    Code Block
    languagebash
    cd $SCRATCH/core_ngs/bedtools
    grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed
    
    sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
    bedtools merge -i good.sc_genes.sorted.bed -s \
      -c 6,4,4 -o distinct,count,collapse > merged.good.sc_genes.txt
    
    wc -l good.sc_genes.bed merged.good.sc_genes.txt

    There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

    Code Block
    languagebash
    cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

    Produces this histogram:

    Code Block
    languagebash
       5750 1
         18 2
          1 4
          1 7

    Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.

    Exercise: Why did we name the merged file with the extension .txt instead of .bed? What would we need to do to convert it to a proper BED6 file?

    cat merged.good.sc_genes.txt | awk ' BEGIN{FS=OFS="\t"} {print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed
    Expand
    titleAnswer

    The output does not follow the BED6 specification: "chrom, start, end, name, score, strand"

    The first 3 output columns comply with the BED3 standard (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

    We can use awk to re-order the fields:

    Code Block
    languagebash

    Exercises:

    • How many reads are represented in the yeast_mrna.sort.filt.bam file?
    • How many mapped? How many proper pairs? How many duplicates?
    • What is the distribution of mapping qualities? What is the average mapping quality?
    Expand
    titleHints

    samtools flagstat for the different read counts.

    samtools view + cut + sort + uniq -c for mapping quality distribution

    samtools view + awk for average mapping quality


    Expand
    titleAnswer


    Code Block
    languagebash
    module load samtools
    
    cd $SCRATCH/core_ngs/bedtools_multicov
    samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt


    Code Block
    titlesamtools flagstat output
    3347559 + 0 in total (QC-passed reads + QC-failed reads)
    24317 + 0 secondary
    0 + 0 supplementary
    922114 + 0 duplicates
    3347559 + 0 mapped (100.00% : N/A)
    3323242 + 0 paired in sequencing
    1661699 + 0 read1
    1661543 + 0 read2
    3323242 + 0 properly paired (100.00% : N/A)
    3323242 + 0 with itself and mate mapped
    0 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM.

    There are 922114 duplicates, or about 28%.

    To get the distribution of mapping qualities (BAM field 5)

    Code Block
    languagebash
    samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c 


    Code Block
    titledistribution of mapping qualities
        498 20
       6504 21
       1012 22
        355 23
       1054 24
       2800 25
        495 26
      14133 27
        282 28
        358 29
        954 30
       1244 31
        358 32
       6143 33
        256 34
        265 35
       1112 36
        905 37
        309 38
       4845 39
       5706 40
        427 41
       1946 42
       1552 43
       1771 44
       6140 45
       1771 46
       3049 47
       3881 48
       3264 49
       4475 50
      15692 51
      25378 52
      16659 53
      18305 54
       7108 55
       2705 56
      59867 57
       2884 58
       2392 59
    3118705 60

    To compute average mapping quality:

    Code Block
    languagebash
    samtools view yeast_mrna.sort.filt.bam | awk '
      BEGIN{FS="\t"; sum=0; tot=0}
      {sum = sum + $5; tot = tot + 1}
      END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }'

    Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59.2. So again, there must have been quality filtering performed on upstream alignment records.

    Here's how to run bedtools multicov in stranded mode, directing the standard output to a file:

    Expand
    titleSetup (if needed)


    Code Block
    languagebash
    idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0606
    
    # or, without a reservation:
    idev -m 120 -N 1 -A OTH21164 -p development
    
    module load biocontainers
    module load samtools
    module load bedtools
    
    mkdir -p $SCRATCH/core_ngs/bedtools_multicov
    cd $SCRATCH/core_ngs/bedtools_multicov
    cp $CORENGS/catchup/bedtools_merge/merged*bed .
    cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .



    Code Block
    languagebash
    titleRun bedtools multicov to count BAM alignments overlapping a set of genes
    cd $SCRATCH/core_ngs/bedtools_multicov
    bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
      -bed merged.good.sc_genes.bed > yeast_mrna_gene_counts.bed

    Exercises:

    • How may records of output were written?
    • Where is the count of overlaps per output record?
    Expand
    titleAnswers


    Code Block
    languagebash
    wc -l yeast_mrna_gene_counts.bed

    6485 records were written, one for each feature in the merged.sc_genes.bed file.

    The overlap count was added as the last field in each output record. So here it is field 7 since the input annotation file had 6 columns.

    Exercise: How many features have non-zero overlap counts?

    Expand
    titleAnswer


    Code Block
    languagebash
    cut -f 7 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
    # or
    cut -f 7 yeast_mrna_gene_counts.bed | grep -v -c '^0'
    # or
    cat yeast_mrna_gene_counts.bed | \
      awk '{if ($7 > 0) print $7}' | wc -l

    Most of the genes (6141/6485) have non-zero read overlap counts.

    Exercise: What is the total count of reads mapping to gene features?

    Expand
    titleAnswer


    Code Block
    languagebash
    cat yeast_mrna_gene_counts.bed | awk '
     BEGIN{FS="\t";sum=0;tot=0}
     {if($7 > 0) { sum = sum + $7; tot = tot + 1 }}
     END{printf("%d overlapping reads in %d genes\n", sum, tot) }'

    There are 1,152,831 overlapping reads in 6,141 non-0 gene annotations.

    Use bedtools genomecov to create a signal track

    A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions that shows the count of overlapping bases for the regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).

    • Go to the UCSC Genome Browser: https://genome.ucsc.edu/
    • Select Genomes from the top menu bar
    • Select Human from POPULAR SPECIES
      • under Human Assembly select Feb 2009 (GrCh37/hg19)
      • select GO
    • In the hg19 browser page,
      • the 100 Vert. Cons track is a signal track
        • the x-axis is the genome position
        • the y-axis represents the base-wise conservation among vertebrates
      • customize the 100 Vert. Cons track
        • right-click on "100 Vert. Cons" text in the left margin,
          • select "Configure 100 Vert. Cons" from the menu
        • in the 100 Vert. Cons Track Settings dialog:
          • change "Track height" to 100
          • change "Data view scaling" to "auto-scale to data view"
          • click "OK"
      • the Layered H3K27Ac track is a signal track
        • the x-axis is the genome position
        • the y-axis represents the count of ChIP-seq reads that overlap each position
          • where the ChIP'd protein is H3K27AC (histone H3, acetylated on the Lysine at amino acid position 27)

    The bedtools genomecov function (https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html), with the -bg (bedgraph) option produces output in bedGraph format. Here we'll analyze the per-base coverage of yeast RNAseq reads in our merged yeast gene regions.

    Make sure you're in an idev session, then prepare a directory for this exercise.

    Expand
    titleMake sure you're in an idev session


    Code Block
    languagebash
    titleStart an idev session
    idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0606
    
    # or, without a reservation:
    idev -m 120 -N 1 -A OTH21164 -p development 
    
    module load biocontainers
    module load bedtools
    bedtools --version   # should be bedtools v2.27.1



    Code Block
    languagebash
    titlePrepare for bedtools coverage
    mkdir -p $SCRATCH/core_ngs/bedtools_genomecov
    cd $SCRATCH/core_ngs/bedtools_genomecov 
    cp $CORENGS/catchup/bedtools_merge/merged*bed .
    cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

    Then calling bedtools genomecov is easy. The -bg option says to report the depth in bedGraph format.

    Code Block
    languagebash
    cd $SCRATCH/core_ngs/bedtools_genomecov
    bedtools genomecov -bg -ibam yeast_mrna.sort.filt.bam > yeast_mrna.genomecov.bedGraph
    
    wc -l yeast_mrna.genomecov.bedGraph # 1519274 lines

    The bedGraph (BED3+) format has only 4 columns: chrom start end value and does not need to include positions with 0 reads. Here the count is the number of reads covering each base in the region given by chrom start end, as you can see looking at the first few lines with head:

    Code Block
    chrI    4348    4390    2
    chrI    4390    4391    1
    chrI    4745    4798    2
    chrI    4798    4799    1
    chrI    4949    4957    2
    chrI    4957    4984    4
    chrI    4984    4997    6
    chrI    4997    4998    5
    chrI    4998    5005    4
    chrI    5005    5044    2
    chrI    5044    5045    1
    chrI    6211    6268    2
    chrI    6268    6269    1
    chrI    7250    7257    3
    chrI    7257    7271    4
    chrI    7271    7274    6
    chrI    7274    7278    7
    chrI    7278    7310    8
    chrI    7310    7315    6
    chrI    7315    7317    5 

    Because this bedGraph file is for the small-ish (12Mb) yeast genome, and for reads that cover only part of that genome, it is not too big – only ~34M. But depending on the species and read depth, bedGraph files can get very large, so there is a corresponding binary format called bigWig (see https://genome.ucsc.edu/goldenpath/help/bigWig.html). The program to covert a bedGraph file to bigWig format is part of the UCSC Tools suite of programs. Look for it with module spider, and note that you can get information about all the tools in it using module spider with a specific container version:

    Code Block
    # look for the ucsc tools package
    module spider ucsc
    
    # specifying a specific container version will show more information about the package
    module spider ucsc_tools/ctr-357--0
    
    # displays information including the programs in the package:
      - bedGraphToBigWig
      - bedToBigBed
      - faToTwoBit
      - liftOver
      - my_print_defaults
      - mysql_config
      - nibFrag
      - perror
      - twoBitToFa
      - wigToBigWig
    

    Looking at the help for bedGraphToBigWig, we'll need a file of chromosome sizes. We can create one from our BAM header, using a Perl substitution script, which I prefer to sed:

    Code Block
    languagebash
    module load ucsc_tools
    
    cd $SCRATCH/core_ngs/bedtools_genomecov
    bedGraphToBigWig  # look at its usage
    
    # create the needed chromosome sizes file from our BAM header
    module load samtools
    samtools view -H yeast_mrna.sort.filt.bam | grep -P 'SN[:]' | \
      perl -pe 's/.*SN[:]//' | perl -pe 's/LN[:]//' > sc_chrom_sizes.txt
    
    cat sc_chrom_sizes.txt
    
    # displays:
    chrI    230218
    chrII   813184
    chrIII  316620
    chrIV   1531933
    chrV    576874
    chrVI   270161
    chrVII  1090940
    chrVIII 562643
    chrIX   439888
    chrX    745751
    chrXI   666816
    chrXII  1078177
    chrXIII 924431
    chrXIV  784333
    chrXV   1091291
    chrXVI  948066
    chrM    85779

    Finally, call bedGraphToBigWig after sorting the bedGraph file again using the sort format bedGraphToBigWig likes. (You can try calling bedGraphToBigWig without sorting to see the error).

    Code Block
    languagebash
    cd $SCRATCH/core_ngs/bedtools_genomecov
    export LC_COLLATE=C  # may or may not need this...
    sort -k1,1 -k2,2n yeast_mrna.genomecov.bedGraph > yeast_mrna.genomecov.sorted.bedGraph
    bedGraphToBigWig yeast_mrna.genomecov.sorted.bedGraph sc_chrom_sizes.txt yeast_mrna.genomecov.bw

    See the size difference between the bedGraph and the bigWig files. The bigWig (9.7M) is less that 1/3 the size of the bedGraph (34M).

    Code Block
    languagebash
    cd $SCRATCH/core_ngs/bedtools_genomecov
    ls -lh yeast_mrna.genome*
    

    Since the bigWig file is binary, not text, you can't use commands like cat, head, tail on them directly and get meaningful output. Instead, just as zcat converts gzip'd files to text, and samtools view convets binary BAM files to text, the bigWigToBedGraph program can convert binary bigWig format to text. That's a different BioContainers module (ucsc-bigwigtobedgraph) and the default container version doesn't work, so we'll specifically load one that does:

    Code Block
    languagebash
    # The default version of is broken, so load this specific biocontainers version
    module load ucsc-bigwigtobedgraph/ctr-357--1
    
    # see usage for bigWigToBedGraph:
    bigWigToBedGraph
    
    cd $SCRATCH/core_ngs/bedtools_genomecov
    # use the program to view a few lines of the binary bigWig file
    bigWigToBedGraph yeast_mrna.genomecov.bw stdout | head