Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Byte Club, October 18 2017. Using MultiQC to produce consolidated QC Reports.

Byte Club, October 18 2017
Anna Battenhouse, CSSB & CCBB.

...

  • MultiQC produces neat, interactive plots in an HTML file.
    • So it can be used as a basic plotting tool for many kinds of reports and data, not just those produced by NGS tools!


Tip

I recommend using Chrome to view MultiQC reports.

The HTML reports generated by MultQC rely heavily on JavaScript and other dynamic web content scripting tools, and not all browsers support them equally well.

Code Workshop

ATAC-seq is a transposon-insertion sequencing method where an engineered, activate transposon inserts in accessible ("open") chromatin. It is considered to be a much simpler protocol to standard DNase-seq, and requires less starting material as well.

...

Setup to follow along

Login to ls5 or stampede at TACCTACCTACC. Execute these commands to set up access to the the multiqc binary:

Code Block
languagebash
titlelonestar5 setup for multiqc
module load python
export PATH="/work/projects/BioITeam/ls5/binopt/multiqc-1.0:$PATH"
export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/annab-packages:$PYTHONPATH"
 
# make sure it is working...
multiqc --help

Produce a consolidated FastQC report

The FastQC took is great for producing detailed reports for every individual fastq file. For example, for Igor's 2 PE datasets, 4 reports are produced from running fastqc (http://web.corral.tacc.utexas.edu/iyer/igor/fastqc/).

The shortcoming is that you have to browse through all the individual reports one at a time, which can be tedious for large experiments.

This is where MultiQC's power comes in. You can point MultiQC to a directory where FastQC has been run and it will magically produce a consolidated report.

For example, logged in to ls5 at TACC, first stage a directory where FastQC has been run:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc/01_fastq
cd $SCRATCH/byteclub/multiqc/01_fastq
ln -s -f /work/01063/abattenh/projects/byteclub/multiqc/fastqc

Now this is all it takes to produce a basic MultiQC report:

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc/01_fastq
multiqc .

...

Code Block
languagebash
titlestampede setup for multiqc
module load python
export PATH="/work/projects/BioITeam/stampede/opt/multiqc-1.0:$PATH"
export PYTHONPATH="/work/projects/BioITeam/stampede/lib/python2.7/annab-packages:$PYTHONPATH"
 
# make sure it is working...
multiqc --help

Produce a consolidated FastQC report

The FastQC took is great for producing detailed reports for every individual fastq file. For example, for Igor's 2 PE datasets, 4 reports are produced from running fastqc (http://web.corral.tacc.utexas.edu/iyer/igor/fastqc/).

The shortcoming is that you have to browse through all the individual reports one at a time, which can be tedious for large experiments.

This is where MultiQC's power comes in. You can point MultiQC to a directory where FastQC has been run and it will magically produce a consolidated report.

For example, logged in to ls5 at TACC, first stage a directory where FastQC has been run:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd $SCRATCH/byteclub/multiqc
ln -s -f /work/projects/BioITeam/projects/byteclub/multiqc/fastqc

Now this is all it takes to produce a basic MultiQC report:

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc
multiqc .

When this completes you'll see a new file and directory:

  • multiqc_report.html – the MultiQC HTML report with its default name
  • multiqc_data – directory with text files containing  MultiQC data used in the report as well as a log file

...

Expand
titleTip

To view the file you created in a web browser, it must be copied somwhere where a browser can open it. An easy way to do this is to copy it to your laptop like this, for example, changing the user name from abattenh and scratch path as appropriate.

Code Block
languagebash
# from your laptop:
scp -p abattenh@ls5.tacc.utexas.edu:/scratch/01063/abattenh/byteclub/multiqc/01_fastq/multiqc_report.html .

Add a few customizations

...

Use your favorite text editor to create a a file called multiqc_config.yaml in your $SCRATCH/byteclub/multiqc/01_fastq directory as shown below. This will add report title lines and change the names of the MultiQC output files.

Code Block
titlemultiqc_config.yaml
# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

After saving this file, remove the previous MultiQC outputs and re-run the program:

...

Code Block
languagebash
Expand
titleCatch up

To catch up, just stage Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc/
cd $SCRATCH/byteclub/multiqc/
01_fastq

rm
rsync -
rf multiqc_data multiqc_report.html multiqc .

...

avrP --delete /work/projects/BioITeam/projects/byteclub/multiqc/02_fastq/ .

After saving this file, remove the previous MultiQC outputs and re-run the program:

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc
rm -rf multiqc_data multiqc_report.html
multiqc .

If all went well, you should now see a mqc_report.html file and a mqc_report_data  directory. Your newly-generated mqc_report.html report file in should look like this (note the new title and header): http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/02_custom.mqc_report.html.

...

  • Always use spaces (not tabs!) in the multiqc_config.yaml file.
  • Make sure the file is saved with Unix line endings (not Windows or Mac).
  • Pay attention to the output when running multiqc. It will tell you if there are issues parsing the config file.
  • Always delete any previous MultiQC output files before running multiqc
    • While their documentation says existing files will just be updated, I have seen MultiQC get confused when previous reports exist.
  • It is a good idea to change the name of the MultiQC output files
    • If output files with those names are not created, something went wrong!
  • Consult example config files

Add reports from a bowtie2 alignment

...

  • Avoid running multiqc on large complex directory trees.
    • Instead, create a separate directory (or directory tree) only for MultiQC 
      • Copy or link the files you want MultiQC to look for there, and use it as MultiQC's target directory.
    • MultiQC will run much faster and have fewer confusions.

Add reports from a bowtie2 alignment

First stage some mm10 bowtie2 alignment data:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc/02_bowtie
cd $SCRATCH/byteclub/multiqc/02_bowtie
lnrsync -savrP -f /work/01063projects/abattenhBioITeam/projects/byteclub/multiqc/fastqc
rsync -avrP /work/01063/abattenh/projects/byteclub/multiqc/bowtie2/ bowtie2/

...

bowtie2/ bowtie2/

Take a look at the contents of the bowtie2 directory. It contains typical output files from running Anna's align_bowtie2_illumina.sh alignment script.

...

  • <prefix>.flagstat.txt - output from running samtools flagstat 
  • <prefix>.idxstats.txt - output from running samtools idxstats 
  • <prefix>.dupinfo.txt - output from running Picard MarkDuplicates 

Note that output from samtools flagstat and samtools idxstats will only be recognized by MultiQC if the files names include the words flagstat and idxstats. Fortunately, Anna's script created files with those names!

Now run multiqc again using the previous MultiQC configuration created above.


Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc/
02_bowtie

cd $SCRATCH/byteclub/multiqc/
02_bowtie cp ../01_fastq/multiqc_config.yaml .

rsync -avrP --delete /work/projects/BioITeam/projects/byteclub/multiqc/03_bowtie/ .

Now run multiqc again:

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc
rm -rf mqc_report*
multiqc .

If all went well, you should now see a mqc_report.html file that looks like this: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/03_bowtie.mqc_report.html, with new sections for Picard and Samtools reports.

Fix the Picard MarkDuplicates sample name

...

For the first part of the solution, we'll create a modified version of the metrics files, but not in the alignment directory, but in a new for_multiqc directory. 

Code Block
languagebash
mkdir -p ~$SCRATCH/playtimebyteclub/multiqc/atacseq/for_multiqc
cd ~$SCRATCH/playtimebyteclub/multiqc/atacseq/for_multiqc
for f in ../bowtie2/*.dupinfo.txt; do
  bn=`basename $f`
  pfx=${bn%%.dupinfo.txt}
  echo "$f - $pfx"
  cat $f | sed 's/[.]sort//g' > ${pfx}.dupmetrics.txt
done

Executing ls -1 ~/playtimeYour $SCRATCH/byteclub/multiqc/atacseq02_bowtie/for_multiqc directory should now show have 2 files:

...

  • brain_50k_nuclei.

...

  • dupmetrics.txt

...

  • brain_

...

  • 50k_nuclei

...

  • .dupmetrics.txt

The final piece of the puzzle is to tell MultiQC to ignore the original <prefix>.dupinfo.txt files by modifying the multiqc_config.yaml file, adding a fn_ignore_files list entry.

Code Block
titlemultiqc_config.yaml
# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'
Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd $SCRATCH/byteclub/multiqc
rsync -avrP --delete /work/projects/BioITeam/projects/byteclub/multiqc/04_picard_fixed/ .

After making this config file modification, you can now run multiqc again:

Code Block
languagebash
cd ~$SCRATCH/playtimebyteclub/multiqc/atacseq; rm -rf mqc_report*; multiqc .

The resulting report will should look like this: ftp://gapdh.icmb.utexas.edu/misc/multiqc/04.mqc_report.fixed_align_info.html, with a cleaned up General Statistics table.

Controlling report section order

...

Code Block
titlemultiqc_config.yaml
# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'

# Modules that should come at the top of the report
top_modules:
    - 'generalstats'
    - 'fastqc'
    - 'samtools'
    - 'picard'

...

Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd 
~
$SCRATCH/
playtime
byteclub/multiqc
/atacseq;

rm -
rsync -avrP --delete /work/projects/BioITeam/projects/byteclub/multiqc/05_section_order/ .

After making this config file modification, you can now run multiqc again:

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc; rm -rf mqc_report*; multiqc .

Producing a report like this:ftp http://gapdhweb.corral.icmbtacc.utexas.edu/misciyer/byteclub/multiqc/05_section_order.mqc_report.changed_section_order.htmlhtml, with a section order that more closely follows workflow processing steps. 

About MultiQC custom data

When MultiQC does not know about data produced by a program it doesn't know about, it has a mechanisms for adding custom report sections. The simple way to do this is declaratively, (i.e., via configuation parameters) as described below. (You can also write a Python module for very fine-grained control, but that is a lot more work.)

To add a section for custom data:

  1. Format the data appropriately
    • MultiQC supports a number of data file formats (yaml, comma-separated values, etc.)
      • I recommend using simple tab-delimited text files, with MultiQC's preferred .tsv extension.
    • data can be provided as one file per sample (where the sample name is part of the file name)
      • or as a single table-like file containing data for all samples
  2. Add two required custom data section entries in the multiqc_config.yaml configuration file
    • a sp (search path) section for finding report data
      • specifying a wildcard pattern, if data is supplied as one file per sample
      • or a single file name for a consolidated data file
    • each report has a user-named section under a single custom_data section.
      • the required id attribute must be unique, and ties the custom_data, sp and custom_content sections
      • other important attributes include description, file_format, and plot_type.
      • a pconfig sub-section contains plot configuration options
  3. Specify the ordering of the custom report section (optional)
    • add a custom_content section order list entry

...

So a bit of command line reformatting is needed to produce files for MultiQC, which we will save in our for_multiqc directory.

Code Block
languagebash
mkdir -p ~/playtime/multiqc/atacseq/for_multiqc
cd ~$SCRATCH/playtimebyteclub/multiqc/atacseq/for_multiqc
for f in ../bowtie2/*.insertsz.txt; do
  bn=`basename $f`
  pfx=${bn%%.insertsz.txt}
  echo "$f - $pfx"
  tail -n +2 $f | grep -v -P '^-' | cut -f 1,3 > ${pfx}.bowtie2_isizes.tsv
done

...

Code Block
titlemultiqc_config.yaml
# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'

# Modules that should come at the top of the report
top_modules:
    - 'generalstats'
    - 'fastqc'
    - 'samtools'
    - 'picard'

# --------------------------------
# Custom data
# --------------------------------
custom_content:
  order:
    -
bowtie2_isize_section

custom_data:
    bowtie2_isize:
        id: 'bowtie2_isize_section'
        section_name: 'Bowtie2 insert size'
        description: 'distribution for alignments (bowtie2 --local -X2000 --no-mixed --no-discordant)'
        file_format: 'tsv'
        plot_type: 'linegraph'
        pconfig:
            id: 'bowtie2_isize_plot'
            title: 'Insert sizes for proper pairs'
            xlab: 'Insert size'
            ylab: 'Count'

sp:
    bowtie2_isize_section:
        fn: '*.bowtie2_isizes.tsv'

...

sizes for proper pairs'
            xlab: 'Insert size'
            ylab: 'Count'
sp:
    bowtie2_isize_section:
        fn: '*.bowtie2_isizes.tsv'
Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd $SCRATCH/byteclub/multiqc
rsync -avrP --delete /work/projects/BioITeam/projects/byteclub/multiqc/06_custom_linegraph/ .

Then the usual...

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc; rm -rf mqc_report*; multiqc .

Resulting in a report that includes our inset size distribution data the custom data section we configured: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/06_custom_linegraph.mqc_report.html, with a new section called Bowtie2 insert size.

What's cool is that this "sawtooth" insert size distribution occurs because of the way transposons insert into the major groove of DNA at regular intervals. So this graph shows Igor that his ATAC-seq proof-of-concept experiment worked!

Adding custom bargraphs

Here we'll create two custom bargraph reports, one for bowtie2 mapping qualities and a second showing genome coverage of the alignments.

The data files for both reports are pretty simple, but it took a bit of scripting to create them. So let's just use pre-made copies:

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc
cp /work/projects/BioITeam/projects/byteclub/multiqc/07_custom_bargraph/for_multiqc/*mapq*      for_multiqc/
cp /work/projects/BioITeam/projects/byteclub/multiqc/07_custom_bargraph/for_multiqc/*genomecov* for_multiqc/

There is one mapping quality histogram for each dataset, with category names in the 1st column and counts in the 2nd. The 50k dataset file looks like this:

Code Block
titlebrain_50k_nuclei.mapq_histogram.tsv
q0	    137354
1-9	    671546
10-19	1081868
20-29	1945926
30-39	1508496
40+	    12930272

There is just one data file for genome coverage. Unlike the per-sample files, it has a header, with dataset names in the 1st column, followed by category names and their counts in subsequent columns. (I've re-formatted the data below for readability, but remember that all .tsv file data must be tab-separated.)

Code Block
titlecombined_genomecov.tsv
sample      none        1-2        3-10       11-50     51+
5k_nuclei   2140984435  237947623  308665107  38729079  4545530
50k_nuclei  2175228345  351105871  186361275  17356704  819579

Here we edit the multiqc_config.yaml configuration file to add appropriate custom data sections:

Code Block
titlemultiqc_config.yaml
# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
    - '*.dupinfo.txt'

# Modules that should come at the top of the report
top_modules:
    - 'generalstats'
    - 'fastqc'
    - 'samtools'
    - 'picard'

# --------------------------------
# Custom data
# --------------------------------
custom_content:
  order:
    - bowtie2_isize_section
    - bowtie2_mapq_section
    - genome_coverage_section
custom_data:
    bowtie2_isize:
        id: 'bowtie2_isize_section'
        section_name: 'Bowtie2 insert size'
        description: 'distribution for alignments (bowtie2 --local -X2000 --no-mixed --no-discordant)'
        file_format: 'tsv'
        plot_type: 'linegraph'
        pconfig:
            id: 'bowtie2_isize_plot'
            title: 'Insert sizes for proper pairs'
            xlab: 'Insert size'
            ylab: 'Count'
    bowtie2_mapq:
        id: 'bowtie2_mapq_section'
        section_name: 'Mapping quality'
        description: 'distribution for aligned reads before filtering'
        file_format: 'tsv'
        plot_type: 'bargraph'
        pconfig:
            id: 'bowtie2_mapq_plot'
            title: 'Mapping quality scores'
            ymax: 60000000
    genome_coverage:
        id: 'genome_coverage_section'
        section_name: 'Genome coverage'
        description: 'of mapped inserts (bedtools genomecov -fs), grouped into coverage count catgories'
        file_format: 'tsv'
        plot_type: 'bargraph'
        pconfig:
            id: 'genome_coverage_plot'
            title: 'Position coverage by coverage count category'
            logswitch: True
            stacking: null
sp:
    bowtie2_isize_section:
        fn: '*.bowtie2_isizes.tsv'
    bowtie2_mapq_section:

        fn: '*.mapq_histogram.tsv'
    genome_coverage_section:
        fn: 'combined_genomecov.tsv'
 
# file suffixes to remove when generating sample names...
extra_fn_clean_exts:
    - type: 'replace'
      pattern: '.mapq_histogram.tsv'
    - type: 'replace'
      pattern: '.genomecov.tsv'
Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd $SCRATCH/byteclub/multiqc
rsync -avrP --delete /work/projects/BioITeam/projects/byteclub/multiqc/07_custom_bargraph/ .

Then the usual...

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc; rm -rf mqc_report*; multiqc .

Resulting in a report that includes our new Mapping quality and Genome coverage sections, that should look like this: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/07_custom_bargraph.mqc_report.html.

Making MultiQC run faster and be less confused

By default, MultiQC scans all files in the analysis directory you specify. This can take quite a while for complex directory hierarchies with many files that will not be used by MultiQC.

Additionally, MultiQC can get confused when the same (or similar) data is found in different files, or in different directories.

To address these issues, it is a good practice to copy everything you want MultiQC to process into a single directory, then either specify just that directory on the multiqc command line (e.g. multiqc for_multiqc), or exclude other directories in the multiqc_config.yaml file.

For example, here we can stage all the reports we want MultiQC to process in our for_multiqc directory:

Code Block
languagebash
cd $SCRATCH/byteclub/multiqc/for_fastqc
ln -s -f ../fastqc
cp -p ../bowtie2/*.flagstat.txt  .
cp -p ../bowtie2/*.idxstats.txt  .

Your for_multiqc directory should now everything we want MultiQC to use:

Code Block
brain_50k_nuclei.bowtie2_isizes.tsv
brain_50k_nuclei.dupmetrics.txt
brain_50k_nuclei.flagstat.txt
brain_50k_nuclei.idxstats.txt
brain_50k_nuclei.mapq_histogram.tsv
brain_5k_nuclei.bowtie2_isizes.tsv
brain_5k_nuclei.dupmetrics.txt
brain_5k_nuclei.flagstat.txt
brain_5k_nuclei.idxstats.txt
brain_5k_nuclei.mapq_histogram.tsv
combined_genomecov.tsv
fastqc
Expand
titleCatch up

To catch up, just use Anna's pre-made files:

Code Block
languagebash
mkdir -p $SCRATCH/byteclub/multiqc
cd $SCRATCH/byteclub/multiqc
rsync -avrP --delete /work/projects/BioITeam/projects/byteclub/multiqc/08_final/ .

Run MultiQC again, but this time just point it 

Code Block
languagebash
cd ~$SCRATCH/playtimebyteclub/multiqc/atacseq;
rm -rf mqc_report*; multiqc .

...

 mqc_report*
multiqc for_multiqc

Alternatively, you could exclude the bowtie2 directory entirely via a fn_ignore_dirs section list item in multiqc_config.yaml, like this: 

References

...

Below are descriptions of two projects I've assisted with lately using MultiQC to help pull together visualizations assessing experiment quality.

...

.

The HTML reports generated by MultQC rely heavily on JavaScript and other dynamic web content scripting tools, and not all browsers support them equally well.

  • These example MultiQC reports below were generated by running the multiqc binary on a command line.
  • After inspecting them locally (by just opening them as files in a web browser), they were copied to a web-accessible location to share with others. Here, that location is Iyer Lab's web-accessible directory on corral 

Igor Ponomarev ATAC-seq data

...