Using MultiQC

Byte Club, October 18 2017. Using MultiQC to produce consolidated QC Reports. Anna Battenhouse, CSSB & CCBB.

Overview

MultiQC is a tool for aggregating NGS QC reports.
- It does not produce reports, just combines them for unified visualization.
MultiQC "knows" the report formats of many existing NGS tools:
- FastQC, cutadapt, bowtie2, tophat, STAR, kallisto, HISAT2, samtools, featureCounts, HTSeq, MACS2, Picard, GATK
- … and more!
MultiQC can also be configured to display other data via two straightforward steps:

format the data appropriately (e.g. tab-delimited text files)
create appropriate custom data entries in a multiqc_config.yaml configuration file

MultiQC produces neat, interactive plots in an HTML file.
- So it can be used as a basic plotting tool for many kinds of reports and data, not just those produced by NGS tools!

Code Workshop

ATAC-seq is a transposon-insertion sequencing method where an engineered, activate transposon inserts in accessible ("open") chromatin. It is considered to be a much simpler protocol to standard DNase-seq, and requires less starting material as well.

For data, we will use some ATAC-seq datasets produced in Igor Ponomarev's lab in WCAAR. As a proof-of-concept for future work, they performed the ATAC-seq protocol on 5k and 50k cell nuclei from mouse brain, producing 2 paired-end datasets.

Setup to follow along

Login to ls5 at TACC. Execute these commands to set up access to the multiqc binary:

module load python
export PATH="/work/projects/BioITeam/ls5/bin/multiqc-1.0:$PATH"
export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/annab-packages:$PYTHONPATH"
 
# make sure it is working...
multiqc --help

Produce a consolidated FastQC report

The FastQC took is great for producing detailed reports for every individual fastq file. For example, for Igor's 2 PE datasets, 4 reports are produced from running fastqc (http://web.corral.tacc.utexas.edu/iyer/igor/fastqc/).

The shortcoming is that you have to browse through all the individual reports one at a time, which can be tedious for large experiments.

This is where MultiQC's power comes in. You can point MultiQC to a directory where FastQC has been run and it will magically produce a consolidated report.

For example, logged in to ls5 at TACC, first stage a directory where FastQC has been run:

mkdir -p $SCRATCH/byteclub/multiqc/01_fastq
cd $SCRATCH/byteclub/multiqc/01_fastq
ln -s -f /work/01063/abattenh/projects/byteclub/multiqc/fastqc

Now this is all it takes to produce a basic MultiQC report:

cd $SCRATCH/byteclub/multiqc/01_fastq
multiqc .

When this completes you'll see a new file and directory:

multiqc_report.html – the MultiQC HTML report with its default name
multiqc_data – directory with text files containing MultiQC data used in the report as well as a log file

Here's what this basic FastQC report looks like: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/01_basic.multiqc_report.html

Tip

To view the file you created in a web browser, it must be copied somwhere where a browser can open it. An easy way to do this is to copy it to your laptop like this, for example, changing the user name from abattenh and scratch path as appropriate.

# from your laptop:
scp -p abattenh@ls5.tacc.utexas.edu:/scratch/01063/abattenh/byteclub/multiqc/01_fastq/multiqc_report.html .

Add a few customizations

MultiQC reports can be customized by creating a file called multiqc_config.yaml in the directory where you call multiqc.

Use your favorite text editor to create a a file called multiqc_config.yaml in your $SCRATCH/byteclub/multiqc/01_fastq directory as shown below. This will add report title lines and change the names of the MultiQC output files.

multiqc_config.yaml

# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
    - Sequenced by: 'GSAF'
    - Job: 'JA17277'
    - Run: 'SA17121'
    - Setup: '2x150'

# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data

After saving this file, remove the previous MultiQC outputs and re-run the program:

cd $SCRATCH/byteclub/multiqc/01_fastq
rm -rf multiqc_data multiqc_report.html
multiqc .

If all went well, you should now see a mqc_report.html file and a mqc_report_data directory. Your newly-generated mqc_report.html report file in should look like this (note the new title and header): http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/02_custom.mqc_report.html

Tips for working with the MultiQC configuation file

Here are a few tips for working with the MultiQC configuration file.

Always use spaces (not tabs!) in the multiqc_config.yaml file.
Make sure the file is saved with Unix line endings (not Windows or Mac).
Pay attention to the output when running multiqc. It will tell you if there are issues parsing the config file.
Always delete any previous MultiQC output files before running multiqc
- While their documentation says existing files will just be updated, I have seen MultiQC get confused when previous reports exist.
It is a good idea to change the name of the MultiQC output files
- If output files with those names are not created, something went wrong!
Consult example config files
- An example multiqc_config.yaml file: https ://github.com/ewels/MultiQC/blob/master/multiqc_config_example.yaml
- All multiqc_config.yaml defaults: https://github.com/ewels/MultiQC/blob/master/multiqc/utils/config_defaults.yaml

Add reports from a bowtie2 alignment

First stage some mm10 bowtie2 alignment data:

xx

code

xx

References

Main MultiQC links

Website: http://multiqc.info/
Documentation: http ://multiqc.info/docs
MultiQC Github repo: https://github.com/ewels/MultiQC
MultiQC test data repo: https://github.com/ewels/MultiQC_TestData

MultiQC configuration files

an example multiqc_config.yaml file: https ://github.com/ewels/MultiQC/blob/master/multiqc_config_example.yaml
all multiqc_config.yaml defaults: https://github.com/ewels/MultiQC/blob/master/multiqc/utils/config_defaults.yaml

MultiQC custom data support

structure of the custom data area of multiqc_config.yaml:
- http://multiqc.info/docs/#configuration
available plot types:
- http://multiqc.info/docs/#plotting-functions
- while this section is written for Python programming, the options listed in each plot type's "config" block can be specified declaratively in any plot's pconfig section in the multiqc_config.yaml.
example custom data files from their test data repo:
- https://github.com/ewels/MultiQC_TestData/tree/master/data/custom_content/no_config
- https://github.com/ewels/MultiQC_TestData/tree/master/data/custom_content/embedded_config

Example Reports from Anna

Below are descriptions of two projects I've assisted with lately using MultiQC to help pull together visualizations assessing experiment quality.

I recommend using Chrome to view MultiQC reports.

The HTML reports generated by MultQC rely heavily on JavaScript and other dynamic web content scripting tools, and not all browsers support them equally well.

These example MultiQC reports below were generated by running the multiqc binary on a command line.
After inspecting them locally (by just opening them as files in a web browser), they were copied to a web-accessible location to share with others. Here, that location is Iyer Lab's web-accessible directory on corral

Igor Ponomarev ATAC-seq data

ATAC-seq is a transposon-insertion sequencing method where an engineered, activate transposon inserts in accessible ("open") chromatin. It is considered to be a much simpler protocol to standard DNase-seq, and requires less starting material as well.

Igor Ponomarev's lab (in WCAAR) performed the ATAC-seq protocol on 5k and 50k cell nuclei from mouse brain, producing 2 paired-end datasets.

http://web.corral.tacc.utexas.edu/iyer/igor/mqc_report.html
has both standard and custom data reports

Marcotte lab amplicon sequencing

The Marcotte lab is working on a deep mutational screening project of a human gene transformed into yeast as an amplicon on a plasmid. Here, the gene is MVK, a gene in the yeast cholesterol biosynthesis pathway. The hsMVK gene is amplified with an error-prone polymerase to produce point mutations. Both the native yeast gene and the human ortholog (with which it shares no sequence similarity) are under on/off promoter control. The idea is to compare the mutations that accumulate in the active hsMVK gene, after many growth cycles, with a background in which the hsMVK gene is present but not active (the yeast MVKis doing the work) to see which mutations are favored or disfavored. As part of this project, Riddhiman Garge produced 19 datasets.

basic FastQC report
- http://web.corral.tacc.utexas.edu/iyer/mvk/mvk_mqc_report.fastqc.html
report on BWA mem alignments of the datasets to hsMVK amplicon and plasmid backbone contigs
- http://web.corral.tacc.utexas.edu/iyer/mvk/mvk_mqc_report.bwa.html
- standard reports from samtools flagstat, samtools idxstats, Picard MarkDuplicates
- custom data reports from bedtools genomecov and from insert size distribution data Anna computed
report using custom data from a specialized deep mutational screening tool from the Jesse Bloom lab
- http://web.corral.tacc.utexas.edu/iyer/mvk/mvk_mqc_report.jbloom.html
- this tool looks only at the overlapping portions of paired-end R1 and R2 reads