...
- MultiQC produces neat, interactive plots in an HTML file.
- So it can be used as a basic plotting tool for many kinds of reports and data, not just those produced by NGS tools!
Tip |
---|
I recommend using Chrome to view MultiQC reports. The HTML reports generated by MultQC rely heavily on JavaScript and other dynamic web content scripting tools, and not all browsers support them equally well. |
Code Workshop
ATAC-seq is a transposon-insertion sequencing method where an engineered, activate transposon inserts in accessible ("open") chromatin. It is considered to be a much simpler protocol to standard DNase-seq, and requires less starting material as well.
...
Setup to follow along
Login to ls5 or stampede at TACCTACCTACC. Execute these commands to set up access to the the multiqc binary:
Code Block | ||||
---|---|---|---|---|
| ||||
module load python export PATH="/work/projects/BioITeam/ls5/binopt/multiqc-1.0:$PATH" export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/annab-packages:$PYTHONPATH" # make sure it is working... multiqc --help |
Produce a consolidated FastQC report
The FastQC took is great for producing detailed reports for every individual fastq file. For example, for Igor's 2 PE datasets, 4 reports are produced from running fastqc (http://web.corral.tacc.utexas.edu/iyer/igor/fastqc/).
...
Code Block | ||||
---|---|---|---|---|
| ||||
module load python
export PATH="/work/projects/BioITeam/stampede/opt/multiqc-1.0:$PATH"
export PYTHONPATH="/work/projects/BioITeam/stampede/lib/python2.7/annab-packages:$PYTHONPATH"
# make sure it is working...
multiqc --help |
Produce a consolidated FastQC report
The FastQC took is great for producing detailed reports for every individual fastq file. For example, for Igor's 2 PE datasets, 4 reports are produced from running fastqc (http://web.corral.tacc.utexas.edu/iyer/igor/fastqc/).
The shortcoming is that you have to browse through all the individual reports one at a time, which can be tedious for large experiments.
...
Code Block | ||
---|---|---|
| ||
mkdir -p $SCRATCH/byteclub/multiqc/01_fastq cd $SCRATCH/byteclub/multiqc/01_fastq ln -s -f /work/01063projects/abattenhBioITeam/projects/byteclub/multiqc/fastqc |
...
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/01_fastq multiqc multiqc . |
When this completes you'll see a new file and directory:
...
Expand | |||||
---|---|---|---|---|---|
| |||||
To view the file you created in a web browser, it must be copied somwhere where a browser can open it. An easy way to do this is to copy it to your laptop like this, for example, changing the user name from abattenh and scratch path as appropriate.
|
Add a few customizations
...
Use your favorite text editor to create a a file called multiqc_config.yaml in your $SCRATCH/byteclub/multiqc/01_fastq directory as shown below. This will add report title lines and change the names of the MultiQC output files.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
To catch up, just stage Anna's pre-made files:
|
After saving this file, remove the previous MultiQC outputs and re-run the program:
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/01_fastq
rm -rf multiqc_data multiqc_report.html
multiqc . |
...
- Always use spaces (not tabs!) in the multiqc_config.yaml file.
- Make sure the file is saved with Unix line endings (not Windows or Mac).
- Pay attention to the output when running multiqc. It will tell you if there are issues parsing the config file.
- Always delete any previous MultiQC output files before running multiqc
- While their documentation says existing files will just be updated, I have seen MultiQC get confused when previous reports exist.
- It is a good idea to change the name of the MultiQC output files
- If output files with those names are not created, something went wrong!
- Consult example config files
- An example multiqc_config.yaml file: https://github.com/ewels/MultiQC/blob/master/multiqc_config_example.yaml
- All multiqc_config.yaml defaults: https://github.com/ewels/MultiQC/blob/master/multiqc/utils/config_defaults.yaml
- Avoid running multiqc on large complex directory trees.
- Instead, create a separate directory (or directory tree) only for MultiQC
- Copy or link the files you want MultiQC to look for there, and use it as MultiQC's target directory.
- MultiQC will run much faster and have fewer confusions.
- Instead, create a separate directory (or directory tree) only for MultiQC
...
First stage some mm10 bowtie2 alignment data:
Code Block | ||
---|---|---|
| ||
mkdir -p $SCRATCH/byteclub/multiqc/02_bowtie cd $SCRATCH/byteclub/multiqc/02_bowtie ln -s -f /work/01063/abattenh/projects/byteclub/multiqc/fastqc rsync -avrP /work/01063projects/abattenhBioITeam/projects/byteclub/multiqc/bowtie2/ bowtie2/ |
...
- <prefix>.flagstat.txt - output from running samtools flagstat
- <prefix>.idxstats.txt - output from running samtools idxstats
- <prefix>.dupinfo.txt - output from running Picard MarkDuplicates
Note that output from samtools flagstat and samtools idxstats will only be recognized by MultiQC if the files names include the words flagstat and idxstats. Fortunately, Anna's script created files with those names!
...
Expand | ||||||
---|---|---|---|---|---|---|
| ||||||
To catch up, just use Anna's pre-made files:
|
Expand | |||||
---|---|---|---|---|---|
| |||||
To catch up, just use Anna's pre-made files:
|
Now run multiqc again:
|
Now run multiqc again:
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/02_bowtie
rm -rf mqc_report*
multiqc . |
...
Code Block | ||
---|---|---|
| ||
mkdir -p $SCRATCH/byteclub/multiqc/02_bowtie/for_multiqc cd $SCRATCH/byteclub/multiqc/02_bowtie/for_multiqcfor_multiqc for f in ../bowtie2/*.dupinfo.txt; do bn=`basename $f` pfx=${bn%%.dupinfo.txt} echo "$f - $pfx" cat $f | sed 's/[.]sort//g' > ${pfx}.dupmetrics.txt done |
Your $SCRATCH/byteclub/multiqc/02_bowtie/for_multiqc directory should have 2 files:
- brain_50k_nuclei.fixed.dupmetrics.txt
- brain_50k_nuclei.fixed.dupmetrics.txt
The final piece of the puzzle is to tell MultiQC to ignore the original <prefix>.dupinfo.txt files by modifying the multiqc_config.yaml file, adding a fn_ignore_files list entry.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
To catch up, just use Anna's pre-made files:
|
After making this config file modification, you can now run multiqc again:
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/02_bowtie; rm -rf mqc_report*; multiqc . |
...
Expand | |||||
---|---|---|---|---|---|
| |||||
To catch up, just use Anna's pre-made files:
|
After making this config file modification, you can now run multiqc again:
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/02_bowtie; rm -rf mqc_report*; multiqc . |
...
So a bit of command line reformatting is needed to produce files for MultiQC, which we will save in our for_multiqc directory.
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/02_bowtie/for_multiqc
for f in ../bowtie2/*.insertsz.txt; do
bn=`basename $f`
pfx=${bn%%.insertsz.txt}
echo "$f - $pfx"
tail -n +2 $f | grep -v -P '^-' | cut -f 1,3 > ${pfx}.bowtie2_isizes.tsv
done |
Next we edit the multiqc_config.yaml configuration file to add appropriate custom data sections:
...
| ||
cd $SCRATCH/byteclub/multiqc/for_multiqc
for f in ../bowtie2/*.insertsz.txt; do
bn=`basename $f`
pfx=${bn%%.insertsz.txt}
echo "$f - $pfx"
tail -n +2 $f | grep -v -P '^-' | cut -f 1,3 > ${pfx}.bowtie2_isizes.tsv
done |
Next we edit the multiqc_config.yaml configuration file to add appropriate custom data sections:
Code Block | ||
---|---|---|
| ||
# Titles to use for the report.
title: "ATAC-Seq QC Reports"
subtitle: null
intro_text: "MultiQC reports for Igor's ATAC-Seq proof-of-concept project."
report_header_info:
- Sequenced by: 'GSAF'
- Job: 'JA17277'
- Run: 'SA17121'
- Setup: '2x150'
# Change the output filenames
output_fn_name: mqc_report.html
data_dir_name: mqc_report_data
# Ignore these files / directories / paths when searching for reports
fn_ignore_files:
- '*.dupinfo.txt'
# Modules that should come at the top of the report
top_modules:
- 'generalstats'
- 'fastqc'
- 'samtools'
- 'picard'
# --------------------------------
# Custom data
# --------------------------------
custom_data:
bowtie2_isize:
id: 'bowtie2_isize_section'
section_name: 'Bowtie2 insert size'
description: 'distribution for alignments (bowtie2 --local -X2000 --no-mixed --no-discordant)'
file_format: 'tsv'
plot_type: 'linegraph'
pconfig:
id: 'bowtie2_isize_plot'
title: 'Insert sizes for proper pairs'
xlab: 'Insert size'
ylab: 'Count'
sp:
bowtie2_isize_section:
fn: '*.bowtie2_isizes.tsv'
|
Expand | |||||
---|---|---|---|---|---|
| |||||
To catch up, just use Anna's pre-made files:
|
Then the usual...
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc; rm -rf mqc_report*; multiqc . |
Resulting in a report that includes our inset size distribution data the custom data section we configured: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/06_custom_linegraph.mqc_report.html, with a new section called Bowtie2 insert size.
What's cool is that this "sawtooth" insert size distribution occurs because of the way transposons insert into the major groove of DNA at regular intervals. So this graph shows Igor that his ATAC-seq proof-of-concept experiment worked!
Adding
...
custom
...
bargraphs
Here we'll create two custom bargraph reports, one for bowtie2 mapping qualities and a second showing genome coverage of the alignments.
...
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/02_bowtie cp /work/01063projects/abattenhBioITeam/projects/byteclub/multiqc/07_custom_bargraph/for_multiqc/*mapq* for_multiqc/ cp /work/01063projects/abattenhBioITeam/projects/byteclub/multiqc/07_custom_bargraph/for_multiqc/*genomecov* for_multiqc/ |
...
There is just one data file for genome coverage. Unlike the per-sample files, it has a header, with an arbitrary tag for the categories dataset names in the 1st column, then dataset followed by category names and their counts in subsequent columns. (I've re-formatted the data below for readability, but remember that all .tsv file data must be tab-separated.)
Code Block | ||
---|---|---|
| ||
countsample 5k_nuclei 50k_nuclei (a) nonenone 2140984435 2175228345 (b) 1-2 237947623 351105871 (c) 3-10 308665107 186361275 (d) 11-50 38729079 51+ 5k_nuclei 2140984435 17356704 (e) 51+ 237947623 308665107 38729079 4545530 50k_nuclei 2175228345 351105871 186361275 17356704 819579 |
Here we edit the multiqc_config.yaml configuration file to add appropriate custom data sections:
...
Expand | |||||
---|---|---|---|---|---|
| |||||
To catch up, just use Anna's pre-made files:
|
Then the usual...
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/02_bowtie; rm -rf mqc_report*; multiqc . |
Resulting in a report that includes our new Mapping quality and Genome coverage sections, that should look like this: http://web.corral.tacc.utexas.edu/iyer/byteclub/multiqc/07_custom_bargraph.mqc_report.html.
Making MultiQC run faster and be less confused
By default, MultiQC scans all files in the analysis directory you specify. This can take quite a while for complex directory hierarchies with many files that will not be used by MultiQC.
Additionally, MultiQC can get confused when the same (or similar) data is found in different files, or in different directories.
To address these issues, it is a good practice to copy everything you want MultiQC to process into a single directory, then either specify just that directory on the multiqc command line (e.g. multiqc for_multiqc), or exclude other directories in the multiqc_config.yaml file.
...
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/02_bowtie/for_fastqc
ln -s -f ../fastqc
cp -p ../bowtie2/*.flagstat.txt .
cp -p ../bowtie2/*.idxstats.txt . |
...
Code Block |
---|
brain_50k_nuclei.bowtie2_isizes.tsv
brain_50k_nuclei.dupmetrics.txt
brain_50k_nuclei.flagstat.txt
brain_50k_nuclei.idxstats.txt
brain_50k_nuclei.mapq_histogram.tsv
brain_5k_nuclei.bowtie2_isizes.tsv
brain_5k_nuclei.dupmetrics.txt
brain_5k_nuclei.flagstat.txt
brain_5k_nuclei.idxstats.txt
brain_5k_nuclei.mapq_histogram.tsv
combined_genomecov.tsv
fastqc |
Then:
Code Block | ||
---|---|---|
| ||
cd ~/playtime/multiqc/atacseq; rm -rf mqc_report* multiqc for_multiqc combined_genomecov.tsv fastqc |
Expand | |||||
---|---|---|---|---|---|
| |||||
To catch up, just use Anna's pre-made files:
|
Run MultiQC again, but this time just point it
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/byteclub/multiqc/02_bowtie rm -rf mqc_report* multiqc for_multiqc |
Alternatively, you could exclude the bowtie2 directory entirely via a fn_ignore_dirs section list item.
...
mqc_report*
multiqc for_multiqc |
Alternatively, you could exclude the bowtie2 directory entirely via a fn_ignore_dirs section list item in multiqc_config.yaml, like this:
References
Main MultiQC links
...
Below are descriptions of two projects I've assisted with lately using MultiQC to help pull together visualizations assessing experiment quality.
...
I recommend using Chrome to view MultiQC reports.
...
.
- These example MultiQC reports below were generated by running the multiqc binary on a command line.
- After inspecting them locally (by just opening them as files in a web browser), they were copied to a web-accessible location to share with others. Here, that location is Iyer Lab's web-accessible directory on on corral.
Igor Ponomarev ATAC-seq data
...