...
- aln.<prefix>.log – Log file of the entire alignment process.
- check the tail of this file to make sure the alignment was successful
- <prefix>.sort.dup.bam – Sorted, duplicate-marked alignment file.
- <prefix>.sort.dup.bam.bai – Index for the sorted, duplicate-marked alignment file
- <prefix>.samstats.txt – Summary alignment statistics from Anna's stats script
Verifying alignment success
The alignment log will have a "I ran successfully" message at the end if all went well, and if there was an error, the important information should also be at the end of the log file. So you can use tail to check the status of an alignment; for example
Code Block | ||||
---|---|---|---|---|
| ||||
tail aln.yeast_chip.log |
This will show something like this:
Code Block |
---|
..samstats file 'yeast_chip.samstats.txt' exists Thu May 28 16:36:01 CDT 2015
..samstats file file 'yeast_chip.samstats.txt' size ok Thu May 28 16:36:01 CDT 2015
---------------------------------------------------------
Cleaning up files...
---------------------------------------------------------
ckRes 0 cleanup
---------------------------------------------------------
All bwa alignment tasks completed successfully!
Thu May 28 16:36:01 CDT 2015
--------------------------------------------------------- |
Notice that success message: "All bwa alignment tasks completed successfully!". It should only appear once in any successful alignment log.
When multiple alignment commands are run in parallel it is important to check them all, and you can use grep looking for part of the unique success message to do this.
For example, suppose I have run 6 alignments and have these 6 log files:
Code Block |
---|
aln.delswr1_htz1_tap1t0.log aln.delswr1_htz1_tap1t30.log aln.wt_htz1_tap1t15.log
aln.delswr1_htz1_tap1t15.log aln.wt_htz1_tap1t0.log aln.wt_htz1_tap1t30.log |
I can check that all 6 completed with this command:
Code Block | ||||
---|---|---|---|---|
| ||||
grep 'completed successfully' aln.*.log | wc -l |
If this command returns 6, I'm done. But what if it doesn't? If you grep -v (lines that don't contain the pattern), you'll get every line in every log file except the success message line, which is not what you want at all.
You could tail the log files one by one to see which one(s) don't have the message, but you can also use a special grep option to do this work:
Code Block | ||||
---|---|---|---|---|
| ||||
grep -L 'completed successfully' aln.*.log |
The -L option tells grep to only print the filenames that don't contain the pattern. Perfect!
Checking alignment statistics
The <prefix>.samstats.txt statistics file produced by the alignment pipeline has a lot of good information in one place. If you use cat or more to view it you'll see this:
Code Block |
---|
-----------------------------------------------
Aligner: bwa
Total sequences: 1184360
Total mapped: 547664 (46.2 %)
Total unmapped: 636696 (53.8 %)
Primary: 547664 (100.0 %)
Secondary:
Duplicates: 324280 (59.2 %)
Fwd strand: 272898 (49.8 %)
Rev strand: 274766 (50.2 %)
Multi hit: 18688 (3.4 %)
Soft clip: 222451 (40.6 %)
All match: 319429 (58.3 %)
Indels: 6697 (1.2 %)
Spliced:
-----------------------------------------------
Total PE seqs: 1184360
PE seqs mapped: 547664 (46.2 %)
Num PE pairs: 592180
F5 1st end mapped: 300477 (50.7 %)
F3 2nd end mapped: 247187 (41.7 %)
PE pairs mapped: 241180 (40.7 %)
PE proper pairs: 236557 (39.9 %)
-----------------------------------------------
Insert size stats for: yeast_chip
Number of pairs: 236557 (proper)
Number of insert sizes: 212
Mean [-/+ 1 SD]: 215 [153 277] (sd 62)
Mode [Fivenum]: 223 [105 210 220 229 321]
----------------------------------------------- |
Since this was a paired end alignment there is paired-end specific information reported, including insert size statistics: mean/standard deviation, mode (most common insert size value) and fivenum (min, q1, median, q3 max insert sizes).
A quick way to check alignment stats if you have run multiple alignments is again to use grep. For example, for the 6 alignment files shown earlier, running this:
Code Block | ||||
---|---|---|---|---|
| ||||
grep 'Total map' *samstats.txt |
will produce output like this:
Code Block |
---|
delswr1_htz1_tap1t0.samstats.txt: Total mapped: 32761761 (86.8 %)
delswr1_htz1_tap1t15.samstats.txt: Total mapped: 33699464 (89.2 %)
delswr1_htz1_tap1t30.samstats.txt: Total mapped: 28441655 (87.6 %)
wt_htz1_tap1t0.samstats.txt: Total mapped: 28454847 (89.5 %)
wt_htz1_tap1t15.samstats.txt: Total mapped: 33245627 (90.9 %)
wt_htz1_tap1t30.samstats.txt: Total mapped: 32567026 (90.7 %) |
TACC batch system considerations
...