Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Tip
titleReservations

Use our summer school reservation (core-ngs-class-0605) for today when submitting batch jobs to get higher priority on the ls6 normal queue.

Code Block
languagebash
titleRequest an interactive (idev) node
# Request a 180 minute interactive node on the normal queue using our reservation
idev -m 120 -N 1 -A OTH21164 -r core-ngs-class-0605

# Request a 120 minute idev node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development


Code Block
languagebash
titleSubmit a batch job
# Using our reservation
sbatch --reseservation=core-ngs-class-0605 <batch_file>.slurm

Note that today's reservation name (core-ngs-class-0605) is different from the TACC allocation/project for this class, which is OTH21164.

...

You should now have a SAM file (yeast_pe.sam) that contains the alignments.

ExerciseExercises:

  • How many lines does the SAM file have?
  • How does this compare to the number of input sequences (R1+R2)?
Expand
titleAnswer

wc -l yeast_pe.sam  reports 1,184,378 lines

The alignment SAM file will contain records for both R1 and R2 reads, so we need to count sequences in both files.

zcat ./fastq/Sample_Yeast_L005_R[12]*gz | wc -l | awk '{print $1/4}' reports 1,184,360 read sequences.

So the SAM file has 18 more lines than the R1+R2 total. These are the header records that appear before any alignment records.

...

Expand
titleAnswers

The SAM file must contain both mapped and un-mapped reads, because there were 1,184,360 R1+R2 reads and the same number of alignment records.

Alignment records occur in the same read-name order as they did in the FASTQ, except that they come in pairs. The R1 read comes 1st, then the corresponding R2. This is called read name ordering.

The command below uses cut to extract the 1st Tabtab-delimited field (-f 1) from the first few alignment records:

Code Block
languagebash
grep -Pv '^@' yeast_pe.sam | head | cut -f 1

Next we'll see how to tell which read is the R1 and which is the R2.

...

Code Block
titleSAMtools suite usage
Program: samtools (Tools for alignments in the SAM format)
Version: 1.20 (using htslib 1.20)

Usage: samtools <command> [options]

Commands:
-- Indexing
   dict           create a sequence dictionary file
   faidx          index/extract FASTA
   fqidx          index/extract FASTQ
   index          index alignment

-- Editing
   calmd          recalculate MD/NM tags and '=' bases
   fixmate        fix mate information
   reheader       replace BAM header
   targetcut      cut fosmid regions (for fosmid pool only)
   addreplacerg   adds or replaces RG tags
   markdup        mark duplicates
   ampliconclip   clip oligos from the end of reads

-- File operations
   collate        shuffle and group alignments by name
   cat            concatenate BAMs
   consensus      produce a consensus Pileup/FASTA/FASTQ
   merge  merge sorted alignments    mpileup   merge sorted alignments
   mpileup        multi-way pileup
   sort           sort alignment file
   split          splits a file by read group
   quickcheck     quickly check if SAM/BAM/CRAM file appears intact
   fastq          converts a BAM to a FASTQ
   fasta          converts a BAM to a FASTA
   import         Converts FASTA or FASTQ files to SAM/BAM/CRAM
   reference      Generates a reference from aligned data
   reset          Reverts aligner changes in reads

-- Statistics
   bedcov         read depth per BED region
   coverage       alignment depth and percent coverage
   depth          compute the depth
   flagstat       simple stats
   idxstats       BAM index stats
   cram-size      list CRAM Content-ID and Data-Series sizes
   phase          phase heterozygotes
   stats          generate stats (former bamcheck)
   ampliconstats  generate amplicon specific stats

-- Viewing
   flags explain  BAM flags
   head           header viewer
   tview text     alignment viewer
   view           SAM<->BAM<->CRAM conversion
   depad          convert padded BAM to unpadded BAM
   samples        list the samples in a set of SAM/BAM/CRAM files

-- Misc
   help [cmd]     display this help message or help for [cmd]
   version        detailed version information

...

The utility samtools index creates an index that has the same name as the input BAM file, with suffix .bai appended. Here's the samtools index usage:

Usage
Code Block
titlesamtools index usage
samtools index usage
Usage: samtools index -M [-bc] [-m INT] <in1.bam> <in2.bam>...
   or: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
  -b, --bai            Generate BAI-format index for BAM files [default]
  -c, --csi            Generate CSI-format index for BAM files
  -m, --min-shift INT   Set minimum interval size for CSI indices to 2^INT [14]
  -M                   Interpret all filename arguments as files to be indexed
  -o, --output FILE    Write index to FILE [alternative to <out.index> in args]
  -@, --threads INT    Sets the number of threads [none]

...

More information about the alignment is provided by the samtools idxstats report, which shows how many reads aligned to each contig in your reference.

Note that samtools idxstats must be run on a sorted, indexed BAM file.

...