Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
# Shorten the sample prefix some more...
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  pfx=$( echo $pfx | perl -pe '~s/_S\d+.*////' | perl -pe '~s/L00/L/')
  echo "$pfx - $file"
done

Now that we have nice sample names, count the number of sequences in each file. To un-compress the gzip'd files "on the fly" (without creating another file), we use zcat (like cat but for gzip'd files) and count the lines, e.g.:

...

languagebash

...

zcat <path> | wc -l

But FASTQ files have 4 lines for every sequence read. So to count the sequences properly we need to divide this number by 4.

Code Block
languagebash
# Clunky way to do arithmetic in bash -- but bash only does integer arithmetic!
echo $(( `zcat $path<gzipped fq file> | wc -l` / 4 ))

# Better way using awk
zcat <gzipped fq $pathfile> | wc -l | awk '{print $1/4}'

...

Code Block
languagebash
cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c

# produces this output:
      2 WT-1
      2 WT-2

What if we want to know the total sequences for each sample rather than for each file? Get a list of all unique sample names, then total the reads in the fastq_stats.txt files for that sample only:

...