Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleSet up directory for working with FASTQs
# Create a $SCRATCH area to work on data for this course,
# with a sub-directory for pre-processing raw fastq files
mkdir -p $SCRATCH/core_ngs/fastq_prep

# Make symbolic links to the original yeast data:
cd $SCRATCH/core_ngs/fastq_prep
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

# or
ln -sf /work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/ \ 
  Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf /work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/ \
  Sample_Yeast_L005_R2.cat.fastq.gz

...

Code Block
languagebash
titlegzip, gunzip exercise
# if the $CORENGS environment variable is not defined
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools

# make sure you're in your $SCRATCH/core_ngs/fastq_prep directory
cd $SCRATCH/core_ngs/fastq_prep

# Copy over a small, uncompressed fastq file
cp $CORENGS/misc/small.fq .

# How many lines does it have?
wc -l small.fq

# check the size, then compress it in-place
ls -lh small*
gzip small.fq

# check the compressed file size
ls -lh small*

# uncompress it again
gunzip small.fq.gz
ls -lh small*

# create a compressed file and also leave the original file
# gzip's -c option says write compressed output to the console (standard outpuoutput)
gzip -c small.fq > small.fq.gz
ls -lh small*

...

Code Block
languagebash
titleUsing the head command
cd $SCRATCH/core_ngs/fastq_prep

# shows 1st 10 lines
head small.fq

# shows the first 2 lines
head -n 2 small.fq
head -2 small.fq

# shows 1st 100 lines -- might want to pipe this to more to see a bit at a time
head -100 small.fq | more

So what if you want to see line numbers on your head (or tail) output? Neither command seems to have an option to do this.

Expand
titleHint

cat --help | more


Expand
titleAnswer


Code Block
languagebash
cat -n small.fq | tail


...

Code Block
languagebash
titleUsing the tail command
cd $SCRATCH/core_ngs/fastq_prep

# shows the last 10 lines
tail small.fq

# show the last line
tail -n 1 small.fq
tail -1 small.fq

# shows the last 100 lines -- might want to pipe this to more to see a bit at a time
tail -100 small.fq | more

# shows all the lines starting at line 900 -- better pipe it to a pager!
# cat -n adds line numbers to its output so we canto see where we are in the file
cat -n small.fq | tail -n +900 | more

# shows 5 lines starting at line 900 because we pipe to head -5
cat -n small.fq | tail -n +900 | head -5

...

Expand
titleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools 
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .


...

You can also pipe the output of zcat or gunzip -c to wc -l to count lines in your compressed FASTQ file.

ExerciseExercises:

  • How many lines are in the Sample_Yeast_L005_R1.cat.fastq.gz file?
  • How many sequences is this?
Expand
titleHint


Code Block
languagebash
zcat Sample_Yeast_L005_R1.cat.fastq.gz | wc -l


...

Expand
titleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools 
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz


...

Code Block
languagebash
titleFor loop to count sequences in multiple FASTQs
cd $SCRATCH/core_ngs/fastq_prep
for fname in *.gz; do
  echo "Processing $fname"
  echo "..$fname has $(zcat $fname | wc -l | awk '{print $1 / 4}') sequencesreads"
done

Each time through the for loop, the next item in the argument list (here the files matching the wildcard glob *.gz) is assigned to the for loop's formal argument (here the variable fname). The actual filename is then referenced as$fname inside the loop. (Read more about Bash control flow)

Note that the $( ... ) syntax is equivalent to backticks ` ... ` so  
So echo $(date) is the same as echo `date` `date`