Content Comparison

...

Code Block

language	bash
title	Start an idev session

idev -m 180 -N 1 -A OTH21164 -r CoreNGScore-ngs-class-0603  # Tuesday
idev -m 180 #-N or1 -A TRA23004 OTH21164 -r core-ngs-class-0604  # Wednesday

# -or-or, without a reservation:
idev -m 120 -N 1 -A OTH21164 -p development   # or -A TRA23004

Data staging

Set ourselves up to process some yeast data data in $SCRATCH, using some of best practices for organizing our workflow.

Code Block

language	bash
title	Set up directory for working with FASTQs

# Create a $SCRATCH area to work on data for this course,
# with a sub-directory for pre-processing raw fastq files
mkdir -p $SCRATCH/core_ngs/fastq_prep

# Make symbolic links to the original yeast data:
cd $SCRATCH/core_ngs/fastq_prep
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

# or
ln -sf /work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/ \
  Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf /work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/ \
  Sample_Yeast_L005_R2.cat.fastq.gz

...

Line 3 always starts with '+' (it can optionally include a sequence descriptionIt can be populated by downstream tools)

Line 4 is a string of Ascii-encoded base quality scores, one character per base in the sequence.

...

Code Block

language	bash
title	gzip, gunzip exercise

# if the $CORENGS environment variable is not defined
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools

# make sure you're in your $SCRATCH/core_ngs/fastq_prep directory
cd $SCRATCH/core_ngs/fastq_prep

# Copy over a small, uncompressed fastq file
cp $CORENGS/misc/small.fq .

# How many lines does it have?
wc -l small.fq

# check the size, then compress it in-place
ls -lh small*
gzip small.fq

# check the compressed file size
ls -lh small*

# uncompress it again
gunzip small.fq.gz
ls -lh small*

Warning

Both gzip and gunzip are extremely I/O intensive when run on large files.

While TACC



# create a compressed file and also leave the original file
# gzip's -c option says write compressed output to the console (standard output)
gzip -c small.fq > small.fq.gz
ls -lh small*

Warning

Both gzip and gunzip are extremely I/O intensive when run on large files.

While TACC has tremendous compute resources and its specialized parallel file system is great, it has its limitations. It is not difficult to overwhelm the TACC file system if you gzip or gunzip more than a few files at a time – as few as 5-6!

See https://docs.tacc.utexas.edu/tutorials/managingio/ for TACC's guidelines for managing I/O.

The intensity of compression/decompression operations is another reason you should compress your sequencing files once (if they aren't already) then leave them that way.

...

Code Block

language	bash
title	Using the head command

# shows cd $SCRATCH/core_ngs/fastq_prep

# shows 1st 10 lines
head small.fq

# shows the 1stfirst 1002 lines
head -n 2 small.fq
head - might want to2 small.fq

# shows 1st 100 lines -- pipe this to more to see a bit at a time
head -100 small.fq | more

So what if you want to see line numbers on your head (or tail) output? Neither command seems to have an option to do this.

Expand

title	Hint

cat --help | more

Expand

title	Answer

Code Block

language	bash

cat -n small.fq | tail

...

Code Block

language	bash
title	Using the tail command

cd $SCRATCH/core_ngs/fastq_prep

# shows the last 10 lines
tail small.fq

# show the last line
tail -n 1 small.fq
tail -1 small.fq

# shows the last 100 lines -- might want to pipe this to more to see a bit at a time
tail -100 small.fq | more

# shows all the lines starting at line 900 -- better pipe it to a pager!
# cat -n adds line numbers to its output so we canto see where we are in the file
cat -n small.fq | tail -n +900 | more

# shows 155 lines starting at line 900 because we pipe to head -5
cat -15n small.fq | tail -n +900 small.fq | head -155

Read more about head and tail in Displaying file contents.

...

Expand

title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools 
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .

...

You can also pipe the output of zcat or gunzip -c to wc -l to count lines in your compressed FASTQ file.

ExerciseExercises:

How many lines are in the Sample_Yeast_L005_R1.cat.fastq.gz file?
How many sequences is this?

Expand

title	Hint

Code Block

language	bash

zcat Sample_Yeast_L005_R1.cat.fastq.gz | wc -l

...

Code Block

language	bash
title	Counting sequences in a FASTQ file

cd $SCRATCH/core_ngs/fastq_prep
zcat Sample_Yeast_L005_R1.cat.fastq.gz | echo "File has $((`wc -l` / 4)) lines"

Whew!

Warning

title	bash arithmetic is integer valued only

Note that arithmetic in the bash shell is integer valued only, so don't use it for anything that requires decimal places!

(Read more about Arithemetic in bash)

A better way to do math

Well, doing math in bash is pretty awful – there has to be something better. There is! It's called awk, which is a powerful scripting language that is easily invoked from the command line.

...

Expand

title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools 
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

...

Note that $1 means something different in awk – the 1st whitespace-delimited input field – than it does in bash, where it represents the 1st argument to a script or function (technically, the 1 environment variable).

This is an example of where a metacharacter- the dollar sign ( $ ) here – has a different meaning for two different programs. (Read more about Literal characters and metacharacters)

...

Code Block

language	bash
title	For loop to count sequences in multiple FASTQs

cd $SCRATCH/core_ngs/fastq_prep
for fname in *.gz; do
  echo "Processing $fname"
  echo "..$fname has `zcat$(zcat $fname | wc -l | awk '{print $1 / 4}'`) sequencesreads"
done

Each time through the for loop, the next item in the argument list (here the files matching the wildcard glob *.gz) is assigned to the for loop's formal argument (here the variable fname). The actual filename is then referenced as$fname inside the loop. (Read more about Bash control flow)

Note that the $( ... ) syntax is equivalent to backticks ` ... `
So echo $(date) is the same as echo `date`

Version	Old Version 33	New Version Current
Changes made by	Anna Battenhouse	Anna Battenhouse
Saved on	Jun 05, 2024	Jun 04, 2025

Versions Compared

Key

Data staging

A better way to do math