...
| Code Block |
|---|
| language | bash |
|---|
| title | Set up directory for working with FASTQs |
|---|
|
# Create a $SCRATCH area to work on data for this course,
# with a sub-directory for pre-processing raw fastq files
mkdir -p $SCRATCH/core_ngs/fastq_prep
# Make symbolic links to the original yeast data:
cd $SCRATCH/core_ngs/fastq_prep
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -f $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz
# or
ln -sf /work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/ \
Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf /work/projects/BioITeam/projects/courses/Core_NGS_Tools/yeast_stuff/ \
Sample_Yeast_L005_R2.cat.fastq.gz |
...
| Code Block |
|---|
| language | bash |
|---|
| title | gzip, gunzip exercise |
|---|
|
# if the $CORENGS environment variable is not defined
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools
# make sure you're in your $SCRATCH/core_ngs/fastq_prep directory
cd $SCRATCH/core_ngs/fastq_prep
# Copy over a small, uncompressed fastq file
cp $CORENGS/misc/small.fq .
# How many lines does it have?
wc -l small.fq
# check the size, then compress it in-place
ls -lh small*
gzip small.fq
# check the compressed file size
ls -lh small*
# uncompress it again
gunzip small.fq.gz
ls -lh small*
# create a compressed file and also leave the original file
# gzip's -c option says write compressed output to the console (standard outpuoutput)
gzip -c small.fq > small.fq.gz
ls -lh small* |
...
| Code Block |
|---|
| language | bash |
|---|
| title | Using the head command |
|---|
|
cd $SCRATCH/core_ngs/fastq_prep
# shows 1st 10 lines
head small.fq
# shows the first 2 lines
head -n 2 small.fq
head -2 small.fq
# shows 1st 100 lines -- might want to pipe this to more to see a bit at a time
head -100 small.fq | more
|
So what if you want to see line numbers on your head (or tail) output? Neither command seems to have an option to do this.
| Expand |
|---|
|
| Code Block |
|---|
| cat -n small.fq | tail |
|
...
| Code Block |
|---|
| language | bash |
|---|
| title | Using the tail command |
|---|
|
cd $SCRATCH/core_ngs/fastq_prep
# shows the last 10 lines
tail small.fq
# show the last line
tail -n 1 small.fq
tail -1 small.fq
# shows the last 100 lines -- might want to pipe this to more to see a bit at a time
tail -100 small.fq | more
# shows all the lines starting at line 900 -- better pipe it to a pager!
# cat -n adds line numbers to its output so we canto see where we are in the file
cat -n small.fq | tail -n +900 | more
# shows 5 lines starting at line 900 because we pipe to head -5
cat -n small.fq | tail -n +900 | head -5 |
...
| Expand |
|---|
|
| Code Block |
|---|
| # Setup (if needed)
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq . |
|
...
You can also pipe the output of zcat or gunzip -c to wc -l to count lines in your compressed FASTQ file.
ExerciseExercises:
- How many lines are in the Sample_Yeast_L005_R1.cat.fastq.gz file?
- How many sequences is this?
| Expand |
|---|
|
| Code Block |
|---|
| zcat Sample_Yeast_L005_R1.cat.fastq.gz | wc -l |
|
...
| Expand |
|---|
|
| Code Block |
|---|
| # Setup (if needed)
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz |
|
...
| Code Block |
|---|
| language | bash |
|---|
| title | For loop to count sequences in multiple FASTQs |
|---|
|
cd $SCRATCH/core_ngs/fastq_prep
for fname in *.gz; do
echo "Processing $fname"
echo "..$fname has $(zcat $fname | wc -l | awk '{print $1 / 4}') sequencesreads"
done |
Each time through the for loop, the next item in the argument list (here the files matching the wildcard glob *.gz) is assigned to the for loop's formal argument (here the variable fname). The actual filename is then referenced as$fname inside the loop. (Read more about Bash control flow)
Note that the $( ... ) syntax is equivalent to backticks ` ... ` so
So echo $(date) is the same as echo `date` `date`