Page Comparison

Table of Contents

Files and File systems

First, let's review Intro Unix: Files and File Systems from the Intro Unix course. The most important takeaways are:

Understanding the tree-like structure of directories and files in the file system hierarchy
- More at: Intro Unix: Files and File Systems: The File System hierarchy
Knowing how to navigate the file system using the cd (change directory) command, Tab key completion, and relative path syntax:
- use the dot ( . ) metacharacter for the current directory
- use the dot-dot ( .. ) metacharacters for the parent directory
- More at:
  - Intro Unix: Files and File Systems: Navigating the file system
  - Intro Unix: Files and File Systems: Relative pathname syntax
Selecting multiple files using pathname wildcards (a.k.a. "globbing")
- asterisk ( * ) to match any length of characters
- brackets ( [ ] ) match any character between the brackets, including hyphen ( - ) delimited character ranges such as [A-G]
- More at: Intro Unix: Files and File Systems: Pathname wildcards (globbing)
A basic understanding of file attributes such as
- file type (file, directory)
- owner and group
- permissions (read, write, execute) for the owner, group and everyone
- More at: Intro Unix: Files and File Systems: File attributes
Familiarly with basic file manipulation commands (mkdir, cp, mv, rm)
- Intro Unix: Files and File Systems: Basic file manipulation commands

Working with remote files

scp to securely copy files (to/from a remote computer)

rsync

wget

The find command

TBD

Working with symbolic links

TBD

About compressed files

Because a lot of scientific data is large, it is often stored in a compressed format to conserve storage space. The most common compression program used for individual files is gzip whose compressed files have the .gz extension. The tar and zip programs are most commonly used for compressing directories.

Let's see how that works by using a small FASTQ file (~/data/fastq/small.fz) that contains NGS read data where each sequence is represented by 4 lines.

Code Block

language	bash

cd ~/data/fastq       # change into your ~/data/fastq directory
ls -lh small.fq       # small.fq is 66K (~66,000) bytes long
wc -l small.fq        # small.fq is 1000 lines long

By default, when you call gzip on a file it compresses it in place, creating a file with the same name plus a .gz extension.

Code Block

language	bash

gzip small.fq         # compress the small.fq file in place, producing small.fq.gz file
ls -lh small.fq.gz    # small.fq.gz is only 15K bytes -- 4x smaller!

The gunzip command does the reverse – decompresses the file and writes the results back without the .gz extension. gzip -d (decompress) does the same thing.

Code Block

language	bash

gunzip small.fq.gz    # decompress the small.fq.gz file in place, producing small.fq file
# or
gzip -d small.fq.gz

Both gzip and gunzip also have -c or --stdout options that tell the command to write on standard output, keeping the original files unchanged.

Code Block

language	bash

cd ~/data/fastq       # change into your ~/data/fastq directory
ls small.fq           # make sure you have an uncompressed "small.fq" file

gzip -c small.fq > sm2.fq.gz  # compress the "small.fq" into a new file called "sm2.fq.gz"
gunzip -c sm2.fq.gz > sm3.fq  # decompress "sm2.fq.gz" into a new "sm3.fq" file

Both gzip and gunzip can also accept data on standard input. In that case, the output is always on standard output.

Code Block

language	bash

cd ~/data/fastq       # change into your ~/data/fastq directory
ls small.fq           # make sure you have an uncompressed "small.fq" file

cat small.fq | gzip > small.fq.gz

The good news is that most bioinformatics programs can accept data in compressed gzipped format. But how do you view these compressed files?

The less pager accepts gzipped files as input
The zcat command is like cat, but works on gzipped files

Here are some ways to work with a compressed file:

Code Block

language	bash

cd                                      # make sure you're in your Home directory
cat jabberwocky.txt | gzip > jabber.gz  # make a compressed copy of the "jabberwocky.txt" file
less jabber.gz                          # use 'less' to view the compressed "jabber.gz" file (q to exit)

zcat jabber.gz | wc -l                       # count lines in the compressed "jabber.gz" file
zcat jabber.gz | tail -4                     # view the last 4 lines of the "jabber.gz" file
zcat jabber.gz | cat -n                      # view "jabber.gz" text with line numbers (no zcat -n option)
zcat jabber.gz | cat -n | tail +6 | head -4  # display lines 6 - 9 of "jabber.gz" text

Exercise 1-1

Display lines 6 - 9 of the compressed "jabber.gz" text

Expand

title	Hint...

zcat, cat -n tail/head or head/tail

Expand

title	Hint...

Working with 3rd party program I/O

Recall the three standard Unix streams: they each have a number, a name and redirection syntax:

Image Removed

standard output is stream 1
- redirect standard output to a file with a the > or 1> operator
  - a single > or 1> overwrites any existing data in the target file
  - a double >> or 1>> appends to any existing data in the target file
standard error is stream 2
- redirect standard error to a file with a the 2> operator
  - a single 2> overwrites any existing data in the target file
  - a double 2>> appends to any existing data in the target file

We also saw that 3rd party bioinformatics tools are often written as a top-level program that handles multiple sub-commands. Examples include the bwa NGS aligner and samtools and bedtools tool suites. To see their menu of sub-commands, you usually just need to enter the top-level command, or <command> --help. Similarly, sub-command usage is usually available as <command> <sub-command> or <command> <sub-command> --help.

Tip

title	3rd party tools and standard streams

Many tools write their main output to standard output by default but have options to write it to a file instead.

Similarly, tools often write processing status and diagnostics to standard error, and it is usually your responsibility to redirect this elsewhere (e.g. to a log file).

Finally, tools may support taking their main input from standard input, but need a "placeholder" argument where you'd usually specify a file. That standard input placeholder is usually a single dash ( - ) but can also be a reserved word such as stdin.

Now let's see how these concepts fit together when running 3rd party tools.

Exercise 1-1 bwa aln

Where does the bwa aln sub-command write its output?

Expand

title	Answer...

The bwa aln usage

Usage: bwa aln [options] <prefix> <in.fq>

does not specify an output file, so it must write its alignment information to standard output.

...

Working with remote files

scp (secure copy)

The cp command only copies files/directories with the local host's file systems. The scp command is similar to cp, but scp lets you securely copy files from one machine to another. And also like cp, scp has a -r (recursive) option to copy directories.

scp usage is similar to cp in that it copies from a <source> to a <destination>, but uses remote machine addressing to qualify either the <source> or the <destination> but not both.

Remote machine addressing looks like this: <user_account>@<hostname>:<source_or_destination>

Examples:

Open a new Terminal (Mac) or Command Prompt (Window) window on your local computer (not logged in to your student account), and try the following, using your studentNN account and GSAF pod host.

Note that you will always be prompted for your credentials on the remote host when you execute an scp command.

To copy a remote file:

Code Block

language	bash
title	scp a single file

# On your local computer - not gsafcomp01 or gsafcomp02
# Be sure to use your assigned student account and hostname

# copy "haiku.txt" from your remote student Home directory to your current local directory
scp student01@gsafcomp01.ccbb.utexas.edu:~/haiku.txt . 

# copy "haiku.txt", now in your local current directory, to your remote student 
# Home directory with the name "haiku2.txt"
scp ./haiku.txt student01@gsafcomp01.ccbb.utexas.edu:~/haiku2.txt

To copy a remote directory:

Code Block

language	bash
title	scp a directory

# On your local computer - not gsafcomp01 or gsafcomp02  # Be sure to use your assigned student account and hostname
 
# copy the "docs" directory and its contents from your remote student Home directory 
# to a local sub-directory called "local_docs"
scp -r student01@gsafcomp01.ccbb.utexas.edu:~/docs/ ./local_docs/

# copy the "local_docs" sub-directory in your local current directory, to your 
#  remote student Home directory with the name "remote_docs"
scp -r ./local_docs/ student01@gsafcomp01.ccbb.utexas.edu:~/remote_docs/

Tip

When transferring files between your computer and a remote server, you always need to execute the command on your local computer. This is because your personal computer does not have an entry in the global hostname database, whereas the remote computer does.

The global Domain Name Service, or DNS database maps full host names to their IP (Internet Protocol) address. Computers that can be accessed from anywhere on the Internet have their host names registered in DNS.

wget (web get)

The wget <url> command lets you retrieve the contents of a valid Internet URL (e.g. http, https, ftp).

By default the downloaded file will be stored in the directory where you execute wget
- with a filename based on the last component of the URL
The -O <path> option specifies the file or pathname where the URL data should be written.

Example:

Code Block

# Make a new "wget" directory in your student Home directory and change into it
mkdir -p ~/wget; cd ~/wget

# download a Gencode statistics file using default output file naming
wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l _README_stats.txt

# if you execute the same wget again, and the output file already exists
# wget will create a new one with a numeric extension
wget "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l _README_stats*

# download the same Gencode statistics file to a different local filename
wget -O gencode_stats.txt "https://ftp.ebi.ac.uk/pub/databases/gencode/_README_stats.txt"
wc -l gencode_stats.txt

The find command

The find command is a powerful – and of course complex! – way of looking for files in a nested directory hierarchy. The general form I use is:

find <in_directory> [ operators ] -name <expression> [ tests ]

looks for files matching <expression> in <in_directory> and its sub-directories
<expression> can be a double-quoted string including pathname wildcards (e.g. "[a-g]*.txt")
there are tons of operators and tests:
- -type f (file) and -type d (directory) are useful tests
- -maxdepth NNis a useful operator to limit the depth of recursion.
returns a list of matching pathnames in the <in_directory>, one per output line.

Examples:

Code Block

language	bash

cd
find . -name "*.txt" -type f     # find all .txt files in the Home directory
find . -name "*docs*" -type d    # find all directories with "docs" in the directory name

Exercise 2-1

The /stor/work/CBRS_unix/fastq/ directory contains sequencing data from a GSAF Job. Its structure, as shown by tree, is:

Image Added

Use find to find all fastq.gz files in /stor/work/CBRS_unix/fastq/.

Expand

title	Answer...

find /stor/work/CBRS_unix/fastq/ -name "*.fastq.gz" -type f
returns 4 file paths

How many fastq.gz files in /stor/work/CBRS_unix/fastq/ were run in sequencer lane L001.

Expand

title	Answer...

find /stor/work/CBRS_unix/fastq/ -name "*L001*fastq.gz" -type f | wc -l
reports 2 file paths

How many sample directories in /stor/work/CBRS_unix/fastq/ were run on July 10, 2020?

Expand

title	Answer...

find /stor/work/CBRS_unix/fastq/ -name "*2020*" -type d | wc -l
reports 2 directory paths

Working with symbolic links

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links (symlinks) to those files in a directory where you plan to work with them. You can use them in your analysis as if they were local to your working directory, without the storage cost of copying them.

Tip

title	Always symlink large files

Storage is a limited resources, so never copy large data files! Create symbolic links to them in your analysis directory instead.

The ln -s <path_to_link_to> [ link_file_name ] command creates a symbolic link to <file_to_link_to>.

ln -s <path> says to create a symbolic link (symlink) to the specified file (or directory) in the current directory
- always use the -s option to avoid creating a hard link, which behaves quite differently
the default link name corresponds to the last name component in <path>
- you can name the link file differently by supplying an optional link_file_name.
it is best to change into (cd) the directory where you want the link before executing ln -s
a symbolic link can be deleted without affecting the linked-to file
the -f (force) option says to overwrite any existing symbolic link with the same name

Examples:

Code Block

language	bash

# create a symlink to the ~/haiku.txt file using relative path syntax
mkdir -p ~/syms; cd ~/syms 
ln -s -f ../haiku.txt
ls -l

The ls -l long listing in the ~/syms directory displays the symlink like this:

Image Added

The 10-character permissions field (lrwxrwxrwx) has an l in the left-most file type position, indicating this is a symbolic link.
The symlink itself is colored differently – in cyan
There are two extra fields after the symlink name
- field 10 has an arrow -> pointing to field 11
- field 11 the path of the linked-to file ("../haiku.txt")

Now create a symlink to a non-existent file:

Code Block

language	bash

# create a symlink to a non-existent "../xxx.txt" file, naming the symlink "bad_link.txt"
mkdir -p ~/syms; cd ~/syms 
ln -sf ../xxx.txt bad_link.txt
ls -l

Now both the symlink and the linked-to file are displayed in red, indicating a broken link.

Image Added

Multiple files can be linked by providing multiple file name arguments along and using the -t (target) option to specify the directory where links to all the files can be created.

Code Block

language	bash

# create a multiple symlinks to the *.bed files in the ~/data/bedfiles/ directory
# the -t . says create all the symlinks in the current directory
mkdir -p ~/syms; cd ~/syms  
ln -sf -t .  ../data/bedfiles/*.bed
ls -l

What about the case where the files you want are scattered in sub-directories? Consider a typical GSAF project directory structure, where FASTQ files are nested in sub-directories:

Image Added

Here's a solution using find and xargs:

Code Block

language	bash

mkdir -p ~/syms/fa; cd ~/syms/fa
find /stor/work/CBRS_unix/fastq -name "*.gz" | xargs ln -sf -t .

Step by step:

find returns a list of matching file paths on its standard output
ln wants its files listed as arguments, not on standard input
- so the paths are piped to the standard input of xargs
xargs takes the data on its standard input and calls the specified function (here ln -sf -t .) with that data as the function's argument list.

About compressed files

Because a lot of scientific data is large, it is often stored in a compressed format to conserve storage space. The most common compression program used for individual files is gzip whose compressed files have the .gz extension. The tar and zip programs are most commonly used for compressing directories.

Let's see how that works by using a small FASTQ file that contains NGS read data where each sequence is represented by 4 lines.

Code Block

language	bash

# copy a small.fq file into a new ~/gzips directory
cd; mkdir gzips
cp -p /stor/work/CCBB_Workshops_1/misc_data/fastq/small.fq ~/gzips/
    
cd ~/gzips
ls -lh          # small.fq is 66K (~66,000) bytes long
wc -l small.fq  # small.fq is 1000 lines long

By default, when you call gzip on a file it compresses it in place, creating a file with the same name plus a .gz extension.

Code Block

language	bash

gzip small.fq   # compress the small.fq file in place, producing small.fq.gz file
ls -lh          # small.fq.gz is only 15K bytes -- 4x smaller!

The gunzip command does the reverse – decompresses the file and writes the results back without the .gz extension. gzip -d (decompress) does the same thing.

Code Block

language	bash

# decompress the small.fq.gz file in place, producing small.fq file
gunzip small.fq.gz    
# or
gzip -d small.fq.gz

Both gzip and gunzip also have -c or --stdout options that tell the command to write on standard output, keeping the original files unchanged.

Code Block

language	bash

cd ~/gzips            # change into your ~/gzips directory
ls small.fq           # make sure you have an uncompressed "small.fq" file

gzip -c small.fq > sm2.fq.gz  # compress the "small.fq" into a new file called "sm2.fq.gz"
gunzip -c sm2.fq.gz > sm3.fq  # decompress "sm2.fq.gz" into a new "sm3.fq" file
ls -lh

Both gzip and gunzip can also accept data on standard input. In that case, the output is always on standard output.

Code Block

language	bash

cd ~/gzips            # change into your ~/gzips directory
ls small.fq           # make sure you have an uncompressed "small.fq" file

cat small.fq | gzip > sm4.fq.gz

The good news is that most bioinformatics programs can accept data in compressed gzipped format. But how do you view these compressed files?

The less pager accepts gzipped files as input
The zcat command is like cat, but works on gzipped files

Here are some ways to work with a compressed file:

Code Block

language	bash

cd ~/gzips                                    
cat ../jabberwocky.txt | gzip > jabber.gz  # make a compressed copy of "jabberwocky.txt"
less jabber.gz                             # use 'less' to view compressed "jabber.gz" 
                                           #   (type 'q' to exit)
zcat jabber.gz | wc -l                     # count lines in the compressed "jabber.gz" file
zcat jabber.gz | tail -4                   # view the last 4 lines of the "jabber.gz" file
zcat jabber.gz | cat -n                    # view "jabber.gz" text with line numbers 
                                           #   (zcat does not have an -n option)

Exercise 2-2

Display lines 7 - 9 of the compressed "jabber.gz" text

Expand

title	Answer...

Working with 3rd party program I/O

Recall the three standard Unix streams: they each have a number, a name and redirection syntax:

Image Added

3rd party tool files and streams

Third party bioinformatics tools are often written to perform sub-command processing; that is, they have a top-level program that handles multiple sub-commands. Examples include the bwa NGS aligner and the samtools and bedtools tool suites.

To see their menu of sub-commands, you usually just need to enter the top-level command, or <command> --help. Similarly, sub-command usage is usually available as <command> <sub-command> or <command> <sub-command> --help.

Tip

title	3rd party tools and standard streams

Many tools write their main output to standard output by default but have options to write it to a file instead.

Similarly, tools often write processing status and diagnostics to standard error, and it is usually your responsibility to redirect this elsewhere (e.g. to a log file).

Finally, tools may support taking their main input from standard input, but need a "placeholder" argument where you'd usually specify a file. That standard input placeholder is usually a single dash ( - ) but can also be a reserved word such as stdin.

Now let's see how these concepts fit together when running 3rd party tools.

Exercise 2-3 bwa mem

Display the bwa mem sub-command usage using the more pager

Expand

title	Answer...

Just typing bwa mem | more doesn't use the more pager!

That's because bwa writes its usage information to standard error, not to standard output. So you have to use the funky 2>&1 syntax before piping to more:

bwa mem 2>&1 | more

Where does the bwa mem sub-command write its output?

Expand

title	Answer...

The bwa mem usage says:

Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

This does not specify an output file, so it must write its alignment information to standard output.

How can this be changed?

Expand

title	Answer...

The bwa mem options usage says:

-o FILE sam file to output results to [stdout]

bwa mem also writes diagnostic progress as it runs, to standard error.

Expand

title	Real example...

Code Block

language	bash

cd ~/gzips
bwa mem /mnt/bioi/ref_genome/bwa/bwtsw/sacCer3/sacCer3.fa sm2.fq.gz > small.sam

Show how you would invoke bwa mem to capture both its alignment output and its progress diagnostics. Use input from a my_fastq.fq file and ./refs/hg38 as the <idxbase>.

Expand

title	Answers...

Redirecting the output to a file:
bwa mem ./refs/hg38 my_fastq.fq 1> my_fastq.sam 2>my_fastq.aln.log

Using the -o option:
bwa mem -o my_fastq.sam ./refs/hg38 2>my_fastq.aln.log

Exercise 2-4 cutadapt

The cutadapt adapter trimming command reads NGS sequences from a FASTQ file, and writes adapter-trimmed reads to a FASTQ file. Find its usage.

Expand

title	Answer...

cutadapt # overview; tells you to run cutadapt --help for details
cutadapt --help | less
cutadapt --help | more

Note that it also points you to https://cutadapt.readthedocs.io/ for full documentation.

Where does cutadapt write its output to from by default? How can that be changed?

Expand

title	Answer...

The cutadapt usage says that output can be written to a file using the -o option

Usage: cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

The brackets around [-o output.fastq] suggest this is optional. Reading a bit further we see:

... Without the -o option, output is sent to standard output.

This suggests output can be specified in 2 ways:

to a file, using the -o option
- cutadapt -a CGTAATTCGCG -o trimmed.fastq small.fq
to standard output without the -o option
- cutadapt -a CGTAATTCGCG small.fq 1> trimmed.fastq

Where does cutadapt read its input from by default? How can that be changed? Can the input FASTQ be in compressed format?

Expand

title	Answer...

The bwa aln options usage says cutadapt usage says an input.fastq file is a required argument:

Expand

title	Answer...

cutadapt --help | more

Note that it also points you to https://cutadapt.readthedocs.io/ for full documentation.

-f FILE file to write output to instead of stdout

bwa aln also writes diagnostic progress as it runs, to standard error. Show how you would invoke bwa aln to capture both its alignment output and its progress diagnostics. Use input from a my_fastq.fq file and ./refs/hg38 as the <prefix>.

Expand

title	Answers...

Redirecting the output to a file:
bwa aln ./refs/hg38 my_fastq.fq > my_fastq.aln 2>my_fastq.aln.log

Using the -f option:
bwa aln -f my_fastq.aln ./refs/hg38 2>my_fastq.aln.log

Exercise 1-2 cutadapt

The cutadapt adapter trimming command reads NGS sequences from a FASTQ file, and writes adapter-trimmed reads to a FASTQ file. Find its usage.

Usagecutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

But again, reading a bit further we see:

... Compressed input and output is supported andauto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' forstandard input/output. ...

This says that the input.fastq file can be provided in one of three compression formats.

And the usage also suggests input can be specified in 2 ways:

from a file, using the -o option
- cutadapt -a CGTAATTCGCG -o trimmed.fastq small.fq
from standard input if the input.fastq argument is replaced with a dash ( - )
- cat small.fq | cutadapt -a CGTAATTCGCG -o trimmed.fastq -

Where does cutadapt write its diagnostic output by default? How can that be changed?

Expand

title	Answer...

The cutadapt usage doesn't say anything directly about diagnostics:

cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

But again, reading in the Output: options section:

-o FILE, --output.fastq] input.fastq

Where does cutadapt write its output to from by default? How can that be changed?

Expand

title	Answer...

The fastx_trimmer usage says that output is written to a file using the -o option

cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

But the brackets around [-o output.fastq] suggest this is optional. Reading a bit further we see:

... Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

x

Where does fastx_trimmer write its input from by default? How can that be changed?

Expand

title	Answer...

The fastx_trimmer options usage says:

[-i INFILE] = FASTA/Q input file. default is STDIN.=FILE Write trimmed reads to FILE. FASTQ or FASTA format is chosen depending on input. The summary report is sent to standard output. Use '{name}' in FILE to demultiplex reads into multiple files. Default: write to standard output

Careful reading of this suggests that:

When the -o option is omitted, and output goes to standard output,
- diagnostics must be written to standard error
  - so can be redirected to a log file with 2> trim.log
- cutadapt -a CGTAATTCGCG small.fq 1> trimmed.fastq 2> trim.log
But when the trimmed output is sent to a file with the -o output.fastq option,
- diagnostics are written to standard output
  - so can be redirected to a log file with 1> trim.log
- cutadapt -a CGTAATTCGCG -o trimmed.fastq small.fq 1> trim.log

Expand

title	Real example...

Code Block

language	bash

cd ~/gzips 
cutadapt -a AGATCGGAAGAGCACACGTCTGA small.fq  > trimmed.fq

Versions Compared

Old Version 7

New Version Current

Key

Files and File systems

Working with remote files

The find command

Working with symbolic links

About compressed files

Exercise 1-1

Working with 3rd party program I/O

Exercise 1-1 bwa aln

Working with remote files

scp (secure copy)

wget (web get)

The find command

Exercise 2-1

Working with symbolic links

About compressed files

Exercise 2-2

Working with 3rd party program I/O

3rd party tool files and streams

Exercise 2-3 bwa mem

Exercise 2-4 cutadapt

Exercise 1-2 cutadapt