...
Almost all built-in Linux commands, and especiallyNGS bioinformatics tools, use options heavily.
...
So you've noticed that options can be complicated – not to mention program arguments. Some options have values and others don't. Some are short, others long. How do you figure out what kinds of functions a command (or NGS bioinformatics tool) offers? You need help!
...
Notice that bwa, like many NGS bioinformatics programs, is written as a set of sub-commands. This top-level help displays the sub-commands available. You then type bwa <command> to see help for the sub-command:
...
If you don't already know much about a command ( or NGS tool), just Google it! Try something like "bwa manual" or "rsync man page". Many tools have websites that combine tool overviews with detailed option help. Even for built-in Linux commands, you're likely to get hits of a tutorial style, which are more useful when you're getting started.
...
- touch <file> – create an empty file, or update the modification timestamp on an existing file
- mkdir -p <dirname> – create directory <dirname>.
- -p says to create any needed sub-directories also
- mv <file1> <file2> – renames <file1> to <file2>
- mv <file1> <file2> ... <fileN> <to_dir>/ – moves files <file1> <file2> ... <fileN> into directory <to_dir>
- mv -t <dir> <file1> <file2> ... <fileN> – same as above, but specifies the target directory as an option (-t <to_dir>)
- ln -s <path> creates a symbolic (-s) link (symlink) to <path> in the current directory
- default link name corresponds to the last name component in <path>
- always change into (cd) the directory where you want the link before executing ln -s
- a symbolic link can be deleted without affecting the linked-to file
- ln -sf -t <target_dir> <file1> <file2> ... <fileN> – creates symbolic links to <file1> <file2> ... <fileN> in target directory <target_dir>
- rm <file> deletes a file. This is permanent - not a "trash can" deletion.
- rm -rf <dirname> deletes an entire directory – be careful!
...
- cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
- -d <delim> to change the field delimiter (Tab by default)
- sort sorts its input using an efficient algorithm
- by default sorts each line lexically
- one or more fields to sort can be specified with one or more -k <start_field_number>,<end_field_number> options
- options to sort numerically (-n), or numbers-inside-text (version sort -V)
- -t <delim> to change the field delimiter (whitespace -- one or more spaces or Tabs – by default)
- by default sorts each line lexically
- uniq -c counts groupings of its input (which must be sorted) and reports the text and count for each group
- use cut | sort | uniq -c for a quick-and-dirty histogram (see piping a histogram)
grep -P '<pattern>' searches for <pattern> in its input and outputs only lines containing itAnchor GREP GREP - always enclose <pattern> in single quotes to inhibit shell evaluation!
- -P says use Perl patterns, which are much more powerful than standard grep patterns
- -c says just return a count of line matches
- -n says include the line number of the matching line
- -v (inverse match) says return only lines not matching the pattern
- -l says return only the names of files that do contain the mattern pattern match
- -L says return only the names of files containing no pattern matches
- <pattern> can contain special match meta-characters and modifiers such as:
- ^ – matches beginning of line
- $ – matches end of line
- . – (period) matches any single character
- * – modifier; place after an expression to match 0 or more occurrences
- + – modifier, place after an expression to match 1 or more occurrences
- \s – matches any whitespace (\S any non-whitespace)
- \d – matches digits 0-9
- \w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
- \t matches Tab; \r matches Carriage return; \n matches Linefeed
- [xyz123] – matches any single character (including special characters) among those listed between the brackets [ ]
- this is called a character class.
- use [^xyz123] to match any single character not listed in the class
- (Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
- note that parentheses ( ) may also be used to capture matched sub-expressions for later use
- Regular expression modules are available in nearly every programming language (Perl, Python, Java, PHP, awk, even R)
- each "flavor" is slightly different
- even bash has multiple regex commands: grep, egrep, fgrep.
- There are many good online regular expression tutorials, but be sure to pick one tailored to the language you will use.
- here's a good general one: https://www.regular-expressions.info/
- and a perl regex tutorial: http://perldoc.perl.org/perlretut.html
- perl regular expressions are the "gold standard" used in most other languages
awk '<script>' a powerful scripting language that is easily invoked from the command lineAnchor AWK_script AWK_script - <script> is applied to each line of input (generally piped in)
- always enclose <script> in single quotes to inhibit shell evaluation
- General structure of an awk script:
- BEGIN {<expressions>} – use to initialize variables before any script body lines are executed
- e.g. BEGIN {FS=":"; OFS="\t"; sum=0} says
- use colon (:) as the input field separator (FS), and tab (\t) as the output field separator (OFS)
- the default input field separator (FS) is whitespace
- one or more spaces or tabs
- the default output field separator (OFS) is a single space
- the default input field separator (FS) is whitespace
- initialize the variable sum to 0
- use colon (:) as the input field separator (FS), and tab (\t) as the output field separator (OFS)
- e.g. BEGIN {FS=":"; OFS="\t"; sum=0} says
- {<body expressions>} – expressions to apply to each line of input
- use $1, $2, etc. to pick out specific input fields
- e.g. {print $3,$4} outputs fields 3 and 4 of the input, separated by the output field separator.
- END {<expressions>} – executed after all input is complete (e.g. print a sum)
- BEGIN {<expressions>} – use to initialize variables before any script body lines are executed
- Here is an excellent awk tutorial, very detailed and in-depth
- <script> is applied to each line of input (generally piped in)
...
- Default field separators
- Tab is the default field separator for cut
- whitespace (one or more spaces or tabs) is the default field separator for awk
- Re-ordering
- cut cannot re-order fields
- awk can re-order fields, based on the order you specify
- awk is a full-featured programming language while cut is just a single-purpose utility.
...
Code Block | ||||
---|---|---|---|---|
| ||||
samtools view -F 0x4 -f 0x2 yeast_pe.sort.bam | awk '
BEGIN{ FS="\t"; sum=0; nrec=0; }
{ if ($9 > 0) {sum += $9; nrec++;} }
END{ print sum/nrec; }'
|
- samtools view converts each alignment record in yeast_pairedend.sort.bam to text
- the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
- BAM files often contain records for both mapped and unmapped reads
- -F filters out records where the specified bit(s) are not set (i.e., they are 0)
- so technically we're asking for "not unmapped" reads since bit 0x4 = 1 means unmapped
- the -f 0x2 filter says to output only reads that are flagged as properly paired by the aligner
- these are reads where both R1 and R2 reads mapped within a "reasonable" genomic distance
- -f filters out records where the specified bit(s) are set (i.e., they are 1)
- alignment records that pass both filters are written to standard output
- the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
- | awk
- the pipe | connects the standard output of samtools view to the standard input of awk
- the single quote denotes the start of the awk script
- we don't have to use line continuation characters ( \ followed by a linefeed) within the script because newline characters within the quotes are part of the script
- 'BEGIN{ ... }{...}END{...}'
- these 3 lines of text, enclosed in single quotes, are the awk script
- Always enclose a command-line awk script in single quotes to protect it from bash shell interpolation.
- the BEGIN{ FS="\t"; sum=0; nrec=0; } block is executed once before the script processes any input data
- it says to use Tab ("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
- it says to use Tab ("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
- { if ($9 > 0) {sum += $9; nrec++} }
- this is the body of the awk script, which is executed for each line of input
- $9 represents the 9th tab-delimited field of the input
- the 9th field of an alignment record is the insert size, according to the SAM format spec
- we only execute the main part of the body when the 9th field is positive: if ($9 > 0)
- since each proper pair will have one alignment record with a positive insert size and one with a negative insert size, this check keeps us from double-counting insert sizes for pairs
- when the 9th field is positive, we add its value to sum (sum += $9) and add one to our record count (nrec++)
- END{ print sum/nrec; }
- the END block between the curly brackets { } is executed once after the script has processed all input data
- this prints the average insert size (sum/nrec) to standard output
- these 3 lines of text, enclosed in single quotes, are the awk script
...
Code Block | ||||
---|---|---|---|---|
| ||||
scp abattenh@stampede2abattenh@ls6.tacc.utexas.edu:/scratch/01063/abattenh/core_ngs/fastq_prep/small_fastqc.html . |
Expand | ||
---|---|---|
| ||
For other Windows users, pscp.exe, a remote file copy program |
...
can be installed as part of the PuTTY suite. To use it, first open a Command window (Start menu, search for |
...
Command). Then in the Command window, see if it is on your Windows %PATH% by just typing the executable name:
If this shows usage information, you're good to go. Execute something like following, substituting your user name and absolute path:
If pscp.exe is not on your %PATH%, you may need to locate the program. Try this:
If you see the program pscp.exe, you're good. You just have to use its full path. For example:
|
Editing files
Anchor | ||||
---|---|---|---|---|
|
There are several options for editing remote files (e.g. at TACC, or on a BRCF pod). These fall into three categories:
- Linux command-line text editors installed at TACC the remove server (nano, vi, emacs). These run in your Terminal window.
- nano is extremely simple and is a good choice as a first local text editor
- warning: nano has a tendency to break long single lines into multiple lines
- vi and emacs are extremely powerful but also quite complex
- emacs reference sheet: https://www.gnu.org/software/emacs/refcards/pdf/refcard.pdf
- vi reference sheet: http://www.atmos.albany.edu/daes/atmclasses/atm350/vi_cheat_sheet.pdf
- nano is extremely simple and is a good choice as a first local text editor
- Text editors Software or IDEs that run on your local computer protocols that allow you to mount external server directories
- Once mounted, the remote storage appears as a local volume/drive.
- Then, you can use any text editor or IDE on your local computer to open/edit/save files (although it may be slower than local file editing)
- Remote file system protocols include Samba (Windows, Mac) and NFS (Linux)
- Software programs that can mount remote data include ExpanDrive for Windows or Mac (costs $$, but has a free trial), TextWrangler for Mac.
- Once mounted, the remote storage appears as a local volume/drive.
- Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer
- E.g., Notepad++ or Komodo Edit
- Once you connect to the remote host, you can navigate its directory structure and edit files.
- When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
- then, you can use any text editor or IDE on your local computer to open/edit/save files (although it will be slower than local file editing)
- e.g. ExpanDrive for Windows or Mac (costs $$, but has a free trial), TextWrangler for MacThese programs can also access remotely mounted storage as a local volume/drive.
Knowing the basics of at least one Linux text editor is useful for creating small files like TACC commands files. We'll use nano and basic emacs for this in class.
...