2021 Linux fundamentals
This page should serve as a reference for the many "things Linux" we use in this course.
- 1 Terminal programs
- 2 Getting around in the shell
- 2.1 Important keyboard shortcuts
- 2.1.1 Tab key completion
- 2.1.2 Arrow keys
- 2.1.3 Command line editing
- 2.2 Wildcards and special file names
- 2.3 Standard streams
- 2.3.1 redirecting output
- 2.3.1.1 Output redirection examples
- 2.3.1 redirecting output
- 2.4 Piping
- 2.4.2 piping a histogram
- 2.4.2.1 The power of chaining pipes
- 2.5 Environment variables
- 2.6 Quoting in the shell
- 2.1 Important keyboard shortcuts
- 3 Using Commands
- 3.1 Command options
- 3.1.1 Useful options for ls
- 3.1.2 Examples of word options
- 3.2 Getting help
- 3.2.1 --help option
- 3.2.2 -h or -? options
- 3.2.3 just type the program name
- 3.2.4 Google
- 3.2.5 man pages
- 3.1 Command options
- 4 Basic linux commands you need to know
- 5 Advanced commands
- 6 Copying files between TACC and your laptop
- 7 Editing files
Terminal programs
You need a Terminal program in order to ssh to a remote computer.
Macs and Linux have a Terminal program built-in
Windows options:
Windows 10
Command shell has ssh and scp (may require latest Windows updates)
Start menu → Search for Command
Windows Subsystem for Linux – Windows 10 Professional includes a Ubuntu-like bash shells
See https://docs.microsoft.com/en-us/windows/wsl/install-win10
We recommend the Ubuntu Linux distribution, but any Linux distribution will have an SSH client
or
Putty – http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
simple Terminal and file copy programs
download either the Putty installer (https://the.earth.li/~sgtatham/putty/latest/w64/putty-64bit-0.70-installer.msi)
or just putty.exe (terminal) and pscp.exe (secure copy client)
Cygwin – http://www.cygwin.com/
A full Linux environment, including X-windows for running GUI programs remotely
Complicated to install
Getting around in the shell
Important keyboard shortcuts
Type as little and as accurately as possible by using keyboard shortcuts!
Tab key completion
The Tab key is your best friend! Hit the Tab key once or twice - it's almost always magic! Hitting Tab invokes shell completion, instructing the shell to try to guess what you're doing and finish the typing for you. On most modern Linux shells, Tab completion will:
single Tab – complete file or directory names up to any ambiguous part
if nothing shows up, there is no unambiguous match
Tab twice – display all possible completions
you then decide where to go next
work for shell commands too (like ls or cp)
Arrow keys
Use Up arrow to retrieve any of the last 500 commands you've typed, going backwards through your history.
You can then edit them and hit Enter (even in the middle of the command) and the shell will use that command.
The Down arrow "scrolls" forward from where you are in the command history.
Right arrow and Left arrow move the cursor forward or backward on the current command line.
Command line editing
Use Ctrl-a (holding down the "control" key and "a") to jump the cursor to the beginning of the line.
Use Ctrl-e to jump the cursor to the end of the line.
Arrow keys are also modified by Ctrl
e.g. Ctrl-right-arrow will skip by word forward; Ctrl-left-arrow backward.
Use Backspace to remove text before the cursor; Delete to remove text after the cursor.
Wildcards and special file names
The shell has shorthand to refer to groups of files by allowing wildcards in file names.
* (asterisk) is the most common filename wildcard. It matches "any length of any characters".
This technique is sometimes called filename globbing, and the pattern a glob.
Other useful ones are
brackets ( [ ] ) to allow for any character in the list of characters between the brackets.
and you can use a hyphen ( - ) to specify a range of characters (e.g. [A-G])
braces ( { } ) enclose a list of comman-separated substrings to match
For example:
ls *.bam – lists all files in the current directory that end in .bam
ls [A-Z]*.bam – does the same, but only if the first character of the file is a capital letter
ls [ABcd]*.bam – lists all .bam files whose 1st letter is A, B, c or d.
ls *.{fastq,fq}.gz – lists all .fastq.gz and .fq.gz files.
Three special file names:
.(single period) means "this directory"...(two periods) means "directory above current." So ls .. means "list contents of the parent directory."~ (tilde) means "my home directory".
Avoid spaces in filenames
While it is possible to create file and directory names that have embedded spaces, that creates problems when manipulating them.
To avoid headaches, it is best not to create file/directory names with embedded spaces.
Standard streams
Every command and Linux program has three "built-in" streams: standard input, standard output and standard error.
It is easy to not notice the difference between standard output and standard error when you're in an interactive Terminal session – because both outputs are sent to the Terminal. But they are separate streams, with different meanings. When running batch programs and scripts you will want to manipulate standard output and standard error from programs appropriately.
redirecting output
To take the standard output of a program and save it to a file, you use the > operator
a single > overwrites any existing target; a double >> appends to it
since standard output is stream #1, this is the same as 1>
To redirect the standard error of a program you must specify its stream number using 2>
To redirect standard output and standard error to the same place, use the syntax 2>&1
To see the difference between standard output and standard error try these commands:
Output redirection examples
# redirect a long listing of your $HOME directory to a file
ls -la $HOME > cmd.out
# look at the contents -- you'll see just files
cat cmd.out
# this command gives an error because the target does not exist
ls -la bad_directory
# redirect any errors from ls to a file
ls -la bad_directory 2> cmd.out
# look at the contents -- you'll see an error message
cat cmd.out
# now redirect both error and output streams to the same place
ls -la bad_directory $HOME > cmd.out
# look at the contents -- you'll see both an error message and files
cat cmd.outPiping
The power of the Linux command line is due in no small part to the power of piping. The pipe symbol ( | ) connects one program's standard output to the next program's standard input.
A simple example is piping uncompressed data "on the fly" to a pager like more (or less):
Pipe uncompressed output to a pager
# zcat is like cat, except that it understands the gz compressed format,
# and uncompresses the data before writing it to standard output.
# So, like cat, you need to be sure to pipe the output to a pager if
# the file is large.
zcat big.fq.gz | more
# Another way to do the same thing is to use gunzip and provide the -c option,
# which says to write decompressed data to the stdout (-c for "console")
gunzip -c big.fq.gz | morepiping a histogram
But the real power of piping comes when you stitch together a string of commands with pipes – it's incredibly flexible, and fun once you get the hang of it.
For example, here's a simple way to make a histogram of mapping quality values from a subset of BAM file records.
The power of chaining pipes
# create a histogram of mapping quality scores for the 1st 1000 mapped bam records
samtools view -F 0x4 small.bam | head -1000 | cut -f 5 | sort -n | uniq -csamtools view converts the binary small.bam file to text and writes alignment record lines one at a time to standard output.
-F 0x4 option says to filter out any records where the 0x4 flag bit is 0 (not set)
since the 0x4 flag bit is set (1) for unmapped records, this says to only report records where the query sequence did map to the reference
| head -1000
the pipe connects the standard output of samtools view to the standard input of head
the -1000 option says to only write the first 1000 lines of input to standard output
| cut -f 5
the pipe connects the standard output of head to the standard input of cut
the -f 5 option says to only write the 5th field of each input line to standard output (input fields are tab-delimited by default)
the 5th field of an alignment record is an integer representing the alignment mapping quality
the resulting output will have one integer per line (and 1000 lines)
| sort -n
the pipe connects the standard output of cut to the standard input of sort
the -n option says to sort input lines according to numeric sort order
the resulting output will be 1000 numeric values, one per line, sorted from lowest to highest
| uniq -c
the pipe connects the standard output of sort to the standard input of uniq
the -c option option says to just count groups of lines with the same value (that's why they must be sorted) and report the total for each group
the resulting output will be one line for each group that uniq sees
each line will have the text for the group (here the unique mapping quality values) and a count of lines in each group
Environment variables
Environment variables are just like variables in a programming language (in fact bash is a complete programming language), they are "pointers" that reference data assigned to them. In bash, you assign an environment variable as shown below:
Set an environment variable
export varname="Some value, here it's a string"Careful – do not put spaces around the equals sign when assigning environment variable values.
Also, always use double quotes if your value contains (or might contain) spaces.
You set environment variables using the bare name (varname above).
You then refer to or evaluate an environment variables using a dollar sign ( $ ) before the name:
Refer to an environment variable
echo $varnameThe export keyword when you're setting ensures that any sub-processes that are invoked will inherit this value. Without the export only the current shell process will have that variable set.
Use the env command to see all the environment variables you currently have set.
Quoting in the shell
What different quote marks mean in the shell and when to use can be quite confusing.
There are three types of quoting in the shell:
single quoting (e.g. 'some text') – this serves two purposes
it groups together all text inside the quotes into a single argument that is passed to the command
it tells the shell not to "look inside" the quotes to perform any evaluations
any environment variables in the text – or anything that looks like an environment variable – are not evaluated
no pathname globbing (e.g. *) is performed
double quoting (e.g. "some text") – also serves two purposes
it groups together all text inside the quotes into a single argument that is passed to the command
it allows environment variable evaluation (but inhibits pathname globbing)
backtick quoting (e.g. `date`)
evaluates the expression inside the backticks
the resulting standard output of the expression replaces the backticked text
Using Commands
Command options
Sitting at the computer, you should have some idea what you need to do. There's probably a command to do it. If you have some idea what it starts with, you can type a few characters and hit Tab twice to get some help. If you have no idea, you Google it or ask someone else.
Once you know a basic command, you'll soon want it to do a bit more - like seeing the sizes of files in addition to their names.
Most built-in commands in Linux use a common syntax to ask more of a command. They usually add a dash ( - ) followed by a code letter that names the added function. These "command line switches" are called options.
Options are, well, optional – you only add them when you need them. The part of the command line after the options, like filenames, are called arguments. Arguments can also be optional, but you can tell them from options because they don't start with a dash.
Useful options for ls
# long listing option (-l)
ls -l
# long listing (-l), all files (-a) and human readable file sizes (-h) options. $HOME is an argument (directory name)
ls -l -a -h $HOME
# sort by modification time (-t) displaying a long listing (-l) that includes the date and time
ls -lt
Almost all built-in Linux commands, and especially NGS tools, use options heavily.
Like dialects in a language, there are at least three basic schemes commands/programs accept options in:
Single-letter short options, which start with a single dash ( - ) and can often be combined, like:
Examples of different short options
head -20 # show 1st 20 lines ls -lhtS (equivalent to ls -l -h -t -S)Long options use the convention that double dashes ( -- ) precede the multi-character option name, and they can never be combined. Strictly speaking, long options should be separated from their values by the equals sign ( = ) according to the POSIX standard (see https://en.wikipedia.org/wiki/POSIX). But most programs let you use a space as separator also. Here's an example using the mira genome assembler:
Example of long options
mira --project=ct --job=denovo,genome,accurate,454 -SK:not=8Word options, illustrated in the GATK command line to call SNPs below.
Word options combine aspects of short and long options – they usually start with a single dash ( - ), but can be multiple letters and are never combined.
Sometimes the option (e.g. java's -Xms initial memory heap size option), and its value (512m which means 512 megabytes) may be smashed together.
Other times a multi-letter switch and its value are separated by a space (e.g. -glm BOTH).
Examples of word options
java -Xms512m -Xmx4g -jar /work2/projects/BioITeam/common/opt/GenomeAnalysisTK.jar -glm BOTH -R $reference -T UnifiedGenotyper -I $outprefix.realigned.recal.bam --dbsnp $dbsnp -o $outprefix.snps.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A DepthOfCoverage -A AlleleBalance
Getting help
So you've noticed that options can be complicated – not to mention program arguments. Some options have values and others don't. Some are short, others long. How do you figure out what kinds of functions a command (or NGS tool) offers? You need help!
--help option
Many (but not all) built-in shell commands will give you some help if you provide the long --help option. This can often be many pages, so you'll probably want to pipe the output to a pager like more. This is most useful to remind yourself what the name of that dang option was, assuming you know something about it.
-h or -? options
The -h and -? options are similar to --help. If --help doesn't work, try -h. or -?. Again, output can be lengthy and best used if you already have an idea what the program does.
just type the program name
Many 3rd party tools will provide extensive usage information if you just type the program name then hit Enter.
For example:
Use the program name alone as a command to get help
bwaProduces something like this:
bwa top-level help information
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.16a-r1181
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.Notice that bwa, like many NGS programs, is written as a set of sub-commands. This top-level help displays the sub-commands available. You then type bwa <command> to see help for the sub-command:
Get help on bwa index
bwa indexDisplays something like this:
bwa top-level help information
Usage: bwa index [options] <in.fasta>
Options: -a STR BWT construction algorithm: bwtsw or is [auto]
-p STR prefix of the index [same as fasta name]
-b INT block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
-6 index files named as <in.fasta>.64.* instead of <in.fasta>.*
Warning: `-a bwtsw' does not work for short genomes, while `-a is' andIf you don't already know much about a command (or NGS tool), just Google it! Try something like "bwa manual" or "rsync man page". Many tools have websites that combine tool overviews with detailed option help. Even for built-in Linux commands, you're likely to get hits of a tutorial style, which are more useful when you're getting started.
And it's so much easier to read things in a nice web browser!
man pages
Linux had built-in help files way before Macs or PCs thought of such things. They're called man pages (short for manual).
For example, man intro will give you an introduction to all user commands.
man pages will detail all options available – in excruciating detail (unless there's no man page ), so the manual system has its own built-in pager. The pager is sort of like less, but not quite the same (why make it easy?). We recommend man pages only for advanced users.
Basic linux commands you need to know
Here's a Linux commands cheat sheet. You may want to print a copy.
And here's a set of commands you should know, by category (under construction).
Command line arguments can be replaced by standard input
Most built-in Linux commands that obtain data from command line arguments (such as file names) can also accept the data piped in on their standard input.
File system navigation
ls - list the contents of the specified directory
-l says produce a long listing (including file permissions, sizes, owner and group)
-a says show all files, even normally-hidden dot files whose names start with a period ( . )
-h says to show file sizes in human readable form (e.g. 12M instead of 12201749)
cd <whereto> - change the current working directory to <whereto>. Some special <wheretos>:
.. (period, period) means "up one level"
~ (tilde) means "my home directory"
file <file> tells you what kind of file <file> is
df shows you the top level directory structure of the system you're working on, along with how much disk space is available
-h says to show sizes in human readable form (e.g. 12G instead of 12318201749)
pwd - display the present working directory
-P says to display the full absolute path
Create, rename, link to, delete files
touch <file> – create an empty file, or update the modification timestamp on an existing file
mkdir -p <dirname> – create directory <dirname>.
-p says to create any needed subdirectories also
mv <file1> <file2> – renames <file1> to <file2>
mv <file1> <file2> ... <fileN> <dir>/ – moves files <file1> <file2> ... <fileN> into directory <dir>
ln -s <path> creates a symbolic (-s) link to <path> in the current directory
default link name corresponds to the last name component in <path>
always change into (cd) the directory where you want the link before executing ln -s