Linux fundamentals

This page should serve as a reference for the many "things Linux" we use in this course. It is by no means complete – Linux is **huge** – but offers introductions to many important topics.

See also this page, which provides lists of the most common Linux commands, by category, as well as their most useful options: Some Linux commands

1 Some Linux commands
2 Terminal programs, shells and commands
- - 2.1.1 SSH to a remote computer
- 2.2 The bash shell REPL and commands
- 2.3 Getting help
  - 2.3.1 Use the program name alone as a command to get help
  - 2.3.2 bwa top-level help information
  - 2.3.3 Get help on bwa index
  - 2.3.4 bwa top-level help information
3 Terminal input
- 3.1 Literal characters and metacharacters
- 3.2 About command line input
  - 3.2.1 Multiple command on a line
  - 3.2.2 Split a command across multiple lines
- 3.3 Text lines and the Terminal
- 3.4 Command input errors
4 Getting around in the shell
- 4.1 Command line history and editing
- 4.2 Tab key completion
- 4.3 Absolute and relative pathname syntax
- 4.4 Pathname wildcards
5 Streams and Piping
- 5.1 Standard streams and redirection
- 5.2 Piping
  - - 5.2.1.1 Pipe uncompressed output to a pager
  - 5.2.2 piping a histogram
    - 5.2.2.1 The power of chaining pipes
6 Viewing text in files
- 6.1 cat, more or less
- 6.2 Introducing grep
- 6.3 head and tail
7 More Linux concepts
- 7.1 Environment variables
  - 7.1.1 Set an environment variable
  - 7.1.2 Refer to an environment variable
- 7.2 Quoting in the shell
  - 7.2.1 single and double quotes
  - 7.2.2 backtick quoting and sub-shell evaluation
- 7.3 What is text?
- 7.4 Writing multiple text lines
  - 7.4.1 heredoc
- 7.5 Arithemetic in bash
8 Bash control flow
- 8.1 the bash for loop
  - - 8.1.1.1 for loop example
  - 8.1.2 processing multiple files in a for loop
    - 8.1.2.1 For loop to count sequences in multiple FASTQs
  - 8.1.3 quotes matter
- 8.2 the if statement
- 8.3 reading file lines with while
9 File attributes
- 9.1 Owner and Group
- 9.2 Permissions
10 Copying files between TACC and your laptop
- 10.1 Execute this at TACC
- 10.2 Execute this on your laptop
11 Editing files
- 11.1 nano
  - 11.1.1 Start the nano text editor
- 11.2 emacs
  - 11.2.1 Start the emacs text editor
- 11.3 Line ending nightmares
- 11.4 Komodo Edit for Mac and Windows
- 11.5 Notepad++ for Windows
12 Other bash resources

Some Linux commands

This page provides lists of the most common Linux commands, by category, as well as their most useful options:
- Some Linux commands

Terminal programs, shells and commands

You need a Terminal program in order to ssh to a remote computer.

Macs and Linux have a Terminal program built-in
Windows options:
- Windows 10+
  - Command Prompt and PowerShell programs have ssh and scp (may require latest Windows updates)
    - Start menu → Search for Command
  - Putty – http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
    - simple Terminal and file copy programs
    - download either the Putty installer or just putty.exe (Terminal) and pscp.exe (secure copy client)
  - Windows Subsystem for Linux – Windows 10 Professional includes a Ubuntu-like bash shells
    - See https://docs.microsoft.com/en-us/windows/wsl/install-win10
    - We recommend the Ubuntu Linux distribution, but any Linux distribution will have an SSH client

Use ssh (secure shell) to login to a remote computers.

SSH to a remote computer

# General form:
ssh <user_name>@<full_host_name>

# For example
ssh abattenh@ls6.tacc.utexas.edu

The bash shell REPL and commands

When you type something in at a bash command-line prompt, it Reads the input, Evaluates it, then Prints the results, then does this over and over in a Loop. This behavior is called a REPL – a Read, Eval, Print Loop. The shell executes the command line input when it sees a linefeed, which happens when you press Enter after entering the command.

The input to the bash REPL is a command, which consists of:

The command name (any of the built-in Linux/Unix commands, or the name of a user-written script or program)
One or more (optional) options, usually noted with a leading dash (-) or double-dash (--).
- short (1-character) options can be provided separately, prefixed by a single dash (-)
  - or can be combined with the combination prefixed by a single dash
- long (multi-character or "word") options are prefixed with a double dash (--) and must be supplied separately.
- Both long and short options can be assigned a value
One or more command-line arguments, which are often (but not always) file names

Some examples using the ls (list files) command:

ls               # example 1 - no options or arguments
ls -l            # example 2 - one "short" (single character) option only (-l)
ls --help        # example 3 - one "long" (word) option (--help)
ls .profile      # example 4 - one argument, a file name (.profile)
ls --width=20    # example 5 - a long option that has a value (--width is the option, 20 is the value)
ls -w 20         # example 6 - a short option w/a value, as above, where -w is the same as --width
ls -l -a -h      # example 7 - three short options entered separately (-l -a -h)
ls -lah          # example 8 - three short options that can be combined after a dash (-lah)

The arguments to ls are one or more file or directory names. If no arguments are provided, the contents of the current directory are listed.
- If an argument is a directory name, the contents of that directory are listed.
Some handy options for ls:
- -l shows a long listing, including file permissions, ownership, size and last modified date.
- -a shows all files, including dot files whose names start with a period ( . ) which are normally not listed
- -h says to show file sizes in human readable form (e.g. 12M instead of 12201749)

A good place to start learning built-in Linux commands and their options is on the Some Linux commands page.

Getting help

How do you find out what options and arguments a command uses?

In the Terminal, type in the command name then the --help long option (e.g. ls --help)
- Works for most Linux commands; 3rd party tools may use -h or -? or even /? instead
- May produce a lot of output, so you may need to scroll up quite a bit, or pipe the output to a pager
  - e.g. ls --help | more (a space advances the output by one screen/"page", and typing Ctrl-C will exit more)
Use the built-in manual system (e.g. type man ls)
- This system uses the less pager
- For now, just know that a space advances the output by one screen/"page", and typing q will exit the display.
Ask the Google, e.g. search for ls man page
- Can be easier to read

Many 3rd party tools, especially bioinformatics tools, may bundle a number of different functions into one command. For these tools, just typing in the command name then Enter may provide top-level usage information. For example, the bwa tool that aligns sequencing reads to a reference genome:

Use the program name alone as a command to get help

bwa

Produces something like this:

bwa top-level help information

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.16a-r1181
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

bwa, like many bioinformatics programs, is written as a set of sub-commands. This top-level help displays the sub-commands available. You then type bwa <command> to see help for the sub-command:

Get help on bwa index

bwa index

Displays something like this:

bwa top-level help information

Usage:   bwa index [options] <in.fasta>

Options: -a STR    BWT construction algorithm: bwtsw or is [auto]
         -p STR    prefix of the index [same as fasta name]
         -b INT    block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.*

Warning: `-a bwtsw' does not work for short genomes, while `-a is' and

Of course Google works on 3rd party tools also (e.g. search for bwa manual)

Terminal input

Literal characters and metacharacters

In the bash shell, and in most tools and programming environment, there are two kinds of input:

literal characters, that just represent (and print as) themselves
- e.g. alphanumeric characters A-Z, a-z, 0-9
metacharacters - these are special characters that are associated with an operation in the environment
- e.g. the pound sign ( # ) comment character that tells the shell to ignore everything after the #

There are many metacharacters in bash: # \ $ | ~ [ ] to name a few.

Pay attention to the different metacharacters and their usages – which can depend on the context where they're used.

About command line input

You know the command line is ready for input when you see the command line prompt. It can be configured differently on different systems, but on our system it shows your account name, server name, current directory, then a dollar sign ($). Note the tilde character ( ~ ) signifies your Home directory.

The shell executes command line input when it sees a linefeed character (\n, also called a newline), which happens when you press Enter after entering the command.

Note: The Unix linefeed (\n) line delimiter is different from Windows, where the default line ending is carriage-return + linefeed (\r\n), and some Mac text editors that just use a carriage return (\r).

More than one command can be entered on a single line – just separate the commands with a semi-colon ( ; ).

Multiple command on a line

cd; ls -lh

A single command can also be split across multiple lines by adding a backslash ( \ ) at the end of the line you want to continue, before pressing Enter.

Split a command across multiple lines

ls6:~$ ls ~/.bashrc \
> ~/.profile

Notice that the shell indicates that it is not done with command-line input by displaying a greater than sign ( > ). You just enter more text then Enter when done.

Use Ctrl-C to exit the current command input

At any time during command input, whether on the 1st command line prompt or at a > continuation, you can press Ctrl-c (Control key and the c key at the same time) to get back to the command prompt.

Text lines and the Terminal

Sometimes a line of text is longer than the width of your Terminal. In this case the text is wrapped. It can appear that the output is multiple lines, but it is not. For example, FASTQ files often have long lines:

head $CORENGS/misc/small.fq

Note that most Terminals let you increase/decrease the width/height of the Terminal window. But there will always be single lines too long for your Terminal width (and too many lines of text for its height).

So how long is a line? So how many lines of output are there really? And how long is a line? The wc (word count) command can tell us this.

wc -l reports the number of lines in its input
wc -c reports the number of characters in its input (including invisible linefeed characters)

And when you give wc -l multiple files, it reports the line count of each, then a total.

wc -l $CORENGS/misc/small.fq           # Reports the number of lines in the small.fq file
cat $CORENGS/misc/small.fq | wc -l     # Reports the number of lines on its standard input
wc -l $CORENGS/misc/*.fq               # Reports the number of lines in all matching *.fq files
tail -1 $CORENGS/misc/small.fq | wc -c # Reports the number of characters of the last small.fq line

Command input errors

You don't always type in commands, options and arguments correctly – you can misspell a command name, forget to type a space, specify an unsupported option or a non-existent file, or make all kinds of other mistakes.

What happens? The shell attempts to guess what kind of error it is and reports an appropriate error message as best it can. Some examples:

# You mis-type a command name, or a command not installed on your system
ls6:~$ catt
catt: command not found

# You try to use an unsupported option
ls6:~$ ls -z
ls: invalid option -- 'z'
Try 'ls --help' for more information.

# You specify the name of a file that does not exist
ls6:~$ ls xxx
ls: cannot access 'xxx': No such file or directory

# You try to access a file or directory you don't have permissions for
ls6:~$ cat /etc/sudoers
cat: /etc/sudoers: Permission denied

Getting around in the shell

Type as little and as accurately as possible by using keyboard shortcuts!

Command line history and editing

Sometimes you want to repeat a command you've entered before, possibly with some changes.

The built-in history command lists the commands you've entered, each with a number.
- You can re-execute any command in the history by typing an exclamation point ( ! ) then the number
- e.g. !15 re-executes the 15th command in your history.

Use Up arrow to retrieve any of the last 50+ commands you've typed, going backwards through your history.
- You can then edit the retrieved line, and hit Enter (even in the middle of the command), and the shell will use that command.
The Down arrow "scrolls" forward from where you are in the command history.

The command line cursor (small thick bar on the command line) marks where you are on the command line.

Right arrow and Left arrow move the cursor forward or backward on the current command line.
Use Ctrl-a (holding down the Control key and a) to jump the cursor to the start of the line.
Use Ctrl-e to jump the cursor to the end of the line.
Arrow keys are also modified by Ctrl- (Windows) or Option- (Mac)
- Ctrl-right-arrow (Windows) or Option-right-arrow (Mac) will skip by "word" forward
- Ctrl-left-arrow (Windows) or Option-left-arrow (Mac) will skip by "word" backward

Once the cursor is positioned where you want it:

Just type in any additional text you want
To delete text after the cursor, use: Ctrl-d or:
- Delete key on Windows
- Function-Delete keys on Macintosh
To delete text before the cursor, use, use: Ctrl-h or:
- Backspace key on Windows
- Delete key on Macintosh
Use Ctrl-k (kill) to delete everything on the line after the cursor
Use Ctrl-y (yank) to copy the last killed text to where the cursor is

Tab key completion

Hitting Tab when entering command line text invokes shell completion, instructing the shell to try to guess what you're doing and finish the typing for you. It's almost magic!

On most modern Linux shells you use Tab completion by pressing:

single Tab – completes file or directory name up to any ambiguous part
- if nothing shows up, there is no unambiguous match
Tab twice – display all possible completions
- you then decide where to go next
shell completion works for commands too (like bowtie)

Absolute and relative pathname syntax

An absolute pathname lists all components of the full file system hierarchy that describes a file. Absolute paths always start with the forward slash ( / ), which is the root of the file system hierarchy. Directory names are separated by the forward slash ( / ) .

You can also specify a directory relative to where you are using one of the special directory names:

single period ( . ) means "the current directory"
two periods ( . . ) means "directory above the current"
tilde ( ~ ) means "my Home directory"

Avoid special characters in filenames

While it is possible to create file and directory names that have embedded spaces, that creates problems when manipulating them.

To avoid headaches, it is best not to create file/directory names with embedded spaces, or with special characters such as + & # ( )

Pathname wildcards

The shell has shorthand to refer to groups of files by allowing wildcards in file names.

Using these wildcards is sometimes called filename globbing, and the pattern a glob.

asterisk ( * ) is the most common filename wildcard. It matches any length of any characters
brackets ( [ ] ) match any character between the brackets
- and you can use a hyphen ( - ) to specify a range of characters (e.g. [A-G])
braces ( { } ) enclose a list of comma-separated strings to match (e.g. {dog,pony})

For example:

ls *.bam – lists all files in the current directory that end in .bam
ls [A-Z]*.bam – does the same, but only if the first character of the file is a capital letter
ls [ABcd]*.bam – lists all .bam files whose 1st letter is A, B, c or d.
ls *.{fastq,fq}.gz – lists all .fastq.gz and .fq.gz files.

Streams and Piping

Standard streams and redirection

Most Linux commands write their results to standard output, a built-in stream that is mapped to your Terminal, but that data can be redirected to a file instead.

In fact every Linux command and program has three standard Unix streams: standard input, standard output and standard error. Each has a number, a name, and redirection syntax:

standard output is stream 1
- redirect standard output to a file with a the > or 1> redirection operator
  - a single > or 1> overwrites any existing data in the target file
  - a double >> or 1>> appends to any existing data in the target file
standard error is stream 2
- redirect standard error to a file with a the 2> redirection operator