This page should serve as a reference for the many "things Linux" we use in this course. It is by no means complete – Linux is **huge** – but offers introductions to many important topics.
...
- Macs and Linux have a Terminal program built-in
- Windows options:
- Windows 10+
- Command Prompt and PowerShell programs have ssh and scp (may require latest Windows updates)
- Start menu → Search for Command
- Putty – http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
- simple Terminal and file copy programs
- download either the Putty installer or just putty.exe (Terminal) and pscp.exe (secure copy client)
- Windows Subsystem for Linux – Windows 10 Professional includes a Ubuntu-like bash shells
- See https://docs.microsoft.com/en-us/windows/wsl/install-win10
- We recommend the Ubuntu Linux distribution, but any Linux distribution will have an SSH client
- Command Prompt and PowerShell programs have ssh and scp (may require latest Windows updates)
- Windows 10+
Use ssh (secure shell) to login to a remote computers.
| Code Block | ||||
|---|---|---|---|---|
| ||||
# General form: ssh <user_name>@<full_host_name> # For example ssh abattenh@ls6.tacc.utexas.edu |
...
Of course Google works on 3rd party tools also (e.g. search for bwa manual)
Viewing text in files
cat, more or less
The most basic way of view file data is the cat command. While the name comes from its ability to concatenate one or more files, it can be used to output the contents of a single file. For example:
| Code Block |
|---|
cat ~/.profile
# or, to see line numbers in the output:
cat -n ~/.profile |
Using cat by itself is fine for small files, but it reads/writes everything in the file without stopping. So for larger files you use a pager such as more, or less. A pager reads text and outputs only one "page" of text at a time, then waits for you to ask it to advance. And a "page" of text is the number of lines that will fit on your visible Terminal.
Using the more pager:
| Code Block |
|---|
more ~/.bashrc |
- Press the spacebar to see the next page.
- If there is additional output, you'll see the --More-- indicator again; if not, the command prompt appears again.
- To end the more display, just type q (quit) or Ctrl-c.
Using the less pager:
| Code Block |
|---|
less ~/.bashrc
# to see line numbers in the output:
less -N ~/.bashrc
# to use case-insensitive matching:
less -I ~/.bashrc |
Basic navigation in less:
- Use q to quit less at any time
- space or Ctrl-f advances one page forward; Ctrl-b goes back one page
- down arrow goes down (forward) one line; up arrow goes up (backward) one line
Searching in less:
- /<pattern> – search for <pattern> in forward direction
- n – goes to the next match of <pattern>
- N – goes to the previous match of <pattern>
- ?<pattern> – search for <pattern> in backward direction
- n – previous match going back
- N – next match going forward
Introducing grep
Another method of text searching is using the grep program, which stands for general regular expression parser. In Unix, the grep program performs regular-expression text searching, and displays lines where the pattern text is found.
Nearly every programming language offers grep functionality, where a pattern you specify – a regular expression or regex – describes how the search is performed.
There are many grep regular expression metacharacters that control how the search is performed (see the grep command).
Basic usage is: grep '<pattern>' <file> where
- '<pattern>' (usually enclosed in single quotes) just contains alphanumeric characters (A-Z, a-z, 0-9).
Common options:
- grep -i will perform a case-insensitive search
- grep -n will display line numbers where the pattern was matched
Terminal input
Literal characters and metacharacters
In the bash shell, and in most tools and programming environment, there are two kinds of input:
...
- e.g. alphanumeric characters A-Z, a-z, 0-9
...
- e.g. the pound sign ( # ) comment character that tells the shell to ignore everything after the #
There are many metacharacters in bash: # \ $ | ~ [ ] to name a few.
Pay attention to the different metacharacters and their usages – which can depend on the context where they're used.
About command line input
You know the command line is ready for input when you see the command line prompt. It can be configured differently on different systems, but on our system it shows your account name, server name, current directory, then a dollar sign ($). Note the tilde character ( ~ ) signifies your Home directory.
The shell executes command line input when it sees a linefeed character (\n, also called a newline), which happens when you press Enter after entering the command.
| Expand | ||
|---|---|---|
| ||
Note: The Unix linefeed (\n) line delimiter is different from Windows, where the default line ending is carriage-return + linefeed (\r\n), and some Mac text editors that just use a carriage return (\r). |
More than one command can be entered on a single line – just separate the commands with a semi-colon ( ; ).
| Code Block | ||||
|---|---|---|---|---|
| ||||
cd; ls -lh |
A single command can also be split across multiple lines by adding a backslash ( \ ) at the end of the line you want to continue, before pressing Enter.
| Code Block | ||||
|---|---|---|---|---|
| ||||
ls6:~$ ls ~/.bashrc \
> ~/.profile |
Notice that the shell indicates that it is not done with command-line input by displaying a greater than sign ( > ). You just enter more text then Enter when done.
| Tip | ||
|---|---|---|
| ||
At any time during command input, whether on the 1st command line prompt or at a > continuation, you can press Ctrl-c (Control key and the c key at the same time) to get back to the command prompt. |
Text lines and the Terminal
Sometimes a line of text is longer than the width of your Terminal. In this case the text is wrapped. It can appear that the output is multiple lines, but it is not. For example, FASTQ files often have long lines:
| Code Block | ||
|---|---|---|
| ||
head $CORENGS/misc/small.fq |
Note that most Terminals let you increase/decrease the width/height of the Terminal window. But there will always be single lines too long for your Terminal width (and too many lines of text for its height).
So how long is a line? So how many lines of output are there really? And how long is a line? The wc (word count) command can tell us this.
- wc -l reports the number of lines in its input
- wc -c reports the number of characters in its input (including invisible linefeed characters)
And when you give wc -l multiple files, it reports the line count of each, then a total.
| Code Block | ||
|---|---|---|
| ||
wc -l $CORENGS/misc/small.fq # Reports the number of lines in the small.fq file
cat $CORENGS/misc/small.fq | wc -l # Reports the number of lines on its standard input
wc -l $CORENGS/misc/*.fq # Reports the number of lines in all matching *.fq files
tail -1 $CORENGS/misc/small.fq | wc -c # Reports the number of characters of the last small.fq line |
Command input errors
You don't always type in commands, options and arguments correctly – you can misspell a command name, forget to type a space, specify an unsupported option or a non-existent file, or make all kinds of other mistakes.
What happens? The shell attempts to guess what kind of error it is and reports an appropriate error message as best it can. Some examples:
| Code Block | ||
|---|---|---|
| ||
# You mis-type a command name, or a command not installed on your system
ls6:~$ catt
catt: command not found
# You try to use an unsupported option
ls6:~$ ls -z
ls: invalid option -- 'z'
Try 'ls --help' for more information.
# You specify the name of a file that does not exist
ls6:~$ ls xxx
ls: cannot access 'xxx': No such file or directory
# You try to access a file or directory you don't have permissions for
ls6:~$ cat /etc/sudoers
cat: /etc/sudoers: Permission denied |
Getting around in the shell
Type as little and as accurately as possible by using keyboard shortcuts!
Command line history and editing
Sometimes you want to repeat a command you've entered before, possibly with some changes.
- The built-in history command lists the commands you've entered, each with a number.
- You can re-execute any command in the history by typing an exclamation point ( ! ) then the number
- e.g. !15 re-executes the 15th command in your history.
- Use Up arrow to retrieve any of the last 50+ commands you've typed, going backwards through your history.
- You can then edit the retrieved line, and hit Enter (even in the middle of the command), and the shell will use that command.
- The Down arrow "scrolls" forward from where you are in the command history.
The command line cursor (small thick bar on the command line) marks where you are on the command line.
- Right arrow and Left arrow move the cursor forward or backward on the current command line.
- Use Ctrl-a (holding down the Control key and a) to jump the cursor to the start of the line.
- Use Ctrl-e to jump the cursor to the end of the line.
- Arrow keys are also modified by Ctrl- (Windows) or Option- (Mac)
- Ctrl-right-arrow (Windows) or Option-right-arrow (Mac) will skip by "word" forward
- Ctrl-left-arrow (Windows) or Option-left-arrow (Mac) will skip by "word" backward
Once the cursor is positioned where you want it:
- Just type in any additional text you want
- To delete text after the cursor, use: Ctrl-d or:
- Delete key on Windows
- Function-Delete keys on Macintosh
- To delete text before the cursor, use, use: Ctrl-h or:
- Backspace key on Windows
- Delete key on Macintosh
- Use Ctrl-k (kill) to delete everything on the line after the cursor
- Use Ctrl-y (yank) to copy the last killed text to where the cursor is
Tab key completion
Hitting Tab when entering command line text invokes shell completion, instructing the shell to try to guess what you're doing and finish the typing for you. It's almost magic!
On most modern Linux shells you use Tab completion by pressing:
- single Tab – completes file or directory name up to any ambiguous part
- if nothing shows up, there is no unambiguous match
- Tab twice – display all possible completions
- you then decide where to go next
- shell completion works for commands too (like bowtie)
Absolute and relative pathname syntax
An absolute pathname lists all components of the full file system hierarchy that describes a file. Absolute paths always start with the forward slash ( / ), which is the root of the file system hierarchy. Directory names are separated by the forward slash ( / ) .
You can also specify a directory relative to where you are using one of the special directory names:
- single period ( . ) means "the current directory"
- two periods ( . . ) means "directory above the current"
- tilde ( ~ ) means "my Home directory"
Avoid special characters in filenames
| Tip |
|---|
While it is possible to create file and directory names that have embedded spaces, that creates problems when manipulating them. To avoid headaches, it is best not to create file/directory names with embedded spaces, or with special characters such as + & # ( ) |
Pathname wildcards
The shell has shorthand to refer to groups of files by allowing wildcards in file names.
Using these wildcards is sometimes called filename globbing, and the pattern a glob.
- asterisk ( * ) is the most common filename wildcard. It matches any length of any characters
- brackets ( [ ] ) match any character between the brackets
- and you can use a hyphen ( - ) to specify a range of characters (e.g. [A-G])
- braces ( { } ) enclose a list of comma-separated strings to match (e.g. {dog,pony})
For example:
- ls *.bam – lists all files in the current directory that end in .bam
- ls [A-Z]*.bam – does the same, but only if the first character of the file is a capital letter
- ls [ABcd]*.bam – lists all .bam files whose 1st letter is A, B, c or d.
- ls *.{fastq,fq}.gz – lists all .fastq.gz and .fq.gz files.
Streams and Piping
Standard streams and redirection
Most Linux commands write their results to standard output, a built-in stream that is mapped to your Terminal, but that data can be redirected to a file instead.
In fact every Linux command and program has three standard Unix streams: standard input, standard output and standard error. Each has a number, a name, and redirection syntax:
- standard output is stream 1
- redirect standard output to a file with a the > or 1> redirection operator
- a single > or 1> overwrites any existing data in the target file
- a double >> or 1>> appends to any existing data in the target file
- redirect standard output to a file with a the > or 1> redirection operator
- standard error is stream 2
- redirect standard error to a file with a the 2> redirection operator
- a single 2> overwrites any existing data in the target file
- a double 2>> appends to any existing data in the target file
- redirect standard error to a file with a the 2> redirection operator
It is easy to not notice the difference between standard output and standard error when you're in an interactive Terminal session – because both outputs are sent to the Terminal window. But they are separate streams, with different meanings. In particular, programs write error and/or diagnostic messages to standard error, not to standard output.
Here's a command that shows the difference between standard error and standard output:
| Code Block | ||
|---|---|---|
| ||
ls /etc/fstab xxx.txt |
Produces this output in your Terminal:
| Code Block |
|---|
ls: cannot access 'xxx.txt': No such file or directory
/etc/fstab |
What is not obvious, since both streams are displayed on the Terminal, is that:
- the diagnostic text "ls: cannot access 'xxx.txt': No such file or directory" is being written to standard error
- the listing of the existing file ("/etc/passwd") is being written to standard output
To see this, redirect standard output and standard error to different files and look at their contents:
| Code Block | ||
|---|---|---|
| ||
ls /etc/fstab xxx.txt 1> stdout.txt 2>stderr.txt
cat stdout.txt # Displays "/etc/fstab"
cat stderr.txt # Displays "ls: cannot access 'xxx.txt': No such file or directory" |
What if you want both standard output and standard error to go to the same file? You use this somewhat odd 2>&1 redirection syntax:
| Code Block | ||
|---|---|---|
| ||
# Redirect both standard output and standard error to the out.txt file
ls /etc/fstab xxx.txt > out.txt 2>&1
# Display the contents of the out.txt file
cat out.txt
# produces output like this:
ls: cannot access 'xxx.txt': No such file or directory
/etc/fstab |
Two final notes.
- When standard output is redirected to a file, the data is not displayed on the Terminal
- If you want the data written to both standard output (the Terminal) and a file, use the tee command
- e.g. ls -l ~ | tee home_dir_listing.log
- There is a special Linux file called /dev/null that serves as a "global trash can" – it just throws away anything you write to it.
- So you can direct standard output and/or standard error to /dev/null to ignore it completely.
When running batch programs and scripts you will want to manipulate standard output and standard error from programs appropriately – especially for 3rd party programs that often produce both results data and diagnostic/progress messages.
Piping
Most programs/commands read input data from some source, then write output to some destination. A data source can be a file, but can also be standard input. Similarly, a data destination can be a file but can also be a stream such as standard output.
The power of the Linux command line is due in no small part to the power of piping. The pipe operator ( | ) connects one program's standard output to the next program's standard input.
A simple example is piping uncompressed data "on the fly" to count its lines using wc -l (word count command with the lines option).
| Code Block | ||||
|---|---|---|---|---|
| ||||
# zcat is like cat, except that it understands the gz compressed format,
# and uncompresses the data before writing it to standard output.
# So, like cat, you need to be sure to pipe the output to a pager if
# the file is large.
zcat big.fq.gz | wc -l
|
piping a histogram
But the real power of piping comes when you stitch together a string of commands with pipes – it's incredibly flexible, and fun once you get the hang of it.
For example, here's a simple way to make a histogram of mapping quality values from a subset of BAM file records.
| Code Block | ||||
|---|---|---|---|---|
| ||||
# create a histogram of mapping quality scores for the 1st 1000 mapped bam records
samtools view -F 0x4 small.bam | head -1000 | cut -f 5 | sort -n | uniq -c |
...
- -F 0x4 option says to filter out any records where the 0x4 flag bit is 0 (not set)
- since the 0x4 flag bit is set (1) for unmapped records, this says to only report records where the query sequence did map to the reference
...
- the pipe connects the standard output of samtools view to the standard input of head
- the -1000 option says to only write the first 1000 lines of input to standard output
...
- the pipe connects the standard output of head to the standard input of cut
- the -f 5 option says to only write the 5th field of each input line to standard output (input fields are tab-delimited by default)
- the 5th field of an alignment record is an integer representing the alignment mapping quality
- the resulting output will have one integer per line (and 1000 lines)
...
- the pipe connects the standard output of cut to the standard input of sort
- the -n option says to sort input lines according to numeric sort order
- the resulting output will be 1000 numeric values, one per line, sorted from lowest to highest
...
Terminal input
Literal characters and metacharacters
In the bash shell, and in most tools and programming environment, there are two kinds of input:
- literal characters, that just represent (and print as) themselves
- e.g. alphanumeric characters A-Z, a-z, 0-9
- metacharacters- these are special characters that are associated with an operation in the environment
- e.g. the pound sign ( # ) comment character that tells the shell to ignore everything after the #
- e.g. the pound sign ( # ) comment character that tells the shell to ignore everything after the #
There are many metacharacters in bash: # \ $ | ~ [ ] to name a few.
Pay attention to the different metacharacters and their usages – which can depend on the context where they're used.
About command line input
You know the command line is ready for input when you see the command line prompt. It can be configured differently on different systems, but on our system it shows your account name, server name, current directory, then a dollar sign ($). Note the tilde character ( ~ ) signifies your Home directory.
The shell executes command line input when it sees a linefeed character (\n, also called a newline), which happens when you press Enter after entering the command.
| Expand | ||
|---|---|---|
| ||
Note: The Unix linefeed (\n) line delimiter is different from Windows, where the default line ending is carriage-return + linefeed (\r\n), and some Mac text editors that just use a carriage return (\r). |
More than one command can be entered on a single line – just separate the commands with a semi-colon ( ; ).
| Code Block | ||||
|---|---|---|---|---|
| ||||
cd; ls -lh |
A single command can also be split across multiple lines by adding a backslash ( \ ) at the end of the line you want to continue, before pressing Enter.
| Code Block | ||||
|---|---|---|---|---|
| ||||
ls6:~$ ls ~/.bashrc \
> ~/.profile |
Notice that the shell indicates that it is not done with command-line input by displaying a greater than sign ( > ). You just enter more text then Enter when done.
| Tip | ||
|---|---|---|
| ||
At any time during command input, whether on the 1st command line prompt or at a > continuation, you can press Ctrl-c (Control key and the c key at the same time) to get back to the command prompt. |
Text lines and the Terminal
Sometimes a line of text is longer than the width of your Terminal. In this case the text is wrapped. It can appear that the output is multiple lines, but it is not. For example, FASTQ files often have long lines:
| Code Block | ||
|---|---|---|
| ||
head $CORENGS/misc/small.fq |
Note that most Terminals let you increase/decrease the width/height of the Terminal window. But there will always be single lines too long for your Terminal width (and too many lines of text for its height).
So how long is a line? So how many lines of output are there really? And how long is a line? The wc (word count) command can tell us this.
- wc -l reports the number of lines in its input
- wc -c reports the number of characters in its input (including invisible linefeed characters)
And when you give wc -l multiple files, it reports the line count of each, then a total.
| Code Block | ||
|---|---|---|
| ||
wc -l $CORENGS/misc/small.fq # Reports the number of lines in the small.fq file
cat $CORENGS/misc/small.fq | wc -l # Reports the number of lines on its standard input
wc -l $CORENGS/misc/*.fq # Reports the number of lines in all matching *.fq files
tail -1 $CORENGS/misc/small.fq | wc -c # Reports the number of characters of the last small.fq line |
Command input errors
You don't always type in commands, options and arguments correctly – you can misspell a command name, forget to type a space, specify an unsupported option or a non-existent file, or make all kinds of other mistakes.
What happens? The shell attempts to guess what kind of error it is and reports an appropriate error message as best it can. Some examples:
| Code Block | ||
|---|---|---|
| ||
# You mis-type a command name, or a command not installed on your system
ls6:~$ catt
catt: command not found
# You try to use an unsupported option
ls6:~$ ls -z
ls: invalid option -- 'z'
Try 'ls --help' for more information.
# You specify the name of a file that does not exist
ls6:~$ ls xxx
ls: cannot access 'xxx': No such file or directory
# You try to access a file or directory you don't have permissions for
ls6:~$ cat /etc/sudoers
cat: /etc/sudoers: Permission denied |
Getting around in the shell
Type as little and as accurately as possible by using keyboard shortcuts!
Command line history and editing
Sometimes you want to repeat a command you've entered before, possibly with some changes.
- The built-in history command lists the commands you've entered, each with a number.
- You can re-execute any command in the history by typing an exclamation point ( ! ) then the number
- e.g. !15 re-executes the 15th command in your history.
- Use Up arrow to retrieve any of the last 50+ commands you've typed, going backwards through your history.
- You can then edit the retrieved line, and hit Enter (even in the middle of the command), and the shell will use that command.
- The Down arrow "scrolls" forward from where you are in the command history.
The command line cursor (small thick bar on the command line) marks where you are on the command line.
- Right arrow and Left arrow move the cursor forward or backward on the current command line.
- Use Ctrl-a (holding down the Control key and a) to jump the cursor to the start of the line.
- Use Ctrl-e to jump the cursor to the end of the line.
- Arrow keys are also modified by Ctrl- (Windows) or Option- (Mac)
- Ctrl-right-arrow (Windows) or Option-right-arrow (Mac) will skip by "word" forward
- Ctrl-left-arrow (Windows) or Option-left-arrow (Mac) will skip by "word" backward
Once the cursor is positioned where you want it:
- Just type in any additional text you want
- To delete text after the cursor, use: Ctrl-d or:
- Delete key on Windows
- Function-Delete keys on Macintosh
- To delete text before the cursor, use, use: Ctrl-h or:
- Backspace key on Windows
- Delete key on Macintosh
- Use Ctrl-k (kill) to delete everything on the line after the cursor
- Use Ctrl-y (yank) to copy the last killed text to where the cursor is
Tab key completion
Hitting Tab when entering command line text invokes shell completion, instructing the shell to try to guess what you're doing and finish the typing for you. It's almost magic!
On most modern Linux shells you use Tab completion by pressing:
- single Tab – completes file or directory name up to any ambiguous part
- if nothing shows up, there is no unambiguous match
- Tab twice – display all possible completions
- you then decide where to go next
- shell completion works for commands too (like bowtie)
Absolute and relative pathname syntax
An absolute pathname lists all components of the full file system hierarchy that describes a file. Absolute paths always start with the forward slash ( / ), which is the root of the file system hierarchy. Directory names are separated by the forward slash ( / ) .
You can also specify a directory relative to where you are using one of the special directory names:
- single period ( . ) means "the current directory"
- two periods ( . . ) means "directory above the current"
- tilde ( ~ ) means "my Home directory"
Avoid special characters in filenames
| Tip |
|---|
While it is possible to create file and directory names that have embedded spaces, that creates problems when manipulating them. To avoid headaches, it is best not to create file/directory names with embedded spaces, or with special characters such as + & # ( ) |
Pathname wildcards
The shell has shorthand to refer to groups of files by allowing wildcards in file names.
Using these wildcards is sometimes called filename globbing, and the pattern a glob.
- asterisk ( * ) is the most common filename wildcard. It matches any length of any characters
- brackets ( [ ] ) match any character between the brackets
- and you can use a hyphen ( - ) to specify a range of characters (e.g. [A-G])
- braces ( { } ) enclose a list of comma-separated strings to match (e.g. {dog,pony})
For example:
- ls *.bam – lists all files in the current directory that end in .bam
- ls [A-Z]*.bam – does the same, but only if the first character of the file is a capital letter
- ls [ABcd]*.bam – lists all .bam files whose 1st letter is A, B, c or d.
- ls *.{fastq,fq}.gz – lists all .fastq.gz and .fq.gz files.
Streams and Piping
Standard streams and redirection
Most Linux commands write their results to standard output, a built-in stream that is mapped to your Terminal, but that data can be redirected to a file instead.
In fact every Linux command and program has three standard Unix streams: standard input, standard output and standard error. Each has a number, a name, and redirection syntax:
- standard output is stream 1
- redirect standard output to a file with a the > or 1> redirection operator
- a single > or 1> overwrites any existing data in the target file
- a double >> or 1>> appends to any existing data in the target file
- redirect standard output to a file with a the > or 1> redirection operator
- standard error is stream 2
- redirect standard error to a file with a the 2> redirection operator
- a single 2> overwrites any existing data in the target file
- a double 2>> appends to any existing data in the target file
- redirect standard error to a file with a the 2> redirection operator
It is easy to not notice the difference between standard output and standard error when you're in an interactive Terminal session – because both outputs are sent to the Terminal window. But they are separate streams, with different meanings. In particular, programs write error and/or diagnostic messages to standard error, not to standard output.
Here's a command that shows the difference between standard error and standard output:
| Code Block | ||
|---|---|---|
| ||
ls /etc/fstab xxx.txt |
Produces this output in your Terminal:
| Code Block |
|---|
ls: cannot access 'xxx.txt': No such file or directory
/etc/fstab |
What is not obvious, since both streams are displayed on the Terminal, is that:
- the diagnostic text "ls: cannot access 'xxx.txt': No such file or directory" is being written to standard error
- the listing of the existing file ("/etc/passwd") is being written to standard output
To see this, redirect standard output and standard error to different files and look at their contents:
| Code Block | ||
|---|---|---|
| ||
ls /etc/fstab xxx.txt 1> stdout.txt 2>stderr.txt
cat stdout.txt # Displays "/etc/fstab"
cat stderr.txt # Displays "ls: cannot access 'xxx.txt': No such file or directory" |
What if you want both standard output and standard error to go to the same file? You use this somewhat odd 2>&1 redirection syntax:
| Code Block | ||
|---|---|---|
| ||
# Redirect both standard output and standard error to the out.txt file
ls /etc/fstab xxx.txt > out.txt 2>&1
# Display the contents of the out.txt file
cat out.txt
# produces output like this:
ls: cannot access 'xxx.txt': No such file or directory
/etc/fstab |
Two final notes.
- When standard output is redirected to a file, the data is not displayed on the Terminal
- If you want the data written to both standard output (the Terminal) and a file, use the tee command
- e.g. ls -l ~ | tee home_dir_listing.log
- There is a special Linux file called /dev/null that serves as a "global trash can" – it just throws away anything you write to it.
- So you can direct standard output and/or standard error to /dev/null to ignore it completely.
When running batch programs and scripts you will want to manipulate standard output and standard error from programs appropriately – especially for 3rd party programs that often produce both results data and diagnostic/progress messages.
Piping
Most programs/commands read input data from some source, then write output to some destination. A data source can be a file, but can also be standard input. Similarly, a data destination can be a file but can also be a stream such as standard output.
The power of the Linux command line is due in no small part to the power of piping. The pipe operator ( | ) connects one program's standard output to the next program's standard input.
A simple example is piping uncompressed data "on the fly" to count its lines using wc -l (word count command with the lines option).
| Code Block | ||||
|---|---|---|---|---|
| ||||
# zcat is like cat, except that it understands the gz compressed format,
# and uncompresses the data before writing it to standard output.
# So, like cat, you need to be sure to pipe the output to a pager if
# the file is large.
zcat big.fq.gz | wc -l
|
piping a histogram
But the real power of piping comes when you stitch together a string of commands with pipes – it's incredibly flexible, and fun once you get the hang of it.
For example, here's a simple way to make a histogram of mapping quality values from a subset of BAM file records.
| Code Block | ||||
|---|---|---|---|---|
| ||||
# create a histogram of mapping quality scores for the 1st 1000 mapped bam records
samtools view -F 0x4 small.bam | head -1000 | cut -f 5 | sort -n | uniq -c |
- samtools view converts the binary small.bam file to text and writes alignment record lines one at a time to standard output.
- -F 0x4 option says to filter out any records where the 0x4 flag bit is 0 (not set)
- since the 0x4 flag bit is set (1) for unmapped records, this says to only report records where the query sequence did map to the reference
- | head -1000
- the pipe connects the standard output of samtools view to the standard input of head
- the -1000 option says to only write the first 1000 lines of input to standard output
- | cut -f 5
- the pipe connects the standard output of head to the standard input of cut
- the -f 5 option says to only write the 5th field of each input line to standard output (input fields are tab-delimited by default)
- the 5th field of an alignment record is an integer representing the alignment mapping quality
- the resulting output will have one integer per line (and 1000 lines)
- | sort -n
- the pipe connects the standard output of cut to the standard input of sort
- the -n option says to sort input lines according to numeric sort order
- the resulting output will be 1000 numeric values, one per line, sorted from lowest to highest
- | uniq -c
- the pipe connects the standard output of sort to the standard input of uniq
- the -c option option says to just count groups of lines with the same value (that's why they must be sorted) and report the total for each group
- the resulting output will be one line for each group that uniq sees
- each line will have the text for the group (here the unique mapping quality values) and a count of lines in each group
Viewing text in files
cat, more or less
The most basic way of view file data is the cat command. While the name comes from its ability to concatenate one or more files, it can be used to output the contents of a single file. For example:
| Code Block |
|---|
cat ~/.profile
# or, to see line numbers in the output:
cat -n ~/.profile |
Using cat by itself is fine for small files, but it reads/writes everything in the file without stopping. So for larger files you use a pager such as more, or less. A pager reads text and outputs only one "page" of text at a time, then waits for you to ask it to advance. And a "page" of text is the number of lines that will fit on your visible Terminal.
Using the more pager:
| Code Block |
|---|
more ~/.bashrc |
- Press the spacebar to see the next page.
If there is additional output, you'll see the --More-- indicator again; if not, the command prompt appears again.
- To end the more display, just type q (quit) or Ctrl-c.
Using the less pager:
| Code Block |
|---|
less ~/.bashrc
# to see line numbers in the output:
less -N ~/.bashrc
# to use case-insensitive matching:
less -I ~/.bashrc |
Basic navigation in less:
- Use q to quit less at any time
- space or Ctrl-f advances one page forward; Ctrl-b goes back one page
- down arrow goes down (forward) one line; up arrow goes up (backward) one line
Searching in less:
- /<pattern> – search for <pattern> in forward direction
- n – goes to the next match of <pattern>
- N – goes to the previous match of <pattern>
- ?<pattern> – search for <pattern> in backward direction
- n – previous match going back
- N – next match going forward
Introducing grep
Another method of text searching is using the grep program, which stands for general regular expression parser. In Unix, the grep program performs regular-expression text searching, and displays lines where the pattern text is found.
Nearly every programming language offers grep functionality, where a pattern you specify – a regular expression or regex – describes how the search is performed.
There are many grep regular expression metacharacters that control how the search is performed (see the grep command).
Basic usage is: grep '<pattern>' <file> where
- '<pattern>' (usually enclosed in single quotes) just contains alphanumeric characters (A-Z, a-z, 0-9).
Common options:
- grep -i will perform a case-insensitive search
- grep -n will display line numbers where the pattern was matched
head and tail
Two other commands that are useful for viewing text are head and tail.
- With no options, head shows the first 10 lines of its input and tail shows the last 10 lines.
- Use the -n option followed by a number to specify how many lines to view']
- or just put the number you want after a dash (e.g. -5 for 5 lines or -1 for 1 line)
- use the tail -n +<integer> syntax to display all input starting from that line
Examples:
| Code Block |
|---|
head ~/.bashrc # view the 1st 10 file lines
head -n 2 ~/.bashrc # view the 1st 2 file lines
head -5 ~/.bashrc # view the 1st 5 file lines
tail ~/.bashrc # view the last 10 file lines
tail -n 3 ~/.bashrc # view the last 3 file lines
tail -1 ~/.bashrc # view the last line of the file
# view 7 lines of text starting at line 20
tail -n +20 ~/.bashrc | head -7 |
Since head and tail do not have an option to display line numbers, you can pipe in text that includes line numbers with cat -n:
| Code Block |
|---|
cat -n ~/.bashrc | head -4 # view the 1st 4 lines w/line numbers
cat -n ~/.bashrc | tail -5 # view the last 5 lines w/line numbers
# view 6 lines of text starting at line 25
cat -n ~/.bashrc | tail -n +25 | head -6 |
More Linux concepts
Environment variables
...
Navigation and operations in nano are similar to those we discussed in Command line editing
You can just type in text, and navigate around using arrow keys (up/down/left/right). A couple of other navigation shortcuts:
...