Part 4: Advanced text manipulation
- 1 Example data files
- 1.1 Exercise 3-1
- 2 Cut, sort, uniq
- 2.1 cut
- 2.1.1 Exercise 3-2
- 2.2 sort
- 2.2.1 Exercise 3-3
- 2.3 uniq
- 2.4 piping a histogram with cut/sort/uniq -c
- 2.4.1 Exercise 3-4
- 2.5 Ensuring uniqueness of field combinations
- 2.5.1 Exercise 3-5
- 2.1 cut
- 3 Introducing awk
- 4 Regular expressions in grep, sed and perl
- 4.1 regular expressions
- 4.2 grep pattern matching
- 4.2.1 Exercise 3-7
- 4.3 perl pattern matching
- 4.3.1 Exercise 3-8
- 4.4 sed pattern substitution
- 4.5 perl pattern substitution
- 4.5.1 Exercise 3-9
- 5 Bash control flow
- 5.1 The bash for loop
- 5.1.1 Quotes matter
- 5.1.2 Exercise 3-10
- 5.2 The if statement
- 5.3 Reading file lines with while
- 5.3.1 Exercise 3-11
- 5.1 The bash for loop
- 6 A few odds and ends
Example data files
For some of the discussions below, we'll use some files in your ~/data directory.
The ~/data/walrus_sounds.tsv file () lists the types of sounds made by several well-known walruses, and the length of each occurrence. Tab-delimited fields are:
column 1 - walrus name
column 2 - sound type
column 3 - length of sound
Take a look at the first few lines of this file:
cd ~/data
head walrus_sounds.tsv
The .tsv filename extension stands for tab separated values, indicating that the field separator (the character separating fields) is Tab. We can verify this using the handy hexdump alias we defined for you as discussed at Intro Unix: What is text?
cd ~/data
head walrus_sounds.tsv | hexdump
The output looks like this, where the hexadecimal 0x09 character is a Tab.
We will also use two data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer Samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as Jobs, and sequenced on GSAF's sequencing machines as part of sequencer Runs.
The files are also in your ~/data directory:
- contains job name/sample name pairs, Tab-delimited, no header
the "JAnnnnn" items in the 1st column are Jobs
the "SAnnnnn" items in the 2nd column are Runs
- contains information about all samples run on a particular run, along with the job each belongs to.
columns (Tab-delimited) are
job_name, job_id, sample_name, sample_id, date_stringcolumn names are in an initial header line
Take a look at the first few lines of these files also:
cd ~/data
head joblist.txt
head sampleinfo.txt
Exercise 3-1
What field separators are used in ~/data/joblist.txt and ~/data/sampleinfo.txt?
How many lines to these sample files have?
Cut, sort, uniq
cut
The cut command lets you isolate ranges of data from its input lines (from files or standard input):
cut -f <field_number(s)> extracts one or more fields (-f) from each line
the default field delimiter is Tab
use -d <delim> to change the field delimiter
cut -c <character_number(s)> extracts one or more characters (-c) from each line
the <numbers> can be
a comma-separated list of numbers (e.g. 1,4,7)
a hyphen-separated range (e.g. 2-5)
a trailing hyphen says "and all items after that" (e.g. 3,7-)
cut does not re-order fields, so cut -f 5,3,1 acts like -f 1,3,5
Examples:
cd ~/data
cut -f 2 joblist.txt | head
head joblist.txt | cut -c 9-13
# no field reordering, so these two produce the same output
cut -f1,2 walrus_sounds.tsv | head
cut -f2,1 walrus_sounds.tsv | head Exercise 3-2
How would you extract the first 5 job_name and sample_name fields from ~/data/sampleinfo.txt without including the header? Recall that job_name and sample_name are fields 1 and 3 of ~/data/sampleinfo.txt.
sort
sort sorts its input lines using an efficient algorithm
by default sorts each line lexically (as strings), low to high
use -n sort numerically (-n)
use -V for Version sort (numbers with surrounding text)
use -r to reverse the sort order
use one or more -k <start_field_number>,<end_field_number> options to specify a range of "keys" (fields) to sort on
use this option when you want to preserve all data on the input lines, but sort on part(s) of the line
e.g. -k1,1 -2,2nr to sort field 1 lexically then field 2 as a number high-to-low
by default, fields are delimited by whitespace – one or more spaces or Tabs
use -t <delim> to change the field delimiter (e.g. -t "\t" for Tab)
Examples:
Here we use cut to isolate text we want to sort:
cd ~/data
cut -f 2 joblist.txt | head | sort # sort the 1st 10 Runs in
# "joblist.txt"
cut -f 2 joblist.txt | head | sort -r # reverse-sort the 1st 10 Runs
# in "joblist.txt"
# reverse-sort all Jobs in "joblist.txt" then look at the 1st 10
cut -f 1 joblist.txt | sort -r | head But we can also sort lines based on one or more fields specified by the -k option:
cd ~/data
# reverse sort (high-to-low) the lines of "joblist.txt"
# according to the data in field 2 (Job), then view the top 10 lines
sort -k1,1r joblist.txt | head
# sort lines of "walrus_sounds.tsv" by sound type (field 2)
# then by walrus (field 1) & look at 20
cat walrus_sounds.tsv | sort -k2,2 -k1,1 | head -20Exercise 3-3
Which walruses make the longest sounds?
Which walruses make the shortest grunts?
Here's an example of using the -V (Version sort) option to sort numbers-with-text:
# produce 4 lines of output with integers then an "x"
echo -e "12x\n2x\n91x\n31x"
# "Version sort" these lines with -V, high number to low
echo -e "12x\n2x\n91x\n31x" | sort -Vruniq
uniq takes sorted input and collapses adjacent groups of identical values
uniq -c says also report a count of the number of members in each group (before collapsing)
Examples:
cd ~/data
head walrus_sounds.tsv | cut -f 2 | sort # look at the 1st
# 10 walrus
# sounds, sorted
head walrus_sounds.tsv | cut -f 2 | sort | uniq # collapse the 1st
# 10 (sorted)
# sounds
head walrus_sounds.tsv | cut -f 2 | sort | uniq -c # add a count of
# the items in
# each grouppiping a histogram with cut/sort/uniq -c
One of my favorite "Unix tricks" is combining sort and uniq calls to produce a histogram-like ordered list of count/value pairs.
cd ~/data
# report counts of each type of walrus sound
cut -f 2 walrus_sounds.tsv | sort | uniq -c
# output:
33 bellow
52 chortle
34 gong
36 grunt
45 whistleSince each line of this output consists of a number then a sound name, separated by whitespace (one or more spaces or Tabs), we can use sort's -k1,1nr option to sort it by numerical count, highest to lowest:
# Take the reported sound counts and reverse sort it numerically
# by column 1 (the count) to see the most common sounds made
cut -f 2 walrus_sounds.tsv | sort | uniq -c | sort -k1,1nr
# output:
52 chortle
45 whistle
36 grunt
34 gong
33 bellowWe affectionately refer to this "cut | sort | uniq -c | sort -k1,1nr" idiom as "piping a histogram".
Exercise 3-4
How many different walruses are represented in the ~/data/walrus_sounds.tsv file?
Which walrus has the most recorded sounds?
Job names are in column 1 of the ~/data/sampleinfo.txt file. Create a histogram of Job names showing the count of samples (lines) for each, and show Jobs with the most samples first.
The Run names in joblist.txt start with SAyy where yy are year numbers. Report how many runs occurred in each year.
Which Run in joblist.txt has the most jobs?
Ensuring uniqueness of field combinations
Sometimes you'll get a table of data that should contain unique values of a certain field, or a certain combination of fields. Using cut sort uniq wc -l can help verify this, and to find which are duplicates.
Example: Are all the Job names in joblist.txt unique?
cd ~/data
wc -l joblist.txt # Reports 3841 Job/Run
# entries
cut -f 1 joblist.txt | sort | uniq | wc -l # But there are only
# 3167 unique Jobs
# So there are some Jobs that appear more than once -- but which ones?
# Use our "piping a histogram" trick but only look at the
# highest-count entries
cut -f 1 joblist.txt | sort | uniq -c | sort -k1,1nr | headExercise 3-5
Are all combinations of Job/Run in joblist.txt unique?
Introducing awk
awk is a powerful scripting language that is easily invoked from the command line. It is especially useful for handling tabular data.
One way of using it:
awk '<script>' - the '<script>' is applied to each line of input (generally piped in)
always enclose '<script>' in single quotes to inhibit shell evaluation
awk has its own set of metacharacters that are different from the shell's
A basic awk script has the following form:
BEGIN {<expressions>}
{<body expressions>}
END {<expressions>}
Here's a simple awk script that takes the average of the numbers passed to it, using the seq function to generate numbers from 1-10, each on a line.
seq 10 | awk '
BEGIN{sum=0; ct=0;}
{sum = sum + $1
ct = ct + 1}
END{print sum/ct,"is the mean of",ct,"numbers"}'Notes:
Once the single quote ( ' ) to start the script is seen on the 1st line, there is no need for special line-continuation.
Just enter the script text then finish with a closing single quote when done.
Multiple expressions can appear on the same line if separated by a semicolon ( ; )
The BEGIN and END clauses are optional, and are executed only once, before and after input is processed, respectively
BEGIN {<expressions>} – use to initialize variables before any script body lines are executed
Script variables ct and sum are initialized to 0 in the BEGIN block above
Some important built-in variables you man want to initialize:
FS (input Field Separator) - used to delimit fields
default is whitespace -- one or more spaces or Tabs
e.g. FS=":" to specify a colon
OFS (Output Field Separator) - use to delimit output fields
default is a single space
e.g. OFS="\t" to specify a Tab
The body expressions are executed for each line of input.
Each line is parsed into fields based on the specified input field separator
Fields can then be access via build-in variables $1 (1st field), $2 (2nd field) and so forth.
the built-in NF variable represents the Number of Fields in a given line
the built-in NR variable represent the Number of the current Record (line)
awk has the usual set of arithmetic operators (+, /, etc)
and comparison operators (=, >, <, etc)
and an if ( <expression> ) { <action> } conditional construct
The END block is executed when there is no more input
The print statement in the END block takes a comma-separated list of values
each value is separated by awk's default output field separator (a single space)
literal text is specified using double quotes ("is the mean of")
Exercise 3-6
Use awk to print out the highest Job (JA) and Run (SA) in joblist.txt.
Use the if ( <expression> ) { <action> } conditional construct for comparisons (e.g. ==, > <).
A more complicated awk script
Now let's write a more complicated awk script to explore its capabilities further. Our goal is to sum up the walrus sound times in ~/data/walrus_sounds.tsv and print out that total in seconds, minutes and hours. And lets write the script bit-by-bit to show how we can "debug as we go" on the command line.
# First isolate the sound length field (field 3)
# We'll use head to test our code on just a few lines until
# we like our script
head walrus_sounds.tsv | cut -f 3 | awk '{print}'
# We see the times are in MM:SS (minutes, seconds) so we'll use
# FS=":" to specify colon as the input field separator
head walrus_sounds.tsv | cut -f3 | awk 'BEGIN{FS=":"}{print $1,$2}'
# Looks good - minutes are coming out as field 1 and seconds as field 2
# Now calculate the number of seconds for each line with some math
head walrus_sounds.tsv | cut -f3 | awk '
BEGIN{FS=":"}
{seconds = $2 + ($1 * 60)
print $1,$2,seconds}'
# Now add each sound's seconds to a global total
head walrus_sounds.tsv | cut -f3 | awk '
BEGIN{FS=":"; total=0}
{seconds = $2 + ($1 * 60)
total = total + seconds
print $1,$2,seconds,total}'
# Now process all the input and just output the final totals
cat walrus_sounds.tsv | cut -f3 | awk '
BEGIN{FS=":"; total=0}
{seconds = $2 + $1 * 60
total = total + seconds}
END{ print "total seconds:",total
print "total minutes:",total/60
print "total hours: ",total/60/60}'
# One final improvement: use the printf function to format the
# output to control how many decimal places are shown.
cat walrus_sounds.tsv | cut -f3 | awk '
BEGIN{FS=":"; total=0}
{seconds = $2 + $1 * 60
total = total + seconds}
END{ printf("total seconds: %d\n", total)
printf("total minutes: %.2f\n",total/60)
printf("total hours: %.2f\n",total/60/60)}'To learn more, here's an excellent awk tutorial, very detailed and in-depth.
printf and sprintf functions come from the C programming language, but many higher-level languages implement similar text formatting. Wikipedia has a nice table of printf format specifiers (https://en.wikipedia.org/wiki/Printf#Type_field) as part of its thorough printf page.
Parsing field-oriented text with cut and awk
The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:
Default field separators
Tab is the default field separator for cut
and the field separator can only be a single character
whitespace (one or more spaces or Tabs) is the default field separator for awk