Part 4: Advanced text manipulation

Part 4: Advanced text manipulation

Example data files

For some of the discussions below, we'll use some files in your ~/data directory.

The ~/data/walrus_sounds.tsv file () lists the types of sounds made by several well-known walruses, and the length of each occurrence. tab-delimited fields are:

  • column 1 - walrus name

  • column 2 - sound type

  • column 3 - length of sound

Take a look at the first few lines of this file:

cd ~/data head walrus_sounds.tsv

The .tsv filename extension stands for tab separated values, indicating that the field separator (the character separating fields) is tab. We can verify this using the handy hexdump alias we defined for you as discussed at Intro Unix: What is text? 

cd ~/data head walrus_sounds.tsv | hexdump

The output looks like this, where the hexadecimal 0x09 character is a Tab.

We will also use two data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer Samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as Jobs, and sequenced on GSAF's sequencing machines as part of sequencer Runs.

The files are also in your ~/data directory:

  • - contains job name/sample name pairs, tab-delimited, no header

    • the "JAnnnnn" items in the 1st column are Jobs

    • the "SAnnnnn" items in the 2nd column are Runs

      • several Jobs can be associated with the same Run

  • - contains information about all samples run on a particular run, along with the job each belongs to.

    • columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string

    • column names are in an initial header line

Take a look at the first few lines of these files also:

cd ~/data head joblist.txt head sampleinfo.txt

Exercise 3-1

What field separators are used in ~/data/joblist.txt and ~/data/sampleinfo.txt?

cd ~/data head -1 joblist.txt | hexdump head -2 sampleinfo.txt | hexdump

The hexdump output shows that both files use tab to separate fields.

How many lines to these sample files have?

Use the word count command with the -l (count lines) option:

wc -l 

cd ~/data wc -l *.txt *.tsv # or wc -l *.{txt,tsv}

shows:

3841 joblist.txt 44 sampleinfo.txt 200 walrus_sounds.tsv 4085 total

Cut, sort, uniq

cut

The cut command lets you isolate ranges of data from its input lines (from files or standard input):

  • cut -f <field_number(s)> extracts one or more fields (-f) from each line

    • the default field delimiter is tab

      • use -d <delim> to change the field delimiter

  • cut -c <character_number(s)> extracts one or more characters (-c) from each line

  • the <numbers> can be

    • a comma-separated list of numbers (e.g. 1,4,7)

    • a hyphen-separated range (e.g. 2-5)

    • a trailing hyphen says "and all items after that" (e.g. 3,7-)

  • cut does not re-order fields, so cut -f 5,3,1 acts like -f 1,3,5

Examples:

cd ~/data cut -f 2 joblist.txt | head head joblist.txt | cut -c 9-13 # no field reordering, so these two produce the same output cut -f1,2 walrus_sounds.tsv | head cut -f2,1 walrus_sounds.tsv | head

Exercise 3-2

How would you extract the first 5 job_name and sample_name fields from ~/data/sampleinfo.txt without including the header? Recall that job_name and sample_name are fields 1 and 3 of ~/data/sampleinfo.txt.

Use tail -n +2 (or tail +2) to skip the header line and start at line 2.
Then use cut -f to isolate the desired fields.

tail -n +2 ~/data/sampleinfo.txt | head -5 | cut -f 1,3

sort

sort sorts its input lines using an efficient algorithm

  • by default sorts each line lexically (as strings), low to high

    • use -n sort numerically (-n)

    • use -V for Version sort (numbers with surrounding text)

    • use -r to reverse the sort order

  • use one or more -k <start_field_number>,<end_field_number> options to specify a range of "keys" (fields) to sort on

    • use this option when you want to preserve all data on the input lines, but sort on part(s) of the line

    • e.g. -k1,1 -2,2nr  to sort field 1 lexically then field 2 as a number high-to-low

  • by default, fields are delimited by whitespace – one or more spaces or tabs

    • use -t <delim> to change the field delimiter (e.g. -t "\t" for tab)

Examples:

Here we use cut to isolate text we want to sort:

cd ~/data cut -f 2 joblist.txt | head | sort # sort the 1st 10 Runs in # "joblist.txt" cut -f 2 joblist.txt | head | sort -r # reverse-sort the 1st 10 Runs # in "joblist.txt" # reverse-sort all Jobs in "joblist.txt" then look at the 1st 10 cut -f 1 joblist.txt | sort -r | head

But we can also sort lines based on one or more fields specified by the -k option:

cd ~/data # reverse sort (high-to-low) the lines of "joblist.txt" # according to the data in field 2 (Job), then view the top 10 lines sort -k1,1r joblist.txt | head # sort lines of "walrus_sounds.tsv" by sound type (field 2) # then by walrus (field 1) & look at 20 cat walrus_sounds.tsv | sort -k2,2 -k1,1 | head -20 # sort lines of "walrus_sounds.tsv" by the combination of # sound type (field 2) and walrus (field 1) & look at 1st 20 cat walrus_sounds.tsv | sort -k2,1 | head -20

Exercise 3-3

Which walruses make the longest sounds?

sort -k3,3r ~/data/walrus_sounds.tsv | head

Looks like ET and Antje make the longest sounds

Which walruses make the shortest grunts?

sort -k3,3 ~/data/walrus_sounds.tsv | grep 'grunt' | head

Looks like Jocko and ET make the shortest grunts

Here's an example of using the -V (Version sort) option to sort numbers-with-text:

# produce 4 lines of output with integers then an "x" echo -e "2x\n12x\n310x\n31x" # sorting lexically puts the strings in alphabetical order echo -e "2x\n12x\n310x\n31x" | sort # "Version sort" these lines with -V, high number to low echo -e "2x\n12x\n910x\n31x" | sort -Vr

Note: In order to be meaningful, all the strings should have the same text suffix (x above).

uniq

uniq takes sorted input and collapses adjacent groups of identical values

  • uniq -c says also report a count of the number of members in each group (before collapsing)

Examples:

cd ~/data head walrus_sounds.tsv | cut -f 2 | sort # look at the 1st # 10 walrus # sounds, sorted head walrus_sounds.tsv | cut -f 2 | sort | uniq # collapse the 1st # 10 (sorted) # sounds head walrus_sounds.tsv | cut -f 2 | sort | uniq -c # add a count of # the items in # each group

“Piping a histogram” with cut/sort/uniq -c

One of my favorite "Unix tricks" is combining sort and uniq calls to produce a histogram-like ordered list of count/value pairs.

cd ~/data # report counts of each type of walrus sound cut -f 2 walrus_sounds.tsv | sort | uniq -c # output: 33 bellow 52 chortle 34 gong 36 grunt 45 whistle

Since each line of this output consists of a number then a sound name, separated by whitespace (one or more spaces or tabs), we can use sort's -k1,1nr option to sort it by numerical count, highest to lowest:

# Take the reported sound counts and reverse sort it numerically # by column 1 (the count) to see the most common sounds made cut -f 2 walrus_sounds.tsv | sort | uniq -c | sort -k1,1nr # output: 52 chortle 45 whistle 36 grunt 34 gong 33 bellow

We affectionately refer to this "cut | sort | uniq -c | sort -k1,1nr" idiom as "piping a histogram".

# Look at a file of yeast annotations (a Genome Feature File, GFF) zcat /mnt/bioi/ref_genome/sgd/saccharomyces_cerevisiae.20161203.gff.gz \ | less # Real data lines start after the comment lines in the header, # and some basic information is in tab-delimited columns 1-8 # Look at columns 1-8 of the first non-comment lines zcat /mnt/bioi/ref_genome/sgd/saccharomyces_cerevisiae.20161203.gff.gz \ | grep -v '^#' | cut -f 1-8 | head # There's also some unstructured sequence data lines at the end, so # exclude those too zcat /mnt/bioi/ref_genome/sgd/saccharomyces_cerevisiae.20161203.gff.gz \ | grep -v '^#' | grep -P -v '^[AGTCN]+' | cut -f 1-8 | tail # Pipe a histogram of annotation entries for each chromosome (column 1) zcat /mnt/bioi/ref_genome/sgd/saccharomyces_cerevisiae.20161203.gff.gz \ | grep -v '^#' | grep -P -v '^[AGTCN]+' | cut -f 1 \ | sort | uniq -c # Looks like there are also lines starting with >chr, so exclude them" zcat /mnt/bioi/ref_genome/sgd/saccharomyces_cerevisiae.20161203.gff.gz \ | grep -v '^#' | grep -P -v '^[AGTCN>]+' | cut -f 1 \ | sort | uniq -c # What types of features (column 3) are in these annotations? zcat /mnt/bioi/ref_genome/sgd/saccharomyces_cerevisiae.20161203.gff.gz \ | grep -v '^#' | grep -P -v '^[AGTCN>]+' | cut -f 3 \ | sort | uniq -c | sort -k1,1n # How many genes on each chromosome? zcat /mnt/bioi/ref_genome/sgd/saccharomyces_cerevisiae.20161203.gff.gz \ | grep -v '^#' | grep -P -v '^[AGTCN>]+' | grep -P '\tgene\t' \ | cut -f 1 | sort | uniq -c # Which chromosome has the most genes? zcat /mnt/bioi/ref_genome/sgd/saccharomyces_cerevisiae.20161203.gff.gz \ | grep -v '^#' | grep -P -v '^[AGTCN>]+' | grep -P '\tgene\t' \ | cut -f 1 | sort | uniq -c | sort -k1,1nr | head -1

Exercise 3-4

How many different walruses are represented in the ~/data/walrus_sounds.tsv file?

cut -f 1 ~/data/walrus_sounds.tsv | sort | uniq | wc -l     
reports 3 different walrus names

Which walrus has the most recorded sounds?

cut -f 1 ~/data/walrus_sounds.tsv | sort | uniq -c | sort -k1,1nr
Looks like ET has the most sounds:
  69 ET
  68 Antje 
  63 Jocko

Job names are in column 1 of the ~/data/sampleinfo.txt file. Create a histogram of Job names showing the count of samples (lines) for each, and show Jobs with the most samples first.

Job JA19060 has the most samples (35)

tail -n +2 sampleinfo.txt | cut -f 1 | sort | uniq -c \ | sort -k1,1nr | head

The Run names in joblist.txt start with SAyy where yy are year numbers. Report how many runs occurred in each year.

First isolate the Run field characters 1-4 (or 3,4) to isolate the years. Then sort, count unique, sort...

cut -f 2 joblist.txt | cut -c1-4 | sort | uniq -c | sort -k1,1nr # output 753 SA15 713 SA14 625 SA16 531 SA17 462 SA13 422 SA18 260 SA12 74 SA19 1 SA99

Which Run in joblist.txt has the most jobs?

The Run name is in field 2

cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr \ | head -1 # 23 SA13038

Ensuring uniqueness of field combinations

Sometimes you'll get a table of data that should contain unique values of a certain field, or a certain combination of fields. Using cut sort uniq wc -l can help verify this, and to find which are duplicates.

Example: Are all the Job names in joblist.txt unique?

cd ~/data wc -l joblist.txt # Reports 3841 Job/Run # entries cut -f 1 joblist.txt | sort | uniq | wc -l # But there are only # 3167 unique Jobs # So there are some Jobs that appear more than once -- but which ones? # Use our "piping a histogram" trick but only look at the # highest-count entries cut -f 1 joblist.txt | sort | uniq -c | sort -k1,1nr | head

Exercise 3-5

Are all combinations of Job/Run in joblist.txt unique?

Yes

cd ~/data wc -l joblist.txt # Reports 3841 # Job/Run entries cut -f 1,2 joblist.txt | sort | uniq | wc -l # And 3841 unique # Job/Run # combinations # or just, since there are only 2 fields: sort joblist.txt | uniq | wc -l

Yes, all entries are unique.

Introducing awk

awk is a powerful scripting language that is easily invoked from the command line. It is especially useful for handling tabular data.

One way of using it:

  • awk '<script>' - the '<script>'  is applied to each line of input (generally piped in)

    • always enclose '<script>' in single quotes to inhibit shell evaluation

    • awk has its own set of metacharacters that are different from the shell's

A basic awk script has the following form, of three clauses, all of which are optional

BEGIN {<expressions>}
{<body expression(s)>}
END {<expressions>}

awk example 1

Here’s how to output the first few lines of joblist.txt with the Run in column 1 and the Job in column 2:

head joblist.txt | awk '{print $2,$1}'

Notes:

  • The { <body expression(s)> } are executed for every line of input.

  • awk refers to fields with $N, where N is the field/column number.

  • awk's default input field delimiter is whitespace (one or more spaces or a tab)

    • can be changed in the BEGIN block (e.g. BEGIN{FS="\t" }"

      • or on the command line (e.g. awk -F '\t')

  • Write output in awk with the print function

    • followed by a comma-separated list of fields, strings or variable names, e.g.

      • print "field 2 is:",$2

      • BEGIN{ total=0 } {total= total + $1} END{ print "total:",total }

      • enclose text and special characters in double quotes ( " ) inside awk scripts

  • awk's default output field delimiter is a single space

    • can be changed in the BEGIN block (e.g. BEGIN{OFS="\t" }

  • spaces are generally optional in clauses

  • Any BEGIN or END blocks are executed once

    • before any input is processed, and then when there is no more input data

Since the original joblist.txt file was tab-separated, we modify our script to do the same, changing the output field delimiter:

head joblist.txt | awk 'BEGIN{ OFS="\t" }{ print $2, $1 }'

awk example 2

An awk script to add two numbers on a line:

echo "14 73" | awk ' BEGIN{tot=0} {tot = $1 + $2 print ""; print $1, "+", $2, "=", tot; print "" }' # output is (including empty lines): 14 + 73 = 87

Notes:

  • Once the single quote ( ' ) to start the script is seen on the 1st line, there is no need for special line-continuation.

    • Just enter the script text then finish with a closing single quote when done.

  • Multiple expressions can appear on the same line if separated by a semicolon ( ; )

  • print "" just prints an empty line

    • print with no arguments would print the entire input line

awk example 3

Here's an awk script that takes the average of the numbers passed to it, using the seq function to generate numbers from 1-10, each on a line.

seq 10 | awk ' BEGIN{sum=0; ct=0;} {sum = sum + $1 ct = ct + 1} END{print sum/ct,"is the mean of",ct,"numbers"}'

Notes:

  • Once the single quote ( ' ) to start the script is seen on the 1st line, there is no need for special line-continuation.

    • Just enter the script text then finish with a closing single quote when done.

  • Multiple expressions can appear on the same line if separated by a semicolon ( ; )

  • The BEGIN and END clauses are optional, and are executed only once, before and after input is processed, respectively

  • BEGIN {<expressions>}  –  use to initialize variables before any script body lines are executed

    • Script variables ct and sum are initialized to 0 in the BEGIN block above

    • Some important built-in variables you man want to initialize:

      • FS (input Field Separator) - used to delimit input data fields

        • default is whitespace -- one or more spaces or a tab

        • e.g. FS=":" to specify a colon

      • OFS (Output Field Separator) - use to delimit output fields

        • default is a single space

        • e.g. OFS="\t" to specify a tab

  • The body expressions are executed for each line of input.

    • Each line is parsed into fields based on the specified input field separator 

    • Fields can then be access via build-in variables $1 (1st field), $2 (2nd field) and so forth.