...
The output looks like this, where the hexadecimal0x09 character is a Tab.
We will also use two data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer Samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as Jobs, and sequenced on GSAF's sequencing machines as part of sequencer Runs.
...
A regular expression (regex) is a pattern of characters to search for and metacharacters that control and modify how matching is done.
The Intro Unix: Some Linux commands: Regular expressions section lists a nice set of "starter" metacharacters. Open that page now as a reference for this section.
...
- -n tells perl to feed the input one line at a time (here 4 lines)
- -e introduces the perl script
- Always enclose a command-line perl script in single quotes to protect it from shell evaluation
- perl has its own set of metacharacters that are different from the shell's
- $_ is a built-in Perl variable holding the current line (including any invisible line-ending characters)
- ~ is the perl pattern matching operator
- =~ says pattern that matches;
- ! ~ says pattern that does not match
- the forward slashes ("/ /") enclose the regex pattern
- the pattern matching operation returns true or false, to be used in a conditional statement
- here "print current line if the pattern matches"
...
Use perl pattern matching to count the number of Runs in joblist.txt that were not run in 2015.
Expand | |||||
---|---|---|---|---|---|
| |||||
|
...
Code Block | ||
---|---|---|
| ||
for number in `seq 5`; do echo $number done for num in $(seq 5); do echo $num; done |
Quotes matter
In the Review of some basics: Quoting in the shell section, we saw that double quotes allow the shell to evaluate certain metacharacters in the quoted text.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
Here's the weird bash syntax for arithmetic (interger integer arithmetic only!):
|
...
Code Block | ||
---|---|---|
| ||
cat -n haiku.txt | \
while IFS= read line; do
echo "Line is: '$line'"
done
|
- The IFS= clears all of read's default Input Field Separator, which is normally whitespace (one or more space characters or tabs).
- This is needed so that read will set the line variable to exactly the contents of the input line, and not specially process any whitespace in it.
- The lines of ~/haiku.txt are piped into the while loop
...
Code Block | ||
---|---|---|
| ||
tail -n +2 ~/data/sampleinfo.txt | \
while IFS= read line; do
jobName=$( echo "$line" | cut -f 1 )
sampleName=$( echo "$line" | cut -f 3 )
if [ "$jobName" == "" ]; then
sampleName="Undetermined"; jobName="none"
fi
echo "job $jobName - sample $sampleName"
done | more |
...
- The double quotes around the text that "$line" are important to preserve special characters inside the original line (here Tab characters).
- Without the double quotes, the line's fields would be separated by spaces, and the cut field delimiter would need to be changed.
- Some lines have an empty Job name field; we replace Job and Sample names in this case.
...
Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename function. Then use the odd-looking syntax:
- $ ${<variable-name>%%.<suffix-to-remove>}
- $ ${<variable-name>##<prefix-to-remove>}
Code Block | ||
---|---|---|
| ||
pathname=~/my_file.something.txt; echo $pathname filename=`basename $pathname`; echo $filename # isolate the filename prefix by stripping the ".something.txt" suffix prefix=${filename%%.something.txt} echo $prefix # isolate the filename suffix by stripping the "my_file.something." prefix suffix=${filename##my_file.something.} echo $suffix |
Exercise 3-12
Use the suffix-removal syntax above to strip the .bed suffix off files in ~/data/bedfiles.
Expand | |||||
---|---|---|---|---|---|
| |||||
|
A few odds and ends
Input from a sub-shell
When parentheses ( ) enclose an expression, it directs that expression be evaluated in a sub-shell of the calling parent shell. Recall also that the less-than sign < redirects standard input. We can use these two pieces of syntax instead of a file in some command contexts.
...
In addition to the methods of writing multi-line text discussed in Intro Unix: Writing text: Multi-line text, there's another one that can be useful for composing a large block of text for output to a file. This is done using the heredoc syntax to define a block of text between two user-supplied block delimiters, sending the text to a specified command.
...