Part 2: Viewing text in files

Setup outside of class...

If you're coming back to this tutorial after class, or just ran across it on the Internet, here are the main 3 files we use. You can download them and follow along if you have a Unix or Linux laptop or server.

Not all examples shown will work, and what you see in your Terminal may not match the description, but you can still practice many operations using these files.

cat, more or less

First things first. Text you want to manipulated often resides in files, and one of the first things you want to do is look at it.

But you're on a command line, so you don't have a handy GUI program you can use, like Notepad on Windows or TextEdit on Macs. So you'll usually use a Linux command to view text.

The most basic way of view file data is the cat command. While the name comes from its ability to concatenate one or more files, it can be used to output the contents of a single file. For example:

cat haiku.txt

Each line of the file is printed to the Terminal (standard output), and you see the command prompt again when it is done.

One of the most useful options is cat -n, which displays a line number with each line of output. We can use this option to see that the haiku.txt file has 11 lines:

student01@gsafcomp01:~$ cat -n haiku.txt
     1  The Tao that is seen
     2  Is not the true Tao, until
     3  You bring fresh toner.
     4
     5  With searching comes loss
     6  and the presence of absence:
     7  "My Thesis" not found.
     8
     9  Yesterday it worked
    10  Today it is not working
    11  Software is like that.

Using cat by itself is fine for small files, but it reads/writes everything in the file without stopping. So for larger files you use a pager such as more, or less.

A pager reads text and outputs only one "page" of text at a time, then waits for you to ask it to advance. A "page" of text is the number of lines that will fit on your visible Terminal. Compare these:

cat jabberwocky.txt
more jabberwocky.txt   # press the space bar to go to the next page

Notice that when you call more, it outputs some text then stops and indicates that there is additional text available (--More--).

Just press the spacebar to see the next page.
If there is additional output, you'll see the --More-- indicator again; if not, the command prompt appears again.
To end the more display, just type q (quit) or Ctrl-c.

So more is really simple, and is great for scanning through a large file. But what if you're looking for something in particular? Or want to go back and forth? That's where the less pager comes in, providing navigation and search capabilities using its own special characters (metacharacters).

Basic navigation in less:

Use q to quit less at any time
space or Ctrl-f advances one page forward; Ctrl-b goes back one page
down arrow goes down (forward) one line; up arrow goes up (backward) one line

Searching in less:

/<pattern> – search for <pattern> in forward direction
- n – goes to the next match of <pattern>
- N – goes to the previous match of <pattern>
?<pattern> – search for <pattern> in backward direction
- n – previous match going back
- N – next match going forward

The searched-for pattern will appear highlighted, at the top of the Terminal window. There may be other highlighted occurrences farther down, but the current one will always be at the top.

Everything in Linux is case-sensitive by default, so "foo" and "Foo" are different text strings. However if you use the less -I (Insensitive) flag, searching will be case-insensitive, and "foO", "Foo", "FOO" will all match the pattern "foo".

So how do you tell where you are in the file? If you use the less -N option, each line will be labeled with a Number.

Exercise 2-1

On what lines of the jabberwocky.text file does the word Jabberwock appear (exact match)?

Hint...

less -N jabberwocky.txt

then to see the 1st occurrence

/Jabberwock

to see other occurrences, just type n

Answer...

Lines 12, 21 and 28

What lines is the word Jabberwock on if case is ignored?

Hint...

less -N -I jabberwocky.txt
less -NI jabberwocky.txt

Answer...

Lines 1, 12, 21 and 28

Introducing grep

Using the less program's search function is one way to find text in a file, especially when you want to see the context surrounding the searched-for text. But for general text searching, the grep program is used more frequently.

The word grep stands for general regular expression parser.

Nearly every programming language offers grep functionality, where a pattern you specify – a regular expression or regex – describes how the search is performed.

In Unix, the grep program performs regular-expression text searching, and displays lines where the pattern text is found.

There are many grep regular expression metacharacters that control how the search is performed. This is a complex topic and we defer it to the Intermediate Unix workshop (for a preview, see the grep command).

For now, we will just use grep <pattern> <file> where <pattern> just contains alphanumeric characters (A-Z, a-z, 0-9), and use options:

grep -i will perform a case-insensitive search
grep -n will display line numbers where the pattern was matched

Note that less and grep both support case-insensitive matching, and displaying line numbers, but they use slightly different options:

Case insensitive matching:
- less -I or less --IGNORE-CASE
- grep -i or grep --ignore-case
Display line numbers:
- less -N or less --LINE_NUMBERS
- grep -n or grep --line-number

It would be great if all Linux command options that mean the same thing used the same options – and some do, but some don't.

Exercise 2-2

Use grep to display (numbered) jabberwocky.txt lines containing the word Jabberwock ignoring case.

Answer...

grep -n -i Jabberwock jabberwocky.txt
grep -ni jabberwock jabberwocky.txt

Standard streams and piping

A key to text manipulation is understanding Unix streams. Every command and Unix program has three "built-in" streams: standard input, standard output and standard error.

Most programs/commands read input data from some source, then write output to some destination. A data source can be a file, but can also be standard input. Similarly, a data destination can be a file but can also be a stream such as standard output.

The power of the Linux command line is due in no small part to the power of piping. The pipe operator ( | ) connects one program's standard output to the next program's standard input.

The key to the power of piping is that most Unix commands can accept input from standard input instead of from files. So, for example, these two expressions appear equivalent:

more jabberwocky.txt
cat jabberwocky.txt | more

Let's dissect the difference in detail:

In the 1st, the more command reads some input from the jabberwocky.txt file
- then writes the output to standard output, which is displayed on your Terminal
- it pauses at page boundaries (--More--) waiting for input on standard input
- when it receives a space character on standard input it reads more input from jabberwocky.txt
- then writes the output to standard output, which is displayed on your Terminal
In the 2nd, the cat command reads its input from the jabberwocky.txt file
- then writes its output to standard output
- the pipe operator ( | ) then connects the standard output from cat to standard input of the more command
- the more command then reads its input from standard input, instead of from a file
  - then writes its output to standard output, which is displayed on your Terminal
  - more continues its processing similar to #1, except reading from its standard input instead of the file

Notes:

In #2, the cat command "blocks" writing to its standard output until more says it's ready for more input
- This "write until block" / "read when input available" behavior makes streams a very efficient means of inter-process communication.
In #1, more can report how much of the file has been read, e.g. --More-- (24%) because it has access to the size information for the file it is reading.
- But in #2, the text is "anonymous input" – from standard input – so more doesn't know how much of the total has been provided.

Excercise 2-3

Use the pipe operator to provide jabberwocky.txt data to the less command so that line numbers are displayed.

Answer...

cat jabberwocky.txt | less -N
# or
cat -n jabberwocky.txt | less

What happens when you just enter the cat command with no arguments? Can you explain why?

Hint...

man cat

Answer...

Just entering the cat command with no arguments appears to "hang" – that is, nothing happens and you don't see the command prompt (just Ctrl-c to get it back).

Reading the man page for cat says this:

NAME
    cat - concatenate files and print on the standard output
SYNOPSIS
    cat [OPTION]... [FILE]...
DESCRIPTION
    Concatenate FILE(s) to standard output.
    With no FILE, or when FILE is -, read standard input.

The SYNOPSYS says in addition to one of more optional options to cat ( [OPTION}... ) arguments to cat are also optional ( [FILE]... ).

Since there was no FILE provided, cat reads from standard input – but there's no data there either, so it just sits and waits for some to appear.

head and tail

Two other commands that are useful for viewing text are head and tail.

With no options, head shows the first 10 lines of its input and tail shows the last 10 lines. You can use the -n option followed by a number to specify how many lines to view, or just put the number you want after a dash (e.g. -5 for 5 lines or -1 for 1 line).

View the 1st 3 lines of haiku.txt

head -n 3 haiku.txt
head -3 haiku.txt
cat haiku.txt | head -3

View the last line of haiku.txt

tail -n 1 haiku.txt
tail -1 haiku.txt
cat haiku.txt | tail -1

But what if you want to see lines in the middle of a file? Here's where a special feature of tail comes in handy. If you use tail and put a plus sign (+) in front of the number (with or without the -n option), tail will start its output at that line.

Let's pipe line-numbered output from cat to tail to see how this works. Note we use cat -n to provide input with line numbers because neither head nor tail has line numbering options.

cat -n haiku.txt | tail -n 5   # display the last 5 lines of haiku.txt

cat -n haiku.txt | tail -n +5  # display text in haiku.txt starting at 
                               # line 5
cat -n haiku.txt | tail +6     # display text in haiku.txt starting at 
                               # line 6

When you use the tail -n +<integer> syntax it will display all input starting from that line until the end of its input. So to view only a few lines starting at a specified line number, pipe the output to head:

# display 2 lines of haiku.txt starting at line 9
cat -n haiku.txt | tail -n +9 | head -2
cat -n haiku.txt | tail +9 | head -n 2

cat -n haiku.txt | head -10 | tail -2

Exercise 2-4

Use cat, head and tail to display the middle stanza of haiku.txt.

Hint...

Use cat -n to see the numbering of haiku.txt lines, then a combination of head/tail or tail/head.

Answer...

There are three 3-line stanzas in haiku.txt. The middle stanza is lines 5-7.

cat -n haiku.txt | tail -n +5 | head -n 3
cat -n haiku.txt | tail +5 | head -3

cat -n haiku.txt | head -7 | tail -3

Text lines and the Terminal

Sometimes a line of text is longer than the width of your Terminal. In this case the text is wrapped. It can appear that the output is multiple lines, but it is not. We can see that by looking at lines of the mobydick.txt file, that has some very long lines:

tail -1 mobydick.txt
cat -n mobydick.txt | more

Note that most Terminals let you increase/decrease the width/height of the Terminal window. But there will always be single lines too long for your Terminal width or too many lines of text for its height.

So how long is a line? And how many lines are there in a file? The wc (word count) command can tell us this.

wc -l reports the number of lines in its input
wc -c reports the number of characters in its input (including invisible linefeed characters)
wc -w reports the number of words in its input (groups of space-separated text characters)

Examples:

wc -l mobydick.txt            # Reports the number of lines in the
                              # mobydick.txt file
cat mobydick.txt | wc -l      # Reports the number of lines of its input

tail -1 mobydick.txt | wc -c  # Reports the number of characters in 
                              # the last mobydick.txt line
head -5 mobydick.txt | wc -c  # Reports the total number of characters in
                              # the first 5 mobydick.txt lines

When you give wc -l multiple files, it reports the line count of each, then a total.

student01@gsafcomp01:~$ wc -l haiku.txt jabberwocky.txt
  11 haiku.txt
  37 jabberwocky.txt
  48 total

Exercise 2-5

How long is the 12th line of jabberwocky.txt?

Hint...

Use tail and head to isolate the 12th line

Answer...

It is 32 characters long (including the linefeed):

tail -n +12 jabberwocky.txt | head -1 | wc -c

Note the slight difference when you give wc -l a file name versus when you pipe input to it.

wc -l <filename> displays the number of lines and the file name.
cat <filename> | wc -l only displays the number of lines in its anonymous input from standard input.

What is text?

We've talked about viewing text using various Unix commands – but what exactly is text? That is, what is stored in files that the shell interprets as text?

Inside of files, text isn't characters at all – it is all numbers, because that's all computers know.

On standard Unix systems, each text character is stored as one byte – eight binary bits – in a format called ASCII (American Standard Code for Information Interchange). Eight bits can store 2^{^8} = 256 values, numbered 0 - 255.

In its original form values 0 - 127 were used for standard ASCII characters. Now values 128 - 255 comprise an Extended set. See https://www.asciitable.com/

However not all ASCII "characters" are printable -- in fact the "printable" characters start at ASCII 32 (space).

ASCII values 0 - 31 have special meanings. Many were designed for use in early modem protocols, such as EOT (end of transmission) and ACK (acknowledge), or for printers, such as VT (vertical tab) and FF (form feed).

The non-printable ASCII characters we care most about are:

Tab (decimal 9, hexadecimal 0x9, octal 0o011)
- backslash escape: \t
Linefeed/Newline (decimal 10, hexadecimal 0xA, octal 0o012)
- backslash escape: \n
Carriage Return (decimal 13, hexadecimal 0xD, octal 0o015)
- backslash escape: \r

Let's use the hexdump command (really an alias, see below) to look at the actual ASCII codes stored in a file:

head haiku.txt | hexdump

This will produce output something like this:

Each line here describes 16 characters, in three display areas:

The numeric offset of the 16-character line, in hexadecimal (base 16)
- 16 decimal is 0x10 hex
The numeric value (ASCII code) for each character, again in hexadecimal
- each 2-digit hex number represents one 8-bit byte/character
The translated text
- The display character associated with each ASCII code, or a period ( . ) for non-printable characters, written between a greater than ( > ) and less than ( < ) sign

Notice that spaces are ASCII 0x20 (decimal 32), and the newline characters appear as 0x0a (decimal 10).

Why hexadecimal? Programmers like hexadecimal (base 16) because it is easy to translate hex digits to binary, which is how everything is represented in computers. And it can sometimes be important to know which binary bits are 1s and which are 0s. See Decimal and Hexadecimal for more information.

To learn more about the hexdump alias...

The expression below defines a helpful hexdump alias for viewing file contents in hexadecimal (base 16).

An alias is just shorthand for calling another command, usually with a specific set of options you like.

Here, the hexdump alias calls the built-in od function with various arguments that we'd never remember.

alias hexdump='od -A x -t x1z -v'

We have defined the hexdump alias for you in your ~/.profile file, the dot file that is executed every time you login. But you can just enter that expression on the command line and the hexdump function will be defined in your current Terminal session.