Command-Line Tools for Digital Preservation

On Monday, June 13, I had the opportunity to attend a work-in-progress workshop taught by Mark Phillips (UNT's Asst. Dean for Digital Libraries), which gave a basic introduction to Unix-based command-line tools for digital preservation.

The workshop consisted of a very basic intro to the command-line environment, and then briefly introduced some things you can do with it. For a more thorough review of things that went on at the workshop, I've attached a few takeaways from the workshop:

-A text file with my somewhat decipherable notes: PhillipsWorkshop_Notes_06-13-2011_b.txt

-A print-out of my Terminal session during the workshop: terminal_session_workshop_06-13-2011_b.txt

-And Mark's ultra-minimal powerpoint: command-line_workshop-2011-06-13.pdf

Here I'll limit things to a few general points:

  • Some digital preservation tasks that can be performed with the command-line:
    • File characterization, i.e. identifying file format and other features of a file that are difficult to view and extract in a GUI.  Mark pointed out that file extensions are arbitrary, and really tell us nothing about what a file is.
    • Checksum generation
    • Metadata quality control, usually through generating sorted lists of metadata fields, which make it much much easier to identify errors and typos
    • Creating packages of items for transmission, which for instance can be accomplished with BagIt
    • I'm sure there's many others, but this seems like a solid foundation, and really got me interested in finding out more
  • A great introduction to working in the command-line environment ("the shell" in Unix terminology): http://linuxcommand.org/learning_the_shell.php
  • The concept of piping within the command-line was a revelation to me.  Basically, piping allows you take the output from one command and input it into another command without saving intermediary files.  There's really no limit how many commands and tools can be piped (or you can think of them as being "chained") together.  Here's an example, where Mark showed how to determine the number of pages in a PDF file:

     pdfinfo UNT-open-access-symposium-2011.pdf | grep Pages
    
    • The output of this command is:

      Pages:     48
      

      Which is arrived at by taking the output of the "pdfinfo" command and piping it into the "grep" command (notice the pipe character | which follows the pdf file name).  "grep" is used to pull a character string or field (in this case, the "Pages" field) out of a given block of text.  You could go even further and use the "awk" program the strip out the "Pages:   " character string, and arrive only at "48" s the final output.  It's not hard to see the possibilities from there-- for me, I immediately started thinking of extracting technical metadata from a large amount of image files, such as pixel dimensions and byte sizes.

  • Some file characterization tools, based on file type/format:
    • General: file
    • Audio: sox
    • Video: ffmpeg (can also do transcoding!)
    • PDF: pdfinfo
    • Images: the Imagemagick set of tools, which include: identify, convert, mogrify
  • Finally, there's one very general conceptual term that Mark kept using, and I'll try to explain it here.  It's hard because it is so general but here goes: it's the idea of standard output. Most command-line programs generate an ouput that is only shown on the Terminal screen, and nowhere else.  This output in the Terminal screen is called standard output.  You can write standard output to a file with the > character, or pipe it into the input of another program, as I mentioned above.  Anyway, Mark kept using that term "standard output", so I'm assuming it's something that we'll run across if/when we investigate this stuff further.

-Zach Vowell, 06/21/2011