Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Some digital preservation tasks that can be performed with the command-line:
    • File characterization, i.e. identifying file format and other features of a file that are difficult to view and extract in a GUI.  Mark pointed out that file extensions are arbitrary, and really tell us nothing about what a file is.
    • Checksum generation
    • Metadata quality control, usually through generating sorted lists of metadata fields, which make it much much easier to identify errors and typos
    • Creating packages of items for transmission, which for instance can be accomplished with BagIt
    • I'm sure there's many others, but this seems like a solid foundation, and really got me interested in finding out more
  • A great introduction to working in the command-line environment ("the shell" in Unix terminology): http://linuxcommand.org/learning_the_shell.php
  • The concept of piping within the command-line was a revelation to me.  Basically, piping allows you take the output from one command and input it into another command without saving intermediary files.  There's really no limit how many commands and tools can be piped (or you can think of them as being "chained") together.  Here's an example, where Mark showed how to determine the number of pages in a PDF file:

    Code Block
      pdfinfo UNT-open-access-symposium-2011.pdf | grep Pages
    
    • The output of this command is:

      Code Block
      
      Pages:     48
      

      Which is arrived at by taking the output of the "pdfinfo" command and piping it into the "grep" command (notice the pipe character | which follows the pdf file name).  "grep" is used to pull a character string or field (in this case, the "Pages" field) out of a given block of text.  You could go even further and use the "awk" program the strip out the "Pages:   " character string, and arrive only at "48" s the final output.  It's not hard to see the possibilities from there-- for me, I immediately started thinking of extracting technical metadata from a large amount of image files, such as pixel dimensions and byte sizes.

  • Some file characterization tools, based on file type/format:
    • General: file
    • Audio: sox
    • Video: ffmpeg (can also do transcoding!)
    • PDF: pdfinfo
    • Images: the Imagemagick set of tools, which include: identify, convert, mogrify
  • Finally, there's one very general conceptual term that Mark kept using, and I'll try to explain it here.  It's hard because it is so general but here goes: it's the idea of standard output. Most command-line programs generate an ouput that is only shown on the Terminal screen, and nowhere else.  This output in the Terminal screen is called standard output.  You can write standard output to a file with the > character, or pipe it into the input of another program, as I mentioned above.  Anyway, Mark kept using that term "standard output", so I'm assuming it's something that we'll run across if/when we investigate this stuff further.

...