Generate Derivative

Generate Derivative

CSV file inputs

The CSV file input will consist of filepath for the files to generate derivatives.

The script will require the following input:

  1. a CSV file containing a list of filepaths
    1. the CSV file must have a header row at the top by default (recommended label: "source" but the script does not enforce this recommendation).
    2. the first column of the CSV must contain filepath values
    3. all paths are recommended to be absolute paths. The script will make no attempt to modify these paths. Paths that the script cannot access will be recorded in an error file.

Requirements for the Python Script

  1. The script should be executable from the command-line, and should take inputs from the command line.
  2. The documents should be available in the database with filename as the key.
  3. Installation of imagemagick in the system.

Interface

The script will be a command-line utility that will serve the requirements as outlined above.

Invocations of the script would look like this:

python3 derivatives.py -f CSV -s SOURCE FILETYPE [-c DESTINATION FILETYPE] [-r RESIZE DIMENSION]

At least one of -c or -r are required.

CSV refers to the CSV file that can be supplied as argument to process. The ellipsis refers to any switch-specific arguments that might be needed.

OPTION represents a the following set of switches that can be applied to modify the default behavior of the script:

SwitchArgumentDescription
-fPath to CSV filePath to CSV file for batch processing.
-ssource filetypethe filetype from which the derivatives will be generated.
-cconversion filetypethe filetype of the generated derivative files to convert the source file (access copies, optional, -r is required)
-rlarger dimension of the resized generated imageslarger dimension of the generated derivative images (optional, -c is required)
-q(none)Quiet mode. Disables all informational prints. All exception and error related prints will still be output.
-h(none)The script displays a help document on the screen and exits.

Architecture
  1. Changes in the globalvars.py file to incorporate global variables. Added global variables which is used extensively in the python script.

Behavior and Implementation

The script (derivatives.py) will perform the following high-level operations:

  1. Parse command line arguments
    1. set variables in accordance with the command line arguments
    2. check if at least one of -c and -r options is specified
    3. if -c is not specified, set it to the same value as -s
    4. inform user about errors in the arguments, print help, and exit
    5. print message if imagemagick is not installed and exit
  2. Read input CSV file
    1. If the CSV file cannot be opened, display error and exit
    2. Ignore the first line–this should be the header
  3. Read metadata property names from labels.json
    1. store all labels within a Python object
  4. Read controlled vocabulary from vocab.json
    1. store the vocabulary as a Python object
  5. Create a connection to the database
  6. Create a CSV file for printing errors (use name specified below)
  7. For each filepath in the input CSV file:
    1. read the contents of the filepath, these should be a list of folders
      1. if directory listing for the filepath cannot be retrieved, print error message and write to error CSV file
    2. within each folder, read the file with the filename the same as the foldername and the filetype specified on the command line (filename should be of the form: <foldername>.<filetype>)
    3. set filename for the generated derivative using the form (<foldername>[_<resize-value>].<convert file type>)
    4. check if the filepath (source path + folder + filename) for destination image already exists
      1. if the filepath exists, report error to the move on to the next filepath (step 7)
      2. if the filepath does not exist
        1. run imagemagick convert command using the provided inputs: source filepath, conversion filepath and resize dimension
        2. create a premis event to record this action for the id (folder name within the source filepath)
        3. update document with matching id in the database with premis event.

Output(s)

Print helpful information in a csv file when errors were encountered. Errors to be reported include errors in command usage, as well as any errors encountered while carrying out property extraction. Error csv name: "derivatives_errors_<timestamp>.csv"

Test cases / Validation

  1. No header in the csv file should print an error.
  2. One entry for filepath in the csv file.
  3. Multiple entries of filepaths in the csv file.
  4. Document not available in the database should print an error with the filepath and filename.
  5. derivative entry already present for the document in the database.