External hard disk to maahiti (server)

Steps

Manually peruse the data folders on maahiti (done)
Create a spreadsheet that records the folder names from the various disks, levels of EAD description for materials in each folder, and the destination folder for the materials
1. The following table summarizes the EAD based hierarchy of the destination directory structure (on Javelin). Higher level number means the level is deeper/lower in the hierarchy.
  Level Type
  1 Series (e.g., Minutes, Annual reports, Patient registers, etc.)
  2 Sub-series
  3 Item group
  4 Item sub-group
  5 version (digital master-tiff, original size-jpeg, 1200-px jpg, 200-px jpeg, pdf)
Describe a set of criteria to test transfer scripts – various levels of hierarchy, directory names that include non-alphanumeric characters, merging folders, and item groups that lie outside of typical ranges (for example, zero).
Select examples of folders that match each of the testing criteria
Develop the transfer script
Test transfer script against each of the examples

Design of the File Transfer Script (v0.1)

NOTE: By 'transfer' it is meant that 'destructive' copying of files will be performed only if the appropriate command-line option (specified below) is set. By default, the files will simply be copied to the destination, keeping the source file intact.

NOTE 2: This script isonly for transferring the files, and not for renaming them.

Requirements

The script should be executable from the command-line, and should take inputs from the command line
~~The script should be able to transfer files from a source directory to a destination directory~~
The script should be able to carry out the transfers in batches (using a CSV file that lists multiple source-destination pairs)

~~The script should be able to handle subsets of rows from the CSV file~~ (low priority)

Input(s)

The script should be able to take the following inputs (the order of listing in this document is inconsequential):

~~a source path and a destination path~~
1. ~~both paths need to be absolute paths.~~
a CSV file containing a list of source-destination path pairs
1. this will be done via a command line option or 'switch'
2. the CSV file will be assumed to have a header row at the top by default.
3. the first column of the CSV must contain source paths
4. the second column of the CSV must contain destination paths
5. all other columns will be ignored
6. all the paths must be absolute paths
file extension: the type of files that will be copied from the source
1. by default, '.tif' files will be copied

Interface

The script will be a command-line utility that will serve the requirements as outlined above.

Invocations of the script would look like this:

python3 accession.py [OPTION]... -f CSV

where SOURCE refers to the absolute path from which files will be copied, and DEST refers to the absolute path to which files will be copied. CSV refers to the CSV file that can be supplied as argument to process batches. The ellipsis refers to any switch-specific arguments that might be needed.

OPTION represents a the following set of switches that can be applied to modify the default behaviour of the script:

Switch	Argument	Description
-e	Character string (without quote marks)	Specific file extension to copy instead of the default ('tif'). Without this switch, only '.tif' files will be transferred.
-f	Path to CSV file	Path to CSV file for batch processing. NOTE: any following source and destination path arguments will be ignored
-h	(none)	The script displays a help document on the screen and exits.
-l	~~Path to log file~~	~~Path to file which will be opened in 'append mode' where activity will be logged~~
-m	(none)	Move the files instead of copying. NOTE: the source directory will be emptied of all
-s	~~Character string (without quote marks)~~	Subsetting of rows from the CSV. Ignored if -f is not supplied. A string that looks like this: "s:e", where 's' is the starting row number, and 'e' is the ending row number. The numbering is based on the assumption that the header row is numbered 1.
-q	(none)	Quiet mode. Disables all informational prints. All exception and error related prints will still be output.

Behavior and Implementation

The script (accession.py) performs the following high-level operations:

Parse command line arguments
1. set variables in accordance with the arguments
2. inform user about errors in the arguments, print help, and exit
Read CSV file
1. validate header structure
2. parse 'arrange' information from header
Read metadata property names from labels.json
1. store all labels within a Python object
Read controlled vocabulary from vocab.json
1. store the vocabulary as a Python object
Create a connection to the database
For each row in the CSV:
1. extract the 'arrange' info (for the admin metadata profile)
2. check if the required destination directory exists
  1. create the directory if it does not exist
  2. initialize serialNo (for the admin metadata profile)
    1. 1 for a destination directory that has been created in step (6.b.i)above
    2. query database to get the largest serialNo corresponding to the destination directory (likely scenario if an errors_.. CSV file is being used, or more files are being added to a given destination directory at a later date)
  3. create an alphanumerically sorted list of files of the required extension type (specified via command-line arguments; default: all files) in the source directory
  4. for each source file in the sorted list:
    1. initialize a metadata record corresponding to the file
      1. create an object property
      2. add the file size (in bytes), format information (name, version), category ("file"), and file name to the object property
      3. assign the record a unique ID (UUID)
        attach the identifierAssignment event to the object
        create a folder with the unique ID under the directory
        move the record to the folder matching the unique ID
    2. calculate the file's checksum (using MD5)
      1. attach the messageDigestCalculation event to the object
    3. copy the file from the source to destination
      1. attach the replication event to the object
    4. rename the destination file using the unique ID
      1. attach the filenameChange event to the object
    5. compute the checksum (MD5) of the destination file, and compare it to that obtained for the source file
      1. attach a fixityCheck event to the object
    6. record the serialNo property in the initialized object
    7. insert record into the database
      1. attach an accession event to the object
    8. Increment the internal counter for serial numbers by 1

Formatting the CSV batch file

The most important thing to consider while creating the CSV file for using with the script is to use a spreadsheet editor like Excel (or an equivalent program on your OS) that can save a basic spreadsheet as a "comma separated value (CSV)" file. This ensures that your CSVs are well-formed, and minimizes any surprises during the execution of the script.
The first column must be called 'source' and the second column must be called 'destination' (both case sensitive)
The following columns are all optional, but if you want to specify any descriptive information regarding the archival arrangement of the files:
- prefix the name of your desired arrangement information column with "arrange:" (case-sensitive, exclude the quotes)
- the text string following "arrange:" will be used as a property name in the admin metadata profile

Output(s)

The script should, of course, be able to carry out the transfer as specified, but also print helpful information in case errors were encountered. Errors to be reported include errors in command usage, as well as any errors encountered while carrying out transfers.

Test cases / Validation

Many of the paths on the disks mounted on Maahiti contain characters not conventionally found in Unix-based file systems. DOS-based and Unix-based OSes deal with certain characters (blankspaces, punctuation marks, etc.) in file paths differently. One of the implicit aims of the whole archiving exercise is to create a file organization that follows the lowest common denominator in terms of how filenames and paths are interpreted by various OSes.

In light of the above, the following scenarios must be tested as a bare minimum:

Paths containing non-alphanumeric characters (including spaces, punctuation marks, etc.)
Paths terminating in '/'
Paths not terminating in '/'
Relative paths (these are not allowed as of this version, and must be checked against)

File transfer