File name normalization

File name normalization

What needs to be done and why

  1. Scans from Crowley are currently spread across multiple 'disks'. The reason for this spread/fragmentation is that items were not necessarily scanned series-wise.
  2. To create the dark archive, we first need to reorganize all the scans into a specific directory structure that looks like this: root/series_no/subseries_no/itemgroup/filename.extension
  3. The scanned files as received from Crowley are organized into disks and folders, and are named using numbers. Files within a folder are numbered from 1.ext to n.ext, where n is the number of files in that folder, and ext is the file extension.
  4. That means if we pool all files for, say, a subseries into a single folder, the ones copied last would overwrite the ones that came before.
    • NOTE: at times this naming scheme is not followed, and we see filenames reflecting the series/subseries/itemgroup hierarchy
      • e.g.: disk4/originals/Patient_Registers/5_1_12/05_01_12_0001.tif
  5. The intermediate step is to first take the disk level splits out of the equation by consolidating all subseries level folders into their respective series folders, item group level folders into their respective subseries folders, and so on. e.g., for series 1, all subseries folders will be organized in root/1/.
    • The result of this step should be as shown in point 2.
    • Once this is completed, we will have a single root node which would contain the whole reorganized directory tree
  6. At the subseries or item group level, we need to create a normalized name for each individual file.
    • The normalized filenames should look like this: <series>_<subseries>_<itemGroupId>_<itemSubGroupId>_<itemId>.<ext>
    • The scheme is described in detail in the next section
    • the old-to-new file name mapping must be recorded in a database
    • the renamed files must be copied to a new location.
    • the new location is to be a directory structure same as the one created in step 5, except that the file names would be normalized.
    • NOTE: for subseries that are further divided into item groups (e.g., in series 1, subseries 1 is divided into 6 'registers'), the  value of itemId should reset to 1 at the item group level.
  7. Automation of steps 5 & 6 is needed. Solution: Python script

Requirements for the Python script

  1. Rename the files as per the normalized name format.<series>_<subseries>_<itemGroupId>_<itemSubGroupId>_<itemId>.<ext>

The tokens in the format scheme above are described below

TokenDescription
seriesseries number
subseries
subseries number
- defaults to 0 in case subseries number is not applicable
itemGroupId
item group number (like box no., register no. etc.)
- defaults to 0 in case item group information is not applicable
itemSubGroupId
item subgroup number
  • defaults to 0 in case item subgroup information is not applicable
itemId
alphanumeric id for any individual scanned item
- should reset to 1 at the itemSubGroupId level if itemSubGroupId ≠ 0 (i.e. item group information is applicable)
- should reset to 1 at the itemGroupId level if itemGroupId ≠ 0 (i.e. item group information is applicable)
- should reset to 1 at the subseries level if itemGroupId = 0 (i.e. item group information is not applicable)
extfile extension for the individual scanned item

 

  1. Create a mapping of original and normalized file names. If commanded to, store mapping in database, and copy renamed files to a new path.
  2. Each record in the map should include:
    • original file path
    • normalized file name 
    • path where file with normalized name is/will be copied
  3. Should provide bash-like command-line switches to pass arguments

    Option
    Description
    Arguments
    Mandatory/Optional
    s
    series no
    series no.s: 1 to 12
    M
    t
    begin transfer of files to path specified for 'p'
    path
    O
    m
    map file to store mapping (opened in append mode)
    file
    O
    dcommit mapping to database O
    e
    file extension
    tiff, pdf
    M