File name normalization
What needs to be done and why
- Scans from Crowley are currently spread across multiple 'disks'. The reason for this spread/fragmentation is that items were not necessarily scanned series-wise.
- To create the dark archive, we first need to reorganize all the scans into a specific directory structure that looks like this: root/series_no/subseries_no/itemgroup/filename.extension
- The scanned files as received from Crowley are organized into disks and folders, and are named using numbers. Files within a folder are numbered from 1.ext to n.ext, where n is the number of files in that folder, and ext is the file extension.
- That means if we pool all files for, say, a subseries into a single folder, the ones copied last would overwrite the ones that came before.
- NOTE: at times this naming scheme is not followed, and we see filenames reflecting the series/subseries/itemgroup hierarchy
- e.g.: disk4/originals/Patient_Registers/5_1_12/05_01_12_0001.tif
- NOTE: at times this naming scheme is not followed, and we see filenames reflecting the series/subseries/itemgroup hierarchy
- The intermediate step is to first take the disk level splits out of the equation by consolidating all subseries level folders into their respective series folders, item group level folders into their respective subseries folders, and so on. e.g., for series 1, all subseries folders will be organized in root/1/.
- The result of this step should be as shown in point 2.
- Once this is completed, we will have a single root node which would contain the whole reorganized directory tree
- At the subseries or item group level, we need to create a normalized name for each individual file.
- The normalized filenames should look like this: <series>_<subseries>_<itemGroupId>_<itemSubGroupId>_<itemId>.<ext>
- The scheme is described in detail in the next section
- the old-to-new file name mapping must be recorded in a database
- the renamed files must be copied to a new location.
- the new location is to be a directory structure same as the one created in step 5, except that the file names would be normalized.
- NOTE: for subseries that are further divided into item groups (e.g., in series 1, subseries 1 is divided into 6 'registers'), the value of itemId should reset to 1 at the item group level.
- Automation of steps 5 & 6 is needed. Solution: Python script
Requirements for the Python script
- Rename the files as per the normalized name format.<series>_<subseries>_<itemGroupId>_<itemSubGroupId>_<itemId>.<ext>
The tokens in the format scheme above are described below
| Token | Description |
|---|---|
| series | series number |
| subseries | subseries number - defaults to 0 in case subseries number is not applicable |
| itemGroupId | item group number (like box no., register no. etc.) - defaults to 0 in case item group information is not applicable |
| itemSubGroupId | item subgroup number
|
| itemId | alphanumeric id for any individual scanned item - should reset to 1 at the itemSubGroupId level if itemSubGroupId ≠ 0 (i.e. item group information is applicable) - should reset to 1 at the itemGroupId level if itemGroupId ≠ 0 (i.e. item group information is applicable) - should reset to 1 at the subseries level if itemGroupId = 0 (i.e. item group information is not applicable) |
| ext | file extension for the individual scanned item |
- Create a mapping of original and normalized file names. If commanded to, store mapping in database, and copy renamed files to a new path.
- Each record in the map should include:
- original file path
- normalized file name
- path where file with normalized name is/will be copied
Should provide bash-like command-line switches to pass arguments
OptionDescriptionArgumentsMandatory/Optionalsseries noseries no.s: 1 to 12Mtbegin transfer of files to path specified for 'p'pathOmmap file to store mapping (opened in append mode)fileOd commit mapping to database O efile extensiontiff, pdfM
, multiple selections available,