What needs to be done and why

Scans from Crowley are currently spread across multiple 'disks'. The reason for this spread/fragmentation is that items were not necessarily scanned series-wise.
To create the dark archive, we first need to reorganize all the scans into a specific directory structure that looks like this: root/series_no/subseries_no/itemgroup/filename.extension
The scanned files as received from Crowley are organized into disks and folders, and are named using numbers. Files within a folder are numbered from 1.ext to n.ext, where n is the number of files in that folder, and ext is the file extension.
That means if we pool all files for, say, a subseries into a single folder, the ones copied last would overwrite the ones that came before.
- NOTE: at times this naming scheme is not followed, and we see filenames reflecting the series/subseries/itemgroup hierarchy
  - e.g.: disk4/originals/Patient_Registers/5_1_12/05_01_12_0001.tif
The intermediate step is to first take the disk level splits out of the equation by consolidating all subseries level folders into their respective series folders, item group level folders into their respective subseries folders, and so on. e.g., for series 1, all subseries folders will be organized in root/1/.
- The result of this step should be as shown in point 2.
- Once this is completed, we will have a single root node which would contain the whole reorganized directory tree
At the subseries or item group level, we need to create a normalized name for each individual file.
- The normalized filenames should look like this: <series>_<subseries>_<itemGroupId>_<itemSubGroupId>_<itemId>.<ext>
- The scheme is described in detail in the next section
- the old-to-new file name mapping must be recorded in a database
- the renamed files must be copied to a new location.
- the new location is to be a directory structure same as the one created in step 5, except that the file names would be normalized.
- NOTE: for subseries that are further divided into item groups (e.g., in series 1, subseries 1 is divided into 6 'registers'), the value of itemId should reset to 1 at the item group level.
Automation of steps 5 & 6 is needed. Solution: Python script

Requirements for the Python script

Rename the files as per the normalized name format.<series>_<subseries>_<itemGroupId>_<itemSubGroupId>_<itemId>.<ext>

The tokens in the format scheme above are described below

Token	Description
series	series number
subseries	subseries number - defaults to 0 in case subseries number is not applicable
itemGroupId	item group number (like box no., register no. etc.) - defaults to 0 in case item group information is not applicable
itemSubGroupId	item subgroup number defaults to 0 in case item subgroup information is not applicable
itemId	alphanumeric id for any individual scanned item - should reset to 1 at the itemSubGroupId level if itemSubGroupId ≠ 0 (i.e. item group information is applicable) - should reset to 1 at the itemGroupId level if itemGroupId ≠ 0 (i.e. item group information is applicable) - should reset to 1 at the subseries level if itemGroupId = 0 (i.e. item group information is not applicable)
ext	file extension for the individual scanned item

Create a mapping of original and normalized file names. If commanded to, store mapping in database, and copy renamed files to a new path.
Each record in the map should include:
- original file path
- normalized file name
- path where file with normalized name is/will be copied

Should provide bash-like command-line switches to pass arguments

Option	Description	Arguments	Mandatory/Optional
s	series no	series no.s: 1 to 12	M
t	begin transfer of files to path specified for 'p'	path	O
m	map file to store mapping (opened in append mode)	file	O
d	commit mapping to database		O
e	file extension	tiff, pdf	M

File name normalization

What needs to be done and why

Requirements for the Python script