Identifying and removing personally identifiable information (PII)
Introduction
Personally identifiable information (PII) refers to information that can be used to determine the identify of a person who would otherwise by anonymous. Sensitive PII refers to information that could cause harm (such as a stolen identity or financial information) to the subject of the information if it is made public. When working with born-digital collections, archivists have to be vigilant, as collection donors may not know whether the files they turn over to UTL contain PII or not. If a donor transfers an entire disk or disk image to UTL, the risk of sensitive PII being transferred increases dramatically, as the disk or image can include files wholly unrelated to the project the donor is interested in archiving, such as personal financial records.
Before access to a born-digital collection is provided to researchers (either online or in a reading room), archivists should be sure that access will not expose sensitive PII which can be used in harmful ways. The size of many digital collection transfers makes manual review of every file in a collection impossible, while the vagaries of how file formats are structured mean a file can contain embedded PII that is not immediately visible at a glance. Identifying and removing sensitive PII is best accomplished using forensics software that scans both visible and invisible file layers for an entire disk image. Working with this kind of software to find PII involves specifying patterns that will be searched for, reviewing search results, and determining which hits are "actual" PII needing redaction and which are false positives (data that matches the searched pattern but doesn't actually refer to any individual's private information).
Redacting sensitive PII stands in conflict with digital forensics principles that encourage archivists to preserve a disk at the bit level, exactly as the donor provided it and without modifying any files. Redaction involves altering files to ensure that sensitive PII cannot be extracted from the collection in any way. Although redacting from an access copy of a collection and preserving the original unredacted disk image on tape is an option, preserving a disk image containing sensitive PII risks that information being exposed later, if the unredacted files are restored and provided to a researcher. Archivists working with a digital collection must choose the right balance of forensic treatment and privacy measures for that collection.
Types of PII
Sensitive PII
Sensitive PII should generally always be redacted, unless there are specific reasons not to. The search method for each kind of PII will depend on the structure allowed by whatever body or agency governs assignment for that piece of information. Searches should be thorough to find every potential instance of PII in a disk image, not only those that match the precise standard structure. See below for examples of what this looks like in practice.
Social Security Numbers
9 digits, usually in the following format:
123-45-6789
In order to be thorough, it is typically a good idea to also search for variations on this pattern, such as the following:
Nine digits with spaces rather than hyphens:
123 45 6789
Nine digits without hyphens or spaces, but preceded immediately by an uppercase or lowercase "SSN", with or without a space before the numbers:
SSN123456789
ssn123456789
SSN 123456789
ssn 123456789
The same patterns, but with a colon added:
SSN:123456789
ssn:123456789
SSN: 123456789
ssn: 123456789
Nine digits without hyphens or spaces, but preceded immediately by an uppercase or lowercase "SS#", with or without a space before the numbers:
SS#123456789
ss#123456789
SS# 123456789
ss# 123456789
The same pattern, but with a colon added:
SS#: 123456789
ss#: 123456789
SS#:123456789
ss#:123456789
The same pattern, but with the pound sign immediately before the 9 digits:
SS #123456789
ss #123456789
Bank Account number
Bank account numbers are typically 8 to 12 digits long, depending on the bank in question. The variety of account number structures makes it difficult to search for a single matching pattern. The best approach may be to search for any combination of the following:
[Account|account|Acct|acct] [Number|number|Num|num|#] [8 or more digits]
[Account|account|Acct|acct] [Number|number|Num|num|#]: [8 or more digits]
Credit Card information
Passwords
Potentially sensitive PII - depends on the context
Full names
Addresses
Phone numbers
Email addresses
Distinguishing between false positives and actual PII
Hex values
Visible vs. invisible data layers
Tools
Autopsy
Python
Python libraries
Adobe Photoshop
Adobe Acrobat
HxD or other hex editor
Identify PII with Autopsy
Add new regular expressions to find specific types of PII
Extract files from disk image
Redact PII with Python
Use regex to find
Output files with "_redacted" suffix
Add original file extension to any converted formats (e.g. file_doc.docx)
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.