Identifying and removing personally identifiable information (PII)

Introduction

Personally identifiable information (PII) refers to information that can be used to determine the identify of a person who would otherwise by anonymous. Sensitive PII refers to information that could cause harm (such as a stolen identity or financial information) to the subject of the information if it is made public. When working with born-digital collections, archivists have to be vigilant, as collection donors may not know whether the files they turn over to UTL contain PII or not. If a donor transfers an entire disk or disk image to UTL, the risk of sensitive PII being transferred increases dramatically, as the disk or image can include files wholly unrelated to the project the donor is interested in archiving, such as personal financial records.

Before access to a born-digital collection is provided to researchers (either online or in a reading room), archivists should be sure that access will not expose sensitive PII which can be used in harmful ways. The size of many digital collection transfers makes manual review of every file in a collection impossible, while the vagaries of how file formats are structured mean a file can contain embedded PII that is not immediately visible at a glance. Identifying and removing sensitive PII is best accomplished using forensics software that scans both visible and invisible file layers for an entire disk image. Working with this kind of software to find PII involves specifying patterns that will be searched for, reviewing search results, and determining which hits are "actual" PII needing redaction and which are false positives (data that matches the searched pattern but doesn't actually refer to any individual's private information).

Redacting sensitive PII stands in conflict with digital forensics principles that encourage archivists to preserve a disk at the bit level, exactly as the donor provided it and without modifying any files. Redaction involves altering files to ensure that sensitive PII cannot be extracted from the collection in any way. Although redacting from an access copy of a collection and preserving the original unredacted disk image on tape is an option, preserving a disk image containing sensitive PII risks that information being exposed later, if the unredacted files are restored and provided to a researcher. Archivists working with a digital collection must choose the right balance of forensic treatment and privacy measures for that collection.

Types of PII

Sensitive PII

Sensitive PII should generally always be redacted, unless there are specific reasons not to. The search method for each kind of PII will depend on the structure allowed by whatever body or agency governs assignment for that piece of information. Searches should be thorough to find every potential instance of PII in a disk image, not only those that match the precise standard structure. See below for examples of what this looks like in practice.

Social Security Numbers

9 digits, usually in the following format:

123-45-6789

In order to be thorough, it is typically  a good idea to also search for variations on this pattern, such as the following:

Nine digits with spaces rather than hyphens:

123 45 6789

Nine digits without hyphens or spaces, but preceded immediately by an uppercase or lowercase "SSN", with or without a space before the numbers:

SSN123456789
ssn123456789

SSN 123456789
ssn 123456789

The same patterns, but with a colon added:

SSN:123456789
ssn:123456789

SSN: 123456789
ssn: 123456789

Nine digits without hyphens or spaces, but preceded immediately by an uppercase or lowercase "SS#", with or without a space before the numbers:

SS#123456789
ss#123456789

SS# 123456789
ss# 123456789

The same pattern, but with a colon added:

SS#: 123456789
ss#: 123456789

SS#:123456789
ss#:123456789

The same pattern, but with the pound sign immediately before the 9 digits:

SS #123456789
ss #123456789
Bank Account number

Bank account numbers are typically 8 to 12 digits long, depending on the bank in question. The variety of account number structures makes it difficult to search for a single matching pattern. The best approach may be to search for any combination of the following:

[Account|account|Acct|acct] [Number|number|Num|num|#] [8 or more digits]
[Account|account|Acct|acct] [Number|number|Num|num|#]: [8 or more digits]
Credit Card information
Passwords

Potentially sensitive PII - depends on the context

Full names

Addresses

Phone numbers

Email addresses

Distinguishing between false positives and actual PII

Hex values

Visible vs. invisible data layers

Tools

Autopsy

Python

Python libraries

Adobe Photoshop

Adobe Acrobat

HxD or other hex editor

Identify PII with Autopsy

Add new regular expressions to find specific types of PII

Extract files from disk image

Redact PII with Python

Use regex to find

Output files with "_redacted" suffix

Add original file extension to any converted formats (e.g. file_doc.docx)

Redacting .msg files

Redacting .txt files

Redacting .docx files

Redacting .doc files

Convert to .docx using Visual Basic

Redact .docx files

Redacting .xls and .xlsx files

Redacting .html files

Manual redaction

Redacting .tif and other image files

Redacting .pdf files

Create a list of files on the disk with original paths and timestamps