Converting EAD XML to simple spreadsheets in OpenRefine

Converting EAD XML to simple spreadsheets in OpenRefine

Guide written 2018 by @David Bliss 

Purpose

Metadata for certain archival collections is stored only as EAD XML files, for example when archivists encode directly in EAD during processing. While EAD is likely to remain the descriptive encoding standard for some time, it may occasionally be necessary or useful to have access to the same metadata in the form of a spreadsheet, for example when reformatting or manipulating the data. This workflow is one example of how to convert EAD XML metadata to a standard spreadsheet file, while requiring minimal manual cleanup or transcription. This process will need to be adapted significantly to meet the particular structure and syntax of any given EAD XML to be converted.

Requirements

  1. One or more EAD XML files from a collection to be mined

  2. OpenRefine, available as a free download here. This guide uses version 2.7.

  3. Notepad++, available as a free download here. This guide uses version 7.5.8.

  4. A spreadsheet program, such as Microsoft Excel, OpenOffice Calc, or Google Sheets. This guide uses Microsoft Excel.

Use case & example files

This guide is based on the specific workflow required to convert EAD XML to spreadsheets for the "Revolution and Counterrevolution in Guatemala, 1944-1963" collection at the Benson Latin American Collection. The finding aid for this collection is broken into three parts, available online:

Part one

Part two

Part three

The finding aid is an item-level inventory of each object in the collection. Each object has an item number, a title, measurements, and a date. Many objects have one or more authors, and many have descriptive notes. This collection is organized by year, and the items from each year are organized according to material type (Broadsides; Miscellanea; Campaign Handbills, which are themselves further divided according to election, candidate, and election locale; and Posters). Each material type has different descriptive needs, and the XML underlying each may be very different. Each material type may require its own wildly different workflow. This guide will extract the Broadsides metadata from the XML, since the broadsides are the most extensive (and first alphabetically), and can be adapted to extract metadata from the other subseries.

Step one: XML extraction and preliminary cleanup

All finding aids hosted on TARO allow users to access a finding aid's raw XML directly from the HTML version of the page. Click the "Raw XML File" link at the top of a finding aid to view its XML. Once the XML loads, save the page as an XML file onto your local machine. Or simply right click the link and click "Save link as..." from the HTML page. (If these finding aids are taken offline, a copy of the XML files can be downloaded directly at the bottom of this page.)

For this workflow, we will save all three finding aids locally as a first step. We will then combine all the Broadsides sections together, to produce a single Broadsides.xml document which we will manipulate in OpenRefine. Most Benson finding aids are not split into multiple pages, so this process of combining them is unique to this collection.

Open each of the XML files in Notepad++. Note that the first item-level description in each file is found well below the top of the document (line 128 in the version used here). The goal for preparing the XML is to isolate each relevant entry, to produce a document containing only entries at the same level of hierarchy in the EAD schema. In this case, that means isolating each <c03> entry falling under the Broadsides subseries. This is what the first <c03> entry looks like, falling directly underneath the <c02> subseries heading, which is itself under the first <c01> series heading:

In order to produce a document containing all <c03> entries for broadsides, all the <c03> entries within each <c02> broadsides subseries in each <c01> year series need to be copied to a new document. Thankfully, EAD makes these headers reliably predictable, and the encoder of these finding aids always put the Broadsides subseries first in a given year's entries. Notepad++ has functionality that will make this process easier as well.

With the first EAD document open in Notepad++, hit Ctrl+F to open the Find tool, and copy the following into the search field: 

<c02 level="subseries">

The cursor will jump to the first Broadsides subseries header, immediately above the first broadside entry that we want to copy. Move the cursor to the left of the first <c03> tag below, right click and select "Begin/End Select". Hit Ctrl+F again and search for the same tag again to jump to the next subseries header (in this case "MISCELLANEA"). Move the cursor to the right of the last </c03> tag immediately above, right click and select "Begin/End Select" again. The entire broadsides section for the first year (1944, in this case) will be selected, sans any unwanted header information. Copy and paste the highlighted text into a new document in Notepad++, and save it as broadsides.xml. If the pasted text doesn't preserve the coloration and style of the base XML, open the "Language" drop down menu at the top of the screen and select "XML" (or just hit X with the menu open).

Go through the three EAD XML files in this way, copying all of the <c03> fields for each Broadsides subseries into the Broadsides.xml document. Remember that there are multiple broadsides sections in each document! If this has been done this correctly, the result will be a broadsides.xml document comprised only of item-level metadata for broadsides, each between <c03> and </c03> tags. Because each item has individual date metadata recorded already, the <c01> year data isn't needed, and because all items are of the same format, the <c02> broadsides header data isn't needed either. 

There are two easy ways to check to make sure the copying and pasting were done correctly, with no formatting issues:

  • Search for "<c01>" and "<c02>" in the broadsides.xml document. Neither should appear anywhere in the document.

  • Count the number of times "<c03 level="item">" and "</c03>" appear in broadsides.xml, using the Count button in the Find window. These numbers should be the same.

Finally, the broadsides document needs to be edited so each line occupies just a single line. Save the document as a new file before this next step. Call the document "broadsides_working.xml" or something similar, to mark it as modified from the original XML. It may be helpful to have the unmanipulated (easier to read) broadsides XML data, in case something goes wrong down the line and the data needs to be changed.

Select all the text in the new working document with Ctrl+A, and select Edit → Line Operations → Join Lines. This will merge all lines (twenty thousand or so) into one. This may take some time or even cause the program to stall. A computer with a powerful processor and 8 GB of RAM or more may be required.

Once the full document is combined into a single line, save it again. Open the Replace tool window with Ctrl+H (or click over from the Find tool window). In the "Search Mode" box at the bottom, select "Extended (\n, \r, \t, \0, \x...)". Type "</c03>" (without the quotes) next to "Find what", and replace each instance with "\n" (again without the quotes). This will replace each instance of </c03>, which ends each broadside entry, with a new line. This may take some time once again.

If this worked correctly, each line should now be a single broadside entry with wrapping tags. We will use OpenRefine to extract the values and separate these lines into columns. Delete any blank lines at the end of the document, and save the file.

Step two: Ingest into OpenRefine

Start OpenRefine, and select "Create Project". Upload the broadsides_working.xml file, with one line per entry, and move to the next screen.

Under "Parse data as", select "Line-based text files".

Click the box next to "Character encoding" and in the window that appears, select UTF-8. This is necessary for OpenRefine to correctly parse special characters like letters with accents and tildes.

Make sure the column header in the preview says "Column 1" and doesn't contain any data: adjust the "Ignore first" fields at the bottom if this is not the case.

Finally, give the project a name at the top right and click "Create Project".

Step three: Data manipulation in OpenRefine

Next comes the core of the workflow: pulling each row apart, removing tags, and formatting the values to make them useful and readable as spreadsheets. OpenRefine is a very powerful piece of software that will manipulate large chunks of the imported data at once. Some of these manipulations and transformations are relatively straightforward, while others are more complex. Any transformations applied to the data must be precise, to avoid unwanted results. The size of the broadsides XML file that has been imported for this project means that it can be difficult to even recognize when a mistake has been made. It is important to review the data for inconsistency or errors after each step performed below.

This workflow will use the following tools and commands in OpenRefine:

  • Splitting/partitioning

  • Text transformation

  • String replacement

  • Faceting

  • Clustering

  • Manual cleanup

Consult the official Recipes page in the OpenRefine wiki to understand some of the commands before continuing.

The workflow that follows is not the only way to convert this data to a spreadsheet. There are a number of ways to go about most projects in OpenRefine, some more efficient than others. The goal of any OpenRefine workflow should be to minimize manual cleanup and errors resulting from inconsistency.

 

The data should look like this when the project is first created. It may be helpful to show more than the default 5 rows at once, just click at the top.

Field 1: Item no.

The first field in each row is the item number. Every entry has an item number, so this is a relatively straightforward field to extract first. Notice that the value is stored between "<container type="folder" label="Item No.">" and "</container>" in each row. The value simply needs to be extracted from between those tags.

Click the drop-down arrow on Column 1 and select Edit column → Add column based on this column...

Name the new column "Item no." and enter the following value in the Expression field: value.split('<container type="folder" label="Item No.">')[1].split('</container>')[0]. This splits the string in each cell at the "<container type..." opening tag, stores everything that follows that tag, then splits the remaining string at the "</container>" closing tag, discarding everything after the item number itself. This type of recursive splitting does most of the transformation in this workflow.

Scroll down in the preview window to review the first few dozen columns, and make sure there are no errors or unexpected values, then hit OK. 

Click the drop down at the top of the new "Item no." column you created and select Facet → Text facet. Review the values in the box that appears to the left and double-check that there are no obviously incorrect values, which would probably indicate that the base data was badly formatted somewhere.

Field 2: Extent

Apply the same splitting transformation to the original column to separate the extent field, which is surrounded by <physdesc> and </physdesc> tags: value.split('<physdesc label="Extent">')[1].split('</physdesc>')[0]. Name the new column "Extent".

The extent and item number fields are the easiest to isolate, because they are the most consistently formatted in the base data. Facet the Extent column to check for obvious errors before moving on. There may be a small number of blank rows, where no extent was entered in the base data.

Field 3: Description

The description field follows the extent in each row, though not ever row actually has a description in this case. Notice that descriptions are wrapped in two consecutive tags: <scopecontent> and <p>. OpenRefine struggles to properly parse expressions with two tags in a single string, so these will need to be handled separately in this case. First, add a column separating the description values and <p> tags from the <scopecontent> tags: value.split('<scopecontent>')[1].split('</scopecontent>')[0]. Name this column "Description" and be sure "set to blank" is selected below this: because not every row has a description section, this column's value for those rows should be blank.

Facet the Description column and notice that, with the exception of the blank cells, each cell begins with <p> and ends with </p>. Select the dropdown menu at the top of the Description column and select Edit cells → Tranform... Use the same type of split expression as before to remove the tags: value.split('<p>')[1].split('</p>')[0]. Select "Keep original" for "on error" to keep blank cells blank with this transformation applied.

(Note that these two expressions could have been combined: value.split('<scopecontent>')[1].split('</scopecontent>')[0].split('<p>')[1].split('</p>')[0] would have achieved the same result. But it is safer to split expressions up and review the results of each step by faceting, to avoid unexpected results.)

Field 4: Date

The final easy field to isolate is the date field, which is stored between <unitdate> and </unitdate>. Create a new column based on the raw data, isolating the date using the standard approach: value.split('<unitdate>')[1].split('</unitdate>')[0]. Name the column "Date" and facet the result to make sure there are no errors or unexpected values in the new column.

There may be some blank fields, as a result of inconsistently-formatted base data, which places dates alongside Origin/Publication info. This will need to be cleaned up manually later.

Unit title fields

The remaining fields: Origin statement, Creator, and Title are all much more difficult to separate & isolate than the previous four fields, because they are nested between tags together and highly inconsistent. The fullest possible entry looks something like this:

 <unittitle>AUTHOR<title xmlns:xlink="XLink namespace " xlink:type="simple" render="bolditalic" xlink:href="">TITLE</title>ORIGIN STATEMENT</unittitle>

However, many rows lack author information, in which case the <unittitle> opening tag is immediately followed by the <title xmlns...>. Whichever values are missing, every row should have an opening and closing <unititle> tag, with a title in there somewhere. In these cases it can be helpful to create a separate "working column" from which values can be gradually "peeled off". Create a new column based on the full XML data: value.split('<unittitle>')[1].split('</unittitle>')[0]. Name this column "Unittitle". Facet and scan for erroneous blanks or unexpected results before proceeding.

Field 5: Origin statement

Notice that the fields are easier to identify in the Unittitle column, and that the origin statements (location and publisher) always follow the closing </title> tag. Create an Origin statement column based on this: value.split('</title>')[1].

Facet this column and notice that there are more errors than in previous extracted columns: At least one row where the title was erroneously captured (the result of an an extra </title> tag between the creator and title), and approximately 50 erroneously blank rows, where the XML encoder failed to separate the three fields with any tags whatsoever (evidently because no title could be discerned, in most cases – brackets are used to mark the title in many rows). These can be cleaned up manually, or using more targeted transformations later.

Field 6: Creator

Like the Origin statement field, the Creator field appears to the side of the title, however there are more rows with no value for this field. Those rows that do have a creator place the value before the following style tag: <title xmlns:xlink="XLink namespace " xlink:type="simple" render="bolditalic" xlink:href="">. Attempting to split along this tag, however, will not work for rows where no author is given: when the value in Unittitle begins with the "<title xmlns..." tag, value.split('<title xmlns:xlink="XLink namespace " xlink:type="simple" render="bolditalic" xlink:href="">')[0] will simply output the value that follows that tag, because there are no values before the tag to output in those cases. That is, this expression works well for those rows where an author is present, but not for those where it is not.

Rather than value.split, use value.partition in this case. This command is similar to value.split, but it correctly preserves the values before the tag (as well as the tag itself), meaning it preserves blank values when a cell begins with the string being partitioned. This expression would isolate author values that come before the "<title xmlns..." tag: value.partition('<title')[0].

However, this partition expression will not isolate author values in any rows that lack the "<title" tag, for example those shown in the image above, where the title is demarcated only by brackets. Further partitioning the output of the first partition expression along opening brackets ("[") will successfully isolate an additional 30 or so rows. Create a new column, based on Unittitle, with the following expression, and call the column "Creator": value.partition('<title')[0].partition('[')[0]. Facet the new column and check for unexpected results before proceeding.

Field 7: Title

Finally, the title field still needs to be isolated. There are two main groups of titles: those demarcated by "<title xmlns..." and "</title>" tags, and those that are demarcated only by brackets. Each will have to be handled separately.

First, separate the titles with tags by creating a new column named "Title", based on Unittitle: value.partition('<title xmlns:xlink="XLink namespace " xlink:type="simple" render="bolditalic" xlink:href="">')[2].split('</title>')[0].

Next, facet the Title column and select any blank cells (all the way at the bottom of the facet box). Most of these rows should be entries whose titles are demarcated only by brackets. Others will truly have no title. With the facet for blank cells only still applied, open the drop-down menu at the top of the Title column, and select Edit cells → Transform... In the box that opens, enter the following expression: cells['Unittitle'].value.partition('[')[1] + cells['Unittitle'].value.partition('[')[2].partition(']')[0]+cells['Unittitle'].value.partition(']')[1].

This ungainly expression performs the following transformations on only those cells that have been faceted for being blank:

  1. Pulls the value of the full "Unittitle" column into the corresponding "Title" column cell

  2. Partitions the Unittitle along the opening bracket, because it is important to keep these brackets in place to preserve the cataloger's role and decision. The first part of the expression, "cells['Unittitle'].value.partition('[')[1]", will copy an opening bracket from the Unittitle field only where one already exists. Where there is no opening bracket, for example where there truly is no title to be found, the cell will remain blank.

  3. Appends the value of the Unittitle field that follows the opening bracket, and cuts this value off at the closing bracket. Again, any cells lacking an opening bracket will remain blank.

  4. Appends any closing bracket in the Unittitle field.

It may be a good idea to individually paste each element of this expression into Expression box, to see what each one does. The result of the full expression is effectively copying and pasting any values contained in brackets from the Unittitle column, including the brackets themselves. For the version of the finding aid used in creating this workflow, this affected 27 of the 56 rows that were previously left blank.

But there is still some additional manual cleanup needed, as a result of inconsistently formatted base data. Using text filtering and faceting, carefully review the full XML, Unittitle, and Title fields for any cells meeting either of the following criteria:

  1. Cells where the Title field is still blank. Some of these may not have had their titles bracketed off, in which case even the second expression would not have helped

  2. Cells where the Title field contains opening or closing brackets. Some of these may have had additional, inconsistent title information contained outside the brackets, too.

In both of the above situations, there may be entire fields misplaced (for example, Origin statements placed in the Creator column) as a result of earlier transformations and misstructured data. In these cases, it is easiest to manually transform the data row-by-row now. Larger scale cleanup using transformations will come next.

Step four: Data cleanup

Now that each row has been mostly pulled apart into discrete columns, it is time to clean up the data as much as possible to ensure what remains is usable and free of major errors resulting from inconsistent XML data. Most OpenRefine projects require some degree of item-level cleanup, because most underlying data has inconsistencies that make large-scale transformations problematic. 

As a rule, the general approach for data cleanup involves the following:

  1. Faceting and bulk editing obvious errors, such as "______" entries where a blank entry should be.

  2. Faceting and reviewing blank entries, to make sure no data has wound up in the wrong column or has failed to copy altogether.

  3. Text filtering for characters that should not appear, such as tag characters (< and >), reviewing what might have caused these to remain.

The solutions to problems identified through these steps vary widely, from item-by-item editing to additional transformations that will replace unwanted characters at scale. The following cleanup steps address each field in the order they appear in each XML entry.

Item no.

Facet the values in the Item no. column, and look for any entries that are not simply numbers, or any values that appear twice. Some entries may be a range of numbers, but each entry should be unique and there should be no blank rows. Notice that most if not all entries end with a period. This period can be removed with the following transformation expression: value.partition('.')[0].

Creator

Facet the values in the Creator column. Notice the moderate number of entries beginning with "________.". Many, but not all of these are followed by actual names. Those that are not can be edited in bulk, from the Facet box itself. The blank entries will likely be too many to fully review, but it is a good idea to glance at it to make sure author information hasn't been left out in any obvious patterned way (comparing against the raw XML).

Click the drop down at the top of the Creator column, and select Edit cells → Cluster and edit... In the window that appears, OpenRefine will display what it thinks are "clustered" values, which differ only slightly in punctuation but represent the same actual creator. Review these recommendations and check the "Merge?" box if the creators do indeed match. Specify what the merged values will be changed to in the "New Cell Value" text box to the right (this is a good occasion to remove unnecessary white spaces and punctuation). Select "Merge Selected & Re-Cluster" to apply the change and review any additional clusters.

Title

Facet the values in the Title column and review any blank entries. There are a small number of entries without titles, some of which may have been cleaned up already. Filter the column for tag characters, which likely indicate text tags such as bold or italic. These can be changed according to local style preferences, replacing styles with quotes for example. The value.replace command can be useful here, for example: value.replace('<title xmlns:xlink="XLink namespace " xlink:type="simple" render="italic" xlink:href="">','"') to replace the long italic tag with double quotes. In this expression, the double quotes are surrounded by single quotes, to match the expression syntax. Review the GREL recipes page for instructions on using value.replace. 

Origin statement

The values in the Origin statement column may be somewhat messy, with various punctuation marks before or after each value. Perform the same Cluster and edit on the Origin statement column values that was performed on the Creator column.

Facet the values in the Origin statement column and look for rows with XML tags. Edit these rows as needed (XML tags likely point to the wrong value being pulled into this column, which may require editing several columns, though hopefully not for many rows).

Facet the values and review blank entries. Edit these as needed. There may be rows with no publication origin information.

Date

Cluster and edit the values in the Date column. Values with question marks, for example "1953?", likely represent uncertainty on the part of the cataloger and are probably best left unchanged. OpenRefine is often smart enough to know that "29 agosto 1944" and "agosto 29 1944" are the same date, and to suggest these as clusters.

Facet the column and review blank entries. Many rows may have dates in the Origin statement column – copy these over to the Date column. Some rows may have two places of publication listed, for example when items were reprinted. These should be closely reviewed. The following expression will pull values that appear between ":" and "(" in the Origin statement column from these rows into the Date column, which should be dates in most cases: cells['Origin statement'].value.split(':')[1].split('(')[0]. This transformation may prompt some additional row-level cleanup, for example pulling unwanted publication information into the Date column, for a few rows. This cleanup should still take less time than would be required manually otherwise. Facet the date column once again to review these entries and clean them up individually.

Notice that many entries in the date column may end in an extraneous period. As long as the period only appears at the end of each row (depending on prior manipulation and cleanup steps), it should be possible to apply a text filter to this column that is simply a period, then apply the following transformation on the relevant rows: value.partition('.')[0]. Review the rows after the transformation is applied to make sure rows were not blanked out or cut in half by this transformation.

Extent

Facet the Extent column and scroll through to look for any entries with XML tags or obviously incorrect values. This column should be relatively clean. Review blank entries and make sure the row does in fact lack extent information.

Final cleanup steps in OpenRefine

With each column more or less cleaned up, the Unittitle column can be removed, and the remaining columns rearranged as desired.

Open the drop-down menu for each column and select Edit cells → Common transforms → Trim leading and trailing whitespace to remove blank spaces before and after every value. Because of the varied ways each column was created and manipulated, this will only apply to a few rows in certain columns and several hundred in others.

Step five: Export

When all cleanup is finished, select "Export" at the top right of the window and export the data to the desired file format. Additional cleanup steps may be easier to perform outside OpenRefine.

 

Appendix: Attachments

Click a file to download it