Metadata Creation Environmental Scan

Overview

This project focuses primarily on descriptive metadata creation and its movement through various hands to ingest, display, and storage, addressing issues of large metadata backlogs and increased digital asset creation.

To view all graphs and a fuller report, please see the Documentation buttons below.



Table of Contents


Recommendations

  • Improve or develop reuse workflows as they share similar, derivative, or identical items.
  • Reuse or improve delivery of minimum descriptive metadata to Digital Stewardship. Minimum metadata standards could assist in easier or more reuse between sites, if partners can reasonably expect what content they are receiving.
  • More formal guidance on partner-provided metadata/use cases.
  • Reduce duplicating assets that are in multiple sites. Designation of a particular source of truth for assets.
  • Broader script sharing across sites.
  • Sharing/creation of tutorials and guides for OpenRefine use, scripts, etc. 
  • Investigate whether can reduce reliance on local databases and drives.
  • Investigate whether able to share taxonomies and local authorities with each other.
  • Develop crosswalks and an appropriate location to host them for ease of use.



Methods

The Metadata Steering Committee chose to conduct an environmental scan to understand the full scope of UTL metadata processes. We reviewed UT Libraries repositories (11 total), noting what staff worked with these repositories, what their main purposes were (relating to the content used), types of metadata maintained, schemas used, sources from where metadata was created/built from, tools utilized in metadata transformation workflows, and then the current status of that repository. Several sites have been undergoing migration throughout the 2023 year, requiring notes about planned or now completed changes. These updates are added to our summary where relevant.

Information was gathered over the course of five months (May-September) in the form of semi-structured interviews with staff who were either curators of or involved in the metadata work for a particular repository. All staff were asked about the environmental scan spreadsheet we had worked on, asking for further clarification or correction about the content listed, as well as what staff worked on what part of the metadata workflows. Notes were taken and then relevant sections added to the spreadsheet.

Sites reviewed:

Alma 

Digital Archive of the Guatemalan National Police Historical Archive (AHPN)

Archive of the Indigenous Languages of Latin America (AILLA)

DAMS (including HRDI and Primeros Libros metadata)

GeoData Portal

Latin American Digital Initiatives (LADI)

Texas Archival Resources Online (TARO)

Tape archive (including Box and network shares)

Texas Scholarworks (TSW)

Texas Data Repository (TDR)

Visual Resources Center (VRC)





Findings

Clicking the images within the findings below will take you to the Tableau Dashboard, where you can see a larger version and interact with the visualization.


 Metadata Reuse

Metadata reuse

Out of 11 sites reviewed, 5 significantly reused metadata from other UTL-managed sites. These are: Alma, DAMS/CP, GeoData Portal, TARO, and TDR. The DAMS/CP and the GeoData Portal had the most variety of metadata reuse and most frequent engagement in this area, though emerging workflows may further increase TARO to TDR and TDR to GeoData Portal sharing. Strengthening and supporting reuse could be useful in reducing duplicate work, simplification and harmonization of metadata about the same or similar instances of an asset, and possibly help materials be ingested into various sites quicker.


 External vs. Internal metadata

External vs. Internal metadata

Two major trends emerged from the interview data in the kinds of metadata sources used by UTL staff and the kinds of metadata present in our sites, that of “internal” and “external” metadata. The majority of metadata created by UTL staff is derived through original metadata work/cataloging or from using UTL-created metadata as sources. We define this as “internal” metadata. However, a large proportion of the metadata used by UTL sites are also from external sources (depositors, vendors, and partners in various kinds of projects and post-custodial arrangements.) This metadata is largely unmediated by UTL staff, and when remediation does occur, it is limited. As some of our sites have both internal and external metadata, finding ways to balance usage of both is key (ex. maintaining consistent use of authorities and local vocabularies contributed from external metadata sources.) 



 Levels of Metadata Remediation or Transformation

Levels of metadata remediation/transformation

The level of work conducted by UTL staff on metadata creation varied in the interview data, depending on where the original sources used to create that metadata came from (i.e. the external vs. internal categories mentioned above.) These levels range from little staff remediation to full transformation or creation of metadata. Generally, external metadata from partners or depositors was little-modified (with the exception of some vendor-derived metadata in the DAMS/CP). As much of this metadata came from post-custodial partnerships or outside depositors, the little remediation reflects best practices in preserving and respecting researcher/partner input. The amount of metadata work increased when conducting original metadata creation/cataloging or when reusing outside or UTL metadata. This last point is significant in that even though reusing metadata has a variety of benefits, it does not erase any needed quality control, crosswalking, or enrichment. 


 Staffing Pain Points

Staffing pain points

A common pain point noted by interview participants was staffing. By not having enough staff in general or not enough staff assigned to metadata/cataloging work, interview participants noted that the description and ingest of assets into UTL sites was slowed. Examples include cataloging of bibliographic materials in Content Management; creation of non-MARC metadata in Content Management; cataloging of maps in Content Management for downstream reuse by the DAMS/Collections Portal and GeoData Portal; richer description of assets for TSW; and more timely description and ingests of material for LADI. Some workflows were created to address these issues, for example the direct scraping of metadata from OCLC for maps metadata creation instead of relying on catalog records in Alma. Overall, staffing was noted as a pain point for 5 out of 11 sites. As metadata in Alma, the DAMS/CP, and GeoData Portal are frequently dependent on each other, finding ways to address these backlogs is key, whether through adaptation of new workflows or advocating for reallocation of staffing resources. 

 Emerging Workflows

Emerging workflows

A number of new or emerging metadata workflows were identified in the interview data. Many of these were created after a recent migration or change to one’s metadata schema, most of them coming from the GeoData Portal’s move to the updated OpenGeoMetadata Aardvark schema and upgrade to the site that removed LITS-mediated access to ingest. These emerging workflows include: removal of ArcGIS Pro and the ISO 19139 XML schema transformation steps from ingesting content into the GeoData Portal, adoption of TDR as a space for Architecture to create geospatial metadata for the GeoData Portal, and directly querying the Collections Portal for maps metadata instead of Alma for the GeoData Portal. Other workflow changes include simplification of ingest for VRC assets, working from MS Excel spreadsheets to JSTOR Forum instead of utilizing several bespoke cataloging tools. As the AILLA and DAMS 2.0 migrations are still pending, these changed workflows are not fully realized yet. Experimentation with OCR (optical character recognition)/HTC (handwritten text recognition), natural language processing/named entity recognition, and AI tools were also noted for Benson materials, in order to generate subjects, contributors, and summaries of materials. At Architecture, evaluations of ArchivesSpace as well as FromThePage as sites for content description and the identification/cleanup of people and place names is underway. These are not yet part of defined workflows, but further testing and training support could be useful in this area. 

 Site Specificity

Site specificity

While there is some metadata reuse and sharing of tools/workflows, a number of UTL’s sites are singular or siloed off from the rest of the metadata ecosystem. This is partially due to the kind of metadata and assets they store. For example, LADI and AHPN are post-custodial projects that contain partner-contributed metadata; for that reason, their content is not shared elsewhere, nor are the workflows similar to other metadata processes where UTL staff create or remediate records. Others like the VRC and AILLA handle unique assets that do not share metadata workflows, schemas, or characteristics with the rest of UTL’s sites. TSW and Digital Stewardship spaces (tapes) are the last two sites that were found to have little to no crossover with other sites, which lies in TSW’s differing ingest process and in the tapes’ focus on administrative and technical metadata, not being a site of descriptive metadata creation. However, some assets are shared across TSW and the DAMS/CP, meaning there could be spaces for metadata reuse in the future. Digital Stewardship’s need for a minimum metadata standard for assets being sent for preservation would also mean collaboration or at the least understanding of other sites’ schemas.