Sanity Check Pipeline

Sanity Check Pipeline

 

 

For our analysis, the code has been written in an iterative fashion. If we found errors or we wanted to add new variables to derive, we had to adjust the code to achieve it. This process happened many times over the course of the project, which means that the code has undergone plenty of changes. One danger with changing the code many times is that previous issues can return unexpectedly even if a new change would not be expected to cause any issues.

One way that we can ensure that we are not creating new issues with the code when we make other changes is to always include in any analysis a set of “sanity check” files. These files (2 in Spanish, 2 in Catalan) are transcripts and output from the linguistic analysis that we have already manually checked for accuracy. This means that the output from the linguistic analysis should be accurate and that whenever a batch of new files are processed, the sanity check files should be examined to make sure no differences were found between the new output and the manually checked output.



Table of sanity check files:

Language

File name

Language

File name

Spanish

BILP013

Spanish

BILP026

Catalan

BIOBS005

Catalan

BILP014

Automated sanity check documentation

To ensure no new errors are introduced due to code changes, the pipeline will check its output doesn't change in the sanity check files for the following
features that have been historically problematic: ['# of words', '# of sentences', 'Average sentence length', '# of clitics', 'Average clitics per sentence', 'Suboordination Main Verb Implied numerator',
'Suboordination Main Verb Implied', 'Suboordination With Two Finite Verbs numerator',
'Suboordination With Two Finite Verbs','# gender agreement errors',
'# gender agreement errors: only DET', '# gender agreement errors per 100 nouns',
'# gender agreement errors (only DET) per 100 nouns','# number agreement errors',
'# number agreement errors: only DET', '# number agreement errors per 100 nouns',
'# number agreement errors (only DET) per 100 nouns'].

If the pipeline fails the sanity check, the pipeline will output the problematic features and will NOT run on the input files.
To see the current output for the sanity check files, please look at sanityCheckSpaCurrentRun_LinguisticFeatures.csv and compare it against its ground truth in SanityChecks/spanishVerifiedSanityCheckOutput.csv. Then, send your findings to Kesha.

Instructions for manual sanity check

Step

Explanation

Notes

Step

Explanation

Notes

  1. Locate the sanity check files

The sanity check files are project-specific and are located in the Z. Sanity Check folder within the 101--Connected Speech_Data folder*. Depending on the structure of the project’s Box folder, the sanity check files are likely located in a manuscript folder or a data quality folder.

 

*The location of the sanity check files may be different if there are separate sanity check files used for another project.

The sanity check folder will contain a subfolder for each language. In each folder there will be two .cha transcript files and an Excel sheet with the output for the linguistic features of the two .cha files in that language.

image-20250620-191644.png
Location of the sanity check files for current connected speech projects.
image-20250620-191903.png
Subfolder for Spanish sanity check materials.
  1. Copy the sanity check files

The next step is to copy the sanity check .cha files of the target language into the folder with the batch of data that you will be running through the linguistic analysis pipeline. In other words, copy the .cha files and paste them into the folder with the rest of the data as if they were simply two more .cha files of your data.

Check that you are including the correct language files. If you are running the Spanish linguistic analysis pipeline, only include the two Spanish .cha files. If you are running the Catalan linguistic analysis pipeline, only include the two Catalan .cha files.

image-20250620-191958.png
Example of downloading the .cha transcript files for later uploading to target linguistic analysis folder.
image-20250620-192050.png
Example of copying through Box the sanity check files into the target linguistic analysis data folder.

Be sure to hit Copy and not Move, otherwise the files will no longer be in this centralized Sanity Check subfolder!

  1. Run linguistic analysis

The next step is to run the linguistic analysis pipeline like normal.

 

This will be language-specific and will use TACC. Once you have completed the process, you will download the output files for the transcripts included in your dataset.

 

The majority of the files will be dependency structures and morphosyntactic dependency tags with or without specific information (according to the file itself) that may or may not be used for your analysis.

 

One specific output file will be an excel file that includes numerical values for all the linguistic variables included in the analysis.

 

Save these files in the correct location for your current analysis.

image-20250620-192550.png
Example of the multiple output files for each transcript file. These will include dependency structures and word-by-word tags.
image-20250620-192756.png
Example of the Linguistic Features output file that contains quantified numerical values for each linguistic variable that was derived from the linguistic analysis pipeline. This file is critical for the sanity check and the last step (Step 4) of this process.
  1. Compare output

This final step is the actual process of confirming that the linguistic analysis pipeline ran as it was supposed to.

 

You will take the Linguistic Features output file from Step 3 and open it. You should see in the Participant ID column (Column A) that there are two participant IDs that match the sanity check participants for the target language.

 

You will then open up the Linguistic Features datasheet in the Sanity Check subfolder for the target language.

 

You must then compare the values in the matching cells from the Linguistic Features of the Sanity Check subfolder to those in your output file and ensure that the numbers match. That means going through and checking multiple/all cells to make sure that the values are the same. You can do this by copying and pasting the sanity check file rows and the target rows in the output file onto another Excel file or just manually checking with both files open.

 

 

@Andrew Collins to add the EXACT functions to

 

 

If any values are different, this means that the code has changed or failed to run as expected as NO values should be different between your analysis and the sanity check files.

If this happens, you will need to contact Stephanie to determine the next step to take for your analysis as the code may need to be debugged or fixed.

image-20250620-193540.png
Example of values from the output linguistic features file that need to be compared to those in the sanity check linguistic features file.

 

Historically, certain columns have been more problematic and can be good starting locations to check.

These include:
- subordination indices

- sentence counts
- clitic counts
- number agreement errors
- gender agreement errors
- word counts