CLAN-derived linguistic features
Overview
We will go through a series of 6 steps to use CLAN to properly format the transcriptions for different pipelines and analyses. These steps will be described in greater detail below:
BEFORE RUNNING THESE PROCESSES, WE NEED TO ADD A STRIPPING CODE TO REMOVE FILLERS FOR THE MLUm COUNT. THIS IS IN PROCESS AS OF 10/03/2025
Run mor:
Run eval: evaluation, getting the standard CLAN linguistic features
Eval will provide several measures in a spreadsheet inclusive of MLUm, part of speech tags etc. The specific command settings you use will determine if you get counts or percents of some values
Fragment count
Repetition count
Retracing count
Paraphasia count
Directory Information in CLAN
First, open CLANc and ensure that you have the correct directory information. For example, this may look like something from the following:
working:
/Users/yourinitials/Library/CloudStorage/Box-Box/SLHS_Grasso/Manuscripts_Projects/Connected Speech Dom.Nondom Lv Nfv/1. Clan Transcriptions/Output
output:
/Users/yourinitials/Library/CloudStorage/Box-Box/SLHS_Grasso/Manuscripts_Projects/Connected Speech Dom.Nondom Lv Nfv/1. Clan Transcriptions/Output
mor lib:
/Users/yourinitials/Library/CloudStorage/Box-Box/SLHS_Grasso/MADR Connected Speech Project/Bilingual_Transcriptions/CLAN_Libraries/CLAN
Running Mor and Eval
Once that is correct, we can begin entering commands into CLAN as outlined in each step below:
Extracting Variables from CLAN
For these next steps, CLAN will output a file called “stats_freq.xls” for each command. Name each output immediately after it is done. For example, after running the next step for Fragment Count, rename the stats_freq.xls file to “Fragments.xls” before continuing to the next step.
Once you have completed these steps, move the output files above to the Output folder. Be sure that the folders are properly divided by language and then language dominance within each language folder.
Warning: You (may) need to copy these values to the spaCy output sheet as new columns with the appropriate column titles. After each time you run the above codes, copy over the column (and be careful that the rows line up with the appropriate rows for the spaCy sheet) and then repeat until all measures are copied over.
Remember to check the CLAN sentence count as compared to spaCy’s sentence count as a sanity check.
Relevant Details for SpaCy/Kesha’s Pipeline
Here is an option to also strip for coded code-switches if your question necessitates it
Some analyses require raw data/counts of some linguistic variables and other analyses require scaled data. For that reason, in additional to raw data, several measures from SpaCy/Kesha’s pipeline are scaled for number of words or includes a count of number of words.
Kesha’s linguistic features scripts removes interjections and fillers from the connected speech samples as to not inflate the total number of words.
Always double-check the word count!