CLAN-derived linguistic features

1 Overview
2 Directory Information in CLAN
3 Running Mor and Eval
4 Extracting Variables from CLAN
5 Relevant Details for SpaCy/Kesha’s Pipeline

Overview

We will go through a series of 6 steps to use CLAN to properly format the transcriptions for different pipelines and analyses. These steps will be described in greater detail below:

BEFORE RUNNING THESE PROCESSES, WE NEED TO ADD A STRIPPING CODE TO REMOVE FILLERS FOR THE MLUm COUNT. THIS IS IN PROCESS AS OF 10/03/2025

This has been added to the MADRlab’s GITHUB:

https://github.com/orgs/MADRLab/repositories

The script is called: Preparing-Transcriptions-for-Analysis-

Run mor:
Run eval: evaluation, getting the standard CLAN linguistic features
1. Eval will provide several measures in a spreadsheet inclusive of MLUm, part of speech tags etc. The specific command settings you use will determine if you get counts or percents of some values
Fragment count
Repetition count
Retracing count
Paraphasia count

Directory Information in CLAN

First, open CLANc and ensure that you have the correct directory information. For example, this may look like something from the following:

working:
/Users/yourinitials/Library/CloudStorage/Box-Box/SLHS_Grasso/Manuscripts_Projects/Connected Speech Dom.Nondom Lv Nfv/1. Clan Transcriptions/Output

output:
/Users/yourinitials/Library/CloudStorage/Box-Box/SLHS_Grasso/Manuscripts_Projects/Connected Speech Dom.Nondom Lv Nfv/1. Clan Transcriptions/Output

mor lib:
/Users/yourinitials/Library/CloudStorage/Box-Box/SLHS_Grasso/MADR Connected Speech Project/Bilingual_Transcriptions/CLAN_Libraries/CLAN

Running Mor and Eval

Once that is correct, we can begin entering commands into CLAN as outlined in each step below:

We will first run mor:

mor @

Before running the code, click on “File In” and ensure that all the .cha transcriptions you need are loaded in.

Next, we run eval:

eval @ +t*PAR: +u

Troubleshooting: If you get an error message at this step, make sure that the only files in “File In” are .cha files and no .cex files are included.

Extracting Variables from CLAN

For these next steps, CLAN will output a file called “stats_freq.xls” for each command. Name each output immediately after it is done. For example, after running the next step for Fragment Count, rename the stats_freq.xls file to “Fragments.xls” before continuing to the next step.

Next, we move on to the Fragment Count:

Copy and paste the following code into CLAN and run it:

FREQ +s"&+*" +d2 +f @

This will count the number of times the code &+ appears. Recall that &+ is the code used specifically for word Word Fragments. The output for this file will show a new column called “Token” that is populated. This is the Word Fragment Count.

Next, we must run the code for Repetition Count:

FREQ +s"[/]" +t*PAR +d2 +f @

Then the Retracing Count:

FREQ +s"[//]" +t*PAR +d2 +f @

Lastly, we will get the Paraphasia Count:

FREQ +s"[: *]" +d2 +f @

Once you have completed these steps, move the output files above to the Output folder. Be sure that the folders are properly divided by language and then language dominance within each language folder.

Warning: You (may) need to copy these values to the spaCy output sheet as new columns with the appropriate column titles. After each time you run the above codes, copy over the column (and be careful that the rows line up with the appropriate rows for the spaCy sheet) and then repeat until all measures are copied over.

Remember to check the CLAN sentence count as compared to spaCy’s sentence count as a sanity check.

Relevant Details for SpaCy/Kesha’s Pipeline

Strip code of repetitions and revisions and part-word fragments and CLAN formatting (chat2text)

Here is an option to also strip for coded code-switches if your question necessitates it

Some analyses require raw data/counts of some linguistic variables and other analyses require scaled data. For that reason, in additional to raw data, several measures from SpaCy/Kesha’s pipeline are scaled for number of words or includes a count of number of words.
Kesha’s linguistic features scripts removes interjections and fillers from the connected speech samples as to not inflate the total number of words.
- Always double-check the word count!

Notes 1-24-25:

This step provides output with a number of CLAN-derived features. Some are automatically derived from the eval function, while others are manually added to the final columns of the spreadsheet that is created from this step. This process creates the output within the “Output” folder in each “CLAN_transcriptions_DATE” folder. This occurs entirely independently from the CLAN folder stripping step that comes next.

Question: Can we combine the following…

As of 1.31.25, it seems as though that the following is not possible. It will simply bring up an output window that lists out all possible actions but does not actually run the code.

It does seem possible to do the Repetitions and Retracings together while following up with the last line.

FREQ +s"[&+*]" +d2 +f @

FREQ +s"[/]" +tPAR +d2 +f @

FREQ +s"[//]" +tPAR +d2 +f @

FREQ +s"[: *]" +d2 +f @