8. Language analysis process/overview
Prior to running Kesha’s pipeline in SpaCY, typically we need to extract features via CLAN CLAN-derived linguistic features
Subsequently, we need to strip the samples of coding from CLAN that SpaCy and Kesha’s pipeline cannot tolerate/ can negatively impact output derived from SpaCy/Kesha’s pipeline. Therefore you will need to strip the samples following these steps CLAN Folder Stripping. The details of what gets remvoed is provided on this page.
Current dictionary of features from Kesha’s pipeline: MultilingualFeatureDictionary-Editable,
Connected_Speech_Feature_Dictionary_Final
How to extract acoustic features: 8. Acoustic Derivations Guide
How to extract linguistic features and word-level parameters (both extracted by linguistic pipeline and checked in the Sanity Check pipeline)
English Linguistic Pipeline Tutorial Guide
Several measures from this pipeline are scaled for number of words or includes a count of number of words.
Kesha’s linguistic features scripts removes interjections and fillers from the connected speech samples as to not inflate the total number of words.
How to extract CLIP: Multilingual image-text similarity scores Tutorial Guide
Current versions stored on MADRlab Repo/Git: