Grammatical Complexity Index - WIP

Grammatical Complexity Index - WIP

The grammatical complexity index (GCI) measure that we use in spaCy has already been found to be highly correlated with other grammatically relevant linguistic variables such as subordination measures and clitic usage. However, it’s also important for us to know exactly what the GCI is measuring within the Spanish transcripts.

The measure is comprised of a ratio of complex dependencies over number of words, as each word is ascribed a dependency. There are two relevant sets of dependency tags for us to be aware of. There are universal and language-specific tags that can be found at this website: Annotation Specifications · spaCy API Documentation (legacy)

 

Notably, only languages like English and German have additional tags. For the rest of the languages, these are the universal tags (cells highlighted are those considered complex for GCI):

 

 

 

Label

Description

Notes

acl

clausal modifier of noun (adjectival clause)

Summary: This seems accurate. It is locking onto clausal modifiers that are either finite or non-finite.

For example, “el perrito que le está mirando”.

advcl

adverbial clause modifier

Summary: Majority of advcl are being OVER identified in Spanish and Catalan samples.

 

It is marking some dependencies correctly, others not. It is correctly capturing adverbial clause modifiers.

For example, “hay una pareja haciendo…”.

 

But, this value has other errors:

  1. It is marking null subject clauses as advcl.

    1. “Una casa. Se ve grande. Tiene un garaje.”

  2. It is marking discourse markers/interjections as advcl.

    1. “Sí.”

 

advmod

adverbial modifier

 

amod

adjectival modifier

 

appos

appositional modifier

 

aux

auxiliary

 

case

case marking

 

cc

coordinating conjunction

 

ccomp

clausal complement

Summary: This measure looks accurate. It is correctly capturing clausul complements.

For example, “se supone que proyecta”.

clf

classifier

 

compound

compound

 

conj

conjunct

 

cop

copula

 

csubj

clausal subject

Summary: Very inaccurate. It is overmarking verbs as clausal subjects.

  1. It is marking infinitives as subjects even when they are not.

    1. “No estar en contacto con la hierba”.

  2. It is marking null subject finite verbs as subjects even when they are not.”

    1. Tienen

dep

unspecified dependency

 

det

determiner

 

discourse

discourse element

 

dislocated

dislocated elements

 

expl

expletive

 

fixed

fixed multiword expression

 

flat

flat multiword expression

 

goeswith

goes with

 

iobj

indirect object

 

list

list

 

mark

marker

Summary: This seems to be less accurate. It is frequently tagging interjections as mark even if they are not being used as markers of a clause.

For example, “pues… estamos ante de…”.

nmod

nominal modifier

Summary: This seems to be largely accurate. It is being caught for cases of a noun with a prepositional phrase.

For example, “paisaje de naturaleza”.

nsubj

nominal subject

 

nummod

numeric modifier

 

obj

object

 

obl

oblique nominal

 

orphan

orphan

 

parataxis

parataxis

 

punct

punctuation

 

reparandum

overridden disfluency

 

root

root

 

vocative

vocative

 

xcomp

open clausal complement

Summary: This seems to be working, though the majority of the cases are restricted to one specific use within the xcomp definition (“look”, “seem” / “parecer”). So, this seems to be accurate.

For example, “parecen las sierras de montserrat”.

Missing dependencies from list Kesha provided: acomp, csubjpass, pobj, complm, infmod, partmod

Additionally, we can see these dependencies in table format from this website: Universal Dependency Relations

image-20250703-193823.png

 

 

For understanding the tree-like dependency structures that are available as output, they also provide a guide in interpreting the dependency relationships at this website: Enhanced Dependencies

 

For example, here are relative clauses:

image-20250703-193812.png