Grammatical Complexity Index

The grammatical complexity index (GCI) measure that we use in spaCy has already been found to be highly correlated with other grammatically relevant linguistic variables such as subordination measures and clitic usage. However, it’s also important for us to know exactly what the GCI is measuring within the Spanish transcripts.

The measure is comprised of a ratio of complex dependencies over number of words, as each word is ascribed a dependency. There are two relevant sets of dependency tags for us to be aware of. There are universal and language-specific tags that can be found at this website: https://v2.spacy.io/api/annotation#dependency-parsing

Notably, only languages like English and German have additional tags. For the rest of the languages, these are the universal tags (cells highlighted are those considered complex for GCI):

Label	Description	Notes
acl	clausal modifier of noun (adjectival clause)	Summary: This seems accurate. It is locking onto clausal modifiers that are either finite or non-finite. For example, “el perrito que le está mirando”.
advcl	adverbial clause modifier	Summary: Majority of advcl are being OVER identified in Spanish and Catalan samples. It is marking some dependencies correctly, others not. It is correctly capturing adverbial clause modifiers. For example, “hay una pareja haciendo…”. But, this value has other errors: It is marking null subject clauses as advcl. “Una casa. Se ve grande. Tiene un garaje.” It is marking discourse markers/interjections as advcl. “Sí.”
advmod	adverbial modifier
amod	adjectival modifier
appos	appositional modifier
aux	auxiliary
case	case marking
cc	coordinating conjunction
ccomp	clausal complement	Summary: This measure looks accurate. It is correctly capturing clausul complements. For example, “se supone que proyecta”.
clf	classifier
compound	compound
conj	conjunct
cop	copula
csubj	clausal subject	Summary: Very inaccurate. It is overmarking verbs as clausal subjects. It is marking infinitives as subjects even when they are not. “No estar en contacto con la hierba”. It is marking null subject finite verbs as subjects even when they are not.” “Tienen”
dep	unspecified dependency
det	determiner
discourse	discourse element
dislocated	dislocated elements
expl	expletive
fixed	fixed multiword expression
flat	flat multiword expression
goeswith	goes with
iobj	indirect object
list	list
mark	marker	Summary: This seems to be less accurate. It is frequently tagging interjections as mark even if they are not being used as markers of a clause. For example, “pues… estamos ante de…”.
nmod	nominal modifier	Summary: This seems to be largely accurate. It is being caught for cases of a noun with a prepositional phrase. For example, “paisaje de naturaleza”.
nsubj	nominal subject
nummod	numeric modifier
obj	object
obl	oblique nominal
orphan	orphan
parataxis	parataxis
punct	punctuation
reparandum	overridden disfluency
root	root
vocative	vocative
xcomp	open clausal complement	Summary: This seems to be working, though the majority of the cases are restricted to one specific use within the xcomp definition (“look”, “seem” / “parecer”). So, this seems to be accurate. For example, “parecen las sierras de montserrat”.

Missing dependencies from list Kesha provided: acomp, csubjpass, pobj, complm, infmod, partmod

Additionally, we can see these dependencies in table format from this website: https://universaldependencies.org/u/dep/

For understanding the tree-like dependency structures that are available as output, they also provide a guide in interpreting the dependency relationships at this website: https://universaldependencies.org/u/overview/enhanced-syntax.html

For example, here are relative clauses: