Annotation Guidelines
How causal relations are annotated in the Causal Semantics framework
Overview
This page documents the annotation schema and guidelines used to produce the training data for C-BERT and the underlying (C, E, I) tuple representation. If you are using C-BERT or working with annotated data from this project, this page serves as the reference for how annotations are structured, what decisions were made, and how edge cases are handled.
The annotation was developed over three iterations (2024–2025) using the INCEpTION platform, applied to 4,753 sentences from German environmental discourse (1990–2022). The resulting corpus contains 2,391 manually annotated causal relations.
Annotation Schema
The schema consists of two layers: span annotations that mark causal components in text, and relation annotations that link them into structured causal relations.
Spans
Each token or token sequence in a sentence receives at most one span label. There are two span categories:
Role marks the primary causal components:
Indicator— the lexical marker that projects the causal relation (e.g. verursachen, ist Ursache, weil, stoppen)Entity— a cause or effect entity involved in the relation (e.g. Klimawandel, Pestizide, Artensterben)
Coefficient captures semantic modifiers:
| Coefficient | Function | Examples |
|---|---|---|
Negation |
Negates or inverts the causal relation | nicht, kein, Verlust, Schwund |
Temporality |
Temporal framing | seit, bereits, künftig, Jahre |
Spatiality |
Spatial framing | global, lokal, weltweit, Deutschland |
Uncertainty |
Modality and hedging | möglicherweise, könnte, vermutlich |
Quality |
Qualitative modification | industrielle, parasitischer, schwefelhaltiger |
Quantity |
Quantitative modification | großes, fünftausend, zwölf |
Representation |
Evidentiality marker | laut, sagt, sei |
Representation Entity |
Source/provenance | Studie, Bericht, Forscher |
Relations
Annotated spans are linked by directed relations:
Cause: Indicator → Entity (the entity fills the Cause role)Effect: Indicator → Entity (the entity fills the Effect role)Constraint: Entity or Indicator → Coefficient (the coefficient modifies its governor)
Worked Example
Consider the sentence:
Klimawandel und intensive Landwirtschaft verstärken möglicherweise das weltweite Artensterben.
(Climate change and intensive agriculture possibly intensify global species extinction.)
| Span | Type | Relation |
|---|---|---|
| Klimawandel | Entity |
verstärken → Cause |
| Landwirtschaft | Entity |
verstärken → Cause |
| verstärken | Indicator |
— |
| möglicherweise | Uncertainty |
verstärken → Constraint |
| weltweite | Spatiality |
Artensterben → Constraint |
| Artensterben | Entity |
verstärken → Effect |
Key points illustrated here: coordinated entities (Klimawandel und Landwirtschaft) are annotated separately with parallel Cause relations; the attributive adjective intensive is extracted as a Quality coefficient (token minimization principle); the sentence adverb möglicherweise is linked to the indicator since it modifies the causal proposition.
Annotation Principles
Four principles guide annotation decisions in ambiguous cases.
Minimal Principle
Only explicitly marked causal relations are annotated — no inference. When a lexically specific marker (e.g. verursacht) and a functional marker (e.g. durch) co-occur, only the lexically richer element is annotated as indicator. Functional prepositions and connectors are annotated as indicators only when no richer causal lexeme is present.
For light verb constructions, only the semantically loaded element is annotated (e.g. hat etwas mit X zu tun → Indicator: tun).
Token Minimization
Entities are reduced to their head token. Attributive modifiers are extracted as separate coefficients:
- industrielle Landwirtschaft → Entity:
Landwirtschaft, Coefficient:industrielle(Quality) - Einsatz von Pestiziden → Entity:
Pestiziden - Exception: named entities (Europäische Union) and fixed multi-word expressions (saurer Regen) are annotated as single units.
This prevents proliferation of marginal entity variants and facilitates downstream aggregation.
Syntactic Proximity
When multiple potential entities compete for a role, the syntactically closest entity to the indicator is preferred, unless semantic considerations override this.
Coefficient Conservatism
Coefficients (other than Representation) are only annotated when they stand in a direct syntactic dependency relation with an indicator or entity.
Indicators
Indicators are the lexical or syntactic markers that project causal relations and establish (C, E, I) tuples. They fulfill two functions: projecting causal roles onto syntactic positions in their co-text, and encoding inherent information about polarity and salience.
The annotation corpus contains 642 distinct indicator forms, grouped into 192 indicator families by morphological and semantic criteria. A family subsumes all realizations of a shared lexical core (e.g. the Ursache family includes verursachen, ist Ursache, Ursache sein, Teilursache sein).
Top 10 Indicator Families
| Family | Forms | Instances | Family | Forms | Instances | |
|---|---|---|---|---|---|---|
| Ursache | 21 | 167 | Beitrag | 10 | 70 | |
| Verantwortung | 11 | 122 | Folge | 6 | 69 | |
| Stoppen | 3 | 111 | Kampf | 8 | 69 | |
| Gegen | 3 | 95 | Führen | 9 | 59 | |
| Durch | 2 | 70 | Grund | 7 | 57 |
Polarity and Salience
Each indicator family carries an inherent polarity (promoting + or inhibiting −) and a default salience class:
| Family | Polarity | Default Salience | Discourse Function |
|---|---|---|---|
| Ursache | + | variable | Prototype; full salience spectrum |
| Verantwortung | + | variable | Causal-moral attribution |
| Folge | + | monocausal | Effect-centered perspective |
| Beitrag | + | polycausal | Distributional attribution |
| Durch | + | monocausal | Grammaticalized preposition |
| Stoppen | − | monocausal | Intervention framing |
| Gegen | − | monocausal | Variable condensation (Kampf gegen, Protest gegen) |
Syntactic Projection Patterns
The syntactic realization of each indicator family determines how Cause and Effect roles are projected:
| Family | Example | Cause Projection | Effect Projection |
|---|---|---|---|
| Ursache | X ist Ursache für Y | Subject | PP (für) |
| Folge | Y ist Folge von X | PP (von) | Subject |
| Verursachen | X verursacht Y | Subject | Accusative object |
| Beitragen | X trägt zu Y bei | Subject | PP (zu) |
| Stoppen | X stoppt Y | Subject | Accusative object |
| Durch | Y durch X | PP complement | Head noun |
| Gegen | X gegen Y | Subject/Agent | PP complement |
Context Markers
Context markers modify and contextualize the causal relations projected by indicators. Unlike indicators, they are not lexically specialized for causality but operate at the predication or proposition level. Three structural marker types directly affect the INFLUENCE computation:
Division
Division markers signal polycausal structures with implicit co-causes. Typical realizations include unter anderem (“among other things”), auch (“also”), ebenfalls (“likewise”), and the composite nicht nur (“not only”).
Division markers reduce the salience to |I| = 0.5 regardless of the indicator’s default.
Priority
Priority markers establish asymmetric weighting within polycausal sets: vor allem (“above all”), hauptsächlich (“mainly”), maßgeblich (“significantly”). They set |I| = 0.75 for the prioritized cause.
Both marker types affect only the salience (|I|), leaving polarity (\pm) unchanged.
Negation
Negation is the structurally most influential marker. Two types are distinguished:
Object-based negation operates at the entity level through markers like Verlust (“loss”), Schwund (“decline”), Rückgang (“decrease”). These invert the polarity when the count of negations is odd:
Verlust von Lebensräumen verursacht Bienensterben.
Indicator verursachen: default I > 0; object negation on Cause (Verlust) → polarity inverted: I < 0.
Propositional negation operates at the relation level (nicht, kein) and neutralizes the entire causal relationship (I = 0):
Pestizide verursachen nicht Insektensterben.
Indicator verursachen: default I > 0; propositional negation → I = 0.
From Annotations to INFLUENCE
The annotations documented above — indicators, entities, and context markers — are the inputs to a deterministic computation that produces the final INFLUENCE value I \in [-1, +1]. In brief: entity identification follows the indicator’s syntactic projection pattern, polarity is determined by indicator class and negation markers, and salience is computed through a cascading hierarchy of morphological, determiner, and syntactic markers.
For the full formal specification — including the cascade rules, coordination normalization, and worked examples — see Tuple Construction.
Data Format
Annotated data is exported as JSON. Each entry represents a sentence with its metadata and extracted relations.
Schema
{
"subfolder": "Artensterben_oa",
"global_sentence_id": 1148482,
"text_id": "FAZ_200204_384209",
"text_date": "2002-04",
"sentence_id": "12",
"sentence": "...",
"relations": [
{
"indicator": "Folge",
"entities": [
{
"entity": "Kleinplanet",
"relation": "Cause",
"dependent_coefficients": [
{"coefficient_text": "Jahren", "coefficient": "Temporality"},
{"coefficient_text": "Mexikos", "coefficient": "Spatiality"}
]
},
{
"entity": "Artensterben",
"relation": "Effect"
}
],
"coefficient": "Division",
"dependent_coefficients": [
{"coefficient_text": "könnte", "coefficient": "Uncertainty"}
],
"representation": "berieten",
"representation_entities": [
{
"entity": "Teilnehmer",
"relation": "Constraint",
"dependent_coefficients": [
{"coefficient_text": "fünftausend", "coefficient": "Quantity"}
]
}
]
}
]
}Field Reference
Sentence-level fields:
| Field | Description |
|---|---|
subfolder |
WABI subcorpus and syntactic position (e.g. Artensterben_oa = accusative object) |
global_sentence_id |
Unique sentence identifier across the full corpus |
text_id |
Source document identifier (format: SOURCE_YYYYMM_ID) |
text_date |
Publication date (YYYY-MM) |
sentence |
Full sentence text |
relations |
Array of causal relations found in this sentence (empty if none) |
Relation-level fields:
| Field | Description |
|---|---|
indicator |
The causal indicator lexeme |
entities |
Array of entities with their causal role (Cause or Effect) |
coefficient |
Structural marker on the relation level (Negation, Division) |
dependent_coefficients |
Contextual coefficients attached to the indicator |
representation |
Evidentiality/speech-act verb (if present) |
representation_entities |
Source entities for reported speech |
Entity-level fields:
| Field | Description |
|---|---|
entity |
Head token of the entity (token-minimized) |
relation |
Causal role: Cause, Effect, or Constraint |
dependent_coefficients |
Array of coefficients modifying this entity |
Sentences Without Relations
Sentences where no explicit causal relation was identified have an empty relations array. These are not noise — they were reviewed during annotation and determined to contain no explicit causal markers per the minimal principle.
Corpus Statistics
| Total sentences | 4,753 |
| Sentences with ≥1 WABI relation | 1,797 (37.8%) |
| Total WABI-relevant relations | 1,867 |
| Distinct indicator forms | 642 |
| Indicator families | 192 |
| Mean relations per sentence | 0.39 |
Per WABI term:
| Term | Sentences | Relations | Rel./Sent. |
|---|---|---|---|
| Waldsterben | 1,818 | 633 | 0.35 |
| Artensterben | 1,854 | 744 | 0.40 |
| Bienensterben | 536 | 257 | 0.48 |
| Insektensterben | 545 | 233 | 0.43 |
Further Reading
- For the theoretical motivation behind polarity and salience, see Framework
- For how annotations are transformed into (C, E, I) tuples, see Tuple Construction
- For the C-BERT model trained on this data, see C-BERT
- Full annotation data (Bundestag subset): HuggingFace Dataset