graph LR
A["*verstärken*"] -->|Cause| B["*Klimawandel*"]
A -->|Cause| C["*Landwirtschaft*"] -->|Priority| D["*insbesondere*"]
A -->|Uncertainty| E["*möglicherweise*"]
A -->|Effect| F["*Artensterben*"]
F -->|Spatiality| G["*weltweit*"]
Annotation Guidelines
How causal relations are annotated in the Causal Semantics framework
Overview
This page presents the annotation schema and guidelines used to produce the training data for C-BERT and the underlying (C, E, I) tuple representation.
If you are using C-BERT or working with annotated data from this project, this page serves as the reference for how annotations are structured, what decisions were made, and how edge cases are handled.
The annotation was developed over three iterations (2024–2025) and primarily conducted via INCEpTION. A total of 4,753 sentences from German environmental discourse (1990–2022) yielded 2,391 manually annotated causal relations.
Annotation Schema
The schema consists of two layers: span annotations mark causal components in text, and relation annotations link them into structured causal relations.
Spans
Each token or token sequence in a sentence receives at most one span label. There are two span categories.
Role marks the primary causal components:
Indicator— the lexical marker that projects the causal relation- e.g. verursachen ‘cause’, ist Ursache ‘is cause’, weil ‘because’, stoppen ‘stop’
Entity— a cause or effect entity involved in the relation- e.g. Klimawandel ‘climate change’, Pestizide ‘pesticides’
Coefficient captures semantic modifiers:
Negation- Neutralizes or inverts the causal relation
- nicht ‘not’, kein ‘no’, Verlust ‘loss’, Schwund ‘shrinkage’
Division- Lowers salience
- auch ‘also’, unter anderem ‘among others’
Priority- Increases salience
- vor allem ‘especially’, wichtig ‘important’
The three types above directly impact influence. Other coefficients run orthogonal, as they either form constraints (e.g. temporality) – or specify epistemic dimensions (e.g Uncertainty).
| Coefficient | Function | Examples |
|---|---|---|
Temporality |
Temporal framing | seit, bereits, künftig, Jahre |
Spatiality |
Spatial framing | global, lokal, weltweit, Deutschland |
Quality |
Qualitative modification | industrielle, parasitischer, schwefelhaltiger |
Quantity |
Quantitative modification | großes, fünftausend, zwölf |
| Coefficient | Function | Examples |
|---|---|---|
Uncertainty |
Modality and hedging | möglicherweise, könnte, vermutlich |
Representation |
Evidentiality marker | laut, sagt, sei |
Representation Entity |
Source/provenance | Studie, Bericht, Forscher |
Relations
Annotated spans are linked by directed relations:
Cause: Indicator → Entity (the entity is designated as a Cause)Effect: Indicator → Entity (the entity fills the Effect role)Constraint: Entity or Indicator → Coefficient (the coefficient modifies its governor)
Example
Consider the spans in this sentence:
Klimawandel und insbesondere Landwirtschaft verstärken möglicherweise das weltweite Artensterben.
Climate change and especially agriculture possibly intensify global species extinction.
Two coordinated entities (Klimawandel und Landwirtschaft) are annotated separately with parallel Cause relations.
The adverb insbesondere ‘especially’ is scoped to Landwirtschaft, raising its salience as Priority. The sentence adverb möglicherweise (Uncertainty) is parented to the indicator since it modifies the causal proposition.
Lastly, the adjective weltweit is separately encoded as Spatiality coefficient – in accordance with the token minimization principle.
Annotation Principles
Four principles guide annotation decisions in ambiguous cases.
Minimal Principle
Only explicitly marked causal relations are annotated — no inference. When a lexically specific marker (e.g. verursacht) and a functional marker (e.g. durch) co-occur, only the lexically richer element is annotated as indicator. Functional prepositions and connectors are annotated as indicators only when no richer causal lexeme is present.
For light verb constructions, only the semantically loaded element is annotated (e.g. hat etwas mit X zu tun → Indicator: tun).
Token Minimization
Entities are reduced to their head token. Attributive modifiers are extracted as separate coefficients:
- industrielle Landwirtschaft → Entity:
Landwirtschaft, Coefficient:industrielle(Quality) - Einsatz von Pestiziden → Entity:
Pestiziden - Exception: named entities (Europäische Union) and fixed multi-word expressions (saurer Regen) are annotated as single units.
This prevents proliferation of marginal entity variants and facilitates downstream aggregation.
Syntactic Proximity
When multiple potential entities compete for a role, the syntactically closest entity to the indicator is preferred, unless semantic considerations override this.
Coefficient Conservatism
Coefficients (other than Representation) are only annotated when they stand in a direct syntactic dependency relation with an indicator or entity.
Indicators
Indicators are the lexical or syntactic markers that project causal relations and establish (C, E, I) tuples. They fulfill two functions: projecting causal roles onto syntactic positions in their co-text, and encoding inherent information about polarity and salience.
The annotation corpus contains 642 distinct indicator forms, grouped into 192 indicator families by morphological and semantic criteria. A family subsumes all realizations of a shared lexical core. E.g. the Ursache family includes, among other:
- verursachen ‘cause’
- ist Ursache ‘is the cause’,
- Ursache sein ‘to be the cause’
- Teilursache sein ‘to be a partial cause’
Top 10 Indicator Families
| Family | Forms | Instances |
|---|---|---|
| Ursache ‘Cause’ |
21 | 167 |
| Verantwortung ‘Responsibility’ |
11 | 122 |
| Stoppen ‘Stop’ |
3 | 111 |
| Gegen ‘Against’ |
3 | 95 |
| Durch ‘through’ |
2 | 70 |
| Family | Forms | Instances |
|---|---|---|
| Beitrag ‘Contribution’ |
10 | 70 |
| Folge ‘Consequence’ |
6 | 69 |
| Kampf ‘Fight’ |
8 | 69 |
| Führen ‘Lead’ |
9 | 59 |
| Grund ‘Reason’ |
7 | 57 |
Some families are Cause-oriented, others are Effect-oriented. Compare:
Emission_C is a cause of pollution_E.
Pollution_E is a consequence of emission_C.
Orientation affects projection – Cause-oriented indicators generally project Cause on the subject, Effect-oriented indicators do the same with Effect. Modified indicator forms affect the projection according to their orientation.
Note the difference between Teilursache ‘partial cause’ and Teilwirkung partial effect: The former implies the existence of several causes – the latter the existence of several effects.
Polarity and Salience
Each indicator family carries a default polarity (promoting + or inhibiting -) and a default salience class:
| Family | Polarity | Default Salience | Discourse Function |
|---|---|---|---|
| Ursache ‘Cause’ |
+ | [0.5-1] | Cause-oriented prototype |
| Folge ‘Consequence’ |
+ | 1 | Effect-oriented |
| Beitrag ‘Contribution’ |
+ | [0.5-0.75] | Distributional attribution |
| Durch ‘Through’ |
+ | 1 | Grammaticalized preposition |
| Stoppen ‘Stop’ |
− | 1 | Intervention framing |
| Gegen ‘Against’ |
− | 1 | Variable condensation (Kampf/Protest gegen ‘fight/protest against’) |
| Reduzieren ‘Reduce’ |
− | [0.5-0.75] | Prototypical distributional negative |
| Wirkung ‘Effect’ |
\pm | [0-1] | Widest range of influence (Gegenwirkung ‘counter-effect’, wirkungslos ‘ineffective’ ) |
Syntactic Projection Patterns
The syntactic realization of each indicator determines how Cause and Effect roles are projected:
| Family | Example | Cause Projection | Effect Projection |
|---|---|---|---|
| Ursache ‘Cause’ |
X ist Ursache für Y ‘x is the cause of’ |
Subject | PP (für) ‘for’ |
| Folge ‘Consequence’ |
Y ist Folge von X ‘x is a consequence of’ |
PP (von) ‘of’ |
Subject |
| Verursachen ’Cause |
X verursacht Y ‘x causes y’ |
Subject | Accusative object |
| Beitragen ‘Contribution’ |
X trägt zu Y bei ‘x contributes to y’ |
Subject | PP (zu) ‘to’ |
| Stoppen ‘Stop’ |
X stoppt Y ‘x stops y’ |
Subject | Accusative object |
| Durch ‘Through’ |
Y durch X ‘x through y’ |
PP complement | Head noun |
| Gegen ‘Against’ |
X gegen Y ‘x against y’ |
Subject | PP complement |
Context Markers
Context markers modify and contextualize the causal relations projected by indicators. Unlike indicators, they are not lexically specialized for causality but operate at the predication or proposition level. Three structural marker types directly affect the INFLUENCE computation:
Division
Division markers signal polycausal structures with implicit co-causes. Typical realizations include unter anderem ‘among other things’, auch ‘also’, ebenfalls ‘likewise’, and the composite nicht nur ‘not only’.
Division markers reduce the salience to |I| = 0.5 regardless of the indicator’s default.
Priority
Priority markers establish asymmetric weighting within polycausal sets: vor allem ‘above all’, hauptsächlich ‘mainly, maßgeblich ’significantly’. They set |I| = 0.75 for the prioritized cause.
Both marker types affect only the salience (|I|), leaving polarity (\pm) unchanged.
Negation
Negation is the structurally most influential marker. Two types are distinguished:
Object-based negation operates at the entity level through markers like Verlust ‘loss’, Schwund ‘decline’, Rückgang ‘decrease’. These invert the polarity when the count of negations is odd:
Verlust von Lebensräumen verursacht Bienensterben.
Loss of habitat causes bee decline.Indicator verursachen ‘cause’: default I = 1;
object negation on Cause (Verlust ‘loss’) inverts polarity: I = -1.
Propositional negation operates at the relation level (nicht, kein) and neutralizes the entire causal relationship (I = 0):
Pestizide verursachen nicht Insektensterben.
Pesticides don’t cause Insektensterben.Indicator verursachen ‘cause’: default I = 1
Propositional negation nicht ‘don’t’ neutralizes: I = 0.
From Annotations to INFLUENCE
The annotations documented above — indicators, entities, and context markers — are the inputs to a deterministic computation that produces the final INFLUENCE value I \in [-1, +1].
In brief: entity identification follows the indicator’s syntactic projection pattern, polarity is determined by indicator class and negation markers, and salience is computed through a cascading hierarchy of morphological, determiner, and syntactic markers.
For the full formal specification — including the cascade rules, coordination normalization, and worked examples — see Tuple Construction.
Data Format
Annotated data is exported as JSON. Each entry represents a sentence with its metadata and extracted relations.
Schema
{
"subfolder": "Artensterben_oa",
"global_sentence_id": 1148482,
"text_id": "FAZ_200204_384209",
"text_date": "2002-04",
"sentence_id": "12",
"sentence": "...",
"relations": [
{
"indicator": "Folge",
"entities": [
{
"entity": "Kleinplanet",
"relation": "Cause",
"dependent_coefficients": [
{"coefficient_text": "Jahren", "coefficient": "Temporality"},
{"coefficient_text": "Mexikos", "coefficient": "Spatiality"}
]
},
{
"entity": "Artensterben",
"relation": "Effect"
}
],
"coefficient": "Division",
"dependent_coefficients": [
{"coefficient_text": "könnte", "coefficient": "Uncertainty"}
],
"representation": "berieten",
"representation_entities": [
{
"entity": "Teilnehmer",
"relation": "Constraint",
"dependent_coefficients": [
{"coefficient_text": "fünftausend", "coefficient": "Quantity"}
]
}
]
}
]
}Field Reference
Sentence-level fields:
| Field | Description |
|---|---|
subfolder |
WABI subcorpus and syntactic position (e.g. Artensterben_oa = accusative object) |
global_sentence_id |
Unique sentence identifier across the full corpus |
text_id |
Source document identifier (format: SOURCE_YYYYMM_ID) |
text_date |
Publication date (YYYY-MM) |
sentence |
Full sentence text |
relations |
Array of causal relations found in this sentence (empty if none) |
Relation-level fields:
| Field | Description |
|---|---|
indicator |
The causal indicator lexeme |
entities |
Array of entities with their causal role (Cause or Effect) |
coefficient |
Structural marker on the relation level (Negation, Division) |
dependent_coefficients |
Contextual coefficients attached to the indicator |
representation |
Evidentiality/speech-act verb (if present) |
representation_entities |
Source entities for reported speech |
Entity-level fields:
| Field | Description |
|---|---|
entity |
Head token of the entity (token-minimized) |
relation |
Causal role: Cause, Effect, or Constraint |
dependent_coefficients |
Array of coefficients modifying this entity |
Sentences Without Relations
Sentences where no explicit causal relation was identified have an empty relations array. These are not noise — they were reviewed during annotation and determined to contain no explicit causal markers per the minimal principle.
Corpus Statistics
| Total sentences | 4,753 |
| Sentences with ≥1 WABI relation | 1,797 (37.8%) |
| Total WABI-relevant relations | 1,867 |
| Distinct indicator forms | 642 |
| Indicator families | 192 |
| Mean relations per sentence | 0.39 |
Per WABI term:
| Term | Sentences | Relations | Rel./Sent. |
|---|---|---|---|
| Waldsterben | 1,818 | 633 | 0.35 |
| Artensterben | 1,854 | 744 | 0.40 |
| Bienensterben | 536 | 257 | 0.48 |
| Insektensterben | 545 | 233 | 0.43 |
Further Reading
- For how annotations are transformed into (C, E, I) tuples, see Tuple Construction
- For the C-BERT model trained on this data, see C-BERT
- Full annotation data (Bundestag subset): HuggingFace Dataset