Annotation Guidelines

How causal relations are annotated in the Causal Semantics framework

Overview

This page presents the annotation schema and guidelines used to produce the training data for C-BERT and the underlying (C, E, I) tuple representation.

If you are using C-BERT or working with annotated data from this project, this page serves as the reference for how annotations are structured, what decisions were made, and how edge cases are handled.

The annotation was developed over three iterations (2024–2025) and primarily conducted via INCEpTION. A total of 4,753 sentences from German environmental discourse (1990–2022) yielded 2,391 manually annotated causal relations.

Annotation Schema

The schema consists of two layers: span annotations mark causal components in text, and relation annotations link them into structured causal relations.

Spans

Each token or token sequence in a sentence receives at most one span label. There are two span categories.

Role marks the primary causal components:

  • Indicator — the lexical marker that projects the causal relation
    • e.g. verursachen ‘cause’, ist Ursache ‘is cause’, weil ‘because’, stoppen ‘stop’
  • Entity — a cause or effect entity involved in the relation
    • e.g. Klimawandel ‘climate change’, Pestizide ‘pesticides’

Coefficient captures semantic modifiers:

  • Negation
    • Neutralizes or inverts the causal relation
    • nicht ‘not’, kein ‘no’, Verlust ‘loss’, Schwund ‘shrinkage’
  • Division
    • Lowers salience
    • auch ‘also’, unter anderem ‘among others’
  • Priority
    • Increases salience
    • vor allem ‘especially’, wichtig ‘important’

The three types above directly impact influence. Other coefficients run orthogonal, as they either form constraints (e.g. temporality) – or specify epistemic dimensions (e.g Uncertainty).

Object-based constraints
Coefficient Function Examples
Temporality Temporal framing seit, bereits, künftig, Jahre
Spatiality Spatial framing global, lokal, weltweit, Deutschland
Quality Qualitative modification industrielle, parasitischer, schwefelhaltiger
Quantity Quantitative modification großes, fünftausend, zwölf
Propositional-based constraints
Coefficient Function Examples
Uncertainty Modality and hedging möglicherweise, könnte, vermutlich
Representation Evidentiality marker laut, sagt, sei
Representation Entity Source/provenance Studie, Bericht, Forscher

Relations

Annotated spans are linked by directed relations:

  • Cause: Indicator → Entity (the entity is designated as a Cause)
  • Effect: Indicator → Entity (the entity fills the Effect role)
  • Constraint: Entity or Indicator → Coefficient (the coefficient modifies its governor)

Example

Consider the spans in this sentence:

Klimawandel und insbesondere Landwirtschaft verstärken möglicherweise das weltweite Artensterben.

Climate change and especially agriculture possibly intensify global species extinction.

Two coordinated entities (Klimawandel und Landwirtschaft) are annotated separately with parallel Cause relations.

The adverb insbesondere ‘especially’ is scoped to Landwirtschaft, raising its salience as Priority. The sentence adverb möglicherweise (Uncertainty) is parented to the indicator since it modifies the causal proposition.

Lastly, the adjective weltweit is separately encoded as Spatiality coefficient – in accordance with the token minimization principle.

graph LR
    A["*verstärken*"] -->|Cause| B["*Klimawandel*"]
    A -->|Cause| C["*Landwirtschaft*"] -->|Priority| D["*insbesondere*"]
    A -->|Uncertainty| E["*möglicherweise*"]
    A -->|Effect| F["*Artensterben*"]
    F -->|Spatiality| G["*weltweit*"]

Annotation Principles

Four principles guide annotation decisions in ambiguous cases.

Minimal Principle

Only explicitly marked causal relations are annotated — no inference. When a lexically specific marker (e.g. verursacht) and a functional marker (e.g. durch) co-occur, only the lexically richer element is annotated as indicator. Functional prepositions and connectors are annotated as indicators only when no richer causal lexeme is present.

For light verb constructions, only the semantically loaded element is annotated (e.g. hat etwas mit X zu tun → Indicator: tun).

Token Minimization

Entities are reduced to their head token. Attributive modifiers are extracted as separate coefficients:

  • industrielle Landwirtschaft → Entity: Landwirtschaft, Coefficient: industrielle (Quality)
  • Einsatz von Pestiziden → Entity: Pestiziden
  • Exception: named entities (Europäische Union) and fixed multi-word expressions (saurer Regen) are annotated as single units.

This prevents proliferation of marginal entity variants and facilitates downstream aggregation.

Syntactic Proximity

When multiple potential entities compete for a role, the syntactically closest entity to the indicator is preferred, unless semantic considerations override this.

Coefficient Conservatism

Coefficients (other than Representation) are only annotated when they stand in a direct syntactic dependency relation with an indicator or entity.

Indicators

Indicators are the lexical or syntactic markers that project causal relations and establish (C, E, I) tuples. They fulfill two functions: projecting causal roles onto syntactic positions in their co-text, and encoding inherent information about polarity and salience.

The annotation corpus contains 642 distinct indicator forms, grouped into 192 indicator families by morphological and semantic criteria. A family subsumes all realizations of a shared lexical core. E.g. the Ursache family includes, among other:

  • verursachen ‘cause’
  • ist Ursache ‘is the cause’,
  • Ursache sein ‘to be the cause’
  • Teilursache sein ‘to be a partial cause’

Top 10 Indicator Families

Family Forms Instances
Ursache
Cause
21 167
Verantwortung
Responsibility
11 122
Stoppen
Stop
3 111
Gegen
Against
3 95
Durch
through
2 70
Family Forms Instances
Beitrag
Contribution
10 70
Folge
Consequence
6 69
Kampf
Fight
8 69
Führen
Lead
9 59
Grund
Reason
7 57

Some families are Cause-oriented, others are Effect-oriented. Compare:

Emission_C is a cause of pollution_E.
Pollution_E is a consequence of emission_C.

Orientation affects projection – Cause-oriented indicators generally project Cause on the subject, Effect-oriented indicators do the same with Effect. Modified indicator forms affect the projection according to their orientation.

Note the difference between Teilursache ‘partial cause’ and Teilwirkung partial effect: The former implies the existence of several causes – the latter the existence of several effects.

Polarity and Salience

Each indicator family carries a default polarity (promoting + or inhibiting -) and a default salience class:

Table Caption
Family Polarity Default Salience Discourse Function
Ursache
Cause
+ [0.5-1] Cause-oriented prototype
Folge
Consequence
+ 1 Effect-oriented
Beitrag
Contribution
+ [0.5-0.75] Distributional attribution
Durch
Through
+ 1 Grammaticalized preposition
Stoppen
Stop
1 Intervention framing
Gegen
Against
1 Variable condensation (Kampf/Protest gegen ‘fight/protest against’)
Reduzieren
Reduce
[0.5-0.75] Prototypical distributional negative
Wirkung
Effect
\pm [0-1] Widest range of influence (Gegenwirkung ‘counter-effect’, wirkungslos ‘ineffective’ )

Syntactic Projection Patterns

The syntactic realization of each indicator determines how Cause and Effect roles are projected:

Family Example Cause Projection Effect Projection
Ursache
Cause
X ist Ursache für Y
‘x is the cause of’
Subject PP (für)
‘for’
Folge
Consequence
Y ist Folge von X
‘x is a consequence of’
PP (von)
‘of’
Subject
Verursachen
Cause
X verursacht Y
‘x causes y’
Subject Accusative object
Beitragen
Contribution
X trägt zu Y bei
‘x contributes to y’
Subject PP (zu)
‘to’
Stoppen
Stop
X stoppt Y
‘x stops y’
Subject Accusative object
Durch
Through
Y durch X
‘x through y’
PP complement Head noun
Gegen
Against
X gegen Y
‘x against y’
Subject PP complement

Context Markers

Context markers modify and contextualize the causal relations projected by indicators. Unlike indicators, they are not lexically specialized for causality but operate at the predication or proposition level. Three structural marker types directly affect the INFLUENCE computation:

Division

Division markers signal polycausal structures with implicit co-causes. Typical realizations include unter anderem ‘among other things’, auch ‘also’, ebenfalls ‘likewise’, and the composite nicht nur ‘not only’.

Division markers reduce the salience to |I| = 0.5 regardless of the indicator’s default.

Priority

Priority markers establish asymmetric weighting within polycausal sets: vor allem ‘above all’, hauptsächlich ‘mainly, maßgeblich ’significantly’. They set |I| = 0.75 for the prioritized cause.

Both marker types affect only the salience (|I|), leaving polarity (\pm) unchanged.

Negation

Negation is the structurally most influential marker. Two types are distinguished:

Object-based negation operates at the entity level through markers like Verlust ‘loss’, Schwund ‘decline’, Rückgang ‘decrease’. These invert the polarity when the count of negations is odd:

Verlust von Lebensräumen verursacht Bienensterben.
Loss of habitat causes bee decline.

Indicator verursachen ‘cause’: default I = 1;
object negation on Cause (Verlust ‘loss’) inverts polarity: I = -1.

Propositional negation operates at the relation level (nicht, kein) and neutralizes the entire causal relationship (I = 0):

Pestizide verursachen nicht Insektensterben.
Pesticides don’t cause Insektensterben.

Indicator verursachen ‘cause’: default I = 1
Propositional negation nicht ‘don’t’ neutralizes: I = 0.

From Annotations to INFLUENCE

The annotations documented above — indicators, entities, and context markers — are the inputs to a deterministic computation that produces the final INFLUENCE value I \in [-1, +1].

In brief: entity identification follows the indicator’s syntactic projection pattern, polarity is determined by indicator class and negation markers, and salience is computed through a cascading hierarchy of morphological, determiner, and syntactic markers.

For the full formal specification — including the cascade rules, coordination normalization, and worked examples — see Tuple Construction.

Data Format

Annotated data is exported as JSON. Each entry represents a sentence with its metadata and extracted relations.

Schema

{
  "subfolder": "Artensterben_oa",
  "global_sentence_id": 1148482,
  "text_id": "FAZ_200204_384209",
  "text_date": "2002-04",
  "sentence_id": "12",
  "sentence": "...",
  "relations": [
    {
      "indicator": "Folge",
      "entities": [
        {
          "entity": "Kleinplanet",
          "relation": "Cause",
          "dependent_coefficients": [
            {"coefficient_text": "Jahren", "coefficient": "Temporality"},
            {"coefficient_text": "Mexikos", "coefficient": "Spatiality"}
          ]
        },
        {
          "entity": "Artensterben",
          "relation": "Effect"
        }
      ],
      "coefficient": "Division",
      "dependent_coefficients": [
        {"coefficient_text": "könnte", "coefficient": "Uncertainty"}
      ],
      "representation": "berieten",
      "representation_entities": [
        {
          "entity": "Teilnehmer",
          "relation": "Constraint",
          "dependent_coefficients": [
            {"coefficient_text": "fünftausend", "coefficient": "Quantity"}
          ]
        }
      ]
    }
  ]
}

Field Reference

Sentence-level fields:

Field Description
subfolder WABI subcorpus and syntactic position (e.g. Artensterben_oa = accusative object)
global_sentence_id Unique sentence identifier across the full corpus
text_id Source document identifier (format: SOURCE_YYYYMM_ID)
text_date Publication date (YYYY-MM)
sentence Full sentence text
relations Array of causal relations found in this sentence (empty if none)

Relation-level fields:

Field Description
indicator The causal indicator lexeme
entities Array of entities with their causal role (Cause or Effect)
coefficient Structural marker on the relation level (Negation, Division)
dependent_coefficients Contextual coefficients attached to the indicator
representation Evidentiality/speech-act verb (if present)
representation_entities Source entities for reported speech

Entity-level fields:

Field Description
entity Head token of the entity (token-minimized)
relation Causal role: Cause, Effect, or Constraint
dependent_coefficients Array of coefficients modifying this entity

Sentences Without Relations

Sentences where no explicit causal relation was identified have an empty relations array. These are not noise — they were reviewed during annotation and determined to contain no explicit causal markers per the minimal principle.

Corpus Statistics

Total sentences 4,753
Sentences with ≥1 WABI relation 1,797 (37.8%)
Total WABI-relevant relations 1,867
Distinct indicator forms 642
Indicator families 192
Mean relations per sentence 0.39

Per WABI term:

Term Sentences Relations Rel./Sent.
Waldsterben 1,818 633 0.35
Artensterben 1,854 744 0.40
Bienensterben 536 257 0.48
Insektensterben 545 233 0.43

Further Reading