Aman's AI Journal • Primers • Inter-Annotator Agreement

Overview
- Why Measure Inter-Annotator Agreement?
- Types of Data for Agreement
Classical Metrics for Inter-Annotator Agreement
Bridging: Why Classical Metrics Fall Short for Distributional Annotations
Distributional Agreement Metrics
Practical Considerations for Inter-Annotator Agreement
Mathematical Relationships Between TV, KL, and JS
Putting It All Together: A Workflow for Measuring IAA
Appendix: Summary of Inter-Annotator Agreement Metrics
- Takeaways
Further Reading
References
Citation

Overview

Inter-annotator agreement (IAA) is a fundamental concept in annotation-driven research areas such as natural language processing (NLP), computer vision, medical coding, and social sciences. The goal of IAA is to quantify the degree to which multiple annotators, given the same task and guidelines, produce consistent outputs.
High agreement suggests that the task is well-defined and that the annotations can be trusted as ground truth; low agreement often indicates ambiguity in the task or poor guideline design.

Why Measure Inter-Annotator Agreement?

Reliability of Data: Annotation tasks often serve as the foundation for training and evaluating supervised machine learning models. If annotations are unreliable, models trained on them will inherit that noise.
Task Clarity: Low agreement can highlight that annotation guidelines are ambiguous, incomplete, or too subjective.
Annotator Quality: Agreement measures can help detect inconsistent annotators or biases.
Scientific Rigor: In empirical research, IAA serves as evidence that reported findings are reproducible and not merely artifacts of annotator idiosyncrasies.

Types of Data for Agreement

Different data types call for different agreement metrics:
- Categorical Labels (Nominal Data):
  - Examples: sentiment classification (positive/neutral/negative), medical diagnosis codes.
  - Agreement here involves checking whether annotators choose the same category.
- Ordinal Labels:
  - Examples: rating scales (1–5 stars, severity levels).
  - Agreement must respect the fact that categories have an inherent order.
- Continuous Annotations:
  - Examples: bounding box coordinates in images, reaction times, or scores between 0 and 1.
  - Agreement is often measured via correlation or distance metrics.
- Structured Outputs:
  - Examples: parse trees, dialogue act sequences, or entity spans.
  - Agreement requires specialized metrics that account for structured predictions.
- Distributions:
  - In some tasks, annotators are asked not to provide a single label, but a probability distribution over possible labels. This reflects uncertainty or subjectivity.
  - Example: In emotion annotation, one annotator may assign 0.6 probability to “joy,” 0.3 to “surprise,” and 0.1 to “neutral,” while another annotator may spread probabilities differently.
  - Agreement is then measured using distributional divergences such as Total Variation distance (TV distance), Kullback–Leibler divergence (KL), or Jensen–Shannon divergence (JS):
  - For two discrete distributions \(P\) and \(Q\) over a label space \(\mathcal{X}\):
    - TV distance:
      \[d_{TV}(P, Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} | P(x) - Q(x) |\]
    - KL divergence:
      \[D_{KL}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}\]
    - Jensen–Shannon divergence:
      \[D_{JS}(P | Q) = \frac{1}{2} D_{KL}!\left(P \middle| M\right) + \frac{1}{2} D_{KL}!\left(Q \middle| M\right), \quad M = \frac{1}{2}(P + Q)\]
  - These measures provide a graded notion of disagreement rather than a binary match/mismatch.

Classical Metrics for Inter-Annotator Agreement

Classical agreement metrics focus on categorical and ordinal labels, where the annotators assign one label per instance. They adjust for chance agreement and provide interpretable scales of reliability.

Cohen’s Kappa (\(\kappa\))

Definition: For two annotators, Cohen’s kappa measures agreement while correcting for chance.
Formula:
- Let:
  - \(p_o\): observed proportion agreement
  - \(p_e\): expected agreement under independence
- Then:
  \[\kappa = \frac{p_o - p_e}{1 - p_e}\]
Suitable for: Categorical (nominal) labels.
Use-case: Two medical experts diagnosing patients into disease categories (yes/no cancer).
Pros:
- Corrects for chance agreement.
- Easy to interpret (\(\kappa = 1\): perfect agreement, \(\kappa = 0\): chance-level).
Cons:
- Only supports two annotators.
- Sensitive to class imbalance (rare categories can distort \(\kappa\)).

Fleiss’ Kappa (\(\kappa\))

Definition: Generalization of Cohen’s kappa for more than two annotators. Agreement is measured by comparing observed vs expected proportions across all annotators.
Formula:
- For category \(j\):
  \[P_j = \frac{1}{Nn} \sum_{i=1}^N n_{ij}\]
  - where \(n_{ij}\) is the number of annotators assigning category \(j\) to item \(i\).
- Then compute per-item agreement and average across items.
Suitable for: Categorical (nominal) labels with multiple annotators.
Use-case: A crowd-sourced sentiment task with 10 annotators per review.
Pros:
- Extends Cohen’s \(\kappa\) to many annotators.
- Still adjusts for chance agreement.
Cons:
- Assumes annotators are exchangeable.
- Same sensitivity to imbalance issues as Cohen’s \(\kappa\).

Krippendorff’s Alpha (\(\alpha\))

Definition: A versatile reliability coefficient that generalizes across data types.
Formula:
\[\alpha = 1 - \frac{D_o}{D_e}\]
- where
- \(D_o\) = observed disagreement
- \(D_e\) = expected disagreement
- The definition of disagreement depends on the data type.
Suitable for: Nominal, ordinal, interval, ratio data.
Use-case: Annotators rating severity of patient symptoms on a 1–5 scale.
Pros:
- Works with any number of annotators.
- Can handle missing data.
- Supports various data types beyond categorical.
Cons:
- Computationally heavier than \(\kappa\).
- Requires defining distance functions for non-categorical data.

Scott’s Pi (\(\pi\))

Definition: Similar to Cohen’s \(\kappa\) but uses expected agreement under uniform distribution rather than empirical marginals.
Formula:
\[\pi = \frac{p_o - p_e}{1 - p_e}\]
Suitable for: Categorical data, two annotators.
Use-case: Two coders labeling political statements as left/right/neutral.
Pros:
- Simple and interpretable.
- Historically important precursor to \(\kappa\).
Cons:
- Assumes annotators share the same distribution.
- Less robust in practice than \(\kappa\).

Correlation-Based Measures

Definition: Correlation-based measures capture the strength and direction of relationships between continuous or ordinal annotations. Common examples include Pearson’s \(r\) and Spearman’s \(\rho\).
Formula:
- For Pearson’s correlation:
\[r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2} \sqrt{\sum_i (y_i - \bar{y})^2}}\]
Suitable for: Interval/ratio data, ordinal data.
Use-case: Annotators assigning continuous emotion intensity scores (0–1).
Pros:
- Handles continuous scales.
- Simple and well understood.
Cons:
- Sensitive to outliers.
- Only captures linear (Pearson) or monotonic (Spearman) relations.

Comparative Analysis

Metric	Data Type	Use-case Example	Pros	Cons
Cohen’s \(\kappa\)	Categorical (2)	2 doctors diagnosing disease	Adjusts for chance, easy to interpret	Only 2 annotators, imbalance issues
Fleiss’ \(\kappa\)	Categorical (many)	Crowd sentiment annotation	Multi-annotator extension of \(\kappa\)	Assumes annotator interchangeability
Krippendorff’s \(\alpha\)	Nominal \(\rightarrow\) ratio	Symptom severity ratings	Versatile, missing data tolerant	More complex computation
Scott’s \(\pi\)	Categorical (2)	Political statement coding	Simple, historic	Unrealistic assumptions
Correlation (\(r, \rho\))	Continuous / ordinal	Emotion intensity scores	Works for continuous data	Sensitive to outliers

Bridging: Why Classical Metrics Fall Short for Distributional Annotations

Most classical inter-annotator agreement (IAA) metrics—such as Cohen’s \(\kappa\), Fleiss’ \(\kappa\), and Krippendorff’s \(\alpha\)—are designed under the assumption that annotators provide a single discrete label per item. However, in many modern annotation settings, this assumption no longer holds.

Emergence of Distributional Annotations

With tasks involving uncertainty, subjectivity, or ambiguity, annotators are increasingly asked to provide a distribution over labels instead of a hard decision. For example:
- Emotion annotation: Annotators distribute probabilities across emotions (joy, sadness, fear).
- Topic labeling: Annotators may indicate that a document is 70% politics and 30% economics.
- Crowdsourcing: Aggregate behavior of many annotators often yields empirical distributions rather than a single consensus label.
This reflects a richer representation of annotator uncertainty and disagreement.

Limitations of Classical Metrics

Binary vs. graded disagreement:
- Cohen’s \(\kappa\) and similar metrics count labels as either “agree” or “disagree.” They cannot capture the degree of overlap between distributions.
Information loss:
- Reducing probability distributions to single labels (e.g., by taking the \(\arg\max\)) discards annotator uncertainty and masks subtler disagreements.
Incompatibility with probabilistic annotations:
- Metrics like \(\kappa\) assume categorical variables, not vectors in the probability simplex \(\Delta^K\) where \(\Delta^K = {, p \in \mathbb{R}^K \mid p_i \geq 0, , \sum_{i=1}^K p_i = 1 ,}\).
Thus, new tools are required to measure distributional agreement, which operate directly on the probability distributions annotators provide.

Transition to Divergence-Based Measures

Instead of measuring categorical agreement, distributional approaches compute distances or divergences between probability distributions. These metrics allow us to say not only whether two annotators disagreed, but how far apart their probability distributions are.
This motivates distributional agreement metrics such as:
- Total Variation (TV) Distance: measures the maximum difference in probabilities across categories.
- Kullback–Leibler (KL) Divergence: measures information loss when one distribution approximates another.
- Jensen–Shannon (JS) Divergence: symmetrized and smoothed version of KL, often more stable.

Distributional Agreement Metrics

When annotators provide probability distributions instead of single labels, we need measures that compare two distributions directly. These are often referred to as divergences or distances on the probability simplex.

Total Variation (TV) Distance

Definition: For two discrete distributions ( P ) and ( Q ) over label set ( \mathcal{X} ), the Total Variation distance quantifies the largest possible difference between the probabilities assigned by ( P ) and ( Q ) to the same event. It measures how much probability mass must be shifted to make one distribution match the other. It is a symmetric, bounded, and interpretable measure of dissimilarity between probability distributions.
Formula:
\[d_{TV}(P, Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} | P(x) - Q(x) |\]
Intuition: Measures the maximum difference in probabilities between two distributions. It represents the largest possible difference in probability mass that the two annotators assign to the same event.
Range: \([0, 1]\)
- 0 means identical distributions.
- 1 means completely disjoint support.
Suitable for: Any distributional annotations.
Use-case: Comparing how two annotators distribute probability across emotions for a sentence.
Pros:
- Symmetric and interpretable.
- Metric (satisfies triangle inequality).
- Bounded in \([0, 1]\).
Cons:
- Ignores information-theoretic aspects (only measures absolute differences).
- May be too coarse in high-dimensional label spaces.

Kullback–Leibler (KL) Divergence

Definition: The Kullback–Leibler divergence quantifies how one probability distribution ( Q ) diverges from another distribution ( P ). It measures the expected number of extra bits required to encode samples from ( P ) using a code optimized for ( Q ). It is an asymmetric and unbounded measure rooted in information theory.
Formula:
\[D_{KL}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}\]
Intuition: Measures how inefficient it is to encode samples from ( P ) using a code optimized for ( Q ).
Range: \([0, \infty)\) Zero if and only if \(P = Q\).
Suitable for: Distributional annotations where one distribution can be treated as “true” and the other as an approximation.
Use-case: Evaluating how much information is lost if one annotator’s probability distribution is used to approximate another’s.
Pros:
- Information-theoretic interpretation.
- Sensitive to differences in rare events.
Cons:
- Asymmetric: \(D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\).
- Undefined if \(Q(x) = 0\) while \(P(x) > 0\).
- Harder to interpret numerically compared to bounded metrics.

Jensen–Shannon (JS) Divergence

Definition: The Jensen–Shannon divergence is a symmetrized and smoothed version of the KL divergence. It measures the average divergence of each distribution from their mean distribution \(M = \frac{1}{2}(P + Q)\). It is always finite, symmetric, and more stable than KL divergence.
Formula:
\[D_{JS}(P | Q) = \frac{1}{2} D_{KL}\left(P | M\right) + \frac{1}{2} D_{KL}\left(Q | M\right), \quad M = \frac{1}{2}(P + Q)\]
Intuition: Measures the average divergence of each distribution from their mean.
Range: \([0, \log 2]\) (often normalized to \([0,1]\)).
Suitable for: Any pair of probability distributions, especially when symmetry and stability are needed.
Use-case: Comparing how two annotators spread probability mass across multiple labels, while avoiding undefined cases.
Pros:
- Symmetric and always finite.
- Square root of JS divergence is a metric.
- More stable in practice than KL.
Cons:
- Less interpretable than TV distance.
- Can still be sensitive to smoothing choices.

Comparison of TV, KL, and JS

Metric	Symmetric?	Range	Interpretability	Sensitivity	Use-case example
TV Distance	Yes	\([0,1]\)	Very interpretable (max diff)	Treats all categories equally	Emotion distributions
KL Divergence	No	\([0,\infty)\)	Info-theoretic, less intuitive	Sensitive to rare events	Approximation error
JS Divergence	Yes	\([0,\log 2]\)	Balanced, bounded, metric \((\sqrt{\text{JS}})\)	Smoothed, avoids infinities	General distributional IAA

Practical Considerations for Inter-Annotator Agreement

IAA analysis requires more than just selecting a formula — it involves understanding the data type, annotation context, and interpretability needs. This section provides guidance on how to choose metrics, what to watch out for, and how to interpret agreement scores.

Choosing a Metric by Data Type

Data Type	Recommended Metrics	Notes
Categorical (nominal)	Cohen’s \(\kappa\) (2 annotators), Fleiss’ \(\kappa\) (many), Krippendorff’s \(\alpha\)	Must check for class imbalance effects
Ordinal	Weighted Cohen’s \(\kappa\), Krippendorff’s \(\alpha\), Spearman’s \(\rho\)	Use distance-based weighting to respect ordering
Continuous	Pearson’s \(r\), Intraclass Correlation (ICC), Krippendorff’s \(\alpha\)	Handle outliers carefully, scale-sensitive
Structured outputs	Task-specific metrics (e.g., overlap F1, span-based agreement)	Define what counts as “match” structurally
Distributions	TV distance, JS divergence, KL divergence	Do not collapse to argmax labels; keep full distributions

Rule of thumb:
- If annotators give single discrete labels \(\rightarrow\) use chance-corrected categorical metrics.
- If annotators give scores or ranks \(\rightarrow\) use correlation- or distance-based measures.
- If annotators give full probability distributions \(\rightarrow\) use divergence measures.

Interpreting Agreement Levels

There is no absolute scale, but a commonly used heuristic (adapted from and 1977’s interpretation for Cohen’s κ) is:

Agreement value	Interpretation
\(< 0.0\)	Worse than chance
\(0.0–0.20\)	Slight agreement
\(0.21–0.40\)	Fair agreement
\(0.41–0.60\)	Moderate agreement
\(0.61–0.80\)	Substantial agreement
\(0.81–1.00\)	Almost perfect

For divergence metrics (TV, KL, JS), lower values mean closer distributions. Typical observed ranges:
- TV distance: < 0.1 \(\rightarrow\) very similar; > 0.3 \(\rightarrow\) strong disagreement
- JS divergence: < 0.05 \(\rightarrow\) close; > 0.2 \(\rightarrow\) widely different
- KL divergence: highly variable; compare relative changes, not absolute cutoffs.

Handling Annotator Bias and Class Imbalance

Class imbalance can inflate or deflate κ-like metrics. Consider reporting class distributions alongside agreement.
Annotator bias (systematic skew) can lower κ even if raw agreement is high.
Consider using confusion matrices to inspect which categories cause disagreement.

Missing Data and Sparse Annotations

Krippendorff’s \(\alpha\) is robust to missing annotations and is the safest choice for incomplete data.
For divergence-based measures, ensure smoothing (e.g., add small (\epsilon)) to avoid zeros that break KL.

Computational Considerations

\kappa-type metrics are computationally cheap (matrix counts).
Krippendorff’s \(\alpha\) is \(O(N \times A^2)\) for \(N\) items and \(A\) annotators — still feasible but heavier.
Divergence-based metrics are \(O(K)\) per pair of distributions, where \(K\) is the number of categories.
If annotator sets are large, prefer efficient pairwise sampling strategies or aggregate distributions.

Mathematical Relationships Between TV, KL, and JS

While Total Variation Distance, Kullback–Leibler divergence, and Jensen–Shannon divergence measure different aspects of distributional difference, they are connected through known inequalities. Understanding these links helps interpret and compare their values meaningfully.

Pinsker’s Inequality (KL vs. TV)

Pinsker’s inequality provides an upper bound on TV distance in terms of KL divergence:

\[d_{TV}(P, Q) \le \sqrt{\frac{1}{2} D_{KL}(P | Q)}\]

This means:
- If KL divergence is small, then TV must also be small.
- Small KL implies distributions are close in absolute terms.
- However, small TV does not guarantee small KL (KL can blow up when \(Q(x) \approx 0\)).

Implication:

KL is more sensitive to low-probability mismatches than TV.

Lower Bound on KL via TV

A lesser-known inequality gives a lower bound on KL in terms of TV:

\[D_{KL}(P | Q) \ge 2 , d_{TV}^2(P, Q)\]

This shows that large TV implies large KL. Together with Pinsker’s inequality, this pins down their growth relationship:

\[2 , d_{TV}^2(P,Q) \le D_{KL}(P|Q)\]

JS divergence is defined as:
\[D_{JS}(P|Q) = \frac{1}{2}D_{KL}(P|M) + \frac{1}{2}D_{KL}(Q|M), \quad M = \frac{1}{2}(P+Q)\]
- and satisfies:
\[D_{JS}(P|Q) \le \log 2\]
It inherits KL’s information-theoretic basis while being symmetric and bounded.
Also, it relates to TV as:

\[d_{TV}^2(P,Q) \le \frac{1}{2} D_{JS}(P|Q)\]

So:
- JS grows at least as fast as \(d_{TV}^2\).
- JS is upper-bounded, while KL is unbounded.
- JS is often preferred for interpretability and numerical stability.

Comparative Analysis of Theoretical Relationships

Pair	Inequality	Interpretation
TV vs. KL	\(d_{TV} \le \sqrt{\tfrac{1}{2} D_{KL}}\)	Small KL implies small TV
TV vs. KL	\(2 d_{TV}^2 \le D_{KL}\)	Large TV implies large KL
TV vs. JS	\(d_{TV}^2 \le \tfrac{1}{2} D_{JS}\)	JS lower bounds TV squared
JS vs. KL	\(D_{JS} \le D_{KL} \quad \text{(if } P=Q \text{ support)}\)	JS is smoothed and bounded version of KL

Key takeaways:
- TV gives an absolute probability difference.
- KL gives a relative (log-based) penalty, very sensitive to rare events.
- JS sits between them: symmetric, smoothed, and bounded, making it ideal for practical agreement comparisons.

Putting It All Together: A Workflow for Measuring IAA

This section provides a step-by-step pipeline for measuring inter-annotator agreement, choosing the correct metric, and interpreting the results in context.

Step 1 — Identify Annotation Data Type

Before picking any metric, classify your annotation outputs into one of these types:
- Categorical (nominal): single class per item, no order
- Ordinal: discrete ranks with meaningful order
- Continuous: numeric values on a scale
- Structured: spans, trees, sequences
- Distributions: full probability vectors over categories
Tip: If annotators are uncertain and spread probability mass, treat their outputs as distributions rather than forcing hard labels.

Step 2 — Choose Suitable Metrics

Use this quick mapping:

Data Type	Recommended Metrics
Categorical	Cohen’s \(\kappa\) (2), Fleiss’ \(\kappa\) (many), Krippendorff’s \(\alpha\)
Ordinal	Weighted Cohen’s \(\kappa\), Krippendorff’s \(\alpha\), Spearman’s \(\rho\)
Continuous	Pearson’s \(r\), Intraclass Correlation (ICC), Krippendorff’s \(\alpha\)
Structured	Task-specific matching (span F1, overlap measures)
Distributions	Kullback–Leibler divergence (\(D_{KL}\)), Jensen–Shannon divergence (\(D_{JS}\)), Earth Mover’s Distance (\(EMD\))

Guidelines:
- If you only care about agreement beyond chance, use \(\kappa\)-type metrics.
- If you care about numerical closeness, use correlation or divergence metrics.

Step 3 — Compute Agreement

Clean data: handle missing annotations, standardize label sets.
For categorical metrics, build an item × annotator label matrix.
For distributional metrics, build an item × annotator probability matrix.
Compute:
- Pairwise agreement (between annotator pairs)
- Average agreement (overall reliability)
Tip: For large numbers of annotators, use random subsampling of pairs to reduce computation.

Step 4 — Interpret Scores in Context

Compare against known benchmarks (e.g., κ > 0.6 is substantial agreement).
For divergence metrics:
- (d_{TV} < 0.1) or (D_{JS} < 0.05) \(\rightarrow\) very high agreement
- (d_{TV} > 0.3) or (D_{JS} > 0.2) \(\rightarrow\) strong disagreement
Visualize distributions and confusion matrices to identify where disagreements occur.
Important: Absolute cutoffs are less meaningful than relative comparisons across tasks or iterations.

Step 5 — Act on the Results

If agreement is low:
- Refine annotation guidelines
- Provide more training/examples to annotators
- Identify and retrain or remove inconsistent annotators
If agreement is high:
- Proceed with data aggregation and model training
- Optionally, use annotator reliability as weights in aggregation

Step 6 — Report Transparently

When publishing or sharing results:
- Specify which metric you used and why.
- Report number of annotators, number of samples, and how missing data was handled.
- Include both agreement values and class distributions for context.

Appendix: Summary of Inter-Annotator Agreement Metrics

Metric	Data Type	Formula	Interpretation	Pros	Cons	Typical Range / Use-Case
Cohen’s \(\kappa\)	Categorical (2 annotators)	\(\kappa = \frac{p_o - p_e}{1 - p_e}\)	Agreement beyond chance between two annotators	Adjusts for chance; simple	Only two annotators; sensitive to class imbalance	\([0, 1]\); medical diagnoses, binary coding
Fleiss’ \(\kappa\)	Categorical (many annotators)	Mean chance-corrected agreement across annotators	Multi-annotator extension of \(\kappa\)	Handles multiple annotators	Assumes annotators are interchangeable; imbalance sensitive	\([0, 1]\); crowdsourced labeling
Krippendorff’s \(\alpha\)	Nominal → ratio	\(\alpha = 1 - \frac{D_o}{D_e}\)	General reliability across data types	Works with missing data; flexible	More complex computation	\([0, 1]\); mixed data, psychological scales
Scott’s \(\pi\)	Categorical (2)	\(\pi = \frac{p_o - p_e}{1 - p_e}\) with uniform expected \(p_e\)	Chance-corrected agreement with equal priors	Simple, historic	Unrealistic distribution assumption	\([0, 1]\); political or sentiment coding
Weighted \(\kappa\)	Ordinal	Weighted form of \(\kappa\) with penalty matrix \(w_{ij}\)	Agreement respecting order of categories	Considers ordinal distances	Needs chosen weights; subjective	\([0, 1]\); rating scales, quality scores
Pearson’s \(r\)	Continuous	\(r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\sqrt{\sum_i (y_i - \bar{y})^2}}\)	Linear correlation of scores	Interpretable; handles continuous values	Sensitive to outliers; only linear	\([-1, 1]\); numeric scoring, regression tasks
Spearman’s \(\rho\)	Ordinal / continuous	Correlation of rank orders	Monotonic relationship between annotators	Order-based, robust	Ignores exact scale differences	\([-1, 1]\); ranking tasks
Intraclass Corr. (ICC)	Continuous	Variance ratio model	Consistency among several raters	Captures group consistency	Depends on model assumptions	\([0, 1]\); behavioral, clinical studies
TV distance	Distributions	\(d_{TV}(P,Q)=\tfrac{1}{2}\sum_x \|P(x)-Q(x)\|\)	Max difference in probability mass	Bounded, symmetric, metric	Ignores info-theoretic nuance	\([0, 1]\); probabilistic emotion or topic labels
KL divergence	Distributions	\(D_{KL}(P\|Q)=\sum_x P(x)\log \tfrac{P(x)}{Q(x)}\)	Information loss using \(Q\) for \(P\)	Info-theoretic; sensitive to rare events	Asymmetric; undefined for zeros	\([0, \infty)\); model approximation error
JS divergence	Distributions	\(D_{JS}(P\|Q)=\tfrac{1}{2}D_{KL}(P\|M)+\tfrac{1}{2}D_{KL}(Q\|M), \quad M=\tfrac{1}{2}(P+Q)\)	Smoothed, symmetric version of KL	Symmetric; bounded; interpretable	Still needs smoothing	\([0, \log 2]\); general probabilistic agreement
Task-specific overlap (\(F_1\), span \(F_1\))	Structured outputs	\(F_1=\frac{2PR}{P+R}\)	Overlap or matching agreement	Intuitive for structured data	Needs domain-specific definition	\([0, 1]\); entity extraction, segmentation

Takeaways

Symmetry: TV and JS are symmetric; KL is not.
Boundedness: \(d_{TV} \in [0, 1], \quad D_{JS} \in [0, \log 2], \quad D_{KL} \in [0, \infty)\)
Data completeness: Krippendorff’s \(\alpha\) handles missing data best.
When in doubt:
- For categorical labels \(\rightarrow\) Cohen/Fleiss \(\kappa\).
- For continuous or ordinal \(\rightarrow\) correlation or \(\alpha\).
- For distributions \(\rightarrow\) \(d_{TV}\) or \(D_{JS}\) divergence.

References

Citation

@article{Chadha2020DistilledInterAnnotatorAgreement,
  title   = {Inter-Annotator Agreement},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}

Overview

Why Measure Inter-Annotator Agreement?

Types of Data for Agreement

Classical Metrics for Inter-Annotator Agreement

Cohen’s Kappa (\(\kappa\))

Fleiss’ Kappa (\(\kappa\))

Krippendorff’s Alpha (\(\alpha\))

Scott’s Pi (\(\pi\))

Correlation-Based Measures

Comparative Analysis

Bridging: Why Classical Metrics Fall Short for Distributional Annotations

Emergence of Distributional Annotations

Limitations of Classical Metrics

Transition to Divergence-Based Measures

Distributional Agreement Metrics

Total Variation (TV) Distance

Kullback–Leibler (KL) Divergence

Jensen–Shannon (JS) Divergence

Comparison of TV, KL, and JS

Practical Considerations for Inter-Annotator Agreement

Choosing a Metric by Data Type

Interpreting Agreement Levels

Handling Annotator Bias and Class Imbalance

Missing Data and Sparse Annotations

Computational Considerations

Mathematical Relationships Between TV, KL, and JS

Pinsker’s Inequality (KL vs. TV)

Lower Bound on KL via TV

JS Divergence Related to KL and TV

Comparative Analysis of Theoretical Relationships

Putting It All Together: A Workflow for Measuring IAA

Step 1 — Identify Annotation Data Type

Step 2 — Choose Suitable Metrics

Step 3 — Compute Agreement

Step 4 — Interpret Scores in Context

Step 5 — Act on the Results

Step 6 — Report Transparently

Appendix: Summary of Inter-Annotator Agreement Metrics

Takeaways

Further Reading

References

Citation