Primers • Inter-Annotator Agreement
- Overview
- Classical Metrics for Inter-Annotator Agreement
- Bridging: Why Classical Metrics Fall Short for Distributional Annotations
- Distributional Agreement Metrics
- Practical Considerations for Inter-Annotator Agreement
- Mathematical Relationships Between TV, KL, and JS
- Putting It All Together: A Workflow for Measuring IAA
- Appendix: Summary of Inter-Annotator Agreement Metrics
- Further Reading
- References
- Citation
Overview
- Inter-annotator agreement (IAA) is a fundamental concept in annotation-driven research areas such as natural language processing (NLP), computer vision, medical coding, and social sciences. The goal of IAA is to quantify the degree to which multiple annotators, given the same task and guidelines, produce consistent outputs.
- High agreement suggests that the task is well-defined and that the annotations can be trusted as ground truth; low agreement often indicates ambiguity in the task or poor guideline design.
Why Measure Inter-Annotator Agreement?
- Reliability of Data: Annotation tasks often serve as the foundation for training and evaluating supervised machine learning models. If annotations are unreliable, models trained on them will inherit that noise.
- Task Clarity: Low agreement can highlight that annotation guidelines are ambiguous, incomplete, or too subjective.
- Annotator Quality: Agreement measures can help detect inconsistent annotators or biases.
- Scientific Rigor: In empirical research, IAA serves as evidence that reported findings are reproducible and not merely artifacts of annotator idiosyncrasies.
Types of Data for Agreement
-
Different data types call for different agreement metrics:
- Categorical Labels (Nominal Data):
- Examples: sentiment classification (positive/neutral/negative), medical diagnosis codes.
- Agreement here involves checking whether annotators choose the same category.
- Ordinal Labels:
- Examples: rating scales (1–5 stars, severity levels).
- Agreement must respect the fact that categories have an inherent order.
- Continuous Annotations:
- Examples: bounding box coordinates in images, reaction times, or scores between 0 and 1.
- Agreement is often measured via correlation or distance metrics.
- Structured Outputs:
- Examples: parse trees, dialogue act sequences, or entity spans.
- Agreement requires specialized metrics that account for structured predictions.
- Distributions:
-
In some tasks, annotators are asked not to provide a single label, but a probability distribution over possible labels. This reflects uncertainty or subjectivity.
- Example: In emotion annotation, one annotator may assign 0.6 probability to “joy,” 0.3 to “surprise,” and 0.1 to “neutral,” while another annotator may spread probabilities differently.
-
Agreement is then measured using distributional divergences such as Total Variation distance (TV distance), Kullback–Leibler divergence (KL), or Jensen–Shannon divergence (JS):
-
For two discrete distributions \(P\) and \(Q\) over a label space \(\mathcal{X}\):
-
TV distance:
\[d_{TV}(P, Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} | P(x) - Q(x) |\] -
KL divergence:
\[D_{KL}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}\] -
Jensen–Shannon divergence:
\[D_{JS}(P | Q) = \frac{1}{2} D_{KL}!\left(P \middle| M\right) + \frac{1}{2} D_{KL}!\left(Q \middle| M\right), \quad M = \frac{1}{2}(P + Q)\]
-
- These measures provide a graded notion of disagreement rather than a binary match/mismatch.
-
- Categorical Labels (Nominal Data):
Classical Metrics for Inter-Annotator Agreement
- Classical agreement metrics focus on categorical and ordinal labels, where the annotators assign one label per instance. They adjust for chance agreement and provide interpretable scales of reliability.
Cohen’s Kappa (\(\kappa\))
-
Definition: For two annotators, Cohen’s kappa measures agreement while correcting for chance.
-
Formula:
-
Let:
- \(p_o\): observed proportion agreement
- \(p_e\): expected agreement under independence
-
Then:
\[\kappa = \frac{p_o - p_e}{1 - p_e}\]
-
-
Suitable for: Categorical (nominal) labels.
-
Use-case: Two medical experts diagnosing patients into disease categories (yes/no cancer).
-
Pros:
- Corrects for chance agreement.
- Easy to interpret (\(\kappa = 1\): perfect agreement, \(\kappa = 0\): chance-level).
-
Cons:
- Only supports two annotators.
- Sensitive to class imbalance (rare categories can distort \(\kappa\)).
Fleiss’ Kappa (\(\kappa\))
-
Definition: Generalization of Cohen’s kappa for more than two annotators. Agreement is measured by comparing observed vs expected proportions across all annotators.
-
Formula:
-
For category \(j\):
\[P_j = \frac{1}{Nn} \sum_{i=1}^N n_{ij}\]- where \(n_{ij}\) is the number of annotators assigning category \(j\) to item \(i\).
-
Then compute per-item agreement and average across items.
-
-
Suitable for: Categorical (nominal) labels with multiple annotators.
-
Use-case: A crowd-sourced sentiment task with 10 annotators per review.
-
Pros:
- Extends Cohen’s \(\kappa\) to many annotators.
- Still adjusts for chance agreement.
-
Cons:
- Assumes annotators are exchangeable.
- Same sensitivity to imbalance issues as Cohen’s \(\kappa\).
Krippendorff’s Alpha (\(\alpha\))
-
Definition: A versatile reliability coefficient that generalizes across data types.
-
Formula:
\[\alpha = 1 - \frac{D_o}{D_e}\]-
where
- \(D_o\) = observed disagreement
-
\(D_e\) = expected disagreement
- The definition of disagreement depends on the data type.
-
-
Suitable for: Nominal, ordinal, interval, ratio data.
-
Use-case: Annotators rating severity of patient symptoms on a 1–5 scale.
-
Pros:
- Works with any number of annotators.
- Can handle missing data.
- Supports various data types beyond categorical.
-
Cons:
- Computationally heavier than \(\kappa\).
- Requires defining distance functions for non-categorical data.
Scott’s Pi (\(\pi\))
-
Definition: Similar to Cohen’s \(\kappa\) but uses expected agreement under uniform distribution rather than empirical marginals.
-
Formula:
\[\pi = \frac{p_o - p_e}{1 - p_e}\] -
Suitable for: Categorical data, two annotators.
-
Use-case: Two coders labeling political statements as left/right/neutral.
-
Pros:
- Simple and interpretable.
- Historically important precursor to \(\kappa\).
-
Cons:
- Assumes annotators share the same distribution.
- Less robust in practice than \(\kappa\).
Correlation-Based Measures
-
Definition: Correlation-based measures capture the strength and direction of relationships between continuous or ordinal annotations. Common examples include Pearson’s \(r\) and Spearman’s \(\rho\).
-
Formula:
- For Pearson’s correlation:
-
Suitable for: Interval/ratio data, ordinal data.
-
Use-case: Annotators assigning continuous emotion intensity scores (0–1).
-
Pros:
- Handles continuous scales.
- Simple and well understood.
-
Cons:
- Sensitive to outliers.
- Only captures linear (Pearson) or monotonic (Spearman) relations.
Comparative Analysis
Metric | Data Type | Use-case Example | Pros | Cons |
---|---|---|---|---|
Cohen’s \(\kappa\) | Categorical (2) | 2 doctors diagnosing disease | Adjusts for chance, easy to interpret | Only 2 annotators, imbalance issues |
Fleiss’ \(\kappa\) | Categorical (many) | Crowd sentiment annotation | Multi-annotator extension of \(\kappa\) | Assumes annotator interchangeability |
Krippendorff’s \(\alpha\) | Nominal \(\rightarrow\) ratio | Symptom severity ratings | Versatile, missing data tolerant | More complex computation |
Scott’s \(\pi\) | Categorical (2) | Political statement coding | Simple, historic | Unrealistic assumptions |
Correlation (\(r, \rho\)) | Continuous / ordinal | Emotion intensity scores | Works for continuous data | Sensitive to outliers |
Bridging: Why Classical Metrics Fall Short for Distributional Annotations
- Most classical inter-annotator agreement (IAA) metrics—such as Cohen’s \(\kappa\), Fleiss’ \(\kappa\), and Krippendorff’s \(\alpha\)—are designed under the assumption that annotators provide a single discrete label per item. However, in many modern annotation settings, this assumption no longer holds.
Emergence of Distributional Annotations
-
With tasks involving uncertainty, subjectivity, or ambiguity, annotators are increasingly asked to provide a distribution over labels instead of a hard decision. For example:
- Emotion annotation: Annotators distribute probabilities across emotions (joy, sadness, fear).
- Topic labeling: Annotators may indicate that a document is 70% politics and 30% economics.
- Crowdsourcing: Aggregate behavior of many annotators often yields empirical distributions rather than a single consensus label.
-
This reflects a richer representation of annotator uncertainty and disagreement.
Limitations of Classical Metrics
-
Binary vs. graded disagreement:
- Cohen’s \(\kappa\) and similar metrics count labels as either “agree” or “disagree.” They cannot capture the degree of overlap between distributions.
-
Information loss:
- Reducing probability distributions to single labels (e.g., by taking the \(\arg\max\)) discards annotator uncertainty and masks subtler disagreements.
-
Incompatibility with probabilistic annotations:
- Metrics like \(\kappa\) assume categorical variables, not vectors in the probability simplex \(\Delta^K\) where \(\Delta^K = {, p \in \mathbb{R}^K \mid p_i \geq 0, , \sum_{i=1}^K p_i = 1 ,}\).
-
Thus, new tools are required to measure distributional agreement, which operate directly on the probability distributions annotators provide.
Transition to Divergence-Based Measures
-
Instead of measuring categorical agreement, distributional approaches compute distances or divergences between probability distributions. These metrics allow us to say not only whether two annotators disagreed, but how far apart their probability distributions are.
-
This motivates distributional agreement metrics such as:
- Total Variation (TV) Distance: measures the maximum difference in probabilities across categories.
- Kullback–Leibler (KL) Divergence: measures information loss when one distribution approximates another.
- Jensen–Shannon (JS) Divergence: symmetrized and smoothed version of KL, often more stable.
Distributional Agreement Metrics
- When annotators provide probability distributions instead of single labels, we need measures that compare two distributions directly. These are often referred to as divergences or distances on the probability simplex.
Total Variation (TV) Distance
-
Definition: For two discrete distributions ( P ) and ( Q ) over label set ( \mathcal{X} ), the Total Variation distance quantifies the largest possible difference between the probabilities assigned by ( P ) and ( Q ) to the same event. It measures how much probability mass must be shifted to make one distribution match the other. It is a symmetric, bounded, and interpretable measure of dissimilarity between probability distributions.
-
Formula:
\[d_{TV}(P, Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} | P(x) - Q(x) |\] -
Intuition: Measures the maximum difference in probabilities between two distributions. It represents the largest possible difference in probability mass that the two annotators assign to the same event.
-
Range: \([0, 1]\)
- 0 means identical distributions.
- 1 means completely disjoint support.
-
Suitable for: Any distributional annotations.
-
Use-case: Comparing how two annotators distribute probability across emotions for a sentence.
-
Pros:
- Symmetric and interpretable.
- Metric (satisfies triangle inequality).
- Bounded in \([0, 1]\).
-
Cons:
- Ignores information-theoretic aspects (only measures absolute differences).
- May be too coarse in high-dimensional label spaces.
Kullback–Leibler (KL) Divergence
-
Definition: The Kullback–Leibler divergence quantifies how one probability distribution ( Q ) diverges from another distribution ( P ). It measures the expected number of extra bits required to encode samples from ( P ) using a code optimized for ( Q ). It is an asymmetric and unbounded measure rooted in information theory.
-
Formula:
\[D_{KL}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}\] -
Intuition: Measures how inefficient it is to encode samples from ( P ) using a code optimized for ( Q ).
-
Range: \([0, \infty)\) Zero if and only if \(P = Q\).
-
Suitable for: Distributional annotations where one distribution can be treated as “true” and the other as an approximation.
-
Use-case: Evaluating how much information is lost if one annotator’s probability distribution is used to approximate another’s.
-
Pros:
- Information-theoretic interpretation.
- Sensitive to differences in rare events.
-
Cons:
- Asymmetric: \(D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\).
- Undefined if \(Q(x) = 0\) while \(P(x) > 0\).
- Harder to interpret numerically compared to bounded metrics.
Jensen–Shannon (JS) Divergence
-
Definition: The Jensen–Shannon divergence is a symmetrized and smoothed version of the KL divergence. It measures the average divergence of each distribution from their mean distribution \(M = \frac{1}{2}(P + Q)\). It is always finite, symmetric, and more stable than KL divergence.
-
Formula:
\[D_{JS}(P | Q) = \frac{1}{2} D_{KL}\left(P | M\right) + \frac{1}{2} D_{KL}\left(Q | M\right), \quad M = \frac{1}{2}(P + Q)\] -
Intuition: Measures the average divergence of each distribution from their mean.
-
Range: \([0, \log 2]\) (often normalized to \([0,1]\)).
-
Suitable for: Any pair of probability distributions, especially when symmetry and stability are needed.
-
Use-case: Comparing how two annotators spread probability mass across multiple labels, while avoiding undefined cases.
-
Pros:
- Symmetric and always finite.
- Square root of JS divergence is a metric.
- More stable in practice than KL.
-
Cons:
- Less interpretable than TV distance.
- Can still be sensitive to smoothing choices.
Comparison of TV, KL, and JS
Metric | Symmetric? | Range | Interpretability | Sensitivity | Use-case example |
---|---|---|---|---|---|
TV Distance | Yes | \([0,1]\) | Very interpretable (max diff) | Treats all categories equally | Emotion distributions |
KL Divergence | No | \([0,\infty)\) | Info-theoretic, less intuitive | Sensitive to rare events | Approximation error |
JS Divergence | Yes | \([0,\log 2]\) | Balanced, bounded, metric \((\sqrt{\text{JS}})\) | Smoothed, avoids infinities | General distributional IAA |
Practical Considerations for Inter-Annotator Agreement
- IAA analysis requires more than just selecting a formula — it involves understanding the data type, annotation context, and interpretability needs. This section provides guidance on how to choose metrics, what to watch out for, and how to interpret agreement scores.
Choosing a Metric by Data Type
Data Type | Recommended Metrics | Notes |
---|---|---|
Categorical (nominal) | Cohen’s \(\kappa\) (2 annotators), Fleiss’ \(\kappa\) (many), Krippendorff’s \(\alpha\) | Must check for class imbalance effects |
Ordinal | Weighted Cohen’s \(\kappa\), Krippendorff’s \(\alpha\), Spearman’s \(\rho\) | Use distance-based weighting to respect ordering |
Continuous | Pearson’s \(r\), Intraclass Correlation (ICC), Krippendorff’s \(\alpha\) | Handle outliers carefully, scale-sensitive |
Structured outputs | Task-specific metrics (e.g., overlap F1, span-based agreement) | Define what counts as “match” structurally |
Distributions | TV distance, JS divergence, KL divergence | Do not collapse to argmax labels; keep full distributions |
-
Rule of thumb:
- If annotators give single discrete labels \(\rightarrow\) use chance-corrected categorical metrics.
- If annotators give scores or ranks \(\rightarrow\) use correlation- or distance-based measures.
- If annotators give full probability distributions \(\rightarrow\) use divergence measures.
Interpreting Agreement Levels
- There is no absolute scale, but a commonly used heuristic (adapted from and 1977’s interpretation for Cohen’s κ) is:
Agreement value | Interpretation |
---|---|
\(< 0.0\) | Worse than chance |
\(0.0–0.20\) | Slight agreement |
\(0.21–0.40\) | Fair agreement |
\(0.41–0.60\) | Moderate agreement |
\(0.61–0.80\) | Substantial agreement |
\(0.81–1.00\) | Almost perfect |
-
For divergence metrics (TV, KL, JS), lower values mean closer distributions. Typical observed ranges:
- TV distance: < 0.1 \(\rightarrow\) very similar; > 0.3 \(\rightarrow\) strong disagreement
- JS divergence: < 0.05 \(\rightarrow\) close; > 0.2 \(\rightarrow\) widely different
- KL divergence: highly variable; compare relative changes, not absolute cutoffs.
Handling Annotator Bias and Class Imbalance
- Class imbalance can inflate or deflate κ-like metrics. Consider reporting class distributions alongside agreement.
- Annotator bias (systematic skew) can lower κ even if raw agreement is high.
- Consider using confusion matrices to inspect which categories cause disagreement.
Missing Data and Sparse Annotations
- Krippendorff’s \(\alpha\) is robust to missing annotations and is the safest choice for incomplete data.
- For divergence-based measures, ensure smoothing (e.g., add small (\epsilon)) to avoid zeros that break KL.
Computational Considerations
-
\kappa-type metrics are computationally cheap (matrix counts).
-
Krippendorff’s \(\alpha\) is \(O(N \times A^2)\) for \(N\) items and \(A\) annotators — still feasible but heavier.
-
Divergence-based metrics are \(O(K)\) per pair of distributions, where \(K\) is the number of categories.
-
If annotator sets are large, prefer efficient pairwise sampling strategies or aggregate distributions.
Mathematical Relationships Between TV, KL, and JS
- While Total Variation Distance, Kullback–Leibler divergence, and Jensen–Shannon divergence measure different aspects of distributional difference, they are connected through known inequalities. Understanding these links helps interpret and compare their values meaningfully.
Pinsker’s Inequality (KL vs. TV)
- Pinsker’s inequality provides an upper bound on TV distance in terms of KL divergence:
-
This means:
- If KL divergence is small, then TV must also be small.
- Small KL implies distributions are close in absolute terms.
- However, small TV does not guarantee small KL (KL can blow up when \(Q(x) \approx 0\)).
Implication:
- KL is more sensitive to low-probability mismatches than TV.
Lower Bound on KL via TV
- A lesser-known inequality gives a lower bound on KL in terms of TV:
- This shows that large TV implies large KL. Together with Pinsker’s inequality, this pins down their growth relationship:
JS Divergence Related to KL and TV
-
JS divergence is defined as:
\[D_{JS}(P|Q) = \frac{1}{2}D_{KL}(P|M) + \frac{1}{2}D_{KL}(Q|M), \quad M = \frac{1}{2}(P+Q)\]- and satisfies:
- It inherits KL’s information-theoretic basis while being symmetric and bounded.
- Also, it relates to TV as:
-
So:
- JS grows at least as fast as \(d_{TV}^2\).
- JS is upper-bounded, while KL is unbounded.
- JS is often preferred for interpretability and numerical stability.
Comparative Analysis of Theoretical Relationships
Pair | Inequality | Interpretation |
---|---|---|
TV vs. KL | \(d_{TV} \le \sqrt{\tfrac{1}{2} D_{KL}}\) | Small KL implies small TV |
TV vs. KL | \(2 d_{TV}^2 \le D_{KL}\) | Large TV implies large KL |
TV vs. JS | \(d_{TV}^2 \le \tfrac{1}{2} D_{JS}\) | JS lower bounds TV squared |
JS vs. KL | \(D_{JS} \le D_{KL} \quad \text{(if } P=Q \text{ support)}\) | JS is smoothed and bounded version of KL |
-
Key takeaways:
- TV gives an absolute probability difference.
- KL gives a relative (log-based) penalty, very sensitive to rare events.
- JS sits between them: symmetric, smoothed, and bounded, making it ideal for practical agreement comparisons.
Putting It All Together: A Workflow for Measuring IAA
- This section provides a step-by-step pipeline for measuring inter-annotator agreement, choosing the correct metric, and interpreting the results in context.
Step 1 — Identify Annotation Data Type
-
Before picking any metric, classify your annotation outputs into one of these types:
- Categorical (nominal): single class per item, no order
- Ordinal: discrete ranks with meaningful order
- Continuous: numeric values on a scale
- Structured: spans, trees, sequences
- Distributions: full probability vectors over categories
-
Tip: If annotators are uncertain and spread probability mass, treat their outputs as distributions rather than forcing hard labels.
Step 2 — Choose Suitable Metrics
- Use this quick mapping:
Data Type | Recommended Metrics |
---|---|
Categorical | Cohen’s \(\kappa\) (2), Fleiss’ \(\kappa\) (many), Krippendorff’s \(\alpha\) |
Ordinal | Weighted Cohen’s \(\kappa\), Krippendorff’s \(\alpha\), Spearman’s \(\rho\) |
Continuous | Pearson’s \(r\), Intraclass Correlation (ICC), Krippendorff’s \(\alpha\) |
Structured | Task-specific matching (span F1, overlap measures) |
Distributions | Kullback–Leibler divergence (\(D_{KL}\)), Jensen–Shannon divergence (\(D_{JS}\)), Earth Mover’s Distance (\(EMD\)) |
-
Guidelines:
- If you only care about agreement beyond chance, use \(\kappa\)-type metrics.
- If you care about numerical closeness, use correlation or divergence metrics.
Step 3 — Compute Agreement
- Clean data: handle missing annotations, standardize label sets.
- For categorical metrics, build an item × annotator label matrix.
- For distributional metrics, build an item × annotator probability matrix.
-
Compute:
- Pairwise agreement (between annotator pairs)
- Average agreement (overall reliability)
- Tip: For large numbers of annotators, use random subsampling of pairs to reduce computation.
Step 4 — Interpret Scores in Context
- Compare against known benchmarks (e.g., κ > 0.6 is substantial agreement).
-
For divergence metrics:
- (d_{TV} < 0.1) or (D_{JS} < 0.05) \(\rightarrow\) very high agreement
- (d_{TV} > 0.3) or (D_{JS} > 0.2) \(\rightarrow\) strong disagreement
-
Visualize distributions and confusion matrices to identify where disagreements occur.
- Important: Absolute cutoffs are less meaningful than relative comparisons across tasks or iterations.
Step 5 — Act on the Results
-
If agreement is low:
- Refine annotation guidelines
- Provide more training/examples to annotators
- Identify and retrain or remove inconsistent annotators
-
If agreement is high:
- Proceed with data aggregation and model training
- Optionally, use annotator reliability as weights in aggregation
Step 6 — Report Transparently
-
When publishing or sharing results:
- Specify which metric you used and why.
- Report number of annotators, number of samples, and how missing data was handled.
- Include both agreement values and class distributions for context.
Appendix: Summary of Inter-Annotator Agreement Metrics
Metric | Data Type | Formula | Interpretation | Pros | Cons | Typical Range / Use-Case |
---|---|---|---|---|---|---|
Cohen’s \(\kappa\) | Categorical (2 annotators) | \(\kappa = \frac{p_o - p_e}{1 - p_e}\) | Agreement beyond chance between two annotators | Adjusts for chance; simple | Only two annotators; sensitive to class imbalance | \([0, 1]\); medical diagnoses, binary coding |
Fleiss’ \(\kappa\) | Categorical (many annotators) | Mean chance-corrected agreement across annotators | Multi-annotator extension of \(\kappa\) | Handles multiple annotators | Assumes annotators are interchangeable; imbalance sensitive | \([0, 1]\); crowdsourced labeling |
Krippendorff’s \(\alpha\) | Nominal → ratio | \(\alpha = 1 - \frac{D_o}{D_e}\) | General reliability across data types | Works with missing data; flexible | More complex computation | \([0, 1]\); mixed data, psychological scales |
Scott’s \(\pi\) | Categorical (2) | \(\pi = \frac{p_o - p_e}{1 - p_e}\) with uniform expected \(p_e\) | Chance-corrected agreement with equal priors | Simple, historic | Unrealistic distribution assumption | \([0, 1]\); political or sentiment coding |
Weighted \(\kappa\) | Ordinal | Weighted form of \(\kappa\) with penalty matrix \(w_{ij}\) | Agreement respecting order of categories | Considers ordinal distances | Needs chosen weights; subjective | \([0, 1]\); rating scales, quality scores |
Pearson’s \(r\) | Continuous | \(r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\sqrt{\sum_i (y_i - \bar{y})^2}}\) | Linear correlation of scores | Interpretable; handles continuous values | Sensitive to outliers; only linear | \([-1, 1]\); numeric scoring, regression tasks |
Spearman’s \(\rho\) | Ordinal / continuous | Correlation of rank orders | Monotonic relationship between annotators | Order-based, robust | Ignores exact scale differences | \([-1, 1]\); ranking tasks |
Intraclass Corr. (ICC) | Continuous | Variance ratio model | Consistency among several raters | Captures group consistency | Depends on model assumptions | \([0, 1]\); behavioral, clinical studies |
TV distance | Distributions | \(d_{TV}(P,Q)=\tfrac{1}{2}\sum_x |P(x)-Q(x)|\) | Max difference in probability mass | Bounded, symmetric, metric | Ignores info-theoretic nuance | \([0, 1]\); probabilistic emotion or topic labels |
KL divergence | Distributions | \(D_{KL}(P|Q)=\sum_x P(x)\log \tfrac{P(x)}{Q(x)}\) | Information loss using \(Q\) for \(P\) | Info-theoretic; sensitive to rare events | Asymmetric; undefined for zeros | \([0, \infty)\); model approximation error |
JS divergence | Distributions | \(D_{JS}(P|Q)=\tfrac{1}{2}D_{KL}(P|M)+\tfrac{1}{2}D_{KL}(Q|M), \quad M=\tfrac{1}{2}(P+Q)\) | Smoothed, symmetric version of KL | Symmetric; bounded; interpretable | Still needs smoothing | \([0, \log 2]\); general probabilistic agreement |
Task-specific overlap (\(F_1\), span \(F_1\)) | Structured outputs | \(F_1=\frac{2PR}{P+R}\) | Overlap or matching agreement | Intuitive for structured data | Needs domain-specific definition | \([0, 1]\); entity extraction, segmentation |
Takeaways
-
Symmetry: TV and JS are symmetric; KL is not.
-
Boundedness: \(d_{TV} \in [0, 1], \quad D_{JS} \in [0, \log 2], \quad D_{KL} \in [0, \infty)\)
-
Data completeness: Krippendorff’s \(\alpha\) handles missing data best.
-
When in doubt:
- For categorical labels \(\rightarrow\) Cohen/Fleiss \(\kappa\).
- For continuous or ordinal \(\rightarrow\) correlation or \(\alpha\).
- For distributions \(\rightarrow\) \(d_{TV}\) or \(D_{JS}\) divergence.
Further Reading
- Inter-coder Agreement for Computational Linguistics
- A Coefficient of Agreement for Nominal Scales
- Measuring Nominal Scale Agreement Among Many Raters
- Reliability of Content Analysis: The Case of Nominal Scale Coding
- Content Analysis: An Introduction to Its Methodology
- Computing Krippendorff’s Alpha-Reliability
- On Krippendorff’s Alpha Coefficient (Revised 2015)
- DKPro Agreement: A Java Library for Measuring Inter-Rater Agreement
- An Elementary Introduction to Information Geometry
- Elements of Information Theory
- Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation
- Semeval-2016 Task 5: Aspect Based Sentiment Analysis
- Reliability in Software Engineering Qualitative Research Using Krippendorff’s Alpha and Atlas.ti
- krippendorffsalpha: An R Package for Measuring Agreement Using Krippendorff’s Alpha Coefficient
- The Measurement of Observer Agreement for Categorical Data
References
- Survey Article: Inter-Coder Agreement for Computational Linguistics (Artstein & Poesio, 2008)
- A Coefficient of Agreement for Nominal Scales (Cohen, 1960)
- Inter-Coder Agreement — Direct MIT/COLI PDF
- A Brief Tutorial on Inter-Rater Agreement (DKPro / Meyer)
- Chapter on Agreement Coefficients for Nominal Ratings (AgreeStat PDF)
- Krippendorff’s alpha — Wikipedia page
- Reliability in Software Engineering Qualitative Research using Krippendorff’s α (arXiv preprint)
Citation
@article{Chadha2020DistilledInterAnnotatorAgreement,
title = {Inter-Annotator Agreement},
author = {Chadha, Aman and Jain, Vinija},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}