Metadata-Version: 2.4
Name: pearmut
Version: 0.3.1
Summary: A tool for evaluation of model outputs, primarily MT.
Author-email: Vilém Zouhar <vilem.zouhar@gmail.com>
License: MIT
Project-URL: Repository, https://github.com/zouharvi/pearmut
Project-URL: Issues, https://github.com/zouharvi/pearmut/issues
Keywords: evaluation,machine translation,human evaluation,annotation
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.110.0
Requires-Dist: uvicorn>=0.29.0
Requires-Dist: wonderwords>=3.0.0
Requires-Dist: psutil>=7.1.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# Pearmut 🍐

**Platform for Evaluation and Reviewing of Multilingual Tasks**: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols ([DA](https://aclanthology.org/N15-1124/), [ESA](https://aclanthology.org/2024.wmt-1.131/), [ESA<sup>AI</sup>](https://aclanthology.org/2025.naacl-long.255/), [MQM](https://doi.org/10.1162/tacl_a_00437), and more!).

[![PyPi version](https://badgen.net/pypi/v/pearmut/)](https://pypi.org/project/pearmut)
&nbsp;
[![PyPI download/month](https://img.shields.io/pypi/dm/pearmut.svg)](https://pypi.python.org/pypi/pearmut/)
&nbsp;
[![PyPi license](https://badgen.net/pypi/license/pearmut/)](https://pypi.org/project/pearmut/)
&nbsp;
[![build status](https://github.com/zouharvi/pearmut/actions/workflows/test.yml/badge.svg)](https://github.com/zouharvi/pearmut/actions/workflows/test.yml)

<img width="1000" alt="Screenshot of ESA/MQM interface" src="https://github.com/user-attachments/assets/4fb9a1cb-78ac-47e0-99cd-0870a368a0ad" />

## Table of Contents

- [Quick Start](#quick-start)
- [Campaign Configuration](#campaign-configuration)
  - [Basic Structure](#basic-structure)
  - [Assignment Types](#assignment-types)
- [Advanced Features](#advanced-features)
  - [Pre-filled Error Spans (ESA<sup>AI</sup>)](#pre-filled-error-spans-esaai)
  - [Tutorial and Attention Checks](#tutorial-and-attention-checks)
  - [Pre-defined User IDs and Tokens](#pre-defined-user-ids-and-tokens)
  - [Multimodal Annotations](#multimodal-annotations)
  - [Hosting Assets](#hosting-assets)
- [Campaign Management](#campaign-management)
- [CLI Commands](#cli-commands)
- [Terminology](#terminology)
- [Development](#development)
- [Citation](#citation)

## Quick Start

Install and run locally without cloning:
```bash
pip install pearmut
# Download example campaigns
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
# Load and start
pearmut add esa.json da.json
pearmut run
```

## Campaign Configuration

### Basic Structure

Campaigns are defined in JSON files (see [examples/](examples/)). The simplest configuration uses `task-based` assignment where each user has pre-defined tasks:
```python
{
  "info": {
    "assignment": "task-based",
    # DA: scores
    # ESA: error spans and scores
    # MQM: error spans, categories, and scores
    "protocol": "ESA", 
  },
  "campaign_id": "wmt25_#_en-cs_CZ",
  "data": [
    # data for first task/user
    [
      [
        # each evaluation item is a document
        {
          "instructions": "Evaluate translation from en to cs_CZ",  # message to show to users above the first item
          "src": "This will be the year that Guinness loses its cool. Cheers to that!",
          "tgt": {"modelA": "Nevím přesně, kdy jsem to poprvé zaznamenal. Možná to bylo ve chvíli, ..."}
        },
        {
          "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
          "tgt": {"modelA": "Tohle bude rok, kdy Guinness přijde o svůj „cool“ faktor. Na zdraví!"}
        }
        ...
      ],
      # more document
      ...
    ],
    # data for second task/user
    [
        ...
    ],
    # arbitrary number of users (each corresponds to a single URL to be shared)
  ]
}
```
Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dictionary representing a document unit:
```python
[
  {
    "src": "A najednou se všechna tato voda naplnila dalšími lidmi a dalšími věcmi.",  # required
    "tgt": {"modelA": "And suddenly all the water became full of other people and other people."}  # required (dict)
  },
  {
    "src": "toto je pokračování stejného dokumentu",
    "tgt": {"modelA": "this is a continuation of the same document"}
    # Additional keys stored for analysis
  }
]
```

Load campaigns and start the server:
```bash
pearmut add my_campaign.json  # Use -o/--overwrite to replace existing
pearmut run
```

### Assignment Types

- **`task-based`**: Each user has predefined items
- **`single-stream`**: All users draw from a shared pool (random assignment)
- **`dynamic`**: work in progress ⚠️

## Advanced Features

### Shuffling Model Translations

By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
The `shuffle` parameter in campaign `info` controls this behavior:
```python
{
  "info": {
    "assignment": "task-based",
    "protocol": "ESA",
    "shuffle": true  # Default: true. Set to false to disable shuffling.
  },
  "campaign_id": "my_campaign",
  "data": [...]
}
```

### Pre-filled Error Spans (ESA<sup>AI</sup>)

Include `error_spans` to pre-fill annotations that users can review, modify, or delete:

```python
{
  "src": "The quick brown fox jumps over the lazy dog.",
  "tgt": {"modelA": "Rychlá hnědá liška skáče přes líného psa."},
  "error_spans": {
    "modelA": [
      {
        "start_i": 0,         # character index start (inclusive)
        "end_i": 5,           # character index end (inclusive)
        "severity": "minor",  # "minor", "major", "neutral", or null
        "category": null      # MQM category string or null
      },
      {
        "start_i": 27,
        "end_i": 32,
        "severity": "major",
        "category": null
      }
    ]
  }
}
```

The `error_spans` field is a 2D array (one per candidate). See [examples/esaai_prefilled.json](examples/esaai_prefilled.json).

### Tutorial and Attention Checks

Add `validation` rules for tutorials or attention checks:

```python
{
  "src": "The quick brown fox jumps.",
  "tgt": {"modelA": "Rychlá hnědá liška skáče."},
  "validation": {
    "modelA": [
      {
        "warning": "Please set score between 70-80.",  # shown on failure (omit for silent logging)
        "score": [70, 80],                             # required score range [min, max]
        "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}],  # expected spans
        "allow_skip": true                             # show "skip tutorial" button
      }
    ]
  }
}
```

**Types:**
- **Tutorial**: Include `allow_skip: true` and `warning` to let users skip after feedback
- **Loud attention checks**: Include `warning` without `allow_skip` to force retry
- **Silent attention checks**: Omit `warning` to log failures without notification (quality control)

The `validation` field is an array (one per candidate). Dashboard shows ✅/❌ based on `validation_threshold` in `info` (integer for max failed count, float \[0,1\) for max proportion, default 0).

**Score comparison:** Use `score_greaterthan` to ensure one candidate scores higher than another:
```python
{
  "src": "AI transforms industries.",
  "tgt": {"A": "UI transformuje průmysly.", "B": "Umělá inteligence mění obory."},
  "validation": {
    "A": [
      {"warning": "A has error, score 20-40.", "score": [20, 40]}
    ],
    "B": [
      {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": "A"}
    ]
  }
}
```
The `score_greaterthan` field specifies the index of the candidate that must have a lower score than the current candidate.
See [examples/tutorial_kway.json](examples/tutorial_kway.json).

### Single-stream Assignment

All annotators draw from a shared pool with random assignment:
```python
{
    "campaign_id": "my campaign 6",
    "info": {
        "assignment": "single-stream",
        # DA: scores
        # MQM: error spans and categories
        # ESA: error spans and scores
        "protocol": "ESA",
        "users": 50,                           # number of annotators (can also be a list, see below)
    },
    "data": [...], # list of all items (shared among all annotators)
}
```


### Pre-defined User IDs and Tokens

The `users` field accepts:
- **Number** (e.g., `50`): Generate random user IDs
- **List of strings** (e.g., `["alice", "bob"]`): Use specific user IDs
- **List of dictionaries**: Specify custom tokens:
```python
{
    "info": {
        ...
        "users": [
            {"user_id": "alice", "token_pass": "alice_done", "token_fail": "alice_fail"},
            {"user_id": "bob", "token_pass": "bob_done"}  # missing tokens are auto-generated
        ],
    },
    ...
}
```

### Multimodal Annotations

Support for HTML-compatible elements (YouTube embeds, `<video>` tags, images). Ensure elements are pre-styled. See [examples/multimodal.json](examples/multimodal.json).

<img width="1000" alt="Preview of multimodal elements in Pearmut" src="https://github.com/user-attachments/assets/77c4fa96-ee62-4e46-8e78-fd16e9007956" />

### Hosting Assets

Host local assets (audio, images, videos) using the `assets` key:

```python
{
    "campaign_id": "my_campaign",
    "info": { 
      "assets": {
        "source": "videos",                    # Source directory
        "destination": "assets/my_videos"      # Mount path (must start with "assets/")
      }
    },
    "data": [ ... ]
}
```

Files from `videos/` become accessible at `localhost:8001/assets/my_videos/`. Creates a symlink, so source directory must exist throughout annotation. Destination paths must be unique across campaigns.

## CLI Commands

- **`pearmut add <file(s)>`**: Add campaign JSON files (supports wildcards)
  - `-o/--overwrite`: Replace existing campaigns with same ID
  - `--server <url>`: Server URL prefix (default: `http://localhost:8001`)
- **`pearmut run`**: Start server
  - `--port <port>`: Server port (default: 8001)
  - `--server <url>`: Server URL prefix
- **`pearmut purge [campaign]`**: Remove campaign data
  - Without args: Purge all campaigns
  - With campaign name: Purge specific campaign only

## Campaign Management

Management link (shown when adding campaigns or running server) provides:
- Annotator progress overview
- Access to annotation links
- Task progress reset (data preserved)
- Download progress and annotations

<img width="1000" alt="Management dashboard" src="https://github.com/user-attachments/assets/8953252c-d7b1-428c-a974-5bc7501457c7" />

Completion tokens are shown at annotation end for verification (download correct tokens from dashboard). Incorrect tokens can be shown if quality control fails.

<img width="500" alt="Token on completion" src="https://github.com/user-attachments/assets/40eb904c-f47a-4011-aa63-9a4f1c501549" />

When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.

## Terminology

- **Campaign**: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
  - **Campaign File**: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
  - **Campaign ID**: A unique identifier for a campaign (e.g., `"wmt25_#_en-cs_CZ"`). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
- **Task**: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
- **Item**: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
- **Document**: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
- **User** / **Annotator**: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
- **Attention Check**: A validation item with known correct answers used to ensure annotator quality. Can be:
  - **Loud**: Shows warning message and forces retry on failure
  - **Silent**: Logs failures without notifying the user (for quality control analysis)
  - **Token**: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
    - **Pass Token** (`token_pass`): Shown when user meets validation thresholds
    - **Fail Token** (`token_fail`): Shown when user fails to meet validation requirements
- **Tutorial**: An instructional validation item that teaches users how to annotate. Includes `allow_skip: true` to let users skip if they have seen it before.
- **Validation**: Quality control rules attached to items that check if annotations match expected criteria (score ranges, error span locations, etc.). Used for tutorials and attention checks.
- **Model**: The system or model that generated the output being evaluated (e.g., `"GPT-4"`, `"Claude"`). Used for tracking and ranking model performance.
- **Dashboard**: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
- **Protocol**: The annotation scheme defining what data is collected:
  - **Score**: Numeric quality rating (0-100)
  - **Error Spans**: Text highlights marking errors with severity (`minor`, `major`)
  - **Error Categories**: MQM taxonomy labels for errors
- **Template**: The annotation interface type. The `basic` template supports comparing multiple outputs simultaneously.
- **Assignment**: The method for distributing items to users:
  - **Task-based**: Each user has predefined items
  - **Single-stream**: Users draw from a shared pool with random assignment
  - **Dynamic**: Work in progress

## Development

Server responds to data-only requests from frontend (no template coupling). Frontend served from pre-built `static/` on install.

### Local development:
```bash
cd pearmut
# Frontend (separate terminal, recompiles on change)
npm install web/ --prefix web/
npm run build --prefix web/
# optionally keep running indefinitely to auto-rebuild
npm run watch --prefix web/

# Install as editable
pip3 install -e .
# Load examples
pearmut add examples/wmt25_#_en-cs_CZ.json examples/wmt25_#_cs-de_DE.json
pearmut run
```

### Creating new protocols:
1. Add HTML and TS files to `web/src`
2. Add build rule to `webpack.config.js`
3. Reference as `info->template` in campaign JSON

See [web/src/basic.ts](web/src/basic.ts) for example.

### Deployment

Run on public server or tunnel local port to public IP/domain and run locally.

## Misc.

If you use this work in your paper, please cite as following.
```bibtex
@misc{zouhar2025pearmut,
    author={Vilém Zouhar},
    title={Pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks},
    url={https://github.com/zouharvi/pearmut/},
    year={2026},
}
```

Contributions are welcome! Please reach out to [Vilém Zouhar](mailto:vilem.zouhar@gmail.com).
