Introduction

Status: Current Last modified: 2026-06-21 21:33 EDT

TalkBank is the world’s largest open repository of spoken language data. This repository (TalkBank/chatter) is the standalone home of the CHAT format authority and the chatter tool family: the chatter CLI, the Rust crates for parsing/validation/transformation, the tree-sitter-talkbank grammar, the talkbank-lsp language server, and the desktop validation app.

chatter is publicly released. To get it right away:

Command-line tool (macOS / Linux): curl --proto '=https' --tlsv1.2 -LsSf https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.sh | sh (Windows and other options: Install).
Desktop app: download for your platform from the latest release.
Full installation guide (all platforms, package details): Install.

The Rust crates are source-available from this repository (not yet published to crates.io). As a 0.x release, APIs and flags may change before 1.0.

Choose the right surface

Task	Recommended Surface
CHAT validation, normalization, or conversion	`chatter` CLI
LSP integration in editors	`talkbank-lsp` standalone
Build CHAT tooling in Rust	Rust crates (`talkbank-model`, `talkbank-parser`, etc.)
Reuse grammar in other tools	`tree-sitter-talkbank`
Standalone desktop GUI for CHAT validation	Chatter Desktop (`apps/chatter-desktop/`)

What’s In This Repo

chatter CLI: validate, convert, normalize, and analyze CHAT files from the command line, with an interactive TUI for corpus-scale workflows
Language Server (LSP): works with any LSP-compatible editor (Neovim, Emacs, Helix, Zed, etc.) to provide live validation and cross-tier alignment
JSON data model: every CHAT structure as typed JSON with lossless roundtrip fidelity, backed by a published JSON Schema
Rust API: parse, validate, inspect, and transform CHAT files programmatically via library crates

Who This Book Is For

Audience	Start Here	Then Go To
CLI users validating, normalizing, or converting CHAT	Install	chatter Quick Start, CLI Reference
Rust library consumers parsing or transforming CHAT	Library Usage	crate-root rustdoc for `talkbank-model`, `talkbank-parser`, and `talkbank-transform`
Grammar / format consumers embedding CHAT parsing in other tools	CHAT Format Overview	`tree-sitter-talkbank` docs and the grammar/reference chapters
Contributors / maintainers working in this repo	Contributing setup	CI and release

Repository Layout

grammar/        Tree-sitter grammar for CHAT
spec/           Source of truth: CHAT specification + error specs
crates/         Rust crates for model, parser, transform, cache, CLI, LSP, tests, and FFI support
apps/           Tauri v2 desktop app (`chatter-desktop`)
corpus/         Reference corpus (must stay 100% valid under the regression gate)
schema/         JSON Schema for the CHAT AST
tests/          Integration tests and fixtures
fuzz/           Fuzz testing targets (separate Cargo workspace)
docs/           Strategy docs, proposals, and investigations for this repo
book/           This documentation (mdBook)

Data flows: spec (source of truth) → grammar (tree-sitter) → Rust crates (parsers, model, validation, CLI, LSP) → applications (chatter, desktop app).

Install

Status: Current Last modified: 2026-07-07 21:20 EDT

Everything here comes from the latest release.

Just want to check CHAT files, and you are not a programmer? Get the desktop app below. You never need a terminal.

Chatter desktop app (recommended for most people)

The Chatter app checks CHAT transcripts in an ordinary window: open a file, see the problems highlighted, fix them, and re-check. No terminal and no setup, and it updates itself when a new version comes out.

macOS

The Mac app is signed and notarized by Apple, so it opens normally (no security warnings).

Download Chatter for your Mac:
- Apple Silicon (M1/M2/M3/M4, essentially every Mac sold since late 2020): Download Chatter for Apple Silicon.
- Intel (older models): Download Chatter for Intel Mac.
Not sure which you have? Apple menu () then About This Mac: if it says “Apple M…”, it is Apple Silicon.
Open the downloaded .dmg file.
Drag Chatter onto the Applications folder in the window that appears.
Open Chatter from your Applications folder (or Launchpad).

Windows (Intel/AMD 64-bit, “x64”)

Download Chatter for Windows and run the installer. Windows binaries are not code-signed yet, so SmartScreen may warn on first run: choose More info, then Run anyway.

Linux (Intel/AMD 64-bit, “x86_64”)

Download Chatter (AppImage) (make it executable, then run it) or the .deb package (install with your package manager).

(The desktop app is x86_64-only on Windows and Linux today; macOS has both Apple Silicon and Intel builds. The chatter command-line tool below also ships a Linux ARM build.)

`chatter`, the command-line tool (for programmers and automation)

If you are comfortable in a terminal, the chatter CLI validates, normalizes, converts (JSON / XML), lints, watches, and batch-processes CHAT files, and is the right tool for scripting and CI.

macOS / Linux:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.sh | sh

Windows (PowerShell):

irm https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.ps1 | iex

Then run chatter --help. Full reference: CLI installation and CLI Reference. chatter self-updates with chatter update.

`talkbank-lsp` language server (editor integration)

For live CHAT validation, hover, go-to-definition, and cross-tier alignment inside an LSP-aware editor (Neovim, Emacs, Helix, Zed, VS Code, and others), install the standalone, code-signed language server:

macOS / Linux:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/TalkBank/chatter/releases/latest/download/talkbank-lsp-installer.sh | sh

Windows (PowerShell):

irm https://github.com/TalkBank/chatter/releases/latest/download/talkbank-lsp-installer.ps1 | iex

Or download the per-platform archive (talkbank-lsp-<target>.tar.xz, or .zip on Windows) from the release and point your editor’s LSP client at the talkbank-lsp binary (it speaks LSP over stdio on .cha files, language id chat).

Rust crates and the grammar (embed in your own program)

To embed CHAT parsing / validation / transformation in your own program, depend on the talkbank-* crates and the tree-sitter-talkbank grammar. They are source-available from this repository (not yet published to crates.io). See Library usage and the CHAT format overview.

As a 0.x release, APIs and flags may change before 1.0; see the Release Notes. For audio + ML pipelines (transcribe, force-align, morphotag), see the upstream batchalign3 project, which has its own installation flow.

Quickstart

Status: Current Last modified: 2026-06-21 21:33 EDT

Task-driven entry points. Pick the row that matches what you want to do today; each path starts at the narrowest useful documentation surface instead of dropping you into the whole book.

Today’s goal	Best first page	Surface
Validate / normalize / convert existing CHAT	chatter Quick Start	CLI
Add CHAT parsing/validation to a Rust program	Library Usage	Rust crates

To download and install chatter, see Install.

For audio + ML workflows (transcribe / align media → CHAT), see the upstream batchalign3 project, outside the chatter repo.

Changelog

All notable changes to this project are documented in this file.

The format is based on Keep a Changelog, and the project follows Semantic Versioning. Before 1.0, breaking changes to the CLI or library APIs bump the minor version and are listed under “Changed” / “Removed”.

Unreleased

[0.3.6] - 2026-07-17

Fixed

The Phon %x-tier content checks (introduced with the %x fold-in) no longer mass-flag valid Phon exports. Two wild-corpus conventions the original specification never confronted are now accepted: (1) pause fillers ((.), (..), (...)) mirrored at the same word position on %mod/%pho/%xmodsyl/%xphosyl (and as pause pairs on %xphoaln) to keep word-aligned tiers in index lockstep, which E735 previously rejected as malformed phone:CODE units (roughly 13,000 spurious errors across the PhonBank corpora); and (2) ^ and IPA . syllable-boundary notation in %mod/%pho words, which the segment-level %xphoaln reconstruction comparison now ignores exactly as it already ignored stress markers (roughly 770 spurious E740/E741). Genuine misalignments (index-shift chains, pause fillers standing in for real words) are still reported. Users who adopted --suppress xphon to silence the storm can remove it and regain the genuine %x-tier checks.
Generated error-documentation pages (docs/errors/) no longer fuse words across wrapped spec lines or drop backticked text: the spec text extractor now renders soft line breaks as spaces and includes inline code spans.

Added

New validation rule E752: timing bullets without an @Media header. A transcript carrying timing evidence (utterance bullets or %wor word timing) must declare the media those timestamps index; completes the media-consistency family (E544: declared linkage without timing; E552: declared unlinked contradicted by timing). Mirrors CLAN CHECK error 112.
New validation rule E753: a word consisting only of a repetition segment (fully ↫...↫-wrapped, no stem outside the delimiters) is rejected; word-category prefixes (&- filler, &~ nonword, 0 omission) count as a stem. Adopted from GUI CLAN CHECK error 151 as a chatter-authority rule (the unix CHECK build never enforced it).
New validation rule E754: the @l letter form must carry exactly one letter of stem (b@l); multi-letter content belongs under @k / @ls. Repeated-segment material (↫b^↫b@l) does not count toward the stem, matching real CLAN CHECK behavior. Mirrors CLAN CHECK error 76.
New validation rule E755: a [- CODE] utterance-level language must be declared in @Languages (utterance-level presence is substantial). Mirrors CLAN CHECK error 152.
Word-level explicit language codes (word@s:CODE) are now validated against the ISO 639-3 registry (E519), the same rule that guards @Languages and @ID; declaration in @Languages remains not required.
@L1 of values are now typed ISO 639-3 language codes and validated against the registry (E519), completing registry validation at every position language codes appear. Wild usage was already uniformly codes; generation via build_chat now takes a LanguageCode for the participant first language.
E756 (empty user-defined %x tier) replaces W601: the rejection is unchanged; the old code fired as a hard error despite its warning prefix, so the number was the bug. The diagnostic message also no longer double-prefixes the tier name (%xfoo, not %xxfoo).

Removed

The E254 warning (word-level @s:CODE not listed in @Languages) is retired: an explicit word-level language code is self-contained and deliberately carries no declaration requirement. @Languages declares the transcript’s substantial languages; a one-word insertion is not substantial presence. (This matches CLAN CHECK, which dropped its own @s declaration requirement in 2019.)

[0.3.5] - 2026-07-15

Emergency release restoring corpus-correct word parsing. Versions 0.3.3 and 0.3.4 have been YANKED (releases and tags removed).

Fixed

Reverted the whitespace-boundary overlap-custody grammar introduced in 0.3.3. Its GLR-arbitrated word readings fragmented words carrying four or more glued markers (for example multi-syllable-pause chains like or^ga^ni^zi^ra), causing spurious E252/E331/E600/E705 validation errors across real corpora and, worse, a serialization mutation (a space inserted into such words on rewrite). Word parsing is restored to the 0.3.2 grammar, verified by an error-code differential and a roundtrip comparison against the 0.3.2 binary over a corpus sample: identical profiles.
A regression test pins that multi-marker words parse as one word and validate cleanly.

Retained from the yanked releases

Typed @u phonetic word forms (UNIBET).
The build_chat header emitters and @ID demographics fix.
The shared English capitalization transform.
The long-tier stack-overflow fix and its regression test.
The SQLite cache concurrency-safety fix; CI runs under nextest.

[0.3.4] - 2026-07-15 [YANKED]

Added

@u phonetic forms are now typed phonetic content. A @u word (a UNIBET/IPA phonetic transcription standing in a word slot, e.g. the spoken side of an aphasia [: target] replacement) now models its content as a dedicated WordContent::Phonetic(WordPhonetic) node instead of orthographic text, in both parsers. Orthographic word-hygiene rules structurally cannot apply to phonetic content; the phonetic string itself stays deliberately lenient (IPA, ASCII UNIBET, X-SAMPA), matching the %pho tier’s stance. to-json emits {"type": "phonetic", ...} for these nodes (schema updated); cleaned_text remains the phonetic string verbatim; the sanitizer redacts phonetic forms like spoken text. Scope is @u only; sibling special forms remain orthographic words.
build_chat now emits the full standard header set. The general CHAT-generation schema (TranscriptDescription / ParticipantDesc) gained typed optional fields for @Date, @Situation, @Options, @Transcriber, @Comment, per-speaker @L1 of, and @PID (preserved from a source, never minted), each emitted in canonical header order. @ID demographics (age, sex, group, SES, education, custom) are now carried through ParticipantDesc instead of being silently dropped, fixing empty demographic slots in generated @ID headers.
Shared English capitalization transform (talkbank_transform::capitalize): capitalizes the pronoun “I” family and the first real word of each utterance on the typed model, for generators whose sources are all-lowercase (improves downstream %mor accuracy). Token-level helpers are public for generators that capitalize their own word representation.

Fixed

chatter validate no longer headlines a warnings-only file as an error. A file whose findings are all warnings (which is valid CHAT, and was already counted valid in the summary) now prints ⚠ Warnings in <file> instead of the contradictory ✗ Errors found in <file>, and the “fix structural errors first” hint fires only on hard errors. Presentation only; validation logic unchanged.
The validation cache no longer fails to initialize when opened concurrently. Two chatter runs sharing a cache directory (or a multi-threaded consumer) could race the one-time SQLite setup and hit UNIQUE constraint failed: _sqlx_migrations.version or a WAL init collision, silently disabling caching for that run. Concurrent opens on a fresh cache directory now retry the transient init race and all succeed.

[0.3.3] - 2026-07-13 [YANKED]

Added

Desktop app: a “Check for Updates…” menu item and a periodic background update check. The app previously checked for a new release only at launch, so an app that was rarely relaunched could sit far behind. It now also checks every six hours in the background, and the app menu has a manual “Check for Updates…” item that reports when you are already up to date.
Desktop app: a real “About Chatter” panel with the version, a short description, and clickable links to the TalkBank site and the source repository, replacing the bare version-only default.
talkbank_transform::build_chat: assemble a validated CHAT file from a typed transcript description. Given participants, optional media, and utterances as pre-formatted CHAT main-tier text (TranscriptDescription), it synthesizes the header block, parses each utterance through the tree-sitter parser, and returns a ChatFile. The description carries a media_status, so a transcript that names its media but has no timing bullets yet (pre-forced-alignment) can emit @Media: <id>, audio, unlinked and stay valid instead of falsely claiming linkage (E544).
talkbank_transform::num_words::expand_number: spell digit tokens as language-appropriate number words (13 lookup-table languages, CJK, and English ordinals/decades), so generated CHAT satisfies E220 (numeric digits are not allowed in words for languages that do not permit them).

Changed

Overlap custody now follows whitespace boundaries, with canonical overlap serialization. Overlap markers bind to the token on the correct side of a whitespace boundary, and serialization emits a single canonical form.
tree-sitter updated to 0.26.11 across the workspace (CLI, grammar bindings, and the generated parser).

Fixed

Long dependent-tier reconstruction is now linear-time. A quadratic blowup on very long utterance tiers is eliminated; pathological inputs that previously stalled the parser now reconstruct in linear time.
Desktop app: the validation settings popover no longer opens hidden behind the results panel. It was rendered below the panels in the stacking order; it now sits above them.
Desktop app: the “up to date” dialog now dismisses on the first OK. A listener leak (an async menu subscription whose cleanup could run before it resolved) let duplicate listeners accumulate, so one menu click stacked several identical dialogs.

[0.3.2] - 2026-07-10

Added

chatter rediarize: repair speaker attribution from external diarization turns. Takes a transcript whose utterance timing is trusted but whose speaker labels are not, plus a speaker-turns JSON file ({"source": ..., "turns": [{"track", "start_ms", "end_ms"}]}) from an external diarizer, and re-attributes each timed utterance to the dominant overlapping turn. Utterances with no turn coverage are flagged, never guessed. Reconciled @ID rows are inserted in the header block. --summary-json emits a machine-readable outcome summary (per-utterance reattributions and flag reasons) for downstream tooling.
Four validation rules for constructs that do not make sense, each adjudicated against real CLAN CHECK behavior and the wild corpus: E748 leading-zero media-bullet times; E749 comma glued to the following word; E750 whitespace inside angle-group delimiters; E751 pause marker glued to a word.

Fixed

The re2c oracle lexer now tokenizes short-form parenthesized material the same way the canonical parser does (its catch-all previously swallowed a trailing delimiter), keeping the two independent parsers in cross-check agreement on the new spacing rules.

Changed

Rust toolchain pin bumped to 1.97.0 (CI workflow pins synced); workspace and spec lockfiles refreshed; desktop dependency bumps (jsonschema 0.47, TypeScript 7).
Documentation: an architecture page on overlap-marker binding (why edge-adjacent overlap markers bind into words, the ideal top-level model, and the conversion-layer path); the grammar’s empty-extras (all-whitespace-explicit) design rationale is now recorded at the declaration site.

[0.3.1] - 2026-07-08

Fixed

Every public fallible constructor’s error type is now publicly nameable. LanguageCodeError (from LanguageCode::new), XphointParseError, and PhoalnParseError were not re-exported, so downstream crates could not store them in typed #[source] fields and had to stringify at the boundary; found by the first real downstream consumption of the 0.3.0 API. A new API-surface guard test pins the contract so a constructor error type can never silently become unnameable again.

[0.3.0] - 2026-07-07

Added

--llm-cache <file> (env CHATTER_LLM_CACHE) for holistic speaker-id judgment. A persistent, write-through JSON response cache for speaker-id / pipeline / batch --judgment holistic: an identical request (same endpoint, model, and rendered prompt) is served from the cache instead of making another LLM call, so re-running a batch after a crash or an unrelated code change does not re-pay completed sessions. Absent flag and env variable means uncached, unchanged from before.

Fixed

chatter batch no longer reports holistic suggestions as merges. In holistic-judgment mode the per-session pipeline exits 0 after writing a suggestion to the pending file without merging (the operator adjudicates first); the batch summary counted those as “merged” and reported zero pending work. Outcomes are now classified by whether the merged output actually exists, and the summary separately counts merges, suggestions awaiting adjudication, and low-confidence refusals awaiting adjudication.
E552 (@Media says unlinked but timing exists) now says where the timing was found and how to fix it. When the only timing evidence is word-level bullets inside a %wor tier (invisible in normal display), the message names the %wor tier and offers both remedies (the media is in fact aligned: remove unlinked; or the %wor tier is stale: remove it) instead of asserting the media is linked and pointing at bullets the user cannot see. The main-tier-bullet case keeps its direct advice.
Chatter Desktop’s single-file validation now shares the CLI’s validation engine. Previously, validating a single .cha file in the desktop app (as opposed to its parent folder) bypassed the on-disk cache entirely, skipped the @Media-filename check (E531), and could not honor --roundtrip / --parser / --strict-linkers. All of these now work identically to chatter validate and to the desktop’s own folder validation, and a new Settings panel exposes the equivalent options.
Chatter Desktop no longer shows “N files, all valid” before a run has actually finished. The file tree previously derived this message from the partial, still-streaming result set, so it could flash “all valid” mid-run whenever no error had streamed in yet.

0.2.1 - 2026-06-24

Added

The talkbank-lsp language server now ships as a standalone release artifact. Prebuilt, code-signed talkbank-lsp binaries for macOS (Apple Silicon and Intel), Linux (x86_64 and aarch64, static musl), and Windows are attached to the GitHub Release, each with its own talkbank-lsp-installer.sh / talkbank-lsp-installer.ps1. Any LSP-aware editor can now install the server without building it from source; it is a first-class artifact in its own right, not only the binary the VS Code extension bundles per platform.

0.2.0 - 2026-06-23

Added

More of CLAN CHECK’s invalidity is now enforced. A batch of CHECK-parity rules was implemented so chatter validate rejects more invalid CHAT:
- E514: an @ID line’s corpus field is required (CHECK 63).
- E547: a constant participant header must follow the @ID block.
- E548: closes the case CHECK 126 covers.
- E549: a speaker may not be declared twice (CHECK 13).
- Duplicate @ID lines and out-of-order @Options fields (CHECK 13, 125).
- A dependent tier used without being declared (CHECK 17).
- An out-of-range @Time Duration (CHECK 35).
- An @Media header marked unlinked while the transcript still carries timing bullets (CHECK 124), and an @Media filename that does not match the data file (CHECK 157).
- A replacement [: ...] now requires a preceding space (CHECK 161).
- Tree-sitter recovery nodes are surfaced as invalidity rather than silently repaired: a surviving ERROR node maps to E316 and a MISSING node to E342 (with the re2c oracle mirroring it), covering a group with no annotation and swallowed recovery nodes inside comma-list headers (CHECK 5/6/106/108).
Phon: U (unknown) is accepted as a legal syllable-constituent code on the %xmodsyl and %xphosyl tiers.
A formal behavioral CHECK-validity parity test suite that runs real CLAN CHECK and chatter on the same fixtures and fails if either side drifts.

Changed

chatter update now self-updates in process. It embeds the axoupdater self-updater as a library, reads the cargo-dist install receipt (keyed by the package name), and replaces the running binary from GitHub Releases. This removes the package-name coupling that previously made chatter update report “not installed” on a correctly installed binary.
The CLI package is renamed talkbank-cli to chatter (the crate now lives at crates/chatter/). The generated install scripts are therefore chatter-installer.sh and chatter-installer.ps1 (previously talkbank-cli-installer.*); update any pinned install URL accordingly. The binary is still chatter, and the library/API crates keep their talkbank-* names.
Validation is stricter. Because of the new CHECK-parity rules above, some files that passed chatter validate under 0.1.1 may now report errors. This is intended: chatter is the CHAT-validity authority and is at least as strict as CLAN CHECK.
Word-level explicit language codes (word@s:CODE) are now validated against the ISO 639-3 registry (E519), the same rule that guards @Languages and @ID; declaration in @Languages remains not required.

Removed

The standalone self-updater binary (cargo-dist install-updater = false). The chatter update subcommand is unchanged for users; it now updates in process instead of shelling out to a separate program.

Fixed

The recovery-node invalidity backstop is scoped to localized errors so it does not over-flag, and several malformed @ID test fixtures were corrected.
Hardened the CHECK-parity audit and corrected a CHECK 126 verdict it had falsely certified; the curated CHECK error-code map is restored in place of a brittle keyword heuristic.

0.1.1 - 2026-06-22

Fixed

Validation cache could serve a stale verdict across rule-set changes. chatter validate keyed its result cache on the cache crate’s package version, which does not change when validation rules change, so a “Valid” result cached before a new rule (such as a retrace-marker check) existed kept being served, while a fresh conversion of the same bytes correctly rejected them. The cache key now folds in a fingerprint over every error-code rule, so adding, removing, or renaming any rule invalidates stale entries; the cache is kept and still functions, only keyed correctly.
CLI usage lines pin the binary name to chatter regardless of the invoked path (clap bin_name).
The book renders Mermaid diagrams again (restored mdbook-mermaid assets).
Desktop app version is now locked to the release version. The desktop bundle (.dmg / .exe / .deb) and the Tauri auto-updater manifest now report the same version as the CLI. A version-sync gate (scripts/sync-app-version.py, enforced in CI and at release time) keeps tauri.conf.json, package.json, the workspace version, and this changelog from drifting, so the updater can never again advertise a version the installed bundle does not match.

Changed

CI book toolchain bumped to mdBook 0.5.3 and mdbook-mermaid 0.17.0.
Build: force serialize-javascript >= 7.0.5 to clear advisories, and bump rand in the spec crate.
Docs: the book intro is de-staged for the public release (download-first).

0.1.0 - 2026-06-15

First public release.

Added

CHAT-format core. A strict, incremental tree-sitter parser (talkbank-parser) with an independent re2c oracle parser (talkbank-parser-re2c) that cross-checks it on every file; a typed CHAT data model with structured validation, error codes, and tier alignment (talkbank-model); and CHAT-to-JSON / JSON-to-CHAT / XML conversion, normalization, transcript-merge, and redaction pipelines (talkbank-transform).
Phon extension tiers. The four Phon %x dependent tiers (%xmodsyl, %xphosyl, %xphoaln, %xphoint) are parsed and validated as first-class CHAT tiers, on by default (pass --suppress xphon to opt out): syllabification constituent codes and phone-vs-source reconstruction, model-to-actual phone alignment, and per-phone time intervals, with dedicated error codes.
chatter CLI. validate, normalize, to-json / from-json / to-xml, merge, speaker-id, batch, pipeline, adjudicate, sanity-scan, lint, clean, watch, new-file, show-alignment, validate-utseg, schema, update, and a content cache.
Language server (talkbank-lsp): real-time validation, hover, go-to-definition, and cross-tier alignment for any LSP-aware editor.
Desktop app (Chatter): a Tauri-based CHAT validation app, shipping in the coordinated release alongside the CLI.
Auto-update. The chatter CLI self-updates with chatter update (the bundled cargo-dist / axoupdater self-updater), and the desktop app checks for and installs new releases on launch (Tauri updater). Both pull from GitHub Releases. The CLI self-updater is experimental.
Prebuilt binaries for macOS (Apple Silicon and Intel), Linux, and Windows, plus desktop installers, attached to the GitHub Release. The macOS desktop .dmg is signed and notarized.

Known limitations

The merge and adjudication surface is experimental. merge, adjudicate, speaker-id, and sanity-scan work, but their interfaces and heuristics may change before 1.0.
Windows binaries are not code-signed yet, so Windows SmartScreen warns on first run (choose “More info” then “Run anyway”). macOS CLI binaries are codesigned but not notarized; install via the release installer script to avoid the Gatekeeper quarantine prompt.
Not on crates.io yet. crates.io publication is deferred.

Installation

Status: Current Last modified: 2026-07-07 21:20 EDT

chatter targets Windows, macOS, and Linux. There are two ways to install it: the prebuilt binaries (recommended for most people, including clinicians and researchers) and a from-source build (for contributors or unsupported platforms).

Prebuilt binaries (recommended)

Every GitHub Release attaches prebuilt binaries for macOS (Apple Silicon and Intel), Linux (x86_64 and ARM64), and Windows (x64), plus desktop-app installers.

chatter CLI

One-line installers (they download the binary for your platform, place it on your PATH, and also install the chatter-update self-updater):

macOS and Linux:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.sh | sh

Windows (PowerShell):

powershell -ExecutionPolicy Bypass -c "irm https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.ps1 | iex"

On Windows the binary is not yet code-signed, so SmartScreen may warn on first run: choose More info, then Run anyway. The macOS binaries are codesigned, and the installer above does not set the quarantine attribute, so Gatekeeper does not prompt.

Prefer a manual download? Grab the archive for your platform from the latest release and extract chatter onto your PATH. (On macOS, a browser-downloaded archive is quarantined; right-click the binary and choose Open once, or run xattr -d com.apple.quarantine ./chatter.)

Verify:

chatter --version
chatter --help

chatter desktop app

The desktop app (“Chatter”) is for people who prefer a window to a terminal. Download the installer for your platform from the latest release:

macOS: the .dmg is signed and notarized; open it and drag the app to Applications. No Gatekeeper override is required.
Windows: the installer is not yet signed (same SmartScreen note as above: More info then Run anyway).
Linux: an AppImage and a .deb are provided.

Updating chatter

chatter keeps itself current so you do not have to track releases by hand.

CLI: run
```
chatter update
```
This runs the bundled chatter-update program, which checks GitHub Releases and installs the newest release in place. (The self-update facility is experimental. It is installed only by the one-line installers above; if you installed another way, update the same way you installed.)
Desktop app: the app checks for updates on launch and offers to install a new version when one is available.

From source

Building from source needs only a stable Rust toolchain (install via rustup, which supports Windows, macOS, and Linux). Node.js and the tree-sitter CLI (cargo install tree-sitter-cli) are needed only when working on the grammar or generated artifacts.

Clone and install the CLI:

git clone https://github.com/TalkBank/chatter.git
cd chatter
cargo install --path crates/chatter --locked

This installs the chatter binary to ~/.cargo/bin/ (macOS/Linux) or %USERPROFILE%\.cargo\bin\ (Windows). To update a source install, pull and re-run the cargo install command above (chatter update is only for installer-based installs).

Building the libraries

If you are developing with the Rust crates directly, from your chatter checkout root:

cargo build --workspace --all-targets --locked
cargo test --workspace --locked
cargo clippy --all-targets -- -D warnings

See the contributor setup for additional commands.

Directory layout

Everything lives in a single repository:

<your-chatter-checkout>/
├── grammar/            # Tree-sitter grammar
├── crates/             # All Rust crates (talkbank-* + the chatter binary)
├── spec/               # CHAT specification
├── apps/               # Tauri desktop app (chatter-desktop)
└── book/               # Chatter mdBook (this book)

The CLI, grammar, crates, and the LSP/desktop integrations all live in this single repository.

Quick Start

Status: Current Last updated: 2026-07-13 17:59 EDT

This page gets you from zero to productive with chatter in five minutes. Install chatter first if you haven’t already.

Validate a CHAT file

Check a single transcript for errors:

chatter validate transcript.cha

If the file is valid you get a summary (a cache-statistics block follows it; use --quiet to suppress all output and rely on the exit code):

=== Summary ===
Total files: 1
Valid: 1
Invalid: 0

If there are problems, you’ll see rich diagnostics with the exact location and a stable error code. For example, a *CHI: line missing its terminator:

✗ Errors found in transcript.cha

E305 (https://talkbank.org/errors/E305)

  × error[E305]: Expected terminator not found (line 6, column 1)
   ╭─[input:6:1]
 6 │ *CHI:   hello world
   · ─────────┬─────────
   ·          ╰── here
   ╰────
  help: Add a terminator at the end: Standard (. ? !), Interruption
        (+... +/. ...), or CA intonation (⇗ ↗ → ↘ ⇘ ...)

Every error code (E305, E705, etc.) is documented with fix guidance in the validation error reference.

Not every diagnostic is an error. Some codes are warnings: the file is valid CHAT, but something is worth flagging (for example E254, a word-level @s: language override that is not listed in @Languages). A file whose only diagnostics are warnings is reported as valid, and its heading reflects that:

⚠ Warnings in transcript.cha

E254 (https://talkbank.org/errors/E254)

  ⚠ warning[E254]: Explicit word language 'spa' is not listed in @Languages
   ╭─[input:6:15]
 6 │ *CHI:   hello hola@s:spa .
   ·               ─────┬────
   ·                    ╰── here
   ╰────
  help: Add 'spa' to @Languages or confirm the word-level override is intentional

The summary still counts this file under Valid, and the exit code stays 0.

Validate an entire corpus

Point chatter at a directory, it walks recursively, validates in parallel, and caches results:

chatter validate corpus/

The interactive TUI shows progress and lets you browse errors per file. Use --format json for machine-readable output, or --quiet for CI (exit code 1 on errors).

Convert to JSON

Get a structured representation of any CHAT file:

chatter to-json transcript.cha

The output conforms to the TalkBank CHAT JSON Schema. Convert back with chatter from-json.

Watch for changes

Edit a file and get live validation feedback:

chatter watch transcript.cha

Every time you save, chatter re-validates and shows updated diagnostics.

What next?

CLI Reference: all commands, flags, and output formats
Validation Errors: every error code, with examples and fix guidance
Batch Workflows: corpus-scale validation and analysis

CLI Reference

Status: Current Last modified: 2026-07-08 18:41 EDT

The chatter CLI is the primary command-line surface for the TalkBank CHAT toolchain.

The following diagram shows the command dispatch structure. Each top-level command dispatches to a handler in the corresponding crate.

flowchart TD
    chatter(["chatter"])

    chatter --> validate["validate\n(chatter)"]
    chatter --> normalize["normalize\n(chatter)"]
    chatter --> tojson["to-json\n(talkbank-transform)"]
    chatter --> fromjson["from-json\n(talkbank-transform)"]
    chatter --> showalign["show-alignment\n(chatter)"]
    chatter --> watch["watch\n(chatter)"]
    chatter --> lint["lint\n(chatter)"]
    chatter --> clean["clean\n(chatter)"]
    chatter --> newfile["new-file\n(chatter)"]
    chatter --> cache["cache\n(stats, clear)"]
    chatter --> schema["schema\n(JSON Schema output)"]
    chatter --> debug["debug\n(overlap-audit, linker-audit,\nfind, sanitize, fix-s)"]

    chatter --> merge["merge\n(experimental)"]
    chatter --> speakerid["speaker-id\n(experimental)"]
    chatter --> rediarize["rediarize\n(experimental)"]
    chatter --> adjudicate["adjudicate\n(experimental)"]
    chatter --> pipeline["pipeline\n(experimental)"]
    chatter --> batch["batch\n(experimental)"]
    chatter --> sanityscan["sanity-scan\n(experimental)"]

Top-Level Commands

chatter validate PATH...
chatter normalize INPUT
chatter to-json INPUT
chatter from-json INPUT
chatter to-xml INPUT
chatter show-alignment INPUT
chatter watch PATH
chatter lint PATH
chatter clean PATH
chatter new-file
chatter cache stats
chatter cache clear --prefix PATH
chatter schema
chatter debug ...
chatter merge FILE1 FILE2          # experimental: combine two transcripts
chatter speaker-id INPUT           # experimental
chatter rediarize INPUT --turns T  # experimental
chatter adjudicate ...             # experimental
chatter pipeline ...               # experimental
chatter batch ...                  # experimental
chatter sanity-scan ...            # experimental

Use chatter --help or chatter <command> --help for the exact live surface.

`validate`

Validate CHAT file(s) or directory tree(s). Accepts multiple paths.

Usage: chatter validate [OPTIONS] <PATH>...

chatter validate file.cha                         # single file
chatter validate file1.cha file2.cha file3.cha    # multiple files
chatter validate corpus/                          # directory (recursive, parallel)
chatter validate file.cha corpus/ other.cha       # mix of files and directories
chatter validate corpus/ -f json                  # structured JSON output
chatter validate corpus/ --force                  # ignore cache, revalidate everything
chatter validate corpus/ --force --audit out.jsonl # bulk audit to JSONL file
chatter validate corpus/ --suppress xphon         # suppress named error group
chatter validate corpus/ --suppress E726,E727     # suppress specific error codes
chatter validate corpus/ -j 8                     # use 8 parallel workers
chatter validate corpus/ --max-errors 50          # stop after 50 errors

Options:

Flag	Description
`-f, --format text\|json`	Output format (default: text)
`--list-checks`	Print every validation check with Active/Planned status, then exit (no `<PATH>` required)
`--skip-alignment`	Skip dependent-tier alignment checks
`--force`	Ignore cache, revalidate all files
`-j, --jobs N`	Parallel workers for directory mode (default: CPU count)
`--quiet`	Only emit errors, suppress success messages
`--max-errors N`	Stop after N errors across all files
`--roundtrip`	Test serialization idempotency (developer tool)
`--parser tree-sitter\|re2c`	Parser backend (default: tree-sitter; re2c is opt-in for faster batch validation)
`--strict-linkers`	Enable strict cross-utterance linker pairing checks (E351-E355); off by default
`--suppress xphon`	Silence the Phon `%x` dependent-tier checks (E725-E728, E735-E746), which run by default
`--audit FILE`	Stream errors to JSONL file (bulk audit mode)
`--suppress CODES`	Suppress error codes or groups (comma-separated)

Suppress groups: xphon expands to the whole Phon %x dependent-tier validation surface (%xmodsyl/%xphosyl/%xphoaln/%xphoint, codes E725-E728 and E735-E746). These checks run by default; pass --suppress xphon to silence the group. (The old --check-xphon flag is a deprecated no-op kept only so existing scripts do not break.) The --suppress flag can mix groups and codes: --suppress xphon,E316.

`normalize`

Serialize a CHAT file into canonical formatting.

chatter normalize input.cha
chatter normalize input.cha -o normalized.cha
chatter normalize input.cha --validate
chatter normalize input.cha --validate --skip-alignment

Flags:

-o, --output <PATH>: write to a file instead of stdout.
--validate: validate (including alignment by default) before writing the normalized output.
--skip-alignment: when paired with --validate, skip the dependent-tier alignment checks (still validates the rest).

normalize writes to stdout unless you pass -o/--output. There is no --in-place flag.

JSON Conversion

# Single file
chatter to-json input.cha                          # pretty-printed JSON to stdout
chatter to-json input.cha --compact                # minified JSON to stdout
chatter to-json input.cha -o output.json           # JSON to file

# Directory (recursive, preserves structure)
chatter to-json corpus/ --output-dir json/          # incremental by default (mtime check)
chatter to-json corpus/ --output-dir json/ --compact # minified output (saves disk)
chatter to-json corpus/ --output-dir json/ --force   # full rebuild
chatter to-json corpus/ --output-dir json/ --prune   # remove orphaned .json files
chatter to-json corpus/ --output-dir json/ --jobs 4  # parallel workers

# Reverse and schema
chatter from-json input.json -o output.cha
chatter schema
chatter schema --url

Single-file mode: to-json validates by default. Use --skip-validation, --skip-alignment, or --skip-schema-validation to bypass checks.

Directory mode: Walks recursively, converting each .cha to .json under --output-dir with the same relative path. Incremental by default: skips files whose JSON is already newer than the source. Use --force to rebuild all. Use --prune to remove .json files with no matching .cha (handles renames/deletions). Use --jobs N for parallel conversion (defaults to number of CPUs).

`to-xml`

Export one CHAT transcript to TalkBank XML. The transcript is validated before any XML is emitted, so an invalid input fails (exit 1) and writes nothing to stdout; a failed export never leaves a partial document. This command is export-only: XML ingest is not implemented, so there is no from-xml.

chatter to-xml input.cha                  # XML to stdout
chatter to-xml input.cha -o output.xml    # XML to a file
chatter to-xml input.cha --skip-alignment # skip dependent-tier alignment checks

The output is TalkBank XML in the http://www.talkbank.org/ns/talkbank namespace (referencing talkbank.xsd). Writing to --output prints a one-line ✓ Converted ... to ... confirmation on stderr; writing to stdout prints only the XML.

Flags: -o, --output <PATH> (stdout if omitted); --skip-alignment (disable dependent-tier alignment validation during export).

Editing and Inspection Commands

`show-alignment`

Print the dependent-tier alignment for a CHAT file (debugging aid).

chatter show-alignment file.cha
chatter show-alignment file.cha -t mor          # one tier type
chatter show-alignment file.cha -t gra -c       # compact one-line-per-alignment output

Flags: -t/--tier <mor|gra|pho|sin> (omit to show all available tiers); -c/--compact (one line per alignment).

`watch`

Watch a CHAT file or directory and re-validate on every save.

chatter watch file.cha
chatter watch corpus/
chatter watch corpus/ --skip-alignment --clear

Flags: --skip-alignment (faster reruns); -c/--clear (clear the terminal between runs).

`lint`

Run lint checks and optionally auto-fix.

chatter lint corpus/
chatter lint corpus/ --fix
chatter lint corpus/ --fix --dry-run         # preview without modifying files
chatter lint corpus/ --skip-alignment

Flags: --fix (apply fixes); --dry-run (show what would change without writing); --skip-alignment.

`clean`

Show the cleaned text for each word (a debugging aid for the text-normalization pipeline).

chatter clean file.cha
chatter clean file.cha --diff-only       # only words where raw differs from cleaned
chatter clean file.cha --format json

Flags: --diff-only; --format text|json.

`new-file`

Create a new minimal valid CHAT file from defaults.

chatter new-file
chatter new-file -o starter.cha --speaker CHI --language eng
chatter new-file -o adult.cha -s MOT -l eng -r Mother
chatter new-file -c brown -u "hello world ."

Flags:

-o, --output <PATH>: stdout if omitted
-s, --speaker <CODE>: default CHI
-l, --language <ISO 639-3>: default eng
-r, --role <ROLE>: default Target_Child
-c, --corpus <CORPUS>: corpus identifier in the @ID header (default corpus)
-u, --utterance <TEXT>: optional initial main-tier utterance content

Cache Commands

chatter cache stats
chatter cache stats --json
chatter cache clear --prefix /path/to/corpus
chatter cache clear --all --dry-run

The validation cache lives under the platform cache directory and stores per-file validation results. validate --force refreshes cache state for the specified path.

`debug`

Developer / debugging subcommands for CHAT analysis. Not intended for routine end-user workflows; surface and behavior may change between releases. Run chatter debug --help for the live list. Current subcommands include:

overlap-audit: analyze CA overlap markers (⌈⌉⌊⌋): pairing, temporal consistency, orphans.
linker-audit: audit linker / special-terminator usage across a corpus (cross-utterance pairing for +<, ++, +^, +", +,, +≋, +≈, plus +..., +/., +//., +"/. etc.).
find: filter CHAT files by @Languages and body content (token / substring counts) across a corpus tree; emits paths, JSONL, or CSV.
sanitize: strip contributor lexical content while preserving structure, for protected-corpus debugging. See the Sanitize user-guide page for the full workflow.
fix-s: normalize whole-utterance same-language @s runs into a [- lang] precode, clear the per-word @s markers (including those on fillers and nonwords), and append any missing explicit @s:LANG codes to @Languages. Trigger conditions and safety rules:
- Every word-bearing item in the utterance, including fillers (&~, &-, &+), nonwords, and retraced material, must carry an explicit language marker AND every marker must resolve to the same target language. If a single filler such as &~dang3 lacks a marker, the utterance is left untouched (the predicate cannot prove it is monolingual).
- Bare @s shortcuts on fillers must be cleared when the rewrite fires. A bare @s resolves relative to the surrounding tier language, so adding a [- LANG] precode without clearing the shortcut would flip the filler’s language to the precode target. fix-s clears the shortcut to keep the original meaning intact.
- The pre-validation rule that catches the unrewritten pattern is E255 (whole-utterance same-language @s run); fix-s is the canonical repair. The companion warn-only E254 reports @s:LANG codes missing from @Languages; fix-s appends them.
- True no-op on already-correct files: a file is rewritten only when a [- lang] conversion or @Languages repair can be proved necessary.
join-retrace: auto-repair dangling-retrace (E370) utterances. An utterance whose last main-tier content is a retrace marker with nothing after it is joined with the next same-speaker utterance. The --scope flag (value-enum, default repetition) selects which retrace kinds qualify:
- --scope repetition (default, Wave 1): only [/] partial-repetition retraces qualify, and only when the successor’s leading words repeat the retraced material. This is the conservative, OBVIOUS-only repair suitable for most automated use.
- --scope corrections (Wave 3a, opt-in): also joins correction retraces: [//] (Full), [///] (Multiple), and [/-] (Reformulation). Corrections replace rather than repeat the retraced material, so the leading-words prefix check is skipped; same-speaker presence alone is the gate. Use --dry-run first to review every proposed correction-join before writing.
- --scope all (Wave 3b, broadest, opt-in): joins ANY dangling retrace kind, including [/] Partial where the successor does NOT repeat the retraced material. This covers genuine child-language disfluencies: false starts, partial words, disfluent repetitions, expansions, and fillers where the transcriber correctly coded a [/] but the successor cannot repeat the abandoned material. Same-speaker presence alone is the gate. Always use --dry-run first when running this scope on new data.
Shared behavior for all joined pairs:
- The join produces one utterance: the first utterance’s content (keeping the trailing retrace marker) followed by the successor’s content, terminated by the successor’s terminator. Main-tier time bullets are unioned (start from the first, end from the successor).
- Dependent tiers are dropped. If either side carried %mor, %gra, or any other dependent tier, the joined utterance drops all of them (a naive %gra merge would yield two ROOT relations, which chatter validate rejects as E723). Such joins are reported as “needs re-morphotag” so the file can be re-run through morphotagging afterwards; the main tier alone remains valid CHAT.
- --dry-run reports what would be joined without modifying files.

Merge and Reconciliation Commands (experimental)

These commands combine, reconcile, and relabel CHAT transcripts of the same recording, in the tradition of CLAN’s reliability and comparison tools (rely, trnfix). They are experimental and in active development: flags and behavior may change, and several modes are not yet complete. Work on copies and validate the output.

Command	What it does
`merge`	Merge two CHAT transcripts of the same media into one, interleaving by time with explicit per-speaker provenance. Structural only: no ASR, no forced alignment, no content rewriting.
`speaker-id`	Assign CHAT-conformant speaker codes to an anonymously-labeled file, from an explicit mapping or by text similarity against a reference transcript.
`rediarize`	Re-attribute utterance speakers from an external diarizer’s timestamped turns (JSON), keeping the words: repairs transcripts whose ASR under-counted or mixed speakers.
`adjudicate`	Resolve pending decisions (currently speaker-id) interactively or from a scripted decision file, writing results to an override file.
`pipeline`	Per-session shortcut: run `speaker-id` in reference mode, then `merge`.
`batch`	Loop `pipeline` over matched donor / reference file pairs across two directories.
`sanity-scan`	Post-merge QA: flag sessions whose automatic decisions look suspicious by an out-of-band heuristic, for operator review via `adjudicate`.

Full guides: Merge, Speaker ID, Rediarize, and the Merge Workflow walkthrough. The holistic-judgment mode of speaker-id / pipeline / batch can call an LLM provider (talkbank-llm) when configured via --llm-endpoint / --llm-model (plus --llm-timeout-secs, --llm-max-retries, and a persistent response cache via --llm-cache or CHATTER_LLM_CACHE); the deterministic modes need no network access. Flag-level detail: Merge, LLM holistic judgment.

Exit Codes

Code	Meaning
`0`	Success – all files valid, or command completed without errors
`1`	Failure – validation errors found, parse errors, or command failed
`2`	Usage error – invalid arguments or missing required options (from clap)

chatter validate exits with code 1 if any file has validation errors or parse errors. This makes it safe to use in scripts and CI pipelines:

chatter validate corpus/ --quiet --tui-mode disable || echo "Validation failed"

Use --quiet to suppress per-file success output while still relying on exit codes. Use --format json for machine-readable structured output (JSON objects go to stdout; exit code still reflects pass/fail).

Output Contracts

Text output is intended for humans.
JSON output is intended for automation and downstream tools.
Error codes and the JSON Schema are documented public contracts; see the Integrating section of this book.

Validation Errors

Status: Current Last modified: 2026-07-16 11:14 EDT

The CHAT validator produces diagnostics at two severity levels: errors (must fix) and warnings (should fix). Each diagnostic has an error code that maps back to a documented spec and validator rule.

chatter validate is the binding judgment on whether a byte sequence is valid CHAT. When it reports an error, the file is invalid CHAT: clean the data rather than working around the check. A warning flags a questionable but parseable construct you should review. Where chatter and an older tool such as CLAN’s check disagree on whether a file is valid, chatter validate is authoritative (see CHECK Parity Audit for how the two are reconciled).

Reading Error Output

The validator emits rich diagnostics that include the error code, a source-pointed snippet, and a suggested fix:

  × error[E304]: Missing speaker in main tier (line 15, column 3)

15 │ *	hello world .
   ·  ╰── here
   ╰────
  help: Add a speaker code between * and : (e.g., *CHI:)

Each diagnostic contains:

File path and location (line:column)
Severity: error or warning
Error code: E prefix for errors, W prefix for warnings, with a URL pointing at the per-code documentation page
Message: human-readable description
Suggestion: actionable fix guidance where available

Error Code Ranges

Range	Category	Examples
E1xx	UTF-8 and encoding	E101: Invalid line format
E2xx	Word-level content	E202: Missing form type after `@`, E203: Invalid form type marker, E207: Unknown annotation
E3xx	Main tier (speakers, terminators, content)	E301: Empty/missing main tier, E304: Missing speaker, E305: Missing terminator, E306: Empty utterance, E307: Invalid speaker, E308: Undeclared speaker
E4xx	Dependent tier structure	E401: Duplicate dependent tier
E5xx	Headers	E501: Duplicate header, E504: Missing @Participants, E505: Invalid @ID format
E6xx	Dependent tier validation	E601: Invalid dependent tier, E604: %gra without %mor
E7xx	Alignment, Phon tiers, structure	E705: Main/%mor count mismatch, E721: %gra index error, E747: Blank line, E748: Leading zero in bullet time, E749: Comma glued to next word, E750: Space inside angle group, E751: Pause glued to word, E752: Timing bullets without @Media, E753: Word only repetition segments, E754: Multi-letter @l form, E755: Undeclared utterance language, E756: Empty user-defined tier, E757: Code glued to following word, E758: Leading space on tier (non-CA)
W1xx-W6xx	Warnings	W108: Speaker not found in @Participants (non-fatal contexts)

Common Errors and Fixes

E256: Curly single quote used as a word character

A curly single quotation mark (U+2018 or U+2019), commonly introduced by autocorrect or speech-to-text, is not a legal CHAT word character. CHAT words use the ASCII apostrophe (U+0027, the plain '). For example, a contraction typed as don + U+2019 + t is rejected; write don't with the ASCII apostrophe instead. chatter flags the curly form wherever it appears in word content and points the diagnostic at the exact character. This mirrors CLAN CHECK errors 138 and 139.

E243: Private-use or non-standard Unicode in a word

A word may contain only standard Unicode. Characters from the Unicode Private Use Area and the other non-standard code points in the U+E000-U+FFFF block are rejected, including the replacement character U+FFFD that marks a botched text encoding. The most common cause is a file saved in the wrong encoding: re-save it as UTF-8 and replace any private-use or compatibility-area character with its standard Unicode equivalent. chatter points the diagnostic at the exact character. This mirrors CLAN CHECK error 86.

E304: Missing speaker code

A main tier line must have a speaker code after the *:

*CHI:	hello world .

An empty speaker code (*: hello .) triggers E304.

E308: Undeclared speaker

Every *SPEAKER: code must be listed in @Participants. Add the missing speaker to the header:

@Participants:	CHI Target_Child, MOT Mother

E370: Retrace marker with nothing to retrace

A retrace or repetition marker ([/], [//], [///]) must be followed by the repeated or corrected material; per the CHAT manual the marker always refers to the text that follows it. A marker followed only by a terminator has nothing to retrace:

*CHI:	<the> [/] .          ← invalid: [/] is not followed by repeated material
*CHI:	<the> [/] the cat .  ← valid: the repeated material follows the marker

This mirrors CLAN CHECK error 119 (and the related retrace checks 151 and 159).

E505: Invalid @ID format

Check that pipe-separated fields are correct and the speaker code matches @Participants:

@ID:	eng|corpus|CHI|2;6.||||Target_Child|||

E705: Main/%mor alignment mismatch

The number of %mor items must match the number of alignable words on the main tier. Retraces, pauses, and events are not counted. The validator shows a columnar diff:

  Main tier       %mor tier
  ──────────────  ──────────────
  I               pro|I
  want            v|want
  to              inf|to
  go              v|go
  home, ⊖

E714 / E715: `%pho`, `%mod`, or `%wor` count mismatch

The same two codes are reused for “too few” / “too many” count mismatches on %pho, %mod, and %wor.

For %wor, the main-tier side is a spoken-token inventory:

regular words and fillers count
fragments, nonwords, and xxx/yyy/www count
retrace does not change %wor membership
replacements keep the original spoken surface word for %wor

That context-sensitivity decides membership, not leniency. Once an item is in the %wor set, alignment is still strict 1:1. So if a filler like &-mm counts on the main tier and %wor omits it, E714 is the correct result.

So this is valid:

*CHI:	<one &+ss> [/] one play ground .
%wor:	one •321008_321148• ss •321148_321368• one •321809_321969• play •322049_322310• ground •322390_322890• .

But this is also valid:

*EXP:	&+ih <the what> [/] what's letter &+th is this ?
%wor:	ih •49063_49103• the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• th •50886_50946• is •50946_51046• this •51086_51586• ?

And this is valid too:

*EXP:	what's is dis [: this] ?
%wor:	what's •37050_37471• is •37491_37631• dis •37631_38131• ?

E721: %gra sequential index error

%gra entries must have sequential 1-based indices: 1|...|... 2|...|... 3|...|...

E748: Leading zero in bullet timestamp

A media bullet time component is written with a leading zero before another digit, for example \u{15}012_200\u{15}. Bullet times are plain millisecond integers; write 12, not 012. A bare 0 (as in 0_200) is legal. This mirrors CLAN CHECK error 90 (“Illegal time representation inside a bullet.”). The bullet’s numeric value still parses, so downstream tooling sees the intended times; the diagnostic alone makes the file invalid.

E749: Comma glued to the following word

A comma on a speaker tier must be followed by a space or end-of-line: write hey , you, not hey ,you. Mirrors CLAN CHECK error 92. The check looks at the word immediately after the comma in document order (including inside <...> groups); constructs that place their own character after the comma (group and overlap marks, CA symbols) are not flagged.

E750: Space inside angle-bracket group delimiters

Group delimiters hug their content: write <dog> [/], never < dog> or <dog >. Mirrors CLAN CHECK error 160. Each offending space gets its own diagnostic; the group still parses, so downstream tooling sees the intended structure.

E751: Pause glued to the preceding word

A pause marker must be space-delimited from the word before it: write hello (.) there, not hello(.) there. Mirrors CLAN CHECK error 57.

E752: Timing bullets without an @Media header

The transcript carries timing evidence (utterance bullets or %wor word timing) but no @Media header declares the recording those timestamps index. Add an @Media header naming the media file (or remove the timing bullets if the transcript is genuinely unlinked). Completes the media-consistency family: E544 covers declared linkage without timing, E552 covers a declared unlinked contradicted by timing. Mirrors CLAN CHECK error 112.

E753: Word consisting only of a repetition segment

A word whose entire material sits inside segment-repetition delimiters (↫hi↫ with nothing outside the arrows) marks the repetition of a word that is not there; attach the repeated segment to its host word (↫p↫parents) or transcribe a stand-alone fragment as a filler or nonword form. Filler and other word-category prefixes (&-, &~, 0) count as material outside the arrows. Adopted from GUI CLAN CHECK error 151 as a chatter rule.

E754: Letter form @l with more than one letter

The @l form marks a single spoken letter (b@l); use @k (letter sequence) or @ls (letter plural) for multi-letter content. Stuttered letters with repetition segments (↫b^↫b@l) are fine: repeated-segment material does not count toward the stem. Mirrors CLAN CHECK error 76.

E519 at word level: language codes must be real everywhere

The ISO 639-3 registry check that guards @Languages and @ID also applies to explicit word-level switch codes (word@s:CODE, including +/& multi-code forms) and to @L1 of values: the code needs no declaration, but it must name a real language. Utterance-level [- CODE] precodes are covered by E755 plus the header check.

E755: Utterance language not declared in @Languages

A [- CODE] precode marks a whole utterance as being in another language, which is substantial presence: declare that language in @Languages. Deliberate contrast: a word-level @s:CODE insertion needs NO declaration (ok@s:eng in a Cantonese transcript is valid as-is), because @Languages lists the transcript’s substantial languages, not every language that appears. Mirrors CLAN CHECK error 152.

E756: Empty user-defined tier

A user-defined %x tier with empty or whitespace-only content declares an annotation that is not there; add the content or remove the line. (Formerly W601; renumbered because it always was a hard error.)

E757: Bracketed code glued to the following word

A bracketed code’s closing ] must be space-delimited from what follows: write hello [/] x, not hello [/]x. The parse is unambiguous either way, which is exactly why this is a style rule: the corpus stays canonically spaced. Mirrors CLAN CHECK error 19.

E758: Leading space before tier content in a non-CA file

A space between the tier’s tab delimiter and the first content item (*CHI:<tab><space>dog .) is invalid unless the file declares @Options: CA; CA transcripts legitimately column-align content with spaces after the tab. Mirrors CLAN CHECK error 123.

E243 addition: the pipe character

| is the %mor tier’s delimiter and has no meaning in main-tier word text; a bare or embedded pipe in a word now reports E243 (IllegalCharactersInWord). Covers the grounded shape of CLAN CHECK error 48.

Generated Error Documentation

The source of truth for error-code details is spec/errors/. Maintainers can also regenerate a local error-reference set from those specs when working on diagnostics:

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_error_docs

That generated reference includes the error description, example inputs, suggested fixes, and the layer that catches the diagnostic.

Chatter Desktop

Status: Current Last modified: 2026-07-06 17:27 EDT

Chatter Desktop is a native graphical validation app for CHAT files, released alongside the chatter CLI. Prefer the chatter CLI for scripted or batch validation; use the desktop app when you want a standalone graphical validation experience without a terminal.

When to use Chatter Desktop

Chatter Desktop (apps/chatter-desktop/) is the right tool when you want to:

Validate CHAT files through a graphical interface, no terminal required
Drag and drop a file or folder and read errors with source snippets
Work on the desktop without setting up a terminal workflow

Related surfaces:

Validate CHAT from the command line: use chatter validate

This page documents the desktop surface:

Chatter Desktop (apps/chatter-desktop/), the CHAT validation GUI

Current status

Release contract: released alongside the CLI in the public chatter release
Distribution: ships in the coordinated chatter release alongside the CLI; also buildable from source (below)
Platforms: macOS, Windows, and Linux

Staying up to date

Chatter Desktop keeps itself current. When you launch it, it quietly checks for a newer release; if one is available it asks whether to update, and on your confirmation it downloads, installs, and restarts into the new version. If the check cannot reach the network it simply does nothing and the app keeps working on the version you have. You never have to track releases or re-download by hand.

Getting Started

Build from source

cd apps/chatter-desktop
npm ci
cargo tauri dev       # launches the app with hot reload
cargo tauri build     # produces a distributable app bundle

Requires: Rust (stable, edition 2024), Node.js, and npm.

Using the App

Opening files

Chatter validates one target at a time: a single .cha file or one folder.

Three ways to start validating:

Choose File: opens a file picker filtered to .cha files
Choose Folder: opens a folder picker; validates all .cha files recursively
Drag and drop: drag one .cha file or one folder onto the app window

When idle, if you’ve previously validated a target, the drop zone shows “Last: corpus/reference/, Re-validate?” as a clickable shortcut.

Reading results

The main window has three areas:

┌──────────────────────────────────────────────────────────────┐
│  [Choose File] [Choose Folder] or drag here  [System|Light|Dark] │
├──────────────────┬───────────────────────────────────────────┤
│ 3 FILES WITH     │  Filter by code… [All|Errors|Warnings]    │
│ ERRORS / 120     │                                           │
│                  │  ▾ [E302] Missing @End header              │
│  📁 corpus/      │  ┌───────────────────────┐                │
│    ✗ file1 (3)   │  │ 41 │ *CHI: hello .    │                │
│    ✗ file3 (1)   │  │ 42 │                   │                │
│                  │  │    │ ^                 │                │
│                  │  └───────────────────────┘                │
│                  │  💡 Add @End on the last line             │
│                  │  [Copy] [Open in CLAN]                    │
├──────────────────┴───────────────────────────────────────────┤
│  Progress: 45/120 │ 4 errors │ ~2m 30s remaining │ [Cancel]  │
└──────────────────────────────────────────────────────────────┘

File tree (left), collapsible directory tree showing only files with errors (valid files are hidden to reduce clutter). A header shows “N files with errors / M total”. Files are sorted alphabetically.
Error panel (right), for the selected file, shows each error with its code in [E001] format, severity color, message, source snippet with caret underlines, and multi-span labels for complex errors (e.g., alignment mismatches across tiers). CHAT-specific formatting is handled: tabs expanded to 8-column boundaries, \x15 bullets rendered as •, underline markers shown as styled underlined text. Suggestions prefixed with 💡.
Status bar (bottom), streaming progress during validation, ETA after 5+ files, total error count, and action buttons.

Filtering errors

A compact filter bar appears above the error cards when a file has diagnostics:

Code filter: type “E7” to show only alignment errors, “W” for warnings, etc.
Severity toggle: switch between All / Errors / Warnings

The file header updates to show filtered vs. total count (e.g., “3 errors (7 total)”).

Collapsible error cards

Each error card has a clickable header that toggles between expanded and collapsed view. Collapsed cards show only the error code and first line of the message. When a file has 5 or more errors, an Expand All / Collapse All button appears.

Validation settings

A ⚙ Settings popover next to the file picker exposes the same knobs the CLI’s flags do, since both surfaces build the same underlying validation config:

Setting	Equivalent CLI flag	Default
Roundtrip check	`--roundtrip`	Off
Parser	`--parser tree-sitter\|re2c`	Tree-sitter
Strict cross-utterance linkers	(enables E351-E355)	Off
Parallel jobs	`--jobs N`	All CPUs

Settings are disabled while a validation run is in progress and apply to the next run (including Re-validate).

Dark mode

Chatter follows your system appearance by default. A System / Light / Dark toggle in the drop zone area lets you override. Your preference is remembered across sessions.

The dark palette uses muted Apple-style colors, readable miette error highlighting on dark backgrounds.

Clickable file paths

Click the file name in the error panel heading to reveal the file in Finder (macOS), Explorer (Windows), or the default file manager (Linux).

Copy errors

Each error card has a Copy button that copies the full miette-rendered error text (plain text, not HTML) to your clipboard for pasting into issue reports or messages.

Actions

Action	Where	What it does
Re-validate	Status bar / last-target hint	Re-run validation on the same target (picks up edits)
Cancel	Status bar (during validation)	Stop the current run
Export	Status bar	Save results as JSON or plain text via a save dialog
Open in CLAN	Per-error button	Opens the file at the error location in the CLAN editor
Copy	Per-error button	Copies the plain-text error to clipboard
Reveal in file manager	File name heading	Opens the file’s parent directory

“Open in CLAN” only appears when the CLAN application is detected on your system (macOS and Windows only). It adjusts line numbers to account for headers that CLAN hides (@UTF8, @PID, @Font, @ColorWords, @Window).

Keyboard shortcuts

Shortcut	Action
Ctrl+R / Cmd+R	Re-validate
Escape	Cancel running validation

All other navigation is mouse-driven (click files, scroll errors).

Window title

The window title updates to reflect the current state:

Idle: “Chatter”
Discovering: “Chatter, Discovering files…”
Running: “Chatter, Validating (45/120)”
Finished: “Chatter, 14 errors in 3 files” or “Chatter, All 74 files valid”

ETA

After 5 or more files have been processed, the status bar shows an estimated time remaining (e.g., “~2m 30s remaining”). The estimate updates every second.

Notifications

When validation finishes while the app is not focused, a system notification shows the summary (“Validation complete, 14 errors in 3 files”).

First launch

On first launch, an onboarding overlay explains the four main interactions: drag files, error panel, keyboard shortcuts, and export. Dismiss with “Got it”, it won’t appear again.

CLI Bundling

The desktop app can bundle the chatter CLI binary so power users who download the GUI can also run the CLI from their terminal (like VS Code ships the code command).

An Install CLI Command menu item (when available) symlinks the bundled binary to /usr/local/bin/chatter (macOS/Linux) or copies it to a PATH directory (Windows).

To build with the bundled CLI:

cargo build --release -p chatter
mkdir -p apps/chatter-desktop/src-tauri/resources
cp target/release/chatter apps/chatter-desktop/src-tauri/resources/
cargo tauri build

Architecture

The desktop app lives in apps/chatter-desktop/:

apps/chatter-desktop/
  src-tauri/          Rust backend (Tauri v2)
    src/
      main.rs         Bin entry, calls chatter_desktop_lib::run()
      lib.rs          Tauri app setup (Builder + module wiring)
      protocol.rs     Shared command/event names + request types
      commands.rs     validate, cancel, open_in_clan, export, reveal, install_cli
      events.rs       ValidationEvent → frontend event bridge
      validation.rs   Desktop validation orchestration for one target
  src/                React + TypeScript frontend
    components/       DropZone, FileTree, ErrorPanel, ProgressBar, OnboardingOverlay
    hooks/            useValidation, validationState, useTheme
    protocol/         Command/event names + TypeScript transport mirrors
    runtime/          Tauri transport + capability-focused runtime seam

The Rust backend calls validate_directory_streaming() and validate_files_streaming() from talkbank-transform directly (folder vs. single-file targets respectively), the same streaming validation pipeline and on-disk cache used by the CLI and TUI. Events flow over crossbeam channels to the Rust side, then are serialized to JSON and emitted to the frontend via Tauri’s event bridge.

Cancellation uses ArcSwapOption for lock-free atomic swap of the cancel sender, no mutex.

The frontend keeps Tauri-specific code confined to src/runtime/tauriTransport.ts. React components and hooks consume narrower capabilities (validationRunner, validationTarget, clan, exports) instead of reaching for one broad desktop service object.

Comparison with TUI

Feature	TUI (`chatter validate`)	Desktop app
File selection	CLI arguments	Drag-and-drop, file picker
Navigation	Keyboard (Tab, arrows)	Mouse click
Error display	Two-pane terminal UI	Scrollable panels with source snippets
Error filtering	,	Code filter + severity toggle
Copy error	,	Copy button per error
Open in CLAN	`c` key	Button per error
Export	`--format json --audit`	Save dialog (JSON or text)
Streaming progress	Progress bar	Progress bar + ETA
Dark mode	Terminal theme	System/Light/Dark toggle
Caching	Same engine	Same engine
Who it’s for	Power users, CI	Researchers, linguists

Both use the identical validation engine and produce the same error codes.

When to Use Which Tool

The TalkBank toolchain offers validation through three interfaces. Each serves a different workflow:

Tool	Audience	Use when
Chatter Desktop	Researchers, linguists	You want a graphical, drag-and-drop CHAT validation app without using a terminal.
`chatter validate` (TUI)	Power users	You’re comfortable in a terminal and want keyboard-driven navigation.
`chatter validate` (CLI)	CI, scripts	You need machine-readable output (`--format json`) or batch audits (`--audit`).

Chatter Desktop focuses on validation only.

CLAN Line Numbering

Status: Current Last modified: 2026-05-29 17:31 EDT

When you click “Open in CLAN” in the desktop app or press Enter in the TUI, chatter sends the error location to the CLAN editor. CLAN opens the file and places the cursor at the error. This usually works seamlessly, but there is one caveat: CLAN and chatter count lines differently.

Hidden Headers

CLAN hides five header types from its editor display:

Header	Purpose
`@UTF8`	Character encoding declaration
`@PID`	Persistent identifier
`@Font`	Display font settings
`@ColorWords`	Color coding rules
`@Window`	Window position/size

These headers are present in the .cha file but invisible in CLAN’s editor. CLAN’s line numbers skip them entirely. A file that starts with @UTF8 on line 1 will show @Begin as “line 1” in CLAN’s display, even though it’s actually line 2 in the file.

What Chatter Does

Chatter automatically adjusts line numbers before sending to CLAN:

Compute the error’s line number in the source file
Count how many hidden headers appear before that line
Subtract the hidden count to get CLAN’s line number
Send the adjusted line number to CLAN

This happens transparently, you don’t need to do anything.

Edge Case: Errors on Hidden Lines

If an error is on a hidden header itself (e.g., a malformed @UTF8 line), CLAN cannot navigate to it because CLAN doesn’t display that line. In this case, “Open in CLAN” will show an error message explaining why.

For Developers

The shared resolution logic lives in talkbank_model::resolve_clan_location(). Both the TUI and the desktop app call this function, it resolves line/column from byte offsets when needed and adjusts for hidden headers.

See clan_location.rs for the implementation and tests.

Batch Workflows

Status: Current Last modified: 2026-06-12 21:05 EDT

The chatter CLI is designed for processing large CHAT corpora efficiently. This page covers common batch workflows.

Validating a Corpus

Validate all .cha files in a directory tree:

chatter validate /path/to/corpus/

The validator recursively discovers .cha files and processes them in parallel. Results are cached, subsequent runs skip unchanged files.

Forcing Revalidation

To bypass the cache and revalidate everything:

chatter validate /path/to/corpus/ --force

Filtering Output

Show only errors (hide warnings):

chatter validate /path/to/corpus/ --quiet

Stop after the first reported error:

chatter validate /path/to/corpus/ --max-errors 1

Write a JSONL audit file while validating:

chatter validate /path/to/corpus/ --audit validation.jsonl

CHAT-JSON Roundtrip

Convert an entire corpus to JSON and back:

# CHAT → JSON
for f in corpus/**/*.cha; do
  chatter to-json "$f" > "${f%.cha}.json"
done

# JSON → CHAT
for f in corpus/**/*.json; do
  chatter from-json "$f" > "${f%.json}.roundtrip.cha"
done

The roundtrip is designed to preserve the ChatFile model. In regression tests, compare normalized output rather than assuming byte-for-byte identity after parser or serializer changes.

Cache Management

The validation cache stores results for previously validated files (keyed by content hash). The cache database file is named talkbank-cache.db and lives in the OS cache directory:

macOS: ~/Library/Caches/talkbank-chat/talkbank-cache.db
Linux: ~/.cache/talkbank-chat/talkbank-cache.db
Windows: %LocalAppData%\talkbank-chat\talkbank-cache.db

It can hold results for large file collections.

To relocate the cache (a different disk, a per-project cache, or an isolated cache for scripted runs), set the TALKBANK_CHAT_CACHE_DIR environment variable to a directory; the database is created directly inside it. This is the supported override on every platform, and the only effective one on Windows, where the default location comes from the system Known Folder API rather than environment variables.

chatter cache stats    # Show hit rates and entry count
chatter cache clear --all

Do not delete the cache file manually while chatter is running.

Reference Corpus Validation

This repository includes a reference corpus at corpus/reference/ (currently ~100 .cha files; verify by find corpus/reference -name '*.cha' | wc -l). The parser must handle every file in this corpus at 100%:

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'

This runs the parser equivalence test, each .cha file is its own test, so nextest runs them in parallel and reports individual failures.

Integration with batchalign

The Batchalign pipeline uses the same Rust core (via PyO3) for CHAT parsing and serialization. Since the 2026-04-28 monorepo merge, Batchalign source lives inside this repository under crates/batchalign-* (the standalone batchalign3 GitHub repo was archived). Files processed by Batchalign produce valid CHAT that passes chatter validate.

CI Integration

Status: Current Last updated: 2026-04-13 19:23 EDT

How to use chatter in continuous integration pipelines.

Exit Codes

Code	Meaning
`0`	All files valid / command succeeded
`1`	Validation errors found or command failed
`2`	Invalid arguments or missing required options

All examples below rely on exit code 1 to signal validation failure.

Basic Usage

chatter validate corpus/ --quiet --tui-mode disable

--quiet suppresses per-file success output
--tui-mode disable prevents interactive TUI (required in non-TTY environments)
Exit code 0 means all files valid; 1 means errors found

GitHub Actions Example

- name: Validate CHAT corpus
  run: |
    chatter validate corpus/ --quiet --tui-mode disable --format json --audit results.jsonl

- name: Upload validation report
  if: failure()
  uses: actions/upload-artifact@v4
  with:
    name: validation-report
    path: results.jsonl

The --audit results.jsonl flag streams per-error JSON lines to a file, which is useful for archiving or downstream analysis even when the step fails.

JSON Output for Automation

chatter validate corpus/ --format json --tui-mode disable 2>/dev/null

Each file produces a JSON object on stdout with status, error_count, and errors array. The exit code still reflects overall pass/fail.

Pre-commit Hook

#!/bin/sh
# .git/hooks/pre-commit
chatter validate . --quiet --tui-mode disable

This blocks commits that introduce invalid CHAT files. The hook runs quickly on cached files; only modified files are re-validated.

Suppressing Specific Errors

Some corpora have known issues that should not block CI. Use --suppress to ignore specific error codes or named groups:

chatter validate corpus/ --suppress E726,E727,E728 --tui-mode disable

Or use the named group shorthand:

chatter validate corpus/ --suppress xphon --tui-mode disable

Suppressed errors do not appear in output and do not affect the exit code.

Audit Mode for Large Corpora

For bulk corpus validation where you want a full error database without caching overhead:

chatter validate corpus/ --audit errors.jsonl --tui-mode disable

The --audit flag streams one JSON object per error to the specified file. A summary is printed to stderr at the end.

CHAT Processing Playbook for Editors and Analysts

Status: Current Last updated: 2026-03-24 00:01 EDT

Objective

Provide practical guidance for non-compiler users who create, edit, and validate CHAT files, with emphasis on error interpretation and correction workflow.

Who This Is For

Transcript editors,
corpus curators,
QA reviewers,
linguists using tooling outputs but not parser internals.

Core Editing Workflow

Open file in editor with CHAT diagnostics enabled.
Run validation (single file first, then batch).
Fix highest-severity structural issues first (headers, tier markers, unmatched delimiters).
Re-run validation and inspect warnings.
Only then address style and normalization suggestions.

Error Triage Heuristic

Errors at file start: likely header formatting or encoding issues.
Errors at tier prefix: likely malformed */% tier syntax.
Errors inside words: likely symbol, marker, or annotation boundary issues.
Repeated same error class: likely one systemic rule violation pattern.

Fast Interpretation Guide

Error: parser/validator could not accept structure; must fix.
Warning: valid but suspicious or non-canonical; review strongly recommended.
Info: advisory normalization or convention hints.

Common Fix Recipes

Header spacing problems:
- Ensure expected separators and avoid accidental tabs/spaces drift.
Unclear language/form markers:
- Confirm @s usage and suffix ordering with house style guide.
Duration/annotation confusion:
- Verify bracketed annotation form and avoid malformed punctuation.
Dependent tier attachment issues:
- Ensure % tiers follow intended main tier and keep indentation consistent.

Batch Validation Workflow

Validate a small sample first.
Group failures by error code.
Fix by pattern, not file-by-file random order.
Re-run and confirm error count decreases monotonically.
Save run report for audit trail.

Collaboration Workflow with Developers

When reporting parsing issues, include:

exact file path,
minimal excerpt around failing span,
observed diagnostic code/message,
expected behavior (if known).

This reduces back-and-forth and speeds defect triage.

Quality Checklist Before Publishing Corpus Updates

No unresolved error-level diagnostics.
Warning classes reviewed and accepted or fixed.
Participant headers and IDs internally consistent.
Roundtrip serialization check passes for representative samples.
Changelog note recorded for major normalization edits.

Training Recommendations

Maintain short examples for each common error class.
Provide editor cheat sheet for tier prefixes and marker syntax.
Run periodic QA calibration sessions across editors.

Sanitize (`chatter debug sanitize`)

Status: Current Last updated: 2026-04-28 22:18 EDT

chatter debug sanitize strips contributor lexical content from a CHAT file while preserving structure (timing bullets, %wor per-word offsets, speaker codes, dependent-tier scaffolding, structural counts, POS tags, language markers). Output is structurally identical to the input but contains no participant words, names, or free-text annotations.

The command exists so engineering tooling, including LLM-assisted debugging, can operate on protected-corpus files (aphasia/, dementia/, rhd/, fluency/Password/, clinical-children corpora, etc.) without exposing contributor speech to commercial LLM services.

When to use it

Run chatter debug sanitize on the source file before loading it into any tool (LLM-backed debugger, scratch directory, screen-shareable session) where you don’t want participant content visible.

When you need to ask a contributor for help debugging a specific file, frame the request as “run the sanitizer locally and send me the output” rather than asking for the raw file.

Usage

# Write sanitized output to stdout
chatter debug sanitize input.cha

# Write sanitized output to a file
chatter debug sanitize input.cha --output sanitized.cha

Working location for sanitized files: prefer a stable, non-/tmp scratch directory (e.g. set TB_SCRATCH_DIR to a per-project dir under your workstation’s persistent storage) for any state that should outlive a single command. macOS clears /tmp on reboot.

What is preserved (byte-exact)

Timing bullets •start_end• on the main tier.
%wor per-word offsets (word START_END triples).
Speaker codes (*PAR, *INV, *CHI, …).
Utterance count, word count per utterance, dependent-tier count.
Structural markers: compound +, clitic ~, CA elements, overlap points, lengthening, stress markers, syllable pause, underline begin/end, proper-noun @n markers.
Language markers (@s:LANG), form types (@a, @b), POS tags ($adj, $n).
Headers: @Languages, @Birth, @Date, @Media, @PID, @L1Of, @Begin/@End/@UTF8.
%mor POS categories and morphological features (e.g., n|, -Past).
%gra (numeric grammatical relations) and %tim (timing).
Untranscribed tokens xxx / yyy / www, preserving them changes semantic meaning, so they pass through unchanged.

What is replaced or redacted

Source	Replacement
`WordContent::Text`	`wN` placeholder, indexed by document position
`Shortening` text	`(x)`
`%mor` lemmas (`MorWord.lemma`)	`lemmaN`; POS + features preserved
`%pho` / `%mod` / `%modsyl` / `%phosyl` / `%phoaln` / `%sin`	tier dropped
Free-text dependent tiers (`%com` `%add` `%exp` `%sit` `%spa` `%int` `%gpx` `%eng` `%gls` `%ort` `%flo` `%def` `%coh` `%fac` `%par` `%alt` `%err`)	`[redacted]`
`@Comment`, `@Transcriber`, `@Birthplace`, `@Activities`, `@Situation`, `@RoomLayout`, `@Location`, `@TapeLocation`, `@Warning`, `@Bck`	`[redacted]` (when content was free text)
`@Participants` participant-name field	dropped (`Participant_<SPEAKER_CODE>` is implied by speaker code + role)
`@ID` `custom_field` and `education`	cleared
`Event` event_type (`&=imitates:Mary` → `&=[redacted]`)	`[redacted]`
`Freecode` text (`[^ aside]`)	`[redacted]`
`OtherSpokenEvent` text	`[redacted]`

Determinism + Idempotence

Placeholder generation is keyed off (utterance_index, word_index) tree position, not a global counter. Two consequences:

Deterministic: sanitizing the same input twice produces byte-identical output.
Idempotent: sanitizing a sanitized file produces the same file again, no double-replacement, no shifting placeholder numbers.

Pipeline

flowchart LR
    Input["Source .cha\n(protected corpus)"] --> Parser["TreeSitterParser\n(talkbank-parser)"]
    Parser --> Model["ChatFile model\n(talkbank-model)"]
    Model --> Sanitize["sanitize()\n(talkbank-transform::redact)"]
    Sanitize --> Walker["walk_words_mut\n+ header walker\n+ dep-tier walker\n+ scoped-annot walker"]
    Walker --> Mutated["Mutated ChatFile\n(placeholders + redactions)"]
    Mutated --> Writer["WriteChat\n(byte-exact bullets)"]
    Writer --> Output["Sanitized .cha\n(scratch path)"]

The walker step replaces WordContent::Text segments inside Word.content, mutates MorWord.lemma fields, redacts free-text header / dep-tier / scoped-annotation strings, and drops phonological tiers. WriteChat then re-serializes, and because it serializes from Word.content (not from Word.raw_text), every CA element, compound marker, clitic boundary, and timing bullet round-trips byte-exact.

Out of v1 scope

Documented for transparency; v2 work:

Speaker-code anonymization (graph rewrite across @Participants, @ID, *SPK:, @Birth, @L1Of).
@Birth / @Date fuzzing (exact birth dates can be identifying).
@Media filename redaction.
Audio-side sanitization. (Audio bytes are never touched by the sanitizer; the audio stays at its original path.)
“Unsanitize” or round-trip mapping. Explicitly not built, the sanitizer is one-way, the mapping table that would reverse it is the exact artifact we don’t want to exist.

Implementation

Library module: talkbank_transform::redact. CLI surface: chatter debug sanitize. The strict policy is the only public preset in v1; future variants can grow on SanitizationPolicy.

Speaker-ID (`chatter speaker-id`)

Status: Draft Last modified: 2026-07-01 21:55 EDT

chatter speaker-id assigns CHAT-conformant speaker codes and role tags to a CHAT file whose speakers carry anonymous or placeholder labels (typically the output of an ASR system that labels speakers as PAR0, PAR1, …). It is the bridge between an ASR pipeline that does not understand speaker roles and a CHAT pipeline that does.

The command is structural: it does not modify utterance content, does not run audio analysis, does not infer speaker identity from voice features. Its inputs are the CHAT file to relabel plus an identification signal (reference transcript, explicit mapping, or saved override record); its output is the same CHAT file with speaker codes rewritten and @Participants / @ID headers reconciled.

When to use it

Whenever you have a CHAT file with placeholder speaker codes that need to become CHAT-conformant codes before downstream tooling can process the file meaningfully. The canonical case is an ASR system that emits CHAT but does not know which speaker is the child, parent, clinician, etc.

A complete pipeline that consumes ASR output and produces a publishable CHAT file goes:

flowchart LR
    Media --> Transcribe
    Transcribe["batchalign3 transcribe<br/>ASR"] --> AsrAnon["asr.cha<br/>PAR0, PAR1, ..."]
    Ref["reference.cha<br/>target speakers only"] -.->|reference signal| SpkId
    AsrAnon --> SpkId
    SpkId["chatter speaker-id<br/>(this page)"] --> AsrLabeled["asr-labeled.cha<br/>CHI, INV, MOT, ..."]
    AsrLabeled --> Merge["chatter merge"]
    Ref --> Merge
    Merge --> Aligned["batchalign3 align"]

The speaker-id stage is the single point in the pipeline where “which anonymous speaker corresponds to which CHAT role” is decided. Downstream stages (chatter merge, batchalign3 align, batchalign3 morphotag) all trust that the labels they receive are correct.

Identification modes

Three mutually-exclusive modes, exactly one of which must be selected:

1. Reference mode

The most common case: a separate CHAT file already exists that covers the same media and contains an authoritative speaker (typically the hand-transcribed target speaker). The reference file’s anchor speaker tells us what that speaker’s content looks like; speaker-id finds the matching speaker in the input by text similarity.

The matching algorithm is multiset Jaccard over bags of content tokens, see “Algorithm” below for the full specification. The ASR speaker whose bag-of-words best matches the reference anchor’s bag-of-words is taken as the same speaker, and is marked for drop in the output (because the reference file authoritatively covers them, the downstream chatter merge stage will pull their utterances from the reference, not from this file). The remaining speakers are renamed to the role specified by --inserted-role.

If the Jaccard margin between the winning speaker and the runner-up is below --confidence-threshold, the command refuses to auto-decide. The operator must either lower the threshold (not recommended without spot-checking), supply an explicit mapping (--mapping), or load a previously-adjudicated override (--override-file).

2. Explicit-mapping mode

The operator already knows the mapping (typically because they listened to the audio, or because the contributor’s data sheet documents it). They supply it directly.

chatter speaker-id input.cha \
  --mapping "PAR0=INV:Investigator,PAR1=drop" \
  -o relabeled.cha

The grammar for --mapping:

One or more comma-separated assignments.
OLD=CODE:ROLE renames OLD to CODE with role tag ROLE.
OLD=drop removes OLD’s utterances entirely.
Every speaker present in the input must be named in the mapping (no defaulting). This is intentional, we want operator decisions to be explicit.

3. Override-file mode

The operator has previously adjudicated this session (perhaps through an interactive review tool) and saved the decision to a shared override file. speaker-id reads the file, finds the entry for this session, and applies it. See “Override file format” below.

chatter speaker-id input.cha \
  --override-file batch-2026-05-27.overrides.toml \
  --session-id NF203-2 \
  -o relabeled.cha

This mode is the production substrate for batch workflows: the orchestrator first runs chatter speaker-id in reference mode for every session; for any session that exits with low-confidence, the operator works through an adjudication tool that writes to the override file; the orchestrator then re-runs chatter speaker-id in override-file mode for those sessions.

CLI contract

chatter speaker-id <INPUT> [OPTIONS]

ARGUMENTS:
  <INPUT>  Path to the CHAT file to relabel.

OPERATION MODES (exactly one required):

  REFERENCE MODE:
    --reference <FILE>
    --anchor <SPEAKER>
    --inserted-role <CODE>:<TAG>[,<CODE>:<TAG>...]

  EXPLICIT-MAPPING MODE:
    --mapping <SPEC>

  OVERRIDE-FILE MODE:
    --override-file <FILE>
    --session-id <ID>

REFERENCE-MODE OPTIONS:
  --confidence-threshold <FLOAT>
      Minimum Jaccard margin (winner_score / loser_score) for the
      command to auto-decide. Below threshold: exit code 4. The
      command prints per-speaker scores to stderr so the operator
      can inspect. Default: 2.0.

  --write-override <FILE>
      When auto-decide succeeds, append the decision to FILE in
      override-file format (creates if missing). Captures the
      audit trail of a batch run.

COMMON OPTIONS:
  -o, --output <PATH>
      Write relabeled CHAT to PATH. Default: stdout.

The operator identity and any free-text note for a session are set
when an operator confirms it through `chatter adjudicate` (see the
merge workflow), not on this command.

Exit codes:

Code	Meaning
0	Success, relabeled file written
1	Invalid input (parse error, missing file, unreadable)
2	Semantic precondition violated (reference has no utterances for anchor; mapping covers a speaker not in input; etc.)
3	Internal error
4	Reference mode: confidence threshold not met. Per-speaker scores printed to stderr; no output written

What the output guarantees

These are testable invariants. Every release verifies them against the reference corpus.

Speaker codes match the supplied mapping

For every speaker in the input file:

If the mapping marks the speaker for drop, none of their utterances appear in the output, AND their @ID row (if any) is removed from the headers, AND their entry is removed from the @Participants header.
If the mapping marks the speaker for rename, every main-tier line *OLD:\t... becomes *NEW:\t... byte-stable except for the speaker code prefix. The @ID row’s third pipe-separated field (speaker code) and eighth field (role tag) are rewritten; other @ID fields are preserved. The @Participants entry’s code and role-tag tokens are rewritten; any intervening tokens (corpus ID, participant name) are preserved.
Speakers not in the mapping are passed through unchanged. (In modes 1 and 3, all speakers are assigned automatically; in mode 2, “all speakers must be in the mapping” is a precondition.)

Utterance content is byte-stable except for the speaker prefix

For every retained utterance, every byte EXCEPT the leading *CODE:\t prefix is preserved verbatim. Dependent tiers attached to the utterance are preserved exactly. NAK-delimited time bullets, CHAT markup, special-form annotations, paralinguistic codes, retracing scopes, all untouched.

Headers reconcile per a fixed table

Header	Behavior
`@UTF8`, `@Begin`, `@End`, `@Window`, `@Languages`, `@Media`	Pass-through unchanged
`@Participants`	Drop entries for dropped speakers; rewrite code + role-tag for renamed speakers; entries for unaffected speakers preserved
`@ID`	Drop rows for dropped speakers; rewrite field 3 (code) and field 8 (role) for renamed speakers; other fields preserved
`@Comment`	Pass-through unchanged (provenance-carrying comments survive)

Provenance is captured if `--write-override` is set

When --write-override <FILE> is supplied AND the command succeeds in reference mode, an entry is appended to FILE recording the session ID (derived from the input filename stem unless overridden), the per-speaker Jaccard scores, the chosen mapping, the operator, and an ISO 8601 timestamp. The format is specified in “Override file format” below. The operator identity and any free-text note are set later, when a session is confirmed via chatter adjudicate.

This is the audit-trail mechanism: a year from now, a researcher who asks “why was PAR0 labeled INV in this session?” can read the override entry and see the scores, the operator, and any notes the operator added.

Algorithm (reference mode)

Token cleaning

Both the reference anchor’s bag of words and each input speaker’s bag of words are built by walking the typed CHAT AST and emitting content tokens. The cleaner strips:

NAK-delimited time bullets
bracket-annotated markup [*], [//], [/], [=! ...], etc.
angle-bracket retracing scope (<...>, unwrap, keep inner text)
terminator variants +//., +..., +/., +!?, etc.
filled-pause and phonological-fragment markers &-..., &+...
unintelligible placeholders xxx, yyy, www
zero-realization markers 0
special-form suffixes (word@l → word)
CHAT compound underscores (Valentine's_Day → Valentine s Day)
punctuation, then lowercase, then filter to alpha-only tokens of length ≥ 2

Both sides are cleaned identically so the comparison is apples-to-apples. This is the same cleaner specified in the reference corpus under spec/constructs/speaker-id/token-cleaner/.

Multiset Jaccard

For two bags-of-words A and B (counted multisets):

J(A, B) = sum_w min(A[w], B[w])  /  sum_w max(A[w], B[w])

Range [0, 1]. The multiset (rather than set) form rewards speakers who say similar things to the anchor in similar volume, not just speakers whose vocabulary happens to intersect.

Decision

scores  = { speaker: J(anchor_bag, speaker_bag) for speaker in input }
winner  = argmax(scores)
loser   = argmax(scores - {winner})
margin  = scores[winner] / scores[loser]    # ∞ when loser score = 0

winner is the input speaker whose content matches the reference anchor’s content best → marked for drop (the reference authoritatively covers them).
loser (and any other lower-scoring speakers, in the multi-speaker case) → renamed to the role given by --inserted-role.

If margin < --confidence-threshold (default 2.0), the command exits with code 4 and prints per-speaker scores to stderr. The operator must inspect, adjudicate, and re-run with --mapping or --override-file.

Why this algorithm

The choice was empirical, not theoretical, and was made against a calibration set of CHAT files paired with their corresponding ASR output. Two earlier candidates were tested first and rejected:

Raw temporal-overlap (sum of ms of an input speaker’s activity inside the anchor’s bullet windows): too weak on real data. Hand transcripts often place per-utterance time bullets as end-to-end segmentation boundaries covering 95-99% of the session timeline, rather than as tight “speaker active here” windows. Both input speakers fall almost entirely “inside” the anchor’s bullet windows and the signal disappears.
Speaker purity (fraction of each input speaker’s activity falling inside anchor windows): same root cause, same failure.

Multiset Jaccard over content tokens succeeded on every session of the calibration set. The borderline cases (margin below 2.0x) clustered around tasks where the non-anchor speaker shares vocabulary with the anchor by the structure of the task, e.g. a clinician describing the same scene the child is also describing in a picture-narrative task. These borderline cases are the reason for the conservative threshold and the --mapping/--override-file escape hatches; the algorithm correctly refuses to auto-decide them rather than silently picking wrong.

Override file format

The override file is a UTF-8 TOML document with one [<session_id>] table per decision. A minimal entry:

schema_version = 2

[session-101-t1]
mode = "auto"
adult_roles = { PAR0 = { code = "INV", tag = "Investigator" } }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.1931, PAR1 = 0.7347 }
margin = 3.81
operator = "alice"
decided_at = 2026-05-27T08:41:00-04:00

The complete schema specification, every field, every type, every mode-semantics rule, the strict refuse-with-clear-error versioning policy, and worked examples for auto/explicit/replay/diarization-mixed cases, is on the dedicated reference page: Merge Override File Format.

Highlights from the reference:

mode = "auto" | "explicit" | "override" records how the decision was made (informational for audit trail; behavior at apply time is the same).
adult_roles maps each renamed speaker’s donor code to its own role assignment: adult_roles[<donor_code>].code is the CHAT speaker code (INV, MOT, FAT, PAR, …); .tag is the CHAT role-tag (Investigator, Mother, …). Renamed speakers in one entry may share a role or each carry a distinct one.
mapping must cover every speaker in the input, no defaulting.
scores and margin are optional but the writer always records them when an auto attempt produced them (even when the final decision was operator-supplied).
flags carries operator-supplied markers like "diarization-mixed" for unusual cases. Unknown strings are preserved verbatim.

Preconditions

chatter speaker-id refuses (exit code 2) if any hold:

Reference mode

The reference file has no utterances for --anchor
The reference file fails to parse
The input file has fewer than 2 distinct speakers (no discrimination problem)

Explicit-mapping mode

A speaker in the mapping is not present in the input
A speaker in the input is not covered by the mapping (no defaulting)

Override-file mode

The override file does not contain a <session-id> entry
The entry’s mapping references a speaker not in the input
The entry’s mapping does not cover every speaker in the input

What chatter speaker-id is NOT

Not voice diarization. Use Batchalign’s ASR pipeline upstream; the labels this command consumes are the labels Batchalign emits.
Not content correction. If the speaker the command identifies has been mis-transcribed by ASR, this command does not fix that , re-run ASR with a better engine.
Not a merge. This command operates on a single CHAT file. To combine the relabeled file with the reference, use chatter merge.
Not interactive. chatter speaker-id is batch-only: it succeeds, refuses, or fails. The interactive review that resolves a low-confidence refusal into an override-file entry is a separate command, chatter adjudicate, run as part of the merge workflow.

Worked example

A typical fully-automated reference-mode call from an orchestrator script:

chatter speaker-id asr-anonymous.cha \
  --reference hand-transcript.cha \
  --anchor CHI \
  --inserted-role INV:Investigator \
  --confidence-threshold 2.0 \
  --write-override batch.overrides.toml \
  -o asr-labeled.cha

For a session this refused (e.g., shared-vocabulary narrative task with margin 1.82x), the orchestrator captures the failure and the operator later resolves it:

# Inspect the scores the command emitted to stderr:
#   PAR0=0.6286  PAR1=0.3457  margin=1.82x  threshold=2.0
# Operator listens to a few seconds of audio and confirms PAR0 is
# the child:

chatter speaker-id asr-anonymous.cha \
  --mapping "PAR0=drop,PAR1=INV:Investigator" \
  --write-override batch.overrides.toml \
  -o asr-labeled.cha

Later, if anyone re-runs the batch, they use override-file mode:

chatter speaker-id asr-anonymous.cha \
  --override-file batch.overrides.toml \
  --session-id NF204-2 \
  -o asr-labeled.cha

The same asr-labeled.cha content is produced; the audit trail remains intact.

Implementation notes (for contributors)

Source: crates/talkbank-transform/src/speaker_id/ (proposed layout).
CLI surface: crates/chatter/src/commands/speaker_id/.
Domain types (SpeakerCode, RoleTag, SpeakerMapping, MergeOverride, JaccardScore, ConfidenceThreshold, Margin) live in talkbank-model and are shared with chatter merge plus any future adjudication UI.
The Jaccard cleaner walks talkbank-model::ChatFile directly via the existing content walker (talkbank-model::walk_words); it does NOT re-implement CHAT parsing or use regex on raw bytes for tokenization.
Spec entries for the cleaner and the algorithm live in spec/constructs/speaker-id/. Every invariant on this page has a spec; regenerate them with the current spec/tools commands from Spec Workflow.
The override-file reader/writer is a typed serde round-trip on a TOML representation; the schema lives in talkbank-model so the format is one shared type across the codebase, not duplicated parsing logic in each consumer.

Rediarize (`chatter rediarize`)

Status: Draft Last modified: 2026-07-08 21:50 EDT

chatter rediarize re-attributes utterance speakers in a CHAT file from an external diarization. Given a transcript whose utterances carry media time bullets and a JSON file of timestamped speaker turns produced by a dedicated diarizer (for example pyannote), it reassigns each utterance’s main-tier speaker to the diarization track that covers the utterance’s time span the most, keeping the utterance content (the words) byte-stable.

The command exists for a specific, common failure shape: ASR systems with bundled diarization (Rev.AI and others) auto-detect the speaker count and can under-count on hard material such as child-adult overlap, collapsing three or four real voices into two tracks. The ASR words are usually fine; the attribution is what is wrong. A dedicated diarizer recounts the voices correctly, and rediarize reconciles its turns with the existing transcript so you keep the good words and replace only the bad attribution.

The command is structural and audio-free: it never touches the recording. The diarizer runs elsewhere (any tool, any model) and hands its result across a documented JSON boundary.

Pipeline position

flowchart LR
    Media["recording\n(audio)"] --> Diarizer["external diarizer\n(e.g. pyannote)"]
    Diarizer --> Turns["turns.json\n(documented format below)"]
    Media --> Asr["ASR with bundled\ndiarization"]
    Asr --> AsrCha["asr.cha\ngood words,\nsuspect speaker tracks"]
    AsrCha --> Rediarize["chatter rediarize\n(this page)"]
    Turns --> Rediarize
    Rediarize --> Fixed["rediarized.cha\nPAR0..PARn correctly\nseparated tracks"]
    Fixed --> SpkId["chatter speaker-id\n(assign real roles)"]

rediarize fixes WHICH anonymous track owns each utterance; it does not decide who each track is. Role assignment (child, mother, investigator, …) is chatter speaker-id’s job, downstream.

Usage

chatter rediarize INPUT.cha --turns TURNS.json -o OUTPUT.cha

Omitting -o prints the rewritten CHAT to stdout.

A summary is reported on stderr after the rewrite (stderr so that a stdout CHAT stream stays clean when -o is omitted):

rediarize: 214 reassigned, 671 unchanged, 7 flagged

Flagged utterances (see below) are listed individually with their utterance index, kept speaker, and reason.

Machine-readable summary (`--summary-json`)

Batch drivers looping rediarize over a corpus should not scrape the stderr text. --summary-json PATH additionally writes the outcome as JSON:

chatter rediarize INPUT.cha --turns TURNS.json \
    -o OUTPUT.cha --summary-json SUMMARY.json

{
  "source": "pyannote/speaker-diarization-community-1",
  "reassigned": 747,
  "unchanged": 145,
  "flagged": [
    {"utterance_index": 12, "kept_speaker": "PAR1",
     "reason": "no_overlapping_turn"}
  ]
}

source: the turns file’s provenance, passed through (null if the turns file carried none).
reassigned / unchanged: utterance counts. unchanged includes flagged utterances (they kept their speaker), so the file’s total bulleted-tier utterance count is reassigned + unchanged.
flagged: every declined reattribution, never truncated (the stderr listing caps at 20 detail lines; this list is complete). utterance_index is the 0-based position among main-tier lines; reason is "no_bullet" or "no_overlapping_turn".

Field names and the reason strings are a stable output contract. The summary is written only on exit 0, after the CHAT output.

The turns JSON format

The --turns file is the corpus-agnostic seam between the diarizer and chatter. Producing it from any given diarizer’s native output is the caller’s concern; the format is:

{
  "source": "pyannote/speaker-diarization-community-1",
  "turns": [
    {"track": "PAR0", "start_ms": 12063, "end_ms": 17024},
    {"track": "PAR1", "start_ms": 13379, "end_ms": 14375}
  ]
}

source (optional): free-form provenance, typically the diarizer model name. Not interpreted, but useful in audit trails.
turns (required): the timestamped segments. Each has:
- track: the anonymous CHAT speaker code this segment belongs to (PAR0, PAR1, …). The producer chooses the codes; a deterministic mapping from diarizer-native labels (for example pyannote’s SPEAKER_00) is recommended.
- start_ms / end_ms: the segment’s media time span in integer milliseconds, half-open [start_ms, end_ms), with end_ms >= start_ms.

Turns MAY overlap each other (diarizers that permit overlapping speech produce such turns); max-overlap attribution handles that naturally. Unknown fields anywhere in the file are rejected, so a misspelled field fails loudly instead of being silently ignored.

Behavior contract

Every utterance with a time bullet is assigned to the turn track with the greatest millisecond overlap against the bullet’s span. An utterance already on its max-overlap track counts as unchanged.
An utterance with no bullet, or whose bullet overlaps no turn at all, keeps its existing speaker and is flagged in the summary. Ambiguity is surfaced, never silently guessed.
@Participants and @ID headers are reconciled to declare exactly the set of tracks the output actually uses: new tracks get entries cloned from an existing participant (same role), declarations for tracks no longer used are dropped.
Utterance content, dependent tiers, and all other headers are preserved as-is.

Exit codes

Code	Meaning
0	Rewrite completed and output written. Flagged utterances do not fail the command; check the summary.
1	Invalid input: unreadable file, CHAT parse failure, malformed turns JSON.
2	Precondition violation: the turns JSON parsed but is semantically defective (for example a turn with `end_ms < start_ms`).

On any non-zero exit, no output file is written.

Worked example

A recording of one child and two parents, transcribed by an ASR whose bundled diarization auto-detected two speakers (the two adults were merged into one track). A dedicated diarizer found three voices and produced turns.json with PAR0/PAR1/PAR2. Then:

chatter rediarize session.cha --turns turns.json -o session-3spk.cha
chatter validate session-3spk.cha

splits the merged adult track by time, declares PAR2 in the headers, and leaves every word as the ASR wrote it. The output then flows into chatter speaker-id (or the merge workflow) to name the three tracks.

Merge (`chatter merge`)

Status: Draft Last modified: 2026-07-07 13:40 EDT

chatter merge combines two CHAT transcripts that cover the same media recording into one. The caller designates which speakers’ utterances are authoritative in which file; the merged output interleaves them by time while byte-preserving every utterance from its designated source.

The command is structural: it does not invent or rewrite utterance content, does not run ASR, does not run forced alignment, does not infer speaker identity. It is the moment in a multi-input CHAT workflow where two parsed transcripts become one.

When to use it

Whenever you have two valid CHAT files of the same recording and you want a single combined CHAT file out, with explicit per-speaker provenance.

Two recurring shapes from real TalkBank workflows:

Hand-coded target speaker + ASR everyone else. A contributor has hand-transcribed only the target speaker (often the child in child-language research) with rich disfluency and error coding, and separately someone runs ASR on the same media to produce a rough-but-complete transcript with all speakers. chatter merge combines them with the hand-coded target speaker’s utterances byte-preserved and the other speakers spliced in from the ASR file.
Older hand transcript + later supplementary transcription. A legacy CHAT file covers most of the recording; a newer pass transcribes additional content (an investigator’s turns, a parent’s turns, a second target child). Merge with --retain listing the speakers whose content lives in the legacy file.

In both shapes the speakers are the unit of authority, not the files. chatter merge’s job is to express that mapping cleanly.

Conceptual model

A CHAT file describes utterances on a shared media timeline. Two CHAT files of the same media share the same timeline; their utterance sets may overlap (same speech transcribed twice) or be disjoint (each file covers different speakers). The merged output is a single CHAT file on the same timeline whose utterance set is the disjoint union of:

the utterances of every speaker listed in --retain from the first input file, and
the utterances of every speaker NOT listed in --retain from the second input file.

Retained-speaker utterances from the first file are kept byte-for-byte identical, including every dependent tier they own (%wor, %mor, %gra, %com, %pho, …). Inserted-speaker utterances from the second file have their downstream-generated dependent tiers (%wor/%mor/%gra/%pho, anything a later pipeline stage will regenerate) stripped before insertion, so the merged file is in a clean state for batchalign3 align and batchalign3 morphotag to own those tiers authoritatively post-merge.

flowchart LR
    File1["File 1<br/>any CHAT file"] --> Merge
    File2["File 2<br/>any CHAT file<br/>(same media)"] --> Merge
    Retain["--retain CHI[,SPK,…]"] -.-> Merge
    Merge["chatter merge<br/>(structural)"] --> Out["Merged CHAT file<br/>retained speakers: byte-stable from File 1<br/>inserted speakers: from File 2,<br/>derived tiers stripped"]

CLI contract

chatter merge <FILE1> <FILE2> --retain <SPEAKER_LIST> [OPTIONS]

ARGUMENTS:
  <FILE1>  Path to the first CHAT file. Speakers listed in --retain are
           taken from here, byte-preserved.
  <FILE2>  Path to the second CHAT file. All other speakers are taken
           from here.

REQUIRED OPTIONS:
  --retain <SPEAKER>[,<SPEAKER>...]
           Comma-separated list of speaker codes (e.g. CHI, or
           CHI,SI2). These speakers' utterances come from <FILE1>;
           everything else comes from <FILE2>.

OPTIONS:
  -o, --output <PATH>
           Write merged output to PATH. Default: stdout.

  --strip-tiers <TIER>[,<TIER>...]
           Dependent tier names to strip from inserted-speaker
           utterances before merging. Default: wor,mor,gra,pho.
           Use empty list (--strip-tiers '') to preserve all
           dependent tiers as-is.

  --allow-bullet-drift
           Permit small backward-time bullets in either input (where
           one utterance's end_ms is slightly greater than the next
           utterance's start_ms). Default behavior: warn but proceed.
           Set this flag to silence the warning.

Exit codes:

Code	Meaning
0	Merge succeeded
1	Invalid input (parse error, missing file, unreadable)
2	Semantic precondition violated (e.g. retained speaker missing from File 1, conflicting `@Media`, no time bullets in File 1)
3	Internal error

What the merged output guarantees

These are testable invariants. Every release verifies them against the reference corpus.

Retained speakers are byte-stable

For every speaker code in --retain, every main-tier line and every dependent-tier line attached to that speaker in <FILE1> appears byte-for-byte identical in the merged output, in the same relative order they appeared in <FILE1>. CHAT markup, NAK-delimited time bullets, paralinguistic annotations, retracing scope, terminator variants, special-form @l/@n/@c suffixes, all preserved.

This is the core semantic guarantee of merge: if you hand-coded disfluency on the target speaker, the disfluency coding survives the merge without any structural change.

Inserted speakers’ downstream-generated tiers are stripped

For every speaker code in <FILE2> that is NOT in --retain, the utterance is included in the merged output with its main tier preserved verbatim BUT with %wor, %mor, %gra, and %pho removed (configurable via --strip-tiers). Other dependent tiers (%com, %spa, %act, %sit, %add, contributor-specific tiers) are preserved.

The rationale: batchalign3 align and batchalign3 morphotag are the authoritative source stages for these tiers in the post-merge pipeline. Carrying inserted-speaker %wor across the merge would leave the merged file in a half-state, some utterances would have %wor, others would not, and downstream behavior on mixed inputs is undefined. The contract is: enter the post-merge stages in a clean state, exit with the tier present and consistent across every utterance.

Utterance order is timeline order

Utterances in the merged output appear in ascending order by their start time bullet (\\x15START_END\\x15, milliseconds). Where two utterances have identical start times, the first-file utterance comes first.

Time bullets are pass-through

chatter merge does NOT recompute, smooth, or refresh time bullets. The bullets in the merged output are exactly those that appeared in the source files. If <FILE2> had %wor rows whose first/last word times implied a slightly different utterance span than the main-tier bullet, the main-tier bullet wins (it was the contract before merge).

If the merge stage detects an inserted-speaker utterance with no main-tier bullet at all (Batchalign occasionally omits these), it lifts a bullet from the corresponding %wor row’s first-word start and last-word end, appending it to the main tier so the merged file has uniform bullet placement. The original %wor is then stripped (per the per-tier rule above).

Header reconciliation

The merged file’s headers are constructed deterministically from the two inputs:

Header	Source	Notes
`@UTF8`	File 1	always required to be `@UTF8`
`@Begin` / `@End`	File 1	always present in merge output
`@Window`	File 1 if present	not generated if absent
`@Languages`	File 1	File 2’s languages must be a subset of File 1’s; any language declared in File 2 but not in File 1 is an error
`@Media`	File 1	File 2’s `@Media` is discarded; warning if mismatched media filename (NOT the modality field, see below)
`@Participants`	concatenation	File 1’s entries first, then File 2’s entries for non-retained speakers in their original order
`@ID`	concatenation	File 1’s `@ID` rows first; File 2’s `@ID` rows for non-retained speakers appended in their original order
`@Comment`	concatenation	File 1’s `@Comment` rows first; File 2’s `@Comment` rows appended in original order (preserves any provenance comments like ASR engine/run timestamp)

The @Media modality field (audio vs video) is a known divergence point: when ASR runs against an mp4, it may write video on its input but emit audio on its output. File 1’s modality wins, as with all @Media content; no warning is emitted for modality mismatch.

Overlap markup is NOT injected

When an inserted-speaker utterance temporally overlaps a retained-speaker utterance, chatter merge does NOT inject CHAT [>] / [<] / angle-bracket-scoped overlap markers. The time bullets carry overlap information; markers are a CLAN-era surface convention that the output of chatter merge deliberately omits.

The retained speakers’ existing overlap markers (if File 1 already contains some) are preserved byte-stably under the byte-preservation rule above.

Preconditions

chatter merge refuses (exit code 2) if any of these hold:

File 1 declares no utterances for any speaker in --retain.
File 1 has no time-bulleted utterances at all (no shared timeline to merge against).
The two files’ @Languages headers disagree.
A speaker code appears in both files but not in --retain (use --retain to disambiguate).
File 2 is missing or unparseable.

chatter merge does NOT refuse on these (proceeds with warning):

Small backward-time bullets in either input (one utterance ends slightly after the next starts), common in hand transcripts, not corrupting; downstream batchalign3 align cleans these.
File 2’s @Media modality disagrees with File 1’s (audio vs video).
File 1 has fewer utterances than File 2, or vice versa.

Speaker identity in File 2 must already be coherent

chatter merge does NOT identify or rename speakers. If File 2 came from ASR and carries anonymous codes like PAR0, PAR1, run chatter speaker-id first to assign CHAT-conformant codes. The merge step trusts whatever speaker codes appear in its inputs.

What chatter merge is NOT

Not ASR. Use batchalign3 transcribe.
Not forced alignment. Use batchalign3 align.
Not morphological tagging. Use batchalign3 morphotag.
Not speaker identification. Use chatter speaker-id.
Not content reconciliation. If two files disagree about what a speaker said at the same time, chatter merge does not adjudicate; it trusts --retain to designate one file as authoritative per speaker.
Not three-way or n-way merge in this release. The 2-input case composes into the n-input case by chaining (chatter merge a b --retain X -o tmp.cha && chatter merge tmp.cha c --retain Y -o out.cha). A future release may add native n-ary merging if a workflow appears for which chained 2-way merges are awkward.

Worked example

A speech-pathology lab hand-transcribed a child’s spontaneous-speech session, marking disfluency carefully, but did not transcribe the clinician’s turns. They send the media and the child-only transcript; the project runs ASR on the media to produce a full-coverage transcript with anonymous speaker codes; then chatter speaker-id labels the ASR file’s adult speaker as INV; then chatter merge combines.

# After ASR labeling: asr.cha has speakers CHI and INV.
chatter merge child-only.cha asr.cha \
  --retain CHI \
  -o merged.cha

# Then alignment regenerates %wor cleanly across all speakers:
batchalign3 align merged.cha

# Then morphotag regenerates %mor and %gra:
batchalign3 morphotag merged.cha

The merged file contains:

Every *CHI utterance byte-stable from child-only.cha, including every disfluency marker, every retracing scope, every paralinguistic annotation, every %com session-structural comment.
Every *INV utterance from asr.cha, in their original time order, interleaved with the *CHI utterances by start time.
One @Participants row listing CHI and INV; @ID rows for both; the union of @Comment rows including any ASR provenance comments from asr.cha.

Relationship to other commands

flowchart TB
    Media[Media file mp4 / wav] --> Transcribe
    Transcribe["batchalign3 transcribe<br/>ASR"] --> AsrAnon["asr-anonymous.cha<br/>PAR0, PAR1, ..."]
    HandTranscript["hand-transcript.cha<br/>target speakers only"] --> SpeakerId
    AsrAnon --> SpeakerId
    SpeakerId["chatter speaker-id<br/>label anon speakers"] --> AsrLabeled["asr-labeled.cha<br/>CHI, INV, MOT, ..."]
    HandTranscript --> Merge
    AsrLabeled --> Merge
    Merge["chatter merge<br/>(this page)"] --> Merged[merged.cha]
    Merged --> Align
    Align["batchalign3 align"] --> Aligned[aligned.cha]
    Aligned --> Morph
    Morph["batchalign3 morphotag"] --> Final[final.cha]

chatter merge sits between speaker-identity resolution and forced alignment. It assumes its inputs have coherent CHAT-conformant speaker codes (no anonymous PAR0/PAR1) and emits a file ready for batchalign3 align to refresh timing and produce %wor.

Inputs must be valid CHAT (`pipeline` / `batch`)

The per-session chatter pipeline shortcut and the directory-level chatter batch driver validate every input as CHAT before doing any speaker-id or merge work. Each donor and the reference it is merged against must pass the same validation chatter validate runs; an input that fails is never merged. Clean invalid transcripts to valid CHAT first (run chatter validate <file> to see the errors), then re-run.

chatter pipeline refuses (exit 2, no output written) if its donor or reference is invalid CHAT.
chatter batch is fail-closed and whole-batch: if any input under the donor/reference directories is invalid CHAT, it reports every offending file and aborts the entire run without merging a single session. “All inputs are chatter-valid” is a hard precondition of the batch, not something discovered session-by-session mid-run.

This gate catches validation-only invalidity (files that parse but fail chatter validate, e.g. a malformed @ID), which the lower-level chatter merge parse is otherwise lenient about.

LLM holistic judgment (pending-only)

--judgment holistic is now reachable from pipeline and batch (not just speaker-id). In holistic mode the command is pending-only: it writes an engine = "llm" review-gated entry via --write-pending and produces no merged file. The operator supplies the LLM connection with --llm-endpoint / --llm-model (or the environment variables CHATTER_LLM_ENDPOINT / CHATTER_LLM_MODEL); an optional --session-context <file.json> provides per-session context that the LLM prompt includes to sharpen its judgment.

Response caching (`--llm-cache`)

--llm-cache <file> (env fallback CHATTER_LLM_CACHE) points holistic judgment at a persistent, write-through JSON response cache. When set, a request identical to one already answered (same endpoint, model, and rendered prompt) is served from the cache file instead of making another LLM call, so re-running a batch after a crash, or after fixing an unrelated bug, does not re-pay every already-completed session. The cache key folds in the exact wire request, so any prompt or PromptVersion change invalidates stale entries automatically, no separate version bump is needed. A cache file that exists but is not valid JSON is a hard error (the run refuses rather than silently ignoring or overwriting it); a missing file is treated as an empty cache and created on first write. Absent flag and env variable means uncached, today’s default behavior. chatter batch threads --llm-cache to every per-session chatter pipeline subprocess it spawns, so one cache file accumulates entries across the whole batch.

Session-context JSON (`--session-context`)

The session-context file is a corpus-agnostic JSON object mapping session IDs (the donor file’s basename stem) to context records. Every record field is optional, and the label fields are free vocabulary: chatter imposes no closed set, the labels are surfaced verbatim into the LLM prompt.

{
  "SESSION-ID": {
    "sample_type": "clinician interview",
    "declared_roles": ["Investigator"],
    "consent_tier": "video+audio",
    "age_months": 52
  }
}

sample_type: what kind of speech sample the session is (e.g. "narrative retell").
declared_roles: adult roles declared present in the session.
consent_tier: media-consent tier governing what may be shared.
age_months: child age in months at the session.

When --session-context is absent, the CHATTER_SESSION_CONTEXT environment variable supplies the path (empty counts as unset). Per session, each context field resolves in order: the explicit record from the file; for the age only, the donor’s CHAT @ID age header (pure CHAT, no external metadata needed); otherwise unknown. Absent sessions or fields are passed to the judgment as unknown, never guessed. A configured-but-malformed file is a hard error, and labels must contain at least one non-whitespace character. Configuring session context on a non-holistic run prints a warning (the deterministic judgment never consults it).

Conversion from a contributor’s own records format (a spreadsheet, a database export) to this JSON happens outside chatter.

The two-pass operator flow is:

batch --judgment holistic --session-context context.json --write-pending P accumulates one engine = "llm" pending entry per session in P.
Operator reviews P, accepts or corrects each entry.
chatter adjudicate promotes reviewed entries to the override file.
batch (deterministic, reading the override file) replays every confirmed mapping and writes the merged files.

Note: the MLU sanity-scan is unreliable for the FluencyBank clinical-interview corpus (children out-narrate the adult, so MLU ratios invert relative to typical child-language recordings). Holistic-pending review via --judgment holistic is the trustworthy alternative there.

Implementation notes (for contributors)

Source: crates/talkbank-transform/src/transcript_merge/ (proposed layout, see the design plan).
CLI surface: crates/chatter/src/commands/transcript_merge/.
Domain types (SpeakerCode, RetainSet, MergeOverride, SpeakerMapping) live in talkbank-model so the override-file format is sharable across the speaker-id stage, the orchestrator, and any future adjudication UI.
The merge operates on talkbank-model::ChatFile; both inputs are parsed via talkbank-parser. The byte-preservation guarantee on retained-speaker utterances relies on the parser’s existing round-trip serialization.
Spec entries exercising the merge live in spec/constructs/, every behavioral invariant on this page has a spec; tests are regenerated via the current spec/tools workflow documented in Spec Workflow.
This page is the user contract; book/src/chatter/reference/ carries the override-file reference for the speaker-id stage that this merge consumes.

The Merge Workflow (`pipeline`, `batch`, `adjudicate`, `sanity-scan`)

Status: Draft (experimental) Last modified: 2026-07-07 21:20 EDT

The merge workflow combines, at scale, the two structural primitives documented elsewhere, chatter speaker-id (assign CHAT-conformant speaker codes to an anonymous donor) and chatter merge (combine two transcripts of the same recording), and adds the operator loop needed when the automatic speaker decision is not confident enough to trust.

Four commands make up the workflow. They are experimental and in active development; flags and behavior may change.

Command	Scope	Role
`chatter pipeline`	one session	speaker-id (reference mode) then merge, in a single invocation
`chatter batch`	a directory pair	loop `pipeline` over matched donor / reference files
`chatter adjudicate`	the operator	resolve the low-confidence sessions a pass left pending
`chatter sanity-scan`	merged output	flag confident auto-decisions that still look suspicious

If you only have one pair of files and one clean answer, reach for pipeline. Everything else here is about doing that safely across a directory of sessions where some answers are not clean.

The big picture: a two-pass loop

The hard part of merging at scale is not the merge; it is deciding, per session, which anonymous ASR speaker is the child the reference already covers. speaker-id’s multiset-Jaccard match (see its page) answers that automatically when the winner clearly beats the runner-up, and refuses (exit code 4) when it does not. The workflow turns that refusal into a reviewable queue.

flowchart TD
    subgraph Pass1["Pass 1: automatic"]
        B1["chatter batch DONOR_DIR REF_DIR\n--write-override audit.toml\n--write-pending pending.toml"]
        B1 --> Clean["confident sessions:\nmerged file written,\ndecision logged to audit.toml"]
        B1 --> Refused["low-confidence sessions:\nNO merge, appended to pending.toml\n(exit code 4)"]
    end
    Refused --> Adj["chatter adjudicate pending.toml\n--override-file audit.toml\n(operator decides)"]
    Adj --> Pass2["Pass 2: chatter batch ... --override-file audit.toml\n(replays the operator's decisions,\nmerges the previously-refused sessions)"]
    Clean --> Done["all sessions merged"]
    Pass2 --> Done

Pass 1 merges everything it is confident about and parks the rest. The operator works the parked queue once. Pass 2 replays their decisions. The same chatter batch (or chatter pipeline) command runs both passes; what changes is whether an override file with entries exists yet.

`chatter pipeline` (one session)

The per-session shortcut: run speaker-id in reference mode to relabel an anonymous donor, then merge the relabeled donor with the reference, in one command instead of two.

chatter pipeline <DONOR> <REFERENCE> \
  --anchor <SPEAKER> --inserted-role <CODE>:<ROLE> --output <PATH> [OPTIONS]

ARGUMENTS:
  <DONOR>      Donor CHAT file with anonymous speaker codes (the ASR output).
  <REFERENCE>  Reference CHAT file carrying the authoritative anchor speaker
               (typically the hand-coded child transcript).

REQUIRED:
  --anchor <SPEAKER>            Anchor code in the reference (typically CHI).
  --inserted-role <CODE>:<ROLE> Role for the donor's non-anchor speakers
                                (e.g. INV:Investigator).
  -o, --output <PATH>           Output path for the merged CHAT file.

KEY OPTIONS:
  --retain <SPEAKER>            Speaker(s) taken from the reference in the
                               final merge (typically the same as --anchor).
  --confidence-threshold <F>    Minimum winner/runner-up Jaccard margin to
                               auto-decide (default 2.0x).
  --write-override <FILE>       On a confident auto-decision, append a
                               mode = "auto" audit entry for this session.
  --write-pending <FILE>        On a low-confidence refusal, append a pending
                               entry (exit code 4 still fires).
  --override-file <FILE>        If the file has an entry for this session
                               (the donor's basename stem), replay that
                               decision instead of running reference mode.

The same command serves pass 1 (no override entry yet, run reference mode) and pass 2 (entry present, replay it). Validation is a hard precondition: a donor or reference that fails chatter validate is never merged (exit 2, nothing written).

`chatter batch` (a directory pair)

Loops pipeline over matched files: the reference for DONOR_DIR/X.cha is REFERENCE_DIR/X.cha. Donors without a matching reference are warned and skipped. It is fail-closed and whole-batch on validity: if any input under either directory is invalid CHAT, the batch reports every offending file and aborts without merging a single session.

chatter batch <DONOR_DIR> <REFERENCE_DIR> \
  --anchor <SPEAKER> --inserted-role <CODE>:<ROLE> --output <DIR> [OPTIONS]

PASS-1 AUDIT + QUEUE:
  --write-override <FILE>  Append every confident auto-decision (mode =
                          "auto"). Required if you want --sanity-scan.
  --write-pending <FILE>   Aggregate every low-confidence refusal into one
                          pending file. One `chatter adjudicate` run resolves
                          them all. Refusals do NOT abort the batch.

PASS-2 REPLAY:
  --override-file <FILE>   Threaded to every per-session pipeline call.
                          Sessions with an entry replay it; the rest fall
                          through to reference mode.

POST-MERGE QA:
  --sanity-scan            Run `sanity-scan` after the loop. Requires
                          --write-override (it reads the auto-decisions) and
                          --write-pending (flagged sessions are appended).
                          Exit code 4 fires if it flags any session.
  --sanity-scan-threshold <F>  Heuristic ratio (default 1.5).

OPERATIONAL:
  --skip-existing          Skip donors whose merged output already exists, to
                          resume an interrupted batch.

batch also accepts the same --judgment deterministic|holistic and LLM / --session-context options as pipeline; see Merge, LLM holistic judgment for that mode and the session-context JSON format.

Reading the batch summary

Every run ends with one summary line on stderr that accounts for every matched donor exactly once:

batch summary: 345 matched donor(s); 0 merged, 345 suggestions awaiting
adjudication, 0 low-confidence refusals awaiting adjudication, 0 errored,
0 unmatched (no reference), 0 skipped (output existed)

merged: the pipeline produced a merged output file (deterministic mode, or a session already covered by an override decision).
suggestions awaiting adjudication: holistic mode judged the session confidently and wrote a suggestion to the pending file; no merge happens until an operator accepts it via chatter adjudicate.
low-confidence refusals awaiting adjudication: the engine declined to suggest (below the confidence threshold); the entry is in the pending file for a human call.
errored / unmatched / skipped: per-session failures, donors with no same-named reference file, and outputs that already existed under --skip-existing, respectively.

The distinction between the first three matters operationally: a holistic run that ends 0 merged, N suggestions awaiting adjudication has done its job; the merge itself happens after adjudication.

`chatter adjudicate` (the operator step)

Reads the pending file a pass produced, walks the operator through the unresolved sessions, and appends the resolved decisions to the override file. On success the pending file is rewritten to drop the entries that were resolved, so re-running adjudicate only ever shows what is left.

chatter adjudicate <PENDING> --override-file <FILE> [--interactive | --scripted <TOML>]

ARGUMENTS:
  <PENDING>  The pending-adjudications TOML a pass wrote.

REQUIRED:
  --override-file <FILE>  Override file to append resolved decisions to
                         (created if absent). This is the same file pass 2
                         reads back.

DECISION SOURCE (one of):
  --interactive           Prompt per pending entry on stdin. See "The
                         interactive decision language" below for the
                         three decision verbs and their syntax.
  --scripted <TOML>       Pre-canned operator decisions, for replayable /
                         tested runs. Mutually exclusive with --interactive.

  --operator <NAME>       Recorded in each override entry (defaults to $USER).

The interactive decision language

Each pending entry is printed with its full context (the sessions, the suggested mapping, the engine’s confidence scores and reasoning), then one line is read from stdin. Three decision verbs are accepted:

Verb	Form	Meaning
`accept` (or `a`)	`accept [note...]`	Take the suggested mapping exactly as proposed
`choose`	`choose SPK:CODE:TAG [SPK:CODE:TAG ...] [note...]`	Supply the speaker mapping yourself: each group maps a donor speaker to a CHAT code and role tag
`override`	`override SPK:CODE:TAG [SPK:CODE:TAG ...] SPK=action [SPK=action ...] [note...]`	Supply the mapping AND per-speaker actions (for example `SPK=drop` to exclude a donor speaker entirely)

SPK:CODE:TAG groups are repeatable, so multi-adult sessions are expressed naturally, one group per speaker:

choose A:CHI:Target_Child B:INV:Investigator C:MOT:Mother reviewed against the recording

Anything after the structured arguments is recorded verbatim as the operator’s note. Every decision (verb, mapping, note, operator, and the engine’s original scores) is appended to the override file, so the audit trail survives the session.

This is the interactive review tool the speaker-id and merge pages refer to: the audit trail (who decided, the scores, any note) lands in the override file so a later reader can see why a session was labeled the way it was. The decision schema is the same override-file format used everywhere in the workflow; see Merge Override File Format, and the Adjudication Workflow architecture page for the design.

`chatter sanity-scan` (post-merge QA)

A confident auto-decision can still be wrong, the runner-up was simply even further off. sanity-scan re-reads the merged output and the pass-1 audit file and flags sessions that pass an out-of-band check: the mean utterance word count of the anchor speaker versus the inserted speaker. In a typical child-language recording the adult out-talks the child, so an anchor (child) mean that is much higher than the inserted (adult) mean is suspicious, possibly the two were swapped.

chatter sanity-scan <MERGED_DIR> \
  --override-file <FILE> --anchor <SPEAKER> --write-pending <FILE> [OPTIONS]

REQUIRED:
  --override-file <FILE>  The pass-1 audit file. Only auto-decided sessions
                         are scanned; explicit-mode entries are skipped (the
                         operator already signed off).
  --anchor <SPEAKER>      Anchor code in the merged files (typically CHI).
  --write-pending <FILE>  Flagged sessions are appended here as
                         sanity-scan-misclassification pending entries for
                         `chatter adjudicate`. Required.

  --threshold <F>         Flag when anchor_mean >= inserted_mean * threshold
                         (default 1.5).

A flag is a question, not a verdict: the session goes back into the adjudication queue for an operator to confirm or correct. Whether to run the scan at all is a judgment about the corpus. It assumes the typical “adult out-talks child” shape, and is unreliable where that inverts (e.g. a clinical-interview corpus where children out-narrate the adult); there, prefer the LLM holistic-pending review described on the merge page.

End-to-end worked example

A directory of ASR donors (asr/) and the matching hand-coded child references (ref/), child anchor CHI, adults labeled INV:

# Pass 1: merge what we are sure of; queue the rest; keep an audit trail.
chatter batch asr/ ref/ \
  --anchor CHI --inserted-role INV:Investigator \
  --output merged/ \
  --write-override audit.toml \
  --write-pending pending.toml \
  --sanity-scan

# Exit 0: every session merged confidently and the scan was clean.
# Exit 4: some sessions are pending (low-confidence and/or scan-flagged).

# Operator resolves the queue once (audit trail recorded):
chatter adjudicate pending.toml --override-file audit.toml --interactive --operator alice

# Pass 2: replay the operator's decisions; the previously-pending
# sessions now merge.
chatter batch asr/ ref/ \
  --anchor CHI --inserted-role INV:Investigator \
  --output merged/ \
  --override-file audit.toml \
  --skip-existing

Exit codes

The workflow commands share the convention used across the merge surface:

Code	Meaning
0	Success
1	Invalid input (parse error, missing file, unreadable)
2	Semantic precondition violated (e.g. invalid CHAT input, missing anchor)
3	Internal error
4	A pass parked work for the operator: a low-confidence speaker-id refusal, or a `sanity-scan` flag. Nothing was lost; the sessions are in the pending file

Exit code 4 is the normal “there is operator work to do” signal, not an error: a batch that parks ten sessions still merged the rest.

CHAT Format Overview

Status: Reference Last updated: 2026-05-11 21:51 EDT

CHAT (Codes for the Human Analysis of Transcripts) is a standardized transcription format for spoken language data, developed by MacWhinney as part of the CHILDES and TalkBank projects. It is the most widely used format in child language research and conversational analysis.

File Anatomy

Every CHAT file follows this structure:

@UTF8
@Begin
@Languages:	eng
@Participants:	CHI Target_Child, MOT Mother
@ID:	eng|corpus|CHI|2;6.||||Target_Child|||
@ID:	eng|corpus|MOT|||||Mother|||
*MOT:	what do you want ?
%mor:	ADV|what AUX|do PRON|you VERB|want ?
%gra:	1|4|LINK 2|4|AUX 3|4|SUBJ 4|0|ROOT 5|4|PUNCT
*CHI:	I want cookie .
%mor:	PRON|I VERB|want NOUN|cookie .
%gra:	1|2|SUBJ 2|0|ROOT 3|2|OBJ 4|2|PUNCT
@End

A CHAT file consists of:

@UTF8: required first line, declares UTF-8 encoding
@Begin: marks the start of the transcript
Headers: lines starting with @ that provide metadata (participants, languages, IDs, etc.)
Utterances: blocks consisting of:
- A main tier (line starting with *SPEAKER:) containing the transcribed speech
- Zero or more dependent tiers (lines starting with %tier:) containing annotations
@End: marks the end of the transcript

Key Conventions

Tab separation: a tab character separates the tier prefix from its content (e.g., *CHI:⟶content)
Terminators: every utterance ends with a terminator (., ?, !, or special forms like +...)
Line continuation: long lines wrap with a tab at the start of continuation lines
Speaker codes: short identifiers; the validator accepts up to seven characters from A-Z, 0-9, _, -, '; three uppercase letters is the convention (e.g., CHI, MOT, FAT, INV)
Media linking: timestamps link transcripts to audio/video via bullet markers

CHAT vs Other Formats

Feature	CHAT	Praat TextGrid	ELAN EAF
Morphological tiers	Built-in (%mor, %gra)	No	No
Dependency syntax	Built-in (%gra)	No	No
Standardized POS	UD-style via %mor	No	No
Word-level alignment	%wor tier	Interval-based	Interval-based
Error recovery	Tree-sitter GLR	N/A	N/A

References

CHAT Manual: the canonical reference
TalkBank: the data repository

Headers

Status: Reference Last updated: 2026-05-11 20:30 EDT

Headers are lines beginning with @ that provide metadata about the transcript. They appear between @Begin and the first utterance (though some headers like @Comment can appear anywhere).

Required Headers

@UTF8

Must be the very first line of every CHAT file. Declares UTF-8 encoding.

@UTF8

@Begin / @End

Mark the start and end of the transcript body. Every CHAT file must have exactly one @Begin and one @End.

@Participants

Declares all speakers in the transcript. Format: CODE [Name] Role, comma-separated. The role is required; the name is optional, so each entry is either CODE Role or CODE Name Role.

@Participants:	CHI Target_Child, MOT Mother, FAT Father
@Participants:	CHI Alex Target_Child, MOT Mary Mother

In the first line, Target_Child, Mother, and Father are roles, not names. In the second line, Alex and Mary are optional names sitting between the speaker code and the role.

Speaker codes are short identifiers; the validator accepts up to seven characters from A-Z, 0-9, _, -, and '. The convention is three uppercase letters; the most common codes are:

CHI: target child
MOT: mother
FAT: father
INV: investigator
OBS: observer

@ID

Provides detailed metadata for each participant. One @ID line per participant.

@ID:	eng|corpus|CHI|2;6.||||Target_Child|||

Fields (pipe-separated): language, corpus, speaker code, age, sex, group, SES, participant role, education, custom field.

Age format: years;months.days (e.g., 2;6. = 2 years, 6 months).

SES field: ethnicity (White, Black, Asian, Latino, Pacific, Native, Multiple, Unknown), socioeconomic code (UC, MC, WC, LI), or combined with comma separator (e.g., White,MC).

Optional Headers

@Languages

Declares the language(s) used in the transcript.

@Languages:	eng, fra

@Date

Recording date in DD-MON-YYYY format.

@Date:	15-JAN-2024

@Location

Where the recording took place.

@Location:	Boston, MA, USA

@Situation

Description of the recording context.

@Situation:	free play with toys in lab

@Activities

Activities during the recording.

@Activities:	toyplay, reading

@Comment

Free-form comments. Can appear anywhere in the file (before, between, or after utterances).

@Comment:	child was tired during this session

@Media

Links the transcript to an audio or video file.

@Media:	session01, audio

@Transcriber / @Coder

Identifies who created or coded the transcript.

@Transcriber:	JDS
@Coder:	ABC

Header Ordering

Headers should follow this conventional order:

@UTF8 (required, first line)
@Begin (required)
@Languages
@Participants (required)
@ID lines (one per participant)
Other metadata headers (@Date, @Location, etc.)
@Comment lines (can also appear later)

Validation

The parser validates header structure including:

@UTF8 must be the first non-empty line
@Begin and @End are required and must appear exactly once
@Participants is required and must declare all speakers used in utterances
@ID participant codes must match @Participants declarations
Age format validation in @ID lines

Utterances

Status: Reference Last updated: 2026-05-11 23:22 EDT

An utterance is the fundamental unit of a CHAT transcript. It consists of a main tier (the transcribed speech) followed by zero or more dependent tiers (annotations).

Main Tier

The main tier begins with *SPEAKER: followed by a tab and the utterance content, ending with a terminator.

*CHI:	I want a cookie .

Speaker Codes

Speaker codes are short identifiers (up to seven characters from A-Z, 0-9, _, -, '; three uppercase letters is the convention) matching a code declared in @Participants:

@Participants:	CHI Target_Child, MOT Mother
*MOT:	what do you want ?
*CHI:	cookie .

Terminators

Every utterance must end with a terminator:

Terminator	Meaning
`.`	Declarative (period)
`?`	Question
`!`	Exclamation
`+...`	Trailing off
`+..?`	Trailing-off question
`+/.`	Interruption
`+//.`	Self-interruption
`+/?`	Interrupted question
`+!?`	Broken question
`+"/.`	Quotation follows on next line

Line Continuation

Long utterances wrap to the next line with a leading tab:

*MOT:	well I think that we should probably go to
	the store and get some more cookies .

Content Items

The content between *SPEAKER: and the terminator consists of content items separated by whitespace:

Words: regular words, potentially with annotations
Groups: bracketed content like <word word> for overlap, retrace, etc.
Special forms: pauses (.), events &=laughs, fillers &-uh
Separators: commas , and other punctuation

Words

Words are the primary content unit. See Word Syntax for full details.

Groups

Angle brackets < > group words for annotations:

*CHI:	<I want> [/] I want cookie .

Common group annotations:

[/]: partial retrace (speaker repeats the same words)
[//]: full retrace (speaker restarts with different words)
[///]: multiple retracing (multiple false starts)
[/-]: reformulation (speaker rephrases with different structure)
[?]: uncertain transcription

Special Forms

*CHI:	um (.) I want &-uh cookie .

(.): short pause
(..): medium pause
(...): long pause
(1.5): timed pause in seconds
&=laughs: paralinguistic event
&-uh: filler

Media Linking

Utterances can include media timestamps (bullets) that link to audio/video:

*CHI:	I want cookies . •1234_5678•

The numbers represent start and end times in milliseconds. The bullets delimiting the pair render as • in most editors; on disk they are the NAK control character (U+0015). See grammar/grammar.js rule bullet.

Dependent Tiers

See Dependent Tiers for documentation on %mor, %gra, %pho, %wor, and other annotation tiers that follow the main tier.

Retraces and Repetitions

Status: Current Last updated: 2026-05-11 23:16 EDT

Retraces mark content that the speaker said but then corrected, repeated, or abandoned. They are one of the most consequential constructs in CHAT because they affect how every dependent tier aligns to the main tier.

CHAT Syntax

A retrace has two parts: the retraced content (what the speaker said first) and the correction (what follows). The retraced content is marked with a trailing bracket code:

Marker	Name	Meaning
`[/]`	Partial repetition	Speaker repeats the same words
`[//]`	Full correction	Speaker restarts with different words
`[///]`	Multiple correction	Multiple false starts
`[/-]`	Reformulation	Speaker rephrases with different structure

Single-Word Retraces

When only one word is retraced, no angle brackets are needed:

*CHI: I [/] I want that .
*CHI: ana [//] an .
*MOT: the book [/-] the magazine is here .

Group Retraces

When multiple words are retraced, angle brackets delimit the scope:

*MOT: <the dog> [//] the cat ran .
*CHI: <I want> [/] I need cookie .
*CHI: <I want the> [///] give me that .

Retraces with Replacements

A retraced word often has a replacement [: target] and/or error code [* code]. This is common in aphasia and child language corpora where the speaker produces an incorrect form:

*PAR: tika@u [: kitty] [* p:n] [//] kitty is nice .
%mor: noun|kitty aux|be-Fin-Ind-Pres-S3 adj|nice-S1 .

*PAR: lɛɾɪ@u [: later] [* p:n] [//] later in the day .
%mor: adv|late adp|in det|the-Def-Art noun|day .

*CHI: male [: female] [* s:r] [/] male [: female] [* s:r] .
%mor: adj|female-S1 .

In each case, the retraced word (before the [//] or [/]) is excluded from %mor alignment. Only the correction (after the marker) is counted.

Data Model

Retraces are a first-class variant of UtteranceContent:

flowchart TD
    UC["UtteranceContent"]
    UC --> Word
    UC --> RW["ReplacedWord"]
    UC --> Retrace
    UC --> AG["AnnotatedGroup"]
    UC --> Other["...20 other variants"]

    Retrace --> BC["BracketedContent"]
    Retrace --> RK["RetraceKind"]
    BC --> BIW["BracketedItem::Word"]
    BC --> BIRW["BracketedItem::ReplacedWord"]

    style Retrace fill:#f96,stroke:#333

The Retrace struct wraps the retraced content in a BracketedContent container, which can hold any combination of words, replaced words, and other content items:

// crates/talkbank-model/src/model/content/retrace.rs
pub struct Retrace {
    pub content: BracketedContent,          // the retraced words
    pub kind: RetraceKind,                  // Partial, Full, Multiple, Reformulation
    pub is_group: bool,                     // <word> [/] vs word [/]
    pub annotations: Vec<ContentAnnotation>,// non-retrace annotations after marker
    pub span: Span,
}

Why First-Class?

Before the retrace refactor, retraces were represented as annotations on words or groups. This meant every match on content had to inspect annotation lists to determine whether a word was retraced. This led to a class of bugs where retraced content was accidentally included in alignment counting, word extraction, or retokenization.

Making Retrace a top-level UtteranceContent variant means:

The compiler enforces handling. Every match on UtteranceContent must have a Retrace arm. Forgetting to handle retraces is a compile error, not a silent runtime bug.
Domain-aware gating is centralized. The content walker checks the Retrace variant once, not at every annotation-inspection site.
Alignment counting is simple. The count function returns 0 for Retrace in Mor domain, no annotation inspection needed.

Parser Conversion

The tree-sitter grammar parses retrace markers ([/], [//], etc.) as annotations on word_with_optional_annotations. The Rust parser converts them to structural Retrace nodes in parse_word_content():

flowchart LR
    subgraph "Tree-sitter CST"
        WOA["word_with_optional_annotations"]
        SW["standalone_word"]
        BA["base_annotations"]
        RP["retrace_partial / retrace_complete / ..."]
        WOA --> SW
        WOA --> BA
        BA --> RP
    end

    subgraph "Rust Model"
        RET["UtteranceContent::Retrace"]
        BC2["BracketedContent"]
        W2["Word or ReplacedWord"]
        RET --> BC2
        BC2 --> W2
    end

    WOA -->|"parse_word_content()\n(word.rs)"| RET

Three cases in parse_word_content():

Word + retrace (I [/]), wrap Word in BracketedItem::Word inside Retrace
Word + replacement + retrace (tika@u [: kitty] [* p:n] [//]), build ReplacedWord, then wrap in BracketedItem::ReplacedWord inside Retrace
Word + replacement, no retrace (tika@u [: kitty]), emit bare ReplacedWord

Group retraces (<content> [/]) are handled in group/parser.rs via the same structural wrapping.

Alignment Behavior

Retraces interact differently with each dependent tier domain:

flowchart TD
    RT["Retrace node\n(e.g. 'tika@u [: kitty] [* p:n] [//]')"]

    RT -->|"Mor domain"| SKIP["SKIP\n(return 0)\nNot morphologically analyzed"]
    RT -->|"Pho domain"| COUNT["COUNT\nPhonologically produced"]
    RT -->|"Sin domain"| COUNT2["COUNT\nGesturally produced"]
    RT -->|"Wor domain"| COUNT3["RECURSE\napply retrace-aware %wor leaf rule"]

    style SKIP fill:#faa,stroke:#333
    style COUNT fill:#afa,stroke:#333
    style COUNT2 fill:#afa,stroke:#333
    style COUNT3 fill:#afa,stroke:#333

Why %mor skips retraces: The %mor tier represents the morphological analysis of what the speaker meant to say. Retraced content is a false start or error; it was produced phonologically but is not part of the intended linguistic structure. The correction after the retrace marker carries the morphological analysis.

Why %pho/%sin/%wor include retraces: These tiers document what was actually produced, the sounds, gestures, and timing of the speech as it happened, including false starts. The retrace was physically spoken, so it appears in these tiers.

For %wor, retrace ancestry does not change leaf-level membership:

spoken word tokens count both inside and outside retrace
that includes fillers, fragments, nonwords, and untranscribed placeholders
overlap annotations do not affect %wor membership

Exact corpus-shaped contrast:

*CHI:	<one &+ss> [/] one play ground .
%wor:	one •321008_321148• ss •321148_321368• one •321809_321969• play •322049_322310• ground •322390_322890• .

*CHI:	&+ih <the what> [/] what's letter &+th is this ?
%wor:	ih •49063_49103• the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• th •50886_50946• is •50946_51046• this •51086_51586• ?

Implementation

Counting: count_alignable_item() in alignment/helpers/count.rs:

UtteranceContent::Retrace(retrace) => {
    if domain == TierDomain::Mor {
        0  // excluded from morphological alignment
    } else {
        count_bracketed_alignable_content(&retrace.content, domain, true)
    }
}

Walking: walk_words() in alignment/helpers/walk/mod.rs:

UtteranceContent::Retrace(retrace) => {
    if !matches!(domain, Some(TierDomain::Mor)) {
        walk_bracketed_content(&retrace.content.content, domain, f);
    }
}

%wor generation and overlap counting still use dedicated recursive helpers, but now for %wor-specific sequencing details like replacement handling rather than for retrace-sensitive membership.

Validation

Cross-Utterance Retrace Validation

The retrace validators in validation/retrace/ check:

Collection: collection/utterance.rs and collection/bracketed.rs walk the content tree to find all Retrace nodes
Detection: detection.rs provides utterance_item_has_retrace() for quick retrace presence checks

Alignment Validation (E705)

E705 fires when the main tier has more alignable items than %mor. If retraces are correctly parsed as Retrace nodes; they are excluded from the count and E705 does not fire. If a retrace is accidentally parsed as a bare ReplacedWord (the bug fixed in c90b9bf), it is counted and triggers a false E705.

Regression Tests

tests/retrace_replaced_word_regression.rs contains 6 targeted tests:

Test	Pattern	Verifies
`single_word_retrace_with_replacement_full`	`word [: repl] [* err] [//]`	Retrace wraps ReplacedWord
`single_word_retrace_with_replacement_partial`	`word [: repl] [* err] [/]`	Partial retrace with replacement
`single_word_retrace_with_replacement_multiple`	`word [: repl] [* err] [///]`	Multiple retrace with replacement
`single_word_retrace_with_replacement_no_error_marker`	`word [: repl] [///]`	No `[*]` still produces Retrace
`single_word_retrace_without_replacement`	`word [//]`	Baseline (no replacement)
`retrace_with_replacement_does_not_cause_e705`	Full pipeline with %mor	No false E705

Reference corpus entries: corpus/reference/annotation/retrace.cha

Replacements

Status: Current Last modified: 2026-05-29 17:47 EDT

A replacement is a CHAT annotation [: ...] that pairs a single spoken word on the main tier with one or more “intended” words. It records both what the speaker actually said and what the analysis should treat the utterance as containing.

*CHI:	wanna [: want to] go .
*CHI:	dis [: this] is fun .
*CHI:	rocking+house [: rocking+horse] [*] ?

This page is the canonical reference for what replacements mean in TalkBank, both as a CHAT-manual construct and as a typed AST in this repo. The most important load-bearing fact, which the rest of the page expands on:

Replacements are word-level, not group-level. Each tier domain chooses one side of the pair: %mor analyzes the replacement (right side); %wor, %pho, %sin align to the original (left side). %gra follows %mor.

CHAT Syntax

Word-Level Scope

A replacement attaches to a single standalone_word on the main tier and contains one or more replacement words inside the brackets:

*CHI:	gonna [: going to] eat lunch .
*CHI:	dis [: this] toy .
*CHI:	rocking+house [: rocking+horse] [*] ?

The grammar rules are word_with_optional_annotations and replacement in grammar/grammar.js grep for the rule names rather than line numbers so this stays accurate as the grammar evolves. Replacement words can be separated by whitespace, so [: going to] is a single replacement of gonna with two words.

There Is No Group-Level Replacement

<dat is> [: that is] is not valid CHAT. A replacement does not attach to a group; it attaches to a single word. The grammar enforces this by typing: ReplacedWord.word: Word, never Group. To replace words inside a group, attach the replacement to the inner word:

*CHI:	<dat [: that] is> [/] is broken .

This shape, replacement inside a group inside a retrace, is legal because each annotation operates at its own scope.

There Is No `[::]` Form

Some literature on CHILDES tooling references a [::] annotation; it does not exist in this repo’s grammar, parser, or model, and is not defined by the current CHAT manual. Only [:] exists. If you encounter [::] in legacy data, treat it as a parse error to investigate, not a construct to support.

The Per-Domain Alignment Rule

This is the rule contributors most often get wrong. Different tier domains align to different sides of a replacement pair:

Tier	Side aligned to	Rationale
`%mor`	replacement (right)	Morphosyntactic analysis annotates the target form, not the error
`%gra`	replacement (right)	Grammatical relations align to `%mor`’s structure
`%wor`	original (left)	Word-level timing is for what was actually spoken
`%pho`	original (left)	Phonological transcription describes what was actually spoken
`%sin`	original (left)	Spelling-in-actual describes the original surface form

The mnemonic: the replacement encodes the intended form (what the speaker meant or what a corrected transcript would read). Tiers analyzing intent (%mor/%gra) use the replacement; tiers documenting realization (%wor/%pho/%sin) use the original.

flowchart LR
    spoken["Original word\n(left of [:)\n'dis'"]
    target["Replacement words\n(inside [: ])\n'this'"]

    spoken -->|"%wor (timing)"| wor["%wor: dis"]
    spoken -->|"%pho (phonology)"| pho["%pho: dɪs"]
    spoken -->|"%sin (spelling)"| sin["%sin: dis"]
    target -->|"%mor (UD parse)"| mor["%mor: pron|this"]
    target -->|"%gra (paired with %mor)"| gra["%gra: 1|0|ROOT"]

For multi-word replacements like gonna [: going to], the rule generalizes consistently:

%wor / %pho / %sin produce one entry, for gonna.
%mor produces two entries, for going and to.
%gra produces two entries, paired to the two %mor items.

The alignment-counting code that enforces this is in alignment/units.rs look for the UtteranceContent::ReplacedWord arm. The full table of per-domain rules is in spec/docs/ALIGNMENT_RULES.md.

Rust AST

A replacement is modeled as a first-class UtteranceContent variant, not as a flag on Word:

// crates/talkbank-model/src/model/annotation/replacement.rs
pub struct ReplacedWord {
    pub word: Word,                       // left side: original spoken word
    pub replacement: Replacement,         // right side: 1+ intended words
    pub scoped_annotations: ReplacedWordAnnotations,
}

Two consequences of this shape:

A replacement is a wrapper around a Word, not a kind of Word. ReplacedWord lives as its own variant of UtteranceContent (and BracketedItem), holding an inner word: Word plus the replacement payload. Contrast with retraces: Retrace is also a variant of UtteranceContent/BracketedItem, but it wraps a group of content (a single word or a <...> group), not a single Word. Different mechanism, different scope, same top-level slot in the AST.
The walk_words() content walker yields WordItem::ReplacedWord as a distinct leaf (defined in crates/talkbank-model/src/alignment/helpers/walk/mod.rs). Domain-aware extraction code branches on this leaf type and chooses original or replacement per the table above.

Validation

Each Replacement Word Is Validated Like a Main-Tier Word

The replacement is a Vec<Word>. Each Word inside it goes through the same validator that runs on main-tier words:

*CHI:	dog [: C-3PO] .

This produces [E220] "C-3PO" is not a legal word in language(s) "eng": numeric digits not allowed, exactly as if C-3PO had appeared on the main tier directly. The replacement does not provide an escape from word-level validation. The implementation is in replacement.rs.

This is critical for any code generating replacements programmatically: do not assume [: ...] lets you smuggle arbitrary text past the word validator. If your producer emits a replacement, both sides must be CHAT-legal under the utterance’s declared language.

Replacement-Specific Error Codes

Three error codes are specific to replacements and do not apply to main-tier words:

Code	Meaning
`E208`	Empty replacement `[:]` (no words provided between `:` and `]`)
`E390`	Replacement contains an omission (`0prefix` form), disallowed inside replacements
`E391`	Replacement contains untranscribed material (`xxx`, `yyy`, `www`), disallowed inside replacements

The principle: a replacement must be a concrete intended form. Empty, omitted, or unintelligible content defeats that purpose.

Interactions with Other Annotations

Replacements and Retraces Are Orthogonal

A retrace ([/], [//], [///], [/-]) and a replacement ([:]) are distinct annotations operating at different structural levels:

Retraces wrap content (a single word or a group). They are first- class UtteranceContent variants and represent post-hoc speaker correction.
Replacements attach inside a Word slot via ReplacedWord. They are editorial metadata about an individual spoken word.

Both can coexist:

*CHI:	<dat [: that] is> [/] is broken .   (replacement inside retrace)

A retrace cannot live inside a replacement (the grammar wraps replacements around standalone_word, not arbitrary content).

Replacements and Error Coding

Error codes follow the replacement and operate on the replaced word as a unit:

*CHI:	rocking+house [: rocking+horse] [*] ?

Here [*] marks rocking+house as containing a phonological/lexical error; the [: rocking+horse] records the intended form. The two annotations cooperate: the replacement encodes what was meant, the error code classifies how it deviates. Implementation: scoped_annotations field on ReplacedWord.

Common Misconceptions

These are bugs we have repeatedly written down then forgotten, recording them here so future contributors don’t reinvent them.

“[: ...] lets me put any text I want.” No. Each replacement word is validated. [: C-3PO] fails E220 in English just as C-3PO would.
“[:] is the right mechanism for ASR sanitization.” Usually no. ASR-introduced normalization typically wants [% ...] (free- form comment) or [= ...] (free-form explanation), neither of which validates word grammar. Use [:] only when you have a concrete CHAT-legal intended form.
“%mor analyzes the original.” No. %mor analyzes the replacement. This is the correction’s morphology, not the error’s.
“%wor count must equal %mor count.” No. For gonna [: going to], %wor has 1 entry and %mor has 2. They align to different sides. The validator’s per-domain rule respects this.
“<a b> [: c d] is a group-level replacement.” No. Group-level replacements don’t exist. Either replace inside (<a [: c] b [: d]>) or rephrase the transcription.

Source Citations

Concern	File:line
Grammar rule (`replacement`)	`grammar/grammar.js:1341-1352`
Word-with-replacement rule	`grammar/grammar.js:1063-1071`
`ReplacedWord` struct	`crates/talkbank-model/src/model/annotation/replacement.rs` (search `pub struct ReplacedWord`)
Per-domain alignment	`crates/talkbank-model/src/model/file/utterance/metadata/alignment/units.rs` (search `UtteranceContent::ReplacedWord`)
Replacement validation	`crates/talkbank-model/src/model/annotation/replacement.rs` (search `impl ... Validate for ReplacementWords`)
Reference corpus example	`corpus/reference/annotation/errors-and-replacements.cha`
CHAT manual	https://talkbank.org/0info/manuals/CHAT.html#Replacement_Scope

Untranscribed Markers: `xxx`, `yyy`, `www`

Status: Reference Last updated: 2026-06-14 19:57 EDT

CHAT reserves three short word-level markers for material the human transcriber cannot or chose not to render as words on the main tier. Each one has a specific meaning. Tools that emit CHAT, including ASR pipelines, format converters, and editor heuristics, must respect those meanings, because every downstream consumer (researchers, validators, and aggregate-statistics tools like CLAN’s freq, kideval, mlu) reads them at face value.

Marker	Meaning	Emitter
`xxx`	Transcriber listened to the audio and could not make out what was said. The speech is unintelligible to the human ear at this point.	Human transcriber only.
`yyy`	Transcriber heard a discrete utterance but could not write it as ordinary CHAT words. Used when the surface form resists orthography (mumbled, slurred, foreign with no equivalent). The phonetic content typically appears on the `%pho` tier.	Human transcriber only.
`www`	Transcriber chose not to transcribe this stretch, usually for privacy, off-topic content, or because the segment is irrelevant to the corpus’s purpose.	Human transcriber only.

The shared property: each marker is the human transcriber telling later readers something specific about their experience listening to the audio. None of them mean “tooling could not process this token”.

Why this matters

When a researcher loads a CHAT corpus and counts xxx occurrences, the result is a measure of human listening difficulty: it tells them how much of the audio resisted human transcription. That number feeds into methodology decisions (“can we get reliable MLU from this corpus?”, “what’s the noise floor on this child’s speech?”, “should we re-record in a quieter environment next time?”). It is a load-bearing signal in language-development research.

If an ASR pipeline emits xxx whenever it can’t sanitize a token, for example, substituting xxx for any word that fails CHAT validation under a strict language profile, every xxx count in the corpus becomes a meaningless mixture of “human couldn’t tell” and “pipeline gave up”. Researchers then reading those counts are silently misled. The signal is destroyed for the entire history of that corpus, because the corruption is indistinguishable from real unintelligibility once committed.

The same reasoning applies to yyy and www. A converter or post-processor that emits any of these three markers because the tooling couldn’t handle a token is committing semantic vandalism against the whole field.

Rules for tooling

Never emit xxx, yyy, or www from a tool to mean “could not process”. These markers are reserved for human transcriber judgment.
When a token cannot be validated as legal CHAT under the declared language, prefer one of:
- Pass the token through verbatim and let the CHAT validator (or CLAN’s check) flag it for human review. The transcriber listens, decides, and corrects.
- Fail loud, abort the file rather than emit corrupted output.
- Apply only purely orthographic, semantically null repairs (e.g., stripping a stray boundary quote mark from "My). These are safe because no information is lost.
Never sanitize a token by replacing it with one of the three markers. That is exactly the corrupting behavior this document prohibits.
Never delete a token to “fix” a validation failure. Deletion loses data without any flag.

What tools synthesizing CHAT should do instead

Any tool that builds CHAT from an external source (ASR output, an importer, a format converter) should follow the same division of labor:

Silently fix only orthographically inarguable problems (for example, stripping a stray boundary quote mark from "My).
For tokens that fail language-level validation but are structurally legal CHAT (e.g., C-3PO under English: tree-sitter accepts the digit-hyphen compound but Word::validate fires E220 “numeric digits not allowed”), ship the token verbatim. The full-file validator and check fire E220 on the same word, the file ends up in the human review queue, and the transcriber listens to the audio and decides what was actually said.
For tokens that fail structural parsing (tree-sitter rejects), fail loud: emitting malformed CHAT would corrupt the file beyond the validator’s ability to flag it.

The division of labor is: the tool fixes only what is mechanically unambiguous; CHECK and the human transcriber handle everything that requires judgment about what the speaker said.

xxx / yyy / www survive the transcript through all NLP passes (morphotag, utseg, translate, coref) without re-interpretation. Tools that walk the AST treat them as opaque tokens; they have no POS tag, no lemma, no dependency parent, no translation.
%wor excludes all three (no phoneme sequence to align). %pho may reference yyy directly because the phonetic content is the whole point of the marker.
See word-syntax.md for grammar; this document is the policy reference for who is allowed to emit them and why.

Postcodes (`[+ ...]`)

Status: Reference Last updated: 2026-06-25 07:30 EDT

A postcode is a tagged annotation token that attaches to an utterance as a whole and appears after the terminator. The canonical CHAT syntax is [+ <text>]. Postcodes carry researcher / analysis tags about the utterance, whether it should be excluded from analysis, how it should be coded, what kind of speech act it represents, without modifying the utterance’s word content.

Syntax and Scope

*CHI:   I want cookie .  [+ exc]
*MOT:   what did you say ?  [+ imp]
*CHI:   no I don't want it !  [+ neg] [+ trn]

Three structural facts to internalize:

Postcodes attach to the utterance, not to a word. They sit after the terminator, on the main tier, alongside (but distinct from) any utterance-level bullet. Unlike word-scoped annotations ([: ...] replacement, [% ...] comment, [= ...] explanation, [* ...] error code), a postcode does not modify the interpretation of any single word, it tags the whole utterance.
Multiple postcodes may follow a single terminator. They are ordered, but the order is not semantically privileged.
The body is free-form text. The CHAT word grammar is not applied to postcode contents. Researchers can write arbitrary tags, codes, descriptions, comments, or analytic notes. The model stores the raw text and leaves interpretation to downstream tooling and conventions.

Common Postcodes, Empirical Survey

The postcode vocabulary is open-ended: the CHAT format imposes no closed set, and an audit of every [+ ...] token across a JSON-mirrored snapshot of the TalkBank corpora (~99k files, 23+ data-repo families) found 488 distinct values in active use.

The findings split into three tiers ranked by repo spread (in how many distinct corpus families the code appears), the more useful ranking than raw count, because high-count codes can be concentrated in a single corpus.

Tier 1, Cross-corpus codes (in 7+ repos)

These are the conventions every CHAT consumer should expect to encounter across collections:

Postcode	Repo spread	Total occurrences	Meaning
`[+ gram]`	13	~3,100	Grammatical, utterance is grammatically well-formed for purposes of the analysis.
`[+ exc]`	9	~26,900	Exclude utterance from analysis. The utterance is preserved in the transcript but tagged so analytic tools (CLAN’s `freq`, `mlu`, etc.) skip it.
`[+ bch]`	9	~10,000	Backchannel, listener-side acknowledgement (`mhm`, `yeah`) that should not be counted as a substantive turn.
`[+ trn]`	7	~3,800	Translation utterance.

Tier 2, Multi-corpus protocol codes (in 4-6 repos)

Codes deployed across several CHILDES sub-collections, typically encoding picture-narration / story-reading / imitation experimental conditions. Substantial raw counts (often tens of thousands), but their meaning is set by the originating protocol, consult per-corpus documentation rather than assuming a global definition:

Postcode	Repo spread	Total occurrences
`[+ SR]`	5	~31,000
`[+ IN]`	5	~24,500
`[+ PI]`	5	~22,700
`[+ R]`	4	~16,200
`[+ I]`	4	~10,500
`[+ nv]`	4	~3,300
`[+ imit]`	4	~3,200

Tier 3, Single-corpus and long-tail codes

About 80% of the 488 distinct values appear in one repo only. The single-corpus codes include high-volume protocol vocabularies (e.g. [+ uncued] ~19,500 in one repo, [+ NAC] ~3,500 in one repo, [+ diary] ~2,800 in a Romance/Germanic diary-study collection, [+ noatt] ~2,300 in one repo, [+ inter-utter-switch] ~720 flagging code-switching turns).

The long tail also includes researcher-private notes, typos that survived check, and per-study coding schemes. Tooling MUST treat any unknown postcode value as opaque text, the corpus author may know what it means, the format does not.

Caveats

Numbers are from a snapshot audit and will drift as corpora are added or revised. Treat the broad shape (open vocabulary, ~4 truly cross-corpus codes, ~10 multi-corpus protocol codes, ~hundreds of single-corpus or long-tail codes) as the load-bearing finding, not the exact counts.
“Repo spread” counts data-repo families, not individual files. Two corpora curated by the same group inside one data-repo count as one for spread; researchers using the same code in two different family-of-corpora packages count as two.
The CHAT manual remains the source of truth for standard conventions. The empirical survey above shows what is actually deployed; when ingesting a new corpus, consult its own documentation for the postcodes in use.

What Postcodes Are NOT

Postcodes are easy to confuse with several other CHAT annotation forms because they all use square brackets. The differences are substantive and load-bearing.

Form	Scope	Body validation	Purpose
`[+ ...]`	Utterance-level (this doc)	None, free text	Researcher / analysis tag attached to the whole utterance
`[: ...]`	Word-level	Replacement words ARE validated as CHAT words	Sanctioned-form correction of the preceding word (see `replacements.md`)
`[% ...]`	Word-level	None, free text	Free-form comment about the preceding word or local span
`[= ...]`	Word-level	None, free text	Explanation of unclear / non-standard speech (often paired with `xxx` / `yyy` placeholders)
`[* ...]`	Word-level	None, error code text	Error coding for the preceding word, optionally with a structured code

Two consequences worth pinning down explicitly:

A postcode cannot carry per-word semantics. If you want to attach a comment, replacement, or error code to a single word, use the appropriate word-scoped form. Stretching a postcode to mean “this word is X” loses the per-word position downstream tools depend on.
A word-scoped annotation cannot tag an utterance. If you want to mark an entire utterance for exclusion or translation, use a postcode. A [% exclude this] after a word does not mean “exclude the utterance” to any consumer.

Not Postcodes: Quotation Markers

Quotation marking in CHAT is not a postcode form. The constructs +"/. (quotation end), +"/, and +" (quotation linkers / continuations) are tier-level terminators and linkers, not [+ ...] postcodes, the grammar rule postcode in grammar/grammar.js is strictly [+ <text>], and the quotation forms live under separate grammar rules (quoted_new_line, linker_quotation_follows).

See Utterances → Terminators for the syntactic forms, and the talkbank-model::validation::cross_utterance validator family (gated by ValidationContext::enable_quotation_validation) for the cross-utterance balance checks.

A walker in talkbank-model::validation::utterance::quotation (check_quotation_balance) does scan the postcode list for text "/ and "/., but a sweep over the data-json corpus mirror (101,414 files, 2026-05-11) returned zero such postcodes, that code path is effectively dead, retained presumably as defence against hand-edited oddities. The real quotation-balance work happens in the cross-utterance family above.

Position in the AST

An utterance’s main tier is MainTier, whose content: TierContent field carries the actual tier payload, including postcodes, as a typed list:

pub struct MainTier {
    pub speaker: SpeakerCode,
    pub content: TierContent,
    // spans omitted for brevity
}

pub struct TierContent {
    pub linkers: TierLinkers,                  // utterance-leading +<, ++, etc.
    pub language_code: Option<LanguageCode>,   // [- code]
    pub content: TierContentItems,             // word-level items (newtype over Vec<UtteranceContent>)
    pub terminator: Option<Terminator>,        // ., ?, !, +..., etc.
    pub postcodes: TierPostcodes,              // [+ ...] tokens after the terminator
    pub bullet: Option<Bullet>,                // optional terminal media bullet
    // content_span omitted for brevity
}

(See talkbank-model/src/model/content/main_tier.rs and tier_content.rs for the exact shape. The postcodes slot lives on TierContent, which is the main-tier payload. Dependent tiers do not use TierContent: each has its own type (for example %com is a text tier, %wor is a list of timed items), and none carries a postcode slot. So a [+ ...]-shaped token on a dependent tier is parsed as ordinary tier content, never a Postcode. This is why chatter does not, and structurally cannot without banned raw-text scanning, reproduce CLAN CHECK 109 (“postcodes are not allowed on dependent tiers”); that deliberate divergence is recorded in the CHECK Parity Audit.)

Because postcodes live at the utterance level, the per-word traversal helpers (walk_words, walk_words_mut) do not visit them. Code that needs to read or rewrite postcodes accesses the list directly.

The model stores postcode text as SmolStr and preserves it verbatim through CHAT roundtrips. Downstream tooling, including CLAN command implementations such as freq, mlu, kideval, is responsible for interpreting individual postcode values per its own conventions.

Tooling Rules

Tools that emit or consume CHAT must respect the scope distinction.

Emitters: when adding a researcher tag to an utterance, attach a Postcode to the utterance’s MainTierContent, not a ContentAnnotation to a word. Both serialize, but only the former reaches downstream consumers as utterance-level metadata.
Consumers: when reading utterance-level tags (e.g., implementing an “exclude” filter), iterate main.content.postcodes on each utterance, not the word-level annotations in UtteranceContent. The two lists are populated by different parser branches and have different semantics.
Round-trip preservers (extract→modify→inject pipelines such as the NLP injection passes in crates/batchalign-*): preserve the postcode list unchanged. None of the standard NLP passes have a reason to add, remove, or reorder postcodes.

References

CHAT manual: Postcodes
CHAT manual: Excluded Utterance Postcode
CHAT manual: Included Utterance Postcode
Model: talkbank-model/src/model/content/postcode.rs
Quotation validator: talkbank-model/src/validation/utterance/quotation.rs

Dependent Tiers

Status: Reference Last updated: 2026-06-22 23:33 EDT

Dependent tiers appear on lines beginning with % immediately after an utterance. They provide annotations linked to the main tier content.

CHAT defines four structural categories of dependent tiers:

Structured linguistic tiers: parsed into typed AST nodes with word-level alignment
Phon phonological tiers: syllabification and segmental alignment from the Phon project
Bullet-content tiers: free-form text with optional inline timing markers
Text tiers: plain text with no structural alignment

Structured Linguistic Tiers

These tiers have rich, parsed representations in the data model. Each token aligns 1-to-1 with an alignable word on the main tier (excluding retraces, pauses, and events). Terminators (., ?, !) must match the main tier terminator.

%mor, Morphological Analysis

The %mor tier carries part-of-speech tags, lemmas, and morphological features for each word on the main tier. See The %mor Tier for full documentation covering the UD-style format, data model, divergences from Universal Dependencies, and migration from traditional CHAT MOR.

Format: POS|lemma[-Feature]*, with ~ separating post-clitics.

*CHI:	she's eating cookies .
%mor:	PRON|she~AUX|be-Pres-S3 VERB|eat-Prog NOUN|cookie-Plur .

%gra, Grammatical Relations

The %gra tier encodes dependency syntax using Universal Dependencies relation labels. Each entry has the format index|head|relation, where indices are 1-based and head 0 indicates ROOT.

*CHI:	I want cookies .
%mor:	PRON|I VERB|want NOUN|cookie-Plur .
%gra:	1|2|SUBJ 2|0|ROOT 3|2|OBJ 4|2|PUNCT

The %gra tier aligns with %mor chunks (clitics expand into multiple chunks). Validation checks sequential indices (E721), ROOT structure (E722 missing root, E723 multiple roots), and circular dependencies (E724).

%pho / %mod, Phonological Transcription

The %pho tier records actual pronunciation; %mod records target/model pronunciation. Both use the same format: space-separated phonetic tokens aligned 1-to-1 with main tier words.

*CHI:	I want three cookies .
%pho:	aɪ wɑnt fwi kʊkiz .
%mod:	aɪ wɑnt θri kʊkiz .

Phonological tiers support IPA, UNIBET, X-SAMPA, or custom notation systems. They are used for child language, speech disorders, L2 learning, and dialectal variation studies.

Parsing strategy: We deliberately parse only the minimal word/group-level structure in %pho and %mod needed for coarse alignment with the main tier. The full IPA phoneme content is stored as opaque strings, deep phonological analysis is handled by Phon, and we avoid duplicating that work. The Phon extension tiers (%modsyl, %phosyl, %phoaln) follow the same strategy.

%sin, Gesture and Sign Annotation

The %sin tier codes gestures and signs aligned with speech. Each token is either 0 (no gesture) or g:referent:type (e.g., g:ball:dpoint for a deictic point at a ball).

*CHI:	that ball .
%sin:	g:ball:dpoint 0 .

Multiple simultaneous gestures use bracket grouping: 〔g:toy:hold g:toy:shake〕.

%wor, Word Timing

The %wor tier carries word-level timing annotations for media synchronization. Words may include inline bullets with millisecond timestamps. Word text is display-only (“eye candy”); timing data comes from the bullet fields.

⚠ IMPORTANT: %wor word text is the cleaned form, by design. When chatter serializes a %wor word it writes the word’s cleaned text, the spoken form with surface markers removed, NOT the raw main-tier surface form. This is a deliberate convention (see WorTier::write_chat in crates/talkbank-model/src/model/dependent_tier/wor.rs), chosen for human readability and because %wor exists to anchor timing, not to re-state the main tier’s orthography. The generated %wor text and the TextGrid export both use this cleaned form.

Consequence you must know: surface markers carried on a word, prosodic lengthening (wabe:), and similar in-word notation, are not preserved in %wor output. A main-tier word wabe: becomes wabe on %wor. This means a %wor line containing such words does not byte-roundtrip (parse, serialize, reparse changes the surface text), and that is expected, not a bug. %wor is a cleaned, timing-only view; the main tier remains the faithful record of surface forms. Do not “fix” the %wor serializer to emit raw text without an explicit decision to change this convention.

%wor is not a flat “all tokens except punctuation” tier. It follows a word-level alignment rule:

Regular words count.
Fillers (&-um, &-uh, &-you_know) count; they are real spoken words with known phoneme sequences.
Fragments (&+...) do NOT count: incomplete phoneme sequences; the FA engine cannot reliably anchor partial phonological material.
Nonwords (&~...) do NOT count: interactional/gestural sounds without stable lexical phoneme content for alignment.
Untranscribed placeholders (xxx, yyy, www) do NOT count: they have no known phoneme sequence; CTC forced alignment cannot produce timings for unknown material.
Replacements keep the original spoken word slot for %wor; the replacement text matters for %mor, not %wor. If the original slot is untranscribed or a fragment/nonword, it is still excluded.
Retrace scope does not change %wor membership.
Overlap markers do not change %wor membership.

%wor is a timing-annotation tier. Its word count equals the number of Wor-domain words and may differ from a naive main-tier word count. There is no downstream positional indexing into %wor; the %wor count is not validated against the main-tier word count.

*CHI:	I want cookies .
%wor:	I want cookies .

Exact corpus-shaped contrast:

*CHI:	<one &+ss> [/] one play ground .
%wor:	one •321809_321969• play •322049_322310• ground •322390_322890• .
# &+ss is a fragment, excluded from %wor regardless of retrace context.

*EXP:	&+ih <the what> [/] what's letter &+th is this ?
%wor:	the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• is •50946_51046• this •51086_51586• ?
# Fragments &+ih and &+th excluded; regular words remain.

*EXP:	what's is dis [: this] ?
%wor:	what's •37050_37471• is •37491_37631• dis •37631_38131• ?

*CHI:	xxx snack .
%wor:	snack •884668_885168• .
# xxx has no phoneme sequence, excluded from %wor; only snack appears.

*CHI:	&~um a boat .
%wor:	a •1073779_1073799• boat •1076861_1077361• .
# &~um is a nonword, excluded from %wor.

*CHI:	&-mm [<] bananas are good .
%wor:	mm •1949506_1949566• bananas •1949566_1949766• are •1949846_1949987• good •1950067_1950567• .
# &-mm is a filler, included in %wor (real spoken word with alignable phoneme sequence).

flowchart TD
    A["Main-tier word candidate"] --> B{"Timestamp token /\nomission / empty?"}
    B -->|Yes| OUT["Excluded from %wor"]
    B -->|No| C{"Untranscribed?\n(xxx/yyy/www)"}
    C -->|Yes| OUT
    C -->|No| D{"Fragment or nonword?\n(&+ or &~)"}
    D -->|Yes| OUT
    D -->|No| IN["Counts for %wor\n(word or filler &-)"]

    style IN fill:#afa,stroke:#333
    style OUT fill:#faa,stroke:#333

Phon Phonological Tiers

These tiers originate from the Phon project and provide syllable-annotated phonological transcription and segmental alignment. They were originally serialized as %x-prefixed user-defined tiers (%xmodsyl, %xphosyl, %xphoaln) and are being promoted to official CHAT tiers. Phon stores phonological data in its own XML format; the CHAT representation is generated by PhonTalk.

%modsyl / %phosyl, Syllabified Phonology

%modsyl is a syllabified version of %mod (target pronunciation); %phosyl is a syllabified version of %pho (actual pronunciation). Each phoneme is annotated with a syllable position code (N=nucleus, O=onset, C=coda, etc.). Words are space-separated and align 1-to-1 with the corresponding %mod or %pho tier.

*CHI:	the best .
%mod:	ðə bɛst .
%modsyl:	ð:Oə:N b:Oɛ:Ns:Ct:C .
%pho:	ðə bɛs .
%phosyl:	ð:Oə:N b:Oɛ:Ns:C .

Alignment: Content-based, stripping position codes (:N, :O, :C, etc.) and stress markers (ˈ, ˌ) from %modsyl should yield the same phonemes as %mod. Same for %phosyl → %pho.

%phoaln, Phone Alignment

%phoaln provides segmental alignment between target and actual IPA, showing phoneme-by-phoneme correspondence. Each pair uses source↔target notation; ∅ marks insertions or deletions.

*CHI:	the best .
%phoaln:	ð↔ð,ə↔ə b↔b,ɛ↔ɛ,s↔s,t↔∅

Alignment: Positional, word-by-word, word N in %phoaln aligns with word N in both %mod and %pho.

Parsing strategy: Same as %pho/%mod, we parse just enough structure for alignment (word boundaries for %modsyl/%phosyl, alignment pairs for %phoaln). IPA phoneme content is treated as opaque strings.

Validation (E725-E728)

Because these are derived views, word counts must match between each syllabification tier and its parent IPA tier:

Check	Error code
`%modsyl` word count ≠ `%mod` word count	E725
`%phosyl` word count ≠ `%pho` word count	E726
`%phoaln` word count ≠ `%mod` word count	E727
`%phoaln` word count ≠ `%pho` word count	E728

These checks are gated on ParseHealth, if either tier in a pair has parse errors, the alignment check is suppressed to avoid false positives.

Known PhonTalk Export Issue

The PhonTalk XML→CHAT converter writes %mod/%pho through a OneToOne alignment path that maps IPA words to orthography words and silently drops extras. The syllabification tiers (%modsyl, %phosyl, %phoaln) bypass this path and include all IPA words. In child phonology data where children produce more IPA words than orthographic targets (~4% of Phon corpus files), this creates tier-to-tier word count mismatches. The mismatches originate in the Phon XML source data (orthography↔IPA word count discrepancies) and are inconsistently handled during CHAT export. This is being investigated in collaboration with the Phon team.

Bullet-Content Tiers

These tiers contain free-form text with optional embedded timing markers (•START_END•) and picture references (•%pic:"file.jpg"•). They do not align word-by-word with the main tier.

Tier	Purpose
`%act`	Physical actions, gestures, non-verbal behaviors
`%cod`	Research-specific coding (semantic roles, thematic coding, error classification)
`%com`	Comments, annotations, and contextual notes
`%exp`	Explanations or expansions of ambiguous/incomplete speech
`%add`	Addressee identification in multi-party conversations
`%spa`	Speech act coding (request, assertion, question, directive)
`%sit`	Situational context or setting description
`%gpx`	Extended gesture position coding
`%int`	Intonational contours and prosodic patterns

%cod is bullet-content in the shared TalkBank AST. In the %cod coding convention, a word selector such as <w4> scopes the code that follows it (it names which main-tier word the code applies to) rather than being a code in its own right.

Example with timing:

*CHI:	gimme that .
%act:	reaches toward shelf
%com:	child is pointing to picture

Text Tiers

These tiers contain plain text with no bullets, timing, or structural alignment:

Tier	Purpose
`%alt`	Alternative transcriptions
`%coh`	Cohesion annotation
`%def`	Definitions
`%eng`	English translations (for non-English transcripts)
`%err`	Error annotations
`%fac`	Facial expressions
`%flo`	Flow annotation
`%gls`	Glosses
`%ort`	Orthographic representations
`%par`	Paralinguistic information
`%tim`	Timing information

User-Defined Tiers

Tiers prefixed with %x (e.g., %xcod, %xact) are user-defined dependent tiers. They are preserved during parsing and roundtrip but receive no structural validation beyond basic format checks. Any %x-prefixed tier is always accepted, this is the open extension point for project-specific annotation.

The Supported Set Is Closed

A dependent tier is valid in chatter only if it is one of the standard tiers documented above (the structured, Phon, bullet-content, and text tiers) or a %x-prefixed user-defined tier. Any other %-tier is invalid CHAT, and chatter rejects the file with error E605 (UnsupportedDependentTier). This is a closed set by design: chatter validate is the binding judgment on CHAT validity, so an unrecognized dependent tier is an error, not a warning.

Deliberate Divergence from CLAN: Retired Legacy Tiers

When TalkBank standardized morphology on a single Universal Dependencies %mor tier (plus %gra for relations), several legacy dependent tiers were retired. CLAN’s check still accepts three of them, so on these chatter is intentionally stricter, a deliberate, documented divergence:

Retired tier	CLAN `check`	chatter
`%trn`	accepts	rejects (E605)
`%tra`	accepts	rejects (E605)
`%grt`	accepts	rejects (E605)
`%umor`	rejects	rejects (E605)

The modern UD-%mor workflow has one morphology tier (%mor) plus %gra; the older training/translation/variant tiers are no longer part of the format chatter validates. %umor is rejected by both validators and is listed only for completeness. Note that %xtra (with the %x prefix) is a perfectly valid user-defined tier; only the bare %tra is retired.

This is one instance of a general principle: where chatter intentionally departs from CLAN/CHECK behavior, the divergence is documented rather than left implicit. See CHECK Parity Audit.

The %mor Tier: Morphological Analysis

Status: Reference Last updated: 2026-05-11 20:35 EDT

The %mor (morphological) dependent tier provides word-by-word morphosyntactic annotation aligned with the main tier. Each main-tier word receives a morphological code specifying part of speech, lemma, and grammatical features.

Format Overview

*CHI:	I want cookies .
%mor:	pron|I-Prs-Nom-S1 verb|want-Fin-Ind-Pres-S1 noun|cookie-Plur .

Each %mor item has the structure POS|lemma[-Feature]*, where:

POS: part-of-speech category (noun, verb, pron, det, aux, etc.)
|: pipe separator (always present)
Lemma: base form of the word (cookie, be, I). May contain language-specific compound or derivational boundary markers (see Compound Lemma Boundaries below)
Features: zero or more morphological features, each preceded by - (-Plur, -Fin-Ind-Pres-S3)

Items are space-separated and terminate with a punctuation marker (., ?, !, etc.).

The UD MOR Format

TalkBank’s %mor tier uses a format inspired by Universal Dependencies (UD) but adapted to CHAT conventions. We call this the UD MOR format to distinguish it from the older CLAN-era MOR format.

The UD MOR format was introduced via batchalign’s Stanza-based morphosyntax pipeline. Stanza produces standard UD analysis (UPOS, lemma, morphological features, dependency relations), and the Rust mapping layer converts this to CHAT %mor and %gra tiers. The new format has been adopted for all new corpus annotation.

Structure: Flat POS|lemma[-Feature]*

Every morphological word is flat, a single POS tag, a single lemma, and a linear chain of features:

POS|lemma[-Feature1][-Feature2][-Feature3]...

There are no compounds, prefixes, subcategories, or nested structures in the UD MOR format. The entire morphological analysis of a word is captured by the POS+lemma+features triple.

Examples:

Word	%mor code	POS	Lemma	Features
dog	`noun\|dog`	noun	dog	(none)
dogs	`noun\|dog-Plur`	noun	dog	Plur
running	`verb\|run-Part-Pres-S`	verb	run	Part, Pres, S
is	`aux\|be-Fin-Ind-Pres-S3`	aux	be	Fin, Ind, Pres, S3
I	`pron\|I-Prs-Nom-S1`	pron	I	Prs, Nom, S1
the	`det\|the-Def-Art`	det	the	Def, Art

Multi-Word Tokens (Clitics)

English contractions and similar multi-word tokens (MWTs) are represented using the tilde (~) separator for post-clitics:

*CHI:	it's red .
%mor:	pron|it~aux|be-Fin-Ind-Pres-S3 adj|red .

Here it's is a single main-tier word that expands to two morphological words: pron|it (main) and aux|be-Fin-Ind-Pres-S3 (post-clitic). The ~ indicates the two MOR words are fused into one orthographic token.

Each clitic counts as its own chunk for %gra alignment; pron|it~aux|be-Fin-Ind-Pres-S3 produces 2 chunks, each needing its own grammatical relation.

Terminator

The %mor tier ends with a terminator that matches the main tier’s utterance terminator:

*CHI:	what is that ?
%mor:	pron|what aux|be-Fin-Ind-Pres-S3 det|that ?

The terminator (., ?, !, +..., etc.) counts as one chunk for %gra alignment.

How It Diverges from UD

The UD MOR format is UD-inspired but not UD-compliant. Several deliberate adaptations make it fit CHAT conventions while preserving most UD information. This section catalogs every divergence.

1. POS Tags Are Lowercased UPOS

UD uses uppercase UPOS tags (NOUN, VERB, PRON). CHAT uses lowercase (noun, verb, pron). This is a lossless, trivially reversible surface change.

UD UPOS	CHAT POS
NOUN	`noun`
VERB	`verb`
AUX	`aux`
PRON	`pron`
DET	`det`
ADJ	`adj`
ADV	`adv`
ADP	`adp`
PROPN	`propn`
INTJ	`intj`
CCONJ	`cconj`
SCONJ	`sconj`
NUM	`num`
PART	`part`
X	`x`

2. Feature Values Are Flat, Not Key=Value (Currently)

UD represents morphological features as key=value pairs: Number=Plur, Tense=Past, Person=3. The current CHAT convention drops the keys and uses only the values: -Plur, -Past, -S3.

This is the most significant divergence from UD, because:

Information loss: Plur could in principle be Number=Plur or Degree=Plur (though in practice the UD feature value set has no real ambiguities).
Collapsed person/number: UD Person=3|Number=Sing becomes -S3, a combined code that cannot be mechanically decomposed back to its UD components.
Feature ordering: Features appear in a conventional order determined by the generation pipeline, not in UD’s alphabetical order.

The data model now supports key=value features. The MorFeature type has an optional key field, when present, the feature serializes as Key=Value (e.g., -Number=Plur); when absent, it serializes as just the value (e.g., -Plur). This is forward-compatible: existing flat features parse and serialize identically, and if batchalign’s mapper begins emitting Key=Value features, they flow through the parser and model without any format changes.

3. Multi-Value Features: Commas Preserved

UD encodes multi-value features with commas: PronType=Int,Rel (the word is both interrogative and relative). In CHAT %mor, the comma is preserved within the feature value:

-Int,Rel

This is treated as a single feature value "Int,Rel". The grammar accepts commas within feature values, and the model stores them as-is. No decomposition occurs; the model faithfully records the string that appears in the %mor tier.

Historical note: Earlier documentation described a “comma-stripping” convention where PronType=Int,Rel became -IntRel (concatenated without separator). The current grammar and parser preserve the comma. Existing corpus data using the concatenated form (-IntRel) also parses correctly; it’s simply treated as the flat value "IntRel".

4. Dependency Relations Are Uppercase with Dash Subtypes

The %gra tier (not %mor, but closely related) uses uppercase relation names with dashes for subtypes, where UD uses lowercase with colons:

UD	CHAT %gra
`nsubj`	`NSUBJ`
`acl:relcl`	`ACL-RELCL`
`obl:tmod`	`OBL-TMOD`

This is lossless; case and separator are trivially reversible.

5. ROOT Head Convention

In UD, the root word has head=0. In %gra, two conventions coexist:

UD convention: head=0 (e.g., 3|0|ROOT), the standard we now emit
Legacy TalkBank convention: head=self (e.g., 3|3|ROOT), found in older corpus data

The parser and validator accept both forms. New output uses head=0.

6. No XPOS, No DEPREL Subtypes in %mor

UD provides both UPOS (universal POS) and XPOS (language-specific POS). CHAT %mor uses only UPOS-equivalent tags; there is no XPOS field. Language-specific POS distinctions are not represented.

Similarly, UD’s fine-grained dependency relation subtypes (e.g., nsubj:pass) appear in %gra as NSUBJ-PASS, but the %mor tier itself contains no dependency information.

7. No Morpheme Segmentation

Traditional CHAT MOR formats (CLAN-era) supported morpheme-level segmentation with compound markers (+), prefix markers (#), and suffix chains (-SUFFIX&type). The UD MOR format does not use any of these, each word is analyzed as a flat POS+lemma+features triple.

The grammar still accepts some of these legacy markers for backward compatibility with older corpus data, but the canonical UD MOR format does not produce them.

Compound Lemma Boundaries

Several UD treebanks use special characters inside lemmas to mark morphological boundaries. These are meaningful linguistic annotations preserved in the CHAT %mor lemma field when possible.

Known Markers Across Languages

Language	Marker	Meaning	Example Lemma	In %mor
Estonian	`=`	Compound boundary	`maja=uks` (house-door)	`noun\|maja=uks`, preserved
Basque	`!`	Derivational boundary	`partxi!se` (share + derivation)	`noun\|partxi!se-Ine`, preserved
Finnish	`#`	Compound boundary	`jää#kaappi` (ice-cabinet)	`noun\|jää_kaappi`, mangled (`#` → `_`)

= and ! pass through the cleaning pipeline because they are not reserved CHAT %mor syntax characters. # is reserved in traditional CHAT MOR for prefix markers (e.g., v|#un#do), so the sanitizer replaces it with _.

Gotcha: = ambiguity with legacy CLAN translation glosses. Legacy CLAN %mor tiers use = for translation glosses (e.g., n|perro=dog), a convention predating UD adoption. The parser treats = identically in both cases; it is preserved as part of the lemma string. This means legacy n|perro=dog parses successfully but the translation semantics are lost: the model stores perro=dog as a single lemma, indistinguishable from an Estonian compound like maja=uks. Since we cannot reliably disambiguate the two uses without language-specific context, legacy translation glosses are silently absorbed into the lemma. Files with legacy =translation syntax still parse and round-trip correctly, but the translation information is not semantically accessible. This affects corpora that predate our UD MOR adoption and lack Stanza coverage for their language.

Multi-Word Expression Lemmas (Stanza `_` Convention)

Stanza uses underscores in lemmas to represent multi-word expressions across many languages: New_York, parce_que (French), pick_up (English), a_causa_di (Italian). The current cleaning pipeline strips underscores entirely (New_York → NewYork), which is a known data quality issue and should be treated as an open data-quality limitation of the current mapper.

Multi-Value Features (Commas in Feature Values)

UD encodes multi-value features with commas: PronType=Int,Rel means a word is both interrogative and relative. These commas appear in the CHAT %mor feature suffix and are preserved as-is:

pron|wat-Int,Rel

This is sometimes mistaken for a compound lemma marker, but commas in UD always appear in the feature column (CONLLU column 6), never in the lemma column (CONLLU column 3). In CHAT %mor, they appear after the - feature separator, not inside the lemma. The grammar, both parsers, and the data model all accept commas in feature values. See Section 3: Multi-Value Features above.

Future Direction

The current handling of compound lemma boundaries is inconsistent across languages. A possible future improvement is a unified Unicode separator character that would normalize all compound/derivational boundary markers (=, !, #, and potentially _) into a single convention. This has not been implemented as of 2026-03-02 and requires a design decision on which character to use and whether to preserve the original markers in a structured field.

Data Model

The Rust data model in talkbank-model represents %mor tiers with these types:

MorTier

The top-level tier container:

pub struct MorTier {
    pub tier_type: MorTierType,    // MorTierType::Mor
    pub(crate) items: MorItems,    // Vec<Mor> wrapper; accessed via accessor methods
    pub terminator: Terminator,    // typed terminator (.`, `?`, `!`, `+...`, etc.)
    pub span: Span,                // source location
}

Mor (Item)

One item aligned with one main-tier word:

pub struct Mor {
    pub main: MorWord,                        // required main word
    pub post_clitics: SmallVec<[MorWord; 2]>, // optional ~clitics
}

MorWord

A single morphological word (POS + lemma + features):

pub struct MorWord {
    pub pos: PosCategory,                    // e.g., "noun"
    pub lemma: MorStem,                      // e.g., "dog"
    pub features: SmallVec<[MorFeature; 4]>, // e.g., [Plur]
}

MorFeature

A morphological feature with optional key:

pub struct MorFeature {
    key: Option<Arc<str>>,  // e.g., Some("Number") or None
    value: Arc<str>,        // e.g., "Plur"
}

Construction examples:

// Flat feature (current convention)
MorFeature::new("Plur")         // key=None, value="Plur"
MorFeature::new("S3")           // key=None, value="S3"
MorFeature::new("Int,Rel")      // key=None, value="Int,Rel"

// Keyed feature (UD-standard, forward-compatible)
MorFeature::new("Number=Plur")  // key=Some("Number"), value="Plur"
MorFeature::new("Tense=Past")   // key=Some("Tense"), value="Past"

// Explicit constructors
MorFeature::flat("Plur")
MorFeature::with_key_value("Number", "Plur")

Lossless roundtrip guarantee: MorFeature::new auto-detects the = delimiter. Features without = are flat; features with = split into key+value. Serialization reproduces the original format exactly, flat features stay flat, keyed features keep their key.

PosCategory and MorStem

Both are interned Arc<str> newtypes for memory efficiency:

pub struct PosCategory(pub Arc<str>);  // interned via pos_interner()
pub struct MorStem(pub Arc<str>);      // interned via stem_interner()

Common values (noun, verb, the, a, be, etc.) are pre-populated in the interner. Cloning is O(1), atomic reference count increment.

Memory Layout

The model uses SmallVec for inline storage of common cases:

Mor.post_clitics: SmallVec<[MorWord; 2]>: most words have 0-1 clitics
MorWord.features: SmallVec<[MorFeature; 4]>: most words have 0-4 features
MorFeature key and value are Arc<str>, interned for deduplication

For a typical 30-word utterance with %mor, the model allocates approximately 30 Mor items, each with 1 MorWord and 0-4 MorFeature values. The interning system ensures that repeated POS tags, stems, and feature values share a single allocation across the entire file.

Grammar

The tree-sitter grammar for %mor is defined in grammar.js. The relevant rules:

mor_content → mor_word (mor_post_clitic)*
mor_post_clitic → tilde mor_word
mor_word → mor_pos pipe mor_lemma (mor_feature)*
mor_feature → hyphen mor_feature_value
mor_feature_value → /[^\.\?\|\+~\-\s\r\n]+/

Key design decisions:

mor_feature_value accepts = and !: The regex [^\.\?\|\+~\-\s\r\n]+ matches any characters except the MOR structural delimiters. This means Number=Plur parses as a single mor_feature_value node. The split on = happens in the model layer, not the grammar, following the “parse, don’t validate” principle.
mor_feature_value accepts ,: Multi-value features like Int,Rel parse as a single node.
No compound/prefix rules: The grammar has no rules for + (compounds) or # (prefixes) in the UD MOR format. These are legacy CHAT MOR features not used in UD-style output.

Parser

The tree-sitter parser produces MorTier from CHAT text. It is GLR-based and error-recovering, producing a CST that the Rust talkbank-parser crate walks to construct MorTier. Used by the CLI, LSP, and batchalign. High-frequency values (PosCategory, MorStem) are interned via Arc<str> during construction.

The corpus/reference/ set is the correctness gate for %mor parsing, every file must parse and round-trip cleanly. The file count grows as new constructs are added; run find corpus/reference -name '*.cha' | wc -l to get the live total.

Validation

The %mor tier undergoes several validation checks:

Content Validation (E711)

Every MorWord is checked for:

Empty POS: |lemma with no POS before the pipe
Empty lemma: pos| with no lemma after the pipe
Empty feature: bare - separator with no feature text

Main-tier Alignment (E705 / E706)

The %mor tier must align 1-to-1 with the main tier’s alignable words (excluding pauses, events, and other non-word content). The number of Mor items must equal the number of alignable main-tier words. The validator emits E705 MorCountMismatchTooFew when %mor has fewer items than the main tier and E706 MorCountMismatchTooMany when it has more. Terminator-mismatch errors are emitted separately as E707 (presence) and E716 (value).

GRA Alignment (E720)

When both %mor and %gra tiers are present, the number of %gra relations must equal the number of %mor chunks (including clitics and the terminator). A mismatch emits E720 MorGraCountMismatch. This is computed via MorTier::count_chunks().

(%gra’s own internal validators, E708 malformed relation, E709 invalid index, E712 word-index out of range, E713 head-index out of range, E721 non-sequential index, E722 no ROOT, E723 multiple ROOTs, E724 circular dependency, are documented in Dependent Tiers § %gra.)

JSON Serialization

The MorTier serializes to JSON using serde. MorFeature serializes as a plain string ("Plur" or "Number=Plur"), so the JSON schema is simply "type": "string". Example:

{
  "tier_type": "Mor",
  "items": [
    {
      "main": {
        "pos": "pron",
        "lemma": "I",
        "features": ["Prs", "Nom", "S1"]
      }
    },
    {
      "main": {
        "pos": "verb",
        "lemma": "want",
        "features": ["Fin", "Ind", "Pres", "S1"]
      }
    },
    {
      "main": {
        "pos": "noun",
        "lemma": "cookie",
        "features": ["Plur"]
      }
    }
  ],
  "terminator": "."
}

When key=value features are present, they serialize with the key included:

"features": ["Number=Plur", "Tense=Past"]

The JSON schema for MorFeature is "type": "string" regardless of whether keys are present.

Migration from Traditional CHAT MOR

What Changed

The traditional CHAT MOR format (CLAN-era) used a complex, hierarchically structured notation:

%mor:	pro:sub|I v|want n|cookie-PL .

Key differences from the UD MOR format:

Aspect	Traditional CHAT MOR	UD MOR
POS tags	CLAN categories (`pro:sub`, `v`, `n`, `adj`, `adv`)	Lowercased UPOS (`pron`, `verb`, `noun`, `adj`, `adv`)
POS subtypes	Colon-separated (`pro:sub`, `det:art`, `v:aux`)	Flat (subtypes dropped or encoded differently)
Features	CLAN suffix system (`-PL`, `-PAST`, `-3S`, `-PRES`)	UD feature values (`-Plur`, `-Past`, `-S3`, `-Pres`)
Compounds	`+` separator (`n	+n\|black+n\|bird`)
Prefixes	`#` separator (`v	#un#do`)
Morpheme segmentation	Full segmentation (`v\|eat&PAST`)	Not used (features are abstract, not morphemic)
Translations	`=` separator (`n\|perro=dog`)	Not present in base format (separate mechanism)

What the Model Removed

The UD MOR redesign (2026) removed the following types from the data model:

MorSuffix: suffix with type discriminant (fusional, derivational, etc.)
MorCompound: compound word with + separator
MorPrefix: prefix with # separator
MorSubcategory: POS subcategory after colon
AnnotatedChunk: chunk with optional translation
Chunk: enum of word/compound/terminator

These were replaced by the flat MorWord { pos, lemma, features } structure. The model went from ~12 types to 4 (MorTier, Mor, MorWord, MorFeature).

Backward Compatibility

The grammar still accepts many traditional CHAT MOR constructs (colons in POS tags, etc.) because the reference corpus contains files in both formats. The parser produces the same flat MorWord regardless; legacy constructs are mapped to the simplified structure during parsing.

What Stays the Same

Despite the format changes, fundamental CHAT conventions remain:

Pipe (|) separates POS from lemma
Hyphen (-) introduces features
Tilde (~) marks post-clitics
Space separates items
Terminator ends the tier
1-to-1 alignment with main tier words

Toward Full UD Compatibility

The current format is UD-inspired but not UD-compliant. Here is a roadmap of what would be needed for full lossless UD round-tripping:

Already Supported

POS tags (UPOS equivalents)
Lemmas
Feature values (flat and key=value)
MWT expansions (clitics)
Dependency relations (via %gra)

Gaps Remaining

Feature keys: The model supports Key=Value features, but batchalign’s mapper currently emits flat values only. When the mapper switches to emitting Number=Plur instead of just Plur, the parser, model, and serializer handle it automatically with no code changes.
Person+Number composites: UD has separate Person=3 and Number=Sing features. CHAT combines them into -S3 (3rd person singular). Decomposing S3 back to Person=3|Number=Sing would require a lookup table or a convention change.
Multi-value feature delimiter: UD uses commas (PronType=Int,Rel). CHAT preserves these commas in the feature value, but the semantic structure (two separate values) is not explicitly modeled. The model treats Int,Rel as an opaque string.
XPOS: UD provides language-specific POS tags (XPOS) alongside universal tags (UPOS). CHAT %mor has no XPOS field. This information is simply not represented.
Morpheme-level analysis: UD’s MISC field can encode morpheme boundaries and glosses. CHAT’s UD MOR format does not attempt morpheme segmentation, features are abstract grammatical categories, not morphemic decompositions.

The Path Forward

The model is designed so that moving toward UD compliance requires no breaking changes:

MorFeature already supports Key=Value, just needs the mapper to emit keys
PosCategory is an opaque string, could hold XPOS in a separate field if needed
JSON schema uses "type": "string" for features, adding keys doesn’t break consumers
The grammar already accepts = in feature values, no grammar changes needed

The migration can happen incrementally: the mapper starts emitting key=value features, existing flat data continues to parse identically, and corpus files can be upgraded at their own pace.

Phon Tiers (%xmodsyl, %xphosyl, %xphoaln, %xphoint)

Status: Reference Last updated: 2026-06-23 07:28 EDT

The Phon extension tiers provide syllable-level phonological annotation, segmental alignment between target and actual IPA, and per-phone time intervals. They are produced by the Phon application and exported to CHAT via PhonTalk.

chatter parses and validates all four tiers as first-class CHAT tiers.

The x prefix. Phon emits these tiers with a leading x (%xmodsyl, %xphosyl, %xphoaln, %xphoint) to mark them as extension tiers. The grammar accepts both the x-prefixed names and the historical non-x names (%modsyl, %phosyl, %phoaln, %phoint); the parser and validator key off the tier kind, not the literal prefix. The canonical serialized form is the x-prefixed name.

The four tiers

Tier	Source	Carries	Word separator
`%xmodsyl`	`%mod`	Syllabification of the model/target transcription	space
`%xphosyl`	`%pho`	Syllabification of the actual transcription	space
`%xphoaln`	`%mod`+`%pho`	Phone-by-phone alignment of model ↔ actual	space
`%xphoint`	`%pho`	Per-phone time intervals (`0x15` time bullets)	`/`

%xmodsyl, %xphosyl, and %xphoaln are word-aligned to their source tier(s) with single ASCII spaces. %xphoint uses / (space-slash-space) as its word separator because single spaces already separate the phone and bullet tokens inside each word.

Tier formats

%xmodsyl / %xphosyl, syllabification

A word is one or more phone:CODE units concatenated with no internal whitespace; words are separated by single spaces. The phone is one IPA phone (IPA length is written with the modifier letter ː, U+02D0, never an ASCII colon, so the : separator is unambiguous). A leading stress marker (ˈ primary, ˌ secondary) is part of the phone it precedes.

Pause fillers. Phon keeps every word-aligned phonology tier in index lockstep with the main tier: when the main tier carries a pause, the pause token ((.), (..), (...)) is mirrored at the same word position on %mod, %pho, %xmodsyl, and %xphosyl (and as a (..)↔(..) pair on %xphoaln). A pause filler is a valid word on the syllabification tiers; it carries no phone:CODE structure and must mirror the same pause token as the source-tier word at its position. Timed pauses ((1.5)) are not accepted as fillers (unattested in the wild corpora).

The constituent code is one character. The legal codes are O N C L R E A D U:

Code	Constituent	Notes
`O`	Onset
`N`	Nucleus	monophthong nucleus
`C`	Coda
`L`	Left appendix	e.g. /s/ in an /s/-stop cluster
`R`	Right appendix	e.g. final /z/ in a complex coda
`E`	OEHS (onset of empty-headed syllable)	e.g. the stop element of an affricate
`A`	Ambisyllabic
`D`	Diphthong	a nucleus member of a diphthong/triphthong; treated as a nucleus
`U`	Unknown	Phon could not assign a concrete constituent; common on `%xphosyl` when the model `%xmodsyl` is fully syllabified

The remaining Phon SyllableConstituentType mnemonics, B (boundary), S (stress), W (word boundary), T (tone), are not emitted on these tiers: boundary, stress, and tone need no per-phone marker.

*CHI:	I want three .
%mod:	aɪ wɑnt θri
%xmodsyl:	a:Dɪ:D w:Oɑ:Nn:Ct:C θ:Oɹ:Oi:N
%pho:	aɪ wɑn fwi
%xphosyl:	a:Dɪ:D w:Oɑ:Nn:C f:Ow:Oi:N

%xphoaln, phone alignment

A word is one or more comma-separated pairs; a pair is model↔actual (↔ is U+2194). Either side may be ∅ (U+2205, empty set): ∅ on the left is an epenthesis (a phone produced but not targeted); ∅ on the right is a deletion. Both sides are never ∅ at once.

*CHI:	the best .
%mod:	ðə bɛst
%pho:	ðə bɛs
%xphoaln:	ð↔ð,ə↔ə b↔b,ɛ↔ɛ,s↔s,t↔∅

The alignment lists segments (phones). Suprasegmental stress (ˈ/ˌ) that may appear on the %mod/%pho word is therefore not part of the alignment pairs; the reconstruction checks below compare modulo those stress markers.

%xphoint, per-phone intervals

%xphoint gives the time segmentation of each individual phone on %pho, effectively phone-level bullets analogous to the word-level timing on %wor. Groups (one per %pho word) are separated by /. Within a group, each phone is followed by a CLAN time-alignment bullet: the byte 0x15 (NAK), the interval start_end, then 0x15.

*CHI:	I want . •0_500•
%pho:	aɪ wɑnt
%xphoint:	aɪ •0_250• / w •250_320• ɑ •320_400• n •400_460• t •460_500•

(Bullets are shown as • above; in the file they are the 0x15 byte.)

Validation

These checks run by default. Pass --suppress xphon to silence the entire Phon %x validation surface, or suppress an individual code. (The historical --check-xphon opt-in flag is now a deprecated no-op: the checks it used to gate are on by default.)

Word-count cross-checks (each %x tier has the same number of words as the tier(s) it depends on):

%xmodsyl ↔ %mod: E725
%xphosyl ↔ %pho: E726
%xphoaln ↔ %mod: E727, ↔ %pho: E728

Content checks:

Code	Tier	Rule
E735	xmodsyl/xphosyl	a non-pause-filler unit is not a well-formed `phone:CODE` (no `:`, empty phone, or empty code)
E736	xmodsyl/xphosyl	a constituent code is not one of `O N C L R E A D U`
E737	xmodsyl	stripping codes and concatenating phones does not reproduce the `%mod` word (a pause filler must mirror the same pause token)
E738	xphosyl	stripping codes and concatenating phones does not reproduce the `%pho` word (a pause filler must mirror the same pause token)
E739	xphoaln	a pair is malformed (not exactly one `↔`, an empty side, or `∅↔∅`)
E740	xphoaln	concatenating the model sides (skipping `∅`, modulo stress and `^`/`.` syllable boundaries) does not reproduce the `%mod` word
E741	xphoaln	concatenating the actual sides (skipping `∅`, modulo stress and `^`/`.` syllable boundaries) does not reproduce the `%pho` word
E742	xphoint	a bullet has `start >= end`
E743	xphoint	interval start times are not non-decreasing across the tier
E744	xphoint	the first start / last end falls outside the record’s media bullet (1 ms tolerance)
E745	xphoint	a group’s phones do not reproduce the `%pho` word
E746	xphoint	the number of groups does not equal the `%pho` word count

See Alignment Architecture for the word-count implementation.

Parsing strategy

%xmodsyl / %xphosyl: stored as flat word strings (talkbank-model::dependent_tier::phon::SylTier), consistent with how %pho and %mod store flat phone words. The validator tokenizes each word into typed phone:CODE units (PositionCode) to apply the content rules above; the IPA characters themselves stay verbatim for exact round-trip.
%xphoaln: each word is parsed into a Vec<AlignmentPair>, where AlignmentPair { source, target } carries one model↔actual mapping (None is ∅).
%xphoint: parsed into typed groups of (phone, bullet) pairs (XphointTier / XphointGroup / PhoneInterval), reusing the same 0x15 bullet machinery as %wor.

Deep phonological analysis is Phon’s domain; chatter parses the structure that validation needs and keeps the IPA content verbatim.

Phon XML source format

In Phon’s native XML format, phonological data is stored as structured elements:

<ipaTarget>
  <pho>
    <pw>
      <ph scType="onset"><base>θ</base></ph>
      <ph scType="nucleus"><base>ɹ</base></ph>
      <ph scType="nucleus"><base>i</base></ph>
    </pw>
  </pho>
</ipaTarget>

Each <pw> (phonological word) element contains <ph> elements with syllable constituent types (scType). The <alignment> element provides phone-level mappings between target and actual using index-based <pm> (phone map) entries.

Data quality notes

A small percentage of Phon corpus XML records have an orthography↔IPA word-count mismatch: the number of <pw> elements in <ipaTarget> / <ipaActual> differs from the number of <w> elements in <orthography>. This is expected in child phonology data: children may produce extra syllables, partial words, or over-productions relative to the target.

For current counts on a local CHILDES/TalkBank data tree, run:

python3 scripts/analysis/scan_phon_mismatches.py /path/to/data

The PhonTalk CHAT export handles this discrepancy inconsistently:

%mod/%pho are written through a OneToOne alignment path that maps IPA words to orthography words; extras are silently dropped.
%xmodsyl/%xphosyl/%xphoaln are written directly from the raw IPATranscript; all IPA words are included.

This produces CHAT files where %xmodsyl may have more words than %mod, triggering the E725-E728 word-count errors. This is being investigated in collaboration with the Phon team.

Word Syntax

Status: Reference Last updated: 2026-05-11 23:33 EDT

Words are the primary content unit on the main tier. CHAT defines several word types and annotation mechanisms.

Standalone Words

Most words are simple tokens separated by whitespace:

*CHI:	I want a cookie .

Words can contain Unicode characters for any language:

*CHI:	ich möchte Kekse .

Compounds

Compound words join multiple elements with +:

*CHI:	I want ice+cream .

Special Word Forms

Shortened Forms

Parentheses mark omitted portions of a word:

*CHI:	(be)cause I want it .

The full form is because; the child produced cause.

Replacements

Square brackets with colon mark what the speaker actually meant:

*CHI:	I goed [: went] to the store .

The speaker said “goed” but the intended word was “went”.

Language Markers

The @s: suffix marks a word’s language in multilingual transcripts:

*CHI:	I want a Keks@s:deu .

Other @ markers:

@l: letter
@c: child-invented form
@f: family-specific word
@n: neologism
@o: onomatopoeia
@b: babbling
@wp: word play
@si: signed word

Annotations

Words and groups can carry post-positioned annotations in square brackets:

Error Marking

*CHI:	he goed [*] to school .

[*] marks an error. More specific error codes can follow: [* m:+ed].

Explanations

*CHI:	that one [= the red ball] .

[= text] provides an explanation or gloss.

Replacements

*CHI:	I wanna [: want to] go .

[: text] marks the target/intended form.

Best Guess

*CHI:	I want the birfer [?] .

[?] marks uncertain transcription.

Events and Actions

Paralinguistic Events

Events marked with &= describe non-speech sounds:

*CHI:	&=laughs I want cookie .
*CHI:	&=coughs .

Fillers

Fillers are marked with &-:

*CHI:	&-um I want &-uh cookie .

Interposed Speech (Other Speaker)

Brief background speech from a different speaker is marked with the &*SPK:text prefix, it captures the interjection without creating a full turn line:

*CHI:	I want &*MOT:careful a cookie .

This says CHI was speaking and MOT briefly said “careful” mid-turn. If the intervention is substantial enough to constitute its own turn, transcribe it as a separate *MOT: utterance instead. Model: crates/talkbank-model/src/model/content/other_spoken.rs.

(Note: [^ text] is a freecode, a standalone free-form researcher annotation that sits as its own content item on the main tier (variant of UtteranceContent::Freecode, sibling of Word and Group; it is NOT attached to any word). See grammar/grammar.js rule freecode and crates/talkbank-model/src/model/content/utterance_content/. Used for transcriber notes that are independent of any single word; for notes about a single word use [% text] or [= text] instead.)

Pauses

*CHI:	I (.) want (..) a (...) cookie .
*CHI:	I (1.5) want a cookie .

(.): short pause
(..): medium pause
(...): long pause
(N.N): timed pause in seconds

Overlap

Overlapping speech between speakers uses angle brackets and overlap markers:

*MOT:	do you want <a cookie> [>] ?
*CHI:	<cookie> [<] !

[>]: follows the overlap (this speaker started first)
[<]: overlaps the previous speaker

Retrace and Repetition

Groups followed by retrace markers indicate speech disfluencies:

*CHI:	<I want> [/] I want a cookie .
*CHI:	<I want> [//] I need a cookie .
*CHI:	<I want a> [///] give me a cookie .

[/]: partial retrace (speaker repeats the same words)
[//]: full retrace (speaker restarts with different words)
[///]: multiple retracing (multiple false starts)
[/-]: reformulation (speaker rephrases with different structure)

The CHAT Word

Status: Current Last modified: 2026-07-15 15:53 EDT

“Word” is the most complex and most misunderstood concept in CHAT. This chapter documents what a word actually is, how the grammar parses it, and how the Rust model represents it. If you maintain this codebase, you will encounter word-level bugs. This chapter exists so you can understand them.

The Fundamental Rule

Whitespace delimits words. Contiguous non-whitespace characters form one word token. This applies everywhere on the main tier.

*CHI:   hello world .
        ^^^^^              word: "hello"
              ^^^^^        word: "world"

The grammar uses extras: $ => [] – no implicit whitespace. Whitespace nodes (whitespaces, space) are explicit in the CST. Tree-sitter does not skip whitespace between tokens. This is the foundation of every tokenization decision in the grammar.

There are no exceptions to this rule. Every ambiguity described in this chapter is resolved by applying this rule consistently.

Word Structure

A word in the grammar is standalone_word – a sequence of an optional prefix, a required body, optional suffixes, and an optional POS tag.

The following diagram shows the full decomposition. All named nodes are separate CST children.

flowchart TD
    sw["standalone_word\n(grammar.js, prec.right 6)"]

    zero["zero\n'0' -- omission prefix"]
    wp["word_prefix\n'&amp;-' filler | '&amp;~' nonword | '&amp;+' fragment"]

    wb["word_body\n(required)"]
    fm["form_marker\n@b, @c, @d, @z:label, ..."]
    wls["word_lang_suffix\n@s, @s:eng, @s:eng+fra"]
    pos["pos_tag\n$n, $v, $adj, ..."]

    sw -->|"optional prefix"| zero & wp
    sw -->|"required"| wb
    sw -->|"optional suffix"| fm & wls
    sw -->|"optional"| pos

    ws["word_segment\npure spoken text"]
    short["shortening\n'(text)' omitted sound"]
    sm["stress_marker\nprimary or secondary"]
    len["lengthening\n':' one or more colons"]
    op["overlap_point\none of four brackets"]
    cae["ca_element\nsingle CA marker"]
    cad["ca_delimiter\npaired CA marker"]
    ub["underline_begin\ncontrol char pair"]
    ue["underline_end\ncontrol char pair"]
    cm["'+'\ncompound marker"]

    wb -->|"children (any order)"| ws & short & sm & len & op & cae & cad & ub & ue & cm

In the grammar (search grammar/grammar.js for the standalone_word and word_body rules), the structure is:

standalone_word: $ => prec.right(6, seq(
  optional(choice($.word_prefix, $.zero)),
  $.word_body,
  optional($.form_marker),
  optional($.word_lang_suffix),
  optional($.pos_tag),
)),

word_body: $ => prec.right(choice(
  seq(
    choice($.word_segment, $.shortening, $.stress_marker),
    repeat(choice($.word_segment, $.shortening, $.stress_marker, $._word_marker)),
  ),
  seq(
    choice($.overlap_point, $.ca_element, $.ca_delimiter, $.underline_begin),
    choice($.word_segment, $.shortening, $.stress_marker),
    repeat(choice($.word_segment, $.shortening, $.stress_marker, $._word_marker)),
  ),
)),

word_body has two branches:

Standard start: the word begins with word_segment, shortening, or stress_marker, followed by any number of body children.
Marker-initial: the word begins with a structural marker (overlap, CA, underline), but that marker must be immediately followed by text content. This prevents degenerate words like a standalone overlap marker from forming a valid standalone_word.

Lengthening and + (compound marker) are excluded from starting a word body. This is how standalone : falls through to separator(colon) – see Section 5 (Tokenization Ambiguities) below.

The word_segment Purity Invariant

word_segment contains ONLY pure spoken text. All structural markers are separate typed children in word_body, never consumed by word_segment.

This is a hard invariant with three consequences:

cleaned_text() never scans for markers. It concatenates Text and Shortening elements. No stripping needed.
Validation finds ALL markers by type. Overlap markers, CA elements, and underline pairs are always WordContent variants, regardless of position within the word.
Editors get typed CST nodes. Syntax highlighting, bracket matching, and hover info work on individual markers, not opaque substrings.

How it works

word_segment is a DFA token at prec(5) with a regex that excludes all structural characters. The exclusions are generated from the symbol registry (grammar/src/generated_symbol_sets.js) – never hand-written.

word_segment: $ => token(prec(5, seq(
  WORD_SEGMENT_FIRST_RE,    // generated: excludes structural chars + '0' at start
  WORD_SEGMENT_REST_RE,     // generated: excludes structural chars
))),

Full exclusion table

Every character in this table is excluded from word_segment and becomes a separate typed node in the CST.

Category	Characters	CST node type
Overlap markers	`⌈` `⌉` `⌊` `⌋`	`overlap_point`
CA elements	`↑` `↓` `≠` `∾` `⁑` `⤇` `∙` `Ἡ` `↻` `⤆`	`ca_element`
CA delimiters	`∆` `∇` `°` `▁` `▔` `☺` `♋` `⁇` `∬` `Ϋ` `∮` `↫` `⁎` `◉` `§`	`ca_delimiter`
Stress markers	`ˈ` `ˌ`	`stress_marker`
Colons	`:`	`lengthening`
Underline markers	`\x02\x01`, `\x02\x02`	`underline_begin` / `underline_end`
Brackets	`[` `]` `<` `>` `(` `)` `{` `}`	structural (annotations, groups)
Punctuation	`.` `!` `?` `,` `;` `+`	terminators, separators, compound
CHAT prefixes	`@` `$` `&` `*` `%`	headers, events, speakers
Intonation contours	`⇗` `↗` `→` `↘` `⇘` `≈` `≋` `∞` `≡`	content-level markers
Group delimiters	`‹` `›` `"` `"` `〔` `〕`	pho/sin groups, quotes
Control chars	`\x01`-`\x08`, `\x15`	bullets, underline

First-character-only exclusion: 0 is excluded from the first character of word_segment (it is the omission prefix). 0 in non-initial positions is valid: 200, h0me, abc0 all parse correctly.

The Rust Data Model

Word struct

The Word struct (crates/talkbank-model/src/model/content/word/word_type.rs) is the canonical typed representation:

pub struct Word {
    pub span: Span,
    pub word_id: Option<SmolStr>,
    pub(crate) raw_text: SmolStr,
    pub content: WordContents,
    pub category: Option<WordCategory>,
    pub form_type: Option<FormType>,
    pub lang: Option<WordLanguageMarker>,
    pub part_of_speech: Option<SmolStr>,
    pub inline_bullet: Option<Bullet>,
}

Key fields:

raw_text: the exact text from the input, including all markers. Used for roundtrip serialization.
content: a WordContents (SmallVec-backed sequence of WordContent elements). This is the structured decomposition. Most words have 1-2 elements; SmallVec avoids heap allocation for the common case.
category: optional prefix (Omission, CAOmission, Filler, Nonword, PhonologicalFragment).
form_type: optional @ suffix (@c child-invented, @d dialect, @z:label user-defined, etc.). The user-defined form requires the colon and a label (@z:label); a colon-less marker such as @zzz is not a valid form and is rejected with E203 (matching CLAN CHECK 147).
lang: optional @s language marker (Shortcut, Explicit, Multiple, Ambiguous).
part_of_speech: optional $ tag.

WordContent enum

WordContent (crates/talkbank-model/src/model/content/word/content.rs) is the enum of everything that can appear inside a word body. Each variant maps directly to a grammar node.

Grammar node	`WordContent` variant	Rust type	Example
`word_segment`	`Text`	`WordText(NonEmptyString)`	`hello`, `want`
`word_segment` in a `@u` word	`Phonetic`	`WordPhonetic(NonEmptyString)`	`rɛmbə˞` in `rɛmbə˞@u`
`shortening`	`Shortening`	`WordShortening(NonEmptyString)`	`(be)` in `(be)cause`
`overlap_point`	`OverlapPoint`	`OverlapPoint`	`⌈`, `⌉2`
`ca_element`	`CAElement`	`CAElement`	`↑`, `↓`
`ca_delimiter`	`CADelimiter`	`CADelimiter`	`∆`, `°`
`stress_marker`	`StressMarker`	`WordStressMarker`	`ˈ` primary, `ˌ` secondary
`lengthening`	`Lengthening`	`WordLengthening { count: u8 }`	`:` = 1, `::` = 2, `:::` = 3
(caret in word)	`SyllablePause`	`WordSyllablePause`	`^` in `o^ver`
`underline_begin`	`UnderlineBegin`	`UnderlineMarker`	`\x02\x01`
`underline_end`	`UnderlineEnd`	`UnderlineMarker`	`\x02\x02`
`+` (compound)	`CompoundMarker`	`WordCompoundMarker`	`+` in `ice+cream`
`~` (clitic boundary)	`CliticBoundary`	`WordCliticBoundary`	`~` in `le~ha`

cleaned_text()

Word::cleaned_text() derives NLP-ready text from content by concatenating only Text, Phonetic, and Shortening variants:

pub fn compute_cleaned_text(&self) -> String {
    let mut result = String::new();
    for item in &self.content {
        match item {
            WordContent::Text(t) => result.push_str(t.as_ref()),
            WordContent::Shortening(s) => result.push_str(s.as_ref()),
            _ => {}
        }
    }
    result
}

This works because the purity invariant guarantees that Text elements never contain structural markers. There is nothing to strip.

`@u` phonetic forms are typed phonetic content

A @u word is a phonetic transcription (historically UNIBET, now usually IPA) standing in a word slot: used when the orthographic word is unknown, unintelligible, or a paraphasia, frequently the spoken side of a [: target] replacement in aphasia data (rɛmbə˞@u [: remember]). Because its content obeys phonetic conventions rather than orthographic word conventions, the parsers fold a @u word’s body into a single WordContent::Phonetic(WordPhonetic) node instead of Text. This makes “orthographic rules apply to orthographic words only” a property of the model: word-hygiene rules structurally cannot reach phonetic content. The phonetic string itself is deliberately lenient (IPA, ASCII UNIBET, and X-SAMPA all pass; only non-emptiness is enforced), matching the %pho tier’s PhoWord stance. In to-json output the node appears as {"type": "phonetic", "content": "..."}, and cleaned_text remains the phonetic string verbatim. The scope is @u only: sibling special forms (@b, @o, …) remain orthographic words.

Examples:

Input	`content`	`cleaned_text()`
`hello`	`[Text("hello")]`	`hello`
`(be)cause`	`[Shortening("be"), Text("cause")]`	`because`
`no::`	`[Text("no"), Lengthening(2)]`	`no`
`ice+cream`	`[Text("ice"), CompoundMarker, Text("cream")]`	`icecream`
`le~ha`	`[Text("le"), CliticBoundary, Text("ha")]`	`leha`
`ja^ja`	`[Text("ja"), SyllablePause, Text("ja")]`	`jaja`
`he↑llo`	`[Text("he"), CAElement(PitchUp), Text("llo")]`	`hello`
`°soft°`	`[CADelimiter(Softer), Text("soft"), CADelimiter(Softer)]`	`soft`
`ˈhello`	`[StressMarker(Primary), Text("hello")]`	`hello`
`⌈hello⌉`	`[OverlapPoint(TopBegin), Text("hello"), OverlapPoint(TopEnd)]`	`hello`

The result is cached via OnceLock on first access.

What is included in cleaned_text vs what is stripped

The following table is the complete inventory of how every word-internal element contributes to (or is excluded from) cleaned_text(). This must match what NLP pipelines (Stanza, etc.) expect as input.

`WordContent` variant	Character(s)	In `cleaned_text`?	Rationale
`Text`	spoken text	YES	The actual word
`Shortening`	`(be)`	YES	Shortened form is still spoken
`CompoundMarker`	`+`	No	Structural boundary, not spoken
`CliticBoundary`	`~`	No	Morphological boundary, not spoken
`SyllablePause`	`^`	No	Pause between syllables, not spoken
`Lengthening`	`:` `::` `:::`	No	Prosodic marker, not spoken
`StressMarker`	`ˈ` `ˌ`	No	Prosodic marker, not spoken
`OverlapPoint`	`⌈` `⌉` `⌊` `⌋`	No	Timing marker, not spoken
`CAElement`	`↑` `↓` `≠` `∾` `⁑` `⤇` `∙` `Ἡ` `↻` `⤆`	No	Prosodic annotation
`CADelimiter`	`∆` `∇` `°` `▁` `▔` `☺` `♋` `⁇` `∬` `Ϋ` `∮` `↫` `⁎` `◉` `§`	No	Voice quality annotation
`UnderlineBegin`	`\x02\x01`	No	Formatting marker
`UnderlineEnd`	`\x02\x02`	No	Formatting marker

Characters that stay in word_segment (ARE spoken text):

Letters (all Unicode)
Digits (in non-initial position; 0 in initial = omission prefix)
Hyphen (-), part of word text, e.g., ice-cream, self-conscious
Apostrophe ('), contractions, e.g., don't, it's
Hash (#), appears in some transcription conventions
Underscore (_), compound boundary in some conventions

Characters NOT in word_segment (excluded by symbol registry): See the full exclusion table in Precedence Decisions in the grammar docs.

Comparison with batchalign2

batchalign2’s annotation_clean() (60 lines of .replace() calls) strips all the same characters that our grammar excludes from word_segment. Key differences:

Parentheses: ba2 COMMENTED OUT the strip. We handle them as Shortening, the content inside parens IS included in cleaned_text.
IPA characters (ạ ā ʔ ʕ ʰ): ba2 incorrectly strips them. We correctly keep them; they are real phonetic content.
Hyphen (-): ba2 strips it. We keep it in word_segment because hyphen is a valid word character (contractions, compounds, morphological suffixes in %mor tier).

Our design eliminates the need for character-by-character stripping entirely. cleaned_text() is a simple concatenation of Text + Shortening elements, with zero scanning.

The Six Tokenization Ambiguities

CHAT was designed for human readability, not machine parsing. Six characters have context-dependent meanings that the grammar must disambiguate. Full details with proof grammars are in grammar/docs/tokenization-rules.md and grammar/docs/precedence-decisions.md. What follows is a summary for orientation.

1. Overlap markers (⌈⌉⌊⌋)

Adjacent to text = part of the word. Space-separated = standalone overlap_point. This adjacency rule is a deliberate approximation of an ideal (edge markers top-level, interior markers in-word) whose full history, feasibility analysis, and open implementation decision are documented in Overlap Marker Binding.

Yeah⌋⌈2 hey      ONE word: "Yeah⌋⌈2"
Yeah ⌋ ⌈2 hey    three tokens: "Yeah", ⌋, ⌈2

Maximal munch at prec(5) makes word_segment consume adjacent overlap characters. Overlap markers are only recognized as overlap_point when space-separated on both sides.

2. Zero/omission prefix (0)

Adjacent to word body = omission prefix. Space-separated = action marker.

0die              ONE word: standalone_word(zero, word_body("die"))
0 die             TWO tokens: nonword(zero), word("die")

standalone_word at prec.right(6) beats nonword at prec(1). The extras: [] setting prevents whitespace from being skipped between zero and word_body. The zero token is inlined directly into standalone_word (not through word_prefix) because tree-sitter’s precedence does not propagate through intermediate rules. This was proven empirically with a minimal test grammar – see grammar/docs/precedence-decisions.md.

3. CA parenthetical vs shortening

In CA mode (@Options: CA), a fully parenthesized word (word) is an uncertain/omitted word (CAOmission), semantically equivalent to 0word. Partially parenthesized hel(lo) is always a shortening.

@Options: CA
*CHI:   (ja) .            CAOmission: uncertain "ja"
*CHI:   hel(lo) .         Shortening: "(lo)" is the shortened part

Distinguishing these requires file-level context (@Options header). The parser sets WordCategory::CAOmission when the word is fully parenthesized in CA mode. Isolated parser.parse_word_fragment() calls cannot determine CA mode – they need a FragmentSemanticContext.

4. Colon – lengthening vs separator

Inside a word (after text): prosodic lengthening. Standalone: separator.

no::              ONE word: Text("no") + Lengthening(2)
hello : world     separator(colon)

The DFA always produces lengthening for : (higher precedence). But word_body rejects lengthening as a first element, so standalone : cannot form a valid word and falls through to separator(colon). This is the “constrain the parser, not the DFA” pattern.

5. Plus (+) – compound vs terminator vs linker

Inside a word: compound marker. At line end: terminator prefix. At line start: linker prefix.

ice+cream         ONE word with compound marker
and then +...     terminator: trailing_off (prec 10 beats prec 5)
+< but I +/.      linker: lazy_overlap, terminator: interruption

Terminators and linkers use prec(10), which beats word_segment at prec(5). No valid CHAT word ends with + – the grammar enforces this by structure.

6. Bracket annotations vs plain brackets

Bracket annotations ([= text], [=! text], [% text]) use prec(8) prefix tokens to beat generic bracket handling.

Historical Context: The Coarsening and Its Reversal

The original structured grammar (pre-coarsening)

The original grammar (preserved in grammar/docs/pre-coarsening-grammar.js.reference) had all word-internal markers as children of word_content:

// Pre-coarsening: every marker was a child of word_content
word_content: $ => choice(
  $.word_segment,
  $.shortening,
  $.stress,
  $.colon,
  $.caret,
  $.tilde,
  $.plus,
  $.overlap_point,
  $.ca_element,
  $.ca_delimiter,
  $.underline_begin,
  $.underline_end,
),

The coarsening decision

At one point, standalone_word was coarsened into an opaque token – a single DFA regex that consumed the entire word as one undifferentiated string. The rationale was:

Simpler grammar with fewer tree-sitter conflicts.
A Chumsky-based direct parser in Rust would re-parse the opaque token into structured WordContent elements.

This worked but had costs:

Two parsers (tree-sitter + Chumsky) with independent bugs.
Validation could not find structural markers without re-parsing.
Editors got one opaque node instead of typed children.
cleaned_text() had to scan for and strip marker characters.

The reversal (Chumsky elimination)

When the Chumsky direct parser was eliminated (making tree-sitter the sole parser), the structured word grammar was restored. The key decisions:

All marker characters were re-excluded from word_segment using the symbol registry as the single source of truth.
Each marker type became a separate CST child in word_body.
The WordContent enum in the Rust model was aligned 1:1 with the grammar nodes.
The word_segment purity invariant was established as a TDD gate.

The result is one parser, one source of truth for exclusions, and typed markers from grammar through model.

Testing: The word_segment Purity Gate

The purity invariant, each structural marker produces a separate CST child rather than being consumed by word_segment, is enforced by a group of tree-sitter corpus tests under grammar/test/corpus/word/. Each *_in_word_lint.txt file embeds a structural marker inside a word and asserts the CST splits the word appropriately:

Test file	Input	Asserts
`overlap_in_word_lint.txt`	`butt⌈er⌉`	`word_segment`, `overlap_point`, `word_segment`, `overlap_point`
`ca_element_in_word_lint.txt`	CA element inside a word	`word_segment`, `ca_element`, `word_segment`
`ca_delimiter_in_word_lint.txt`	CA delimiter pair around a word	`ca_delimiter`, `word_segment`, `ca_delimiter`
`lengthening.txt`, `lengthening_between_segments.txt`	`no::`, etc.	`word_segment`, `lengthening`
`stacked_ca_markers.txt`	Multiple adjacent CA markers in one word	Each marker is its own CST child

Underline and stress invariants are covered by corpus tests elsewhere in grammar/test/corpus/ and by the parser-equivalence tests in crates/talkbank-parser-tests/. The historical word_segment_purity.txt consolidated 8 named tests in one file; it was retired in commit fdceeac2 when the corresponding constructs were given their own per-construct test files (this is the new layout that the current spec generators produce from the spec sources).

How to add a new purity-style test

If you add a new structural marker to the grammar:

Add its characters to the symbol registry (spec/symbols/symbol_registry.json).
Run just symbols-gen to regenerate the exclusion sets.
Add a spec in spec/constructs/ that embeds the marker inside a word; regenerate the affected grammar/parser fixtures with the current spec/tools commands from Spec Workflow so a per-construct test fixture is created in grammar/test/corpus/word/. Verify the CST output names each marker as its own child.

Run the full verification sequence:

cd grammar && tree-sitter generate && tree-sitter test
cargo nextest run -p talkbank-parser
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus

Key Source Files

File	What it defines
`grammar/grammar.js`	search for `standalone_word`, `word_body`, `word_segment`, `_word_marker`
`grammar/src/generated_symbol_sets.js`	Character exclusion sets (generated, do not edit)
`grammar/test/corpus/word/_in_word_lint.txt`, `lengthening.txt`, `stacked_ca_markers.txt`	Per-construct purity-invariant gate tests (replaced the consolidated `word_segment_purity.txt` retired in `fdceeac2`)
`grammar/docs/tokenization-rules.md`	The 6 tokenization ambiguities with full examples
`grammar/docs/precedence-decisions.md`	Precedence proofs (zero, colon, purity invariant)
`grammar/docs/pre-coarsening-grammar.js.reference`	Historical: the grammar before coarsening
`crates/talkbank-model/src/model/content/word/word_type.rs`	`Word` struct
`crates/talkbank-model/src/model/content/word/content.rs`	`WordContent` enum (12 variants)
`crates/talkbank-model/src/model/content/word/word_contents.rs`	`WordContents` (SmallVec-backed sequence)
`crates/talkbank-model/src/model/content/word/category.rs`	`WordCategory` enum (5 variants)
`crates/talkbank-model/src/model/content/word/form.rs`	`FormType` enum (22 variants)
`crates/talkbank-model/src/model/content/word/language.rs`	`WordLanguageMarker` enum (4 variants)

Symbols

Status: Reference Last modified: 2026-05-29 18:43 EDT

CHAT uses a rich set of symbols for transcription conventions. This page documents the symbol categories and the symbol registry that drives both the grammar and the Rust crates. The symbol registry (spec/symbols/symbol_registry.json) is the source of truth, when this page and the registry disagree, the registry wins.

Symbol Registry

The authoritative symbol definitions live in spec/symbols/symbol_registry.json. This JSON file is the single source of truth, it generates:

Character sets for the tree-sitter grammar (grammar.js)
Rust constants for the model and validation crates
Validation rules for the spec tool

After any change to the symbol registry, run:

just symbols-gen

Symbol Categories

Terminators

Punctuation that ends an utterance:

Symbol	Name	Usage
`.`	Period	Declarative
`?`	Question	Interrogative
`!`	Exclamation	Exclamatory
`+...`	Trailing off	Incomplete utterance
`+..?`	Trailing-off question	Question trails off
`+/.`	Interruption	Speaker interrupted by another
`+//.`	Self-interruption	Speaker interrupts self
`+/?`	Interrupted question	Question interrupted
`+!?`	Broken question	Exclamation-question
`+"/.`	Quoted new line	Quotation continues on next line

CA (Conversation Analysis) Symbols

CA notation symbols fall into three parser-distinct categories in spec/symbols/symbol_registry.json. They are not interchangeable, the grammar treats them as different node kinds.

CA element symbols (ca_element_symbols) attach to a word, so book↑ is a single token whose content carries the symbol:

Symbol	Meaning
`↑`	Rising pitch (attaches to a word)
`↓`	Falling pitch (attaches to a word)
`∙`	Micropause
`≠`	Inhalation marker
`⁑` `↻` `∾` `⤆` `⤇` `Ἡ`	Other CA element symbols

CA arrow separators (in word_segment_forbidden_start_symbols) are own-node separators between words, not word-attachments. The parser splits them as their own nodes:

Symbol	Meaning
`→`	Level pitch contour
`↗`	Rising-to-mid contour
`↘`	Falling-to-mid contour
`⇗`	Rising-to-high contour
`⇘`	Falling-to-low contour
`↖` `↙` `←`	Other CA arrow separators

CA delimiter symbols (ca_delimiter_symbols) bracket annotated prosodic regions:

Symbol	Meaning
`°`	Quiet speech
`∆` `∇`	Higher / lower pitch register
`∬` `∮`	Other prosodic-region delimiters
`▁` `▔`	Low / high prosodic-region delimiters
`⁇` `§` `⁎` `↫` `◉` `☺` `♋` `Ϋ`	Additional registered CA delimiters

Confirm the current contents of each category by reading spec/symbols/symbol_registry.json directly, that is the file just symbols-gen derives the grammar and Rust constants from.

Word Segment Characters

Characters that are forbidden at the start of words, forbidden in the rest of words, or forbidden throughout. These define the lexical boundaries of what constitutes a “word” in CHAT.

The grammar uses these sets to construct the word-matching regex patterns. Characters like [, ], <, >, (, ) are structural delimiters and cannot appear inside words.

Event Segment Characters

Characters forbidden in event descriptions (&=event content). Events have slightly different lexical rules than words.

Language Codes

CHAT uses ISO 639-3 three-letter language codes in @Languages headers and @s: word markers:

@Languages:	eng, fra
*CHI:	I want a croissant@s:fra .

Common codes: eng (English), fra (French), deu (German), spa (Spanish), zho (Mandarin), jpn (Japanese).

Special Markers

@ Markers (Word-Level)

The authoritative form-marker set is FormType in crates/talkbank-model/src/model/content/word/form.rs. Current variants:

Marker	Meaning
`@a`	Approximate / phonologically consistent form
`@b`	Babbling
`@c`	Child-invented form
`@d`	Dialect form
`@f`	Family-specific form
`@fp`	Filled pause (deprecated, use `&-um` etc.)
`@g`	Gemination / general special form
`@i`	Interjection
`@k`	Letter sequence (kinship)
`@l`	Single letter
`@ls`	Letter plural
`@n`	Neologism
`@o`	Onomatopoeia
`@p`	Proper name
`@q`	Metalinguistic reference
`@sas`	Second-attempt success
`@si`	Singing
`@sl`	Slang
`@t`	Test word
`@u`	Unibet transcription
`@wp`	Word play
`@x`	Complex / excluded
`@z:<label>`	User-defined special form (carries an arbitrary label)

The second-language qualifier @s:LANG is a separate construct (see the L2 morphotag section of the Batchalign book); it is not part of FormType.

& Markers (Events and Fillers)

Prefix	Meaning
`&=`	Paralinguistic event (e.g., `&=laughs`)
`&-`	Filler (e.g., `&-um`)
`&+`	Phonological fragment (e.g., `&+sh`)
`&~`	Nonword (e.g., `&~mama`)
`&*`	Other speaker’s speech event (e.g., `&*MOT:word`, speech attributed to another speaker)

Scope Markers

Marker	Meaning
`[/]`	Partial retrace, speaker repeats the same words
`[//]`	Full retrace, speaker restarts with different words
`[///]`	Multiple retracing, multiple false starts
`[/-]`	Reformulation, speaker rephrases with different structure
`[*]`	Error
`[?]`	Best guess
`[>]`	Overlap follows
`[<]`	Overlap precedes
`[= text]`	Explanation
`[: text]`	Replacement

Architecture Overview

Status: Current Last modified: 2026-06-15 15:00 EDT

TalkBank/chatter is the standalone home of the TalkBank CHAT specification, tree-sitter grammar, Rust crates, chatter CLI, LSP server, and desktop app. It is self-contained: the CHAT-format core builds and runs without any external TalkBank repository, so downstream consumers can depend on its crates directly.

Data Flow

Specification is the source of truth. Code is generated downstream from it.

spec/           Source of truth (CHAT specification)
    ↓
grammar.js      Tree-sitter grammar (in grammar/)
    ↓
parser.c        Generated C parser (never hand-edited)
    ↓
Rust crates     Parser → Model → Validation → Transform
    ↓
Applications    chatter CLI, LSP server, desktop app

Two layers

Within this repository, the architecture splits into two layers:

Source-of-truth artifacts. spec/, spec/symbols/, and grammar/ define the CHAT language and generate downstream parser tests, error docs, and shared symbol sets.

Consumer crates and applications. The Rust crates under crates/, the chatter CLI, talkbank-lsp, and the desktop app all consume those source-of-truth artifacts rather than defining CHAT semantics independently.

Crate Dependency Graph

flowchart TD
    derive["talkbank-derive\nProc macros"]
    model["talkbank-model\nData model, validation, alignment, errors"]
    cache["talkbank-cache\nValidation + roundtrip cache"]
    parser["talkbank-parser\nCanonical parser (tree-sitter)"]
    re2c["talkbank-parser-re2c\nAlternate parser (equivalence oracle)"]
    transform["talkbank-transform\nPipelines, CHAT↔JSON, caching"]
    cli["chatter\nCLI: validate, normalize, convert"]
    lsp["talkbank-lsp\nLanguage Server Protocol"]
    s2c["send2clan\nCLAN app bindings"]
    desktop["chatter-desktop\nDesktop validation app (Tauri)"]
    tests["talkbank-parser-tests\nEquivalence tests"]

    derive --> model
    model --> parser & re2c
    parser --> transform
    re2c --> transform
    cache --> transform
    transform --> cli & lsp & desktop
    s2c --> cli & desktop
    parser --> tests
    re2c --> tests

Repository Layout

chatter/
├── grammar/                Tree-sitter grammar
├── spec/                   CHAT specification (source of truth)
│   ├── constructs/         Valid CHAT examples + expected parse trees
│   ├── errors/             Invalid CHAT examples + expected error codes
│   ├── symbols/            Shared symbol registry (JSON)
│   ├── tools/              Core spec generators
│   └── runtime-tools/      Runtime-aware spec bootstrap/validation tools
├── crates/                 Rust crates (model, parser, transform, CLI support, LSP)
├── corpus/                 Reference corpus
├── tests/                  Integration tests and fixtures
├── schema/                 JSON Schema (auto-generated)
├── apps/chatter-desktop/   Desktop validation app (Tauri v2, React)
├── fuzz/                   cargo-fuzz targets (separate workspace)
├── book/                   This documentation
└── docs/                   Strategy, proposals, investigations

Cargo Workspaces

Three separate Cargo workspaces live here:

Root workspace (Cargo.toml), all Rust crates for parsing, model, transform, CLI, LSP, and apps/chatter-desktop/src-tauri.
Spec workspace (spec/Cargo.toml), spec/tools for core generation, spec/runtime-tools for runtime-aware spec tooling.
Fuzz workspace (fuzz/Cargo.toml), cargo-fuzz targets for parser and validation robustness checks.

Use the relevant manifest path for the workspace you mean to operate in:

spec/tools/Cargo.toml for generators
spec/runtime-tools/Cargo.toml for bootstrap/mining/runtime validation
fuzz/Cargo.toml for cargo-fuzz targets

Where to read next

For per-topic detail (sections being consolidated; see SUMMARY for the authoritative current list):

Spec System, Grammar, Parser Backends, how CHAT becomes typed AST.
CHAT model: the AST itself, content traversal, wide-struct rule.
Alignment: tier alignment, DP, sequence alignment.
Errors and validation: diagnostics, validation gates, and parser/model invariants.
Editor/runtime integration: talkbank-lsp and application boundaries layered on top of the CHAT core.
Memory and Ownership, Type-Driven Design (lands during M11 errors-and-validation work).
XML Emitter: projection.

For per-crate summaries see Crate Reference.

Spec System

Status: Current Last modified: 2026-05-29 17:50 EDT

Specifications in spec/ are the authoritative source of truth for the CHAT format. They drive grammar artifact generation, validation/error docs, and targeted test generation.

Historical note: This system was originally shaped during a dual-parser era. The chumsky-based direct parser was removed in March 2026. Today the canonical parser is tree-sitter (talkbank-parser); a second implementation, talkbank-parser-re2c, exists as a specification oracle and high-throughput batch parser. Fragment specs remain valuable, but synthetic tree-sitter wrapper behavior is audit-only legacy unless a page or test explicitly says otherwise.

Spec Types

Construct Specs (`spec/constructs/`)

Each construct spec defines a valid CHAT pattern with its expected parse tree:

# example_name

Description of what this example tests.

## Input

\```mor_dependent_tier
%mor:	VERB|eat .
\```

## Expected CST

\```cst
(mor_dependent_tier
  (mor_tier_prefix)
  ...)
\```

## Metadata

- **Level**: tier
- **Category**: tiers

The Input code fence label (e.g., mor_dependent_tier, utterance) selects which template wraps the fragment into a full CHAT file for parsing.

That is an explicit grammar/test templating mechanism. It is useful, but it does not by itself define honest isolated-fragment semantics for the direct parser.

Error Specs (`spec/errors/`)

Each error spec defines an invalid CHAT pattern with expected error codes:

# Error E301

## Metadata

- Code: E301
- Name: missing_participants
- Severity: Error
- Layer: parser

## Examples

### missing_participants_1

\```chat
@UTF8
@Begin
*CHI: hello .
@End
\```

Key metadata fields:

Layer: parser: error caught during parsing (returns Err)
Layer: validation: error caught after successful parse
Status: not_implemented: generates #[ignore] tests

Symbol Registry (`spec/symbols/`)

symbol_registry.json defines character sets used by both the grammar and Rust crates. In this repo, just symbols-gen validates the registry and regenerates the checked-in grammar and Rust symbol-set outputs. The generation step produces:

JavaScript constants for grammar.js
Rust constants for model validation

Test Generation

The predecessor monorepo used make test-gen as shorthand for three generator classes. That root wrapper is not yet ported into this repo, but the underlying generation responsibilities are still:

1. Tree-sitter Corpus Tests

gen_tree_sitter_tests reads construct specs and error specs, then:

Wraps each Input in a template to create a full CHAT file
Parses with tree-sitter and checks for error nodes
Writes Expected CST to grammar/test/corpus/

For error specs, it captures the actual parse (with ERROR nodes) as the expected tree.

2. Rust Tests

gen_rust_tests generates Rust test functions:

Construct specs become parse-and-compare tests
Parser-layer error specs become parser.parse_chat_file() tests expecting Err
Validation-layer error specs become parse-then-validate tests

Output: crates/talkbank-parser-tests/tests/generated/

The generated suites are useful as grammar/audit support and regression coverage, but they are not the sole authority for parser semantics.

3. Error Documentation

gen_error_docs generates optional local markdown pages for each error code under docs/errors/ when maintainers want a browsable reference set while working on diagnostics. The source of truth remains spec/errors/.

Workflow After Spec Changes

Regenerate only the affected spec-driven artifacts using the current commands documented in spec/CLAUDE.md.
Run the concrete verification commands from Contributing > Setup.

Never hand-edit generated artifacts, always regenerate from specs.

Post-Bootstrap Doctrine

spec/tools remains the generator/validator for grammar corpus tests, error docs, and shared symbol artifacts.
talkbank-parser-tests owns parser equivalence and roundtrip contracts.
Isolated grammar additions should usually need two things: one grammar corpus example and one full-file fixture. They should not require the old bootstrap ritual unless generated artifacts really changed.

Grammar

Status: Current Last updated: 2026-03-24 00:01 EDT

The CHAT grammar is defined in grammar/grammar.js using the tree-sitter parser generator. It produces a GLR parser that handles the full CHAT format with error recovery.

Design Principles

Explicit Whitespace

Unlike most tree-sitter grammars, CHAT does not use extras for whitespace. All whitespace is grammar-visible because CHAT’s structure is whitespace-sensitive:

Tab separates tier prefix from content
Newline ends tiers
Line continuation uses tab-at-start-of-line
Space separates words and annotations

Two-Level Structure

The grammar has two structural levels:

Document level: headers, utterances, @Begin/@End
Tier level: main tier content, dependent tier content (each with distinct rules)

Opaque Lemmas

In the %mor tier rules, lemmas are parsed as opaque Unicode strings. The grammar does not attempt to decompose lemma content, that happens in the model layer. This follows the “parse, don’t validate” principle.

Key Grammar Rules

Document Structure

document → utf8_header, begin_header, lines..., end_header
line → header | utterance
utterance → main_tier, dependent_tiers...

Main Tier

main_tier → star, speaker, colon, tab, tier_body
tier_body → contents, utterance_end
contents → content_item, (whitespace, content_item)...

MOR Tier (UD-style)

mor_contents → mor_content, (whitespace, mor_content)..., terminator
mor_content → mor_word, mor_post_clitic*
mor_word → mor_pos, pipe, mor_lemma, mor_feature*
mor_post_clitic → tilde, mor_word
mor_feature → hyphen, mor_feature_value

POS tags are simple identifiers (no subcategories). Lemmas are opaque strings. Features are hyphen-separated values that may contain = for Key=Value pairs and , for multi-value features.

Grammar Change Workflow

parser.c is generated from grammar.js, never edit it directly.

After any change to grammar.js:

cd grammar && tree-sitter generate
tree-sitter test (160 tests)
cargo test -p talkbank-parser
cargo nextest run -p talkbank-parser-tests (reference corpus equivalence, per-file)
Verify the 78-file reference corpus passes at 100%

Conflict Resolution

The grammar uses tree-sitter’s precedence and conflict mechanisms to handle ambiguities:

Word tokens use prec(5) to win over separators
Inline bullets use prec(10) for their delimiters
CA (conversation analysis) symbols use prec(3) for colon disambiguation

Generated Artifacts

Running tree-sitter generate produces:

src/parser.c: the C parser
src/node-types.json: node type metadata

The Rust crate talkbank-parser references node-types.json to generate node_types.rs (a generated constants file).

Overlap Marker Binding

Status: Current Last modified: 2026-07-10 12:06 EDT

Overlap markers (⌈ ⌉ ⌊ ⌋) mark simultaneous speech. They are the hardest tokenization problem in CHAT, because they legitimately live at two levels: between words (marking a span boundary in the utterance) and inside words (marking that the overlap boundary falls mid-word, as in o⌈ne t⌉wo). This page documents the binding rule the grammar ships, the ideal rule it approximates, why the gap exists, and the measured options for closing it. It is the permanent record of a design debate that has run since the project’s earliest prototypes; read it before touching word_body, contents, or anything overlap-adjacent.

The shipping rule: adjacency binds into the word

A marker adjacent to text is part of the word; only a space-separated marker is a standalone overlap_point.

Yeah ⌈2 hey     ⌈2 is standalone (spaces both sides)
⌈one two⌉       TWO words: "⌈one" and "two⌉" (markers bound in)
o⌈ne t⌉wo       TWO words with interior markers (same rule)

Mechanically: overlap_point and word_segment carry equal token precedence, and maximal munch plus word_body’s continuation rules give the word custody of every adjacent marker. See Word Internals (tokenization ambiguity #1) and the grammar’s tokenization-rules.md (Exception 1).

The ideal rule this approximates

A marker with spoken text on BOTH sides is word-internal; a marker at a word’s edge is top-level content. Under the ideal rule, ⌈one two⌉ parses as (⌈) (one) (two) (⌉): the visually obvious reading: while o⌈ne keeps its interior marker. The shipping rule diverges exactly at word edges, where it gives the word custody of markers the ideal calls top-level.

The ideal was the project’s ORIGINAL specification. Early prototype grammars (January 2026) attempted it and produced a substantial decision record; the analysis concluded that the rule “requires bidirectional context that LR parsers cannot naturally handle” (the parser must see both sides of the marker to classify it, and LR(1) has one token of lookahead). The adjacency rule was adopted as the tractable alternative, and every grammar generation since (through the February coarsening campaign and the March re-structuring) has carried it forward.

What 2026-07-10 established

A feasibility experiment revisited the impossibility conclusion with GLR machinery the January analysis had not combined: an interior-only word_body (a word may not begin or end with an overlap marker), a declared conflict, dynamic precedence on interior continuations, and removal of the static prec.right bias so the conflict genuinely splits. Results:

The ideal rule IS expressible: probes and grammar fixtures parse to the ideal shapes with no ERROR nodes, and the grammar’s conflict inventory net-shrinks.
Corpus reality is the hard part. Conversation-analysis (CA) transcription layers: overlap points, paired CA delimiters, underline spans, lengthening, compounds: cross-nest freely at word edges (☺you ⌈there⌉☺, ∇⌈ho:ney⌉∇, full⌉+grown, ⌈drug⌉ [!]). The shipping rule sidesteps every such case by giving the word custody of everything adjacent; the ideal rule must answer a custody question PER MARKER PAIR, each answer costing a grammar rule, a conflict, and an AST-shape decision. Measured against the full kept corpus (763 overlap-bearing files, all of which parse cleanly under the shipping grammar), five iterations of custody rules reduced ideal-rule regressions from 195 files to 105: a converging but long tail.

Two implementation routes therefore exist:

Grammar route: finish the custody enumeration. Honest estimate: a multi-week grammar project, followed by AST migration across the model, the generated visitor, the second (oracle) parser, and the XML emitter.
Conversion route (recommended by the experiment): keep the shipping grammar, and re-associate edge-bound overlap points to top level during CST-to-model conversion. At that point the CA layers are already resolved into typed word children, so every custody question becomes a deterministic tree transformation rather than a GLR fight. Precedent: CA terminator promotion, which already uses this parse-one-way/normalize-at-conversion pattern. The grammar’s empty-extras design (all whitespace grammar-visible) preserves exactly the facts the transformation needs.

The choice between routes (or deferral) is an open maintainer decision at the time of writing; this page must be updated when it is made.

Why this interacts with whitespace separation

The grammar’s deepest design commitment is extras: []: all whitespace is grammar-visible, because the worst historical CHAT parser bug was ACCEPTING glued content items as if properly separated (the legacy Java implementation tokenized hello(.) correctly as a word and a pause, and that silent acceptance was precisely the problem: malformed sources never got cleaned). Overlap markers and a short list of negotiated exceptions (notably comma-left: one, two is accepted; one ,two is not) are the only constructs that legitimately juxtapose with words at all. Whitespace-separation violations that the grammar tolerates for recovery’s sake are rejected by validation with precise diagnostics (E749, E750, E751), per the layer rule: the grammar’s job is SHAPE (parse everything, truest tree); rejection of recoverable style belongs to validation, where messages are helpful and recovery graceful.

Parsing

Status: Current Last updated: 2026-07-07 21:17 EDT

The parsing pipeline converts CHAT text into a typed ChatFile AST. The default and canonical parser is the tree-sitter parser (talkbank-parser). A second implementation, talkbank-parser-re2c, exists alongside it as a specification oracle and high-throughput batch parser; it produces the same ChatFile model and is opt-in via chatter validate --parser re2c. The LSP and all production paths use the tree-sitter parser.

Tree-Sitter Parser

The talkbank-parser crate wraps the tree-sitter C parser and converts its concrete syntax tree (CST) into the ChatFile model.

Full-file parsing is the canonical entry point. TreeSitterParser also provides fragment methods (parse_word_fragment(), parse_main_tier_fragment(), parse_chat_file_fragment(), etc.) for parsing isolated CHAT fragments directly.

CST → AST Pipeline

flowchart LR
    chat["CHAT text\n(.cha file)"]
    grammar["tree-sitter grammar\n(grammar.js → parser.c)"]
    cst["Concrete Syntax Tree\n(all whitespace preserved)"]
    walker["TreeSitterParser\n(CST traversal)"]
    ast["ChatFile AST\n(semantic model)"]

    chat --> grammar --> cst --> walker --> ast

Source text
    ↓ tree-sitter parse
Concrete Syntax Tree (CST), green tree with all tokens
    ↓ tree_parsing (Rust)
ChatFile AST, typed model with validation-ready data

The CST preserves every character of the source (whitespace, punctuation, comments). The Rust tree-parsing modules extract semantic information from the CST into the typed model through a generated typed traversal layer, described next.

The generated typed traversal (`generated_traversal`)

The bridge between the tree-sitter CST and the typed model is a single generated module, crates/talkbank-parser/src/generated_traversal.rs, produced by the tree-sitter-grammar-utils generator from the grammar’s own machine-readable description (grammar/src/grammar.json plus node-types.json). It contains one extract_* function per grammar rule, each returning a typed view of that rule’s children, so consumer code dispatches on generated types rather than on node.kind() strings.

Every child position a grammar rule models is exposed as a NodeSlot with five states:

`NodeSlot` state	Meaning
`Present`	The expected node is there; a typed accessor is available
`Missing`	Tree-sitter inserted a zero-width MISSING node during recovery
`Error`	An ERROR subtree occupies the position
`Unexpected`	A node of an unmodeled kind landed here
`Absent`	An optional position is simply empty

This design makes silent recovery-node loss structurally impossible at modeled positions: Missing and Error are explicit variants every call site must handle (they map to the E342 and E316 diagnostics), not conditions a hand-written walk can forget to check. Hand-walking the CST with node.kind() comparisons, and classifying the text of ERROR nodes to guess what was malformed, are both banned in production parser code for exactly this reason.

Recovery handling is two-layered by design: the per-position NodeSlot states cover every position the grammar models, and a whole-tree recovery backstop (see the recovery discussion below) surfaces recovery nodes that land where no grammar rule models a slot, such as top-level junk. The layers are complementary, and both are load-bearing: removing the backstop demonstrably regresses the CHECK-parity and recovery-is-not-validity test suites.

The module is regenerated whenever the grammar changes (the regeneration workflow, including the staleness guard that fails the test suite if regeneration is forgotten, is documented in the repository root CLAUDE.md under “Grammar Change Workflow”). It is never edited by hand: generator defects are fixed in tree-sitter-grammar-utils and regenerated.

Error Recovery

Tree-sitter’s GLR algorithm provides automatic error recovery. When the parser encounters unexpected input, it:

Inserts ERROR nodes in the CST
Continues parsing the rest of the file
Reports parse errors via the ErrorSink trait

This means the parser always produces a result, even for malformed files, it extracts as much structure as possible.

ParseOutcome

Individual parse functions return ParseOutcome<T>:

ParseOutcome::parsed(value): successfully parsed
ParseOutcome::rejected(): could not parse this node (error already reported)

This allows the parser to skip individual malformed elements while continuing to parse the rest of the file.

Parser Equivalence

The 78-file reference corpus is the primary correctness guarantee:

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'

Each .cha file is its own test, nextest runs them in parallel and reports individual failures.

TreeSitterParser API

TreeSitterParser is the sole API handle for parsing. Callers create one instance and pass &TreeSitterParser to all parsing call sites. There is no trait abstraction, TreeSitterParser is a concrete type in the talkbank-parser crate.

use talkbank_parser::TreeSitterParser;

let parser = TreeSitterParser::new()?;

// Full-file parsing (methods on TreeSitterParser).
// parse_chat_file returns ParseResult<ChatFile> with the diagnostic list
// embedded in the result envelope.
let chat_file = parser.parse_chat_file(&source)?;
// parse_chat_file_streaming pushes diagnostics into an ErrorSink as it
// goes, useful for very large files or LSP-style incremental flows.
let chat_file = parser.parse_chat_file_streaming(&source, &errors);

// Fragment parsing (methods on TreeSitterParser), used when synthesizing
// CHAT from non-CHAT sources (ASR output, UD annotations).
let word = parser.parse_word_fragment(word_text, &errors);
let main_tier = parser.parse_main_tier_fragment(tier_text, &errors);

AST Structure

The resulting ChatFile AST has a recursive content structure:

flowchart TD
    cf["ChatFile"]
    hdr["Headers\n@Languages, @Participants,\n@ID, @Options"]
    utts["Utterances[]"]
    mt["MainTier\nspeaker + content"]
    dt["DependentTiers[]\n%mor, %gra, %pho, %sin, %wor"]
    uc["UtteranceContent\n24 variants"]
    leaf["Leaves\nWord | ReplacedWord | Separator"]
    group["Groups\nGroup | AnnotatedGroup |\nRetrace | PhoGroup | SinGroup | Quotation"]

    cf --> hdr & utts
    utts --> mt & dt
    mt --> uc
    uc --> leaf & group
    group -->|recurse| uc

Parser String Handling

The tree-sitter parser constructs owned model types (e.g., MorWord, GrammaticalRelation) directly from CST text. String-heavy types like PosCategory and MorStem use Arc<str> interning to avoid redundant allocations for repeated values. Short strings in model newtypes use SmolStr for inline storage up to 23 bytes.

CHAT Data Model

Status: Current Last updated: 2026-06-14 19:57 EDT

The talkbank-model crate defines the typed AST for CHAT files. Every other crate, parser, transform, CLAN, CLI, LSP, and the entire batchalign runtime, depends on it. This page describes the model itself, the three-level content hierarchy, the content-walker primitives, and the extract → infer → inject pattern that all NLP tasks follow.

ChatFile

The root type is ChatFile, representing a complete CHAT transcript:

pub struct ChatFile {
    pub lines: Vec<Line>,
    pub participants: IndexMap<SpeakerCode, Participant>,
    pub languages: LanguageCodes,
    pub options: ChatOptionFlags,
}

Each Line is either a Header or an Utterance. The full ownership tree:

flowchart TD
    chatfile["ChatFile\n(talkbank-model/src/model/file/chat_file/core.rs)"]
    chatfile --> lines["lines: Vec&lt;Line&gt;"]
    chatfile --> participants["participants:\nIndexMap&lt;SpeakerCode, Participant&gt;"]
    chatfile --> languages["languages: LanguageCodes"]
    chatfile --> options["options: ChatOptionFlags"]

    lines --> header_line["Line::Header (Header)"]
    lines --> utt_line["Line::Utterance (Utterance)"]

    utt_line --> preceding["preceding_headers:\nSmallVec&lt;Header&gt;"]
    utt_line --> main["main: MainTier"]
    utt_line --> deptiers["dependent_tiers:\nVec&lt;DependentTier&gt;"]
    utt_line --> health["parse_health: ParseHealthState"]

    main --> speaker["speaker: SpeakerCode"]
    main --> tiercontent["content: TierContent"]
    tiercontent --> linkers["linkers: Vec&lt;Linker&gt;"]
    tiercontent --> uttcontent["utterance_content:\nVec&lt;UtteranceContent&gt;\n(24 variants)"]
    tiercontent --> terminator["terminator: Option&lt;Terminator&gt;"]
    tiercontent --> bullet["bullet: Option&lt;Bullet&gt;"]

The DependentTier enum has 25 variants: structured linguistic (Mor/Gra/Pho/Mod/Sin/Act/Cod/Wor), with-inline-bullets (Add/Com/Exp/Gpx/Int/Sit/Spa), text-only (Alt/Coh/Def/Eng/Err/Fac/Flo/Gls/Ort/Par/Tim), Phon-project (Modsyl/Phosyl/Phoaln), and UserDefined / Unsupported.

Three-Level Content Hierarchy

CHAT main-tier content is a tree with three nesting levels. Every content traversal must understand all three.

ChatFile
└── Line::Utterance
    └── MainTier
        └── TierContent
            ├── content: Vec<UtteranceContent>     ← Level 1
            │   ├── Word(Box<Word>)
            │   │   └── content: Vec<WordContent>  ← Level 3
            │   ├── OverlapPoint(OverlapPoint)
            │   ├── Group(Group)
            │   │   └── BracketedContent
            │   │       └── Vec<BracketedItem>     ← Level 2
            │   ├── PhoGroup, SinGroup, Quotation
            │   │   └── (same BracketedContent)
            │   ├── Retrace(Box<Retrace>)
            │   ├── Pause, Event, Separator, ...
            │   └── AnnotatedWord, AnnotatedGroup, ...
            ├── bullet: Option<Bullet>
            ├── linkers: Linkers
            └── terminator: Terminator

Level 1, `UtteranceContent` (24 variants)

What you iterate when walking utterance.main.content.content.0:

Category	Variants
Words	`Word`, `AnnotatedWord`, `ReplacedWord`
Groups	`Group`, `AnnotatedGroup`, `PhoGroup`, `SinGroup`, `Quotation`
CA markers	`OverlapPoint`, `Separator`
Events	`Event`, `AnnotatedEvent`, `OtherSpokenEvent`
Actions	`AnnotatedAction`
Timing	`InternalBullet`
Scope markers	`LongFeatureBegin/End`, `NonvocalBegin/End/Simple`, `UnderlineBegin/End`
Other	`Freecode`, `Pause`

Critical rule: every match on UtteranceContent must explicitly list all 24 variants. No _ => catch-alls. Project policy: silent data loss when new variants are added is unacceptable.

Level 2, `BracketedItem` (22 variants)

Content inside groups (<...>, ‹...›, 〔...〕, "..."). Accessed via group.content.content.0 (the double .content.content.0 is not a typo, Group.content is BracketedContent, which has .content: BracketedItems, which has .0: Vec<BracketedItem>).

BracketedItem mirrors UtteranceContent closely. Retrace content (<word word> [/], word [//]) is a dedicated Retrace variant at both levels, not hidden inside AnnotatedGroup. Groups can nest arbitrarily deep.

Level 3, `WordContent` (11 variants)

Content inside a single word token, accessed via word.content:

Variant	Example
`Text`	plain text segment
`Shortening`	`(lo)` omitted sound
`OverlapPoint`	`butt⌈er⌉`, overlap inside a word
`CAElement`	`↑ ↓` prosody markers
`CADelimiter`	`° ∆` paired delimiters
`StressMarker`	`ˈ ˌ`
`Lengthening`	`:`
`SyllablePause`	`^`
`CompoundMarker`	`+` in `ice+cream`
`UnderlineBegin/End`	scope delimiters

Key insight: overlap markers can appear at all three levels, as standalone UtteranceContent::OverlapPoint (space-separated: ⌈ word ⌉), as BracketedItem::OverlapPoint (inside groups), or as WordContent::OverlapPoint (intra-word: butt⌈er⌉). Any traversal looking for overlap markers must check all three levels.

Annotated Wrappers and Replaced Words

`Annotated<T>`

Adds scoped annotations ([/], [* m], [= explanation], etc.) to any annotatable inner type:

pub struct Annotated<T> {
    pub inner: T,
    pub annotations: Vec<ContentAnnotation>,
    pub span: Span,
}

At Level 1: AnnotatedWord(Box<Annotated<Word>>), AnnotatedGroup(Annotated<Group>), AnnotatedEvent(Annotated<Event>), AnnotatedAction(Annotated<Action>). Same variants exist at Level 2.

`ReplacedWord`

Represents word [: replacement], a surface form with a replacement:

pub struct ReplacedWord {
    pub word: Word,
    pub replacement: Replacement,
}
pub struct Replacement {
    pub words: Vec<Word>,
}

Convention when extracting words for NLP: use replacement words if non-empty, else the surface form (Wor and Mor domains both follow this, each with its own counts_for_tier filter).

Tier Domains

Different NLP tasks need different views of the same content. The TierDomain enum controls which words count for each tier and how groups are traversed:

Domain	Used by	Skips	Counts separators?
`Mor`	`%mor` / `%gra` generation	Retrace groups	Yes, `,` `„` `‡` carry mor items (`cm\|cm`, `end\|end`, `beg\|beg`)
`Wor`	`%wor` generation, FA	Nothing	No
`Pho`	`%pho` alignment	`PhoGroup`	No
`Sin`	`%sin` alignment	`SinGroup`	No

The content walker takes Option<TierDomain>: Some(domain) for domain-aware gating, None to recurse everything unconditionally.

Content Walkers

talkbank-model exports closure-based walkers. Two layers:

walk_content: generic, visits all content items (custom traversals).
walk_words / walk_words_mut, filtered to words / replaced words / separators, with domain-aware gating. The primary primitive.

use talkbank_model::alignment::helpers::{
    walk_words, walk_words_mut,
    WordItem, WordItemMut,
    TierDomain,
};

walk_words(content, Some(TierDomain::Wor), &mut |leaf| {
    match leaf {
        WordItem::Word(word) => { /* ... */ }
        WordItem::ReplacedWord(replaced) => { /* ... */ }
        WordItem::Separator(sep) => { /* ... */ }
    }
});

flowchart TD
    input["&[UtteranceContent]\n+ domain: Option&lt;TierDomain&gt;"]
    dispatch["Match variant\n(24 UtteranceContent variants)"]
    word["Word → emit WordItem::Word"]
    rw["ReplacedWord → emit WordItem::ReplacedWord"]
    sep["Separator → emit WordItem::Separator"]
    group["Group / AnnotatedGroup /\nPhoGroup / SinGroup / Quotation"]
    gate{"Domain\ngating"}
    skip["Skip\n(atomic unit)"]
    recurse["Recurse into\ngroup.content"]

    input --> dispatch
    dispatch --> word & rw & sep & group
    group --> gate
    gate -->|"Mor: skip retraces"| skip
    gate -->|"Pho/Sin: skip groups"| skip
    gate -->|"None: recurse all"| recurse
    recurse -->|back| dispatch

What `walk_words` does NOT visit

Only words and separators. Not OverlapPoint (any level), not CAElement within words, not events / pauses / actions, not internal bullets. For these, write a custom traversal, see talkbank-model/validation/utterance/overlap.rs for the reference pattern.

`walk_overlap_points`, overlap marker iterator

walk_overlap_points(content, &mut |visit| {
    // visit.point.kind, visit.point.index, visit.word_position
});

Visits every OverlapPoint at all three content levels with its word-position context. Used by the alignment pipeline (onset estimation) and the validator (pairing checks). For region-level analysis (pairing ⌈ with ⌉ by index), use extract_overlap_info() which builds OverlapRegion structs. For whole-file analysis, analyze_file_overlaps() matches top regions (⌈) with bottom regions (⌊) across utterances with 1:N support (used by E347 and chatter debug overlap-audit).

Validation

Beyond what the grammar enforces, validate_with_alignment() checks semantic constraints:

%mor alignment: number of MOR items matches alignable main-tier words.
%gra structure: sequential indices, ROOT checks, circular dependency.
Header consistency: @ID codes match @Participants.
Speaker references: all *SPEAKER: codes declared.

Five parallel alignment flows are computed against the main tier:

flowchart TD
    main["MainTier content"]
    walker["walk_words()\ncount alignable words"]

    subgraph "5 Parallel Alignment Flows"
        mor["%mor\ncustom logic\n(clitic handling)"]
        pho["%pho\npositional_align()\n(skip PhoGroup)"]
        sin["%sin\npositional_align()\n(skip SinGroup)"]
        wor["%wor\npositional_align()\n(LCS diff format)"]
        gra["%gra\nalign to %mor chunks\n(not main tier)"]
    end

    main --> walker
    walker --> mor & pho & sin & wor
    mor --> gra

For the alignment algorithms themselves, see Alignment.

Common Pitfalls

“Consecutive” means in-order traversal, not adjacent array indices. When CHAT tools speak of “consecutive” or “sequential” items on the main tier, this always means document order via recursive traversal, accounting for groups (<...>), retrace groups (<...> [/]), quotations ("..."), and all other bracketed structures. Never check adjacency in the flat Vec<UtteranceContent>, use walk_words or equivalent in-order traversal.
Missing intra-word content. Overlap markers, CA elements, and other markers can appear inside Word.content. Checking only UtteranceContent::OverlapPoint misses WordContent::OverlapPoint (e.g., butt⌈er⌉, a⌈nd).
Missing annotated variants. UtteranceContent::AnnotatedWord and AnnotatedGroup wrap inner types in Annotated<T> and are easy to forget.
BracketedContent access. Group.content → BracketedContent, with .content: BracketedItems, with .0: Vec<BracketedItem>.
Separator counter sync (Mor domain). Tag-marker separators (, „ ‡) count as NLP words because they have %mor items. Any code counting words in the Mor domain must count these separators too.

Serialization

CHAT: WriteChat trait writes any model type back to CHAT format.
JSON: all model types implement Serialize/Deserialize. Format per the JSON Schema.
JSON Schema: derived via JsonSchema. Run cargo test --test generate_schema to regenerate schema/chat-file.schema.json.

Memory and Interning

String-heavy types (PosCategory, MorStem, MorFeature) use Arc<str> with a global interner, significant memory savings on large corpora where the same POS tags and lemmas appear thousands of times.

Collections that are typically small use SmallVec for inline storage:

SmallVec<[MorFeature; 4]>: features per word (usually 0-4).
SmallVec<[MorWord; 2]>: post-clitics (usually 0-1).

Transform Pipeline

Status: Current Last updated: 2026-07-07 21:17 EDT

The talkbank-transform crate provides high-level pipelines that compose parsing, validation, and serialization into reusable workflows.

Core Pipelines

Parse + Validate

The most common pipeline: parse a CHAT file and validate it.

use talkbank_transform::parse_and_validate;

let result = parse_and_validate(source, &parser, &error_collector);

This:

Parses the source text into a ChatFile AST
Runs validation (alignment checks, header consistency, etc.)
Collects all errors and warnings into the ErrorSink

CHAT → JSON

Convert a CHAT file to its JSON representation:

use talkbank_transform::chat_to_json;

let json = chat_to_json(source, &parser)?;

The JSON follows the schema at schema/chat-file.schema.json.

JSON → CHAT

The JSON produced by chat_to_json is schema-conformant and round-trips. Deserialize it back into a ChatFile with serde_json (the model derives Deserialize), then serialize through WriteChat to reproduce CHAT text:

let chat_file: talkbank_model::ChatFile = serde_json::from_str(json_str)?;
let chat_text = chat_file.to_chat_string();

The chatter from-json command wraps this path (crates/chatter/src/commands/json.rs, json_to_chat).

CHAT → CHAT (Normalize)

Parse and reserialize to normalize formatting:

use talkbank_transform::normalize_chat;

let normalized = normalize_chat(source, &parser)?;

normalize_chat lives in crates/talkbank-transform/src/pipeline/convert.rs.

Validation + Roundtrip Cache Lifecycle

The following diagram shows the full validation and roundtrip pipeline, including the cache layer:

flowchart TD
    file["CHAT file"]
    cache{"Cache\nhit?"}
    parse["Parse\n(tree-sitter → AST)"]
    validate["Validate\n(per-file → per-utterance →\nmain tier → dependent tiers)"]
    rt{"Roundtrip\nflag?"}
    ser1["Serialize → CHAT text"]
    reparse["Reparse CHAT text"]
    ser2["Serialize again"]
    cmp{"Two\nserializations\nmatch?"}
    store["Store in cache\n(SQLite)"]
    pass["Pass"]
    fail["Fail"]
    cached["Return cached result"]

    file --> cache
    cache -->|miss| parse --> validate --> rt
    cache -->|hit| cached
    rt -->|yes| ser1 --> reparse --> ser2 --> cmp
    rt -->|no| store --> pass
    cmp -->|yes| store
    cmp -->|no| fail

Streaming Parse

For large files or interactive use, the transform crate supports streaming parse where utterances are processed incrementally rather than loading the entire AST into memory.

The shared validation runner (every frontend, one engine)

All bulk validation, whatever the frontend, flows through the validation_runner module’s two streaming entry points in crates/talkbank-transform/src/validation_runner/:

validate_directory_streaming walks a directory and feeds every CHAT transcript to a worker pool;
validate_files_streaming runs an explicit file list through the same worker pool.

Both share one worker loop, so every consumer gets identical rule coverage (including the file-stem-dependent checks such as the @Media filename match), identical stats accounting, and the same on-disk cache. The chatter CLI, the TUI, and the desktop app all call these entry points; the desktop app’s single-file path was unified onto validate_files_streaming in 0.3.0 after field reports showed the previous bespoke path skipped the cache and the stem-based checks. The invariant to preserve: no frontend grows its own validation orchestration; a file must validate identically whether selected alone or reached by a directory walk.

Caching

The transform layer integrates with a file-system cache. Validation results are keyed by content hash, so unchanged files skip re-validation. Cache location is platform-specific: ~/Library/Caches/talkbank-chat/ (macOS), ~/.cache/talkbank-chat/ (Linux), %LocalAppData%\talkbank-chat\ (Windows).

Use --force to bypass the cache for specific paths.

Error Collection

Pipelines use the ErrorSink trait for error reporting. Callers can provide:

A collecting sink (gathers all diagnostics for batch output)
A printing sink (writes diagnostics to stderr in real-time)
A custom sink (for LSP diagnostics, JSON output, etc.)

Merge Pipeline, Domain Types

Status: Draft Last modified: 2026-07-07 21:17 EDT

This page specifies the typed Rust vocabulary shared by chatter merge, chatter speaker-id, the override-file reader/writer, and the adjudication tooling (CLI today; a VS Code or web UI would share the same types). It was originally written before the implementing code, as a deliberate design-first specification against the user contract in chatter merge and chatter speaker-id. The implementation has since shipped, and this page now records the shipped form: where the implementation departed from the original design (the owning crate, several type names, and the schema-v2 per-speaker role map), the affected section says so explicitly instead of silently rewriting history.

The design follows the cross-cutting rules in this repo’s root CLAUDE.md: newtypes over primitives at every stable boundary; no boolean blindness; no tuple-packed seams; typed errors via thiserror; deterministic BTreeMap/BTreeSet over hash maps for serialized state.

Where the types live

The merge-pipeline types live in crates/talkbank-transform/src/speaker_id/, co-located with the algorithms (identify_mapping, apply_mapping) that produce and consume them, and are re-exported at talkbank_transform::speaker_id::* (see that module’s mod.rs). The structural-merge error type (MergeError) lives beside the merge algorithm in crates/talkbank-transform/src/transcript_merge.rs.

Design history. The original design placed the types in a new talkbank-model::merge module, on co-location-with-CHAT-types and lightweight-dependency grounds. That module was never created: the implementation kept the types next to the algorithms whose invariants they encode, in talkbank-transform. talkbank-model still owns the CHAT-domain vocabulary the merge types reference (SpeakerCode, ParticipantRole, ParticipantEntry, IDHeader, ChatFile); a consumer that wants the merge types depends on talkbank-transform, which the CLI, LSP, and desktop app already do.

Designed vs shipped (quick map)

The sections below preserve the original type specification, updated in place for the types most central to the override-file contract. This table maps each designed name to what actually shipped, so a reader grepping the codebase finds the right symbol. All shipped paths are relative to crates/talkbank-transform/src/.

Designed (this page, 2026-05)	Shipped
`InsertedRole`	`InsertedRoleSpec` (`speaker_id/override_file.rs`): on-disk `code` / `tag` strings plus optional `specific_role`
`MappingAction`	`SpeakerAction` (`speaker_id/override_file.rs`): `Rename` / `Drop`
`DecisionMode`	`OverrideMode` (`speaker_id/override_file.rs`): `Auto` / `Explicit` / `Override`
`SpeakerMapping` (single shared `inserted_role`)	On disk: `MergeOverride.mapping` plus the per-speaker `MergeOverride.adult_roles` map (schema v2). In memory: `MappingSpec = HashMap<SpeakerCode, SpeakerAssignment>` (`speaker_id/mapping.rs`), each `Rename` carrying its own code / role / specific-role
`Margin` enum (`Finite` / `Unbounded`)	`ConfidenceMargin(f64)` (`speaker_id/types.rs`); the unbounded case is `f64::INFINITY`, and the on-disk `margin` is a plain number
`JaccardScore` (fallible serde newtype)	`JaccardScore(pub f64)` (`speaker_id/types.rs`), a plain newtype; on-disk scores are bare `f64` values
`ConfidenceThreshold` (associated `DEFAULT`)	`ConfidenceThreshold(pub f64)` (`speaker_id/types.rs`) plus `DEFAULT_CONFIDENCE_THRESHOLD` (`speaker_id/identify.rs`)
`RetainSet` newtype	Not shipped; `merge_chats` takes `retain: &[SpeakerCode]` (`transcript_merge.rs`)
`MergeFlag` enum	Not shipped; `MergeOverride.flags` is `Vec<String>`
`OperatorId` / `SessionId` newtypes	`MergeOverride.operator` is `String`; override entries are keyed by `String` session IDs (a `SessionId` newtype exists in the `speaker_id/judgment/` submodule for the LLM-judgment surface)
`OverrideFile::CURRENT_SCHEMA_VERSION = 1`	Module-level `CURRENT_SCHEMA_VERSION: u32 = 2` (`speaker_id/override_file.rs`)
`SpeakerIdError` / `MergeError` / `OverrideFileError` variant sets	Shipped with revised variants; see the updated Error types section below

Existing types reused (not redefined)

Type	Defined in	Used as
`SpeakerCode`	`talkbank-model::model::header::codes::speaker`	Identifier for `*<CODE>:` speakers, dictionary keys in mappings, `--retain` set elements
`ParticipantRole`	`talkbank-model::model::header::codes::participant`	Role-tag in `@Participants` and `@ID` (`Target_Child`, `Investigator`, `Mother`, etc.)
`ParticipantName`	`talkbank-model::model::header::codes::participant`	Optional participant name in `@Participants`
`ParticipantEntry`	`talkbank-model::model::header::codes::participant`	Single `@Participants` row
`IDHeader`	`talkbank-model::model::header::id`	Single `@ID` row
`ChatFile<S>`	`talkbank-model::model::file::chat_file::core`	The merge stages’ inputs and outputs (parameter `S: ValidationState`)

None of these are redefined; the speaker_id and transcript_merge modules import and reference them.

New types (specification)

The subsections below are the type specification. The ones central to the override-file contract (InsertedRoleSpec, SpeakerAction, the speaker-mapping pair, OverrideMode, MergeOverride, OverrideFile, and the three error enums) have been updated in place to the shipped form. The remaining subsections (JaccardScore, ConfidenceThreshold, Margin, RetainSet, MergeFlag, OperatorId, SessionId) are preserved as the original design; where the shipped form differs (it does for each of those), the designed-vs-shipped table above is authoritative for the current symbol and shape.

`JaccardScore`

A multiset-Jaccard similarity value, by construction in the closed range [0.0, 1.0].

/// Multiset Jaccard similarity between two bags of tokens.
///
/// By construction in [0.0, 1.0]. `JaccardScore::zero()` is the
/// no-overlap point; `JaccardScore::one()` is identical-bag.
///
/// Used by the speaker-id stage to score how well each donor
/// speaker matches a reference anchor's content.
#[derive(Clone, Copy, Debug, PartialEq, PartialOrd, Serialize, Deserialize, JsonSchema)]
#[serde(try_from = "f64", into = "f64")]
pub struct JaccardScore(f64);

impl JaccardScore {
    pub fn new(v: f64) -> Result<Self, JaccardScoreError>;
    pub fn zero() -> Self;
    pub fn one() -> Self;
    pub fn value(self) -> f64;
}

impl Display for JaccardScore { /* "0.735" three-digit */ }
impl TryFrom<f64> for JaccardScore { /* validates range */ }
impl From<JaccardScore> for f64 { /* infallible widen */ }

Construction is fallible: JaccardScore::new(1.5) returns Err(JaccardScoreError::OutOfRange(1.5)). NaN is also rejected. Internal computation that’s guaranteed in-range by construction (the multiset formula) uses an internal from_unchecked private constructor; public API is fallible.

`ConfidenceThreshold`

The minimum Jaccard margin (winner / loser) the speaker-id stage will auto-accept. By construction in [1.0, ∞), a threshold of < 1.0 makes no sense (means the loser scores higher than the winner, which can’t happen). Default 2.0 per the empirical calibration recorded in chatter speaker-id.

#[derive(Clone, Copy, Debug, PartialEq, PartialOrd, Serialize, Deserialize, JsonSchema)]
#[serde(try_from = "f64", into = "f64")]
pub struct ConfidenceThreshold(f64);

impl ConfidenceThreshold {
    pub const DEFAULT: Self = Self(2.0);
    pub fn new(v: f64) -> Result<Self, ConfidenceThresholdError>;
    pub fn value(self) -> f64;
}

impl Default for ConfidenceThreshold {
    fn default() -> Self { Self::DEFAULT }
}

`Margin`

The decisive ratio between the highest-scoring speaker and the runner-up. Distinguished from ConfidenceThreshold by intent (this is observed; the threshold is configured) and from JaccardScore by range (margin is ≥ 1.0; score is ≤ 1.0).

Uses an enum rather than a bare float to model the divide-by-zero case (runner-up has zero Jaccard) cleanly. Avoids the f64::INFINITY sentinel that doesn’t round-trip through all serializers.

/// Ratio of winning speaker's score to runner-up's score.
///
/// `Finite(r)` for `r >= 1.0`. `Unbounded` when the runner-up
/// has zero score (winner scored anything, runner-up scored
/// nothing). Compares meaningfully against `ConfidenceThreshold`
/// regardless of variant.
#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize, JsonSchema)]
#[serde(untagged)]
pub enum Margin {
    Finite(f64),
    /// Serialized as the JSON/TOML string "unbounded"; never as
    /// f64::INFINITY (which round-trips inconsistently).
    Unbounded,
}

impl Margin {
    pub fn from_scores(winner: JaccardScore, loser: JaccardScore) -> Self;
    pub fn meets(self, threshold: ConfidenceThreshold) -> bool;
}

impl Display for Margin { /* "3.81x" or "∞" */ }

`RetainSet`

The set of speaker codes specified by --retain on chatter merge. A BTreeSet<SpeakerCode> wrapped in a newtype so the type signatures of merge functions communicate intent. Empty is allowed (means “no speakers come from File 1; File 1 contributes only headers”, a degenerate but legal case).

/// Speakers whose utterances come from the first input to
/// `chatter merge`. All other speakers come from the second
/// input.
#[derive(Clone, Debug, Default, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
pub struct RetainSet(BTreeSet<SpeakerCode>);

impl RetainSet {
    pub fn new() -> Self;
    pub fn from_iter<I: IntoIterator<Item = SpeakerCode>>(it: I) -> Self;
    pub fn contains(&self, code: &SpeakerCode) -> bool;
    pub fn iter(&self) -> impl Iterator<Item = &SpeakerCode>;
    pub fn is_empty(&self) -> bool;
}

impl FromStr for RetainSet {
    type Err = RetainSetParseError;
    /// Parses `"CHI,SI2"` → `{CHI, SI2}`. Empty entries rejected.
    fn from_str(s: &str) -> Result<Self, Self::Err>;
}

`InsertedRoleSpec` (designed as `InsertedRole`)

The CHAT identity recorded for one renamed speaker: a speaker code, a standard role tag, and (only when needed) a specific-role label. A struct rather than separate function arguments because the triple is meaningful as a unit (in TOML override files it serializes as an inline table; on the CLI a CODE:TAG pair parses into one). Shipped in speaker_id/override_file.rs under the name InsertedRoleSpec, with on-disk String fields (this is the serialized form written into override files) rather than the designed SpeakerCode / ParticipantRole newtypes; MergeOverride::to_mapping_spec lifts the strings back into the typed CHAT primitives at the read boundary.

/// Inline-table form of the inserted-role spec recorded in each
/// override entry.
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct InsertedRoleSpec {
    /// CHAT speaker code (e.g. `INV`, or `INV1` when disambiguated from
    /// a same-role collision).
    pub code: String,
    /// CHAT standard role tag (e.g. `Investigator`).
    pub tag: String,
    /// Specific-role label for `@Participants`' name/specific-role slot
    /// (e.g. `First_Investigator`), set only when two adults in the same
    /// judgment share `tag` and need the CHAT manual's `CHI1`/`CHI2`-style
    /// disambiguation. `None` for the ordinary single-adult-per-role case.
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub specific_role: Option<String>,
}

The specific_role field is never operator-typed: it is filled by the same-role auto-disambiguation described under the speaker-mapping section below. On the CLI, --inserted-role INV:Investigator and each OLD=CODE:ROLE assignment in --mapping supply the code / tag pair; both halves are required.

`SpeakerAction` (designed as `MappingAction`)

What happens to a particular speaker in the input. Enum (not boolean) to avoid blindness. Shipped in speaker_id/override_file.rs under the name SpeakerAction.

/// Action applied to one speaker in the input file.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum SpeakerAction {
    /// Rename the speaker per its own entry in `adult_roles`.
    /// Rewrites speaker codes on every utterance and the
    /// corresponding @Participants and @ID entries.
    Rename,
    /// Remove this speaker's utterances and its @Participants /
    /// @ID rows entirely.
    Drop,
}

The TOML serialization uses "drop" / "rename" lowercase strings, matching the override-file format documented in merge-overrides.md. The design left room for a future RenameTo { code, tag } variant; that never became necessary, because schema v2 instead resolves every Rename through the per-speaker adult_roles map, which carries each speaker’s own target identity (next section).

Speaker mapping: on-disk `mapping` + `adult_roles`, in-memory `MappingSpec` (designed as `SpeakerMapping`)

The decision record produced by the speaker-id stage and consumed by the apply step. Carries enough information to apply deterministically to a ChatFile. The original design was a SpeakerMapping struct with a single shared inserted_role: InsertedRole field and the constraint “all renamed speakers go to the same role in v1 of this schema”. Schema v2 replaced that constraint with a per-speaker role map, and the shipped code splits the concept into an on-disk shape and an in-memory shape.

On disk, two sibling fields of MergeOverride (speaker_id/override_file.rs):

/// Per-donor-speaker-code role assignment, for every speaker whose
/// `mapping` action is `Rename`. Invariant: every `Rename` key in
/// `mapping` has a matching entry here.
pub adult_roles: BTreeMap<String, InsertedRoleSpec>,

/// Map from input speaker codes to actions. Every speaker that
/// exists in the input must appear here.
pub mapping: BTreeMap<String, SpeakerAction>,

Every Rename resolves via that speaker’s own adult_roles entry, so one entry can rename two speakers to two different roles (PAR0 -> INV:Investigator, PAR1 -> FAT:Father). When two adults in the same session are assigned the same role, the writer auto-disambiguates per the CHAT manual’s CHI1/CHI2 convention: numbered speaker codes (INV1, INV2), the shared standard role tag unchanged, and ordinal specific-role labels (First_Investigator, Second_Investigator, falling back to bare numerals past Fourth) recorded in each spec’s specific_role field (speaker_id/judgment/consume.rs, disambiguate_adult_roles). A hand-edited file that records a Rename with no matching adult_roles entry fails closed at replay time with SpeakerIdError::OverrideRenameMissingRole; the sanctioned constructors (MergeOverride::auto_decision, MergeOverride::operator_decision) maintain the covering invariant.

In memory (speaker_id/mapping.rs), the apply step consumes a typed per-speaker assignment map:

/// What to do with a speaker named in the input file.
pub enum SpeakerAssignment {
    /// Drop the speaker entirely.
    Drop,
    /// Rename the speaker to `code` with role tag `role` (and an
    /// optional specific-role label for `@Participants`).
    Rename {
        code: SpeakerCode,
        role: ParticipantRole,
        specific_role: Option<ParticipantName>,
    },
}

/// Operator-supplied mapping from input speaker codes to
/// post-relabeling assignments.
pub type MappingSpec = HashMap<SpeakerCode, SpeakerAssignment>;

MergeOverride::to_mapping_spec converts the on-disk pair into a MappingSpec for apply_mapping; parse_mapping_spec builds one directly from the CLI --mapping string. The on-disk contract requires every speaker that exists in the input to appear in mapping (we want every decision to be explicit). Note a shipped gap: apply_mapping currently passes through unchanged any speaker absent from the in-memory MappingSpec; enforcing the every-input-speaker precondition at apply time is a documented follow-up (speaker_id/apply.rs).

`OverrideMode` (designed as `DecisionMode`)

How a MergeOverride entry came to exist. Three variants matching the three speaker-id operation modes. Shipped in speaker_id/override_file.rs under the name OverrideMode.

/// How a speaker-id decision was made.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum OverrideMode {
    /// Reference-mode auto-decide above confidence threshold.
    Auto,
    /// Operator-supplied `--mapping` (typically after a low-confidence
    /// reference-mode attempt).
    Explicit,
    /// Replay of a prior decision read from another override file.
    Override,
}

`MergeFlag`

Extensible operator-supplied flags on an override entry. Closed variants for known cases plus a Custom(String) escape hatch.

#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "kebab-case")]
pub enum MergeFlag {
    /// ASR diarization mixed multiple real-world roles into one
    /// speaker label. The rename may still be the best available
    /// approximation but the output is imperfect.
    DiarizationMixed,
    /// The operator could not confidently determine which speaker
    /// is which; mapping is best-guess.
    BestGuess,
    /// Open variant for contributor-specific flag vocabulary.
    /// Serializes as the inner string verbatim.
    #[serde(untagged)]
    Custom(String),
}

`OperatorId`

Who made the decision. String newtype.

string_newtype!(
    /// Identifier of the operator who created an override entry.
    /// Free-form; typically a username or initials. Recorded as
    /// audit trail.
    pub struct OperatorId;
);

`SessionId`

Identifies an entry within an override file. Typically the basename stem of the input CHAT file, but the override-file schema doesn’t constrain its shape, contributors may use any stable identifier they like (<participant>-<timepoint>, <recording-id>, etc.).

string_newtype!(
    /// Identifies a session within an override file. Free-form
    /// stable string; typically the CHAT-file basename stem.
    pub struct SessionId;
);

`MergeOverride`

A single per-session decision record. The unit of operator adjudication. As shipped (speaker_id/override_file.rs):

/// A single override-file entry: the operator decision for one
/// session. See `merge-overrides.md` for field semantics.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct MergeOverride {
    /// How the decision was made.
    pub mode: OverrideMode,

    /// Per-donor-speaker-code role assignment, for every speaker
    /// whose `mapping` action is `Rename` (schema v2; see the
    /// speaker-mapping section above).
    pub adult_roles: BTreeMap<String, InsertedRoleSpec>,

    /// Map from input speaker codes to actions. Every speaker that
    /// exists in the input must appear here.
    pub mapping: BTreeMap<String, SpeakerAction>,

    /// Per-speaker Jaccard scores recorded at decision time.
    /// Present for `Auto` (and `Explicit` decisions that followed a
    /// low-confidence reference-mode attempt).
    #[serde(skip_serializing_if = "BTreeMap::is_empty", default)]
    pub scores: BTreeMap<String, f64>,

    /// Winner-score / runner-up-score margin. Serialized as a
    /// number; the divide-by-zero case is `f64::INFINITY`.
    #[serde(skip_serializing_if = "Option::is_none", default)]
    pub margin: Option<f64>,

    /// Free-form identifier of the operator who made the decision.
    pub operator: String,

    /// When the decision was made (RFC 3339).
    pub decided_at: DateTime<Utc>,

    /// Free-text operator note. Strongly recommended for `Explicit`
    /// and `Override` modes.
    #[serde(skip_serializing_if = "Option::is_none", default)]
    pub note: Option<String>,

    /// Operator-supplied audit flags (e.g. `"diarization-mixed"`,
    /// `"best-guess"`).
    #[serde(skip_serializing_if = "Vec::is_empty", default)]
    pub flags: Vec<String>,

    /// Which engine produced this decision. Absent in pre-provenance
    /// files, which deserialize as `Deterministic`.
    #[serde(default)]
    pub engine: DecisionEngine,

    /// LLM audit trail; present only for `engine = Llm` decisions.
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub judgment: Option<JudgmentProvenance>,
}

The struct embeds the timestamp via chrono::DateTime<Utc>; serde serializes to RFC 3339 (2026-05-27T08:41:00Z) by default. TOML preserves this format faithfully. The engine / judgment provenance fields postdate the original design (they record whether a decision was deterministic or LLM-made; see speaker_id/provenance.rs); they were added without a schema bump because they are backward compatible in both directions, as documented in merge-overrides.md.

`OverrideFile`

The top-level container. Holds schema version + per-session entries. Read from / written to disk as TOML.

/// Current schema version supported by this binary (module-level
/// const in `speaker_id/override_file.rs`). Readers refuse files
/// with any other value; there is no implicit version, no fallback,
/// no auto-migration. Bumped from 1 to 2 for the per-speaker
/// `adult_roles` map (was `inserted_role`, a single shared field).
pub const CURRENT_SCHEMA_VERSION: u32 = 2;

/// The full override-file document.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OverrideFile {
    /// Schema version. Currently 2. Always `CURRENT_SCHEMA_VERSION`
    /// when this binary writes; readers reject other values with a
    /// typed error rather than guessing.
    pub schema_version: u32,

    /// Per-session entries, alphabetically ordered by session ID
    /// via the `BTreeMap` default.
    #[serde(flatten)]
    pub entries: BTreeMap<String, MergeOverride>,
}

impl OverrideFile {
    /// Read an override file from disk, or return an empty default
    /// (with the current schema version) if the path does not
    /// exist. Refuses any `schema_version != CURRENT_SCHEMA_VERSION`.
    /// Used by the `--write-override` append flow.
    pub fn read_or_default(path: &Path) -> Result<Self, OverrideFileError>;

    /// Serialize to TOML and write via `.tmp` + rename, so a crash
    /// mid-write leaves the prior file intact rather than truncated.
    pub fn write(&self, path: &Path) -> Result<(), OverrideFileError>;

    /// Insert (or replace) the entry for `session_id`.
    pub fn upsert(&mut self, session_id: String, entry: MergeOverride);

    pub fn get(&self, session_id: &str) -> Option<&MergeOverride>;
}

(The designed standalone read never shipped; read_or_default is the single read path, and the designed insert shipped as upsert. Iteration helpers session_ids, auto_entries, and llm_entries were added for diagnostics, the post-merge sanity scan, and LLM audits respectively.)

The #[serde(flatten)] on entries means the on-disk TOML is flat tables keyed by session ID (as shown in the speaker-id.md schema):

schema_version = 2

[NF203-2]
mode = "auto"
adult_roles = { PAR0 = { code = "INV", tag = "Investigator" } }
# ...

rather than nested under an [entries] table.

Error types

Two thiserror-based enums covering the merge pipeline’s failure modes. Each variant carries enough information for the CLI to produce a useful diagnostic and for callers to pattern-match behavior.

`SpeakerIdError`

As shipped (speaker_id/error.rs; several designed variant names changed, and the low-confidence payload became the full DonorMatchReport rather than loose fields):

#[derive(Debug, thiserror::Error)]
pub enum SpeakerIdError {
    /// The `--mapping` spec couldn't be parsed.
    #[error("invalid --mapping spec: {0}")]
    InvalidMappingSpec(String),

    /// Reference mode: no utterances for the requested anchor
    /// speaker in the reference transcript.
    #[error("reference transcript has no utterances for anchor speaker {anchor}")]
    ReferenceMissingAnchor { anchor: SpeakerCode },

    /// Reference mode: fewer than two distinct donor speakers, so
    /// there is nothing for multiset-Jaccard to choose between.
    DonorTooFewSpeakers { speakers: Vec<SpeakerCode> },

    /// Reference mode: winner-to-runner-up margin below the
    /// confidence threshold; the auto-decision is refused.
    LowConfidence {
        /// Full match report: would-be winner, per-speaker scores,
        /// margin. `--write-pending` records it for adjudication.
        report: DonorMatchReport,
        threshold: ConfidenceThreshold,
    },

    /// Override-file replay: the requested session ID is not in the
    /// override file; the available IDs are surfaced.
    SessionIdNotFound { session_id: String, available: Vec<String> },

    /// Override-file replay: a `Rename` action with no matching
    /// `adult_roles` entry (hand-corrupted file); fails closed.
    OverrideRenameMissingRole { speaker: SpeakerCode },

    /// Underlying parse error from the input file.
    #[error("parse error: {0}")]
    Parse(#[from] PipelineError),
}

The LowConfidence variant is the only “soft” failure: the caller (CLI) maps it to exit code 4 and prints the scores. Parse maps to exit 1 (invalid input); every other variant maps to exit 2 (precondition violation) per the user-guide contract. The mapping is the CLI layer’s job; SpeakerIdError itself just classifies the failure mode. (The designed SpeakerNotInMapping / MappingSpeakerNotInInput variants did not ship: apply_mapping currently passes through speakers absent from the mapping unchanged, and enforcing the every-input-speaker precondition is a documented follow-up in speaker_id/apply.rs. The designed OverrideIo wrapping also did not ship; override-file I/O failures surface as OverrideFileError directly.)

`MergeError`

As shipped (transcript_merge.rs; the designed RetainSet payload is a Vec<SpeakerCode>, and a fifth precondition variant, ParticipantAlreadyDeclared, was added for the dedupe-on-insert rule on @Participants):

#[derive(Debug, thiserror::Error)]
pub enum MergeError {
    /// File 1 declares no utterances for any speaker in the retain
    /// set; the merge would produce a degenerate output.
    RetainSpeakersMissing { retain: Vec<SpeakerCode> },

    /// File 1 has retained-speaker utterances but none carry a time
    /// bullet; no shared timeline to merge against.
    NoTimelineInFile1,

    /// File 2 (the donor) declares an `@Languages` code not present
    /// in File 1's set. Donor under-claiming is fine; donor
    /// over-claiming is refused (see below).
    LanguageMismatch {
        file1: LanguageCodes,
        file2: LanguageCodes,
    },

    /// A speaker code outside the retain set appears in both files'
    /// utterances; no rule to choose between the two versions.
    AmbiguousSpeaker { speaker: SpeakerCode },

    /// A donor participant code (outside --retain) is already
    /// declared in File 1 with real utterances or conflicting
    /// metadata; silent dedupe would discard content or paper over
    /// an identity mismatch.
    ParticipantAlreadyDeclared {
        speaker: SpeakerCode,
        file1_role: ParticipantRole,
        donor_role: ParticipantRole,
    },

    /// Underlying parse error from either input file.
    #[error("parse error: {0}")]
    Parse(#[from] PipelineError),
}

Two shipped rules worth calling out because they refine the designed “exact @Languages match” and “concatenate @Participants” contracts:

@Languages is donor-subset matching, not exact equality. File 2 (the donor, typically ASR output) may declare a subset of File 1’s languages (an ASR run in a fixed language mode under-claims; that is expected). Only donor over-claiming, a donor language absent from File 1, raises LanguageMismatch, since it may signal a wrong-file pairing or a language the annotator missed.
@Participants insertion dedupes. A donor entry whose speaker code File 1 already declares is silently skipped (not inserted twice) when File 1’s declaration is vestigial: zero utterances under that code, and role/name metadata matching the donor’s. If File 1 has real utterances under the code, or the two declarations disagree, the merge refuses with ParticipantAlreadyDeclared instead. The same dedupe set filters the inserted @ID rows.

`OverrideFileError`

Independent enum because override-file I/O is also called by non-speaker-id code paths (the adjudication tool, future UIs). As shipped (speaker_id/override_file.rs), it is leaner than the designed five-variant version: read/write/parse failures collapse into Io and Toml, and found is an Option<u32> so a missing schema_version field is reported distinctly from a wrong one:

#[derive(Debug, thiserror::Error)]
pub enum OverrideFileError {
    /// The file's `schema_version` is missing or not equal to
    /// `CURRENT_SCHEMA_VERSION` (currently 2). The binary refuses to
    /// interpret unknown versions rather than risk silent misreads.
    #[error("unsupported override-file schema_version {found:?}; this binary supports {supported}")]
    UnsupportedSchemaVersion {
        /// The schema version as read from the file (None if the
        /// field was absent entirely).
        found: Option<u32>,
        /// The schema version this binary supports.
        supported: u32,
    },

    /// I/O error reading or writing the file.
    #[error("override-file I/O error: {0}")]
    Io(#[from] std::io::Error),

    /// TOML parse / serialize error.
    #[error("override-file TOML error: {0}")]
    Toml(String),
}

(The designed NotFound variant is unnecessary: read_or_default treats a missing file as the empty-file default, and any other I/O failure surfaces through Io.)

Module layout

As shipped. The design’s talkbank-model/src/merge/ layout was never created (see “Where the types live”); the real layout is:

crates/talkbank-transform/src/speaker_id/
    mod.rs             pub re-exports (the crate-facing surface)
    types.rs           JaccardScore, ConfidenceMargin, ConfidenceThreshold
    mapping.rs         MappingSpec, SpeakerAssignment, parse_mapping_spec
    identify.rs        identify_mapping, DonorMatchReport,
                       DEFAULT_CONFIDENCE_THRESHOLD
    apply.rs           apply_mapping, apply_mapping_chat
    override_file.rs   CURRENT_SCHEMA_VERSION, OverrideMode,
                       SpeakerAction, InsertedRoleSpec, MergeOverride,
                       OverrideFile, OverrideFileError
    provenance.rs      DecisionEngine, JudgmentProvenance, ModelId, ...
    error.rs           SpeakerIdError
    judgment/          LLM holistic-judgment surface (sampling, prompt
                       rendering, provider, consume; home of the
                       adult_roles same-role auto-disambiguation)

crates/talkbank-transform/src/transcript_merge.rs
    merge_chats, MergeError, DEFAULT_STRIP_TIERS

Each file aims for the ≤400-line target; concerns that outgrew a single file (the LLM judgment surface) became the judgment/ subdirectory, exactly the split-further move this section anticipated.

Type design rules followed

A spot-check against the cross-cutting design rules in this repo’s root CLAUDE.md, restated against the shipped code:

Newtypes over primitives. Every numeric domain value (JaccardScore, ConfidenceMargin, ConfidenceThreshold) is wrapped; CHAT-domain strings reuse the existing SpeakerCode / ParticipantRole / ParticipantName wrappers. (The designed SessionId / OperatorId newtypes shipped as plain String at the on-disk serialization boundary; see the designed-vs-shipped table.) ✓
No tuple-packed seams. InsertedRoleSpec is a struct, not (code, tag); SpeakerAssignment::Rename carries named fields; MergeOverride likewise. ✓
No boolean blindness. SpeakerAction and OverrideMode are enums, not bools. (The designed Margin::Finite/Unbounded enum shipped as ConfidenceMargin(f64) with f64::INFINITY for the unbounded case, a deliberate simplification recorded in the table.) ✓
Typed errors. Three thiserror enums (SpeakerIdError, MergeError, OverrideFileError) with named-field variants carrying full context. ✓
Deterministic seams. BTreeMap for every serialized collection (adult_roles, mapping, scores, entries). The in-memory MappingSpec is a HashMap; it is never serialized directly. ✓
Module browseability. One file per concern in speaker_id/, with the LLM judgment surface split into its own judgment/ subdirectory. ✓
Default impls present where meaningful. DEFAULT_CONFIDENCE_THRESHOLD (2.0); OverrideFile::default() for the empty-file case. ✓
Display impls present where user-visible. JaccardScore, ConfidenceMargin, ConfidenceThreshold. ✓
Parse functions at the CLI boundary, not regex hacks in command code. parse_mapping_spec for --mapping; the CODE:ROLE pair parse for --inserted-role. ✓

Decisions on the seven open questions

Resolved 2026-05-27, captured here so implementers don’t re-litigate.

1. `JaccardScore` representation: `f64`

Multiset Jaccard J(A, B) = sum_w min(A[w], B[w]) / sum_w max(A[w], B[w]) is computed from u64 token counts, which fit in f64’s 53-bit mantissa for any plausible CHAT bag-of-words. The division is inexact in general but IEEE 754 makes it bit-deterministic given the same inputs across every platform that implements 754 (all of ours: Windows, macOS, Linux, x86_64, arm64).

The bit-deterministic reproducibility property is load-bearing because the override-file audit trail records scores; a researcher re-running speaker-id years later on the same inputs must compute the same score to verify the decision. f64 arithmetic provides this for free given workspace platform constraints. Document the property in the type’s rustdoc.

A rational u64/u64 representation was considered for “true” reproducibility but adds boilerplate and a comparison-against- threshold operation that loses the same precision in the end (the threshold is a ratio too). Reject.

2. `DateTime<Utc>` crate: `chrono`

The workspace already pins chrono = "0.4" at the root Cargo.toml. The merge code (in talkbank-transform) uses the workspace version verbatim via chrono = { workspace = true }. No new datetime dep.

The “succession-aware” rule from the workspace-root CLAUDE.md contributor guide (outside the book) and the analogous feedback_no_terraform_only_opentofu discipline from operator memory says: do not fragment the ecosystem by introducing a second tool when a workspace tool already does the job. jiff is a fine library but adopting it for one new module would mean two datetime crates in tree.

Override-file timestamps serialize as RFC 3339 UTC; chrono’s serde feature handles this with #[serde(with = "chrono::serde::ts_rfc3339")] or the default Serialize/Deserialize impl.

3. TOML library: `toml` (the workspace-pinned crate)

Workspace already pins toml = "^1.1.2". That crate reads AND writes, no need to combine toml and toml_edit for the v1 override-file format.

toml_edit was considered for its formatting/comment preservation across in-place edits. The case for it is hypothetical right now: override files are primarily machine-written by chatter speaker-id --write-override; human edits exist but are not the dominant workflow. The cost of toml_edit is the second TOML dep (workspace churn, plus the friction every contributor pays parsing TOML through one API and writing through another).

If a workflow emerges where operators heavily hand-edit override files and lose formatting on each batch re-run, swap to toml_edit then. Defer.

4. `MergeOverride::flags`: `Vec<MergeFlag>`

Operator-supplied flags are semantically set-like (each flag present or absent), but Vec is the right representation because:

MergeFlag includes a Custom(String) #[serde(untagged)] variant. Deriving Ord on this enum requires a manual Ord impl that hashes the discriminator + the inner string. Doable but adds maintenance load.
The order of flags in the on-disk file isn’t load-bearing for correctness; deterministic single-source-write produces a deterministic Vec.
Duplicates are noise but not corrupting. Document in the field’s rustdoc that consumers should treat as set semantics (deduplicate before comparing).

The writer (speaker-id --write-override path) inserts flags in a deterministic order; on-disk Vec is fully reproducible. If a hand-edited file has an out-of-order or duplicated flag list, that shows up as a non-corrupting noise in subsequent diffs, acceptable.

5. `SpeakerMapping::assignments`: `BTreeMap<SpeakerCode, MappingAction>`

Confirmed. BTreeMap gives:

One-action-per-speaker by construction (no duplicate keys).
Deterministic serialization order (alphabetical by SpeakerCode).
Cheap membership tests during apply.

The CLAUDE.md “no tuple-packed seams” rule targets raw tuples as struct fields or function arguments. A BTreeMap’s internal key-value pairing is not a domain seam exposed to the API; it’s the representation. Approved.

(As shipped, this decision holds for the serialized shape: MergeOverride.mapping is BTreeMap<String, SpeakerAction> and MergeOverride.adult_roles is BTreeMap<String, InsertedRoleSpec>. The in-memory MappingSpec is a HashMap because it is never serialized directly.)

6. Schema versioning policy: strict refuse-with-clear-error

The reader (OverrideFile::read_or_default, as shipped) refuses any schema_version != CURRENT_SCHEMA_VERSION with a typed OverrideFileError::UnsupportedSchemaVersion { found, supported }. No automatic migration.

This is the conservative default. Reasons:

We have no upgrade history yet; building a migration framework for a problem that doesn’t exist is premature abstraction (CLAUDE.md “Always Fix Root Causes” + the general “no premature abstraction” instinct).
The override file is fundamentally a record of operator decisions. If the schema breaks, operators re-adjudicate; the prior file becomes a historical artifact that can be read by scripts with old binaries.
When a real schema change lands and there is real upgrade friction, that’s the moment to write a one-shot migration (chatter merge migrate-overrides --from <path> --to <path>). Until that happens, premature migration code is dead weight.

Document this in the reader’s rustdoc so the policy is explicit to callers. The policy has since been exercised for real: the 2026-07 v1 -> v2 bump (the per-speaker adult_roles map) was a breaking, non-migrating change exactly as designed here; v1 files are refused and their sessions re-adjudicated. The version-to-version diff and migration instructions live in merge-overrides.md.

7. Where the `--mapping` parser lives: beside the mapping type

parse_mapping_spec("PAR0=drop,PAR1=INV:Investigator") -> Result<MappingSpec, SpeakerIdError> lives alongside the MappingSpec type it returns, in talkbank_transform::speaker_id::mapping as shipped (the design said talkbank-model::merge::mapping; the parser moved with the types when they landed in talkbank-transform, see “Where the types live”).

Why:

The spec format is part of the type’s contract. A reader looking for “how do I construct a MappingSpec from a string?” should find the answer where the type is defined, not in the consumer CLI crate.
A future non-CLI consumer (HTTP API, library wrapper, scripting binding) wants the same parser without re-implementing or depending on chatter.
talkbank-transform has no CLI-framework dependency (no clap), but a free function returning Result<MappingSpec, _> doesn’t need one. The clap value-parser in chatter becomes a thin shim over parse_mapping_spec.

If at some point a SECOND mapping syntax becomes useful (e.g., JSON-inline, or a TOML fragment), add a parse_mapping_json sibling rather than reshaping parse_mapping_spec. The existing parser stays the lingua franca.

These decisions are the design baseline going into spec authoring and implementation. Future revisions to any of them require an explicit doc update plus a deprecation/migration plan, not a silent change in the implementation.

Relationship to specs and tests

The design intended a spec entry in spec/constructs/merge-types/ per type/invariant pair, regenerated into Rust tests via the spec/tools generators. That directory was never created: as shipped, the behavioral invariants are pinned directly by the Rust test suites instead, per the layered scheme in the Test Plan: transform-level tests (crates/talkbank-transform/tests/speaker_id_tests.rs, transcript_merge_tests.rs, adjudication_tests.rs), CLI subprocess tests (crates/chatter/tests/merge_tests.rs, speaker_id_tests.rs, adjudication_tests.rs), and per-module #[cfg(test)] unit tests beside the types themselves (e.g. the round-trip and per-speaker-role tests in speaker_id/override_file.rs). Folding the fragment-level cases (token cleaning, Jaccard goldens) into spec/constructs/ remains an open option, not a shipped mechanism.

Merge Pipeline, Test Plan

Status: Draft Last modified: 2026-07-07 21:17 EDT

This page is the test-coverage roadmap for the new merge pipeline (chatter speaker-id + chatter merge + chatter adjudicate + the override-file format + the underlying talkbank-transform::speaker_id types). It exists because, per this repo’s root CLAUDE.md red/green TDD rule, every new feature starts with failing tests at the highest level the feature lives at, and we want to enumerate those tests before writing the implementation, so coverage is designed, not discovered.

This is a plan, not yet code. When the implementation work begins, every test case below becomes a real test; the doc then flips to a coverage matrix that gets kept honest by CI.

TDD discipline, what “strict red/green” means here

Every cycle of impl-phase work is:

RED. Write ONE failing test at the highest layer the feature lives at. The test exercises a real user-observable behavior, not an internal helper. Commit the failing test alone (or stage it before any code change), verify it fails for the right reason (the missing behavior), not for a compile error or a typo.
GREEN. Write the smallest code change that makes the test pass. No anticipating future tests, no scaffolding for tests that don’t yet exist. The codebase should compile and pass tests at this point.
REFACTOR. With the green test as the safety net, tighten the implementation: extract helpers, rename for clarity, replace primitives with newtypes, document tricky parts. Tests stay green throughout.
DRILL DOWN if needed. If the L3 (or L2) test passes but pinned the behavior less precisely than the contract requires (e.g., the L3 test asserts “exit 2 with some error” but the contract says “the specific MergeError variant must match”), add an L2 (or L1) test next that drills into the precise path. The drilled test FAILS at first against the green-but-imprecise impl, motivating the tighter impl.

Cycles must be atomic: one RED → one GREEN → optional REFACTOR → optional drill-down. Do not stack multiple tests on top of a single impl change; do not write impl ahead of tests. The discipline matters because the bug bar of this pipeline is high (CHAT-data byte-stable preservation, audit-trail reproducibility) and TDD is the cheapest way to catch regressions before they ship.

Three test layers + the adjudication layer

The merge pipeline’s behavior spans four substrates with different testing mechanisms.

Layer	Substrate	Why tests live here
L1, Spec / fragment	`spec/constructs/speaker-id/` → current `spec/tools` generators	Token-cleaner behavior on CHAT fragments (markup strip for Jaccard scoring). Same mechanism that pins parser/grammar tests; regenerated regression.
L2, Transform / AST	`crates/talkbank-transform/tests/`	Pure-Rust tests over parsed `ChatFile` values. `identify_mapping`, `apply_mapping`, `merge`, `run_adjudication` semantics on hand-built or parsed CHAT inputs. No process boundary.
L3, CLI / subprocess	`crates/chatter/tests/merge_tests.rs` (new)	End-to-end behavior of `chatter speaker-id`, `chatter merge`, and `chatter adjudicate` invoked as subprocesses (`assert_cmd` + `predicates`). Exit codes, flag parsing, file I/O, stderr formats.
L4, Scripted adjudication	`crates/talkbank-transform/tests/adjudication_tests.rs` + scripted prompter	Operator-decision paths in `chatter adjudicate`. Uses `ScriptedPrompter` injecting synthetic operator choices. See Adjudication Workflow for the prompter abstraction.

L1 ⊂ L2 ⊂ L3 in terms of failure-mode coverage: a failing L1 test implies a failing L2 test which implies a failing L3 test. So when the same invariant could be tested at multiple layers, the starter test is the highest layer and lower-layer tests are supplements that pin the precise internal path. L4 sits beside L2/L3, same crate/file conventions but a dedicated layer because the prompter-injection pattern is specific to adjudication.

L1, Spec / fragment tests

Lives in spec/constructs/speaker-id/. Three subdirectories:

token-cleaner/: what the Jaccard tokenizer strips and keeps
jaccard-scoring/: fixed-input → fixed-score golden tests
mapping-application/: header rewrite rules on real fragments

L1.1, Token cleaner

Each spec is a CHAT main-tier fragment + the expected token list after cleaning. Behavior pinned: bracket markup stripped, angle-bracket retracing unwrapped, terminator variants discarded, &-... / &+... discarded, xxx/yyy/www discarded, 0 discarded, @l / @n / @c suffix dropped, _-compound split to spaces, punctuation stripped, lowercased, ≥2-char alpha filter, NAK bullets stripped.

Spec	Input fragment	Expected tokens
`clean-plain-utterance`	`*CHI:\thello world .`	`["hello", "world"]`
`clean-strip-bracket-codes`	`CHI:\thello [] [/] world [//] .`	`["hello", "world"]`
`clean-unwrap-angle-retrace`	`*CHI:\t<two of the> [//] three of the presents .`	`["two", "of", "the", "three", "of", "the", "presents"]`
`clean-strip-fillers`	`*CHI:\t&-um &+pre something &-uh .`	`["something"]`
`clean-strip-zero-and-paralinguistic`	`*CHI:\t0 [=! nodding] .`	`[]`
`clean-strip-unintelligible`	`*CHI:\txxx and yyy and www .`	`["and", "and"]`
`clean-strip-bullets`	`*CHI:\thello world . \x150_1234\x15`	`["hello", "world"]`
`clean-special-form-suffix`	`*CHI:\tnaming l@l u@l l@l u@l .`	`["naming"]`
`clean-compound-underscore`	`*CHI:\tValentine's_Day and Fruit_Loops .`	`["valentine", "day", "and", "fruit", "loops"]`
`clean-terminator-variants`	`*CHI:\thello +//. world +... again +/. last !`	`["hello", "world", "again", "last"]`
`clean-overlap-markers`	`*CHI:\t↫here↫ and there .`	`["here", "and", "there"]`
`clean-lowercase-filter`	`*CHI:\tHello World A I am .`	`["hello", "world", "am"]`

Each spec file in spec/constructs/speaker-id/token-cleaner/ has the standard # name, ## Input, ## Expected tokens, and ## Metadata sections per the spec authoring template at spec/CLAUDE.md in the workspace root (outside the book).

L1.2, Jaccard scoring

Fixed bag-of-tokens pairs with known multiset Jaccard. These guard against off-by-one errors in the sum_w min / sum_w max implementation and against any future “optimizations” that silently change scoring.

Spec	Bag A	Bag B	Expected `J(A,B)`
`jaccard-identical`	`{hello:2, world:1}`	`{hello:2, world:1}`	`1.0`
`jaccard-disjoint`	`{hello:1}`	`{world:1}`	`0.0`
`jaccard-empty-empty`	`{}`	`{}`	`0.0`
`jaccard-empty-nonempty`	`{}`	`{x:1}`	`0.0`
`jaccard-multiset-counts`	`{a:3, b:1}`	`{a:1, b:1}`	`2/4 = 0.5`
`jaccard-partial-overlap`	`{a:1, b:1, c:1}`	`{b:1, c:1, d:1}`	`2/4 = 0.5`

L1.3, Mapping application on fragments

Header-rewrite micro-tests. Each spec gives an input @Participants: or @ID: row and a small mapping; the expected output row is the rewritten form.

Spec	Input row	Mapping	Expected output row
`participants-rewrite-rename`	`@Participants:\tPAR0 Participant, PAR1 Participant`	`PAR0→INV:Investigator, PAR1→drop`	`@Participants:\tINV Investigator`
`participants-preserve-name-token`	`@Participants:\tCHI Alex Target_Child, PAR0 Participant`	`PAR0→INV:Investigator`	`@Participants:\tCHI Alex Target_Child, INV Investigator`
`id-rewrite-rename`	`@ID:\teng\|corpus_name\|PAR0\|\|\|\|\|Participant\|\|\|`	`PAR0→INV:Investigator`	`@ID:\teng\|corpus_name\|INV\|\|\|\|\|Investigator\|\|\|`
`id-drop-removes-row`	`@ID:\teng\|...\|PAR1\|\|\|\|\|Participant\|\|\|`	`PAR1→drop`	(row removed)
`id-preserves-other-fields`	`@ID:\teng\|2\|CHI\|6;01.\|female\|NF\|\|Target_Child\|\|\|`	`(no-op for CHI)`	identical to input

L2, Transform / AST tests

Lives in crates/talkbank-transform/tests/. Three test files:

speaker_id_tests.rs
transcript_merge_tests.rs
override_file_tests.rs

Each tests behavior over parsed talkbank-model::ChatFile values, using inline synthetic CHAT strings parsed via talkbank_parser::parse_chat_file (no subprocess overhead).

L2.1, `identify_mapping` (reference mode)

Test	Scenario	Assertion
`identify_mapping_clean_winner`	Reference has CHI saying content X; donor has PAR0 saying X verbatim and PAR1 saying unrelated content	Returns `SpeakerMapping { drop: {PAR0}, rename: {PAR1: INV} }`, margin >> 2.0
`identify_mapping_borderline_refuses`	Reference and both donor speakers share substantial vocabulary (margin < 2.0)	Returns `Err(SpeakerIdError::LowConfidence { scores, threshold, margin })`
`identify_mapping_anchor_missing`	Reference has no utterances tagged with anchor speaker	Returns `Err(SpeakerIdError::AnchorMissingInReference { anchor: CHI })`
`identify_mapping_single_speaker_donor`	Donor has only one speaker	Returns `Err(SpeakerIdError::InsufficientSpeakers { n: 1 })`
`identify_mapping_threshold_at_exact_value`	Constructed donor where margin = 2.0 exactly with threshold 2.0	Returns `Ok(_)` (≥ comparison, not strict >)
`identify_mapping_threshold_below_exact_value`	Margin = 1.9999 with threshold 2.0	Returns `Err(SpeakerIdError::LowConfidence)`
`identify_mapping_unbounded_margin`	Donor PAR1 has Jaccard 0 against reference; PAR0 > 0	Returns `Ok(_)` with `margin = Margin::Unbounded`
`identify_mapping_deterministic`	Same inputs, repeated call	Identical `SpeakerMapping` byte-for-byte (BTreeMap ordering)

L2.2, `apply_mapping`

Test	Scenario	Assertion
`apply_mapping_renames_main_tier`	Donor has `PAR0:\t...` and `PAR1:\t...`; mapping renames PAR0→INV, drops PAR1	Output has `*INV:\t...` for original PAR0 utts; PAR1 utts absent
`apply_mapping_byte_stable_except_prefix`	Donor has rich CHAT markup, %wor, %com on every utt	Every retained utt is byte-identical except the `*CODE:\t` prefix; dependent tiers preserved exactly
`apply_mapping_rewrites_participants`	Donor `@Participants:` has PAR0+PAR1 entries	Output has only INV entry (after PAR1 drop)
`apply_mapping_rewrites_id`	Donor `@ID:` rows for PAR0+PAR1	PAR0 row rewritten to INV with role tag; PAR1 row removed
`apply_mapping_speaker_not_in_input`	Mapping references PAR9 which isn’t in donor	Returns `Err(SpeakerIdError::MappingSpeakerNotInInput { speaker: PAR9 })`
`apply_mapping_speaker_not_in_mapping`	Donor has PAR0+PAR1+PAR2 but mapping only covers PAR0+PAR1	Returns `Err(SpeakerIdError::SpeakerNotInMapping { speaker: PAR2 })`
`apply_mapping_preserves_other_headers`	Donor has `@Languages`, `@Media`, `@Comment`	All non-Participants/non-ID headers pass through verbatim
`apply_mapping_idempotent_on_rerun`	Apply mapping, parse output, apply identity mapping	Output unchanged (byte-stable)

L2.3, `merge` (core invariants)

These mirror the user-guide’s “What the merged output guarantees” section directly. Each invariant from that section maps to one or more L2 tests; the L3 tests then re-exercise the same invariant through the CLI.

Test	Invariant from user-guide	Assertion
`merge_retained_speakers_byte_stable`	“Retained speakers are byte-stable”	Every `*CHI:` block from File 1 (main tier + all dependent tiers, including `%com`) appears in the output byte-identical, in original order
`merge_strips_default_derived_tiers`	“Inserted speakers’ downstream-generated tiers are stripped”	Output has no `%wor`, `%mor`, `%gra`, `%pho` on inserted-speaker utts; other dependent tiers preserved
`merge_strip_tiers_configurable`	“configurable via `--strip-tiers`”	Custom `strip_tiers=[com]` removes `%com` instead of the defaults
`merge_strip_tiers_empty_preserves_all`	empty strip set	Inserted utts retain `%wor`, `%mor`, `%gra`, `%pho` from File 2 verbatim
`merge_utterance_order_by_start_time`	“Utterance order is timeline order”	Output utterances sorted by start_ms ascending
`merge_stable_tiebreak_file1_first`	“first-file utterance comes first”	When File 1 and File 2 each have an utterance starting at exactly t, the File 1 one appears first in the output
`merge_bullets_pass_through`	“Time bullets are pass-through”	Every bullet in the output is exactly the bullet from its source utterance, merge does not recompute, smooth, or refresh
`merge_bullet_lift_from_wor`	“If main tier lacks bullet, lift from %wor”	Donor utt with no end-of-line bullet but a `%wor` row gets a derived `\x15<first>_<last>\x15` appended; original `%wor` then stripped per the tier policy
`merge_no_overlap_markers_injected`	“Overlap markup is NOT injected”	Even when inserted utt’s bullet overlaps a retained utt’s bullet by 500ms, no `[>]`/`[<]` tokens appear anywhere in the output that weren’t in the original retained file
`merge_preserves_existing_overlap_markers`	retained file already has `[>]` somewhere	The original `[>]` is preserved byte-stable on the retained utt
`merge_header_languages_passthrough`	Header reconciliation rule	Output `@Languages` matches File 1’s
`merge_header_media_file1_wins`	Header reconciliation rule	File 1 says `video`, File 2 says `audio` → output says `video` (no warning emitted for modality only)
`merge_header_participants_concatenates`	Header reconciliation rule	Output `@Participants:` is File 1’s entries + File 2’s non-retained entries, in that order, with dedupe-on-insert: a File 2 entry whose speaker code File 1 already declares is skipped rather than inserted twice (legal only when File 1’s declaration is vestigial: zero utterances, matching role/name metadata; otherwise the merge refuses with `MergeError::ParticipantAlreadyDeclared`)
`merge_header_id_concatenates`	Header reconciliation rule	Output `@ID:` rows are File 1’s + File 2’s non-retained, original order within each file; the same dedupe-on-insert set also filters File 2’s `@ID` rows, so a deduped participant contributes no duplicate `@ID` row
`merge_header_comments_concatenate`	Header reconciliation rule	Output `@Comment` rows are File 1’s + File 2’s, in original order (ASR provenance preserved)
`merge_preconditions_retain_missing`	exit code 2 precondition	File 1 declares no CHI; merge with `retain={CHI}` returns `Err(MergeError::RetainSpeakersMissing)`
`merge_preconditions_no_timeline`	exit code 2 precondition	File 1 has no utterances with bullets → `Err(MergeError::NoTimelineInFile1)`
`merge_preconditions_language_mismatch`	exit code 2 precondition	File 1 `@Languages: eng`, File 2 `@Languages: yue` → `Err(MergeError::LanguageMismatch)`
`merge_preconditions_ambiguous_speaker`	exit code 2 precondition	Both files have INV utterances and retain={CHI} (INV not in retain) → `Err(MergeError::AmbiguousSpeaker { speaker: INV })`
`merge_warns_on_backward_bullet_drift`	“small backward-time bullets … proceeds”	File with `utt1: 100_200`, `utt2: 190_300`, succeeds, emits a warning

L2.4, Override file I/O

Test	Scenario	Assertion
`override_file_round_trip`	Construct `OverrideFile` with one entry, write, read back	Re-read value `==` original
`override_file_refuses_missing_schema_version`	TOML with no `schema_version`	`Err(OverrideFileError::UnsupportedSchemaVersion { found: None, supported: 2 })` (the shipped `found` is an `Option<u32>`, so absence reports as `None`, not a sentinel value)
`override_file_refuses_wrong_schema_version`	`schema_version = 99` (any value other than the current `2`; a pre-bump `schema_version = 1` file is refused the same way, per the v1-to-v2 migration note in merge-overrides.md)	`Err(UnsupportedSchemaVersion { found: Some(99), supported: 2 })`
`override_file_rejects_unknown_field`	Entry has an extraneous field `extra = "x"`	`Err(OverrideFileError::Parse)`
`override_file_rejects_malformed_mode`	`mode = "guess"`	`Err(Parse)` (only `auto`/`explicit`/`override` accepted)
`override_file_atomic_write`	Write to a path that already exists	Original file is replaced atomically; no `<path>.tmp` left behind
`override_file_deterministic_serialization`	Same struct, write twice	Bytes on disk are byte-identical between writes
`override_file_omits_empty_optionals`	Entry has empty `scores`, no `margin`, empty `flags`	TOML output does not contain those keys
`override_file_preserves_margin_unbounded`	Entry has `margin = Margin::Unbounded`	TOML on disk has `margin = "unbounded"`; reads back as `Unbounded`
`override_file_preserves_margin_finite`	Entry has `margin = Margin::Finite(3.81)`	TOML on disk has `margin = 3.81`; reads back equal
`override_file_read_or_default_missing`	Path does not exist	Returns empty `OverrideFile` with current schema version
`override_file_get_returns_entry`	File has one entry under SessionId X	`get(X)` returns Some; `get(Y)` returns None

L2.5, Domain-type unit tests

Smaller per-type tests. Each in its module’s #[cfg(test)] mod tests section.

Note on type names: several of the designed types referenced below shipped under different names or shapes (InsertedRole is InsertedRoleSpec; Margin is ConfidenceMargin(f64) with f64::INFINITY for the unbounded case; RetainSet and MergeFlag never shipped as newtypes). See the designed-vs-shipped table in Domain Types before writing any still-pending test from this table against the current code.

Test	Type	Assertion
`jaccard_score_new_in_range`	`JaccardScore`	`new(0.5)` → `Ok`; `new(-0.1)` and `new(1.1)` → `Err`; `new(NaN)` → `Err`
`jaccard_score_serde_round_trip`	`JaccardScore`	Serializes to `0.5` (bare float in JSON/TOML); deserializes back identically; out-of-range deserialize → error
`confidence_threshold_default_is_2_0`	`ConfidenceThreshold`	`Default::default().value() == 2.0`
`confidence_threshold_rejects_below_1`	`ConfidenceThreshold`	`new(0.5)` → `Err`
`margin_from_scores_zero_loser`	`Margin`	`from_scores(JaccardScore::new(0.7), JaccardScore::zero()) == Margin::Unbounded`
`margin_from_scores_zero_zero`	`Margin`	`from_scores(zero, zero) == Margin::Finite(0.0)` or explicit “degenerate” representation (decide and document)
`margin_meets_threshold`	`Margin`	`Finite(3.81).meets(threshold=2.0) == true`; `Finite(1.5).meets(2.0) == false`; `Unbounded.meets(threshold) == true` for any threshold
`retain_set_parse`	`RetainSet`	`"CHI".parse() == Ok({CHI})`; `"CHI,SI2".parse() == Ok({CHI, SI2})`; `"".parse() == Err`; `"CHI,,SI2".parse() == Err`
`inserted_role_parse`	`InsertedRole`	`"INV:Investigator".parse() == Ok(_)`; `"INV".parse() == Err`; `":Investigator".parse() == Err`
`mapping_spec_parse_simple`	`parse_mapping_spec`	`"PAR0=drop,PAR1=INV:Investigator"` parses to a complete `MappingSpec` with `Drop` for PAR0 and a `Rename` carrying PAR1’s own code + role
`mapping_spec_parse_drop_only`	`parse_mapping_spec`	`"PAR0=drop"` parses; a drop-only mapping is legal in isolation, since roles are per-speaker and a mapping with no `Rename` needs no role at all
`mapping_spec_parse_multiple_roles`	`parse_mapping_spec`	`"PAR0=INV:Investigator,PAR1=MOT:Mother"` parses, with each speaker’s `Rename` carrying its own role. (The original plan named this `mapping_spec_parse_conflicting_roles` and expected an error because the designed v1 schema allowed only one shared inserted role; the shipped schema-v2 per-speaker `adult_roles` map makes multiple distinct roles a supported case, and two adults assigned the same role auto-disambiguate to numbered codes `INV1`/`INV2` with `First_`/`Second_` specific-role labels.)
`merge_flag_serde_known_variants`	`MergeFlag`	`DiarizationMixed` serializes as `"diarization-mixed"` (kebab-case); deserializes the same
`merge_flag_serde_custom`	`MergeFlag`	Unknown string deserializes as `Custom("unknown-flag")`; serializes verbatim

L3, CLI / subprocess tests

Lives in crates/chatter/tests/merge_tests.rs (new file). Uses the same assert_cmd + predicates + tempfile pattern as the existing integration_tests.rs. Each test invokes chatter speaker-id or chatter merge as a subprocess against files written to a tempdir().

L3.1, `chatter merge`, success paths

Test	Invariants exercised
`merge_basic_clinician_pattern`	E2E happy path: small hand-coded child-only file + small ASR-labeled file → exit 0, output exists, retained CHI byte-stable, inserted INV present with derived tiers stripped. Single-invocation smoke test.
`merge_writes_to_stdout_by_default`	No `-o` flag → output goes to stdout, exit 0
`merge_writes_to_output_path`	`-o merged.cha` → file created with correct content; nothing on stdout
`merge_retain_multi_speaker`	`--retain CHI,SI2` keeps both CHI and SI2 byte-stable; everything else from File 2
`merge_strip_tiers_custom`	`--strip-tiers com,act` removes `%com` and `%act` instead of default set
`merge_strip_tiers_empty`	`--strip-tiers ''` preserves `%wor` from File 2 in output

L3.2, `chatter merge`, error paths

Test	Asserted exit code	Asserted stderr
`merge_missing_file1`	1	“No such file” or equivalent typed message
`merge_unparseable_file1`	1	parser diagnostic
`merge_missing_retain_flag`	2 (clap)	clap usage message
`merge_retain_empty_value`	2	typed error from `RetainSet::from_str`
`merge_no_retain_speakers_in_file1`	2	`RetainSpeakersMissing` rendered
`merge_no_timeline_in_file1`	2	`NoTimelineInFile1` rendered
`merge_language_mismatch`	2	`LanguageMismatch { file1: eng, file2: yue }` rendered
`merge_ambiguous_speaker`	2	`AmbiguousSpeaker { speaker: ... }` rendered with hint to use –retain

L3.3, `chatter speaker-id`, reference mode

Test	Scenario	Assertion
`speaker_id_reference_auto_clean_winner`	Reference + donor where margin >> 2.0	Exit 0; output has expected renamed/dropped speakers
`speaker_id_reference_writes_override`	With `--write-override path.toml`	File created; entry has `mode = "auto"`, scores, margin, decided_at, operator
`speaker_id_reference_appends_to_existing_override`	`--write-override path.toml` where file already has another session	New session added; existing session preserved
`speaker_id_reference_low_confidence_exits_4`	Margin < threshold	Exit 4; stderr contains per-speaker scores
`speaker_id_reference_anchor_missing_exits_2`	Reference has no anchor speaker utterances	Exit 2; typed error in stderr
`speaker_id_reference_threshold_override`	`--confidence-threshold 1.5` on a margin-1.7 case	Exit 0 (would have refused at default 2.0)
`speaker_id_reference_anchor_required`	`--reference` without `--anchor`	Exit 2 (clap or our own); usage error

L3.4, `chatter speaker-id`, explicit-mapping mode

Test	Scenario	Assertion
`speaker_id_explicit_basic`	`--mapping "PAR0=drop,PAR1=INV:Investigator"`	Exit 0; output renames PAR1→INV, drops PAR0
`speaker_id_explicit_mapping_speaker_not_in_input`	`--mapping` references PAR9 not in input	Exit 2; typed error
`speaker_id_explicit_speaker_missing_from_mapping`	Input has PAR0+PAR1+PAR2; mapping only covers PAR0+PAR1	Exit 2; typed error naming PAR2
`speaker_id_explicit_with_note_records_in_override`	`--mapping` + `--write-override` + `--note "verified by listening"`	TOML entry has `note = "verified by listening"` and `mode = "explicit"`

L3.5, `chatter speaker-id`, override-file mode

Test	Scenario	Assertion
`speaker_id_override_file_replay`	Override file has entry for session-X	Reading override + applying produces same output as the original auto/explicit run
`speaker_id_override_file_missing_entry`	Override file has no entry for the requested session	Exit 2; `OverrideEntryMissing` in stderr
`speaker_id_override_file_missing_file`	`--override-file path.toml` where file doesn’t exist	Exit 1; `NotFound` in stderr
`speaker_id_override_file_wrong_schema_version`	File has `schema_version = 99`	Exit 1; `UnsupportedSchemaVersion` in stderr
`speaker_id_override_file_mutually_exclusive_modes`	`--reference` AND `--mapping` both set	Exit 2 (clap or our own); only one operation mode allowed

L3.6, Pipeline composition

These exercise chatter speaker-id → chatter merge composed end-to-end through the file system, simulating the orchestrator workflow.

Test	Scenario	Assertion
`pipeline_speaker_id_then_merge`	Run speaker-id on anonymous ASR file; run merge on the result + hand-coded file	Final merged file passes all merge invariants (retained byte-stable, etc.)
`pipeline_replay_via_override_file`	Run once with auto; capture override file; delete intermediates; replay via `--override-file`; merge again	Final merged file is byte-identical to the original run (audit-trail-reproducibility property)
`pipeline_low_confidence_then_explicit`	Run speaker-id; gets exit 4; capture scores from stderr; run again with `--mapping` matching what the operator would decide; record via `--write-override`; merge	All steps succeed; override file has `mode = "explicit"` with prior scores recorded

L4, Scripted adjudication tests

Lives in crates/talkbank-transform/tests/adjudication_tests.rs. Uses the Prompter trait and ScriptedPrompter documented in Adjudication Workflow §The prompter abstraction. Each test constructs a pending-adjudications input, scripts the operator’s decisions, runs run_adjudication, and asserts on the resulting override file plus the residual pending file.

L4.1, Speaker-id adjudication paths

Test	Scripted decision	Assertion
`adjudicate_speaker_id_accepts_suggested`	`AcceptSuggested { note: None }` for one pending entry	Override file entry has `mode = "explicit"`, mapping matches suggested, pending file emptied
`adjudicate_speaker_id_override_mapping`	`OverrideMapping { mapping: { PAR0=rename, PAR1=drop }, note: Some("verified by listening") }` (opposite of suggested)	Override file mapping matches operator’s choice; note recorded
`adjudicate_speaker_id_defer`	`Defer { reason: "need to listen to audio" }`	Pending entry untouched; override file unchanged; tool exits 4 (deferred)
`adjudicate_speaker_id_block`	`Block { reason: "reference file missing bullets" }`	Pending entry tagged as blocked; override file unchanged
`adjudicate_speaker_id_kind_mismatch_rejected`	`OverrideInsertedRole { ... }` against a `speaker-id-low-confidence` entry	Returns `Err(AdjudicationError::DecisionKindMismatch)`; nothing written

L4.2, Parent-role-lookup adjudication paths

Test	Scripted decision	Assertion
`adjudicate_parent_role_accepts_default_inv`	`AcceptSuggested`	Override entry uses `INV:Investigator` (the safe default)
`adjudicate_parent_role_overrides_to_mother`	`OverrideInsertedRole { code: "MOT", tag: "Mother" }`	Override entry uses MOT; note recorded
`adjudicate_parent_role_overrides_to_father`	`OverrideInsertedRole { code: "FAT", tag: "Father" }`	Override entry uses FAT
`adjudicate_parent_role_invalid_code_rejected`	`OverrideInsertedRole { code: "", tag: "Mother" }`	Returns `Err`; with `--skip-on-error`, logs and proceeds

L4.3, Diarization-mix and sanity-scan paths

Test	Scripted decision	Assertion
`adjudicate_diarization_mix_flag_only`	`Flag { flags: [DiarizationMixed], note: "PAR0 mixes clinician+parent" }`	Existing override entry gets flag added; mapping unchanged
`adjudicate_sanity_scan_swap_mapping`	`OverrideMapping { ... }` reversing original speaker-id	Override entry updated; `mode = "explicit"`; original mapping preserved in `history`
`adjudicate_sanity_scan_confirms_real_overlap`	`Flag { flags: [Custom("real-overlap-confirmed")] }`	Override entry gets custom flag; mapping unchanged

L4.4, Workflow plumbing

Test	Scenario	Assertion
`adjudicate_empty_pending_file_noop`	Pending file has empty `entries` array	Exit 0; nothing changes
`adjudicate_resumption_skips_decided_entries`	Pending file has 3 entries; first 2 already decided in override; only 3rd has no override entry	Prompter is called exactly once, for the 3rd entry
`adjudicate_re_adjudicate_preserves_history`	Existing override entry; `--re-adjudicate` with new decision	New decision saved; prior decision preserved in `history` array
`adjudicate_kind_filter_processes_only_matching`	Pending file has mixed kinds; `--kind parent-role-lookup` flag set	Prompter only called for parent-role-lookup entries; other kinds untouched
`adjudicate_dry_run_writes_nothing`	Any pending input + any decision; `--dry-run` set	Override file unchanged; pending file unchanged
`adjudicate_scripted_mode_unknown_session_aborts`	Scripted decisions reference session-X but pending has only session-Y	Returns `Err(AdjudicationError::ScriptedDecisionWithoutPendingEntry)`; tool exits 2
`adjudicate_scripted_mode_extra_pending_aborts`	Pending has session-X and session-Y; scripted decisions cover only session-X	Returns `Err(AdjudicationError::PendingEntryWithoutScriptedDecision)`; tool exits 2
`adjudicate_mutually_exclusive_modes`	`--interactive` + `--scripted` both set	Returns `Err`; tool exits 2 (clap or our own validator)

L4.5, Prompter contract conformance

These tests pin the contract that any Prompter impl must satisfy, so future UI backends (VS Code, web) can be developed against the same invariants.

Test	Scenario	Assertion
`prompter_terminal_round_trip_decision`	`TerminalPrompter` reading a scripted stdin	Returns the expected `OperatorDecision` parsed from the operator’s typed input
`prompter_scripted_returns_decisions_in_order`	`ScriptedPrompter::from_decisions([d1, d2, d3])`	Three consecutive `ask()` calls return d1, d2, d3 in order
`prompter_scripted_panics_on_unscripted_session`	`ScriptedPrompter` has decisions for session A; tool asks for session B	`ask()` returns `Err(PrompterError::NoDecisionFor(SessionId))`
`prompter_scripted_toml_round_trips`	Write a scripted-decisions TOML, read with `ScriptedTomlPrompter`, run	Same `OperatorDecision` sequence as a `ScriptedPrompter::from_decisions` with equivalent contents

Fixture catalog

These are the synthetic CHAT pairs that the tests above consume. Each is small (≤20 utterances), exercises a precise invariant, and is fully fictional (no real corpus content).

The fixtures live as inline const FIX_*: &str blocks in the respective test modules, following the precedent in chatter/tests/integration_tests.rs (which has const VALID_CHAT: &str = r#"..."# etc.).

`FIX_REF_TWO_UTT_NO_MARKUP`

The smallest possible valid CHAT pair input. Two *CHI: utterances, no markup beyond a simple terminator, time bullets on both. Used by cycle 1’s smoke test where the impl must work without yet handling any markup edge cases.

`FIX_ASR_LABELED_TWO_UTT`

The matching donor for FIX_REF_TWO_UTT_NO_MARKUP: two *INV: utterances at different time positions. Used by cycle 1.

`FIX_REF_CHILD_ONLY_SIMPLE`

A 6-utterance child-only hand transcript with rich CHAT markup (error code, retracing, filled pause, special-form letter, zero realization with paralinguistic). Used by every L2/L3 merge test from cycle 2 onward as the canonical “File 1”, the reference / authoritative file. Has time bullets on every utterance.

`FIX_ASR_ANON_2SPEAKER_SIMPLE`

The matching ASR-output file with anonymous PAR0 (clinician, asks questions) and PAR1 (child, says what FIX_REF_* shows plus some extra). Has %wor on every utterance. Used by every speaker-id test where auto-mode is expected to succeed cleanly (margin >> 2.0).

`FIX_ASR_LABELED_INV_SIMPLE`

FIX_ASR_ANON_2SPEAKER_SIMPLE after speaker-id has run with PAR1→drop, PAR0→INV:Investigator. Used by merge tests where we want to skip the speaker-id step and test merge alone.

`FIX_ASR_BORDERLINE_VOCABULARY`

ASR file where both speakers describe the same picture-book content (margin 1.6-1.9 against reference). Used by low-confidence tests.

`FIX_REF_NO_BULLETS`

A reference file with no time bullets at all. Used to test NoTimelineInFile1 precondition.

`FIX_REF_LANG_ENG` / `FIX_ASR_LANG_YUE`

Two files with conflicting @Languages. Used to test LanguageMismatch.

`FIX_AMBIGUOUS_INV`

Two files both containing *INV: utterances, with --retain CHI (INV not in retain set). Used to test AmbiguousSpeaker.

`FIX_REF_MULTI_RETAIN`

Reference file containing *CHI: and *SI2: utterances (sibling target). Used to test --retain CHI,SI2.

`FIX_ASR_NO_MAIN_BULLET`

Donor file where some utterances have no main-tier bullet, only %wor. Used to test bullet-lift behavior in normalization.

`FIX_OVERRIDE_VALID` / `FIX_OVERRIDE_WRONG_SCHEMA` / `FIX_OVERRIDE_MALFORMED`

Override files in valid, schema-rejected, and parse-rejected shapes. Used by override-file I/O tests.

`FIX_PENDING_SPEAKER_ID` / `FIX_PENDING_PARENT_ROLE` / `FIX_PENDING_MIXED_KINDS`

Pending-adjudications files exercising one kind, another kind, and a mix. Used by L4 adjudication tests.

`FIX_SCRIPTED_ACCEPT_ALL` / `FIX_SCRIPTED_OVERRIDE_FIRST_DEFER_SECOND`

Scripted-decisions TOML files for ScriptedTomlPrompter. Cover the canonical accept-suggested case and a mixed override+defer case.

The exact bytes of each fixture are pinned in their respective test modules when the implementation lands; this plan doesn’t freeze them yet, only their purpose. Drafting the actual bytes is the first step of impl-phase work.

Coverage matrix

Cross-checking that every behavioral invariant from the four design docs has at least one test:

Invariant source	Invariant	First-failing layer	Test name
merge user-guide	Retained byte-stable	L3 → L2	`merge_basic_clinician_pattern` + `merge_retained_speakers_byte_stable`
merge user-guide	Derived tiers stripped	L3 → L2	`merge_strip_tiers_custom` + `merge_strips_default_derived_tiers`
merge user-guide	Order by start_ms	L2	`merge_utterance_order_by_start_time`
merge user-guide	Tiebreak File1 first	L2	`merge_stable_tiebreak_file1_first`
merge user-guide	Bullets pass-through	L2	`merge_bullets_pass_through`
merge user-guide	Bullet lift from %wor	L2	`merge_bullet_lift_from_wor`
merge user-guide	Header reconciliation (all rows, including @Participants / @ID dedupe-on-insert of donor codes File 1 already vestigially declares)	L2	`merge_header_*` series
merge user-guide + memory	No overlap markers injected	L2	`merge_no_overlap_markers_injected` + `merge_preserves_existing_overlap_markers`
merge user-guide	Each precondition → exit 2	L3	`merge_*_exits_2` series in L3.2
merge user-guide	Warns on bullet drift	L2	`merge_warns_on_backward_bullet_drift`
speaker-id user-guide	Reference mode auto	L3	`speaker_id_reference_auto_clean_winner`
speaker-id user-guide	Explicit mode	L3	`speaker_id_explicit_basic`
speaker-id user-guide	Override-file mode	L3	`speaker_id_override_file_replay`
speaker-id user-guide	Confidence threshold (exit 4)	L3 → L2	`speaker_id_reference_low_confidence_exits_4` + `identify_mapping_borderline_refuses`
speaker-id user-guide	Byte-stable except prefix	L2	`apply_mapping_byte_stable_except_prefix`
speaker-id user-guide	Header rewrites	L2 + L1	`apply_mapping_rewrites_` + `participants-rewrite-` specs
speaker-id user-guide	Provenance captured	L3	`speaker_id_reference_writes_override`
speaker-id user-guide	Each precondition → typed error	L3 → L2	various `_exits_2` and `apply_mapping_` tests
speaker-id user-guide	Token cleaner spec	L1	`clean-*` specs
speaker-id user-guide	Multiset Jaccard formula	L1	`jaccard-*` specs
override-file ref	Schema-version refusal	L2	`override_file_refuses_*` tests
override-file ref	Round-trip fidelity	L2	`override_file_round_trip`
override-file ref	Deterministic serialization	L2	`override_file_deterministic_serialization`
override-file ref	Atomic write	L2	`override_file_atomic_write`
override-file ref	margin `"unbounded"` form	L2	`override_file_preserves_margin_unbounded`
domain types	`JaccardScore` range	L2	`jaccard_score_new_in_range`
domain types	`ConfidenceThreshold ≥ 1`	L2	`confidence_threshold_*`
domain types	`Margin` semantics	L2	`margin_*`
domain types	`RetainSet::from_str`	L2	`retain_set_parse`
domain types	`InsertedRole::from_str`	L2	`inserted_role_parse`
domain types	`parse_mapping_spec`	L2	`mapping_spec_parse_*`
domain types	`MergeFlag` serde	L2	`merge_flag_serde_*`
domain types	Pipeline reproducibility	L3	`pipeline_replay_via_override_file`

Every invariant has at least one named test; many have multiple across layers. When the impl phase begins, the first commit should produce the fixtures, the second commit the highest-layer failing test for the simplest invariant, then drill down per the standard TDD progression.

What this plan does NOT cover

Performance / scaling tests. Until the pipeline shows up on a measured workload, no targeted perf assertions. The reference corpus’s existing round-trip benchmarks remain the baseline.
Fuzz testing. This repository now has a local fuzz/ workspace for parser/validation fuzzing. If the merge crate stabilizes enough to justify dedicated fuzzing, adding a merge-specific target for random parseable CHAT-pair inputs is a follow-up, not a v1 blocker.
Cross-platform CI checks. Windows / Linux / macOS each build the workspace; the merge module rides the existing CI. No platform-specific tests needed (the merge operates on parsed AST and writes UTF-8; no path-or-line-ending quirks).
Real-corpus regression sweeps. Once impl lands, running chatter merge over a curated subset of the reference corpus and snapshotting outputs is a smart follow-up. Lives in a separate tests/golden/ style mechanism if added; not designed here.

TDD authoring sequence

Each numbered item is one full RED → GREEN → REFACTOR cycle. Cycles must run in order; do not start cycle N+1 until cycle N is green and committed. Numbers are designed so the first working pipeline (cycle 8) emerges from the absolute minimum set of types + algorithms, then each later cycle extends.

The starter test for cycle 1 is intentionally tiny: a 2-utterance fixture pair with no markup, one retain speaker. The smoke test exercises every layer (parser, transform, CLI) but with the simplest possible CHAT bytes, so the first impl is small enough to land in one cycle.

Phase A, minimal end-to-end pipeline (cycles 1-8)

These cycles produce the simplest possible chatter merge working end-to-end with synthetic fixtures.

#	RED (failing test)	GREEN (smallest impl that passes)
1	`merge_basic_smoke`, L3 subprocess test against the tiniest fixture pair (`FIX_REF_TWO_UTT_NO_MARKUP` + `FIX_ASR_LABELED_TWO_UTT`), retain={CHI}, asserts exit 0 and “merged file exists”	Stub `chatter merge` subcommand wiring; introduce minimal `talkbank-transform::transcript_merge::merge` that interleaves utterances by start_ms and emits parser→serializer round-trip. No tier-stripping, no header-reconcile, no validation. Just: parse, sort, serialize.
2	`merge_retained_speakers_byte_stable`, L2 over the smoke fixture, asserts every CHI block byte-identical	Implement byte-stable handling for retained utterances (preserve `main_raw_lines` + dependent tiers exactly).
3	`merge_strips_default_derived_tiers`, L2 against a fixture where the donor has `%wor` rows	Implement `tier_strip` per the per-tier policy; drop `%wor`/`%mor`/`%gra`/`%pho` from inserted-speaker utts.
4	`merge_utterance_order_by_start_time`, L2 with a fixture where File 1 and File 2 utterances interleave	Implement `timeline` sort key (start_ms primary; source-order tiebreak).
5	`merge_header_participants_concatenates`, L2	Implement `header_reconcile::participants_merge`.
6	`merge_header_id_concatenates`, L2	Extend `header_reconcile` for @ID rows.
7	`merge_header_languages_passthrough` + `merge_header_media_file1_wins` + `merge_header_comments_concatenate`, L2	Extend `header_reconcile` for remaining headers per the contract table.
8	`merge_preconditions_retain_missing` + `merge_preconditions_no_timeline` + `merge_preconditions_language_mismatch` + `merge_preconditions_ambiguous_speaker`, L3, each asserting exit code 2 with a specific stderr message	Implement `preconditions` module + map `MergeError` to exit codes in the CLI.

Phase A, actual cycle log

The four-precondition cycle 8 was deliberately split into four single-variant cycles (9a / 9b / 9c / 9d) so each MergeError variant lands with its own RED→GREEN cycle and L2 + L3 sibling tests. The numbering here is therefore finer-grained than the plan table above; the table records the shape of Phase A, the log records what was actually committed.

#	Test(s)	Layer	Status
1	`merge_basic_smoke`	L3	done
2	`merge_retained_speakers_byte_stable`	L2	done
3	`merge_strips_default_derived_tiers`	L2	done
4	`merge_strip_tiers_configurable`	L2	done
5	`merge_strip_tiers_empty_preserves_all`	L2	done
6	`merge_header_participants_concatenates`	L2	done
7	`merge_header_id_concatenates`	L2	done
8a	`merge_header_comments_concatenate`	L2	done
8b	`merge_header_languages_passthrough` + `merge_header_media_file1_wins`	L2	done
9a	`merge_no_retain_speakers_in_file1` + `_returns_err`	L3 + L2	done (L2 sibling backfilled in 9c)
9b	`merge_no_timeline_in_file1` + `_returns_err`	L3 + L2	done
9c	`merge_language_mismatch` + `_returns_err`	L3 + L2	done
9d	`merge_ambiguous_speaker` + `_returns_err`	L3 + L2	done

End of Phase A: chatter merge works on simple fixtures with all four preconditions (retain / timeline / language / ambiguous speaker) enforced. The pipeline is publishable as v0.

Phase B, actual cycle log

Phase B picks up at cycle 10 in the cycle log (Phase A used 9a-9d for the precondition split).

#	Test(s)	Layer	Status
10	`speaker_id_explicit_basic`	L3	done
11	`apply_mapping_byte_stable_except_prefix` + `apply_mapping_rewrites_participants` + `apply_mapping_rewrites_id`	L2	done (regression-guards)
12	`identify_mapping_clean_winner`	L2	done
13	`identify_mapping_borderline_refuses`	L2	done
14	`speaker_id_reference_low_confidence_exits_4`	L3	done
15	`speaker_id_reference_writes_override` (+ `OverrideFile` data model)	L3	done
16	`speaker_id_override_file_replay` (+ `OverrideFile::get`)	L3	done
17	`adjudicate_speaker_id_accepts_suggested` (+ adjudication core)	L4	done
18	`adjudicate_scripted_accepts_suggested` (+ `chatter adjudicate` CLI + scripted-TOML I/O)	L3	done
19	`speaker_id_reference_writes_pending_on_low_confidence` (+ `--write-pending` flag + `LowConfidence` carries `DonorMatchReport`)	L3	done
20	`adjudicate_speaker_id_override_mapping` (+ `OperatorDecision::OverrideMapping` variant + scripted-TOML `override-mapping` shape)	L4	done
21	`adjudicate_interactive_accepts_suggested` (+ `TerminalPrompter` + `--interactive` flag)	L3	done
22	`adjudicate_parent_role_lookup_chooses_role` (+ `PendingKindData` promotion + `ParentRoleLookup` kind + `ChooseRole` decision)	L4	done
23	`adjudicate_interactive_chooses_role` (+ `parse_operator_response` + kind-aware prompt hint)	L3	done
24	`adjudicate_interactive_override_mapping` (+ `parse_override_mapping` + `parse_speaker_assignment`)	L3	done
25	`pipeline_clean_winner_end_to_end` (+ `chatter pipeline` subcommand)	L3	done
26	`batch_pass1_single_session` (+ `chatter batch` subcommand, subprocess driver)	L3	done
27	`batch_mixed_outcomes` (regression-guard: clean+borderline aggregation)	L3	done
28	`batch_pass2_replay` (+ `--override-file` on `pipeline` + `batch`; per-session auto-detection)	L3	done
29	`batch_skip_existing` (+ `--skip-existing` flag on `batch` for idempotent re-runs)	L3	done
30	refactor, `PipelineArgs` + `BatchArgs` structs retire three `#[allow(clippy::too_many_arguments)]` markers	,	done (true-no-op refactor; covered by cycles 25-29 regression suite)
31	refactor, split `commands/speaker_id.rs` (472 lines) into `speaker_id/{mod,modes,writes,support}.rs` (158 + 196 + 103 + 86 lines); retire 4 stale `#[allow(dead_code)]` markers on `ReferenceModeOutcome` (fields are read by `write_override_entry`)	,	done (true-no-op refactor; covered by cycles 10-29 regression suite)
32	`adjudicate_sanity_scan_accept_suggested` (+ `AdjudicationKind::SanityScanMisclassification` variant, `PendingKindData::SanityScanMisclassification { suggested, reason }` variant, two apply-decision arms mirroring `SpeakerIdLowConfidence`, terminal prompter render + prompt-hint arm)	L4	done, adjudication kind end-to-end; the post-merge scan detector itself (heuristic + auto-pending-write) is a separate cycle 33
33	`sanity_scan_flags_inverted_mlu` (+ `talkbank_transform::sanity_scan::scan_session` + `chatter sanity-scan` subcommand; mean-utterance-word-count asymmetry heuristic, default 1.5×, binary-mapping only)	L3	done, detector + CLI end-to-end; multi-rename support, batch integration, and alternative heuristics deferred
34	`batch_writes_override_for_auto_decisions` (+ `--write-override` on both `chatter pipeline` and `chatter batch`; threaded through `PipelineArgs.write_override_path` + `BatchArgs.write_override_path`; reference-mode auto-decisions audit-trailed for sanity-scan + future re-runs)	L3	done
35	`batch_with_sanity_scan_flag_flags_inverted_mlu` (+ `--sanity-scan` + `--sanity-scan-threshold` on `chatter batch`; post-loop subprocess driver for `chatter sanity-scan`; precondition validation requiring `--write-override` + `--write-pending`)	L3	done
36	refactor, split `cli/args/core.rs` (984 → 747 lines): extract `DebugCommands` → `debug_commands.rs`, `CacheCommands` → `cache_commands.rs`, config enums (`LogFormat`, `TuiMode`, `OutputFormat`, `ParserBackend`, `AlignmentTier`) → `cli_types.rs`, unit-test module → `core_tests.rs` (via `#[path]`); satisfies the 800-line hard limit	,	done (true-no-op refactor; covered by full regression suite + 110 bin/integration tests)
37+	sanity-scan multi-rename support; diarization-mix-review kind (operator workflow design needed); newtype threading at struct seams (deferred simplify finding); `apply_decision` arm dedup + per-kind `OperatorDecision` sub-enums	L3 + L4	pending

Phase B, speaker-id pipeline (cycles 9-16)

These cycles add chatter speaker-id and its three modes.

#	RED	GREEN
9	`speaker_id_explicit_basic`, L3 against an anonymous-2-speaker donor with `--mapping "PAR0=drop,PAR1=INV:Investigator"`, asserts output has only INV utts	Stub `chatter speaker-id` subcommand. Implement `parse_mapping_spec` + `apply_mapping`. Reference mode and override-file mode return `unimplemented!()` for now.
10	`apply_mapping_byte_stable_except_prefix` + `apply_mapping_rewrites_participants` + `apply_mapping_rewrites_id`, L2	Tighten `apply_mapping` per header rewrite rules.
11	`identify_mapping_clean_winner`, L2 with a fixture where one donor speaker overwhelmingly matches the reference	Implement `text_cleaner` + `jaccard` modules. Implement `identify_mapping` using them. Reference mode in CLI now works.
12	`identify_mapping_borderline_refuses`, L2 with a borderline fixture	Add `ConfidenceThreshold` check + `LowConfidence` error path.
13	`speaker_id_reference_low_confidence_exits_4`, L3 against borderline fixture	Map `LowConfidence` to exit code 4 in the CLI; print scores to stderr.
14	`speaker_id_reference_writes_override`, L3 with `--write-override`	Implement `OverrideFile::read_or_default` + `OverrideFile::write`.
15	`speaker_id_override_file_replay`, L3 with `--override-file` + `--session-id`	Implement override-file mode in CLI (`OverrideFile::get` + apply).
16	Token-cleaner L1 specs (a handful of representative `clean-*` specs from L1.1) + current `spec/tools` generators	Move the regex-and-string cleaner into a spec-test-covered implementation. Specs become the regression net.

End of Phase B: full chatter speaker-id + chatter merge pipeline works auto + explicit + override modes.

Phase C, adjudication (cycles 17-22)

These cycles add the chatter adjudicate tool and its prompter-injection testability.

#	RED	GREEN
17	`adjudicate_empty_pending_file_noop`, L4 against an empty pending file, asserts exit 0 + no changes	Stub `chatter adjudicate` subcommand. Implement `PendingAdjudications::read` + `run_adjudication` core skeleton with a no-op `Prompter` trait.
18	`prompter_scripted_returns_decisions_in_order`, L4	Implement `ScriptedPrompter::from_decisions` (in-memory) per the `Prompter` trait.
19	`adjudicate_speaker_id_accepts_suggested`, L4 against `FIX_PENDING_SPEAKER_ID` with one `AcceptSuggested` decision	Implement `apply_decision` for the speaker-id-low-confidence kind. Override file now gets the decision; pending entry removed.
20	`adjudicate_speaker_id_override_mapping`, L4 with `OverrideMapping` decision	Extend `apply_decision` for the override-mapping variant.
21	`adjudicate_speaker_id_kind_mismatch_rejected`, L4 with a `OverrideInsertedRole` against a speaker-id pending entry	Implement kind→variants validation in `apply_decision`.
22	`adjudicate_scripted_mode_unknown_session_aborts` + `adjudicate_scripted_mode_extra_pending_aborts`, L4	Tighten scripted-mode validation; assert 1:1 mapping between pending entries and scripted decisions.

End of Phase C: scripted adjudication tested end-to-end with synthetic operator inputs. Interactive terminal UX still unimplemented (next phase).

Phase D, interactive UX (cycles 23-25)

#	RED	GREEN
23	`prompter_terminal_round_trip_decision`, L4 with mocked stdin/stdout	Implement `TerminalPrompter` parsing `[a]/[o]/[f]/...` keys + optional follow-up prompts.
24	`adjudicate_resumption_skips_decided_entries`, L4 with a partially-decided override file + full pending list	Implement skip-already-decided logic in `run_adjudication`.
25	Manual smoke test (NOT automated), run `chatter adjudicate --interactive` against the test fixtures; visually confirm the operator UX matches the doc’s mock-up	Polish terminal output: ANSI formatting, fixed-width alignment, the `[m] Show more context` action, the `[p] Play media` action.

End of Phase D: full v1 pipeline complete.

Phase E, non-speaker-id adjudication kinds (cycles 26-29)

Each adjudication kind gets its own RED→GREEN cycle.

#	RED	GREEN
26	`adjudicate_parent_role_overrides_to_mother` + `adjudicate_parent_role_overrides_to_father`, L4	Implement `parent-role-lookup` kind end-to-end (pending schema, prompter context, decision application).
27	`adjudicate_diarization_mix_flag_only`, L4	Implement `diarization-mix-review` kind end-to-end.
28	`adjudicate_sanity_scan_swap_mapping`, L4	Implement `sanity-scan-misclassification` kind end-to-end.
29	`adjudicate_re_adjudicate_preserves_history`, L4	Implement `--re-adjudicate` flag; add `history` field to `MergeOverride`.

Phase F, breadth pass (cycles 30+)

Fill in every remaining test from L1-L4 that hasn’t been written yet. These are coverage-deepening tests, not behavior adders. The impl from Phases A-E should pass them with at most minor refactoring; if a test fails meaningfully, that’s a gap in the impl that this cycle closes.

The breadth pass is the only phase where multiple cycles can proceed in parallel (different contributors take different test groups). Phases A-E are strictly serial.

Hard rules during impl phase

No test stubs. Every test in this plan, when written, must FAIL before its impl exists and PASS after. Skipped or #[ignore]-marked tests are not allowed in the regression net (use #[ignore] only for genuinely slow or environment-dependent tests, not for “not implemented yet”).
No test deletion to make CI green. If a test that was passing starts failing after a refactor, the refactor is wrong. Investigate; do not delete the test.
Three cycle archetypes, distinguish them. A cycle is one of:
- bug-fix: RED motivates new impl code (cycle N-1’s impl truly cannot satisfy the new test).
- regression-guard: RED pins an invariant the impl inherits from upstream infrastructure (e.g. parse→serialize byte-stability inherited from talkbank-parser). The test passes against cycle N-1’s impl, but the cycle is valuable because it locks in the invariant against future “optimizations” that might break it. Verbose-output the actual behavior on first run to confirm the invariant holds for the right reasons, not by accident.
- true no-op: RED tests something already pinned elsewhere. These ARE unnecessary; drop the cycle or sharpen the test. The difference between regression-guard and true no-op is whether the invariant is named explicitly anywhere else. If yes (e.g., the parser crate already has a roundtrip test that covers it), the cycle is true-no-op. If no, the cycle is a regression-guard and worth keeping.

Merge Pipeline, Crate Architecture

Status: Draft Last modified: 2026-07-07 21:17 EDT

This page explains where the new merge-pipeline code lives in the chatter workspace, which crates gain modules, what depends on what, and which boundary each piece sits inside. The goal is succession-readability: a contributor coming to this work for the first time should be able to map a behavior they read about in chatter merge or chatter speaker-id to the precise crate + module that implements it.

Companion documents:

Domain Types: the typed vocabulary in talkbank-transform::speaker_id (and MergeError beside the merge algorithm in talkbank-transform::transcript_merge).
Test Plan: what tests live where.
Override File Format, the on-disk format.

Boundary decisions

Two boundary decisions govern where every new piece of code lives. Both reference rules already documented in this repo’s root CLAUDE.md (workspace-root contributor guide, outside the book).

Decision 1: talkbank-* crates, not batchalign-* crates

The merge pipeline is pure CHAT-AST structural manipulation, no ML, no audio I/O, no network, no model loading, no fleet runtime. Per the crate-boundary decision test in the workspace CLAUDE.md:

If code fundamentally needs ML models, audio processing, network services, or fleet runtime → batchalign-* crate. Otherwise → talkbank-* crate.

chatter merge and chatter speaker-id answer “no” to each ML/audio/network/runtime question. They consume parsed ChatFile values, manipulate them, and emit parsed-and-serialized output. Even the speaker-id text-similarity scoring is a deterministic function over CHAT content tokens, no ML model, no embedding, no inference. All new merge code lives in talkbank-* crates.

The batchalign-* crates remain the home for batchalign3 transcribe (ASR), batchalign3 align (forced alignment), and batchalign3 morphotag (Stanza-based morphological tagging), the ML-bearing stages that surround the merge in the pipeline.

Decision 2: types and algorithms in `talkbank-transform`, CHAT vocabulary in `talkbank-model`, CLI in `chatter`

The merge pipeline’s code splits across the same talkbank-* crates that already host the parse/validate/normalize/JSON pipelines:

talkbank-model owns the CHAT-domain vocabulary the merge code references (SpeakerCode, ParticipantRole, ParticipantEntry, IDHeader, ChatFile). It gained no new merge module.
talkbank-transform owns both the merge-specific domain types (MappingSpec, InsertedRoleSpec, SpeakerAction, OverrideMode, MergeOverride, OverrideFile, the error enums) and the algorithms (token cleaning, Jaccard scoring, mapping application, structural merge, adjudication core). No CLI parsing, no clap.
chatter owns the subcommands (chatter speaker-id, chatter merge, chatter adjudicate, plus the composing pipeline / batch / sanity-scan drivers). Thin shim layer that parses arguments and drives the transform layer.

Design history. The original design gave the domain types their own talkbank-model::merge module (“types in the model crate, algorithms in the transform crate”). As shipped, the types live with the algorithms in talkbank-transform::speaker_id instead; the talkbank-model::merge module was never created. See Domain Types §Where the types live.

This mirrors how chatter validate, chatter normalize, chatter to-json are wired today and keeps the crate boundaries honest: a caller wanting the algorithms and types without CLI machinery (e.g., a library binding, an HTTP service, an external tool reading override files) depends on talkbank-transform without pulling in clap.

Crate dependency graph

The new code does not introduce any new crate-level dependencies, every edge below already exists in the workspace today. The merge work adds modules to existing crates.

flowchart TD
    derive["talkbank-derive\n(proc macros, unchanged)"]
    model["talkbank-model\n(CHAT vocabulary, unchanged)"]
    parser["talkbank-parser\n(unchanged)"]
    transform["talkbank-transform\n(+ speaker_id, transcript_merge,\nadjudication, sanity_scan modules)"]
    cli["chatter\n(+ speaker-id, merge, adjudicate,\npipeline, batch, sanity-scan subcommands)"]
    cli_tests["chatter/tests/\n(+ merge_tests, speaker_id_tests,\nadjudication_tests, pipeline_tests, batch_tests)"]
    transform_tests["talkbank-transform/tests/\n(+ transcript_merge_tests, speaker_id_tests,\nadjudication_tests)"]

    derive --> model
    model --> parser
    model --> transform
    parser --> transform
    transform --> cli
    model --> cli
    transform --> transform_tests
    transform --> cli_tests
    cli --> cli_tests

Module layout per affected crate

`talkbank-model`: unchanged

talkbank-model gained no merge module. (The original design added a crates/talkbank-model/src/merge/ module with scoring / role / mapping / retain / override_file / errors files and pub use role::{InsertedRole, MappingAction}-style re-exports; none of that was created. The domain types shipped inside talkbank-transform::speaker_id instead, with revised names; see Domain Types.) The merge code consumes talkbank-model’s existing CHAT vocabulary (SpeakerCode, ParticipantRole, ParticipantEntry, IDHeader, ChatFile) unmodified.

`talkbank-transform`, `speaker_id/` module + `transcript_merge.rs`

Sibling top-level modules, mirroring the user-facing distinction between the two subcommands. speaker_id/ holds both the domain types and the algorithms; transcript_merge fits in a single file:

crates/talkbank-transform/src/speaker_id/
    mod.rs             pub re-exports (the crate-facing surface)
    types.rs           JaccardScore, ConfidenceMargin, ConfidenceThreshold
    mapping.rs         MappingSpec, SpeakerAssignment, parse_mapping_spec
    identify.rs        identify_mapping (token cleaning + multiset
                       Jaccard), DonorMatchReport,
                       DEFAULT_CONFIDENCE_THRESHOLD
    apply.rs           apply_mapping, apply_mapping_chat
                       (@Participants / @ID rewriting per mapping)
    override_file.rs   CURRENT_SCHEMA_VERSION, OverrideMode,
                       SpeakerAction, InsertedRoleSpec, MergeOverride,
                       OverrideFile, OverrideFileError
    provenance.rs      DecisionEngine, JudgmentProvenance, ...
    error.rs           SpeakerIdError
    judgment/          LLM holistic-judgment surface (sampling,
                       prompt rendering, provider, consume)

crates/talkbank-transform/src/transcript_merge.rs
    merge_chats (preconditions, header reconciliation, timeline
    interleave, tier strip), MergeError, DEFAULT_STRIP_TIERS

crates/talkbank-transform/src/adjudication.rs
    run_adjudication core, Prompter trait, ScriptedPrompter,
    PendingAdjudications

crates/talkbank-transform/src/sanity_scan.rs
    post-merge misclassification heuristic (scan_session)

All of these land alongside the existing CHAT-core transform modules (parse, serialize, validate, normalize) in talkbank-transform.

Exposed via crates/talkbank-transform/src/lib.rs:

pub mod adjudication;
pub mod sanity_scan;
pub mod speaker_id;
pub mod transcript_merge;

`chatter`, new command modules

The CLI dispatch pattern in this crate uses one directory per multi-file command (e.g. commands/validate/) or one file for single-file commands (commands/normalize.rs, commands/lint.rs). Speaker-id warranted a directory (it has reference / explicit / override-file operation modes plus override/pending write paths); merge and the other pipeline commands fit in single files:

crates/chatter/src/commands/speaker_id/
    mod.rs        SpeakerIdArgs + run_speaker_id entry point
    modes.rs      reference / explicit / override-file / holistic-LLM
                  mode drivers
    writes.rs     --write-override / --write-pending output paths
    support.rs    shared helpers (CODE:ROLE parsing, session-ID
                  derivation, typed error-to-exit-code mapping)

crates/chatter/src/commands/transcript_merge.rs
    run_merge: drives talkbank_transform::transcript_merge::merge_chats,
    maps MergeError to exit codes

crates/chatter/src/commands/adjudicate.rs   chatter adjudicate
crates/chatter/src/commands/pipeline.rs     chatter pipeline (speaker-id
                                            then merge, one session)
crates/chatter/src/commands/batch.rs        chatter batch (many sessions)
crates/chatter/src/commands/sanity_scan.rs  chatter sanity-scan
crates/chatter/src/commands/merge_preflight.rs  merge preflight checks

The CLI argument surface extends the top-level Commands enum in crates/chatter/src/cli/args/core.rs, which carries Merge, SpeakerId, Adjudicate, Pipeline, Batch, and SanityScan variants with inline field definitions (not separate *Args structs in the command modules). Subcommand dispatch in crates/chatter/src/commands/dispatch.rs matches on the enum and wires each arm to the respective commands::*::run_* entry point.

Test crates

Per the Test Plan:

crates/talkbank-transform/tests/
    speaker_id_tests.rs        L2 tests for identify_mapping /
                               apply_mapping / override-file I/O
    transcript_merge_tests.rs  L2 tests for merge invariants
    adjudication_tests.rs      L4 scripted-prompter tests

crates/chatter/tests/
    merge_tests.rs             L3 subprocess tests for chatter merge
    speaker_id_tests.rs        L3 subprocess tests for chatter speaker-id
    adjudication_tests.rs      L3 subprocess tests for chatter adjudicate
    pipeline_tests.rs          L3 composition tests (speaker-id + merge)
    batch_tests.rs             L3 batch-driver tests
    sanity_scan_tests.rs       L3 sanity-scan tests

(The test plan’s L1 layer, spec/constructs/speaker-id/ fragment specs regenerated via spec/tools, was not created; the token-cleaner and Jaccard behaviors are pinned by the L2 tests instead.)

Data flow for `chatter merge`

The full call graph when an operator runs chatter merge file1.cha file2.cha --retain CHI -o out.cha:

sequenceDiagram
    actor Operator
    participant CLI as chatter<br/>(cli/args/core.rs, Commands::Merge)
    participant Runner as commands::transcript_merge<br/>(run_merge)
    participant Merge as talkbank-transform::transcript_merge<br/>(merge_chats)

    Operator->>CLI: chatter merge file1 file2 --retain CHI
    CLI->>Runner: run_merge(file1, file2, retain, output)
    Runner->>Merge: merge_chats(content1, content2, retain, strip_tiers, options)
    Merge->>Merge: parse_and_validate both inputs
    Merge->>Merge: preconditions (retain / timeline /<br/>languages / ambiguous / already-declared)
    Merge->>Merge: header reconcile (@Participants concat<br/>with dedupe-on-insert; @ID / @Comment injection)
    Merge->>Merge: tier strip on inserted utts · timeline sort
    Merge-->>Runner: merged CHAT String or MergeError
    alt Ok(merged)
        Runner->>Runner: write to -o path (or stdout)
        Runner-->>Operator: exit 0
    else Err(MergeError)
        Runner-->>Operator: formatted stderr + exit code 2 (Parse: exit 1)
    end

The CLI layer is thin: clap parses arguments into the Commands::Merge variant, run_merge calls the transform layer’s merge_chats function, and translates the Result<String, MergeError> into stdout/stderr/exit-code output. All algorithm logic lives in talkbank-transform.

Data flow for `chatter speaker-id`

The reference-mode call path:

sequenceDiagram
    actor Operator
    participant CLI as chatter<br/>(cli/args/core.rs, Commands::SpeakerId)
    participant Runner as commands::speaker_id::modes<br/>(run_reference_mode)
    participant SpkId as talkbank-transform::speaker_id<br/>(identify.rs / apply.rs)
    participant Override as talkbank-transform::speaker_id<br/>(override_file.rs)

    Operator->>CLI: chatter speaker-id input --reference ref --anchor CHI<br/>--inserted-role INV:Investigator
    CLI->>Runner: run_speaker_id(args) → run_reference_mode
    Runner->>SpkId: parse donor + reference (parse_and_validate)
    Runner->>SpkId: identify_mapping(reference, anchor, donor, threshold)
    SpkId-->>Runner: DonorMatchReport or Err(LowConfidence { report, threshold })
    alt Ok(report)
        Runner->>Runner: build MappingSpec (winner → drop,<br/>others → inserted role)
        Runner->>SpkId: apply_mapping_chat(donor, mapping)
        SpkId-->>Runner: relabeled CHAT String
        opt --write-override
            Runner->>Override: OverrideFile::read_or_default(path)
            Override-->>Runner: OverrideFile
            Runner->>Override: upsert(session_id, MergeOverride::auto_decision), write
        end
        Runner-->>Operator: relabeled output, exit 0
    else Err(LowConfidence)
        opt --write-pending
            Runner->>Runner: record pending-adjudication entry
        end
        Runner-->>Operator: scores to stderr, exit 4
    end

The explicit-mapping and override-file modes use the same apply_mapping and --write-override paths but skip identify_mapping: the mapping comes from parse_mapping_spec or from OverrideFile::get + MergeOverride::to_mapping_spec respectively. A fourth mode (holistic LLM judgment, via the judgment/ submodule) produces pending-adjudication entries for chatter adjudicate rather than deciding directly; see Adjudication Workflow.

How this composes with the post-merge ML stages

The end-to-end pipeline batchalign3 transcribe → chatter speaker-id → chatter merge → batchalign3 align → batchalign3 morphotag crosses the talkbank-* / batchalign-* boundary twice:

flowchart LR
    subgraph BA[Batchalign, ML / audio / network]
        Trans["batchalign3 transcribe"]
        Align["batchalign3 align"]
        Morph["batchalign3 morphotag"]
    end
    subgraph TB[talkbank, pure CHAT-AST]
        SpkId["chatter speaker-id"]
        Merge["chatter merge"]
    end
    Media["mp4 / wav media"] --> Trans
    Trans -->|ASR.cha| SpkId
    Hand["hand transcript.cha"] -->|reference| SpkId
    Hand --> Merge
    SpkId -->|labeled.cha| Merge
    Merge -->|merged.cha| Align
    Align -->|+ bullets + %wor| Morph
    Morph -->|+ %mor + %gra| Final["final.cha"]

Each crossing is CHAT-file-to-CHAT-file at a stable serialization boundary: Batchalign emits a CHAT file, talkbank consumes it; talkbank emits a CHAT file, Batchalign consumes it. Neither side has a runtime dependency on the other; they exchange data through the file system (or piped stdin/stdout) exactly as the user-facing CLI commands do. This keeps the boundary honest: a contributor working on the merge pipeline never needs to load a Stanza model, and a contributor working on batchalign3 align never needs to parse a speaker-id override file.

Public surface impact

Cumulative public API additions (the surface a downstream library consumer would see):

Crate	New `pub` items	Stability
`talkbank-model`	None; the merge work reuses the existing CHAT vocabulary (`SpeakerCode`, `ParticipantRole`, `ParticipantEntry`, `IDHeader`, `ChatFile`) unmodified	Unchanged
`talkbank-transform`	`speaker_id::{identify_mapping, apply_mapping, apply_mapping_chat, parse_mapping_spec, MappingSpec, SpeakerAssignment, DonorMatchReport, SpeakerIdError, CURRENT_SCHEMA_VERSION, OverrideFile, MergeOverride, OverrideMode, SpeakerAction, InsertedRoleSpec, OverrideFileError, ...}` (plus the `judgment` and provenance surfaces); `transcript_merge::{merge_chats, MergeError, DEFAULT_STRIP_TIERS}`; `adjudication::`; `sanity_scan::`	Stable, algorithms behind these are pinned by the test plan’s L2 tests
`chatter`	New `Commands` enum variants (`Merge`, `SpeakerId`, `Adjudicate`, `Pipeline`, `Batch`, `SanityScan`)	Internal to the binary, not a library surface

No existing public surface is modified or removed; this is a purely-additive change. Existing consumers (the VS Code extension, talkbank-lsp, chatter-desktop, batchalign) continue to depend on the existing surface and can ignore the additions until a workflow uses them.

Where to look for things (newcomer guide)

Question	File
“What does `chatter merge` do?”	`book/src/chatter/user-guide/merge.md`
“What does `chatter speaker-id` do?”	`book/src/chatter/user-guide/speaker-id.md`
“What’s in an override file?”	`book/src/chatter/integrating/merge-overrides.md`
“What types are in `talkbank-transform::speaker_id`?”	`book/src/architecture/merge-domain-types.md`
“Where are the tests?”	`book/src/architecture/merge-test-plan.md`
“Which crate is this code in and why?”	This page
“Where does the merge code live in source?”	`crates/talkbank-transform/src/speaker_id/` + `crates/talkbank-transform/src/transcript_merge.rs` + `crates/chatter/src/commands/speaker_id/` + `crates/chatter/src/commands/transcript_merge.rs`
“What’s in an utterance / `ChatFile` / `%mor` tier?”	`talkbank-model` crate rustdoc; `book/src/architecture/chat-model/chat-model.md`
“What’s the parser do?”	`book/src/architecture/parsing.md`; `book/src/architecture/parser-model-contracts.md`

Adjudication Workflow

Status: Draft Last updated: 2026-07-07 21:17 EDT

This page specifies how human-in-the-loop adjudication fits into the merge pipeline. Several pipeline stages have decision points where the algorithm cannot or should not auto-decide; this document specifies how those refusals reach an operator, how the operator’s decision is recorded, and how the pipeline resumes with the decision applied.

The design satisfies two constraints set explicitly upstream:

Test the interaction. Every operator-decision path must be exercisable in automated tests by providing synthetic operator choices. No hardcoded stdin reads in the decision core; a pluggable prompter abstraction is mandatory.
Batch-then-review is the default workflow. No mid-batch interactive pauses in the main pipeline. The optional --interactive flag exists on the adjudication tool only, for small-batch debugging, and rides on the same data contract.

Companion documents:

Merge Override File Format, the on-disk record of decisions.
Domain Types: SpeakerMapping, MergeOverride, etc.
Test Plan: where the adjudication tests live.
Crate Architecture: where the adjudication code lives.

Why batch-then-review, and not real-time

Every adjudication point in the pipeline is per-session local: the operator’s decision affects this session’s output and no other session in the same batch. There is no case where an operator decision propagates forward to influence how other sessions get processed.

The cases that might appear to want real-time interaction are better served by sampling:

Case	Real-time approach	Better approach
Systematic pipeline failure (everything refuses)	Watch each refusal, abort batch	Run a 5-10-session canary first; examine; abort or proceed
Confidence-threshold calibration on a new corpus	Adjust threshold mid-batch	Run canary; pick threshold; full batch
Cross-session pattern (one contributor always has PAR0 = clinician)	Notice during interactive review	Run canary; observe pattern; add per-contributor explicit mapping to orchestrator config
Operator wants per-session progress visibility	Watch each step	`chatter adjudicate --interactive` after a batch run, walking the same pending queue

TalkBank’s operational reality makes batch-then-review strictly better:

Batches are research-scale (hundreds of sessions per donor). Forcing operator presence during the batch run = forcing hours of babysitting.
Overnight and fleet runs are routine; interactive doesn’t work for those.
Focused operator review of all refusals together is more efficient than scattered per-batch decisions (less context-switching; easier to spot patterns across sessions).
Aligns with the project’s “academic research, accuracy is the standard, take however long it takes” rule: operator efficiency dominates wall-clock latency.

The --interactive flag is preserved for the small-batch debugging case but is explicitly NOT the dominant workflow.

The known adjudication points

The pipeline has at least five points where adjudication may be needed. Each is recorded as one or more entries in the override file via the same schema.

#	Adjudication point	Trigger	Operator’s decision	Affects
1	Speaker-id low confidence	`chatter speaker-id` Jaccard margin < threshold	Per-speaker mapping (drop/rename) and per-donor-code `adult_roles`	Speaker labeling, drop set, downstream merge
2	Parent role lookup	Parent-sample session needs `MOT` vs `FAT` decision	`adult_roles[donor_speaker].code` and `.tag` for this session	The merged file’s headers + main-tier prefixes
3	Diarization-mix flag	Operator observes Batchalign collapsed multiple real-world speakers into one label	`flags = ["diarization-mixed"]` plus a note	Downstream consumers know output is imperfect; might gate publication
4	Post-merge sanity scan	Auto-scan flags retained-speaker utterances with high-text-similarity inserted-speaker utterances nearby (suggesting speaker-id misclassification)	Confirm or override the original speaker-id mapping	Triggers re-run of speaker-id + merge for the session
5	Unbulleted reference file	Reference CHAT file has no time bullets; merge can’t proceed	Either bullet the reference upstream, or request fresh authoritative data	Pipeline blocked for this session pending external fix

Points 1-4 are handled by the unified chatter adjudicate tool specified below. Point 5 is an out-of-scope failure mode: the adjudication tool records that the session is blocked, but the fix lives outside this pipeline (operator contacts the contributor or runs forced-alignment first).

Data flow

flowchart TD
    Inputs["Input CHAT files +<br/>reference files"]
    Orch["Orchestrator<br/>(future: tb subcommand;<br/>now: shell/script)"]
    SpkId["chatter speaker-id<br/>(per session)"]
    Merge["chatter merge<br/>(per session)"]
    Pending["pending-adjudications.toml<br/>(workflow queue)"]
    Override["overrides.toml<br/>(durable decisions)"]
    Adj["chatter adjudicate"]
    Operator((Operator))
    Final["merged/*.cha"]

    Inputs --> Orch
    Orch -->|pass 1: speaker-id| SpkId
    SpkId -->|exit 0 → auto entry| Override
    SpkId -->|exit 4 → pending entry| Pending
    Orch -->|pass 1: merge for ok sessions| Merge
    Merge --> Final
    Pending --> Adj
    Override --> Adj
    Adj <-->|prompter| Operator
    Adj -->|writes decision| Override
    Adj -->|removes resolved| Pending
    Override -->|pass 2| Orch
    Orch -.->|loop until pending empty| SpkId

The orchestrator runs two passes:

Pass 1: for every input session, run chatter speaker-id in reference mode. Successful auto-decides write to the override file with mode = "auto" and immediately proceed to chatter merge. Refusals (exit code 4) and other adjudication-requiring states write a pending entry to pending-adjudications.toml and the session is skipped for the rest of pass 1.

Pass 2 (after operator runs chatter adjudicate): the orchestrator re-runs chatter speaker-id for the previously skipped sessions, now finding decisions in the override file (mode = "override"). Sessions complete; pending entries are removed.

The pipeline is idempotent: re-running pass 1 on a partially adjudicated batch produces no spurious work, sessions with already-recorded decisions skip to merge directly.

The pending-adjudications artifact

Separate from the override file, a pending-adjudications.toml file holds in-flight workflow state. Its purpose is to carry the evidence the operator needs (per-speaker scores, opening utterance previews) from the orchestrator’s pass 1 to the adjudication tool, without polluting the override file with “to-do” entries.

Schema

schema_version = 2

[[entries]]
session_id = "session-102-t1"
kind = "speaker-id-low-confidence"
created_at = 2026-05-27T11:00:00-04:00

# Inputs the adjudication tool needs:
input_path = "asr/session-102-t1.cha"
reference_path = "chi-only/session-102-t1.cha"
anchor_speaker = "CHI"

# Evidence for the operator:
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
threshold_used = 2.0

# Opening turns (first N utterances per speaker) for context:
preview = """
*CHI:    they start to bite . [0_1708]
*PAR0:   They start to bite . [75_1165]
*PAR1:   They do what . [1515_2245]
... (further preview)
"""

# Suggested defaults the operator can accept-as-is:
suggested = { mapping = { PAR0 = "drop", PAR1 = "rename" }, adult_roles = { PAR1 = { code = "INV", tag = "Investigator" } } }

[[entries]]
session_id = "session-103-t1-parent"
kind = "parent-role-lookup"
# ... different evidence for the MOT-vs-FAT case ...

Schema characteristics

kind discriminates the adjudication type (one of speaker-id-low-confidence, parent-role-lookup, diarization-mix-review, sanity-scan-misclassification). Each kind has its own required field set; the adjudication tool dispatches on kind to choose the right prompt template and the right validator for the operator’s response.
suggested carries what the algorithm WOULD have chosen had the threshold been lower (for speaker-id) or a parsed default (for parent-role). The operator can accept-as-is or override.
Entries are a [[entries]] array of tables (not a session-keyed [<session_id>] map) because the same session could conceivably have multiple pending decisions (e.g., a speaker-id refusal AND a parent-role lookup), each a separate array entry.

Lifecycle

Written by: the orchestrator’s pass 1, when chatter speaker-id exits with code 4 or when other adjudication triggers fire.
Consumed by: chatter adjudicate, which reads it, prompts the operator entry-by-entry, writes decisions to the override file, and removes resolved entries.
Cleaned up: an empty entries array is the “all clear” state; pass 2 of the orchestrator can proceed.

`chatter adjudicate`, CLI surface

A new chatter subcommand in chatter. Its job is to walk a pending-adjudications file and write decisions to an override file.

chatter adjudicate <PENDING_FILE> --override-file <OVERRIDE_FILE> [OPTIONS]

ARGUMENTS:
  <PENDING_FILE>   Path to pending-adjudications.toml.

REQUIRED OPTIONS:
  --override-file <PATH>
      Path to the override file (created if missing, appended if
      existing). Decisions go here.

OPTIONS:
  --interactive
      (default) Prompt the operator for each pending entry via
      a terminal UI. This is the only mode for v1; later UI
      backends may add e.g. --backend=web for web-served prompts.

  --scripted <PATH>
      Read pre-canned decisions from a TOML file. Used in tests
      and in automated bulk-decision workflows (e.g., the
      operator has prepared a decision sheet in advance).
      Mutually exclusive with --interactive.

  --kind <KIND>
      Process only pending entries whose `kind` matches. Useful
      when the operator wants to batch through one class of
      decision at a time (e.g., do all parent-role lookups
      first, then all speaker-id refusals).

  --skip-on-error
      If the operator's response cannot be applied (e.g., they
      typed an invalid speaker code), log and skip rather than
      abort. Default: abort on first invalid response.

  --operator <NAME>
      Operator identifier recorded in override entries.
      Default: $USER.

  --dry-run
      Read pending and prompt the operator, but do NOT write to
      the override file. Useful for previewing what decisions
      look like before committing.

Exit codes:

Code	Meaning
0	All pending entries decided; pending file updated
1	I/O error (missing file, unparseable, write failure)
2	Operator-supplied decision rejected as invalid (when `--skip-on-error` not set)
3	Internal error
4	Operator deferred at least one entry (used `:skip` in the prompt); pending file still has entries

The --scripted mode is the testability seam. A scripted decision file looks like:

schema_version = 2

[[decisions]]
session_id = "session-102-t1"
kind = "speaker-id-low-confidence"
choice = { kind = "accept-suggested", note = "verified by listening" }

[[decisions]]
session_id = "session-103-t1-parent"
kind = "parent-role-lookup"
choice = { kind = "override", adult_roles = { PAR0 = { code = "FAT", tag = "Father" } }, note = "per contributor data sheet" }

The adjudication tool reads the scripted file, matches decisions to pending entries by session_id + kind, applies each as though the operator had typed it. If a scripted decision has no matching pending entry, or a pending entry has no scripted decision, the run aborts with a clear error.

The prompter abstraction (testability)

The adjudication tool’s core flow is:

// pseudocode, actual signatures live in talkbank-transform
pub fn run_adjudication(
    pending: PendingAdjudications,
    override_file: &mut OverrideFile,
    prompter: &mut dyn Prompter,
    operator: OperatorId,
) -> Result<AdjudicationOutcome, AdjudicationError> {
    for entry in pending.entries() {
        let context = build_context(entry);
        let decision = prompter.ask(&context)?;
        apply_decision(override_file, entry, decision, &operator);
    }
    Ok(...)
}

pub trait Prompter {
    fn ask(&mut self, context: &AdjudicationContext)
        -> Result<OperatorDecision, PrompterError>;
}

Production implementations:

TerminalPrompter: prints context to stdout, reads operator response from stdin. Used by --interactive.

Test implementations:

ScriptedPrompter::from_decisions(Vec<(SessionId, OperatorDecision)>), returns each decision in turn, errors if asked for an unprovided session. Used by L2 transform tests.
ScriptedTomlPrompter::read(path): reads the same TOML format as --scripted. Used by L3 CLI tests so subprocess tests and library-level tests share fixture format.

This means:

Every adjudication test path is automated. No subprocess PTY hackery, no expect-script DSL. Tests construct ScriptedPrompter, run the adjudication core, assert on the resulting OverrideFile.
The terminal UI is dumb. All it does is Display-format the context and parse the operator’s response into an OperatorDecision. No business logic in the UI layer.
Future UI backends (VS Code, web) implement Prompter and drop in. The adjudication core is unchanged.

The `OperatorDecision` type

pub enum OperatorDecision {
    /// Accept the algorithm's suggested mapping verbatim.
    AcceptSuggested { note: Option<String> },

    /// Override with an operator-supplied mapping (speaker-id).
    OverrideMapping {
        mapping: SpeakerMapping,
        note: Option<String>,
    },

    /// Override the inserted role(s) only (parent-role lookup).
    OverrideInsertedRole {
        adult_roles: BTreeMap<String, InsertedRoleSpec>,
        note: Option<String>,
    },

    /// Add or update flags on an existing entry.
    Flag { flags: Vec<MergeFlag>, note: Option<String> },

    /// Defer this entry; leave it in pending for later review.
    Defer { reason: String },

    /// Mark the session as blocked (e.g., unbulleted reference);
    /// requires upstream action before pipeline can resume.
    Block { reason: String },
}

Each variant maps cleanly to one or more adjudication kinds:

Kind	Allowed `OperatorDecision` variants
`speaker-id-low-confidence`	`AcceptSuggested`, `OverrideMapping`, `Defer`
`parent-role-lookup`	`AcceptSuggested`, `OverrideInsertedRole`, `Defer`
`diarization-mix-review`	`Flag`, `Defer`
`sanity-scan-misclassification`	`OverrideMapping`, `Flag`, `Defer`
(any)	`Block` is always available

The kind → allowed-variants mapping is enforced by the adjudication tool: a kind = "parent-role-lookup" entry that gets an OverrideMapping decision is rejected with a clear error (AdjudicationError::DecisionKindMismatch).

Operator terminal UX (interactive mode)

What the operator sees when running chatter adjudicate pending.toml --override-file overrides.toml --interactive:

═══════════════════════════════════════════════════════════════
ADJUDICATION  [1 / 14]  session-102-t1   kind = speaker-id-low-confidence
═══════════════════════════════════════════════════════════════

Reference file:  chi-only/session-102-t1.cha
Donor file:      asr/session-102-t1.cha
Anchor speaker:  CHI

Per-speaker Jaccard scores against reference's CHI:
  PAR0 = 0.6286   ◄── higher
  PAR1 = 0.3457
  margin = 1.82×   (threshold was 2.00×)

Opening turns side-by-side:

  *CHI    [0_1708]    they start to bite .
  *PAR0   [75_1165]   They start to bite .
  *PAR1   [1515_2245] They do what .

  *CHI    [1708_5966] they put up their shields at some point .
  *PAR0   [2755_4405] They put up those heels .
  *PAR1   [4865_6045] At some point oh .

  (3 more turns shown; press 'm' for more)

Algorithm-suggested mapping:
  PAR0 → drop   (winner, matches CHI content)
  PAR1 → rename to INV:Investigator

Your decision?
  [a] Accept suggested
  [o] Override mapping
  [f] Flag and defer
  [d] Defer (review later)
  [b] Block (needs upstream fix)
  [m] Show more context
  [p] Play media (uses $TB_MEDIA_PLAYER)
  [q] Quit (save progress and exit)
>

When the operator types a and then is prompted for an optional note, the tool writes the decision to the override file and advances to the next pending entry.

The [p] Play media action is just a wrapper around Command::new($TB_MEDIA_PLAYER).arg(media_path).spawn(), the adjudication tool doesn’t bundle an audio player. The operator configures their preferred player via the environment.

Adjudication contexts beyond speaker-id

The same chatter adjudicate tool handles all five adjudication points by dispatching on kind. For each, the displayed context and the allowed decisions differ:

parent-role-lookup

Shown context: the session is a parent sample (basename contains parent-suffix conventionally, or contributor data sheet says so). The merged output needs an inserted-role code of MOT, FAT, or PAR. The operator picks.

Session: session-103-t1-parent
Kind: parent-role-lookup

This is a parent-sample session. The merged file's inserted
speaker (currently labeled PAR0 → ???) needs a CHAT role.

Contributor data sheet (if attached): not available
Audio preview duration: 8m 14s

Algorithm-suggested:  INV : Investigator   (default for ambiguity)

Your decision?
  [a] Accept suggested (INV : Investigator)
  [m] MOT : Mother
  [f] FAT : Father
  [p] PAR : Adult (gender unknown)
  [c] Custom role
  [d] Defer
  [b] Block (needs upstream metadata)
>

diarization-mix-review

Triggered by the operator (or a post-merge auto-scan) observing that an ASR speaker’s content mixes real-world speakers. The adjudication is to add the "diarization-mixed" flag plus a note explaining the mix.

sanity-scan-misclassification

Triggered by the post-merge sanity scan when a retained-speaker utterance has high text similarity with a temporally-adjacent inserted-speaker utterance. The operator either confirms (“the original speaker-id was wrong, swap the mapping”) or overrides (“the duplication is real, both speakers said the same thing at the same time”).

Resumption and re-adjudication

The pending-adjudications file is the source of truth for “what still needs deciding.” If the operator quits mid-review (via [q] or process-kill), the next chatter adjudicate invocation picks up where they left off, already-decided entries have already been removed from pending and written to the override file.

Re-adjudication of an already-decided entry is a planned extension, not yet implemented. The proposed interface would load the existing override entry, present it as the “current decision,” and ask the operator whether to keep or replace it; the operator’s decision would overwrite the entry, and the prior decision would be preserved in a history array on the entry (recording the prior mode, mapping, operator, decided_at, and note). The proposed invocation shape (not a working command today) is:

# Proposed, not yet implemented:
chatter adjudicate --re-adjudicate <SESSION_ID> --override-file overrides.toml

It needs a small override-file schema extension, a per-entry optional history: Vec<MergeOverride> field. This is a minor, additive schema change, comparable to the 2026-06 engine/judgment addition (no version bump needed either way), not a breaking one; schema_version is already 2 as of the adult_roles map (see Merge Override File Format §Future schema changes), so a future breaking change to this schema would need schema_version = 3, not 2.

Composition with the orchestrator

The orchestrator (proposed tb merge or similar) drives the pipeline. Its high-level flow:

// pseudocode for the orchestrator's main loop
let inputs = discover_input_sessions(input_dir);
let override_file = OverrideFile::read_or_default(override_path);
let mut pending = PendingAdjudications::default();

for session in inputs {
    if let Some(decision) = override_file.get(&session.id) {
        // Already adjudicated; apply directly.
        let labeled = apply_mapping(&session.donor, &decision.mapping)?;
        let merged = merge(&session.reference, &labeled, &session.retain)?;
        write_merged(merged, &session.output_path)?;
    } else {
        // Try auto-decide.
        match identify_mapping(&session.donor, &session.reference, ...) {
            Ok(mapping) => {
                let labeled = apply_mapping(&session.donor, &mapping)?;
                let merged = merge(...)?;
                write_merged(merged, &session.output_path)?;
                override_file.insert(session.id.clone(), record_auto_decision(&mapping));
            }
            Err(SpeakerIdError::LowConfidence { scores, margin, threshold }) => {
                pending.push(PendingEntry::speaker_id_low_confidence(
                    session.id.clone(),
                    scores, margin, threshold,
                    /* preview */ build_preview(&session),
                ));
            }
            Err(other) => return Err(other),
        }
    }
}

pending.write(pending_path)?;
override_file.write(override_path)?;

if !pending.is_empty() {
    eprintln!(
        "Pipeline complete for {} sessions; {} sessions need adjudication.\n\
         Run: chatter adjudicate {} --override-file {}",
        decided_count, pending.len(), pending_path, override_path
    );
    return Ok(ExitCode::NeedsAdjudication);
}

The orchestrator is the layer that hasn’t been designed yet at the type level. It’s likely a tb subcommand (since tb is the workflow tool for multi-repo / multi-step ops), with a fallback shell-script form for the v0 pipeline.

What this design does NOT cover

The orchestrator binary itself. That’s a separate design pass; this doc only specifies the contract between the pipeline stages and the adjudication tool.
GUI/web adjudication backends. v1 is terminal-only. The Prompter trait is the extension point; future backends implement it. The data contract (pending.toml, overrides.toml) does not change.
Audio playback / waveform display. v1 launches the operator’s $TB_MEDIA_PLAYER and gets out of the way. A future TUI with inline audio scrubbing is conceivable but is a major UI project, not v1.
ML-suggested decisions. A future version could feed pending entries to a classifier that pre-fills “suggested” with model output. Out of scope; the suggested field exists today as a hook.

Test coverage

Every behavior of chatter adjudicate is tested via the scripted-prompter abstraction. See the Test Plan (TBD section L4) for the test inventory. Coverage spans:

Each adjudication kind’s happy path (operator accepts suggested, decision written to override file)
Each adjudication kind’s override path (operator types an alternative, decision validated and recorded)
Each adjudication kind’s defer path (entry stays in pending)
Each adjudication kind’s block path (entry marked blocked; pipeline reports blocker)
Re-adjudication path (operator changes their mind; prior decision preserved in history)
Mutually-exclusive flag enforcement (--interactive + --scripted rejected)
Invalid operator response handling (with and without --skip-on-error)
Schema-version refusal on the pending file
Empty pending file (no-op, exit 0)

XML Emitter

Status: Current Last updated: 2026-06-14 12:56 EDT

Purpose

crates/talkbank-transform/src/xml/ serialises a ChatFile<S> into TalkBank XML, an obsolete, frozen interchange format. The emitter is chatter’s implementation of that format’s CHAT to XML projection.

Scope:

Legacy / rare-use facility. The TalkBank project no longer publishes XML for download; CHAT is the canonical distribution format. The XML emitter exists to support rare legacy consumers that still need the XML projection; it is not a primary interchange path. New integrations should consume CHAT directly.
Emission only. XML ingest (XML → CHAT) is explicitly out of scope. The only historical consumer that ever needed XML → CHAT was Phon (via its PhonTalk plug-in, which used an XML round-trip); Phon has since pivoted to reading CHAT directly. The other XML readers are all either dormant or migrated:
- NLTK’s CHILDESCorpusReader is unmaintained and was always read-only.
- langcog/childes-db has had no commits since September 2022.
- TalkBankDB and the current TalkBank analysis stack read CHAT directly, not XML.
Phonetic tiers are permanently unsupported. %pho, %mod, %phosyl, %modsyl, %phoaln report XmlWriteError::PhoneticTierUnsupported. Phon has pivoted to CHAT-only interchange; no downstream consumer reads the rich <pg>/<pw>/<ph>/<cmph>/<ss> XML. Files carrying these tiers still parse, validate, and round-trip through CHAT unchanged, only the XML projection is declined.
Parity oracle. The goldens in corpus/reference-xml/ (the reference TalkBank XML generated against the reference CHAT corpus) are the parity target. All paired goldens pass structurally, full parity across every reference .cha file the TalkBank XML format can represent. A small number of reference fixtures have no golden because the frozen format cannot express them: some use UD POS tags (propn) that postdate it, and others declare @Media with a linkage type that the E544 validator catches before emission has a chance to run. Intentional divergences, not Rust gaps.

Module layout

The emitter is split across six submodules under xml/. Each file contributes an impl XmlEmitter { … } block plus any free helpers it owns; state lives on the single XmlEmitter struct defined in writer.rs.

flowchart TD
    entry["write_chat_xml<br/>(writer.rs)"]
    emitter["XmlEmitter struct<br/>owns quick-xml Writer<br/>+ next_utterance_id"]
    root["root.rs<br/>document / participants /<br/>body / utterance orchestration<br/>+ metadata helpers"]
    word["word.rs<br/>&lt;w&gt; / &lt;t&gt; / &lt;tagMarker&gt; /<br/>&lt;pause&gt; / &lt;g&gt; wrappers /<br/>word-internal markers /<br/>scoped annotations"]
    mor["mor.rs<br/>&lt;mor&gt; / &lt;mw&gt; / &lt;gra&gt; /<br/>UtteranceTiers collector /<br/>%mor feature serialization"]
    wor["wor.rs<br/>&lt;media&gt; / &lt;wor&gt; /<br/>&lt;internal-media&gt; /<br/>ms → seconds formatting"]
    deptier["deptier.rs<br/>&lt;a type=…&gt; side tiers<br/>(%act / %com / %exp /<br/>%gpx / %sit / %xLABEL)"]
    error["error.rs<br/>XmlWriteError variants"]

    entry --> emitter
    emitter --> root
    emitter --> word
    emitter --> mor
    emitter --> wor
    emitter --> deptier
    root -->|"terminator,<br/>separator"| word
    root -->|"collect_utterance_tiers,<br/>UtteranceTiers"| mor
    root -->|"&lt;media&gt;,<br/>&lt;wor&gt;"| wor
    root -->|"side tiers"| deptier
    word -->|"&lt;mor&gt; subtree<br/>inside &lt;w&gt;"| mor
    word -->|"&lt;mor&gt; subtree<br/>inside &lt;tagMarker&gt;"| mor
    wor -->|"%wor terminator<br/>label"| word
    error -.->|"errors"| entry
    error -.->|"errors"| root
    error -.->|"errors"| word
    error -.->|"errors"| mor
    error -.->|"errors"| wor
    error -.->|"errors"| deptier

File	Role
`writer.rs`	`XmlEmitter` struct, namespace/version constants, `write_chat_xml` entry point, minimal-document unit test, `escape_text` helper
`root.rs`	Document / participants / body / utterance orchestration; root-element metadata helpers (corpus lookup, date/age/sex formatting, `@Options` flags, `@Types` projection, per-speaker extras)
`word.rs`	All word-level element shapes; word-internal marker walking; scoped-annotation dispatch; event / action emission
`mor.rs`	`%mor` / `%gra` emission including post-clitic `<mor-post>`; `UtteranceTiers` aggregator
`wor.rs`	`%wor` tier emission plus utterance-level `<media>`; `format_seconds` ms → seconds
`deptier.rs`	Text-content “side tiers” that render as `<a type=…>text</a>` (`%act`, `%com`, `%exp`, `%gpx`, `%sit`, `%xLABEL`)
`error.rs`	`XmlWriteError` `thiserror` enum

Top-level data flow

sequenceDiagram
    participant Caller
    participant write_chat_xml as write_chat_xml<br/>(writer.rs)
    participant XmlEmitter as XmlEmitter
    participant emit_document as emit_document<br/>(root.rs)
    participant emit_body as emit_body<br/>(root.rs)
    participant emit_utterance as emit_utterance<br/>(root.rs)

    Caller->>write_chat_xml: ChatFile&lt;S&gt;
    write_chat_xml->>XmlEmitter: new()
    write_chat_xml->>emit_document: emit_document(file)
    emit_document->>emit_document: emit &lt;?xml?&gt; + &lt;CHAT&gt; attrs
    emit_document->>emit_document: emit_participants(file)
    emit_document->>emit_body: emit_body(file)
    loop each Line
        alt Line::Header
            emit_body->>emit_body: emit_header_if_body(header)
        else Line::Utterance
            emit_body->>emit_utterance: emit_utterance(utterance)
        end
    end
    write_chat_xml->>XmlEmitter: finish() → String
    XmlEmitter-->>Caller: Ok(String)

Utterance emission in detail

emit_utterance is the most complex orchestrator: it walks the main tier in parallel with two cursors into the dependent tiers.

flowchart TD
    start([emit_utterance])
    preHdr[emit pre-begin<br/>headers]
    collect["collect_utterance_tiers<br/>→ UtteranceTiers {<br/>mor, gra, wor, sin, side_tiers }"]
    openU["&lt;u who=… uID=…&gt;"]
    linkers["emit_linker × N<br/>(utterance.main.content.linkers)"]
    walk{"walk<br/>utterance.main.content.content"}
    term{"terminator<br/>present?"}
    emitTerm["emit_terminator<br/>(word.rs)"]
    missing["&lt;t type='missing<br/>CA terminator'/&gt;"]
    media{"main bullet<br/>present?"}
    emitMedia["emit_utterance_media<br/>(wor.rs)"]
    wor{"%wor tier<br/>present?"}
    emitWor["emit_wor<br/>(wor.rs)"]
    side{"side tiers<br/>non-empty?"}
    emitSide["emit_side_tiers<br/>(deptier.rs)"]
    closeU["&lt;/u&gt;"]
    done([return])

    start --> preHdr --> collect --> openU --> linkers --> walk
    walk -->|"Word / AnnotatedWord /<br/>ReplacedWord / AnnotatedGroup /<br/>Separator / Pause / Retrace /<br/>Event / AnnotatedAction /<br/>OverlapPoint"| walk
    walk --> term
    term -->|yes| emitTerm
    term -->|no| missing
    emitTerm --> media
    missing --> media
    media -->|yes| emitMedia
    media -->|no| wor
    emitMedia --> wor
    wor -->|yes| emitWor
    wor -->|no| side
    emitWor --> side
    side -->|yes| emitSide
    side -->|no| closeU
    emitSide --> closeU
    closeU --> done

The TierCursors invariant

Walking the main tier requires tracking three independent cursors into the %mor / %gra / %sin tiers. This separation is the single most important correctness invariant in the emitter; a merged cursor silently drifts on any utterance containing a clitic chain, an untranscribed placeholder, or a sign-language item.

A TierCursors helper in mor.rs owns the three cursors and provides mor_index() / gra_chunk() / sin_index() accessors plus consume_mor(post_clitics_len) / consume_sin() / advance_bulk(mor, gra) advance methods. Every content-arm in emit_utterance runs a fixed template: look up partners at the current cursor positions, emit, call consume_*. The advance math has exactly one home.

Cursor	Indexes into	Advances by
`mor`	`mor_tier.items` (one `Mor` per main-tier word)	1 per alignable word
`gra`	`gra.relations` (1-based `<gra index=…/>`)	`1 + post_clitics.len()` per `Mor`
`sin`	`sin_tier.items` (one `SinItem` per sin-countable word)	1 per sin-countable word

A Mor item like pron|what-Int-S1~aux|be-Fin-Ind-Pres-S3 is one entry in mor_tier.items but contributes two %gra edges, one for the main <mw>, one for each <mor-post><mw/></mor-post>. So mor and gra cursors advance at different rates.

%sin uses a separate counting predicate than %mor. The model’s counts_for_tier(word, TierDomain) function encodes the differences:

TierDomain::Mor excludes nonwords (&~), fillers (&-), phonological fragments (&+), and untranscribed placeholders (xxx, yyy, www).
TierDomain::Sin includes everything that was phonologically or gesturally produced, fragments and untranscribed do participate. A gesture can accompany an unintelligible vocalisation.

Because the predicates diverge, the sin cursor advances on its own schedule. For *CHI: mommy xxx . %sin: g:point 0 . the xxx word consumes a %sin item but not a %mor item.

Four main-tier content variants delegate cursor arithmetic through their emitters: emit_replaced_word and emit_annotated_group return (mor_used, gra_used) tuples consumed via cursors.advance_bulk(mor_used, gra_used); emit_word and emit_annotated_word call cursors.consume_mor(post_count) inline.

Why cursor-based, not `AlignmentSet`-based?

talkbank-model’s AlignmentSet (Utterance.alignments) holds pre-computed MorAlignmentPair / SinAlignmentPair / etc., the same main-word-index ↔ target-tier-index mapping the emitter computes on-the-fly. Why not use it directly?

The XML emitter accepts ChatFile<S: ValidationState> for any S. When called on a ChatFile<NotValidated>, compute_alignments has never run and Utterance.alignments is None. Rather than force callers to validate first, or risk panics on unvalidated input, the emitter recomputes what it needs via the cursor walk.

The cursor walk is equivalent to the model’s alignment output for every reference-corpus input; it only diverges on malformed files that the model’s alignment would also flag. The cursors stay as local emitter state, and the alignment module stays a separate, optional layer.

`%sin` → `<sg><w><sw/></w></sg>` emission

When a %sin tier is present and the current word counts for TierDomain::Sin, the emitter wraps the <w> element (and its nested <mor> subtree if any) in a <sg> (sign group) with a <sw> (sign word) sibling:

<sg><w>what<mor>...</mor></w><sw>0</sw></sg>

SinItem::Token(text) renders as <sw>text</sw>; SinItem::SinGroup(…) joins its gesture tokens with spaces. The emission is the entirety of XmlEmitter::emit_sin_word; everything else is just the <sg>…</sg> wrap in emit_utterance’s Word arm.

`@Media` linkage and timing evidence (E544)

Validation fires E544 before XML emission when an unqualified @Media header (status-less) claims linkage but the transcript carries no timing evidence (no main-tier bullets, no positional %wor sidecar). This is a validator-level rule (lives in crates/talkbank-model/src/model/file/chat_file/validate.rs check_media_linkage_has_timing), not an emitter rule; it runs during ChatFile::validate and blocks downstream emission on validation-gated entry points. See spec/errors/E544_media_linkage_without_timing.md.

The emitter itself doesn’t care about bullet presence; this check was historically imposed as a parser-level semantic failure, and Rust implements it in the validator instead.

Post-clitic emission

flowchart LR
    mor["&lt;mor type='mor'&gt;"]
    mw["&lt;mw&gt;…&lt;/mw&gt;<br/>(main MorWord)"]
    gra["&lt;gra type='gra'<br/>index=N head=… relation=…/&gt;"]
    post["&lt;mor-post&gt;"]
    pmw["&lt;mw&gt;…&lt;/mw&gt;<br/>(post-clitic MorWord)"]
    pgra["&lt;gra type='gra'<br/>index=N+1 head=… relation=…/&gt;"]
    postEnd["&lt;/mor-post&gt;"]
    endMor["&lt;/mor&gt;"]

    mor --> mw --> gra --> post --> pmw --> pgra --> postEnd --> endMor

Each post-clitic gets its own <mor-post> wrapper containing one <mw> plus the next <gra> index. Multiple post-clitics emit sequentially.

Emitter / parser / model boundary

The emitter generally defers to the Rust model’s canonical predicates rather than inventing output-side rules. Four cases are exceptions where the emitter bridges a disagreement between the parser and the TalkBank XML format at the output boundary. All four are legitimate divergences, not regressions: the Rust model is correct, the TalkBank XML format is obsolete and frozen at a pre-evolution CHAT snapshot, and the emitter’s bridges are the right place to reconcile the output shape.

CA intonation contour terminators

Rust parses ⇗, ↗, →, ↘, ⇘ at the end of an utterance as Terminator::CaRisingToHigh etc. The TalkBank XML format classifies them as separators followed by an implicit “missing CA terminator”. The emitter splits a pitch-contour terminator into two sibling elements:

<s type="rising to high"/>
<t type="missing CA terminator"/>

See ca_terminator_separator_label in word.rs. If the Rust parser ever migrates to classify these as separators, the emitter’s bridge becomes dead and should be removed.

CAOmission as whole-word shortening

(parens) (a fully-parenthesised word) parses to WordCategory::CAOmission. TalkBank XML emits <w><shortening>parens</shortening></w>, a <shortening> wrapper around the word body with no type="omission" attribute. The 0word syntax (true omission) gets <w type="omission">word</w> with no shortening wrapper.

The emitter branches on CAOmission and opens a <shortening> wrapper around emit_word_contents. word_category_attr returns None for CAOmission so no type="omission" attribute is emitted.

Leading overlap-point hoisting

Rust parses ⌈°overlapping+soft⌉° as a single word whose WordContent vector starts with a TopOverlapBegin marker. TalkBank XML keeps the leading ⌈ as a top-level sibling of <w> but keeps the trailing ⌉ inside. The emitter hoists the prefix of leading WordContent::OverlapPoint items out before opening <w>, and emit_word_contents skips them during the content walk.

`xxx` / `yyy` / `www` case-sensitivity

The model’s word.untranscribed() helper is case-insensitive; it treats XXX and xxx identically as “unintelligible” to protect downstream Stanza/MOR pipelines from spurious uppercase entries. The XML schema’s untranscribed attribute, however, attaches only to the strictly lowercase placeholders. The emitter uses a local untranscribed_attribute_for_xml helper that does the case-sensitive check at the output boundary.

Both behaviours are deliberate and stay: the model’s case-insensitive helper is a Stanza/MOR correctness fix, and the emitter’s case-sensitive gate matches the XML schema contract.

Reserving element boundaries: single state holder

XmlEmitter owns a quick_xml::Writer<Vec<u8>> and a running next_utterance_id: u32 counter. Every emission helper writes through that single writer so indentation, escaping, and the document-order contract are centrally enforced.

Every BytesText emission routes through escape_text (in writer.rs) which uses quick_xml::escape::partial_escape to escape only <, >, &. Apostrophes and double quotes pass through literally, matching the TalkBank XML format and avoiding entity-decode issues that would otherwise split text at ' boundaries during structural comparison.

Testing

Two complementary test surfaces:

Unit tests in xml/writer.rs (minimal document smoke) and xml/wor.rs (format_seconds fractional padding) exercise internal helpers directly.
Golden-XML parity harness at crates/talkbank-parser-tests/tests/xml_golden.rs. Runs one parametrised test per file in corpus/reference-xml/**/*.xml, parses both emitted and golden XML via quick-xml, and diffs event streams with whitespace and attribute-order normalisation. Comparator lives in crates/talkbank-parser-tests/tests/xml_support/mod.rs.

The harness diagnostic surfaces the first divergence as actual: … vs expected: …. To debug further, temporary dump helpers (write the emitted XML to /tmp/emitted.xml and side-by-side diff against the golden) are the quickest path; add them as #[ignore]d tests in crates/talkbank-parser-tests/tests/xml_dump.rs when needed and delete after the divergence is resolved.

spec/errors/E544_media_linkage_without_timing.md: the @Media bullet-existence validator that runs before emission.

Reference-XML coverage gaps (which files the TalkBank XML format can’t represent) are called out inline in the “Parity oracle” bullet of §Purpose above, permanent exclusions are UD-POS files that postdate the frozen format and @Media-without-timing files E544 blocks at validation, both intentional divergences, not Rust gaps.

Staged features

The emitter reports XmlWriteError::FeatureNotImplemented for CHAT constructs that have a known XML shape but haven’t been wired in yet. With all paired reference-XML goldens passing, any new staged feature that lands will be triggered by a file added to the reference corpus that exercises it. When that happens:

Run cargo nextest run -p talkbank-parser-tests --test xml_golden and read the failure message.
Find the TalkBank XML output for the construct in the paired golden.
Add a match arm in the appropriate submodule (word.rs::emit_scoped_annotation, deptier.rs::emit_side_tier, word.rs::ca_delimiter_label, etc.) with a short comment explaining the mapping.
If the construct changes %mor / %gra cursor accounting, update emit_utterance in root.rs, not individual callers.

Permanently-unsupported tiers (%pho, %mod, %phosyl, %modsyl, %phoaln) use XmlWriteError::PhoneticTierUnsupported and are not staged for future work, Phon’s pivot to CHAT-only interchange removed the downstream need.

Errors, CHAT core

Status: Current Last modified: 2026-06-17 11:29 EDT

The error infrastructure used across all CHAT-core crates (talkbank-model, talkbank-parser, talkbank-transform, chatter, talkbank-lsp). Defined in the errors module of talkbank-model.

External runtime/application errors that live outside this repo’s CHAT core are documented separately in their owning projects. For the diagnostic UX standard that applies within this workspace, see error-diagnostics-ux.

Core Types

`ParseError`

Every diagnostic is a ParseError:

pub struct ParseError {
    pub code: ErrorCode,
    pub severity: Severity,
    pub location: SourceLocation,
    pub context: ErrorContext,
    pub message: String,
}

`ErrorCode`

Error codes follow a structured numbering scheme:

Range	Category
E1xx	Encoding
E2xx	Words and content
E3xx	Main tier (speakers, terminators, content, retraces)
E4xx	Dependent tier structure
E5xx	Headers
E6xx	Dependent tier validation
E7xx	Alignment (`%mor`, `%gra`, `%pho`, `%wor`)
W1xx-Wxxx	Warnings (same categories)

Codes are grouped by range as above. The numbering is a navigational aid, not the authority on where a code is caught: most codes are emitted at the layer suggested below, but a few main-tier checks (for example undeclared-speaker and retrace structure) are validation-layer despite their E3xx number. The per-code Layer in spec/errors/ is authoritative.

flowchart LR
    subgraph "Parser layer\n(parser.parse_chat_file())"
        E1["E1xx\nEncoding\n(BOM, charset)"]
        E2["E2xx\nWords and content\n(word syntax, events,\noverlap markers)"]
        E3["E3xx\nMain tier\n(speaker, content,\nterminator, retraces)"]
        E4["E4xx\nDependent tier structure\n(tier presence, format)"]
        E5["E5xx\nHeaders\n(format, required fields,\nparticipant resolution)"]
    end

    subgraph "Validation layer\n(validate_with_alignment)"
        E6["E6xx\nDependent tier validation\n(tier name/format)"]
        E7["E7xx\nAlignment\n(%mor/%gra/%pho/%wor counts,\nGRA indices, orphaned tiers)"]
    end

    W["Wxxx\nWarnings\n(same categories,\nnon-fatal)"]

    E1 ~~~ E2 ~~~ E3 ~~~ E4 ~~~ E5
    E6 ~~~ E7

The source of truth for error-code details is spec/errors/. Maintainers can generate a local markdown reference set under docs/errors/ with gen_error_docs when they need a browsable error catalog while working on diagnostics.

`Severity`

Error: must be fixed; indicates invalid CHAT.
Warning: should be fixed; indicates questionable but parseable CHAT.

`SourceLocation` and `Span`

Byte offsets into the source text:

#![allow(unused)]
fn main() {
pub struct SourceLocation { pub start: usize, pub end: usize }
pub struct Span { pub start: usize, pub end: usize }
}

`ErrorContext`

Carries the source fragment around the error location:

pub struct ErrorContext {
    pub source_fragment: String,
    pub byte_range: Range<usize>,
    pub node_kind: String,
}

`ErrorSink` Trait

The central abstraction for error reporting:

flowchart LR
    val["Validator / Parser"]
    pe["ParseError\ncode + severity +\nlocation + message"]
    sink["ErrorSink trait\n.report()"]
    vec["ErrorCollector\ncollect to Vec"]
    chan["ChannelErrorSink\ncrossbeam channel\n(feature = channels)"]
    asyncchan["AsyncChannelErrorSink\ntokio mpsc"]
    cfg["ConfigurableErrorSink\nseverity gating"]
    null["NullErrorSink\nno-op"]

    val --> pe --> sink
    sink --> vec & chan & asyncchan & cfg & null

pub trait ErrorSink {
    fn report(&self, error: ParseError);
}

All parsing and validation functions accept &impl ErrorSink rather than returning errors directly. This allows:

Collecting all errors (for batch processing).
Printing errors in real-time (for interactive use).
Filtering by severity or code.
Counting errors without storing them.

The trait uses &self (not &mut self) so it can be shared across threads. Implementations typically use interior mutability (Mutex<Vec<ParseError>>).

ErrorCollector is the in-memory collector in errors/collectors.rs. The stored-diagnostics role is explicit in both code and docs.

Module layout in talkbank-model:

errors/error_sink.rs: trait and lightweight forwarding sinks.
errors/collectors.rs: in-memory collectors and counters.
errors/async_channel_sink.rs: Tokio-channel streaming.
errors/configurable_sink.rs, errors/offset_adjusting_sink.rs, errors/tee_sink.rs, adapters.

ChannelErrorSink is opt-in behind the channels feature so the default talkbank-model dependency does not pull in crossbeam just to own the core error trait and in-memory collectors.

Two Error Layers

Errors are detected at two layers. This distinction matters for spec testing.

Parser layer: structural errors caught during parser.parse_chat_file(). These prevent the file from being fully parsed (missing @Begin, invalid syntax). Parser-layer specs test that parser.parse_chat_file() returns Err.
Validation layer: semantic errors caught by validate_with_alignment() after a successful parse. The file parsed correctly but violates constraints (%mor alignment mismatch, undeclared speakers). Validation-layer specs test that validation reports specific error codes.

Adding a New Error Code

Add the variant to ErrorCode in crates/talkbank-model/src/errors/codes/error_code.rs with a #[code("Exxx")] attribute.
Create a spec file in spec/errors/Exxx-description.md following the existing template.
Construct ParseError::new(ErrorCode::YourVariant, ...) at the detection site in the parser or validator.
Regenerate the affected spec artifacts with the current spec/tools binaries (gen_rust_tests, gen_validation_corpus, and optionally gen_error_docs).
Run the concrete verification commands from book/src/contributing/dev-checks.md.

Validation

Status: Current Last modified: 2026-06-13 22:40 EDT

CHAT validation runs at multiple points in the processing pipeline. All validation logic is in Rust: talkbank-model::validation owns CHAT-core validation, and talkbank_transform::validate (crates/talkbank-transform/src/validate.rs) owns the transform-side pre/post validation gate functions (validate_to_level, validate_output). This page covers validity levels, pre/post validation gates, severity posture, the verification-gate set (G0-G14), and how validation failures interact with caches and bug reports.

For error-code infrastructure (codes, sinks, severities, layers), see chat-core-errors. For the diagnostic UX standard, see error-diagnostics-ux.

Validity Levels

The ValidityLevel enum defines three cumulative validation levels. Each level includes all checks from lower levels.

Level	Name	Checks
L0	`Parseable`	No parse errors (clean tree-sitter CST)
L1	`StructurallyComplete`	`@Participants` and `@Languages` present, all speaker codes declared, every utterance has a terminator
L2	`MainTierValid`	Well-formed words, valid timing bullets if present

Pre-validation gates

Each command requires input to meet a minimum level before processing:

Command	Required level
`morphotag`	`MainTierValid`
`utseg`	`StructurallyComplete`
`translate`	`StructurallyComplete`
`coref`	`StructurallyComplete`
`align`	`Parseable` (lenient, must handle messy real-world files)

validate_to_level() checks the file against the required level and returns all failures found. Invalid files are rejected early with diagnostics, before any compute is spent on inference.

flowchart TD
    cmd["a transform command\n(morphotag, utseg, translate,\ncoref, align)"]
    gate["validate_to_level(file, required_level)\n(talkbank-transform validate.rs)"]
    check{"meets the command's minimum\nValidityLevel?\n(L0 Parseable / L1 StructurallyComplete\n/ L2 MainTierValid)"}
    reject["reject early with diagnostics;\nno compute spent"]
    proceed["run the command's inference"]

    cmd --> gate --> check
    check -->|"no"| reject
    check -->|"yes"| proceed

Post-Serialization Validation

After an orchestrator injects results and serializes CHAT output, the server runs validate_output():

Alignment validation: checks that %mor/%gra/%wor tier word counts match the main tier. ParseHealth-aware: utterances flagged as unparseable during lenient parsing are excluded.
Semantic validation: full CHAT validation:
- E362: non-monotonic timestamps (utterance bullets must increase).
- E701 / E704: temporal constraints (overlap rules, same-speaker timing).
- Header correctness, required headers present and well-formed.
- Cross-utterance patterns, speaker code consistency.

Only blocks on severity="error", not warnings.

Severity Posture

Validation intentionally distinguishes errors from warnings:

Errors block output. The server will not write CHAT with error-level validation failures.
Warnings are reported but do not block. Legacy corpora contain widespread minor violations that must remain processable.

This distinction matters especially for %gra:

Existing broken %gra in old corpora may be accepted with warnings so files remain processable.
Newly generated %gra from batchalign3 is validated more strictly before writeback.

Bug Reports and Cache Purges

When post-serialization validation fails:

A structured bug report is written to ~/.batchalign3/bug-reports/.
Cache entries that produced the invalid output are purged (self-correcting cache).

This prevents broken results from being served on future runs.

Verification Contract

This repo does not currently expose the predecessor workspace’s make verify wrapper. The current local contract is the concrete command set documented in Developer Verification Checks and Testing and Quality Gates.

Core local sweep:

cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Add the surface-specific checks that match the validation-affecting code you changed:

grammar: cd grammar && tree-sitter generate && tree-sitter test
spec tools: cargo build --manifest-path spec/tools/Cargo.toml and cargo build --manifest-path spec/runtime-tools/Cargo.toml
parser / model / alignment / serialization: cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)' and cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus

The reference corpus at corpus/reference/ remains the sacred semantic target. Historical labels like G0-G14 are useful for older design notes, but they are not the current command surface of this checkout.

Validation at the PyO3 Boundary

There is no public Python validation API. The ParsedChat handle that previously exposed validate() / validate_structured() / validate_chat_structured() was retired in the 2026-03-21 PyO3 slimdown to worker-runtime-only. Validation now runs entirely on the Rust side; when a worker invocation detects a failure it constructs BatchalignBoundaryError::ChatValidation { entries, … } which the PyO3 boundary lowers into a CHATValidationException carrying a populated errors: list[ValidationErrorEntry] on the Python side.

Python callers that need structured validation results invoke batchalign3 via subprocess and catch the exception:

from batchalign_core import CHATValidationException

try:
    batchalign_core.execute_v2(request)
except CHATValidationException as exc:
    for entry in exc.errors:
        print(entry.code, entry.line, entry.message)

Upstream batchalign runtime errors and the Python ↔ Rust boundary contract are documented separately in the batchalign3 project.

Known limitations

Validation rules are intentionally permissive on legacy data. Some checks emit warnings rather than errors so legacy corpora remain processable while still surfacing the issue. Examples: pre-existing malformed %gra (warned, not blocked, so files that already shipped with bad %gra round-trip cleanly); some bullet-format minor variants. Newly generated tiers from batchalign are validated more strictly before writeback.
%wor word counts are not validated against the main tier. %wor is a timing-annotation tier with no downstream positional indexing; legacy files may have xxx, fragments, or nonwords in %wor without producing alignment errors.
Cross-utterance quotation validation is gated off by default (enable_quotation_validation flag), the cross-utterance walker exists but is not yet wired into the standard validation gate.
Some error-spec / validator pairs are not yet implemented. Tracked in spec/errors/ files marked Status: not_implemented; these generate #[ignore] tests via the current spec/tools generators rather than failing CI. Run grep -rl "Status.*not_implemented" spec/errors/ to enumerate.

CHECK Parity Audit

Status: Current Last updated: 2026-06-25 07:30 EDT

CLAN’s check (CHECK) was the long-standing validator for CHAT files. chatter validate is the forward-looking replacement, and it is the binding judgment on whether a byte sequence is valid CHAT: when chatter rejects a file, the file is invalid and the right response is to clean the data, not to weaken the parser. CHECK is no longer the authority on validity.

CHECK is still useful for one thing: as a reference oracle that helps find validation rules chatter does not yet have. The CHECK Parity Audit is the tool that compares the two systematically, so that every rule CHECK enforces is either matched by chatter, or is a deliberate, documented divergence.

What the audit answers

For every error code CLAN’s check actually emits, the audit answers: does chatter have an equivalent rule, and if not, why not?

Semantic parity: does chatter enforce the same intended rule?
Behavioral parity: does chatter match CHECK’s literal runtime behavior, including CHECK’s documented anomalies (some CHECK rules are buggy or were disabled in place; reproducing those bugs is not a goal)?
Strictness policy: chatter should be at least as strict semantically. A file CHECK rejects should not silently pass chatter unless the divergence is deliberate.

How it works

flowchart LR
    cpp["CLAN check.cpp\n(OSX-CLAN/src/clan)"]
    extract["scripts/extract_check_codes.py\n(every code CHECK actually emits)"]
    ref["clan-check-reference/\ncheck-error-codes.json"]
    map["map_by_id()\n(audit_check_parity.rs)"]
    audit["audit_check_parity\nbinary"]
    out["docs/audits/\ncheck-parity-audit.md"]

    cpp --> extract --> ref
    ref --> audit
    map --> audit
    audit --> out

The CHECK reference is generated from CLAN’s check.cpp by scripts/extract_check_codes.py into crates/talkbank-parser-tests/clan-check-reference/check-error-codes.json. It records every code CHECK emits (the call sites in the C source), not the stale subset documented in CLAN’s own CHECK-rules.md.
The mapping lives in map_by_id() in crates/talkbank-parser-tests/src/bin/audit_check_parity.rs: an explicit CHECK-number to TalkBank-code table (for example 138 | 139 => &["E256"]), with a keyword fallback for the unmapped remainder.
The audit binary joins the two and writes the report. Regenerate it with:
```
cargo run -p talkbank-parser-tests --bin audit_check_parity
```
which rewrites docs/audits/check-parity-audit.md (the full per-rule table and the executive summary). That generated file is the authoritative, citation-stable record; this page explains how to read it.

The current headline numbers (regenerate to refresh): of the CHECK codes that are actually emitted, roughly two-thirds map directly to a TalkBank code, and the audit reports semantic parity, behavioral parity, and an “enhancements beyond CHECK” set (TalkBank codes with no CHECK equivalent, the majority).

Triaging a gap

A CHECK rule with no TalkBank mapping is not automatically a chatter bug. Each gap is triaged against the CLAN source (OSX-CLAN/src/clan/check.cpp) into one of three buckets:

(a) Genuine gap. CHECK enforces a real CHAT rule chatter is missing. Action: implement it in chatter with strict top-down TDD (a failing chatter validate test on a real .cha fixture first), then add the map_by_id entry. Example: curly single quotes (see below).
(b) Intentional divergence. CHECK’s rule is wrong, disabled, or a text-hack chatter deliberately does not reproduce. Action: document the divergence, do not implement. Examples: CHECK error 49 (uppercase-in-word) has been commented out in check.cpp since 2019, so flagging it would diverge from current CHECK; CHECK error 109 (postcodes on dependent tiers) is a raw character-match text-hack chatter deliberately does not reproduce (worked example below).
(c) Enhancement beyond CHECK. A TalkBank code with no CHECK counterpart. These are validation rules chatter adds; they need no CHECK mapping.

The remaining unmapped CHECK codes are an open, low-priority tail: most resolve to bucket (b) on source examination. Closing them is not a release gate.

Worked example: E256 (CHECK 138/139), implemented across both parsers

Curly single quotes (U+2018, U+2019) used as word characters were a genuine gap (bucket a): CHECK errors 138/139 flag them, chatter previously absorbed them silently. They are illegal CHAT word characters; CHAT uses the ASCII apostrophe.

Because chatter has two parsers that must agree (the tree-sitter parser and the re2c oracle, see Parser Backends), the fix lands in both, reaching the same recovery:

The character is excluded from the word token via the shared Symbol Registry (so it can never be part of a word).
The tree-sitter grammar recognizes it as a dedicated illegal_curly_quote node (not a generic parse error), and the parser emits E256 with a span pointing at the exact character.
The re2c lexer emits a recognized IllegalCurlyQuote token; the file-level parser emits E256 and drops the token before parsing.
In both, the offending quote is dropped and the surrounding words survive, so validation continues and reports a precise, actionable diagnostic.

This is the canonical shape of a CHECK-parity rule implemented to chatter’s standards: a recognized construct (parse, don’t merely fail), the same behavior in both parsers, and a spec in spec/errors/ that drives the tests.

Worked example: CHECK 109 (intentional divergence, do not implement)

CHECK error 109 (“Postcodes are not allowed on dependent tiers”) is the canonical bucket-(b) divergence. CLAN fires it from check_CheckWords (OSX-CLAN/src/clan/check.cpp:3471-3690) whenever a %-tier word matches the raw character pattern [+ or [- (the isPostCodeMark macro), on any non-%x dependent tier. chatter deliberately does not reproduce it, for two reasons.

There is nothing typed to flag. chatter models postcodes as structured Postcode nodes inside TierContent.postcodes, a slot carried by the main tier. Ordinary dependent tiers have their own tier types (%com is a text tier) with no postcode slot, so a [+ ...]-shaped token on one is just part of the tier text. Detecting it would require a raw character scan of the tier string, the banned CHAT text-hacking; there is no structured node to validate.
It is not a CHAT-validity rule. An empirical check (2026-06-25) ran the real CLAN analysis tools on dependent-tier postcodes: FREQ and MLU exclude the [+ ...] token from their counts exactly as they do on the main tier, and KWAL prints the line without error. No analysis tool chokes; only CHECK flags it, so CHECK 109 guards against a failure mode its own toolchain does not have.

The divergence is grounded permanently as a divergence entry (CHECK 109) in the behavioral parity manifest (crates/talkbank-parser-tests/tests/check_parity/manifest.json): the chatter_matches_check gate asserts chatter keeps validating the fixture clean (a permanent intentional state, not a gap to close), and clan_check_grounding re-confirms the real CLAN binary still emits 109.

Bullet Validation documents the temporal media-bullet checks (CLAN errors 83/133/84 and chatter’s E701/E704/E729), a specific instance of the same “match CHECK where it is right, diverge where it is wrong” reconciliation this audit tracks across the whole error set.
Errors, CHAT core describes the ErrorCode model and the parser-layer / validation-layer split.
The spec-driven test pipeline that backs every rule is in Testing: rules live in spec/errors/ and generate both parser tests and the validation corpus.

Crate Reference

Status: Current Last modified: 2026-06-15 15:00 EDT

Summary of the main crates and packages in TalkBank/chatter.

Foundational crates

tree-sitter-talkbank

Rust binding crate for the generated TalkBank CHAT tree-sitter grammar. Exposes LANGUAGE, NODE_TYPES, and the generated query constants used by editor and parser integrations.

talkbank-model

The typed data model for CHAT files. Defines ChatFile, Utterance, DependentTier, MorTier, GraTier, and all other AST types. Includes validation logic, the WriteChat trait for CHAT serialization, serde support for JSON, and JsonSchema derivations. Also owns error types (ParseError, ErrorSink trait, Span, SourceLocation), diagnostic infrastructure, and ParseValidateOptions. Provides a closure-based content walker (walk_words / walk_words_mut) that centralizes recursive traversal of UtteranceContent and BracketedItem with domain-aware group gating.

talkbank-derive

Procedural macros for the model crate (SemanticEq, SemanticDiff, SpanShift, ValidationTagged, and the error_code_enum macro).

talkbank-cache

SQLite-backed validation and roundtrip cache used by higher-level validation and corpus workflows.

talkbank-parser

The canonical parser. Wraps the tree-sitter C parser and converts the concrete syntax tree (CST) into ChatFile model types. Provides error recovery via tree-sitter’s GLR algorithm and is the parser used by the CLI, LSP, transform pipelines, and editor tooling.

talkbank-parser-re2c

Independent alternate parser used as an equivalence oracle against the tree-sitter parser. Primarily a testing and spec-hardening tool rather than a first-wave end-user surface.

talkbank-transform

High-level pipelines: parse+validate, CHAT-to-JSON, JSON-to-CHAT, normalization. Integrates the validation cache, JSON schema validation, and parallel directory validation.

Application and integration surfaces

chatter

The chatter CLI binary: validate, normalize, to-json, and corpus management.

talkbank-lsp

Language Server Protocol server with tree-sitter incremental parsing, real-time diagnostics, and semantic highlighting.

send2clan

Rust bindings for sending files to the CLAN application (macOS Apple Events, Windows WM_APP). The crate exposes the safe send2clan API directly while keeping the raw FFI in private modules.

chatter-desktop

Desktop validation app (Tauri v2, React). Mandates TUI parity with the CLI.

Test and spec-support crates

talkbank-parser-tests

Parser tests. Runs the parser over the reference corpus and validates the results. Also owns spec-generated tests, roundtrip tests, equivalence tests, and property tests.

spec/tools

Generator binaries for tree-sitter corpus tests, generated Rust tests, shared spec artifacts, and error documentation.

spec/runtime-tools

Runtime-aware spec tooling for validation, bootstrap, and corpus-mining tasks that should not live in the root Rust workspace.

CLI Startup and the Program Stack

Status: Current Last modified: 2026-06-12 19:01 EDT

Why main() in crates/chatter/src/main.rs does not run the program directly, and what every contributor adding CLI surface should know about stack budgets.

The incident this page exists for

From 2026-06-05 to 2026-06-12, every chatter invocation crashed on Windows in debug builds with STATUS_STACK_OVERFLOW (exit code 0xC00000FD) before argument parsing even began. The crash surfaced as four failing adjudication_tests subprocess tests in the windows-latest CI job, but the faulting code was the clap-derived command-tree construction (Cli::augment_args via CommandFactory::command()), shared by every subcommand. The trigger was ordinary growth: the FREQ parity work added several hundred flags across 2026-06-03/04, and the construction path’s stack needs crossed 1 MiB.

Why stack usage is not portable

Two multipliers vary independently, and the crash happens where they collide:

Platform main-thread allowance. There is no single default:

Context	Main/default stack
Windows main thread	1 MiB (set in the PE header at link time)
macOS main thread	8 MiB
Linux main thread	typically 8 MiB (`ulimit -s`)
Rust spawned threads	2 MiB unless `stack_size` is given

Shipping cross-platform means your real budget is the smallest of these: Windows’ 1 MiB.

Build profile. At opt-level 0, rustc gives every temporary in a function body its own stack slot and does not coalesce them, so a function’s frame is roughly the SUM of all its temporaries, not the maximum simultaneously alive. clap’s derive expands to one enormous builder function per args struct (one multi-call chain per flag, each Arg/Command temporary a few hundred bytes by value), which is exactly the shape this penalizes. Release builds coalesce slots and inline, shrinking the same frames by one to two orders of magnitude.

Consequence: identical code can be fine in release on macOS (8 MiB budget, small frames) and fatal in debug on Windows (1 MiB budget, fat frames). Debug test binaries cross the line first, which is why CI subprocess tests caught it and shipped release binaries never crashed.

The design: an explicitly sized program thread

main() spawns the entire program onto a thread with an explicit, documented stack size (PROGRAM_STACK_BYTES, 16 MiB) and only joins and re-raises panics, so exit semantics are unchanged. This removes the dependency on platform main-stack defaults altogether instead of chasing the budget back under an invisible, platform-dependent line that the CLAN parity roadmap (roughly sixty commands’ worth of flags still to come) guarantees we would cross again. rustc itself uses the same pattern for the same reasons.

flowchart TD
    main["main()\n(crates/chatter/src/main.rs)"]
    spawn["thread::Builder::stack_size(PROGRAM_STACK_BYTES)\n.spawn(program_main)"]
    prog["program_main()\nclap tree build + parse + cli::run"]
    join{"join() result?"}
    ok["process exits normally"]
    panic["resume_unwind(payload)\n(same exit behavior as a panic in main)"]
    fail["spawn failed (OS resource):\neprintln + exit(1)"]

    main --> spawn
    spawn -->|"Ok(handle)"| prog
    prog --> join
    join -->|"Ok(())"| ok
    join -->|"Err(payload)"| panic
    spawn -.->|"Err(e)"| fail

The reservation is virtual address space; physical pages are committed only as they are touched, so the 16 MiB costs nothing measurable. The extra thread spawn at startup is microseconds.

Regression gates

crates/chatter/tests/stack_limit_tests.rs runs the real binary under a Windows-sized 1 MiB stack (sh -c 'ulimit -s 1024') on Unix, so macOS and Linux CI enforce the Windows constraint on every run. Without this, the constraint is tested only by the windows-latest job, where this incident sat unnoticed for a week.
The windows-latest cross-platform job remains the native test of the real 1 MiB main stack (which no longer matters to the program thread, but guards the main() shim itself).

Guidance for contributors

Do not move program logic back onto the bare OS main thread; anything before the spawn runs under the platform’s smallest default.
Adding flags and subcommands is normal and expected; the budget is now the explicit PROGRAM_STACK_BYTES constant. If deep recursion or generated code ever approaches it, raise the constant deliberately in a reviewed change rather than discovering the limit in CI.
The same two multipliers apply to any worker threads you spawn: Rust’s 2 MiB spawned-thread default is also finite, and recursive parser or validation code running on worker threads should size them explicitly if depth is data-dependent.

Repository Architecture and Boundaries

Status: Current Last modified: 2026-06-15 15:00 EDT

Top-level layout

spec/                     canonical syntax and error spec source
spec/tools/               deterministic generators + validators (separate Cargo workspace)
grammar/                  tree-sitter grammar source + generated parser artifacts
crates/                   all Rust crates (root Cargo workspace)
  talkbank-model/         data model, validation, alignment, errors, parser API trait
  talkbank-derive/        proc macros (SemanticEq, SpanShift, ValidationTagged, error_code_enum)
  talkbank-parser/        canonical parser (tree-sitter)
  talkbank-parser-re2c/   alternate parser (specification oracle, opt-in batch parser)
  talkbank-parser-tests/  parser equivalence and roundtrip tests
  talkbank-transform/     pipelines, CHAT↔JSON, caching, parallel validation
  chatter/           the `chatter` CLI binary
  talkbank-lsp/           LSP server
  send2clan/              Rust bindings to the legacy CLAN app bridge
 talkbank-cache/         validation + roundtrip cache
apps/                     desktop app (Tauri v2 + React): chatter-desktop
corpus/                   reference corpus (must pass 100%)
schema/                   JSON Schema for ChatFile AST
tests/                    workspace-level integration tests and fixtures
fuzz/                     fuzz targets (separate Cargo workspace)
book/                     mdBook documentation source
docs/                     strategy docs, proposals, and investigations

Architectural principles

Clear boundaries between specification, generation, runtime logic, and documentation.
Generated artifacts and hand-authored code are kept separate with hard guardrails, parser.c, node-types.json, generated tests and error-doc artifacts are never edited by hand.
Each crate has a single clear responsibility.
Entry-point docs guide new contributors to authoritative references quickly.

Canonical ownership rules

spec/ owns the language intent and accepted examples, what CHAT means.
grammar/ owns tokenization and CST shape only, not semantic validation policy.
talkbank-model owns semantic validity, serialization invariants, error types, and parser API contracts.
talkbank-transform owns pipelines and JSON schema validation.
talkbank-cache owns the shared SQLite-backed validation and roundtrip cache.

Dependency direction rules

spec does not depend on runtime crates.
grammar is consumed by parser crates, not vice versa.
talkbank-model is dependency-minimal and stable; all other talkbank-* crates depend on it.
CLI / LSP / desktop apps depend on stable internal APIs, never directly on unstable internals of other crates.
Generator tools may read specs and grammar metadata but do not become runtime dependencies.

Acceptance criteria

Every top-level directory has a clear purpose statement.
No crate depends on internal modules outside declared boundaries.
No generated artifact is edited manually.
New contributors can identify authoritative docs in less than five minutes.

Grammar System and Token Governance

Status: Current Last modified: 2026-05-29 18:43 EDT

Current Reality

grammar/grammar.js encodes substantial implicit language knowledge directly in regex exclusions, reserved symbol lists, and leniency decisions. Example areas:

word segment forbidden start/rest classes,
CA delimiter/element symbol groups,
event segment exclusions,
hand-maintained coupling between comments and token rules.

This is currently powerful but fragile.

Primary Failure Modes

New symbolic token added in one place but not in exclusion sets.
Parser behavior changes silently due to regex class edits.
Generated node types drift from assumptions in spec tooling.
Lenient parsing choices become undocumented policy.

Current Design

The generated symbol registry is the single source of token constraints. The pipeline has shipped, just symbols-gen rebuilds it.

Registry Artifacts

spec/symbols/symbol_registry.json (human-authored intent):
- symbol string
- category (delimiter, continuation, overlap, punctuation, etc.)
- contexts where reserved/allowed
- parse role and precedence notes
Generated outputs:
- grammar/src/generated_symbol_sets.js
- crates/talkbank-model/src/generated/symbol_sets.rs
- spec/tools/src/generated/symbol_sets.rs
- docs: Symbol Registry

Grammar Refactor Requirements

Replace large manual regex strings with generated character classes.
Keep final grammar readable by preserving semantic names in generated constants.
Distinguish clearly between:

syntax permissiveness,
semantic validation restrictions.

Add comments only for design rationale, not for duplicating manual references.

Node Type Drift Controls

Enforce regeneration and consistency checks:
- grammar source change must regenerate parser and node types,
- node type constants consumed by spec/tools and parser code must compile,
- CI fails if generated files differ from committed state.

Leniency Policy

Explicitly classify every lenient parse behavior:

Parse-lenient + validate-strict.
Parse-lenient + validate-warning.
Parse-strict (hard fail).

Document this matrix in the Leniency Policy.

Grammar Test Strategy

Keep corpus tests generated from spec/constructs.
Add targeted hand-authored edge tests for symbol boundary interactions.
Add mutation-style tests for forbidden-character regressions.
Add parser equivalence tests for tokenizer-sensitive cases.

Acceptance Criteria

No manual reserved-symbol duplication in grammar.js.
Symbol registry is generated to all required consumers.
Grammar modifications cannot land with stale generated artifacts.
Every special token category has explicit policy documentation.

Parser, Model, and API Contracts

Status: Current Last updated: 2026-06-21 21:33 EDT

Single-handle parser API

talkbank-parser provides TreeSitterParser as the canonical API handle for all parsing, full-file and fragment methods live directly on the struct. Callers create one instance and pass &TreeSitterParser everywhere. The alternate talkbank-parser-re2c is opt-in (specification oracle and high-throughput batch parsing) and produces the same ChatFile model.

Contract for Batchalign

The Batchalign runtime (the batchalign crate) consumes these guarantees from the talkbank-* core crates:

parsing produces a typed ChatFile or an explicit parse-status signal
parse-health taint is visible to alignment consumers
alignment helpers operate on semantic model types, not raw text hacks
recovery never fabricates valid-looking placeholder semantics for malformed input

The parser/model boundary stays honest enough for downstream workflows, align, compare, benchmark, morphotagging, to make their own validity decisions.

Canonical Contract Model

Public Contract Layers

Parse API Contract:

stable function signatures,
deterministic parse result envelope,
clear partial-success semantics.

Semantic Model Contract:

stable core model fields,
explicit unstable/internal fields policy.

Diagnostic Contract:

stable error code IDs and severity semantics,
best-effort message text compatibility.

Serialization Contract:

deterministic output constraints,
normalized formatting policy.

Required Types

ParseOutcome<T>
- value: T | omitted-by-status
- diagnostics: Vec<Diagnostic>
- status: Success | Partial | Failed
Diagnostic
- code, severity, category, message, location, context, suggestion

Parser Role

talkbank-parser: the sole parser, used by CLI/LSP/API/batchalign3. TreeSitterParser is the only API handle, callers create one and pass &TreeSitterParser everywhere.
Tree-sitter GLR provides error recovery; the Rust traversal code converts CST to typed model.
Full-file methods: parser.parse_chat_file(), parser.parse_chat_file_streaming().
Fragment methods: parser.parse_word_fragment(), parser.parse_main_tier_fragment(), etc.

Invariants

Parsing with offset must shift all spans consistently.
Parse-level and validation-level diagnostics must remain distinguishable.
Serialization should preserve semantic equivalence and documented formatting rules.
Roundtrip behavior must be testable per parser implementation.
Parser functions that accept ErrorSink should not return Option<T> for fallible parse state.

API Versioning Policy (Pre-1.0, Strict)

Three intended contract levels:
- Stable-for-integrators
- Stable-internal
- Experimental
Mark every public function/type by contract level.

This classification is not yet codified in a separate manifest file; the levels above are the working policy. Integrators should treat any unmarked surface as Experimental until contract levels are formally published.

Acceptance Criteria

Single canonical parse outcome envelope exposed for integrators.
Parser implementations conform to shared contract tests.
Contract-level annotations exist for all public API surfaces.
Documentation for parse/validate/serialize lifecycle is centralized and current.

Recovery Contract: No Fabricated Semantic Values

The parser contract must forbid sentinel semantic values during error recovery.

Disallowed recovery behavior:

returning arbitrary enum variants as fallback for unknown/missing nodes,
returning empty strings as stand-ins for required fields,
constructing fake words/chunks like "missing", "error", or other placeholders.

Required recovery behavior:

Emit structured diagnostic with precise span and expected node kind.
Return an explicit parse-status signal (Partial/Failed) through ParseOutcome.
Omit invalid semantic node OR store it in explicit recovery metadata, never as a valid semantic value.

Current enforcement:

CI guardrail script tracks and blocks introduction of new ErrorSink + Option signatures.
See scripts/check-errorsink-option-signatures.sh and scripts/errorsink_option_allowlist.txt.

Rationale:

fabricated semantic values create secondary, misleading diagnostics against synthetic data,
downstream tools cannot distinguish real user content from parser-generated placeholders,
equivalence and regression tests become noisy and non-actionable.

For batchalign3, this is especially important because alignment workflows must be able to tell the difference between:

a malformed input that should taint or block alignment
a recoverable input where raw text can be preserved
a clean input that should proceed through the align/compare pipeline

String Storage Policy

The model uses three string storage strategies:

Arc<str> interning (interned_newtype!): For high-frequency repeated values (POS tags, stems, speaker codes). Global interner avoids redundant allocations.
SmolStr (string_newtype!): For short strings (median 10-15 chars) that benefit from inline storage. O(1) clone, no heap allocation for strings ≤23 bytes.
String: Only for utility types outside the core model (e.g., semantic_diff/).

Parser Backends

Status: Current Last modified: 2026-06-13 22:40 EDT

TalkBank has two CHAT parser implementations. Both implement the ChatParser trait and produce identical ChatFile model types.

The --parser flag selects the backend at the CLI boundary; everything downstream consumes the identical ChatFile output, so the choice is invisible past the dispatch point:

flowchart TD
    cli["chatter validate --parser &lt;backend&gt;\n(ParserBackend enum,\nchatter cli_types.rs)"]
    sel{"which backend?\n(ParserKind,\ntalkbank-transform\nvalidation_runner/config.rs)"}
    ts["TreeSitterParser\n(talkbank-parser:\nGLR, incremental)"]
    re2c["Re2cParser\n(talkbank-parser-re2c:\nre2c DFA + chumsky)"]
    trait["ChatParser trait\n(talkbank-model\nparser_api/chat_parser.rs)"]
    model["ChatFile\n(talkbank-model:\nSemanticEq-identical\nfor both backends)"]

    cli --> sel
    sel -->|"tree-sitter (default)"| ts
    sel -->|"re2c"| re2c
    ts -->|"ParserDispatch::TreeSitter\n(worker.rs) implements"| trait
    re2c -->|"ParserDispatch::Re2c\n(worker.rs) implements"| trait
    trait --> model

ParserDispatch::new(kind) (in validation_runner/worker.rs) is the single place that constructs the chosen backend from a ParserKind; both variants wrap a ChatParser implementor, so the validation runner never branches on backend again.

TreeSitterParser (default)

Crate: talkbank-parser
Technology: tree-sitter GLR parser
Grammar: grammar/grammar.js → generated C parser
Strengths: Incremental reparsing (LSP), robust error recovery (GLR), CST-level diagnostics
Weaknesses: Slower on batch workloads, !Send + !Sync (one parser per thread)

Used by the LSP, the default CLI, and all production validation.

Re2cParser

Crate: talkbank-parser-re2c
Technology: re2c DFA lexer + chumsky parser combinators
Grammar: Translated from grammar.js rules → re2c conditions + chumsky combinators
Strengths: 4-8x faster, Send + Sync, zero constructor cost, specification oracle
Weaknesses: No incremental reparsing, Box::leak memory strategy

Used for batch validation, parser parity testing, and performance benchmarking.

CLI Usage

# Default: tree-sitter
chatter validate corpus/

# Use re2c for faster batch validation
chatter validate --parser re2c corpus/

# Roundtrip with re2c
chatter validate --parser re2c --roundtrip corpus/

The --parser flag accepts tree-sitter (default) or re2c. Cache entries are parser-specific, switching parsers does not invalidate the other’s cache.

Parity Status

Both parsers produce SemanticEq-identical output on the 87-file reference corpus (100% match). On the ~100k-file wild corpus, parity is ~98.7%.

Error Detection

Metric	Value
Specs tested	140
Both detect error	140/140 (100%)
Same error code	79/140 (56.4%)
Different code, both detect	61/140 (43.6%)
Re2c silent (misses error)	0

The 61 code mismatches come from architectural differences, not bugs. Both parsers report actionable diagnostics for all 140 testable error specs.

Performance

Benchmark	TreeSitter	Re2c	Speedup
Small file (13 lines)	44 µs	9.6 µs	4.6x
Medium file (dependent tiers)	69 µs	9.4 µs	7.3x
Large file (complex)	7,734 µs	970 µs	8.0x
Batch (35 files)	21.7 ms	3.0 ms	7.2x

Run benchmarks: cargo bench -p talkbank-parser-re2c --bench parse_comparison

When to Use Which

Use Case	Recommended Parser	Why
LSP / editor integration	tree-sitter	Incremental reparsing
Batch validation (>100 files)	re2c	4-8x faster
CI validation	Either	Both correct; re2c saves CI time
Error diagnostics (user-facing)	tree-sitter	More specific E3xx codes
Parser parity testing	Both	Re2c is the specification oracle
Profiling / benchmarking	re2c	DFA lexer gives a performance floor

Shared Model Infrastructure

Both parsers convert to the same talkbank_model::ChatFile type and share post-hoc promotion logic:

TierContent::extract_terminal_bullet(): trailing InternalBullet → utterance bullet
parse_bullet_node_timestamps(): structured bullet CST → (start_ms, end_ms)

CA intonation arrows are no longer promoted to terminators at the parser/model boundary; both parsers leave them as Separator items. See CA Terminator Resolution.

Detailed Parity Report

See crates/talkbank-parser-re2c/docs/parity-report.md for the full gap analysis, divergence categories, and remaining work items.

Parser Leniency Policy

Status: Current Last updated: 2026-07-16 11:58 EDT

This document is the single source of truth for how the tree-sitter grammar, Rust validation layer, and CLI tooling divide responsibility for enforcing the CHAT specification. It consolidates decisions scattered across grammar.js comments, analysis documents, and code.

Scope: Documentation only. This document does not implement new validation rules; it records what exists, what is intentionally absent, and proposes a roadmap for closing gaps.

Philosophy: Parse, Don’t Validate

The tree-sitter grammar intentionally accepts a superset of valid CHAT. The rationale:

Maximise parse coverage: Real-world .cha files contain legacy patterns, whitespace variations, and edge cases. A grammar that rejects them produces no AST and therefore no diagnostics. Accepting them gives the validation layer something to work with.
Separate syntax from semantics: The grammar captures structure (headers, utterances, tiers, annotations). The Rust validation layer enforces semantic rules (required headers, participant declarations, alignment counts).
Enable configurable strictness: Different consumers need different policies. A roundtrip pipeline can be strict; an editor providing live diagnostics should be lenient. Validation profiles (see Validation Profile Infrastructure) make this possible.

Three-Tier Classification

Every intentional leniency decision falls into one of three tiers:

Tier	Label	Meaning
A	Parse-lenient + validate-strict	Grammar accepts it; validation rejects it as an error
B	Parse-lenient + validate-warning	Grammar accepts it; validation emits a warning
C	Parse-lenient only	Grammar accepts it; no validation needed: the construct is genuinely optional or the broad acceptance is by design

This classification was proposed in an earlier grammar governance analysis and is formalised here.

Leniency Matrix

Master table of every documented leniency decision in the grammar. The Status column indicates whether downstream validation compensates for the grammar’s permissiveness.

#	Grammar Construct	Spec Requirement	Grammar Behavior	Tier	Validation	Error Code	Status
1	`@UTF8` header	Required, must be first line	Optional (not enforced)	A	Validated	E503	OK
2	`@Begin` header	Required	Optional (`grammar.js` ~L104)	A	Validated	E504	OK
3	`@End` header	Required	Optional (`grammar.js` ~L106)	A	Validated	E502	OK
4	Pre-first-utterance header order	No enforced order (matches CLAN CHECK)	`choice()`, any order (`grammar.js` ~L122-135)	C	N/A (by design)	,	OK
5	Headers after utterances	Allowed (e.g. `@Bg`, `@Eg`, `@G`, `@Comment`)	Interleaved freely	C	N/A (by design)	,	OK
6	Content type context restrictions	Unified across contexts	Unified `base_content_item` (`grammar.js` ~L731-738)	C	N/A (by design); specific semantic rules (E371, E372) exist separately	,	OK
7	Terminator presence	Required (except CA mode)	Optional (`grammar.js` ~L691-692)	A	Validated	E305	OK
8	Bare shortening as word	CA mode only	Accepted anywhere	A	Validated	E2xx	OK
9	Trailing whitespace in annotations	Not specified	Optional trailing space (`grammar.js` ~L957, 966, 975, 1004, 1013)	C	N/A	,	OK
10	MOR segment Unicode	Very permissive (broad language support)	Exclusion-based regex (`grammar.js` ~L1909-1915)	C	N/A (by design)	,	OK
11	MOR fusional suffixes with hyphens	ALNUM + IPA only	Allows hyphens (`grammar.js` ~L1942-1945)	C	N/A (by design)	,	OK
12	MOR nested translations	No nested structures	Allows `()` and `[]` nesting (`grammar.js` ~L1954-1966)	C	N/A (by design)	,	OK
13	Linkers / language codes	Truly optional	Optional	C	N/A	,	OK
14	Word annotations	Truly optional	Optional	C	N/A	,	OK
15	Media bullet	Truly optional	Optional	C	N/A	,	OK
16	Group whitespace (leading/trailing)	No whitespace inside `<` `>`	Optional (`grammar.js` ~L1097, 1099)	C	N/A	,	OK
17	Long feature label characters	Limited character set	`/[A-Za-z0-9@%_-]+/` (`grammar.js` ~L1327)	C	N/A	,	OK
18	Catch-all headers (`$.anything`)	Structured content for some headers	`/[^\r\n]+/` for ~19 header types	C	N/A (content is opaque)	,	OK
19	Header gap whitespace	Single space/tab	`repeat1(choice(space, tab))` (`grammar.js` ~L467, 477, 489)	C	N/A	,	OK
20	`@Types` header whitespace	No spaces around commas	Optional whitespace around commas (`grammar.js` ~L584-592)	C	N/A	,	OK

Permissiveness Regression Decisions

During development, several validation rules were tightened and then relaxed after they produced false positives against the reference corpus. These decisions are documented in the permissiveness regression log (archived). Each is summarised here with its rationale.

Decision 1: `[*]` bare annotation, E214 disabled

Previous behaviour: E214 emitted when [*] appeared without an explicit error code (empty ContentAnnotation::Error).
Current behaviour: Bare [*] is accepted without error.
Implementation: Removed validation branch in talkbank-model/src/model/annotation/annotated.rs.
Rationale: Reference files (errormarkers.cha, compound.cha) use bare [*] as valid CHAT.
Revisit: If coded error annotations become required, do it behind an explicit strict profile.

Decision 2: `@t` without `@s:<lang>`, E248 disabled

Previous behaviour: E248 emitted for @t markers without an explicit language marker.
Current behaviour: @t accepted without requiring @s:<lang>.
Implementation: Removed checks in talkbank-model/src/validation/word/structure.rs.
Rationale: Reference file formmarkers.cha contains a@t and is expected to be valid.
Revisit: Scope to explicit strict validation mode if desired.

Decision 3: Undeclared inline language codes, E254 re-introduced as warning

Original behaviour: Inline @s:... markers with language codes not declared in @Languages emitted E254 as an error.
Intermediate behaviour: E254 was disabled and the code removed from the codebase to keep reference file lang-marker.cha valid.
Current behaviour: E254 (UndeclaredExplicitWordLanguage) is back in the registry at crates/talkbank-model/src/errors/codes/error_code.rs:321 and emitted at crates/talkbank-model/src/validation/word/language/resolve.rs:195, but as a warning rather than an error. This was paired with the introduction of E255 (WholeUtteranceLanguageSwitchShouldUsePrecode) for whole-utterance @s runs that should use [- lang] precodes.
Why it returned: Heterogeneous corpora (Cantonese, Polish, Czech, Spanish, HK bilingual) made the warn-only signal load-bearing for catching @s:LANG markers that disagreed with @Languages. The warning surfaces the inconsistency without blocking the file.
Revisit: If the warn-only signal turns out to be ignored in practice, decide between escalating back to error severity or removing.

Decision 4: Mixed-language digit legality, permissive-any rule

Previous behaviour: Digits had to be legal in all applicable languages for mixed/ambiguous markers.
Current behaviour: Digits accepted if legal in at least one applicable language.
Implementation: Changed from is_valid_in_all() to any() in talkbank-model/src/validation/word/language/digits.rs.
Rationale: Prevents false positives in mixed-language reference examples.
Revisit: Confirm spec intent for mixed/ambiguous validation semantics.

Decision 5: `@Bg` nesting, same-label only

Previous behaviour: Any nested @Bg while another gem scope was open emitted E529.
Current behaviour: E529 only fires when nesting the same label (or same unlabeled scope key). Different labels may nest hierarchically.
Implementation: Changed from any_scope_open to same_scope_open in talkbank-model/src/validation/header/structure.rs.
Rationale: Avoids false positives on hierarchical markup patterns (e.g., HSLLD corpus).
Revisit: Decide whether nesting policy should be global or per-label.

Decision 6: Temporal bullets in CA mode, skipped

Previous behaviour: E701/E704 temporal checks ran even for CA-mode files.
Current behaviour: Temporal constraints are skipped when file is in CA mode.
Implementation: validate_temporal_constraints() early-returns when ca_mode is true (talkbank-model/src/validation/temporal.rs).
Rationale: CA reference files include patterns that triggered false monotonicity/self-overlap diagnostics.
Revisit: Implement CA-specific temporal policy rather than global skip.

Decision 7: Pipeline severity threshold, errors only

Previous behaviour: Any validation diagnostic (including warnings) caused PipelineError::Validation.
Current behaviour: Pipeline returns failure only if at least one diagnostic has Severity::Error.
Implementation: talkbank-transform/src/pipeline/parse.rs.
Rationale: Warnings should not block parse/transform/export pipelines.
Revisit: Keep as default; add explicit --strict flag/profile if needed.

Decision 8: Spacing warnings W210/W211, disabled (RETIRED 2026-07-16)

Previous behaviour: Style-level spacing warnings around terminators and overlap markers.
Current behaviour: Checks removed from core main-tier validation path.
Implementation: check_spacing_warnings() invocation removed from talkbank-model/src/model/content/main_tier.rs.
Rationale: Generated unexpected diagnostics on files treated as valid in reference workflow.
Revisit: CLOSED. The codes were RETIRED outright on 2026-07-16 (maintainer ruling): real CLAN CHECK accepts the W210 construct (glued terminator), overlap markers hug their content by design so W211’s shape is valid CA notation, and no production code ever emitted either. The numbers are retired and not reused; no lint profile will reintroduce them. The living spacing rules are E243, E749, E750, E751, E757, and E758.

Validation Gap Roadmap

Concrete items where the grammar is lenient but no validation compensates. Each proposes a new error code and priority.

Priority 1: `@UTF8` Presence (E503), DONE

Grammar: @UTF8 is optional.
Spec: Required, must be the first line.
Implemented: E503 (MissingUTF8Header) added to check_headers() in talkbank-model/src/validation/header/structure.rs.
Severity: Error.
Note: All 340 reference corpus files contain @UTF8, zero roundtrip impact.

Priority 2: Pre-First-Utterance Header Order (proposed E534), Not a Gap

Grammar: choice() accepts headers in any order between @Begin and the first utterance.
Assessment: CLAN CHECK does not enforce any ordering for post-@Begin headers; it validates presence and format only. Our grammar’s flexible ordering matches CHECK’s behavior.
Status: Reclassified from Tier B (GAP) to Tier C (by design).

Priority 3: Content Type Context Validation, Not a Gap

Grammar: Unified base_content_item accepts any content type in any context.
Assessment: The unified rule is correct by design. Nested groups are legal CHAT (e.g., <the <dag> [: dog]> [= something]). The two specific semantic restrictions that do exist (no pauses in pho groups, E371; no nested quotations, E372) are already validated.
Status: Reclassified from Tier A (PARTIAL) to Tier C (by design).

Validation Profile Infrastructure

What Exists

`ValidationConfig` (`talkbank-model/src/errors/config.rs`)

Builder-pattern configuration for per-error-code severity overrides.

let config = ValidationConfig::new()
    .downgrade(ErrorCode::IllegalUntranscribed, Severity::Warning)
    .disable(ErrorCode::InvalidOverlapIndex)
    .upgrade(ErrorCode::UnknownAnnotation, Severity::Error);

API:

new(): empty config, all codes use original severity
downgrade(code, severity): lower severity (chainable)
disable(code): suppress entirely (chainable)
upgrade(code, severity): raise severity (chainable)
set_severity(code, Option<Severity>): set or disable (chainable)
effective_severity(code, original) -> Option<Severity>: query
is_disabled(code) -> bool: check

Pre-built profiles:

lenient(): Downgrades IllegalUntranscribed and InvalidOverlapIndex to Severity::Warning. Designed for legacy corpora gradual migration.
strict(): escalates unmapped warnings to errors (sets upgrade_unmapped_warnings, honored by effective_severity). Explicit per-code overrides still take precedence, so a caller can opt a specific code back to Severity::Warning.

`ConfigurableErrorSink` (`talkbank-model/src/errors/configurable_sink.rs`)

Wrapper that intercepts errors and applies ValidationConfig before forwarding to an inner ErrorSink.

let inner = ErrorCollector::new();
let sink = ConfigurableErrorSink::new(&inner, config);
// Pass `sink` to parser/validator, disabled errors are filtered,
// severity overrides are applied.

Runner-Level Flags (`talkbank-transform`, `chatter`)

Flag	Effect
`--skip-alignment`	Skip tier alignment validation
`--roundtrip`	Test serialization idempotency after validation
`--force`	Clear cache for path and revalidate
`--max-errors N`	Stop after N errors

What Is Missing

Gap	Description	Effort
No `--profile` CLI flag	Users cannot select `strict` / `lenient` / `lint` from the command line	Medium
`ConfigurableErrorSink` not wired into validation pipeline	Infrastructure exists but is not used by `chatter validate`	Medium
No profile serialization	Cannot load profiles from TOML/JSON config files	Medium
No corpus-specific profiles	E.g., HSLLD-specific rules	Future

Proposed Profiles

From the permissiveness regression log:

Profile	Purpose	Behaviour
`reference-compatible`	Current permissive baseline	Default, matches current validation behaviour
`strict-chat`	Full spec enforcement	Re-enable selected tightenings (E214, E248, etc.; E254 was retired 2026-07-15 with the @s ruling)

The roundtrip gate should be pinned to an agreed profile to prevent future ambiguity about what “pass” means.

Silent Recovery Points (NLP Pipelines)

An earlier Python-Rust boundary audit identified several places where batchalign-core silently massages data without diagnostics. These are related to leniency because they represent permissive acceptance without transparency.

Pipeline	Recovery Mechanism	Diagnostics?
Stanza morphosyntax	`retokenize.rs` DP alignment; `Word::new_unchecked` fallback	No
Whisper/Wave2Vec FA	`forced_alignment.rs` DP “best fit”	No
Google Translate	Imported verbatim into `%xtra`	No filtering
Stanza segmentation	Silent abort on assignment mismatch	No

Key infrastructure gap: ParseHealth exists in talkbank-model (per-utterance tier cleanliness flags with taint(), is_clean(), can_align_main_to_mor() methods). It is used by the tree-sitter and direct parsers during parsing. However, batchalign-core does not read, write, or propagate ParseHealth during any mutation (morphosyntax injection, FA injection, retokenisation). The infrastructure exists in the model layer but is not connected to the pipeline layer.

Cross-References

Source	What It Contains
Grammar governance analysis (archived)	Proposed this document; leniency matrix concept; three-tier classification
Permissiveness regression log (archived)	8 permissiveness regression decisions with rationale
Python-Rust boundary audit (archived)	Silent recovery points; ParseHealth gap; NLP pipeline audit
`grammar/grammar.js`	Inline comments on each leniency decision (line references in matrix above)
`talkbank-model/src/errors/config.rs`	`ValidationConfig` API
`talkbank-model/src/errors/configurable_sink.rs`	`ConfigurableErrorSink` adapter
`talkbank-model/src/validation/header/structure.rs`	Header validation: E501, E502, E503, E504-E533
`talkbank-model/src/validation/temporal.rs`	Temporal constraint checks (E701, E704); CA-mode skip
`talkbank-model/src/model/content/main_tier.rs`	Where W210/W211 were removed

Last updated: 2026-02-18

Error Diagnostics UX Standard

Status: Current Last modified: 2026-05-30 07:08 EDT

Workspace-wide standard for diagnostic shape, severity, recovery behavior, span correctness, and integrator output formats. Applies to the CHAT-core error system. Upstream batchalign-runtime errors follow the same shape and are documented separately in the batchalign3 project.

Objective

Make diagnostics precise, explainable, and actionable for both developers and non-technical editors, while keeping machine readability for downstream tools.

Open concerns

Message quality across the error catalog is not yet governed by one central style standard. Different error codes were authored at different times and converge unevenly on the message-quality guidance below.

Canonical Diagnostic Schema

Diagnostic {
  code: String,
  severity: Error | Warning | Info,
  category: Parse | Validation | Alignment | Header | Tier | Internal,
  location: SourceLocation,
  context: ErrorContext,
  message: String,
  suggestion: Option<String>,
  related: Vec<RelatedLocation>
}

Message Quality Standard

Each diagnostic must answer:

What failed.
Where it failed.
Why it likely failed.
What to do next.

Avoid internal jargon unless accompanied by user-facing explanation.

Severity Policy

Error: blocks parse/validation outcome.
Warning: content is usable but has quality/compliance concerns.
Info: optional guidance and migration hints.

Severity must not be overloaded for tooling convenience.

Recovery Policy: Diagnostic-First, Not Sentinel-First

When parser recovery is required:

do not invent semantic fallback values to keep type construction convenient,
do not use empty strings or arbitrary enum defaults as recovered content.

Instead:

Report a diagnostic with expected/actual node context.
Preserve span information for tooling and UI.
Propagate partial/failure status explicitly.

Any synthetic placeholders that are unavoidable for internal plumbing must be:

non-semantic (not exposed as real model content),
marked internal-only,
excluded from user-facing diagnostics and serialization.

Sentinel vs error-variant rule

If an unexpected condition changes semantic trust in parsed content:

Represent that explicitly as an error-bearing state (enum variant, parse-taint flag, or explicit outcome type).
Never represent it as None or a default payload that can be mistaken for valid content.

This applies both to parser outputs and to runtime metadata consumed during validation.

Diagnostic construction

Use shared constructors/helpers for common diagnostics to reduce drift:

span-only diagnostics (code + severity + span + message),
source-backed diagnostics (code + severity + span + source + offending + message).

Benefits: consistent location/context population, fewer ad-hoc ParseError::new(...) call shapes, simpler migration to richer miette rendering.

Error Code Governance

Central registry under talkbank-model (errors module).
One authoritative description and example per code.
Deprecated codes remain mapped with explicit migration notes.
CI check forbids duplicate code definitions or orphaned docs.

Span and Location Correctness

All diagnostics use consistent line/column and byte-offset definitions.
Golden tests cover:
- single-byte and multi-byte UTF-8 content,
- embedded content offsets,
- continuation lines and tabs.

Integrator Output Formats

Human-readable CLI diagnostics.
Machine-readable JSON diagnostics.
LSP diagnostic mapping.

All formats share the same underlying diagnostic schema.

Acceptance Criteria

Every emitted diagnostic includes code, severity, location, and suggestion policy.
Error code documentation and runtime definitions are synchronized automatically.
Span correctness is covered by dedicated tests.
CLI and JSON outputs are contract-tested for schema compliance.

Wide Struct Audit

Status: Current Last modified: 2026-05-29 22:34 EDT

A repository-wide audit rule for struct shape. Applies to the crates in TalkBank/chatter (model, parser, transform, CLI, CLAN, LSP, cache, and related tooling). The rule originated in the predecessor monorepo, but this page is scoped to the current repository rather than the old mixed CHAT+batchalign workspace.

A struct with many fields is not automatically wrong. The smell is:

many unrelated concerns packed into one value
several related booleans that act like implicit policy enums
repeated field-name prefixes that point to missing sub-structs
parallel vectors or stringly runtime fields
runtime code reaching into many unrelated fields of the same value

The repo therefore treats 10 or more named fields as an audit threshold, not as an automatic ban.

Design Rules

Treat 10 or more named fields as an audit trigger.
Treat 3 or more related boolean fields as a smell even below that threshold.
Boundary and transport records may stay wide when they mirror a real external shape.
Runtime coordination structs prefer named sub-structs over flat bags.
Replace parallel vectors with per-item records where possible.
If a wide struct stays wide, record the reason in the surrounding design docs, audit notes, or code review rather than letting it remain unexplained.

Refactor Examples

`ValidateDirectoryOptions` (chatter), was a flat bag

Used to be a flat bag of format, cache, traversal, roundtrip, parser, audit, and TUI flags. Now grouped by concern:

ValidationRules
ValidationExecution
ValidationTraversalMode
ValidationPresentation

Shape this audit wants for policy-rich CLI boundaries: one small top-level struct with explicit sub-objects and enums rather than a dozen flat fields.

`ParseHealth` (talkbank-model), was a ten-boolean state vector

Now stores taint as a compact tier bitset keyed by ParseHealthTier, the shape this audit expects for fixed domain sets.

flowchart LR
    tier["ParseHealthTier"] --> set["Tier health set"]
    set --> checks["Alignment safety checks"]

Open Hotspots

TUI state bags

Real state owners that still want grouping by concern (selection vs. progress vs. render flags vs. status):

crates/chatter/src/ui/validation_tui/state.rs TuiState

`Backend` (talkbank-lsp)

crates/talkbank-lsp/src/backend/state.rs is a service-root aggregate. Defensible, but still wants grouping such as document caches, parse caches, validation state, language services.

Metric structs

SpeakerEval/SpeakerKideval are acceptable as report records. If output renderers keep needing subsets (lexical metrics, morphosyntax metrics, error counts, derived scores), those records should eventually nest along those lines.

Audit Guardrail

There is currently no repo-local automated wide-struct lint in TalkBank/chatter. Treat this page as a manual review checklist and refactor trigger: when a type grows past the threshold, decide explicitly whether it is an acceptable boundary/schema aggregate or a real split target.

Spec Tooling and Generation Pipeline

Status: Current Last updated: 2026-05-19 17:38 EDT

Objective

Make spec/ the reliable language-contract source while keeping generation deterministic, maintainable, and appropriately scoped.

The goal is to separate:

grammar artifact generation
validation/error-doc generation
parser semantic testing (fragment and full-file)

Anything that still looks like bootstrap-era synthetic fragment orchestration is now audit-only unless a doc says it remains operational.

Open structural concerns

spec/tools still carries bootstrap-era Rust parser/model dependencies that create circular or awkward workflow coupling.
Contributor workflows still over-assume that make test-gen is the right reaction to every parser-related change.

Current Generation Pipeline

spec constructs/errors
  -> spec validators
  -> generated grammar corpus tests
  -> generated rust parser/validation tests
  -> generated error docs
  -> coverage dashboards and quality reports

That pipeline is still useful, but it is too broad to remain the single mental model for parser testing.

Desired Post-Bootstrap Split

grammar specs/templates
  -> generated tree-sitter corpus tests

error specs
  -> generated validation/parser error tests
  -> generated error docs

fragment semantic fixtures and invariants
  -> fragment-level parser tests

reference corpus / curated full files
  -> parser parity tests

Structural Reorganization for `spec/tools` (proposed, not yet implemented)

The intent here is to narrow spec/tools’s mission back to spec-driven artifact generation and validation rather than leaving it as a bootstrap-era staging ground for parser semantics. A proposed module split:

input (markdown/spec parsing)
ir (normalized internal representation)
emit (grammar tests, rust tests, docs)
validate (schema and semantic checks)
sync (grammar node-types and symbol-registry checks)

Current layout (crates: bin/, generated/, lib.rs, output/, spec/, templates/) has not been migrated to this shape. Treat this section as a design target for future work rather than a description of the current source tree.

Legacy vs Active

Keep these active:

grammar corpus generation
error doc generation
symbol registry sync/validation
affected regeneration when a spec or grammar input truly changed

Treat these as legacy audit paths:

synthetic tree-sitter fragment wrappers
bootstrap-era parser equivalence rituals

Determinism Requirements

Stable ordering of generated outputs.
Stable formatting of generated code/docs.
Re-runs without source changes produce no diffs.

Drift Prevention Controls

Node type compatibility check:
- spec/tools must compile and run against current generated node constants.
Registry compatibility check:
- all symbol categories used in specs and grammar must be known in registry.
Generation integration check:
- full generation pass with clean tree must produce zero diff.
Boundary check:
- generated grammar/docs flows should not silently become the sole authority for fragment parsing semantics.

Authoring Experience (proposed, not yet implemented)

Spec authoring would benefit from:

Strict but simple spec templates for constructs and errors.
A spec lint command for immediate feedback (missing fields, invalid tags, malformed examples, unknown error codes).
Clearer documentation of when make test-gen is actually needed and when a small direct test is the right answer instead.

The spec lint binary does not yet exist; the strict-validation work that exists today happens implicitly through make test-gen failures plus the spec validators in spec/tools/src/bin/.

Versioning and Metadata

Each spec file should include:

ownership,
status (draft, accepted, deprecated),
parser/validation scope,
linked tests and generated outputs.

Acceptance Criteria

spec/tools is green and deterministic.
Every generation target has explicit provenance from source specs.
Drift between node types, specs, and generators is blocked in CI.
Spec contributors have a documented and automated happy path.
Small grammar changes no longer force a giant regeneration ritual by default.
Fragment parsing semantics are tested outside the generation pipeline.

Symbol Registry Architecture

Status: Current Last modified: 2026-05-29 18:43 EDT

Purpose

spec/symbols/symbol_registry.json is the canonical source of token/symbol classes used by CHAT grammar tokenization policy.

Scope

The registry currently governs:

CA delimiter symbols,
CA element symbols,
word segment forbidden symbol classes,
event segment forbidden symbol classes.

Governance Rules

Symbol changes must be made only in spec/symbols/symbol_registry.json.
Registry must pass validation:
- node spec/symbols/validate_symbol_registry.js
Grammar symbol sets must be regenerated after any registry change:
- just symbols-gen
Generated files are read-only and must not be edited manually.

Determinism Requirements

Every category list in the registry must be lexicographically sorted.
Duplicate symbols are forbidden.
ca_delimiter_symbols and ca_element_symbols must be disjoint.

These constraints keep generated outputs stable and review diffs minimal.

Consuming Outputs

Generated symbol constants are emitted to:

grammar/src/generated_symbol_sets.js
crates/talkbank-model/src/generated/symbol_sets.rs
spec/tools/src/generated/symbol_sets.rs

grammar/grammar.js imports from this generated module to avoid manual duplication of critical symbol policy.

Change Workflow

Edit registry JSON.
Run registry validation.
Run just symbols-gen.
Run grammar generation/tests.
Run parser equivalence tests.
Commit source + generated outputs together.

Auditability

Registry drift is caught by the checked-in generated artifacts plus the normal local verification sweep and CI checks, so symbol changes should land together with regenerated grammar and Rust outputs.

Bullet Validation

Status: Current Last updated: 2026-05-01 05:19 EDT

Media bullets are timestamps embedded in CHAT utterances that link transcript text to audio/video. They appear as •start_end• at the end of a main tier line (e.g., *CHI: hello . •1000_2000•). Validating that these timestamps are internally consistent is one of the more subtle parts of CHAT validation, because the “obvious” rules turn out to be wrong for multi-party conversation.

This chapter documents what CLAN CHECK does, where its implementation falls short of its own intent, and how chatter validate interprets and improves on that intent.

The three temporal checks

There are three distinct temporal constraints that can be checked on bullet timestamps. They differ in scope, severity, and whether they should run by default.

E701: Same-speaker start-time monotonicity (CLAN Error 83)

Rule: For each speaker, their utterances’ start times must be non-decreasing. If speaker CHI has utterance A starting at 10,000ms and utterance B (later in document order) starting at 8,000ms, that is an error, CHI’s timeline has gone backward.

Scope: Per-speaker. Cross-speaker non-monotonicity is allowed (see Why cross-speaker non-monotonicity is not an error).

Severity: Error.

E704: Same-speaker self-overlap (CLAN Error 133)

Rule: For each speaker, the current utterance’s start time must not be more than 500ms before the same speaker’s previous utterance’s end time. In other words, a speaker cannot overlap with themselves by more than 500ms.

Scope: Per-speaker. The 500ms tolerance accounts for annotation rounding and minor timing imprecision at boundaries.

Severity: Error.

E729: Cross-speaker overlap (CLAN Error 84)

Rule: The current utterance’s start time must not be before the previous utterance’s (any speaker) end time. This checks for any temporal overlap between adjacent utterances, regardless of speaker.

Scope: Global (cross-speaker). Only fires with CLAN’s +c0 flag.

Severity: Warning. Not part of default validation.

This check is part of CLAN’s “strict timeline contiguity” mode, which requires that every utterance’s start time equals the previous utterance’s end time, no gaps (Error 85) and no overlaps (Error 84). It is designed for a very specific use case: verifying that audio has been exhaustively and non-redundantly segmented. In normal conversational transcripts, cross-speaker overlap is ubiquitous, so this check would be absurd as a default.

What CLAN CHECK does

CLAN CHECK implements bullet validation in the function check_checkBulletsConsist() in check.cpp. Understanding its implementation is essential because it has several accidental behaviors that affect the error counts users see.

The snapshot-and-compare pattern

The function uses a global pair (check_SNDBeg, check_SNDEnd) to hold the “current” bullet timing, and saves the previous values into local variables (tBegTime, tEndTime) at the start of each call. The comparison flow is:

1. Save previous: tBegTime = check_SNDBeg, tEndTime = check_SNDEnd
2. Parse new bullet into check_SNDBeg, check_SNDEnd
3. Check error 83: check_SNDBeg < tBegTime?           (cross-speaker comparison)
4. Check error 133: speaker's last END - check_SNDBeg > 500?  (same-speaker)
5. If +c0 mode: check error 84 (overlap) and error 85 (gap)
6. Update speaker's last END time via check_setLastTime()

The early-return shadowing bug

The critical implementation detail is that error 83 fires via return(83) at step 3. This causes the function to exit immediately, skipping steps 4 through 6. Two consequences follow:

Error 83 shadows error 133. An utterance that triggers error 83 (global non-monotonicity) can never also trigger error 133 (same-speaker overlap) in the same call, even if both conditions are true. This is not intentional, it is an artifact of C-style early-return control flow.
Speaker state goes stale. Step 6 (check_setLastTime) updates the speaker’s per-speaker tracking in the SPLIST linked list. When error 83 fires, this update is skipped. All subsequent error-133 checks for that speaker compare against a stale endTime value, causing cascading state corruption that suppresses legitimate error 133 reports.

Error 83 is global, not per-speaker

CLAN fires error 83 by comparing the current utterance’s start time against the previous utterance’s start time, regardless of speaker. In a multi-party conversation:

*PIL: something . •100000_102000•
*UEL: response .  •99500_101000•      ← Error 83: 99500 < 100000

This fires error 83 because UEL’s start time (99,500ms) is before PIL’s start time (100,000ms). But this is just two people talking at the same time, normal conversational overlap. The [>] and [<] markers in CHAT explicitly annotate this as intentional simultaneous speech.

In files with many speakers (the Koine/bre corpus has 7-9 speakers per file, including children talking over each other), this fires on a huge fraction of utterances. CLAN’s accidental shadowing partially masks the problem by suppressing downstream error-133 reports when error 83 fires.

Why cross-speaker non-monotonicity is not an error

Consider a classroom recording with a teacher (PIL) and seven children. The teacher asks a question, and three children answer simultaneously:

*PIL: qué es esto ?        •50000_52000•
*UEL: un coche .           •51200_52500•   ← started during PIL's question
*MAR: coches .             •51000_51800•   ← started even earlier
*REN: es un coche grande . •51500_53000•   ← started between UEL and MAR

In document order, the start times are: 50000, 51200, 51000, 51500. This is non-monotonic (51000 < 51200), but there is nothing wrong with this data. The children are simply talking at the same time. No amount of reordering the utterances in the file would make all start times monotonically increasing while preserving the speaker-turn structure.

Cross-speaker non-monotonicity is an inherent property of multi-party conversation, not a data error. Flagging it as an error produces thousands of false positives on any corpus with overlapping speech.

When IS non-monotonic start time an error?

Same-speaker non-monotonicity IS an error. If CHI speaks at 10,000ms, then later in the file CHI speaks again at 8,000ms, CHI’s timeline has gone backward. This almost certainly indicates a transcription or alignment mistake.

The test is simple: within the same speaker’s utterance sequence, start times must be non-decreasing. This is what chatter validate checks for E701.

How chatter validate implements bullet validation

E701: Per-speaker monotonicity (not global)

chatter validate tracks each speaker’s last start time in a HashMap. E701 only fires when the same speaker’s start time goes backward. Cross-speaker non-monotonicity is silently accepted.

This is an intentional semantic divergence from CLAN CHECK, which fires error 83 globally. We believe CLAN’s global check reflects the implementation (comparing against a single global tBegTime) rather than the intent (detecting disordered timestamps). The per-speaker version matches the intent without drowning users in false positives from normal conversational overlap.

E704: Per-speaker overlap with 500ms tolerance

chatter validate tracks each speaker’s last end time in a HashMap. E704 fires when the overlap exceeds 500ms (same threshold as CLAN Error 133).

Unlike CLAN, E704 runs independently of E701. An utterance can trigger both errors if it is both non-monotonic (E701) and self-overlapping (E704). CLAN’s early-return pattern prevents error 133 from firing when error 83 fires, which is a bug, not a feature.

Speaker state is always updated regardless of whether errors fire. This avoids the cascading state corruption that CLAN’s implementation suffers from.

E729: Not in default validation

E729 (CLAN Error 84, cross-speaker overlap) is implemented but not called during default validation. It exists for future use in a strict-bullet mode equivalent to CLAN’s +c0 flag.

Untranscribed utterances are skipped

Utterances containing only untranscribed markers (www, xxx, yyy) are skipped for E704 checks. These utterances often carry broad segment bullets (covering a long span of background speech) that would create false self-overlap reports. This matches CLAN CHECK’s behavior, where untranscribed tiers do not contribute to timing comparisons.

CA mode disables all temporal checks

When the file header includes @Options: CA, all temporal validation is skipped. Conversation Analysis mode intentionally relaxes timing constraints because CA transcription conventions use overlapping and non-sequential timing as part of the analytic notation.

Comparison: CLAN CHECK vs chatter validate

The following table summarizes the behavioral differences:

┌────────────────────────────┬──────────────┬─────────────────┐
│ Behavior                   │ CLAN CHECK   │ chatter validate│
├────────────────────────────┼──────────────┼─────────────────┤
│ Error 83 / E701 scope      │ Global       │ Per-speaker     │
│ Error 133 / E704 scope     │ Per-speaker  │ Per-speaker     │
│ Error 84 / E729 default    │ Off (+c0)    │ Off             │
│ 83 shadows 133             │ Yes (bug)    │ No              │
│ 83 corrupts speaker state  │ Yes (bug)    │ No              │
│ E701 + E704 independent    │ No           │ Yes             │
│ Speaker state always fresh │ No           │ Yes             │
│ Untranscribed skipped      │ Implicit     │ Explicit        │
│ CA mode bypass             │ Yes          │ Yes             │
│ 500ms tolerance (E704)     │ Yes          │ Yes             │
└────────────────────────────┴──────────────┴─────────────────┘

Expected count differences

On multi-party files with overlapping speech:

E701 count will be lower than CLAN’s error 83 count. CLAN fires error 83 on cross-speaker non-monotonicity; we don’t. The difference represents legitimate conversational overlap that we intentionally do not flag.
E704 count will be higher than CLAN’s error 133 count. CLAN’s early-return shadowing prevents error 133 from firing when error 83 fires, and the stale speaker state causes further suppression. Our correctly maintained per-speaker tracking reports all genuine self-overlaps.

On single-speaker files or files with minimal overlap, the counts should be very close or identical.

Implementation details

The implementation lives in crates/talkbank-model/src/validation/temporal.rs.

Data flow

flowchart TD
    A["collect_bullets(file)\n(temporal.rs:101)"] -->|"Vec&lt;BulletInfo&gt;"| B
    B["validate_global_timeline()\n(temporal.rs:169)"] -->|"Per-speaker HashMap"| C["E701 errors"]
    A -->|"Vec&lt;BulletInfo&gt;"| D
    D["validate_speaker_timelines()\n(temporal.rs:212)"] -->|"Per-speaker HashMap"| E["E704 errors"]

BulletInfo

Each utterance with a bullet produces a BulletInfo containing:

utterance_idx: 0-based index in the file
speaker: the speaker code (e.g., "CHI", "PIL")
bullet: the Bullet struct with start_ms and end_ms
has_timeable_content: whether the utterance contains transcribed words (used to skip untranscribed-only turns for E704)

Only main speaker tiers are collected. Dependent tiers (%mor, %gra, etc.) are excluded.

Per-speaker tracking

Both E701 and E704 use HashMap<&str, ...> keyed by speaker code:

E701: stores (utterance_idx, start_ms), the speaker’s most recent start time
E704: stores (utterance_idx, end_ms), the speaker’s most recent end time

State is always updated after processing each bullet, regardless of whether an error was reported. This ensures clean tracking for subsequent comparisons.

CLAN source reference

For readers who want to trace the CLAN implementation:

Function: check_checkBulletsConsist() in OSX-CLAN/src/clan/check.cpp, lines 3849-3967
Error 83: lines 3883-3890 (early return(83))
Error 133: lines 3892-3895 (only reached if error 83 did not fire)
Speaker state update: line 3909 (check_setLastTime), only reached if no error fired
Per-speaker tracking: SPLIST linked list, lookup via check_getLatTime() / check_setLastTime()
+c0 mode: checkBullets flag, set via +c0 command-line option (line 5920), guards errors 84/85 at lines 3897 and 3953
Call site: check_ParseWords() line 4801, guarded by utterance->speaker[0] == '*' (main tiers only)

CA Terminator Resolution

Status: Current Last updated: 2026-05-05 12:23 EDT

How CA markers are split between separators and linkers in the parser/model.

Current rule

The parser/model no longer promotes CA markers into utterance terminators.

The supported split is:

Standard utterance terminators remain the CHAT terminators such as . ? ! +... +/. and related final punctuation tokens.
CA intonation arrows (⇗ ↗ → ↘ ⇘) stay Separator content items.
CA TCU markers (≈ ≋) stay Separator content items.
CA TCU linker forms (+≈ +≋) stay Linker items.

This means a trailing →, ≈, or ≋ remains in main-tier content rather than being retyped as Terminator.

Parser/model consequences

Tree-sitter grammar keeps arrows and ≈/≋ on the separator path.
The tree parser converts those nodes directly into Separator variants.
The re2c parser classifies ≈/≋ as separators and +≈/+≋ as linkers.
The old post-hoc resolve_ca_terminator() promotion pass was removed.
Terminator::try_from_chat_str() intentionally rejects CA arrows, ≈, ≋, +≈, and +≋.

Data Model

The active surface split is:

Kind	CHAT tokens
`Terminator`	`.` `?` `!` `+...` `+/.` `+//.` `+/?` `+!?` `+"/.` `+".` `+//?` `+..?` `+.`
`Separator`	`⇗` `↗` `→` `↘` `⇘` `≈` `≋` plus the other CA/content separators
`Linker`	`+≈` `+≋` plus the other utterance linkers

Legacy CA-only Terminator variants still exist in the type for backward compatibility with older serialized data, but new parser/classifier code does not construct them from CHAT text.

Regression coverage

The regression surface for this split is:

ca_symbols_are_not_chat_terminators in talkbank-model
trailing_ca_arrow_stays_separator in talkbank-parser
trailing_ca_no_break_stays_separator in talkbank-parser
trailing_ca_technical_break_stays_separator in talkbank-parser

Validation Cache

Status: Current Last modified: 2026-06-22 06:48 EDT

The CHAT-core validation cache, used by chatter validate and the LSP server. Distinct from the audio-task cache used by upstream batchalign3 for FA / UTR ASR / media conversion (documented separately in that project): this cache stores parse + validate results keyed by file path + options.

crates/talkbank-cache/.

Architecture

flowchart TD
    req["Validation request\n(path + options)"]
    key["Cache key\n(path_hash + RulesVersion + check_alignment + parser_kind)"]
    db["SQLite WAL\n~/.cache/talkbank-chat/\ntalkbank-cache.db"]
    hit["Cache hit\n→ return stored result"]
    miss["Cache miss\n→ parse + validate + store"]

    req --> key --> db
    db -->|"found + RulesVersion match + content_hash match"| hit
    db -->|"not found, rules changed, or content edited"| miss
    miss --> db

Configuration

Config	Value	Why
Backend	SQLite via `sqlx`	Concurrent reads (WAL), atomic writes, zero-config
Pool size	16 connections	Matches validation worker count
`mmap`	256 MB	Fast random access for 95k+ entries
Invalidation	Rules-version field + content hash + 30-day TTL	Rule-set or schema changes auto-invalidate; content edits invalidate per-file; stale entries pruned
Bridge	Embedded single-threaded tokio runtime	Sync workers call `rt.block_on()` for async SQLite

Schema

file_cache table (see crates/talkbank-cache/migrations/20260101000000_initial.sql):

Column	Role
`path_hash`	BLAKE3 hash of the resolved path (part of the lookup key)
`file_path`	Resolved file path, indexed for path-based maintenance ops
`content_hash`	Hash of the file content; mismatch invalidates the entry
`version`	Cache-compatibility version (`RulesVersion`): the cache crate version folded together with a fingerprint of the active validation rule set. A mismatch invalidates the entry
`cached_at`	Insertion timestamp
`check_alignment`	Whether alignment validation was requested
`is_valid`	Cached validation outcome (0/1)
`roundtrip_tested`	Whether roundtrip equivalence was checked
`roundtrip_passed`	Roundtrip result when tested
`parser_kind`	Parser backend (tree-sitter or re2c)

The lookup key is the compound unique index (path_hash, version, check_alignment, parser_kind); file_path is a secondary index used by maintenance operations (orphan pruning, etc.).

Database location

Platform	Path
macOS	`~/Library/Caches/talkbank-chat/talkbank-cache.db`
Linux	`~/.cache/talkbank-chat/talkbank-cache.db`
Windows	`%LocalAppData%\talkbank-chat\talkbank-cache.db`

Invalidation

Validation-rule changes: the version column holds a RulesVersion, which folds the talkbank-cache crate version together with a fingerprint of the active validation rule set (an FNV-1a hash over every ErrorCode the validator can emit, via talkbank_model::validation_rules_fingerprint). Adding, removing, or renaming a rule (for example introducing error code E370, “retrace marker must be followed by material”) changes the fingerprint, hence the RulesVersion, hence the lookup key, so verdicts cached under the old rule set become a cache MISS and are re-validated instead of served stale. This is the mechanism that keeps chatter validate (the authority on CHAT validity) from returning a stale “Valid” after the rules tighten. The stale rows stay on disk under their old version for selective re-testing; they are simply never served to a query carrying the new version.
Content changes: each entry stores the file’s content_hash; a mismatch is a per-file miss.
Time-based: entries older than 30 days are pruned.
Manual: pass --force to bypass cache lookups for a particular validation run.

Per repository policy, do not delete the cache directory without explicit request. Use --force when you want fresh validation for specific paths without destroying the whole cache.

Alignment

Status: Current Last modified: 2026-06-15 15:00 EDT

Alignment in the toolchain operates at two structural layers, plus a separate overlap-marker pass. Tier alignment is structural (counting and pairing AST nodes); word extraction is positional (domain-ordered token indices).

Layer	Where	Purpose
Tier alignment	`talkbank-model::alignment`	1:1 mapping between main tier and dependent tiers (`%mor`, `%pho`, `%wor`, `%sin`, `%gra`)
Word extraction	`talkbank-transform::extract`	Pull NLP-ready words from the AST in domain order

Tier Alignment

Validates that dependent tiers have the correct number and arrangement of items relative to the main tier. Lives in crates/talkbank-model/src/alignment/.

TierDomain

#![allow(unused)]
fn main() {
enum TierDomain { Mor, Pho, Sin, Wor }
}

The same utterance produces different counts per domain:

Rule	Mor	Pho	Sin	Wor
Skip retrace groups	Yes	No	No	No
Count pauses	No	Yes	No	No
PhoGroup	Recurse	Atomic (1)	Skip (0)	Recurse
SinGroup	Recurse	Skip (0)	Atomic (1)	Recurse
Include fragments (`&+`)	No	Yes	Yes	No
Include nonwords (`&~`)	No	Yes	Yes	No
Include fillers (`&-`)	No	Yes	Yes	Yes
Include untranscribed	No	Yes	Yes	No
Include tag-marker separators	Yes	No	No	No
`ReplacedWord` aligns to	Replacement	Original	Original	Original

For the underlying word filter (counts_for_tier, should_skip_group), the content walker, and the ChatFile model itself, see CHAT Data Model. The walker plus the domain table together govern every tier-alignment count.

Retrace handling, alignment-critical

Retraces are the most alignment-critical content type. A Retrace node wraps content the speaker said then corrected.

Mor: skip entirely (count 0). The retrace was a false start; only the correction carries morphological analysis.
Pho, Sin: recurse, words were physically produced and have phonological / gestural data.
Wor: recurse, retrace ancestry does not change %wor membership.

Critical invariant: the parser must emit UtteranceContent::Retrace for all retrace patterns, including single-word retraces with replacements (word [: repl] [* err] [//]). If a retrace is accidentally emitted as a bare ReplacedWord, it counts for %mor alignment, causing false E705 errors. Enforced by tests/retrace_replaced_word_regression.rs. Full data model + parsing pipeline + CHAT examples in Retraces and Repetitions.

AlignmentPair

#![allow(unused)]
fn main() {
struct AlignmentPair {
    source_index: Option<usize>,
    target_index: Option<usize>,
}
}

Universal index-pair primitive. Some/Some = matched. One None = insertion / deletion placeholder for mismatch diagnostics. is_complete(), both indices Some. is_placeholder(), unmatched.

Per-domain results

Type	Function	Source → Target
`MorAlignment`	`align_main_to_mor()`	Main → `%mor` items
`PhoAlignment`	`align_main_to_pho()`	Main → `%pho` tokens
`SinAlignment`	`align_main_to_sin()`	Main → `%sin` tokens
`WorAlignment`	`align_main_to_wor()`	Main → `%wor` tokens
`GraAlignment`	`align_mor_to_gra()`	`%mor` chunks → `%gra` relations

%gra aligns to %mor chunks, not items. Clitics create additional chunks (pro|it~v|be&PRES = 2 chunks: pre-clitic + main).

Trait abstractions

Trait	Purpose	Implementors
`IndexPair`	`source()`/`target()` on any pair type	`AlignmentPair`, `GraAlignmentPair`
`TierAlignmentResult`	`pairs()`/`errors()`/`push_*()` accumulator	All 5 alignment result types
`AlignableTier`	What a tier provides for generic alignment	`PhoTier`, `SinTier`, `WorTier`
`TierCountable`	`count_tier_positions()` / `collect_tier_items()` methods	`[UtteranceContent]`

The generic positional_align() function uses AlignableTier to eliminate duplication: align_main_to_{pho,sin,wor}() are thin wrappers around it. %mor doesn’t use it (additional terminator validation logic). %gra doesn’t use it (source is MorTier, not MainTier). WorTier overrides mismatch_format() to Diff (LCS) since both sides are word sequences; the others use Positional.

`%wor` is not validated

%wor is a timing-annotation tier. There is no downstream positional indexing into %wor, and validate_alignments() does not check %wor word count against the main tier. Old corpus files may have xxx, fragments, or nonwords in %wor (pre-2026-04 behavior) without producing false errors.

Phon tier-to-tier alignment

A second class of alignment that operates between dependent tiers:

Source	Target	Code
`%modsyl`	`%mod`	E725
`%phosyl`	`%pho`	E726
`%phoaln`	`%mod`	E727
`%phoaln`	`%pho`	E728

Derived-view alignments: %modsyl is a syllabified reannotation of %mod, %phosyl of %pho, %phoaln aligns both. Word counts must match between source and target. Computed in compute_alignments() after the main-tier alignments. build_tier_to_tier_alignment() constructs index pairs and emits build_count_mismatch_error() when counts disagree. %phoaln checks against both %mod and %pho, potentially emitting E727 and E728 simultaneously.

Known data issue: Phon XML source data has orthography↔IPA word count discrepancies in ~4% of files (518 / 12,340). Expected in child phonology data. The PhonTalk converter handles this inconsistently, %mod/%pho are truncated to match orthography via OneToOne, but %xmodsyl/%xphosyl/%xphoaln are written from raw IPATranscript, exposing the full IPA word count. Result: E725-E728 mismatches.

Parse-health gating

Alignment diagnostics honor ParseHealth metadata. If a dependent tier’s domain is parse-tainted, mismatch errors for that domain pair are suppressed. Main-tier taint blocks all main→dependent alignments. Dependent-tier taint blocks only that tier. Phon tier-to-tier checks have their own gates (can_align_modsyl_to_mod, can_align_phosyl_to_pho, can_align_phoaln).

Word Extraction

extract_words() (in crates/talkbank-transform/src/extract.rs) uses the content walker to pull words from the AST in domain-specific order. Returns Vec<ExtractedWord> with text, word_index, is_separator, special_form. Tag-marker separators (, „ ‡) are included as words in Mor domain because they have %mor items (cm|cm, end|end, beg|beg).

Overlap Marker Iteration

CA overlap markers (⌈⌉⌊⌋) appear at three content levels, UtteranceContent (top-level), BracketedItem (inside groups), and WordContent (intra-word, butt⌈er⌉). Two APIs in talkbank-model/src/alignment/helpers/overlap.rs:

`walk_overlap_points`, low-level

Visits every OverlapPoint in document order with word-position context. Analogous to walk_words but for overlap markers:

walk_overlap_points(&utterance.main.content.content.0, &mut |visit| {
    // visit.point: &OverlapPoint (kind + optional index)
    // visit.word_position: usize (alignable words seen so far)
});

`extract_overlap_info`, region-based

Pairs markers by (kind, index) into OverlapRegion structs. Each region represents a matched ⌈…⌉ or ⌊…⌋ pair. Index-aware: ⌈2...⌉2 forms a separate region from ⌈...⌉. Mismatched indices leave markers unpaired. Onset-only ⌈ (without ⌉) is a legitimate CA convention, region has end_at_word = None, is_well_paired() = false, but top_onset_fraction() still works.

Cross-utterance, `analyze_file_overlaps`

For whole-file analysis, in overlap_groups.rs. 1:N matching: one top region from speaker A can match multiple bottom regions from speakers B, C, etc. Used by E347 and chatter debug overlap-audit.

Overlap validation

Code	Level	Check
E347	Cross-utterance	Orphaned tops/bottoms with 1:N matching (warning)
E348	Utterance	Unpaired markers within a single utterance (warning)
E373	Utterance	Invalid overlap index values (must be 2-9)
E704	Cross-utterance	Same speaker encoding both top and bottom (error)

chatter debug overlap-audit <path> reports per-file statistics (groups, bottoms, orphans, temporal consistency) in TSV format. Use --database <path.jsonl> for a persistent JSON-lines database.

Design Principles

No string hacking. All alignment operates on typed AST structures (Word, MorTier, AlignmentPair), never on serialized CHAT text.
Domain-aware from the start. TierDomain gates traversal at the walker level. Downstream code never re-implements retrace / group skipping logic.
Deterministic over approximate. Tier alignment and word extraction use deterministic, positional algorithms over the typed AST.
Dense indexed structures. AlignmentPair uses Option<usize> rather than cloned data; index pairs are stored positionally, not in hash maps.
Exhaustive matching. Every match on UtteranceContent (24 variants) or BracketedItem (22 variants) lists all variants explicitly. New variants are a compile error, not a silent bug.
Walker as shared primitive. walk_words() removed ~330 lines of duplicated traversal boilerplate across 7 call sites.

Downstream Consumers

Consumer	Crate	Usage
Validation	`talkbank-model`	Cross-tier checks (E714/E715, E725-E728), overlap (E347/E348/E373/E704)
LSP hover	`talkbank-lsp`	Show aligned tier items for word under cursor
Word extraction	`talkbank-transform`	NLP-ready words from utterances
Overlap audit	`chatter`	`chatter debug overlap-audit`
`%wor` generation	`talkbank-model`	Build `%wor` tier from main tier

Memory and Ownership

Status: Current Last updated: 2026-03-24 01:32 EDT

This chapter documents the memory management and ownership patterns used across the TalkBank Rust crates. Understanding these decisions helps contributors make consistent choices when adding new code.

String Representation Strategy

CHAT corpora contain massive repetition, the same speaker codes, language codes, POS tags, and high-frequency words appear millions of times across files. The codebase uses three string types, chosen by expected cardinality and duplication:

flowchart LR
    raw["Raw input (&amp;str)"]
    smol["SmolStr\n(inline ≤23 bytes)"]
    arc["Arc&lt;str&gt;\n(interned, deduplicated)"]
    string["String\n(owned, unique)"]

    raw -->|short, low repetition| smol
    raw -->|high repetition domain value| arc
    raw -->|ephemeral/unique| string

Type	When to use	Examples
`SmolStr`	Short tokens, low duplication	Postcode text, tier content, event labels
`Arc<str>` (interned)	High-cardinality domain symbols	Speaker codes, language codes, POS tags, stems
`String`	Ephemeral or unique values	Error messages, temporary formatting

String Interning

Location: talkbank-model/src/model/intern.rs

Five global process-local interners, each a DashMap<Arc<str>, Arc<str>> behind OnceLock<StringInterner>:

Interner	Pre-seeded values	Typical savings
`speaker_interner()`	30+ codes (CHI, MOT, FAT, …)	High, 3-letter codes repeat per utterance
`language_interner()`	45+ ISO 639-3 codes	Moderate, per-file
`pos_interner()`	60+ POS tags + UD relations	Very high, every %mor word
`stem_interner()`	200+ frequent English stems	High, function words dominate
`participant_interner()`	14 roles (Target_Child, …)	Low, per-file

How it works:

Fast path: get() on DashMap, O(1) Arc::clone if found
Slow path: insert() new Arc if miss, deduplicates on future access
Thread-safe: DashMap uses shard-level locks, no global contention
After initialization, reads are lock-free

Memory impact: 50-200 MB savings on large corpora (5-20% reduction). Arc::clone is O(1) atomic increment vs String::clone O(n) copy.

Newtype Macros

Two macros generate domain-typed string wrappers:

string_newtype!: wraps SmolStr. Used for generic CHAT text.
interned_newtype!: wraps Arc<str> with automatic interning. Used for domain symbols.

// SmolStr-backed: no interning, inline small strings
string_newtype!(PostcodeText);

// Arc<str>-backed: interned via global interner
interned_newtype!(SpeakerCode, speaker_interner);

Ownership Model

ChatFile Lifecycle

flowchart TD
    src["Source text (&amp;str)"]
    cst["tree-sitter CST\n(Tree, borrowed nodes)"]
    model["ChatFile\n(owned AST)"]
    cache["SQLite cache\n(validation result)"]
    lsp["LSP server\n(per-document state)"]
    json["JSON output\n(serde serialization)"]
    cli["CLI output\n(CHAT text)"]

    src -->|tree-sitter parse| cst
    cst -->|CST-to-model conversion| model
    model -->|validate + hash| cache
    model -->|held in backend| lsp
    model -->|to_json()| json
    model -->|to_chat_string()| cli

Parsing: tree-sitter Tree owns the CST. Node<'a> values borrow from Tree, zero-copy traversal. The CST-to-model conversion copies data into owned ChatFile fields (SmolStr, Arc<str>). The Tree is dropped after conversion.
Validation: ChatFile is borrowed (&self) during validation. Errors are streamed to an ErrorSink, no accumulation required.
LSP: Each open document holds an owned ChatFile in the backend. Re-parsed on every edit via tree-sitter incremental parsing.
CLI batch: Each file is independently parsed → validated → reported → dropped. No cross-file state except the shared cache.

Arc Usage

Arc appears in three distinct roles:

Role	Type	Why
String interning	`Arc<str>` in model types	O(1) clone for high-repetition domain values
Worker pool	`Arc<WorkerGroup>` in batchalign	RAII `CheckedOutWorker::drop()` needs group reference to return worker
Cache backend	`Arc<dyn CacheBackend>` in batchalign	Shared across async request handlers

No Rc (single-threaded sharing not needed). No Cow<str> (SmolStr covers the inline-small-string use case more naturally).

Interior Mutability

Pattern	Where	What it protects
`RefCell<Parser>` inside `TreeSitterParser`	`talkbank-parser`	Tree-sitter `Parser` needs `&mut self` but isn’t `Sync`. Callers create a `TreeSitterParser` and pass `&TreeSitterParser` everywhere.
`DashMap<Arc<str>, Arc<str>>`	String interners	Concurrent interning during parallel parsing. Shard-level locks.
`OnceLock<StringInterner>`	5 global interners	Lazy init, lock-free after first access
`LazyLock<Regex>`	All regex patterns workspace-wide	Compile-once, no per-call overhead
`std::sync::Mutex<VecDeque>`	batchalign worker idle queue	Held < 10 μs for push/pop only
`tokio::sync::Mutex<HashMap>`	batchalign job store	Short reads/writes, never held across `.await`
`Semaphore`	Worker availability (batchalign)	Async signaling without holding locks during dispatch

Rule: std::sync::Mutex for data accessed from sync code or held briefly. tokio::sync::Mutex only when the lock must be held across .await points (which we avoid when possible). DashMap when many threads read concurrently.

Collection Choices

Collection	Where	Why not HashMap/Vec
`BTreeMap`	All test/snapshot JSON output	Deterministic key ordering for reviewable diffs
`IndexMap`	Participants, per-speaker results	Preserves encounter order (CHAT spec requires @Participants order)
`SmallVec<[T; N]>`	Headers (N=2), tiers (N=3), features (N=4), token mappings (N=4)	Inline storage for common sizes; avoids heap for typical cases
`VecDeque`	Worker idle queue (batchalign)	FIFO fair scheduling
Dense `Vec` indexed by position	Retokenize word-to-token mapping	O(1) lookup, no hashing overhead, cache-friendly

No LinkedList, BinaryHeap, or custom allocators.

Tree-Sitter Memory Model

Tree-sitter parsing is zero-copy for CST traversal:

// Node<'a> borrows from Tree, no allocation per node
fn process_node<'a>(node: Node<'a>, source: &str) -> ParseResult<...> {
    for i in 0..node.child_count() {
        let child: Node<'a> = node.child(i).unwrap(); // Stack-only, no heap
        let text: &str = child.utf8_text(source.as_bytes())?; // Borrows source
        // ... convert to owned model types ...
    }
}

The tree-sitter parser consumes &str, produces a CST, and the Rust traversal code constructs owned model types from CST nodes.

SQLite Memory-Mapped I/O

The validation cache uses SQLite with memory-mapped I/O for fast random access:

SqliteConnectOptions::new()
    .journal_mode(SqliteJournalMode::Wal)       // Concurrent reads during writes
    .pragma("cache_size", "-8000")               // 8 MB page cache
    .pragma("mmap_size", "268435456")            // 256 MB memory-mapped region
    .synchronous(SqliteSynchronous::Normal)      // Balanced durability

This configuration handles 95,000+ cached entries efficiently. The cache is never deleted (use --force to refresh specific paths).

Manual Drop Implementations

Three types have custom Drop for resource cleanup:

Type	Cleanup action	Why
`AuditReporter`	Joins audit writer thread and flushes output	Audit mode owns file IO in a dedicated writer thread
`CheckedOutWorker`	Returns worker to idle queue + releases semaphore permit	RAII pool resource management
`WorkerHandle`	Sends SIGTERM/SIGKILL to child process	Process must be terminated when handle drops

All drops are acyclic, no ordering dependencies between them.

Allocation Optimization Patterns

Rather than using an arena allocator (bumpalo was evaluated and removed, the data lifetimes don’t fit the “allocate many, free all at once” pattern), the codebase uses targeted optimizations:

Pattern	Where	Savings
Scratch buffer reuse (clear + swap)	DP alignment row costs	~50% fewer allocations in inner loop
Flat table (`vec![...; rows * cols]`)	DP small-problem fallback	1 allocation vs rows+1
Dense Vec instead of HashMap	Retokenize word mapping	O(1) lookup, no hash overhead
SmallVec inline storage	Throughout	Avoids heap for 1-4 element collections
`SmolStr` inline strings	All short CHAT tokens	No heap allocation for ≤23 byte strings

See also: the batchalign3 book’s Arena Allocators page for the full evaluation of where arenas do and don’t help.

Algorithms and Data Structures

Status: Current Last modified: 2026-06-15 15:00 EDT

This chapter documents the key algorithms and data structure decisions across the TalkBank Rust crates.

CHAT AST Representation

The CHAT model is a tree of owned enums. The two central types are:

UtteranceContent: 24 variants covering all main-tier content
BracketedItem: 22 variants for content inside groups/brackets

flowchart TD
    file["ChatFile"]
    header["Headers\n(@Languages, @Participants, ...)"]
    utt["Utterance"]
    mc["MainContent\nVec&lt;UtteranceContent&gt;"]
    dt["DependentTiers\n(%mor, %pho, %gra, ...)"]

    file --> header
    file --> utt
    utt --> mc
    utt --> dt

    mc --> word["Word / AnnotatedWord / ReplacedWord"]
    mc --> group["Group / PhoGroup / SinGroup / Quotation"]
    mc --> marker["Pause / Separator / OverlapPoint / ..."]
    group --> bi["BracketedContent\nVec&lt;BracketedItem&gt;"]
    bi --> word2["Word / ReplacedWord / Separator"]
    bi --> nested["Nested groups"]

Memory layout: Large variants (e.g., AnnotatedWord with scoped annotations) are Boxed to keep the enum’s stack size bounded.

Content Walker

Location: talkbank-model/src/alignment/helpers/walk/

Closure-based recursive traversal centralizing the walk over all 24+22 variants:

pub fn for_each_leaf<'a>(
    content: &'a [UtteranceContent],
    domain: Option<AlignmentDomain>,
    f: &mut impl FnMut(ContentLeaf<'a>),
)

Domain-aware gating:

Some(Mor): skips retrace groups (retrace words aren’t morphologically analyzed)
Some(Pho | Sin): skips PhoGroup/SinGroup (treated as atomic by those tiers)
None: recurses everything unconditionally

Both immutable (for_each_leaf) and mutable (for_each_leaf_mut) versions exist. Used by talkbank-model, talkbank-transform word extraction, and other typed CHAT traversals across the workspace.

Parsing Strategies

Tree-Sitter (Canonical Parser)

flowchart LR
    src["Source .cha text"]
    ts["tree-sitter C parser\n(generated from grammar.js)"]
    cst["CST (Tree)"]
    conv["Recursive descent\nover CST nodes"]
    model["ChatFile (owned AST)"]
    errors["ErrorSink\n(diagnostics)"]

    src --> ts --> cst --> conv --> model
    conv --> errors

Grammar defined in grammar/grammar.js (source of truth)
parser.c is generated, never edit directly
CST-to-model conversion: recursive dispatch on node kind, skip WHITESPACES, report unrecognized nodes via ErrorSink
Strict + catch-all pattern: Known header values get named grammar rules (syntax highlighting); unknown values hit a catch-all (flagged by validator)

Fragment Parsing

TreeSitterParser provides fragment methods for parsing individual CHAT fragments (a word, a tier line) directly. Methods like parser.parse_word_fragment(), parser.parse_main_tier_fragment(), etc. are used when synthesizing CHAT from non-CHAT sources (ASR output, UD annotations).

Historical note: A Chumsky-based direct parser previously provided combinator-based fragment parsing. It was removed in March 2026; tree-sitter is now the sole parser.

Tier Alignment (1:1 Positional)

Location: talkbank-model/src/alignment/traits.rs

Generic positional_align() pairs main-tier words with dependent-tier items by position (O(n)). Traits: AlignableTier, TierAlignmentResult, AlignableContent.

%pho, %sin, %wor, use generic positional alignment
%mor, %gra, domain-specific custom implementations
Mismatch diagnostics via similar crate (Patience diff algorithm, O(n log n))

Caching

The CHAT-core validation cache is documented separately in Validation Cache. The upstream batchalign3 project documents its own audio-task cache (FA / UTR ASR / media conversion) separately.

Text Processing

Regex Compilation

All regex patterns use LazyLock<Regex> from std::sync, compiled once at first use, lock-free thereafter. Never call Regex::new() inside functions or loops.

Deterministic Output

BTreeMap for all test/snapshot JSON (lexicographic key ordering)
IndexMap for participant/speaker ordering (preserves encounter order per spec)
Frequency results collected into BTreeMap<NormalizedWord, Count>

Setup

Status: Current Last modified: 2026-06-21 21:33 EDT

Development is supported on Windows, macOS, and Linux. The instructions below use Unix shell syntax; on Windows, use PowerShell or Git Bash equivalently.

Prerequisites

Rust (stable) via rustup (all platforms)
Node.js for tree-sitter grammar generation and symbol validation
tree-sitter CLI: cargo install tree-sitter-cli
just (optional but recommended) for the repo’s top-level helper recipes

Clone Repository

mkdir -p ~/talkbank && cd ~/talkbank
git clone https://github.com/TalkBank/chatter.git
cd chatter

Build

From your chatter checkout root:

cargo build --workspace --locked
cargo build --workspace --all-targets --locked

# Optional helpers from the root justfile
just build
just test
just book-install-tools
just book

Two Cargo Workspaces

The repository has two independent Cargo workspaces:

1. Root workspace (`Cargo.toml`)

Contains all Rust crates for parsing, model, validation, and transform:

cargo build
cargo test

2. Spec workspace (`spec/Cargo.toml`)

Contains two sibling crates for spec-driven artifacts. Invoke with --manifest-path relative to the chatter repo root:

cargo build --manifest-path spec/tools/Cargo.toml
cargo build --manifest-path spec/runtime-tools/Cargo.toml
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- --help
cargo run --manifest-path spec/runtime-tools/Cargo.toml --bin validate_error_specs -- --help

Root justfile recipes

just build        # Build the Rust workspace
just build-release
just test         # cargo test --workspace
just clippy
just fmt
just fmt-check

Verification

This repo does not currently have the old monorepo-wide make verify wrapper ported into the root checkout. Until that lands, use the concrete verification commands from the repo guidance:

cargo fmt
cargo check --workspace --all-targets
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Add grammar/spec commands when your change touches those surfaces:

cd grammar && tree-sitter generate && tree-sitter test
cargo build --manifest-path spec/tools/Cargo.toml
cargo build --manifest-path spec/runtime-tools/Cargo.toml

CI green on the pushed commit remains the authoritative pre-push gate for this repo.

Editor Setup

rust-analyzer

The workspace should work out of the box with rust-analyzer. The root Cargo.toml workspace configuration is standard.

Grammar Workflow

Status: Current Last modified: 2026-05-29 18:36 EDT

The tree-sitter grammar at grammar/grammar.js is the formal definition of the CHAT format. Changes require careful validation.

The following diagram shows the complete regeneration pipeline. Every step must pass before committing a grammar change.

flowchart TD
    edit(["Edit grammar/grammar.js"])
    generate["tree-sitter generate\n→ src/parser.c\n→ src/node-types.json"]
    grammar_test["tree-sitter test\n(corpus tests)"]
    rust_test["cargo test -p talkbank-parser\n(CST-to-model conversion)"]
    equiv["parser equivalence\n(corpus/reference/ files)"]
    spec_check{"Grammar change\naffects spec examples?"}
    test_gen["spec/tools generators\n→ grammar/test/corpus/\n→ parser-tests/tests/generated/\n→ docs/errors/"]
    commit(["Commit"])

    edit --> generate --> grammar_test --> rust_test --> equiv --> spec_check
    spec_check -->|Yes| test_gen --> commit
    spec_check -->|No| commit

Step-by-Step Procedure

1. Edit the Grammar

Modify grammar.js in the grammar/ directory. Key design principles:

Explicit whitespace (no extras)
Precedence annotations to resolve ambiguities
Named rules for all semantically meaningful nodes

2. Generate the Parser

cd grammar
tree-sitter generate

This produces src/parser.c and src/node-types.json. Never edit these files by hand.

3. Run Grammar Tests

tree-sitter test

Every test under grammar/test/corpus/ must pass. Tests live there and are partially auto-generated from specs (primarily via gen_tree_sitter_tests).

4. Run Parser Tests

cargo test -p talkbank-parser

This verifies the Rust parser wrapper handles all CST nodes correctly.

5. Run Parser Equivalence

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'

Every file in the reference corpus must parse correctly. Each .cha file is its own test, nextest runs them in parallel and reports individual failures.

6. Regenerate Spec Tests

If the grammar change affects any spec examples:

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- \
  --output-dir grammar/test/corpus \
  --template-dir spec/tools/templates

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_rust_tests -- \
  --output-dir crates/talkbank-parser-tests/tests/generated

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_validation_corpus -- \
  --corpus-dir crates/talkbank-parser-tests/tests/error_corpus/validation_errors

This regenerates tree-sitter corpus tests and other generated outputs that still depend on the spec pipeline.

Do this when the grammar change actually affects generated artifacts.

7. Update node_types.rs

If new node types were added to the grammar, the generated node_types.rs in talkbank-parser needs updating. The spec tools handle this via node-types.json.

Critical Policy

The reference corpus at corpus/reference/ must pass parser equivalence at 100%. If a grammar change breaks even one file, revert immediately. The reference corpus is the ultimate arbiter of correctness.

Common Patterns

Adding a New Token

Define the token in grammar.js
Add handling in the Rust tier parser (match on the new node kind)
Add a spec construct example
Run the relevant generation and verification steps

For small, isolated syntax additions, the grammar workflow should stay local:

one grammar change
one grammar corpus example
one full-file fixture if needed

Changing a Rule

Modify the rule in grammar.js
tree-sitter generate && tree-sitter test
Update Rust parser if CST node structure changed
Update spec examples if the expected CST changed
Run the current local verification sweep from contributing/dev-checks.md

Spec Workflow

Status: Current Last modified: 2026-05-29 17:50 EDT

Specifications in spec/ are the source of truth for CHAT format intent, grammar examples, and validation/error contracts.

Adding a Construct Spec

Construct specs define valid CHAT patterns with expected parse trees.

1. Create the Spec File

Create a new markdown file in the appropriate spec/constructs/ subdirectory:

spec/constructs/
├── header/         # Header-related constructs
├── main_tier/      # Main tier patterns
├── tiers/          # Dependent tier patterns
├── utterance/      # Utterance-level patterns
└── word/           # Word syntax patterns

2. Write the Spec

# my_example

Description of what this example demonstrates.

## Input

\```utterance
*CHI:	hello world .
\```

## Expected CST

\```cst
(utterance
  (main_tier
    ...))
\```

## Metadata

- **Level**: utterance
- **Category**: main_tier

The code fence label (e.g., utterance, mor_dependent_tier) selects which template wraps the input into a full CHAT file.

3. Generate the CST

Parse your input with tree-sitter to get the actual CST, then copy it as the Expected CST (stripping positions and field names).

4. Regenerate The Affected Generated Artifacts

The predecessor monorepo wrapped this step as make test-gen. That root wrapper is not yet ported into this repo, so follow spec/CLAUDE.md and run only the generator command(s) relevant to the artifacts you intentionally changed.

For isolated grammar additions, keep the change small:

Add or adjust one grammar example.
Add one full-file fixture if the change matters in context.
Regenerate only the artifacts that truly changed.

Adding an Error Spec

Error specs define invalid CHAT patterns with expected error codes.

1. Create the Spec File

Error specs live in spec/errors/, named by error code. The convention is E###_auto.md (or E###_<short-slug>.md); for example spec/errors/E301_auto.md covers “Empty speaker code”.

2. Write the Spec

The actual on-disk format (per spec/errors/E301_auto.md) uses bolded metadata keys; there is no Name field and severity is implicit in the error-code numbering:

# E301: Empty speaker code

## Description

Empty speaker code

## Metadata

- **Error Code**: E301
- **Category**: Main tier validation
- **Level**: utterance
- **Layer**: parser

## Example 1

**Source**: `E3xx_main_tier_errors/E301_empty_speaker.cha`
**Trigger**: Main tier with * but no speaker code
**Expected Error Codes**: E301

\```chat
@UTF8
@Begin
@Languages:	eng
@Participants:	CHI Child
...
\```

Key Metadata Fields

Layer: parser: the error is caught during parser.parse_chat_file() (file fails to parse)
Layer: validation: the error is caught by validate_with_alignment() after successful parse
Status: not_implemented: generates #[ignore] tests (validation logic not yet coded)

3. Regenerate The Affected Artifacts

Regenerate the affected artifacts with the current spec-tool commands from spec/CLAUDE.md, then run the concrete verification commands from Setup / Developer Verification Checks.

Updating the Symbol Registry

The symbol registry at spec/symbols/symbol_registry.json defines character sets used by the grammar and Rust crates.

flowchart TD
    registry["Edit spec/symbols/\nsymbol_registry.json"]
    validate["validate_symbol_registry.js\n(structure check)"]
    gen_grammar["Generate grammar symbols\n(for tree-sitter)"]
    gen_rust["generate_rust_symbol_sets.js\n→ talkbank-model/src/generated/symbol_sets.rs\n→ spec/tools/src/generated/symbol_sets.rs"]
    fmt["rustfmt\n(format generated code)"]
    verify["Run current symbol generators\nthen local verification sweep"]

    registry --> validate --> gen_grammar & gen_rust
    gen_rust --> fmt --> verify
    gen_grammar --> verify

After editing, run the current symbol-generation commands from spec/CLAUDE.md, then regenerate any dependent grammar/tests/docs outputs if the symbol change affects them.

Common Mistakes

Editing generated files: never edit grammar/test/corpus/ or crates/talkbank-parser-tests/tests/generated/ by hand
Regenerating reflexively: use regeneration when generated artifacts changed, not as a substitute for thinking about what kind of test authority the change really needs
Wrong layer: parser-layer specs expect parse failure; validation-layer specs expect parse success + error report

Testing

Status: Current Last modified: 2026-06-15 15:00 EDT

Test Generation Pipeline

Specs are the source of truth. All grammar corpus tests, Rust parser tests, and error docs are generated from specs. This repo does not currently have the old monorepo-wide make test-gen wrapper; run the relevant spec/tools binaries directly instead, and never hand-edit generated files.

flowchart LR
    subgraph sources["Source of Truth"]
        constructs["spec/constructs/\n(construct specs, see directory listing)"]
        errors["spec/errors/\n(error specs, see directory listing)"]
        templates["spec/tools/templates/\n(Tera wrappers)"]
    end

    subgraph generators["spec/tools generators\n(run only what changed)"]
        gen_ts["gen_tree_sitter_tests"]
        gen_rust["gen_rust_tests"]
        gen_validation["gen_validation_corpus"]
        gen_docs["gen_error_docs"]
    end

    subgraph outputs["Generated Outputs (DO NOT EDIT)"]
        ts_tests["grammar/test/corpus/\n(tree-sitter tests)"]
        rust_tests["crates/talkbank-parser-tests/tests/generated/\n(Rust tests)"]
        val_corpus["crates/talkbank-parser-tests/tests/error_corpus/validation_errors/\n(.cha fixtures + manifest.json)"]
        error_docs["docs/errors/\n(local generated error pages)"]
    end

    constructs & errors --> gen_ts
    templates --> gen_ts
    constructs & errors --> gen_rust
    errors --> gen_validation
    errors --> gen_docs

    gen_ts --> ts_tests
    gen_rust --> rust_tests
    gen_validation --> val_corpus
    gen_docs --> error_docs

To add a grammar test or error test, add a spec file in spec/constructs/ or spec/errors/, then run the current generator command(s) from Spec Workflow. Use only the binaries that match the artifacts you intentionally changed.

Test Strategy

Testing is organized in layers, from fastest to most comprehensive.

flowchart TD
    unit["Unit + Integration Tests\n(cargo nextest run)"]
    specgen["Spec-Generated Tests\n(spec/tools generators)\nParser + validation layer"]
    grammar["Grammar Corpus\n(tree-sitter test)"]
    ref["Reference Corpus\n(corpus/reference/, 100% required)"]
    gates["Local verification sweep + CI\n(dev-checks.md / quality-gates.md)"]

    unit --> specgen --> grammar --> ref --> gates

Never-Regress Gates

Four gates form the regression contract for the CHAT core. They guard the behavior a successor cannot easily re-derive: parser correctness, lossless serialization, full-corpus coverage, and error detection. Any commit touching the relevant surface (grammar, parser, model, validation, serialization, or alignment) MUST run the matching gate(s) and keep them green. A red gate is a bug until proven otherwise (see the repo CLAUDE.md, “Test Failures Are Bugs Until Proven Otherwise”), never a test expectation to quietly update.

Gate	Command	What it protects
Parser equivalence	`cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'`	The re2c oracle parser and the tree-sitter parser agree on every reference file. A divergence means one parser is wrong, or a construct spec is missing.
Roundtrip idempotency	`cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus`	parse, serialize, re-parse yields a semantically identical AST (`SemanticEq`) for every reference file. Catches any model or `WriteChat` change that silently loses information.
Reference corpus 100%	(the same `roundtrip_reference_corpus` test)	Every file under `corpus/reference/` parses and roundtrips with zero failures. The reference corpus is the ultimate arbiter of full-file correctness; it must be 100%, never “mostly”.
Error-code spec tests	`cargo nextest run -p talkbank-parser-tests --test generated_tests --test validation_error_corpus --test error_coverage`	Every error spec under `spec/errors/` still fires its expected code: parser-layer errors reject as designed, validation-layer errors are detected, and every `ErrorCode` has a backing spec. These tests are generated from specs, never hand-written.

Two of the four share one test: roundtrip_reference_corpus enforces both roundtrip idempotency and the reference-corpus 100% guarantee, because it iterates every reference file (the coverage guarantee) and checks roundtrip semantic equality on each (the idempotency guarantee).

All four also run as part of the full workspace sweep (cargo nextest run --workspace), so a complete local run before committing covers them. The per-gate commands above are the fast, targeted way to re-check one surface during the inner development loop. The sections below describe each layer in more detail.

Unit Tests (nextest)

cargo nextest run

Runs all unit and integration tests across all crates (~2300+ tests). These test individual functions, serialization roundtrips, and model invariants.

cargo nextest does not run doctests. Keep cargo test --doc as a separate verification step when you change public API examples or doc comments.

Parser Equivalence

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'

Runs the parser on each file in the corpus/reference/ tree and validates results. Each .cha file is its own test, enabling per-file parallelism and failure isolation via nextest. The exact file count is whatever find corpus/reference -name '*.cha' -type f | wc -l reports, do not hard-code it here.

Spec-Generated Tests

Part of talkbank-parser-tests. These are generated from specs via the current spec/tools binaries and currently test:

Construct specs: input parses correctly
Parser-layer error specs: input fails to parse with expected error code
Validation-layer error specs: input parses but validation reports expected error code

Common entrypoints from the repository root:

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- \
  --output-dir grammar/test/corpus \
  --template-dir spec/tools/templates

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_rust_tests -- \
  --output-dir crates/talkbank-parser-tests/tests/generated

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_validation_corpus -- \
  --corpus-dir crates/talkbank-parser-tests/tests/error_corpus/validation_errors

Tree-Sitter Grammar Tests

cd grammar && tree-sitter test

Runs the tree-sitter grammar corpus tests. This is the right gate for grammar structure changes.

Error Corpus Tests

Error fixtures live in spec/errors/. Parser-layer error examples become Rust tests via gen_rust_tests; validation-layer examples become a .cha fixture corpus + manifest.json via gen_validation_corpus, under crates/talkbank-parser-tests/tests/error_corpus/validation_errors/, which the data-driven runner validation_error_corpus.rs consumes. Add a new error spec under spec/errors/E###_*.md and regenerate.

Tree-Sitter Tests

cd grammar
tree-sitter test

Verifies the grammar produces correct CSTs for known inputs. The actual test count comes from ls grammar/test/corpus/*.txt | wc -l; do not hard-code it.

Reference Corpus

The reference corpus at corpus/reference/ is organized into subdirs (annotation/, audio/, ca/, content/, core/, edge-cases/, languages/, tiers/, word-features/). The parser must handle every file at 100%, the exact file count is whatever find corpus/reference -name '*.cha' -type f | wc -l reports.

This corpus is the ultimate arbiter of correctness for full-file parsing.

Local Verification Contract

There is no repo-local make verify wrapper in this checkout today. Use the explicit command set from Developer Verification Checks and Testing and Quality Gates instead.

Core local sweep:

cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Then add the surface-specific checks that match your change:

grammar changes: cd grammar && tree-sitter generate && tree-sitter test
spec-tool changes: cargo build --manifest-path spec/tools/Cargo.toml and cargo build --manifest-path spec/runtime-tools/Cargo.toml
parser / model / alignment / serialization changes: cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)' and cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus

Running Specific Tests

# Single test by name
cargo nextest run test_name

# Tests in a specific crate
cargo nextest run -p talkbank-model

# Tests matching a pattern
cargo nextest run -- mor

# With output
cargo nextest run --no-capture

What to Run When

What you changed	Run
Grammar (`grammar.js`)	`cd grammar && tree-sitter generate && tree-sitter test`, then the relevant parser/spec-generator commands
Parser (CST-to-model)	`cargo nextest run -p talkbank-parser`
Model (types, validation, alignment)	`cargo nextest run -p talkbank-model`
CLI (chatter args, dispatch)	`cargo nextest run -p chatter`
LSP	`cargo nextest run -p talkbank-lsp`
Spec files	Run the relevant `gen_*` commands from `spec/tools`, then the local verification sweep from `dev-checks.md`
Pre-merge (any change)	The local verification sweep from `dev-checks.md` plus surface-specific additions
Pre-push (quick)	Re-run the narrowest commands that cover the surfaces you touched; there is no repo-local `make ci-local` wrapper

Mutation Testing

Use cargo-mutants to find code that can be changed without any test failing, the true coverage gaps.

# Install (once)
cargo install cargo-mutants

# Run against a specific crate (--jobs 1 to avoid OOM on 64 GB machines)
cargo mutants -p talkbank-parser --timeout 120 --jobs 1

# Review results
cat mutants.out/missed.txt    # Mutations no test caught
cat mutants.out/caught.txt    # Mutations properly detected

Mutation testing is not part of CI but should be run periodically (after major changes) to find untested logic paths. Results guide where to add new tests.

Configuration: mutants.toml at the repo root excludes trivial functions.

Adding Tests

Model tests: add to the relevant crate’s tests/ directory or #[cfg(test)] module
Parser tests: if the change is about grammar shape or validation contracts, add or update specs and regenerate with the relevant spec/tools generator binaries
Error tests: add a new spec under spec/errors/E###_*.md and run gen_rust_tests (parser-layer) or gen_validation_corpus (validation-layer); the generated Rust tests / fixture corpus + manifest are produced automatically

Coding Standards

Status: Current Last updated: 2026-03-24 00:01 EDT

Rust Conventions

Edition: 2024
Formatting: cargo fmt before every commit
Linting: cargo clippy --all-targets -- -D warnings must pass with zero warnings
No clippy silencing without explicit approval

Error Handling

No panics for recoverable conditions, use thiserror/miette for error types
Library code uses the ErrorSink trait for error reporting, not Result
Use ParseOutcome<T> in parser code (parsed or rejected)

Logging

Library crates use tracing (never println! or eprintln!)
CLI binaries write to stdout (results) and stderr (diagnostics)
Use appropriate log levels: error!, warn!, info!, debug!, trace!

Naming

Follow standard Rust conventions (snake_case for functions, CamelCase for types)
Conventional Commits for commit messages: <type>[scope]: <description>
- Types: feat, fix, refactor, test, docs, chore

Dependencies

Preferred crates:

clap: CLI argument parsing
serde: serialization
miette: user-facing diagnostics
insta: snapshot testing
tracing: structured logging
rayon / crossbeam, concurrency
smallvec: small-buffer optimization

Code Organization

Keep crate boundaries clean, lower crates should not depend on higher ones
The model crate should not depend on any parser
Parsing code should not depend on serialization/transform code
All CHAT parsing and serialization goes through the AST, never ad-hoc string manipulation
Treat 10 or more named struct fields as an audit trigger. Wide boundary or report records can be acceptable, but wide runtime state bags need explicit review. See architecture/chat-model/wide-structs.md.

Testing

Prefer spec-driven tests over hand-written tests for parser behavior
Use cargo nextest run for unit tests (except doctests)
Snapshot tests with insta for complex output comparisons

Generated Files

Never hand-edit generated artifacts:

parser.c: generated from grammar.js
grammar/test/corpus/: generated from specs
crates/talkbank-parser-tests/tests/generated/: generated from specs
crates/talkbank-model/src/generated/symbol_sets.rs: generated from symbol registry

Always regenerate from source inputs.

Coding Standards and Engineering Practices

Status: Current Last updated: 2026-05-21 08:38 EDT

Objective

Set enforceable, language-specific standards that reduce ambiguity and improve long-term maintainability.

Global Standards

Prefer explicit domain types over ad-hoc strings.
Keep parsing, validation, and rendering logic separated.
Eliminate magic numbers/strings/paths via named constants and config.
Treat generated code as immutable artifacts.
Require tests for every bugfix and behavior change.

Rust Standards

Enforce formatter and clippy in CI.
Minimize #[allow(clippy::...)]; each allowance needs rationale.
Prefer small focused modules with clear ownership.
Public APIs require doc comments with examples and error behavior.
In parser code, disallow ErrorSink + Option<T> signatures for fallible parse operations.
- Use explicit outcome enums or Result with structured diagnostics.
- Guardrail script: scripts/check-errorsink-option-signatures.sh.
For model enums that encode validation state, require ValidationTagged derive.
- Explicit annotation: #[validation_tag(error|warning|clean)].
- Naming-convention fallback (per crates/talkbank-derive/src/validation_tagged.rs:118-123): variants ending in Error → Error; variants ending in Warning OR Unsupported, plus a variant named exactly Unsupported, → Warning; otherwise → Clean.

Grammar Standards

Grammar rules must map to documented token/category semantics.
No duplicated symbol sets in free-form literals.
Every non-obvious precedence/conflict decision must include rationale.

Spec and Generator Standards

Spec files must follow strict metadata template.
Generators must be deterministic and pure with respect to inputs.
No hardcoded user-specific paths in docs or generated outputs.

Magic Value Policy

Disallowed

Inline path literals tied to local machines.
Unnamed numeric constants encoding protocol behavior.
Repeated header/tier string literals across modules.

Required

Central constants/modules:
- path defaults,
- tier/header prefixes,
- token categories,
- formatting policies.

Review and PR Standards

PR template must include:
- subsystem touched,
- contract impact,
- generated artifact impact,
- tests added/updated,
- docs updated.
Require at least one reviewer with subsystem ownership for core modules.

Internal Decision Records

Adopt short ADR format in the book’s architecture section:

context,
decision,
alternatives considered,
consequences,
rollback path.

Acceptance Criteria

Coding standards are documented once and enforced automatically.
Magic values are systematically reduced and tracked.
Every behavior change includes tests and doc impact assessment.
Architecture decisions are recorded and discoverable.

CI and Release

Status: Current Last updated: 2026-07-07 21:20 EDT

Pre-Merge Verification

Use the concrete local verification commands from Setup and Developer Verification Checks:

cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Then rely on GitHub Actions CI as the authoritative shared signal before you announce a change as ready.

Generated artifact drift

Generated artifacts are still important, but the old root wrappers from the predecessor workspace are not yet ported into this repo. In practice:

regenerate only the affected spec/symbol outputs,
do not hand-edit generated artifacts,
and run the surface-specific verification commands that match the change.

See Spec Workflow and spec/CLAUDE.md for the current source-of-truth guidance.

Release Process

TalkBank/chatter is the public release source of truth: release.yml (cargo-dist) publishes the signed GitHub Releases for the CLI and the desktop app.

Workflows that actually exist in this repo

Workflow	Purpose	Notes
`.github/workflows/ci.yml`	Main build/test/book CI	Primary shared signal on pushes and PRs
`.github/workflows/cross-platform.yml`	Cross-platform build coverage	Supplements the main CI workflow
`.github/workflows/crates-io-foundation.yml`	First-wave crates.io readiness	Checks foundation-crate metadata, package surfaces, hold-backs, and publish order
`.github/workflows/release.yml`	cargo-dist release automation	Builds dist-enabled workspace artifacts from version tags; owns the GitHub Release
`.github/workflows/release-desktop.yml`	Desktop installer release automation	Builds chatter-desktop installers on the same version tags and uploads them into the release that `release.yml` creates; `workflow_dispatch` runs build-only
`.github/workflows/clippy-rolling.yml`	New-stable clippy drift detection	Weekly maintenance workflow

Current release stance

release.yml is about workspace artifact packaging via cargo-dist, not about crates.io publication.
The first-wave crates.io path is documented separately in Crates.io Publication and is checked by just crates-io-foundation-check plus .github/workflows/crates-io-foundation.yml.

Desktop release workflow: how the two tag workflows compose

On a version tag, release.yml (cargo-dist) and release-desktop.yml run in parallel. cargo-dist owns creating the GitHub Release and attaching the CLI archives, checksums, and installer scripts; release-desktop.yml builds the Tauri installers, then polls until the release exists and uploads its installers into it. Two platform notes baked into the workflow:

macOS: Tauri signs, notarizes, and staples the .app, but NOT the .dmg it wraps around it. The workflow therefore submits the .dmg itself to the notary service and staples it, then verifies codesign, spctl, and stapler validate on both artifacts. The signing identity is supplied via environment, never hardcoded in tauri.conf.json.
Windows / Linux: artifacts are currently unsigned by decision; see docs/strategy/distribution-and-signing.md (“Decisions, 2026-06-12”) and the SmartScreen guidance in the install docs.

Release secrets (Actions secrets on this repository)

Required by the macOS jobs of release-desktop.yml (and by cargo-dist macOS codesigning if macos-sign is enabled, which uses the separate CODESIGN_* names documented in the strategy doc):

Secret	Content
`APPLE_CERTIFICATE`	base64-encoded Developer ID Application `.p12`
`APPLE_CERTIFICATE_PASSWORD`	password for the `.p12`
`APPLE_SIGNING_IDENTITY`	full identity string, `Developer ID Application: <Name> (<TEAMID>)`
`APPLE_API_KEY`	App Store Connect API key ID (notarization)
`APPLE_API_ISSUER`	App Store Connect issuer ID
`APPLE_API_KEY_CONTENT`	contents of the `AuthKey_*.p8` file

Rotation: replacing the certificate or notary key means updating these secrets and nothing else; no workflow edits are needed. A maintainer must re-create all of them on any new repository (secrets do not transfer).

Crates.io Publication

Status: Current Last updated: 2026-06-21 21:33 EDT

Scope

The crates.io automation in this repo currently targets the Wave 1A foundation crates only. crates.io publication is a deliberate maintainer action, not a tag-triggered release path.

Wave 1A is:

tree-sitter-talkbank
talkbank-derive
talkbank-model
talkbank-cache
talkbank-parser
talkbank-parser-re2c
talkbank-transform

talkbank-parser-re2c is part of the first wave because talkbank-transform has a runtime dependency on it. Holding it back would make talkbank-transform unpublishable.

The current Wave 1B hold-backs are explicitly marked publish = false:

send2clan
chatter
talkbank-lsp

They stay blocked until their support contract, install story, and user-facing docs are ready.

What the repo now automates

Two repo-native entry points cover the first-wave foundations:

Surface	Purpose
`just crates-io-foundation-check`	Local preflight for first-wave crates.io readiness
`.github/workflows/crates-io-foundation.yml`	CI enforcement for first-wave metadata, package surfaces, hold-backs, and publish order

The readiness check enforces:

required crates.io metadata (repository, homepage, keywords, categories, readme)
readme-file existence
package assembly for every first-wave crate via cargo package --list
the first-wave runtime dependency graph
publish = false guards on Wave 1B crates
a real cargo publish --dry-run for the standalone tree-sitter-talkbank crate

Important limitation: Cargo cannot fully dry-run the bootstrap wave

For the first publication of an interdependent workspace, cargo publish --dry-run is not a complete CI gate for every crate. Cargo rewrites path dependencies to registry dependencies while preparing the package. That means a crate such as talkbank-model cannot complete a registry-style dry-run until its prerequisite talkbank-derive already exists on crates.io.

So the current automation is intentionally honest:

tree-sitter-talkbank gets a real crates.io dry-run because it stands alone.
The remaining Wave 1A crates are validated by metadata, readme, and dependency checks before publication. (No MSRV is declared yet; set a deliberate rust-version and re-add an MSRV check when publication is actually pursued.)
As each prerequisite crate lands on crates.io, rerun targeted cargo publish --dry-run -p <crate> checks for the later crates before publishing them.

This is a real limitation of the initial bootstrap wave, not a missing script. If we later want full registry-resolution rehearsal before publication, that requires a staging registry/local index strategy, not just another shell loop.

Publication procedure

Before publishing anything:

Verify crates.io name availability for every Wave 1A package.
Run just crates-io-foundation-check.
Ensure .github/workflows/crates-io-foundation.yml and the main CI workflow are green on the commit you intend to publish.
Publish in this exact order, waiting for the crates.io index to observe each crate before moving to the next:
- tree-sitter-talkbank
- talkbank-derive
- talkbank-model
- talkbank-cache
- talkbank-parser
- talkbank-parser-re2c
- talkbank-transform
After each prerequisite becomes visible on crates.io, rerun any newly-unblocked cargo publish --dry-run -p <crate> checks before the next publish step.

Example command shape:

cargo publish -p tree-sitter-talkbank --locked

Tagging policy

Do not use version tags to drive crates.io publication from this repo. .github/workflows/release.yml is reserved for cargo-dist GitHub Releases of dist-enabled artifacts. Crates.io publication remains a deliberate manual maintainer flow.

Testing and Quality Gates

Status: Current Last modified: 2026-06-21 21:33 EDT

This page summarizes the current relationship between local verification and the repository CI workflows.

Local pre-merge contract

There is no repo-local make verify wrapper in this checkout today. The local contract is the command set documented in Setup and Developer Verification Checks:

cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

plus grammar/spec/parser-specific checks when you touch those surfaces.

Never-regress gates

Beyond the formatting/build/test sweep above, the CHAT core has four never-regress gates that must stay green for any change touching the grammar, parser, model, validation, serialization, or alignment: parser equivalence, roundtrip idempotency, reference-corpus 100%, and the error-code spec tests. Each has a fast, targeted command. They are defined, with the exact command and what each protects, under Testing, Never-Regress Gates. A red gate is a bug until proven otherwise, never a test expectation to quietly update.

Root CI contract

The main CI workflow (.github/workflows/ci.yml) is the authoritative shared signal for this repo. Today it covers:

Rust build, test, and clippy
mdBook build

Additional workflows cover cross-platform build coverage and rolling-clippy drift checks.

Because the old local wrapper pipeline has not been ported into this repo, historical references to numbered gates such as G0-G14 should be treated as legacy labels from the predecessor workspace, not as the current command surface here.

Additional CI-only checks

These are required CI signals or workflow checks that are not identical to the local command set:

cross-platform release/build coverage
weekly rolling-clippy drift checks
workflow-specific smoke tests attached to release automation

Documentation Architecture

Status: Current Last modified: 2026-06-15 15:00 EDT

Principle: Centralized Book + Subsystem Satellites

User-facing and contributor-facing prose lives in mdBook (book/). The repo-level docs/ directory holds operator-facing material (release contract, versioning, code-signing, platform support, validation feature flags). Maintainers can also generate a local error-reference tree under docs/errors/ while working on diagnostics, but that output is not the canonical checked-in docs surface. Subsystem-specific working docs stay in place only when tightly coupled to files in that directory.

flowchart TD
    main["book/ (the unified Chatter mdBook)\nSurfaces: chatter, chat-format, architecture, contributing\nAudiences: users, integrators, contributors"]
    spec["spec/docs/\nSpec authoring guides"]
    errors["docs/errors/\nOptional local generated error reference"]
    api["cargo doc\nRust API docs (auto-generated)"]

    main -->|"links to"| spec
    main -->|"links to"| errors
    main -.->|"complements"| api

Where Documentation Goes

Content type	Location	Examples
User guides, CHAT format reference	`book/src/chatter/user-guide/`, `book/src/chat-format/`	CLI usage, validation errors
Architecture and design	`book/src/architecture/`	Parsing, data model, concurrency, memory
Contributor workflows	`book/src/contributing/`	Grammar workflow, testing, coding standards
Integrator contracts	`book/src/chatter/integrating/`	JSON schema, diagnostic contract
Technical reference and audits	`book/src/` (Technical Reference section)	Parity audits, UTF-8 audit, risk register
Spec authoring guides	`spec/docs/`	Error spec format, curation workflow
Generated error docs	`docs/errors/`	Optional local output from `gen_error_docs`; source of truth stays in `spec/errors/`
Historical/archived docs	project archive	Old audits, superseded proposals
AI assistant context	`CLAUDE.md` files (per repo/subdir)	Not documentation for humans

Rules

One canonical page per topic. No duplicate coverage across locations.
No crate-level docs/ directories. Architectural explanations go in the book. Crate API docs come from /// doc comments via cargo doc.
Satellites stay only when the audience is editing files in that directory. Spec authors need WRITING_ERROR_SPECS.md next to their specs. Everyone else reads the book.
Generated docs are build artifacts. Never hand-edit docs/errors/. If you need that local reference set, regenerate it with gen_error_docs.
Historical docs go to project archive. Don’t keep old audit logs, investigation notes, or superseded proposals in the public repo.

One unified book

There is one mdBook for this repo at book/, titled “Chatter, TalkBank CHAT Toolchain”, organized by audience-first sections under book/src/:

Section	Audience	Content
`book/src/chatter/`	chatter CLI users + integrators	CLI reference, library usage, JSON contracts
`book/src/chat-format/`	All users + integrators	CHAT format reference (headers, tiers, symbols)
`book/src/architecture/`	All devs	Cross-surface architecture, parser/grammar/data-model design
`book/src/contributing/`	Contributors	Setup, testing, coding standards, dev checks

One book.toml and one SUMMARY.md for the whole tree. Cross-section links resolve as ordinary in-book paths.

CHAT Processing Playbook for Developers

Status: Current Last updated: 2026-03-23 23:49 EDT

Objective

Provide an implementation playbook for developers building or extending CHAT parsing, validation, transformation, and serialization logic.

Mental Model

Treat CHAT processing as a layered pipeline:

Ingest bytes and normalize line boundaries.
Parse syntax into structured model with exact spans.
Validate semantic rules with structured diagnostics.
Transform or enrich model without breaking invariants.
Serialize in canonical form.

Developer Workflow

Start from a concrete fixture or corpus case.
Add/adjust parser behavior with contract tests first.
Add semantic validator rules separately from parser acceptance.
Confirm roundtrip and equivalence gates.
Update docs for any visible behavior or policy change.

Tier Dispatch Strategy

Use cheap byte-prefix dispatch before heavy parsing:

@ => header candidate,
* => main tier,
% => dependent tier,
continuation rules and whitespace handled deterministically.

This preserves performance and isolates error contexts earlier.

For downstream batchalign3 consumers, tier dispatch is only the front door. The important contract is what happens after dispatch: parse-health taint, recovery vs rejection, and whether a tier is safe to pass into alignment.

Word Parsing Rules of Thumb

Parse suffix markers in strict order (@..., @s..., $...) with explicit precedence.
Keep raw_text exact, cleaned_text policy-driven and test-locked.
Treat CA delimiters and special symbols via centralized symbol sets.
Never embed ad hoc symbol literals in multiple files.

Error Handling Contract

Every parser failure should produce structured diagnostics with:
- code,
- severity,
- span,
- context,
- message.
Avoid silent fallback behavior unless policy explicitly allows it.
If fallback occurs, emit warning-grade diagnostics where relevant.
Never fabricate semantic placeholders (empty required text, arbitrary enum default, fake word/chunk) to satisfy type construction.
Prefer None/partial outcome + diagnostics over synthetic model values.

Span Discipline

Offsets are absolute across full file content.
Nested parser helpers must accept base offset and return shifted spans.
Add tests for boundary and continuation-line spans.

Performance Policy

Prefer byte-oriented prechecks for top-level dispatch and simple delimiters.
Use parser combinators for structural parsing, not for obvious constant-prefix routing.
Measure parser performance on representative corpus slices before/after major changes.

Common Failure Patterns and Fixes

Symptom: semantic mismatch only in snapshots.
- Fix: compare parser outputs directly and isolate first structural delta.
Symptom: generated tests pass, corpus fails.
- Fix: add missing fixture, decide parse-vs-validate placement, lock behavior.
Symptom: output drift after grammar edit.
- Fix: run full regeneration and equivalent parser contract suite before merge.

Batchalign3 Surface Checks

When a change affects the surface used by batchalign3, confirm:

full-file parse equivalence still holds for corpus coverage
alignment-sensitive downstream tiers still gate on parse-health appropriately

Review Checklist for Parser PRs

New or changed behavior has targeted tests.
Equivalence suite status is attached.
Snapshot updates are intentional and explained.
No hidden magic symbols or magic string literals introduced.
Docs updated where user-visible behavior changes.

Required Artifacts for Significant Changes

Design note (architecture decision record in the book).
Before/after examples.
Impacted fixtures list.
Migration implications for integrators.

GitHub Readiness and Open Source Governance

Status: Current Last modified: 2026-06-21 21:33 EDT

Objective

Prepare TalkBank/chatter to operate as a healthy public project with clear legal, security, contribution, and release processes.

Root Artifacts

Artifact	Status	Notes
`LICENSE-MIT` + `LICENSE-APACHE`	Done	Dual-licensed `MIT OR Apache-2.0` (standard Rust convention; both files present at root, no combined `LICENSE`). Every crate inherits `license = "MIT OR Apache-2.0"` from `[workspace.package]`.
`CONTRIBUTING.md`	Done	Setup, standards, PR flow, pre-PR checklist
`CODE_OF_CONDUCT.md`	TODO (deferred)	Intentionally absent for now: it is held until a durable enforcement contact (an institutional address or successor handle, not an individual) is settled. The plan is to adopt the Contributor Covenant once that contact exists.
`SECURITY.md`	Done	Root file added; issue-template contact link now resolves to a real policy
`CODEOWNERS`	TODO	Not added yet: repo contents do not currently publish an authoritative GitHub owner/team map for path-level review ownership
`.github/workflows/*.yml`	Done	`ci.yml` (Rust build+test, mdBook, Rust-version-sync) + `cross-platform.yml` (OS matrix) + `clippy-rolling.yml` + `crates-io-foundation.yml` + `release.yml` + `release-desktop.yml`
`.github/ISSUE_TEMPLATE/*`	Done	Bug report + feature request (YAML forms)
Pull request template	Done	`.github/PULL_REQUEST_TEMPLATE.md` mirrors current CONTRIBUTING + PR review requirements

CI Governance Policy

Required status checks: the ci.yml jobs that run on every pull request, Rust build + test, mdBook build, and Rust version pins in sync. See Branch Protection for the exact GitHub check names and which other workflow (cross-platform.yml) is deliberately not in the required set.
Branch protection rules: documented in Branch Protection; configure on GitHub once the repo is public.

Release Governance

Releases: the CLI and desktop app are published as signed GitHub Releases (cargo-dist); the Rust crates are source-available (not yet on crates.io).
Cargo publication governance: first-wave crates.io foundations are documented in Crates.io Publication and checked by .github/workflows/crates-io-foundation.yml.
Binary release governance: release.yml is reserved for cargo-dist GitHub Release packaging of dist-enabled artifacts. It is not the crates.io publication workflow.
Tagging rule: do not treat version tags as authorization to publish new surfaces. A surface becomes stable only when its release notes explicitly say so and its public distribution channel is live.
Release-note rule: every public release note must state the surface’s distribution channel, support boundary, and any closely related surfaces that remain held back.

Community Operations

Label taxonomy: bug and enhancement auto-applied by issue templates. Richer taxonomy (drift, spec, grammar, parser, docs, good first issue): TODO (GitHub settings).
Contributor pathway: CONTRIBUTING.md covers setup and PR flow. First-time/advanced contributor pathways: TODO.
Public project roadmap: TODO.

Supply Chain and Security

Dependency scanning: CI runs rustsec/audit-check and cargo-deny (with deny.toml). Automated update PRs (Dependabot/Renovate): TODO.
Signed release artifacts: TODO.
Security advisories process: documented in SECURITY.md.

Acceptance Criteria

Repo has complete governance artifacts at root.
CI and branch protections enforce stated policy.
Contributors can onboard and submit PRs without tribal knowledge.
Release/support tiers are documented per surface.
Release process is repeatable and documented.

Rust Compilation Times: Findings and Optimizations

Status: Reference (historical analysis; current Cargo.toml profile knobs are the source of truth) Last updated: 2026-05-20 20:32 EDT

This document captures the compilation performance analysis that drove the current dev/test profile knobs in the workspace root Cargo.toml. The absolute measurements below were taken before the 2026-04-28 batchalign3 fold roughly tripled the third-party dependency surface; subsequent updates are reflected in Cargo.toml comments, which are the source of truth.

Background: How Rust Compilation Works

Rust compilation has two key mechanisms for speed:

Incremental compilation: When you change one file and rebuild, the compiler remembers which “codegen units” within each crate were affected and only recompiles those. This is the primary speedup mechanism for local iterative development (edit-compile-test cycles).
Crate-level caching: Cargo tracks which crates have changed inputs (source files, dependencies, feature flags). Unchanged crates are skipped entirely. This helps when you edit a leaf crate and don’t need to rebuild unrelated crates.

Additionally, there are external tools:

sccache: A shared compilation cache that stores compiled artifacts by content hash. Designed for CI environments where builds start from a clean state. It works by wrapping rustc and checking a cache before invoking the real compiler.
Linker choice: The linker runs after all crates are compiled to produce the final binary. Faster linkers (like lld) can shave seconds off link time for large binaries.

What We Found

Problem 1: sccache Was Disabling Incremental Compilation (Critical)

The global ~/.cargo/config.toml had:

[build]
rustc-wrapper = "/opt/homebrew/bin/sccache"

This caused two compounding problems:

sccache disables Rust incremental compilation entirely. When a rustc-wrapper is set, Cargo cannot use incremental mode because the wrapper interposes between Cargo and rustc, breaking the incremental artifact protocol.
sccache had near-zero cache benefit for this workspace. The sccache stats showed a 2.7% Rust cache hit rate. Out of 37 compilations, 36 were marked “non-cacheable” because rlib crates (library crates, which is what most workspace crates produce) cannot be cached by sccache.

The result: every cargo build after a one-line change was effectively a clean rebuild of the entire dependency chain. A change to talkbank-model (near the root of the crate graph) triggered a full recompile of 11+ downstream crates, taking 60-90 seconds even for a trivial edit.

Problem 2: Full Debug Info Was Inflating Link Times

The dev profile was generating full DWARF debug info (level 2), which includes:

Type definitions for every struct/enum
Variable location info for debugger inspection
Full scope and lifetime metadata

This produces large .dSYM bundles and .o files, increasing linker input size and slowing down the link phase.

Problem 3: Third-Party Dependencies at -O0

All third-party crates (serde, regex, tree-sitter, etc.) were compiled at opt-level = 0 in dev builds. Since these crates rarely change, this was a pure penalty: slow runtime (tests using serde deserialization, tree-sitter parsing, or regex matching ran ~10x slower than necessary) with no compile-time benefit after the first build.

Non-Problem: lld Linker

The linker = "lld" setting in the global cargo config was fine. On macOS this uses ld64.lld from Homebrew’s LLVM toolchain (LLD 21.1.8), which is slightly faster than Apple’s default linker for workspaces of this size. No change needed.

Changes Made

Change 1: Project-Local sccache Override

Created .cargo/config.toml in the project root:

[build]
rustc-wrapper = ""

This overrides the global sccache setting for this project only, re-enabling incremental compilation. Other Rust projects on the system are unaffected.

Why not modify the global config? Keeping the project-local override is safer, sccache may still be useful for other projects or CI workflows.

Note: .cargo/config.toml is gitignored (not committed) because the empty-string rustc-wrapper = "" value trips a cargo-llvm-cov bug that treats "" as a real wrapper path instead of “no wrapper.” Each contributor opts in locally; CI does not carry the override.

Change 2: Reduced Debug Info

In the workspace Cargo.toml:

[profile.dev]
debug = "line-tables-only"

[profile.test]
debug = "line-tables-only"

This generates only file/line number information for backtraces, skipping the bulky type and variable metadata. You still get useful panic/backtrace output with source locations; you just can’t inspect local variables in a debugger (lldb/gdb). For most development workflows this is the right tradeoff.

Change 3: Optimized Third-Party Dependencies, RETIRED post-fold

The original change set [profile.dev.package."*"] opt-level = 1 to optimize every third-party crate. After the 2026-04-28 batchalign3 fold roughly tripled the third-party dependency surface (axum, async-trait, tokio’s full feature set, etc.), the build-time cost of this setting became prohibitive, and the workspace Cargo.toml comment block now explains why it was removed.

[profile.test.package."*"] opt-level = 1 was also removed for the same reason; for specific tests where runtime is the bottleneck, opt in locally rather than reintroducing the workspace-wide setting.

Results (pre-fold, 2026-03 measurement)

The numbers below were captured pre-fold against the original ten-crate workspace. The fold roughly tripled the third-party dep set and forced retiring [profile.dev.package."*"] opt-level = 1; today’s wall-clock will be slower and depends on which crate you touched. Re-run cargo build --timings on the current workspace if you need fresh numbers.

Scenario	Before	After (pre-fold)
Clean build	~3-5 min (est.)	~39s
Incremental rebuild (touch `talkbank-model`)	~60-90s	~4s
Test runtime (serde/regex/tree-sitter hot paths)	Slow (-O0)	Faster (-O1, when opt-in)

Optional: Cranelift Backend for Maximum Iteration Speed

For the fastest possible “does it compile?” checks during rapid iteration, Rust nightly supports the Cranelift codegen backend:

cargo +nightly -Z codegen-backend=cranelift build

Cranelift generates code ~2x faster than LLVM but produces unoptimized output and is nightly-only. It is useful for compile-check cycles but not for correctness testing or benchmarking.

General Principles for Rust Compile Time

Incremental compilation is king for local dev. Anything that disables it (sccache, certain rustc-wrapper tools) is a net negative for iterative development.
sccache is for CI, not local dev. It shines when doing clean builds from scratch (CI runners, cross-compilation). For edit-rebuild cycles, incremental compilation is far more valuable.
Optimize dependencies, not your own crates. [profile.dev.package."*"] with opt-level = 1 gives you faster test execution with minimal compile cost (dependencies rarely change).
Debug info has a real cost. Full DWARF debug info inflates binary sizes and link times. Use line-tables-only unless you actively need a debugger.
Measure before optimizing. Use cargo build --timings to generate an HTML report showing per-crate compile times and parallelism. Use sccache --show-stats to verify cache effectiveness.
Watch for crate graph bottlenecks. Crates that sit at the root of the dependency graph (like talkbank-model) are the critical path, changes to them trigger the longest rebuild chains. Keep these crates lean and consider splitting them if they grow too large.

Developer Verification Checks

Status: Current Last modified: 2026-05-30 20:13 EDT

This page defines the current local verification expectations for TalkBank/chatter.

There is not yet a repo-local make verify wrapper in this checkout. Use the concrete commands below instead.

Core local sweep

Run this from the repository root before opening or merging substantial changes:

cargo fmt --all -- --check
cargo check --workspace --all-targets
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Surface-specific additions

Add the checks that match the surface you changed:

Grammar changes

cd grammar && tree-sitter generate && tree-sitter test

Spec tooling changes

cargo build --manifest-path spec/tools/Cargo.toml
cargo build --manifest-path spec/runtime-tools/Cargo.toml

Parser / model / alignment / serialization changes

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus

See Setup and Spec Workflow for the surface-specific regeneration guidance.

When to Run

Always before creating a PR.
Always before merging parser, spec-tool, grammar, or generated-artifact changes.
Again after rebasing if upstream changed the same surface.

Additional Engineering Checks

Run these in addition to the core sweep when touching parser/model code:

cargo test -p talkbank-parser --test test_parse_health_recovery
cargo nextest run -p talkbank-parser-tests --test parser_equivalence_files

These protect against regressions in:

parser recovery without sentinel fabrication
parse-health taint propagation
parser semantic equivalence

Failure Policy

If any required check fails, do not merge.
Fix the failing check or scope down the change.
If the failure is unrelated and pre-existing, document it in the PR and open a blocker issue.

Recommended Fast Loop During Development

Use narrower loops while iterating, then run the full sweep before final review. For a broad Rust verification pass:

cargo test --workspace

For grammar-only edits, prefer the smallest relevant loop first:

cd grammar && tree-sitter test
cargo nextest run -p talkbank-parser

Only reach for spec/symbol regeneration when the change truly affects generated artifacts; do not treat regeneration as a substitute for choosing the right regression test.

Branch Protection and Required CI Checks

Status: Current Last updated: 2026-06-15 15:00 EDT

This page defines the required status checks and protection policy for main.

Branch Protection Policy

Enable branch protection for main with:

Require pull request before merge.
Require approvals (minimum 1; maintainers may set higher).
Require conversation resolution before merge.
Require status checks to pass before merge.
Restrict force pushes and branch deletions.

Required Status Checks

Configure these CI checks as required. The names are the GitHub check names, which come from each job’s name: in .github/workflows/ci.yml; that workflow runs on every pull request to main:

Rust build + test
mdBook build
Rust version pins in sync

One other workflow is deliberately NOT in the required set:

cross-platform.yml (the Ubuntu + macOS + Windows matrix) runs on push to main, a daily schedule, and manual dispatch, NOT on pull requests, so it cannot report a status on a PR and must not be required (requiring it would block every merge). It is a post-merge and daily drift gate. Add a pull_request trigger first if you want it required.

Optional Hardening

Require branches to be up to date before merging.
Enable merge queue if PR volume increases.
Restrict who can dismiss stale reviews.

Operational Rule

If required checks fail:

Do not bypass protection.
Fix the issue or revert the breaking change.
Re-run checks until green.

Reference Corpus Overhaul

Status: Historical (Phase 0-6 narrative is preserved for context; the live corpus layout is described in Testing § Reference Corpus, read that first for current counts and structure) Last modified: 2026-05-29 18:43 EDT

Subsequent reorganization moved the corpus from the 345-flat-plus-language-subdirs layout described below into nine topical subdirectories under corpus/reference/. Absolute counts in this page (file totals, language-dir counts, the constructs/ directory) reflect the pre-reorganization state and are kept here only as the historical record of how the corpus got to where it is.

Motivation

The reference corpus (corpus/reference/) is the 100%-pass quality gate for all parser/grammar changes. The parser must handle every file at 100%. Before this overhaul, the corpus had three problems:

Language monoculture: 345 files, all English. We have 100K+ real files across 42 languages in the corpus data directory but the gate only tested English.
Construct gaps: 18 concrete grammar node types were never exercised (e.g., interrupted_question, scoped_best_guess, trailing_off_question). A grammar regression affecting these constructs would pass CI undetected.
Error coverage gaps: 27 error specs were stubs (no CHAT example), 4 error codes had no spec file at all.

Strategy

Fresh build, not incremental patching. We kept the existing 345 English files as-is (they encode years of parser fixes) and added multilingual files + construct gap-fillers on top.

Phase 0: Coverage Tooling

Built corpus_node_coverage (spec/tools/src/bin/corpus_node_coverage.rs) to measure which of the 334 concrete grammar node types the corpus exercises. Running against the old 345-file corpus confirmed exactly 18 gaps.

Phase 1: Language Selection & File Extraction

Built extract_corpus_candidates (spec/runtime-tools/src/bin/extract_corpus_candidates.rs) to automatically select representative files from the corpus data directory for 20 target languages:

eng, zho, fra, deu, spa, jpn, nld, heb, por, ell,
tur, hrv, pol, ita, hun, rus, est, dan, ara, isl

Selection criteria:

Clean tree-sitter parsing (no ERROR nodes), mandatory
Short files (under 200 lines, preferring 15-100)
Varied tiers (%mor/%gra/%pho/%com)
Multiple speakers preferred
Privacy: explicitly skip Password directories in the corpus data directory

For each language, the tool scored and ranked candidates. We selected 1-2 files per language (25 files total across 20 language subdirectories).

Phase 2: Construct Gap-Filling

Created 4 handcrafted files in corpus/reference/constructs/ to exercise the 18 missing node types that don’t appear in real-world data:

File	Node types exercised
`rare-terminators.cha`	`interrupted_question`, `self_interrupted_question`, `self_interruption`, `trailing_off_question`
`uptake.cha`	`uptake_symbol`
`best-guess.cha`	`scoped_best_guess`
`unsupported.cha`	`thumbnail_header`, `unsupported_header`, `unsupported_dependent_tier`, `unsupported_line`, `unsupported_header_prefix`, `unsupported_tier_prefix`

Other gaps (l1_of_header, utf8_header, etc.) were already covered by the language files or were confirmed as supertypes (not concrete).

Result: 334/334 concrete types exercised (100%).

Phase 3: Tier Regeneration

Ran batchalign3 morphotag on all 25 language files to generate fresh %mor/%gra tiers:

cd /path/to/batchalign3
uv run batchalign3 morphotag /path/to/chatter/corpus/reference/{lang}/ --in-place

All 20 languages are covered by Stanza’s UD models. Validation confirmed all 374 files pass parser equivalence and roundtrip.

Phase 4: Error Corpus Expansion

4.1: Created 3 missing error specs (E707, E711, E717) with CHAT examples and metadata. Fixed E376 (had wrong error code E208 in metadata).

4.2: Filled 17 triggerable stub specs with CHAT examples:

Cross-utterance validation (E341, E351-E355)
Parser recovery warnings (E319-E322, E325, E326)
Underline tier errors (E356-E357)
Overlap index errors (E373)
Direct parser tier errors (E381, E384)

4.3: Documented 12 untriggerable stubs (internal, deprecated, or not-yet-wired error codes) with explanations of why no example is possible: E001, E002, E211, E317, E318, E340, E374, E377, E378, E380, E385, E386.

4.4: Corrected 5 misclassified specs where examples triggered different error codes than intended (E319-E322, E376). Added Status: not_implemented and explanatory notes.

4.5: Built perturbation tool (spec/tools/src/bin/perturb_corpus.rs) with 11 mutation strategies that take a valid .cha file and produce controlled mutations targeting specific error codes:

Perturbation	Target Error
`delete-participants`	E501
`delete-languages`	E503
`delete-id`	E504
`undeclared-speaker`	E308
`delete-terminator`	E305
`extra-mor-word`	E706
`fewer-mor-words`	E705
`delete-begin`	E502
`delete-end`	E510
`duplicate-participants`	E511
`mor-terminator-mismatch`	E716

Also includes a mining mode (--mine DIR) that scans real data for tree-sitter ERROR nodes, with automatic Password directory exclusion.

4.6: Regenerated golden artifacts: all 8 golden generators + audit + bootstrap:

Artifact	Lines
`golden_words.txt`	769 (1949 unique words)
`golden_mor_tiers.txt`	405
`golden_gra_tiers.txt`	7
`golden_main_tiers.txt`	607
`golden_pho_tiers.txt`	25
`golden_wor_tiers.txt`	7
`golden_sin_tiers.txt`	5
`golden_com_tiers.txt`	24
`golden_words_featured.txt`	96
`golden_words_minimal.txt`	62

Bootstrap regenerated reference_corpus.rs with 374 test cases.

Phase 5: CI Integration & Validation

At that milestone, the then-current verification sweep passed:

Parser equivalence: 377/377 (374 files + 3 extra)
Node coverage: 334/334 (100%)
Error coverage: 181/181 (100%), 169 with CHAT examples, 12 documented stubs
The parser-equivalence and reference-corpus regression gates passed

Phase 6: Cleanup & Documentation

Updated file count references (339→374) across CLAUDE.md files
Rewrote corpus/README.md with new structure
Updated memory files

Final State

corpus/reference/           374 files total
  *.cha                     345 files (original English corpus)
  constructs/                 4 files (rare grammar constructs)
  {20 language dirs}/        25 files (multilingual, from corpus data)

Metric	Before	After
Total files	345	374
Languages	1 (English)	20
Concrete node coverage	316/334 (94.6%)	334/334 (100%)
Error specs	177/181 (97.8%)	181/181 (100%)
Error specs with examples	~150	169
Documented stubs	0	12
Golden artifacts	Stale	Freshly regenerated

Tools Built

Tool	Path	Purpose
`corpus_node_coverage`	`spec/tools/src/bin/`	Grammar node type coverage
`extract_corpus_candidates`	`spec/runtime-tools/src/bin/`	Automated file selection from corpus data
`perturb_corpus`	`spec/tools/src/bin/`	Error file generation by mutation

What Worked

extract_corpus_candidates: Automated scoring eliminated guesswork in file selection. Files were high-quality, short, and diverse.
construct gap-filling: 4 handcrafted files closed 18 gaps efficiently.
Keeping existing 345 files: No breakage, no regressions. The new files are purely additive.
batchalign3 morphotag: Generated correct %mor/%gra for all 20 languages without manual intervention.

What Didn’t Work / Lessons Learned

Mining real errors from corpus data: The MacWhinney subcorpus (407 files) had zero tree-sitter parse errors; the data is too clean. Mining is slow on large directories (>4 minutes for all of Eng-NA). The perturbation approach is more effective for systematic error coverage.
Parser recovery error specs (E319-E322): Writing examples that trigger specific tree-sitter error recovery codes is very difficult. Tree-sitter’s error recovery is robust and routes most malformed input through generic paths (E316) rather than the specific recovery codes. These remain as documented stubs.
Direct parser vs unsupported.cha (historical, direct parser has been removed): The former Chumsky direct parser could not handle unsupported_line nodes (failed on constructs/unsupported.cha). This is no longer relevant since tree-sitter is now the sole parser.

Known Remaining Gaps

12 untriggerable error stubs: Internal (E001, E002), deprecated (E211, E317, E318, E340, E374, E377, E378, E380, E385, E386). These are legitimate, the codes either have no emission path or are reserved.
No audio files: Phase 3.3 (audio subset with %wor tiers) was deferred. Adding ~10 short audio clips would test the alignment pipeline end-to-end.
Direct parser roundtrip (historical, direct parser has been removed): 373/374 passed under the former Chumsky direct parser (unsupported.cha failed). No longer relevant since tree-sitter is now the sole parser.
5 parser recovery specs not_implemented: E319-E322, E376. Examples don’t trigger the intended codes due to tree-sitter’s error recovery routing.

Desktop App Testing

Status: Current Last updated: 2026-07-07 21:20 EDT

This document covers the testing strategy for the Chatter desktop app (apps/chatter-desktop/). Testing is split into three tiers by speed and scope.

Testing Tiers

┌─────────────────────────────────────────────────────────┐
│  Tier 3: E2E (WebdriverIO + tauri-driver)               │
│  Real app, real DOM, real IPC. Slow (~5-10s/test).       │
│  Catches: rendering bugs, IPC wiring, platform quirks.   │
│  Run: manually before releases, optionally in CI.        │
├─────────────────────────────────────────────────────────┤
│  Tier 2: Rust integration tests                          │
│  Real validation pipeline, real event bridge, no GUI.    │
│  Catches: serialization mismatches, event ordering,      │
│  stats consistency, single-file handling.                 │
│  Run: every commit, CI required.                         │
├─────────────────────────────────────────────────────────┤
│  Tier 1: Unit tests (Rust + TypeScript)                  │
│  Pure functions and thin runtime seams in isolation.     │
│  Catches: protocol drift, reducer bugs, CLAN math.       │
│  Run: every commit, CI required.                         │
└─────────────────────────────────────────────────────────┘

Most bugs will be caught by Tier 2. The Rust integration tests exercise the exact same code path as the Tauri commands; they call validate_target_streaming() and the frontend event bridge directly, then verify the JSON shape, field names, event ordering, and stats consistency.

Tier 1 & 2: Unit and integration tests

Running

# TypeScript capability/seam tests
cd apps/chatter-desktop && npm run test:unit

# Rust contract/integration tests
cargo nextest run -p chatter-desktop --test validation_bridge

What they cover

Test	What it verifies
`apps/chatter-desktop/tests/unit/validationRunner.test.cjs`	Validation capability uses centralized command names, subscribes before invoke, and disposes listeners exactly once
`apps/chatter-desktop/tests/unit/validationState.test.cjs`	Validation reducer computes relative file names and merges diagnostics/status immutably
`reference_corpus_no_hard_errors`	every file under `corpus/reference/` produces zero `Severity::Error` (warnings allowed)
`event_lifecycle_has_correct_sequence`	Discovering → Started → FileComplete×N → Finished ordering
`frontend_events_serialize_to_expected_json_shape`	Every event has `type` field; camelCase field names match TypeScript types; diagnostics include `renderedText`
`protocol_contracts_serialize_to_expected_json_shape`	Rust command/event constants and request payloads stay aligned with the TypeScript protocol module
`single_file_validation`	Single-file path validates exactly the selected file
`finished_stats_match_file_events`	`valid + invalid + parseErrors == totalFiles`; FileComplete count matches
`rendered_html_present_for_errors`	Every diagnostic carries non-empty miette HTML with box-drawing characters and `style=` attributes (ANSI colors converted to HTML)

Adding new tests

Test file: apps/chatter-desktop/src-tauri/tests/validation_bridge.rs

The tests use collect_events() which runs the real validation pipeline and collects all FrontendEvent values. To test a specific scenario:

#![allow(unused)]
fn main() {
#[test]
fn my_scenario() {
    let target = workspace_root().join("path/to/corpus");
    let events = collect_events(&target);
    let summary = summarize(&events);
    // assert on summary fields or individual events
}
}

Miette rendering pipeline

Error rendering is server-side. Each FrontendDiagnostic carries two renderings:

rendered_html: render_error_with_miette_with_source_colored() produces ANSI-colored text, ansi-to-html converts it to HTML <span style="...">. The frontend displays it in a <pre> block via dangerouslySetInnerHTML. This guarantees identical output to the CLI.
rendered_text: render_error_with_miette_with_source() produces plain text (no ANSI codes) for clean clipboard copy-paste.

The rendered_html_present_for_errors integration test verifies that every error diagnostic includes non-empty HTML containing miette box-drawing characters and style= attributes from ANSI color conversion.

TypeScript seam tests

The TypeScript unit tests compile a focused subset of apps/chatter-desktop/src/ to a temporary CommonJS directory, then run Node’s built-in test runner against the compiled output. This keeps the test toolchain small while still exercising the runtime seam as real JavaScript.

Runner script: apps/chatter-desktop/scripts/run-unit-tests.mjs
Compile config: apps/chatter-desktop/tsconfig.unit.json
Test files: apps/chatter-desktop/tests/unit/*.test.cjs

TypeScript ↔ Rust contract

The Rust integration tests verify that serialized JSON matches what the TypeScript frontend expects. If you change a field name or event structure in events.rs, the frontend_events_serialize_to_expected_json_shape test will catch the mismatch before you discover it at runtime.

The key serde attributes:

#[serde(tag = "type", rename_all = "camelCase")] on enums, variant names become camelCase tag values (fileComplete, not FileComplete)
#[serde(rename_all = "camelCase")] on individual variants, field names become camelCase (totalFiles, not total_files)
Both must be present: the enum-level rename_all only affects tag names, not field names within variants

Tier 3: E2E Tests (WebdriverIO)

Prerequisites

cargo install tauri-driver    # WebDriver backend for Tauri (Linux/Windows only)
cargo tauri build --debug     # Build the app binary

Note: tauri-driver only works on Linux and Windows. On macOS, WKWebView does not support WebDriver. Run E2E tests in CI (Linux) or on a Windows machine.

Running

# Terminal 1: start tauri-driver (WebDriver server on :4444)
tauri-driver

# Terminal 2: run the tests
cd apps/chatter-desktop
npm run test:e2e

What they cover

The smoke tests in tests/e2e/smoke.spec.ts verify that the app launches and renders the expected UI elements:

Drop zone with Choose File / Choose Folder buttons
Empty file tree (“No files loaded”)
Empty error panel (“Select a file to view errors”)
Status bar showing “Ready”

Limitations

File dialogs cannot be driven via WebDriver. The native file picker (@tauri-apps/plugin-dialog) opens an OS-level dialog that WebDriver can’t interact with. Options for testing the validation flow:

Test-only Tauri command: add validate_for_test(path) behind #[cfg(debug_assertions)] that bypasses the file dialog
Programmatic invoke: use driver.executeScript() to call window.__TAURI__.core.invoke("validate", { path }) directly
Drag-and-drop simulation: possible but platform-dependent and fragile

For now, the Rust integration tests cover the full validation pipeline. E2E tests focus on UI rendering and user-visible layout.

Adding E2E tests

Test file: apps/chatter-desktop/tests/e2e/*.spec.ts

WebdriverIO provides $() and $$() for CSS selectors, plus Tauri-aware capabilities:

it("should show validation results", async () => {
  // Programmatically trigger validation (bypasses file dialog)
    await browser.executeAsync(async (path, done) => {
      await (window as any).__TAURI__.core.invoke("validate", {
        path,
      });
      // Wait for finished event
      setTimeout(done, 5000);
  }, "/path/to/corpus");

  const tree = await $(".file-tree-panel");
  const text = await tree.getText();
  expect(text).not.toContain("No files loaded");
});

When to run E2E tests

Before releases: manual run to verify the built app works end-to-end
Optionally in CI: requires tauri-driver and a display server (Xvfb on Linux). Slow, so consider running only on release branches.
Not on every commit: the Rust integration tests are fast and cover more ground

Platform-Specific Considerations

Platform	WebView engine	E2E support
macOS	WKWebView	Not supported: `tauri-driver` does not work on macOS (WKWebView has no WebDriver API)
Windows	WebView2 (Chromium)	Full support via `tauri-driver`
Linux	WebKitGTK	Full support via `tauri-driver`; requires Xvfb for headless

macOS limitation: Apple’s WKWebView does not expose a WebDriver endpoint, so tauri-driver cannot drive the app on macOS. E2E tests must run on Linux (CI) or Windows. For local macOS development, rely on the Rust integration tests (Tier 2) and manual smoke testing.

CSS rendering differs slightly between WebKit (Linux) and Chromium (Windows). Visual regressions are possible, consider screenshot comparison tests if this becomes a problem.

Test Data

All tests use the reference corpus at corpus/reference/. This corpus is checked into the repo and must always pass validation with zero hard errors (warnings are allowed). The exact set of files and the current warning-emitting files are whatever find corpus/reference -name '*.cha' -type f and the validator report, do not hard-code those lists here.

Do not create ad-hoc .cha test files. Use existing reference corpus files or ask the user to provide test data.

CI Integration

Add to the existing CI workflow:

# Rust integration tests (fast, always run)
- name: Desktop integration tests
  run: cargo nextest run -p chatter-desktop --test validation_bridge

# E2E tests (slow, release branches only)
- name: Build desktop app
  if: startsWith(github.ref, 'refs/heads/release')
  run: cargo tauri build --debug
- name: E2E smoke tests
  if: startsWith(github.ref, 'refs/heads/release')
  run: |
    tauri-driver &
    sleep 2
    cd apps/chatter-desktop && npm run test:e2e

Library Usage

Status: Current Last modified: 2026-06-21 21:33 EDT

The TalkBank Rust crates can be used as dependencies in your own Rust projects for parsing, validating, and manipulating CHAT files. This page shows the most common entry points; the API reference on docs.rs (once published) is the authoritative source. Until then, treat the rustdoc comments inside each crate’s src/lib.rs as the source of truth.

Examples on this page are mirrored as a real Cargo test at crates/talkbank-transform/tests/book_library_usage_examples.rs. The book renders them as rust,ignore so mdbook doesn’t try to link against the workspace’s many compiled crate variants; the parallel test runs the same code under cargo test and is what catches API drift between this page and the libraries. If you edit either, update both.

Important: some legacy tree-sitter fragment helpers are synthetic rather than semantically honest. They can inject fragment input into boilerplate CHAT text and parse the resulting synthetic file. Prefer full-file parsing for real tree-sitter use, and do not treat legacy fragment helpers as the long-term fragment API. For direct-parser fragment semantics, use direct-parser-native tests instead of treating synthetic wrappers as the oracle.

Adding Dependencies

The TalkBank library crates are source-available from this repository. They are not yet published on crates.io, so depend on them from the public repo via git (pinned to a release tag), or via local path dependencies from a TalkBank/chatter checkout for local development:

[dependencies]
talkbank-model = { path = "../chatter/crates/talkbank-model" }
talkbank-transform = { path = "../chatter/crates/talkbank-transform" }
talkbank-parser = { path = "../chatter/crates/talkbank-parser" }

The published-crate workflow is tracked separately; once it lands these paths can become version = "X.Y" deps.

Parsing and Validating a CHAT File

The simplest entry point is parse_and_validate from talkbank-transform. It takes the source text and a ParseValidateOptions, returns a fully constructed ChatFile, or a PipelineError if parsing or validation failed.

extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::parse_and_validate;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = std::fs::read_to_string("file.cha")?;
let options = ParseValidateOptions::default().with_validation();
let chat_file = parse_and_validate(&source, options)?;

for utt in chat_file.utterances() {
    println!("Speaker: {}", utt.main.speaker);
}
Ok(())
}

ChatFile is generic over a ValidationState parameter; the parse_and_validate return defaults to the validated state. chat_file.utterances() returns an iterator over &Utterance derived from the file’s lines (utterances are interleaved with headers and comments in source order).

For batch workflows where parser construction overhead matters, reuse a single TreeSitterParser and call parse_and_validate_with_parser:

extern crate talkbank_model;
extern crate talkbank_parser;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_parser::TreeSitterParser;
use talkbank_transform::parse_and_validate_with_parser;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let chat_files: Vec<std::path::PathBuf> = Vec::new();
let parser = TreeSitterParser::new()?;
let options = ParseValidateOptions::default().with_validation();

for path in &chat_files {
    let source = std::fs::read_to_string(path)?;
    let chat_file = parse_and_validate_with_parser(&parser, &source, options.clone())?;
    let _ = chat_file;
}
Ok(())
}

ParseValidateOptions also exposes with_alignment() (implies with_validation(), additionally validates cross-tier alignment for %mor, %gra, %pho, %wor) and with_strict_linkers() (enables E351-E355 self-completion/other-completion linker checks).

Working with the Model

ChatFile stores participants and language metadata as top-level fields populated from @Participants / @ID / @Languages headers during parsing. Utterances live in lines and are iterated via chat_file.utterances().

extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::DependentTier;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::parse_and_validate;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "\
@UTF8
@Begin
@Languages:\teng
@Participants:\tCHI Target_Child
@ID:\teng|test|CHI|||||Target_Child|||
*CHI:\thello world .
%mor:\tco|hello n|world .
@End
";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;

// Participant metadata is top-level on the ChatFile.
let _participants = &chat_file.participants;

// Iterate utterances and their dependent tiers.
for utt in chat_file.utterances() {
    for tier in &utt.dependent_tiers {
        if let DependentTier::Mor(mor_tier) = tier {
            for item in mor_tier.items() {
                println!("POS: {}, Lemma: {}", item.main.pos, item.main.lemma);
            }
        }
    }
}
Ok(())
}

DependentTier is a closed-set enum (Mor, Gra, Pho, Mod, Sin, Act, Add, Com, Err, Exp, Gpx, Int, Lan, …); match on the variants you care about and ignore the rest. MorTier::items() returns &[Mor]; each Mor has a main MorWord plus optional post-clitics.

Serializing to CHAT

Bring the WriteChat trait into scope and call to_chat_string() for a fully-rendered CHAT string, or write_chat(&mut writer) to stream into any std::fmt::Write.

extern crate talkbank_model;
extern crate talkbank_transform;
use std::fmt::Write as _;

use talkbank_model::ParseValidateOptions;
use talkbank_model::WriteChat;
use talkbank_transform::parse_and_validate;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "@UTF8\n@Begin\n@Languages:\teng\n@Participants:\tCHI Target_Child\n@ID:\teng|test|CHI|||||Target_Child|||\n*CHI:\thello .\n@End\n";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;

// Convenience: render to a fresh String.
let chat_text = chat_file.to_chat_string();
assert!(chat_text.starts_with("@UTF8"));

// Streaming: write into any std::fmt::Write sink.
let mut output = String::new();
chat_file.write_chat(&mut output)?;
Ok(())
}

Serializing to JSON

Prefer the schema-validated helpers in talkbank_transform::json: to_json_pretty_validated checks the output against the JSON schema and catches drift between the data model and the schema. The unvalidated variants are a faster bypass when you’ve already validated upstream.

extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::json::to_json_pretty_validated;
use talkbank_transform::parse_and_validate;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "@UTF8\n@Begin\n@Languages:\teng\n@Participants:\tCHI Target_Child\n@ID:\teng|test|CHI|||||Target_Child|||\n*CHI:\thi .\n@End\n";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;

let json = to_json_pretty_validated(&chat_file)?;
assert!(json.contains("\"speaker\""));
Ok(())
}

The schema for ChatFile lives at schema/chat-file.schema.json and is regenerated from the Rust types via cargo test --test generate_schema. For arbitrary serde values (not just ChatFile), to_json_unvalidated / to_json_pretty_unvalidated work the same way without the schema step.

Custom Error Handling

Lower-level parser entry points stream diagnostics through the ErrorSink trait. Implement it to collect, count, filter, or forward errors as they arrive, useful when you need finer-grained control than the Result<ChatFile, PipelineError> shape parse_and_validate returns.

extern crate talkbank_model;
use talkbank_model::ErrorSink;
use talkbank_model::ParseError;

struct MyErrorHandler;

impl ErrorSink for MyErrorHandler {
    fn report(&self, error: ParseError) {
        // Custom handling: log, filter, count, etc.
        eprintln!("[{}] {}", error.code, error.message);
    }
}

ErrorSink is Send + Sync, and a blanket &T: ErrorSink impl means borrowed references are sinks too, no Arc wrapper required. The built-in ErrorCollector (gathers into a Vec), ParseTracker (counts by severity), and NullErrorSink (discards) cover most common needs; implement ErrorSink directly for everything else.

Crate Selection Guide

Need	Crate
Data model types, error types, `WriteChat`, `ErrorSink`	`talkbank-model`
Tree-sitter CHAT parsing (low-level)	`talkbank-parser`
Full pipeline (parse + validate + JSON, schema validation)	`talkbank-transform`

talkbank-model is the foundation, every other crate depends on it. If all you need are the AST types and validation, model alone is enough. talkbank-transform brings parsing + JSON + caching.

Batchalign3-Facing Surface

If you are building Batchalign3 or another external consumer, the stable surface is usually:

Batchalign3 need	Prefer
Canonical full-file parsing	`talkbank-parser`
Parse/validate contracts and typed model access	`talkbank-model`
Alignment-aware downstream consumers (`align`, `compare`, `benchmark`)	`talkbank-model` alignment helpers plus the model AST
Whole-pipeline parse+validate+convert	`talkbank-transform`

For batch workflows, keep parser instances reusable and keep alignment logic separate from parse semantics.

JSON Output Reference

Status: Reference Last updated: 2026-05-11 23:45 EDT

This document describes the structure of JSON produced by chatter to-json. For the formal JSON Schema, see JSON Schema.

Quick Start

# Default: parse + validate + align, pretty-printed, schema-checked
chatter to-json file.cha

# Write to file
chatter to-json file.cha -o file.json

# Skip validation (parse only, faster)
chatter to-json file.cha --skip-validation

# Skip alignment only
chatter to-json file.cha --skip-alignment

Validation and alignment are on by default. Use --skip-validation or --skip-alignment to opt out.

Top-Level Structure

{
  "lines": [ ... ]
}

A ChatFile is a flat list of lines. Each line has a line_type discriminator:

`line_type`	Description
`"header"`	File header (`@Begin`, `@Languages`, `@Participants`, etc.)
`"utterance"`	Main tier + dependent tiers + alignment
`"comment"`	`@Comment:` lines

Word Fields

Words are the fundamental unit. Every word in the main tier content array carries these fields:

Field	Type	Always?	Description
`type`	`"word"`	yes	Discriminator
`raw_text`	string	yes	Exact text from the transcript, including all CHAT markers
`cleaned_text`	string	yes	NLP-ready text (shortenings restored, markers stripped)
`content`	array	yes	Structured breakdown of word parts (see below)
`category`	string	no	`"omission"`, `"filler"`, `"nonword"`, `"fragment"`, `"ca_omission"`
`form_type`	string	no	Special form code: `"c"`, `"d"`, `"f"`, `"x"`, etc.
`lang`	object	no	Language marker (see Language-Switched example)
`untranscribed`	string	no	`"unintelligible"` (xxx), `"phonetic"` (yyy), `"untranscribed"` (www)

Word content items use "content" for the text value:

{ "type": "text", "content": "dog" }

Computed Fields

cleaned_text and untranscribed are computed from content during serialization. They do not exist as stored fields in the data model.

cleaned_text: Concatenates Text and Shortening elements from content. Excludes lengthening markers (:), stress markers, CA elements, overlap points, compound markers, and underline markers. Example: sit(ting) → "sitting".
untranscribed: Present only when cleaned_text is "xxx", "yyy", or "www".

Word Examples

Simple Word

dog

{
  "type": "word",
  "raw_text": "dog",
  "cleaned_text": "dog",
  "content": [{ "type": "text", "content": "dog" }]
}

Filler

&-uh

{
  "type": "word",
  "raw_text": "&-uh",
  "cleaned_text": "uh",
  "content": [{ "type": "text", "content": "uh" }],
  "category": "filler"
}

Untranscribed

xxx

{
  "type": "word",
  "raw_text": "xxx",
  "cleaned_text": "xxx",
  "content": [{ "type": "text", "content": "xxx" }],
  "untranscribed": "unintelligible"
}

Compound

ice+cream

{
  "type": "word",
  "raw_text": "ice+cream",
  "cleaned_text": "icecream",
  "content": [
    { "type": "text", "content": "ice" },
    { "type": "compound_marker", "content": { "span": { "start": 0, "end": 1 } } },
    { "type": "text", "content": "cream" }
  ]
}

Omission

0she

{
  "type": "word",
  "raw_text": "0she",
  "cleaned_text": "she",
  "content": [{ "type": "text", "content": "she" }],
  "category": "omission"
}

Nonword

&~baba

{
  "type": "word",
  "raw_text": "&~baba",
  "cleaned_text": "baba",
  "content": [{ "type": "text", "content": "baba" }],
  "category": "nonword"
}

Special Form

doggy@c

{
  "type": "word",
  "raw_text": "doggy@c",
  "cleaned_text": "doggy",
  "content": [{ "type": "text", "content": "doggy" }],
  "form_type": "c"
}

Language-Switched

maison@s:fra

{
  "type": "word",
  "raw_text": "maison@s:fra",
  "cleaned_text": "maison",
  "content": [{ "type": "text", "content": "maison" }],
  "lang": { "type": "explicit", "code": "fra" }
}

The lang field has variants: {"type": "shortcut"} (bare @s), {"type": "explicit", "code": "fra"} (@s:fra), and {"type": "multiple", "code": ["eng", "zho"]} (@s:eng+zho).

Utterances

An utterance line contains:

{
  "line_type": "utterance",
  "main": {
    "speaker": "CHI",
    "content": {
      "content": [ ... ],
      "terminator": { "type": "period" },
      "bullet": { "start_ms": 0, "end_ms": 3042 }
    }
  },
  "dependent_tiers": [ ... ],
  "alignments": { ... },
  "utterance_language": { "status": "resolved_default", "code": "eng" },
  "language_metadata": { ... }
}

Key structural points:

The utterance body is under "main", not "utterance".
content, terminator, and bullet are nested inside main.content.
terminator is an object with a type field ("period", "question", "exclamation", etc.), not a bare string.
bullet (utterance-level timing) is inside main.content, omitted when absent (not present as null).
dependent_tiers, alignments, utterance_language, and language_metadata are top-level siblings of main. Empty dependent_tiers and alignments are omitted when there is nothing to report.

Content Items

main.content.content is a heterogeneous array. Each item has a type discriminator:

Type	Description
`"word"`	A word token (see Word Fields above)
`"event"`	Non-verbal action (`&=laughs`)
`"pause"`	Timed or untimed pause (`(.)`, `(0.5)`)
`"group"`	Bracketed group (`<word word>`)
`"separator"`	Tag markers, linkers, etc.

Dependent Tiers

When present, dependent_tiers is an array of tagged objects:

"dependent_tiers": [
  {
    "type": "Mor",
    "data": {
      "tier_type": "Mor",
      "items": [
        {
          "main": { "pos": "pron", "lemma": "I" }
        },
        {
          "main": { "pos": "verb", "lemma": "want", "features": ["Fin", "Ind", "Pres"] }
        }
      ],
      "terminator": "."
    }
  },
  {
    "type": "Gra",
    "data": {
      "tier_type": "Gra",
      "relations": [
        { "index": 1, "head": 2, "relation": "NSUBJ" },
        { "index": 2, "head": 0, "relation": "ROOT" }
      ]
    }
  }
]

`type`	Tier	Description
`"Mor"`	`%mor`	Morphological analysis (POS tags, lemmas, features, clitics)
`"Gra"`	`%gra`	Grammatical relations (dependency arcs)
`"Pho"`	`%pho`	Phonological transcription
`"Sin"`	`%sin`	Syntax tier
`"Wor"`	`%wor`	Word-level timing (items with `inline_bullet`)
Other	`%xxx`	User-defined dependent tiers

%wor Tier

The Wor tier contains word items with timing:

{
  "type": "Wor",
  "data": {
    "items": [
      {
        "kind": "word",
        "raw_text": "hello",
        "cleaned_text": "hello",
        "content": [{ "type": "text", "content": "hello" }],
        "inline_bullet": { "start_ms": 100, "end_ms": 300 }
      }
    ],
    "terminator": { "type": "period" }
  }
}

Note that %wor items use "kind" instead of "type" for their discriminator, since "type" is used by the tier envelope.

Alignment Data

When validation runs (the default), the alignments object contains:

units: per-tier index arrays (for internal bookkeeping)
Named tier pairs (e.g., mor, gra) with alignment mappings

"alignments": {
  "units": {
    "main_mor": [{"index": 0}, {"index": 1}],
    "main_pho": [{"index": 0}, {"index": 1}],
    "main_sin": [{"index": 0}, {"index": 1}],
    "main_wor": [{"index": 0}, {"index": 1}],
    "mor": [{"index": 0}, {"index": 1}]
  },
  "mor": {
    "pairs": [
      { "source_index": 0, "target_index": 0 },
      { "source_index": 1, "target_index": 1 }
    ],
    "errors": []
  }
}

Alignment links each main-tier word (source_index) to its corresponding dependent-tier item (target_index) by position. errors contains any alignment-level diagnostics (count mismatches, etc.) and is [] when alignment validated cleanly.

Headers

Headers use the header object with a type discriminator:

Type	Header	Key Fields
`"utf8"`	`@UTF8`	,
`"begin"`	`@Begin`	,
`"end"`	`@End`	,
`"languages"`	`@Languages`	`codes`
`"participants"`	`@Participants`	`entries` (speaker_code, name, role)
`"id"`	`@ID`	`language`, `corpus`, `speaker`, `role`, `age`, `sex`, …
`"media"`	`@Media`	`filename`, `media_type`, `status`
`"comment"`	`@Comment`	`text`
`"date"`	`@Date`	`date`
`"options"`	`@Options`	`options` (array of strings)

See the JSON Schema for the complete list of header types and fields.

Timing

Utterance-level timing appears in main.content.bullet:

"bullet": {
  "start_ms": 1234,
  "end_ms": 5678
}

Word-level timing (from %wor tier) appears in inline_bullet on individual words within the Wor dependent tier.

JSON Schema

Status: Current Last modified: 2026-06-15 15:00 EDT

This repository generates JSON Schema from Rust-owned types with schemars for the ChatFile transcript model used by chatter to-json.

Keeping that schema generated from the Rust source of truth lets cross-language integrations consume a stable contract without re-deriving the shapes by hand.

Available schemas

Schema	Canonical URL	Repository	Generator
`ChatFile` transcript model	`https://talkbank.org/schemas/v0.1/chat-file.json`	`schema/chat-file.schema.json`	`cargo test --test generate_schema`

The generated schema declares both $schema (JSON Schema 2020-12) and $id (the canonical URL above). External consumers that want to track the current transcript-model version should follow the v0.1 URL; there is no /latest/ alias in the generated artifacts.

Transcript schema: `ChatFile`

chatter to-json converts CHAT transcripts into a structured JSON form backed by the same ChatFile model used by the parser, validator, and serializer.

How `chatter to-json` uses it

By default, chatter to-json:

validates the CHAT input,
checks dependent-tier alignment unless --skip-alignment is passed, and
validates the emitted JSON against the schema unless --skip-schema-validation is passed.

Useful flags:

chatter to-json input.cha --skip-validation
chatter to-json input.cha --skip-alignment
chatter to-json input.cha --skip-schema-validation

chatter from-json deserializes JSON back into the internal ChatFile model and re-serializes it to CHAT format. The input should conform to this schema.

Roundtrip expectations

The CHAT-to-JSON-to-CHAT pipeline is intended to preserve the ChatFile model:

chatter to-json input.cha -o intermediate.json
chatter from-json intermediate.json -o output.cha
diff input.cha output.cha

Both directions go through the same typed model. When changing the parser, serializer, or schema generation, confirm roundtrip behavior with the existing roundtrip test suites rather than assuming byte-for-byte identity.

Using the schema externally

Validate JSON in Python

import json
import jsonschema
import urllib.request

schema_url = "https://talkbank.org/schemas/v0.1/chat-file.json"
schema = json.loads(urllib.request.urlopen(schema_url).read())

with open("transcript.json") as f:
    data = json.load(f)

jsonschema.validate(data, schema)

IDE autocompletion

{
  "$schema": "https://talkbank.org/schemas/v0.1/chat-file.json",
  "lines": [],
  "participants": {},
  "languages": [],
  "options": []
}

Generate types from the schema

Tools like quicktype, json-schema-to-typescript, and datamodel-code-generator can generate typed structs or classes from the schema for TypeScript, Python, Go, and other languages.

Regenerating the schema

After changing transcript-model types in talkbank-model:

cd chatter
cargo test --test generate_schema

This writes the checked-in schema artifact in schema/. CI already checks that generated artifacts stay in sync.

Code references

schema/chat-file.schema.json: generated schema
crates/talkbank-transform/src/json.rs: schema loading and validation
crates/talkbank-model/src/model/: Rust data model
tests/generate_schema/: shared schema generation helpers

Diagnostic and JSON Output Contract

Status: Current Last updated: 2026-06-15 15:00 EDT

This page documents the machine-readable JSON surfaces currently exposed by the top-level chatter CLI.

Stability policy

Treat field names documented here as the public contract.
Treat additional fields as additive unless this page says otherwise.
Treat message wording as human-facing text, not a stable machine contract.

`chatter validate ... --format json`

Both chatter validate FILE --format json and chatter validate DIR --format json emit newline-delimited JSON (NDJSON) on stdout, with the same record shapes in both modes:

zero or more per-file records (one per validated file), then
one final summary record.

A single-file invocation still emits a file record followed by a summary record; it is not a single-object surface.

Per-file records

Valid files:

{"type":"file","file":"/path/to/file.cha","status":"valid","cache_hit":false}

Invalid files (the errors array is opaque per-error JSON; the note field is appended when the validator stopped further checks because of structural errors):

{
  "type": "file",
  "file": "/path/to/file.cha",
  "status": "invalid",
  "error_count": 1,
  "errors": [
    {
      "code": "E502",
      "message": "Missing @End header at end of file",
      "severity": "Error"
    }
  ],
  "note": "Some additional checks may not have run because of structural errors. Fix the structural errors first, then re-validate."
}

Parser-failure files use "status":"parse_error" with an error string. Read-failure files use "status":"read_error" with an error string.

Summary record

{
  "type": "summary",
  "directory": "/path/to/dir",
  "total_files": 2,
  "valid": 1,
  "invalid": 1,
  "parse_errors": 0,
  "cache_hits": 0,
  "cache_misses": 2,
  "cache_hit_rate": 0.0,
  "cancelled": false
}

When --roundtrip is set, the summary also includes roundtrip_passed and roundtrip_failed counters.

Contract notes

The type field is stable: "file" or "summary".
For file records: file and status are stable; cache_hit is stable for valid records. error_count and errors are stable for invalid records.
For summary records: directory, total_files, valid, invalid, parse_errors, cache_hits, cache_misses, cache_hit_rate, and cancelled are stable.
status values currently observed: valid, invalid, parse_error, read_error. New status values may appear.
Errors do not include a byte-offset location field in the NDJSON surface; for byte-offset diagnostics use the LSP or the non-JSON renderer.
The note field on invalid file records is human-facing guidance and may be added or omitted between releases.
Exit code 0 means all files validated successfully; exit code 1 means at least one file failed or an I/O error occurred.

`chatter to-json`

chatter to-json emits the full ChatFile JSON model rather than a diagnostic summary. The authoritative contract for that output is the JSON Schema documented in JSON Schema.

Practical notes:

The JSON itself is the contract, not any validation status lines printed by the CLI.
Use -o/--output if you want only the JSON in a file.
Use --skip-validation, --skip-alignment, or --skip-schema-validation only when you explicitly want to bypass those checks.

`chatter cache stats --json`

Cache statistics emit one JSON object on stdout:

{
  "total_entries": 743,
  "cache_dir": "/Users/example/Library/Caches/talkbank-chat",
  "cache_size_bytes": 274432,
  "last_modified": "2026-03-09T13:05:31+00:00"
}

Contract notes:

total_entries, cache_dir, cache_size_bytes, and last_modified are stable.
last_modified is RFC 3339 / ISO 8601 text.

Merge Override File Format

Status: Draft Last updated: 2026-07-01 21:55 EDT

The merge override file is the typed, human-readable record of operator decisions in the chatter speaker-id → chatter merge pipeline. It serves three purposes:

Persistence: operator adjudications made for one batch can be replayed on later runs without re-prompting (chatter speaker-id --override-file <FILE> --session-id <ID>).
Audit trail: each entry records who decided what, when, and on the basis of which Jaccard scores. Years later, a researcher can answer “why was PAR0 labeled INV in this session?” by reading the file.
Interchange: an adjudication UI (CLI, future web app) and the batch pipeline share the same file format; UI tools can be added or replaced without changing the on-disk contract.

This page is the authoritative reference for the file’s schema. For the usage contract (which commands read/write it, when, why), see chatter speaker-id.

File location and naming

The file’s location is caller-chosen. The convention is one file per donor batch, named for the batch:

batch-2026-05-27-childes-eng.overrides.toml
batch-2026-06-15-fluency-pilot.overrides.toml
batch-2026-08-22-aphasiabank-bilingual.overrides.toml

Pipeline operators pass the path explicitly via --override-file; no implicit search of a default location.

File format

UTF-8 TOML. The file has exactly one top-level key, schema_version, followed by zero or more session entries, each keyed by a session ID.

schema_version = 2

[<session_id_1>]
mode = "auto"
# ... fields per entry ...

[<session_id_2>]
mode = "explicit"
# ... fields per entry ...

The session ID is the table name. It is a free-form stable string, typically the basename stem of the CHAT file the entry applies to (s12-t1, Corpus2024-session-07, etc.). The TOML parser treats it as a key; CHAT-conformant identifiers fit the unquoted-key grammar and need no escaping, but any string is permitted if it conforms to TOML key syntax (use quoted keys like "unusual_session-id" if the ID contains non-bare-key characters).

Top-level fields

Field	Type	Required	Meaning
`schema_version`	unsigned integer	yes	The schema version this file conforms to. Currently `2`. Readers refuse files with any other value.

The reader refuses files with schema_version absent or unknown, returning a typed error (OverrideFileError::UnsupportedSchemaVersion). There is no implicit version, no fallback, no auto-migration. Operators of a file written by a newer version of chatter must upgrade their binary; operators of a file written by an older version that the current binary no longer supports must re-adjudicate. This policy is documented in architecture/merge-domain-types.md §6; its rationale is to keep the schema honest and avoid premature migration code that might silently misinterpret old data.

Per-session entry fields

Each [<session_id>] table contains the fields below. Required fields must be present and well-typed; optional fields may be omitted; unknown fields cause a parse error.

Required fields

Field	Type	Meaning
`mode`	string enum	One of `"auto"`, `"explicit"`, `"override"`. How the decision was made; see “Mode semantics” below.
`adult_roles`	table of donor code → inline table	The CHAT identity assigned to each speaker whose `mapping` action is `"rename"`, keyed by that speaker’s donor code. Every `"rename"` key in `mapping` must have a matching key here. Each inline table has fields: `code` (string, CHAT speaker code), `tag` (string, CHAT role-tag), `specific_role` (string, optional, CHAT specific-role label such as `First_Investigator`, set only when two adults in the entry share `tag`).
`mapping`	inline table	Map from input speaker codes to actions. Keys are speaker codes; values are `"rename"` or `"drop"`. Every speaker that exists in the input CHAT file must appear in `mapping`.
`operator`	string	Free-form identifier of the person who created the entry (username, initials, email prefix). Recorded as audit trail.
`decided_at`	RFC 3339 datetime	When the decision was made. Must include a time zone (UTC recommended).

Optional fields

Field	Type	Default	Meaning
`scores`	inline table	`{}`	Per-speaker Jaccard scores recorded at decision time. Keys are speaker codes; values are floats in `[0.0, 1.0]`. Populated when the decision was based on a reference-mode auto attempt (even if the final mode is `"explicit"` because the operator overrode a low-confidence result).
`margin`	float or string	absent	The decisive margin (winner-score / loser-score). Finite values serialize as numbers; the divide-by-zero case (loser score = 0) serializes as the string `"unbounded"`.
`note`	string	`""`	Free-text operator note. Strongly recommended for `"explicit"` and `"override"` modes, captures why the operator made the call.
`flags`	array of strings	`[]`	Operator-supplied flags marking unusual situations. Known values listed in “Flag vocabulary” below; unknown strings are preserved verbatim (treated as `Custom`).
`engine`	string enum	`"deterministic"`	Which engine produced the decision. Always written on new entries; absent only in pre-provenance files, which read as `"deterministic"`. One of `"deterministic"` (Jaccard reference-mode, spreadsheet, or operator adjudication) or `"llm"` (language-model judgment).
`judgment`	inline table	absent	LLM audit trail. Present only when `engine = "llm"`; omitted for deterministic decisions. Sub-fields documented below.

`judgment` sub-table fields

The judgment inline table records the audit trail for LLM-produced decisions. It is present if and only if engine = "llm".

Field	Type	Required	Meaning
`model`	string	yes	Model identifier used for the judgment (e.g. `"deepseek-v4-flash"`).
`endpoint`	string	yes	OpenAI-compatible base URL the judgment was made against.
`prompt_version`	string	yes	Prompt-template version tag (e.g. `"v1"`). Bumping this marks older entries as produced by a prior template.
`confidence`	inline table	no (omitted when empty)	Per-field model confidence in `[0.0, 1.0]`. Keys are decision field names (e.g. `"mapping"`, `"roles"`, `"merge_applicable"`). Omitted entirely when no confidence values were reported.
`reasoning`	string	yes	One or two sentence model rationale for the decision.

Mode semantics

The mode field records how the decision was made and is informational only at read time, every mode applies the same mapping deterministically. Distinguishing modes matters for audit purposes.

Mode	Set when	Operator confidence
`"auto"`	`chatter speaker-id` ran in reference mode, Jaccard margin was at or above `--confidence-threshold`, and the operator did not intervene.	High; the algorithm picked.
`"explicit"`	The operator supplied `--mapping` directly, typically after a prior reference-mode attempt failed at the confidence threshold.	Operator made the call; confidence depends on what evidence they used (listening to audio, contributor data sheet, prior knowledge).
`"override"`	The entry was created by reading a prior override file (replay).	Inherited from whichever prior decision the entry was first stamped with. The `mode` is updated to `"override"` whenever a replay re-writes the entry.

The reader does not enforce mode → field correlations (e.g., it does not require scores to be present when mode = "auto"). The writer follows these conventions:

"auto" entries always include scores and margin.
"explicit" entries include scores and margin IFF a prior reference-mode attempt produced them; otherwise they are absent.
"override" entries preserve whatever scores, margin, and note were in the source file.

Mapping semantics

Each entry in mapping is one of:

"rename": the speaker is renamed per its own entry in adult_roles (looked up by the speaker’s donor code), to adult_roles[<donor_code>].code with role tag adult_roles[<donor_code>].tag, and specific-role label adult_roles[<donor_code>].specific_role if present, in the output CHAT file. Every utterance for this speaker has its *CODE: prefix rewritten; the @Participants entry for this speaker has its code + role-tag (+ specific-role label, if set) rewritten; the @ID row’s code (field 3) and role (field 8) are rewritten.
"drop": the speaker’s utterances are removed from the output entirely. The speaker’s @Participants entry and @ID row are also removed.

Precondition. Every speaker that appears in the input CHAT file must appear in mapping. There is no defaulting; omission is rejected with SpeakerIdError::SpeakerNotInMapping { speaker }. This is by design: every decision must be explicit, so a future reader knows that no speaker was silently passed through.

The reader rejects:

Mapping entries whose key is not a speaker present in the input (SpeakerIdError::MappingSpeakerNotInInput).
Mapping values other than "rename" or "drop" (TOML parse error from the typed deserializer).

Flag vocabulary

The flags array contains zero or more string values. The following are recognized vocabulary; consumers MAY treat them specially:

Flag	Meaning
`"diarization-mixed"`	The ASR diarization label being renamed actually contains multiple real-world speakers (e.g., clinician + parent collapsed). The rename is the best available approximation; downstream consumers should know the output is imperfect.
`"best-guess"`	The operator could not confidently determine which speaker is which (e.g., from audio alone). The mapping is recorded as best-guess and merits review by a domain expert before publication.

Any other string is preserved verbatim as a contributor-specific flag (Custom(String) in the Rust type). Consumers SHOULD NOT crash on unknown flags but MAY surface them in audit-trail displays.

The order of flags within an entry is not semantically meaningful; duplicates are tolerated but considered noise. Tooling that modifies the list SHOULD deduplicate.

Reader semantics

OverrideFile::read(path) is the canonical reader. Its behavior:

Open path UTF-8.
Parse via toml.
Refuse if schema_version is absent or not equal to the binary’s CURRENT_SCHEMA_VERSION (currently 2). Error: OverrideFileError::UnsupportedSchemaVersion { found, supported }.
Parse all [<session_id>] tables into MergeOverride values; reject unknown fields.
Return OverrideFile { schema_version, entries }.

OverrideFile::read_or_default(path) is the variant used by chatter speaker-id --write-override: if the file does not exist, returns OverrideFile::default() (empty, current schema version); otherwise behaves as read.

OverrideFile::get(&session_id) retrieves a single entry; returns None if absent.

Writer semantics

OverrideFile::write(path) serializes the file deterministically:

Top-level field order: schema_version first.
Entries ordered by session ID alphabetically (BTreeMap default).
Per-entry field order: mode, adult_roles, mapping, scores, margin, operator, decided_at, note, flags, engine, judgment.
Optional fields omitted when empty / absent.
Atomic replace: writes to <path>.tmp then renames over <path> to avoid leaving a partial file on crash.

chatter speaker-id --write-override <path> appends a single entry: it reads the file (or starts empty), inserts/updates the entry for the current session, and writes back. The session ID defaults to the input CHAT file’s basename stem unless overridden via --session-id.

Example: minimal auto-mode entry

schema_version = 2

[session-101-t1]
mode = "auto"
adult_roles = { PAR0 = { code = "INV", tag = "Investigator" } }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.1931, PAR1 = 0.7347 }
margin = 3.81
operator = "alice"
decided_at = 2026-05-27T08:41:00-04:00

The reader reconstructs: child speaker was PAR1 (high Jaccard match with reference’s CHI); auto-decide succeeded with margin 3.81×; PAR0 becomes INV:Investigator in the output.

Example: operator-adjudicated entry

After a low-confidence refusal, the operator listened to the audio, confirmed the call, and re-ran with --mapping:

[session-102-t1]
mode = "explicit"
adult_roles = { PAR1 = { code = "INV", tag = "Investigator" } }
mapping = { PAR0 = "drop", PAR1 = "rename" }
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
operator = "alice"
decided_at = 2026-05-27T11:15:00-04:00
note = "Auto refused at 2.0× threshold. Listened to first 60 seconds; PAR0 produces child-content matching the hand transcript. PAR1 introduces herself as the clinician."

The scores from the prior auto attempt are preserved; the note captures why the operator was confident in the call despite the close margin. Years later, a researcher can verify by listening to the same 60 seconds and confirming the operator’s observation, the audit trail is reproducible.

Example: diarization-mixed parent sample

[session-103-t1-parent]
mode = "explicit"
adult_roles = { PAR0 = { code = "MOT", tag = "Mother" } }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.3727, PAR1 = 0.6940 }
margin = 1.86
operator = "alice"
decided_at = 2026-05-27T11:22:00-04:00
note = "Parent sample. Per contributor data sheet: mother. PAR0 contains clinician intro + parent mixed (Batchalign diarization limitation)."
flags = ["diarization-mixed"]

The flags = ["diarization-mixed"] warns downstream consumers that the renamed MOT speaker is not a clean parent-only stream the first ~15 seconds were the clinician giving setup instructions before leaving the room. The note captures the specifics for future review.

Example: replayed entry

The same file run on a different day from the override file:

[session-102-t1]
mode = "override"
adult_roles = { PAR1 = { code = "INV", tag = "Investigator" } }
mapping = { PAR0 = "drop", PAR1 = "rename" }
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
operator = "alice"
decided_at = 2026-05-27T11:15:00-04:00
note = "Auto refused at 2.0× threshold. Listened to first 60 seconds; PAR0 produces child-content matching the hand transcript. PAR1 introduces herself as the clinician."

mode becomes "override" whenever the entry is re-applied by reading the file. The other fields (including the original operator and decided_at) are preserved, the override file is the audit trail of the original decision, not of the replay.

TOML grammar reference

For consumers writing the file by hand or generating it from other tools, the grammar is standard TOML 1.0 (toml.io) with the following domain-specific conventions:

Datetimes use RFC 3339 with explicit time zone. UTC offset Z and offsets like -04:00 are both accepted.
Floats: standard TOML float syntax. The margin field accepts either a float or the string "unbounded".
Tables vs inline tables: top-level [<session_id>] tables may use either standard or inline syntax; the writer emits standard tables for readability.
Comments: TOML # line comments are permitted anywhere; the reader ignores them. The writer does not preserve comments across read-modify-write cycles (toml, not toml_edit); hand-edited comments may be lost on subsequent --write-override runs. If preserving comments becomes important, the writer can be swapped for toml_edit in a future release.

Future schema changes

Schema version increments appear here under “Migration” with the version-to-version diff and migration instructions. The policy is strict refuse-with-clear-error on any schema_version value this binary does not recognize; there is no auto-migration.

Migration: schema_version 1 -> 2 (`adult_roles` map, 2026-07)

schema_version bumped from 1 to 2 when the single per-entry inserted_role field was replaced by adult_roles, a map from donor speaker code to InsertedRoleSpec. The old field could only name one CHAT identity per entry, so a session with two distinct adult speakers (two different roles, or two speakers sharing one role) had no way to record more than one of them. adult_roles keys each InsertedRoleSpec by the donor code it applies to, so every "rename" speaker in mapping gets its own role assignment; InsertedRoleSpec also gained an optional specific_role field for the CHAT manual’s First_Investigator/Second_Investigator-style disambiguation when two adults in one entry share a role.

This is a breaking, non-migrating version bump: a schema_version = 1 file is refused with OverrideFileError::UnsupportedSchemaVersion, not auto-converted. Operators holding a pre-bump override file must re-adjudicate those sessions. The pending-adjudications.toml format bumped its own schema version in lockstep for the same reason; see Adjudication Workflow.

2026-06 additive fields: `engine` and `judgment` (no version bump)

The engine and judgment fields were added in 2026-06 to record decision provenance (deterministic vs LLM). This addition did NOT increment schema_version because both fields are backward compatible in both directions:

Old reader, new file: TOML deny_unknown_fields is not set globally; older binaries that parse a file containing engine and judgment will silently ignore the unknown keys. The decision itself (mode, mapping, adult_roles) is unaffected.
New reader, old file: engine has #[serde(default)] and defaults to "deterministic"; judgment has skip_serializing_if = "Option::is_none" and is absent, which deserializes as None. Pre-provenance files are therefore readable without error and are treated as deterministic decisions.

A future version bump would be warranted only if a change makes old files unreadable or misinterpretable, neither of which applies here.

Relationship to JSON Schema

The Rust OverrideFile type is implemented (in talkbank-transform, src/speaker_id/override_file.rs) and drives the override-file replay workflow today. What is not yet built is its JSON Schema export: OverrideFile does not yet derive schemars::JsonSchema, so no schema is generated, and the canonical URL https://talkbank.org/schemas/v0.1/merge-overrides.json is reserved but not yet published. Exposing it follows the same schemars-based generator pattern documented in JSON Schema.

The TOML form is the on-disk format; JSON Schema is the machine-readable spec for external tooling. Both describe the same OverrideFile Rust type.

Keyboard shortcuts

Chatter: TalkBank CHAT Toolchain

0.2.1 - 2026-06-24

0.2.0 - 2026-06-23

0.1.1 - 2026-06-22

0.1.0 - 2026-06-15