Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Status: Current Last modified: 2026-06-21 21:33 EDT

TalkBank is the world’s largest open repository of spoken language data. This repository (TalkBank/chatter) is the standalone home of the CHAT format authority and the chatter tool family: the chatter CLI, the Rust crates for parsing/validation/transformation, the tree-sitter-talkbank grammar, the talkbank-lsp language server, and the desktop validation app.

chatter is publicly released. To get it right away:

  • Command-line tool (macOS / Linux): curl --proto '=https' --tlsv1.2 -LsSf https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.sh | sh (Windows and other options: Install).
  • Desktop app: download for your platform from the latest release.
  • Full installation guide (all platforms, package details): Install.

The Rust crates are source-available from this repository (not yet published to crates.io). As a 0.x release, APIs and flags may change before 1.0.

Choose the right surface

TaskRecommended Surface
CHAT validation, normalization, or conversionchatter CLI
LSP integration in editorstalkbank-lsp standalone
Build CHAT tooling in RustRust crates (talkbank-model, talkbank-parser, etc.)
Reuse grammar in other toolstree-sitter-talkbank
Standalone desktop GUI for CHAT validationChatter Desktop (apps/chatter-desktop/)

What’s In This Repo

  • chatter CLI: validate, convert, normalize, and analyze CHAT files from the command line, with an interactive TUI for corpus-scale workflows
  • Language Server (LSP): works with any LSP-compatible editor (Neovim, Emacs, Helix, Zed, etc.) to provide live validation and cross-tier alignment
  • JSON data model: every CHAT structure as typed JSON with lossless roundtrip fidelity, backed by a published JSON Schema
  • Rust API: parse, validate, inspect, and transform CHAT files programmatically via library crates

Who This Book Is For

AudienceStart HereThen Go To
CLI users validating, normalizing, or converting CHATInstallchatter Quick Start, CLI Reference
Rust library consumers parsing or transforming CHATLibrary Usagecrate-root rustdoc for talkbank-model, talkbank-parser, and talkbank-transform
Grammar / format consumers embedding CHAT parsing in other toolsCHAT Format Overviewtree-sitter-talkbank docs and the grammar/reference chapters
Contributors / maintainers working in this repoContributing setupCI and release

Repository Layout

grammar/        Tree-sitter grammar for CHAT
spec/           Source of truth: CHAT specification + error specs
crates/         Rust crates for model, parser, transform, cache, CLI, LSP, tests, and FFI support
apps/           Tauri v2 desktop app (`chatter-desktop`)
corpus/         Reference corpus (must stay 100% valid under the regression gate)
schema/         JSON Schema for the CHAT AST
tests/          Integration tests and fixtures
fuzz/           Fuzz testing targets (separate Cargo workspace)
docs/           Strategy docs, proposals, and investigations for this repo
book/           This documentation (mdBook)

Data flows: spec (source of truth) → grammar (tree-sitter) → Rust crates (parsers, model, validation, CLI, LSP) → applications (chatter, desktop app).

Install

Status: Current Last modified: 2026-06-21 21:33 EDT

Installation paths for each surface of chatter. Pick the row that matches what you want to do and the audience you belong to.

If you want to…Use this surfaceStart here
Validate, normalize, convert, or batch-process CHAT fileschatter CLICLI installation
Embed the Rust crates in another programRust librariesLibrary usage
Reuse the grammar in editor or parser toolingtree-sitter-talkbankcrate docs plus the CHAT format overview

chatter is publicly released: the CLI and desktop app are available from the latest GitHub release. The Rust crates and grammar are source-available from this repository (not yet published to crates.io). As a 0.x release, APIs and flags may change before 1.0.

For audio + ML pipelines (transcribe, force-align, morphotag, benchmark), see the upstream batchalign3 project, that lives outside the chatter repo and has its own installation flow.

Quickstart

Status: Current Last modified: 2026-06-21 21:33 EDT

Task-driven entry points. Pick the row that matches what you want to do today; each path starts at the narrowest useful documentation surface instead of dropping you into the whole book.

Today’s goalBest first pageSurface
Validate / normalize / convert existing CHATchatter Quick StartCLI
Add CHAT parsing/validation to a Rust programLibrary UsageRust crates

To download and install chatter, see Install.

For audio + ML workflows (transcribe / align media → CHAT), see the upstream batchalign3 project, outside the chatter repo.

Installation

Status: Current Last modified: 2026-06-16 07:55 EDT

chatter targets Windows, macOS, and Linux. There are two ways to install it: the prebuilt binaries (recommended for most people, including clinicians and researchers) and a from-source build (for contributors or unsupported platforms).

Every GitHub Release attaches prebuilt binaries for macOS (Apple Silicon and Intel), Linux (x86_64 and ARM64), and Windows (x64), plus desktop-app installers.

chatter CLI

One-line installers (they download the binary for your platform, place it on your PATH, and also install the chatter-update self-updater):

  • macOS and Linux:

    curl --proto '=https' --tlsv1.2 -LsSf https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.sh | sh
    
  • Windows (PowerShell):

    powershell -ExecutionPolicy Bypass -c "irm https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.ps1 | iex"
    

On Windows the binary is not yet code-signed, so SmartScreen may warn on first run: choose More info, then Run anyway. The macOS binaries are codesigned, and the installer above does not set the quarantine attribute, so Gatekeeper does not prompt.

Prefer a manual download? Grab the archive for your platform from the latest release and extract chatter onto your PATH. (On macOS, a browser-downloaded archive is quarantined; right-click the binary and choose Open once, or run xattr -d com.apple.quarantine ./chatter.)

Verify:

chatter --version
chatter --help

chatter desktop app

The desktop app (“Chatter”) is for people who prefer a window to a terminal. Download the installer for your platform from the latest release:

  • macOS: the .dmg is signed and notarized; open it and drag the app to Applications. No Gatekeeper override is required.
  • Windows: the installer is not yet signed (same SmartScreen note as above: More info then Run anyway).
  • Linux: an AppImage and a .deb are provided.

Updating chatter

chatter keeps itself current so you do not have to track releases by hand.

  • CLI: run

    chatter update
    

    This runs the bundled chatter-update program, which checks GitHub Releases and installs the newest release in place. (The self-update facility is experimental. It is installed only by the one-line installers above; if you installed another way, update the same way you installed.)

  • Desktop app: the app checks for updates on launch and offers to install a new version when one is available.

From source

Building from source needs only a stable Rust toolchain (install via rustup, which supports Windows, macOS, and Linux). Node.js and the tree-sitter CLI (cargo install tree-sitter-cli) are needed only when working on the grammar or generated artifacts.

Clone and install the CLI:

git clone https://github.com/TalkBank/chatter.git
cd chatter
cargo install --path crates/chatter --locked

This installs the chatter binary to ~/.cargo/bin/ (macOS/Linux) or %USERPROFILE%\.cargo\bin\ (Windows). To update a source install, pull and re-run the cargo install command above (chatter update is only for installer-based installs).

Building the libraries

If you are developing with the Rust crates directly, from your chatter checkout root:

cargo build --workspace --all-targets --locked
cargo test --workspace --locked
cargo clippy --all-targets -- -D warnings

See the contributor setup for additional commands.

Directory layout

Everything lives in a single repository:

<your-chatter-checkout>/
├── grammar/            # Tree-sitter grammar
├── crates/             # All Rust crates (talkbank-* + the chatter binary)
├── spec/               # CHAT specification
├── apps/               # Tauri desktop app (chatter-desktop)
└── book/               # Chatter mdBook (this book)

The CLI, grammar, crates, and the LSP/desktop integrations all live in this single repository.

Quick Start

Status: Current Last updated: 2026-06-15 15:00 EDT

This page gets you from zero to productive with chatter in five minutes. Install chatter first if you haven’t already.

Validate a CHAT file

Check a single transcript for errors:

chatter validate transcript.cha

If the file is valid you get a summary (a cache-statistics block follows it; use --quiet to suppress all output and rely on the exit code):

=== Summary ===
Total files: 1
Valid: 1
Invalid: 0

If there are problems, you’ll see rich diagnostics with the exact location and a stable error code. For example, a *CHI: line missing its terminator:

✗ Errors found in transcript.cha

E305 (https://talkbank.org/errors/E305)

  × error[E305]: Expected terminator not found (line 6, column 1)
   ╭─[input:6:1]
 6 │ *CHI:   hello world
   · ─────────┬─────────
   ·          ╰── here
   ╰────
  help: Add a terminator at the end: Standard (. ? !), Interruption
        (+... +/. ...), or CA intonation (⇗ ↗ → ↘ ⇘ ...)

Every error code (E305, E705, etc.) is documented with fix guidance in the validation error reference.

Validate an entire corpus

Point chatter at a directory, it walks recursively, validates in parallel, and caches results:

chatter validate corpus/

The interactive TUI shows progress and lets you browse errors per file. Use --format json for machine-readable output, or --quiet for CI (exit code 1 on errors).

Convert to JSON

Get a structured representation of any CHAT file:

chatter to-json transcript.cha

The output conforms to the TalkBank CHAT JSON Schema. Convert back with chatter from-json.

Watch for changes

Edit a file and get live validation feedback:

chatter watch transcript.cha

Every time you save, chatter re-validates and shows updated diagnostics.

What next?

CLI Reference

Status: Current Last modified: 2026-06-15 15:00 EDT

The chatter CLI is the primary command-line surface for the TalkBank CHAT toolchain.

The following diagram shows the command dispatch structure. Each top-level command dispatches to a handler in the corresponding crate.

flowchart TD
    chatter(["chatter"])

    chatter --> validate["validate\n(chatter)"]
    chatter --> normalize["normalize\n(chatter)"]
    chatter --> tojson["to-json\n(talkbank-transform)"]
    chatter --> fromjson["from-json\n(talkbank-transform)"]
    chatter --> showalign["show-alignment\n(chatter)"]
    chatter --> watch["watch\n(chatter)"]
    chatter --> lint["lint\n(chatter)"]
    chatter --> clean["clean\n(chatter)"]
    chatter --> newfile["new-file\n(chatter)"]
    chatter --> cache["cache\n(stats, clear)"]
    chatter --> schema["schema\n(JSON Schema output)"]
    chatter --> debug["debug\n(overlap-audit, linker-audit,\nfind, sanitize, fix-s)"]

    chatter --> merge["merge\n(experimental)"]
    chatter --> speakerid["speaker-id\n(experimental)"]
    chatter --> adjudicate["adjudicate\n(experimental)"]
    chatter --> pipeline["pipeline\n(experimental)"]
    chatter --> batch["batch\n(experimental)"]
    chatter --> sanityscan["sanity-scan\n(experimental)"]

Top-Level Commands

chatter validate PATH...
chatter normalize INPUT
chatter to-json INPUT
chatter from-json INPUT
chatter to-xml INPUT
chatter show-alignment INPUT
chatter watch PATH
chatter lint PATH
chatter clean PATH
chatter new-file
chatter cache stats
chatter cache clear --prefix PATH
chatter schema
chatter debug ...
chatter merge FILE1 FILE2          # experimental: combine two transcripts
chatter speaker-id INPUT           # experimental
chatter adjudicate ...             # experimental
chatter pipeline ...               # experimental
chatter batch ...                  # experimental
chatter sanity-scan ...            # experimental

Use chatter --help or chatter <command> --help for the exact live surface.

validate

Validate CHAT file(s) or directory tree(s). Accepts multiple paths.

Usage: chatter validate [OPTIONS] <PATH>...
chatter validate file.cha                         # single file
chatter validate file1.cha file2.cha file3.cha    # multiple files
chatter validate corpus/                          # directory (recursive, parallel)
chatter validate file.cha corpus/ other.cha       # mix of files and directories
chatter validate corpus/ -f json                  # structured JSON output
chatter validate corpus/ --force                  # ignore cache, revalidate everything
chatter validate corpus/ --force --audit out.jsonl # bulk audit to JSONL file
chatter validate corpus/ --suppress xphon         # suppress named error group
chatter validate corpus/ --suppress E726,E727     # suppress specific error codes
chatter validate corpus/ -j 8                     # use 8 parallel workers
chatter validate corpus/ --max-errors 50          # stop after 50 errors

Options:

FlagDescription
-f, --format text|jsonOutput format (default: text)
--list-checksPrint every validation check with Active/Planned status, then exit (no <PATH> required)
--skip-alignmentSkip dependent-tier alignment checks
--forceIgnore cache, revalidate all files
-j, --jobs NParallel workers for directory mode (default: CPU count)
--quietOnly emit errors, suppress success messages
--max-errors NStop after N errors across all files
--roundtripTest serialization idempotency (developer tool)
--parser tree-sitter|re2cParser backend (default: tree-sitter; re2c is opt-in for faster batch validation)
--strict-linkersEnable strict cross-utterance linker pairing checks (E351-E355); off by default
--check-xphonRe-enable %xphon* cross-tier alignment checks (E725-E728); skipped by default
--audit FILEStream errors to JSONL file (bulk audit mode)
--suppress CODESSuppress error codes or groups (comma-separated)

Suppress groups: xphon expands to E725/E726/E727/E728 (%xphosyl/%xphoaln/%xmodsyl cross-tier alignment). These are suppressed by default since 2026-04-21; pass --check-xphon to include them. The --suppress flag can mix groups and codes: --suppress xphon,E316.

normalize

Serialize a CHAT file into canonical formatting.

chatter normalize input.cha
chatter normalize input.cha -o normalized.cha
chatter normalize input.cha --validate
chatter normalize input.cha --validate --skip-alignment

Flags:

  • -o, --output <PATH>: write to a file instead of stdout.
  • --validate: validate (including alignment by default) before writing the normalized output.
  • --skip-alignment: when paired with --validate, skip the dependent-tier alignment checks (still validates the rest).

normalize writes to stdout unless you pass -o/--output. There is no --in-place flag.

JSON Conversion

# Single file
chatter to-json input.cha                          # pretty-printed JSON to stdout
chatter to-json input.cha --compact                # minified JSON to stdout
chatter to-json input.cha -o output.json           # JSON to file

# Directory (recursive, preserves structure)
chatter to-json corpus/ --output-dir json/          # incremental by default (mtime check)
chatter to-json corpus/ --output-dir json/ --compact # minified output (saves disk)
chatter to-json corpus/ --output-dir json/ --force   # full rebuild
chatter to-json corpus/ --output-dir json/ --prune   # remove orphaned .json files
chatter to-json corpus/ --output-dir json/ --jobs 4  # parallel workers

# Reverse and schema
chatter from-json input.json -o output.cha
chatter schema
chatter schema --url

Single-file mode: to-json validates by default. Use --skip-validation, --skip-alignment, or --skip-schema-validation to bypass checks.

Directory mode: Walks recursively, converting each .cha to .json under --output-dir with the same relative path. Incremental by default: skips files whose JSON is already newer than the source. Use --force to rebuild all. Use --prune to remove .json files with no matching .cha (handles renames/deletions). Use --jobs N for parallel conversion (defaults to number of CPUs).

to-xml

Export one CHAT transcript to TalkBank XML. The transcript is validated before any XML is emitted, so an invalid input fails (exit 1) and writes nothing to stdout; a failed export never leaves a partial document. This command is export-only: XML ingest is not implemented, so there is no from-xml.

chatter to-xml input.cha                  # XML to stdout
chatter to-xml input.cha -o output.xml    # XML to a file
chatter to-xml input.cha --skip-alignment # skip dependent-tier alignment checks

The output is TalkBank XML in the http://www.talkbank.org/ns/talkbank namespace (referencing talkbank.xsd). Writing to --output prints a one-line ✓ Converted ... to ... confirmation on stderr; writing to stdout prints only the XML.

Flags: -o, --output <PATH> (stdout if omitted); --skip-alignment (disable dependent-tier alignment validation during export).

Editing and Inspection Commands

show-alignment

Print the dependent-tier alignment for a CHAT file (debugging aid).

chatter show-alignment file.cha
chatter show-alignment file.cha -t mor          # one tier type
chatter show-alignment file.cha -t gra -c       # compact one-line-per-alignment output

Flags: -t/--tier <mor|gra|pho|sin> (omit to show all available tiers); -c/--compact (one line per alignment).

watch

Watch a CHAT file or directory and re-validate on every save.

chatter watch file.cha
chatter watch corpus/
chatter watch corpus/ --skip-alignment --clear

Flags: --skip-alignment (faster reruns); -c/--clear (clear the terminal between runs).

lint

Run lint checks and optionally auto-fix.

chatter lint corpus/
chatter lint corpus/ --fix
chatter lint corpus/ --fix --dry-run         # preview without modifying files
chatter lint corpus/ --skip-alignment

Flags: --fix (apply fixes); --dry-run (show what would change without writing); --skip-alignment.

clean

Show the cleaned text for each word (a debugging aid for the text-normalization pipeline).

chatter clean file.cha
chatter clean file.cha --diff-only       # only words where raw differs from cleaned
chatter clean file.cha --format json

Flags: --diff-only; --format text|json.

new-file

Create a new minimal valid CHAT file from defaults.

chatter new-file
chatter new-file -o starter.cha --speaker CHI --language eng
chatter new-file -o adult.cha -s MOT -l eng -r Mother
chatter new-file -c brown -u "hello world ."

Flags:

  • -o, --output <PATH>: stdout if omitted
  • -s, --speaker <CODE>: default CHI
  • -l, --language <ISO 639-3>: default eng
  • -r, --role <ROLE>: default Target_Child
  • -c, --corpus <CORPUS>: corpus identifier in the @ID header (default corpus)
  • -u, --utterance <TEXT>: optional initial main-tier utterance content

Cache Commands

chatter cache stats
chatter cache stats --json
chatter cache clear --prefix /path/to/corpus
chatter cache clear --all --dry-run

The validation cache lives under the platform cache directory and stores per-file validation results. validate --force refreshes cache state for the specified path.

debug

Developer / debugging subcommands for CHAT analysis. Not intended for routine end-user workflows; surface and behavior may change between releases. Run chatter debug --help for the live list. Current subcommands include:

  • overlap-audit: analyze CA overlap markers (⌈⌉⌊⌋): pairing, temporal consistency, orphans.

  • linker-audit: audit linker / special-terminator usage across a corpus (cross-utterance pairing for +<, ++, +^, +", +,, +≋, +≈, plus +..., +/., +//., +"/. etc.).

  • find: filter CHAT files by @Languages and body content (token / substring counts) across a corpus tree; emits paths, JSONL, or CSV.

  • sanitize: strip contributor lexical content while preserving structure, for protected-corpus debugging. See the Sanitize user-guide page for the full workflow.

  • fix-s: normalize whole-utterance same-language @s runs into a [- lang] precode, clear the per-word @s markers (including those on fillers and nonwords), and append any missing explicit @s:LANG codes to @Languages. Trigger conditions and safety rules:

    • Every word-bearing item in the utterance, including fillers (&~, &-, &+), nonwords, and retraced material, must carry an explicit language marker AND every marker must resolve to the same target language. If a single filler such as &~dang3 lacks a marker, the utterance is left untouched (the predicate cannot prove it is monolingual).
    • Bare @s shortcuts on fillers must be cleared when the rewrite fires. A bare @s resolves relative to the surrounding tier language, so adding a [- LANG] precode without clearing the shortcut would flip the filler’s language to the precode target. fix-s clears the shortcut to keep the original meaning intact.
    • The pre-validation rule that catches the unrewritten pattern is E255 (whole-utterance same-language @s run); fix-s is the canonical repair. The companion warn-only E254 reports @s:LANG codes missing from @Languages; fix-s appends them.
    • True no-op on already-correct files: a file is rewritten only when a [- lang] conversion or @Languages repair can be proved necessary.

Merge and Reconciliation Commands (experimental)

These commands combine, reconcile, and relabel CHAT transcripts of the same recording, in the tradition of CLAN’s reliability and comparison tools (rely, trnfix). They are experimental and in active development: flags and behavior may change, and several modes are not yet complete. Work on copies and validate the output.

CommandWhat it does
mergeMerge two CHAT transcripts of the same media into one, interleaving by time with explicit per-speaker provenance. Structural only: no ASR, no forced alignment, no content rewriting.
speaker-idAssign CHAT-conformant speaker codes to an anonymously-labeled file, from an explicit mapping or by text similarity against a reference transcript.
adjudicateResolve pending low-confidence decisions (currently speaker-id) interactively, writing results to an override file.
pipelinePer-session shortcut: run speaker-id in reference mode, then merge.
batchLoop pipeline over matched donor / reference file pairs across two directories.
sanity-scanPost-merge QA: flag sessions whose automatic decisions look suspicious by an out-of-band heuristic, for operator review via adjudicate.

Full guides: Merge and Speaker ID. The speaker-id holistic-judgment mode can call an LLM provider (talkbank-llm) when configured; the deterministic modes need no network access.

Exit Codes

CodeMeaning
0Success – all files valid, or command completed without errors
1Failure – validation errors found, parse errors, or command failed
2Usage error – invalid arguments or missing required options (from clap)

chatter validate exits with code 1 if any file has validation errors or parse errors. This makes it safe to use in scripts and CI pipelines:

chatter validate corpus/ --quiet --tui-mode disable || echo "Validation failed"

Use --quiet to suppress per-file success output while still relying on exit codes. Use --format json for machine-readable structured output (JSON objects go to stdout; exit code still reflects pass/fail).

Output Contracts

  • Text output is intended for humans.
  • JSON output is intended for automation and downstream tools.
  • Error codes and the JSON Schema are documented public contracts; see the Integrating section of this book.

Validation Errors

Status: Current Last modified: 2026-06-17 11:29 EDT

The CHAT validator produces diagnostics at two severity levels: errors (must fix) and warnings (should fix). Each diagnostic has an error code that maps back to a documented spec and validator rule.

chatter validate is the binding judgment on whether a byte sequence is valid CHAT. When it reports an error, the file is invalid CHAT: clean the data rather than working around the check. A warning flags a questionable but parseable construct you should review. Where chatter and an older tool such as CLAN’s check disagree on whether a file is valid, chatter validate is authoritative (see CHECK Parity Audit for how the two are reconciled).

Reading Error Output

The validator emits rich diagnostics that include the error code, a source-pointed snippet, and a suggested fix:

  × error[E304]: Missing speaker in main tier (line 15, column 3)

15 │ *	hello world .
   ·  ╰── here
   ╰────
  help: Add a speaker code between * and : (e.g., *CHI:)

Each diagnostic contains:

  • File path and location (line:column)
  • Severity: error or warning
  • Error code: E prefix for errors, W prefix for warnings, with a URL pointing at the per-code documentation page
  • Message: human-readable description
  • Suggestion: actionable fix guidance where available

Error Code Ranges

RangeCategoryExamples
E1xxUTF-8 and encodingE101: Invalid line format
E2xxWord-level contentE202: Missing form type after @, E203: Invalid form type marker, E207: Unknown annotation
E3xxMain tier (speakers, terminators, content)E301: Empty/missing main tier, E304: Missing speaker, E305: Missing terminator, E306: Empty utterance, E307: Invalid speaker, E308: Undeclared speaker
E4xxDependent tier structureE401: Duplicate dependent tier
E5xxHeadersE501: Duplicate header, E504: Missing @Participants, E505: Invalid @ID format
E6xxDependent tier validationE601: Invalid dependent tier, E604: %gra without %mor
E7xxAlignment (%mor, %gra, %pho, %wor)E705: Main/%mor count mismatch, E721: %gra index error
W1xx-W6xxWarningsW108: BOM detected, W601: Empty user-defined tier

Common Errors and Fixes

E256: Curly single quote used as a word character

A curly single quotation mark (U+2018 or U+2019), commonly introduced by autocorrect or speech-to-text, is not a legal CHAT word character. CHAT words use the ASCII apostrophe (U+0027, the plain '). For example, a contraction typed as don + U+2019 + t is rejected; write don't with the ASCII apostrophe instead. chatter flags the curly form wherever it appears in word content and points the diagnostic at the exact character. This mirrors CLAN CHECK errors 138 and 139.

E304: Missing speaker code

A main tier line must have a speaker code after the *:

*CHI:	hello world .

An empty speaker code (*: hello .) triggers E304.

E308: Undeclared speaker

Every *SPEAKER: code must be listed in @Participants. Add the missing speaker to the header:

@Participants:	CHI Target_Child, MOT Mother

E370: Retrace marker with nothing to retrace

A retrace or repetition marker ([/], [//], [///]) must be followed by the repeated or corrected material; per the CHAT manual the marker always refers to the text that follows it. A marker followed only by a terminator has nothing to retrace:

*CHI:	<the> [/] .          ← invalid: [/] is not followed by repeated material
*CHI:	<the> [/] the cat .  ← valid: the repeated material follows the marker

This mirrors CLAN CHECK error 119 (and the related retrace checks 151 and 159).

E505: Invalid @ID format

Check that pipe-separated fields are correct and the speaker code matches @Participants:

@ID:	eng|corpus|CHI|2;6.||||Target_Child|||

E705: Main/%mor alignment mismatch

The number of %mor items must match the number of alignable words on the main tier. Retraces, pauses, and events are not counted. The validator shows a columnar diff:

  Main tier       %mor tier
  ──────────────  ──────────────
  I               pro|I
  want            v|want
  to              inf|to
  go              v|go
  home, ⊖

E714 / E715: %pho, %mod, or %wor count mismatch

The same two codes are reused for “too few” / “too many” count mismatches on %pho, %mod, and %wor.

For %wor, the main-tier side is a spoken-token inventory:

  • regular words and fillers count
  • fragments, nonwords, and xxx/yyy/www count
  • retrace does not change %wor membership
  • replacements keep the original spoken surface word for %wor

That context-sensitivity decides membership, not leniency. Once an item is in the %wor set, alignment is still strict 1:1. So if a filler like &-mm counts on the main tier and %wor omits it, E714 is the correct result.

So this is valid:

*CHI:	<one &+ss> [/] one play ground .
%wor:	one •321008_321148• ss •321148_321368• one •321809_321969• play •322049_322310• ground •322390_322890• .

But this is also valid:

*EXP:	&+ih <the what> [/] what's letter &+th is this ?
%wor:	ih •49063_49103• the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• th •50886_50946• is •50946_51046• this •51086_51586• ?

And this is valid too:

*EXP:	what's is dis [: this] ?
%wor:	what's •37050_37471• is •37491_37631• dis •37631_38131• ?

E721: %gra sequential index error

%gra entries must have sequential 1-based indices: 1|...|... 2|...|... 3|...|...

Generated Error Documentation

The source of truth for error-code details is spec/errors/. Maintainers can also regenerate a local error-reference set from those specs when working on diagnostics:

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_error_docs

That generated reference includes the error description, example inputs, suggested fixes, and the layer that catches the diagnostic.

Chatter Desktop

Status: Current Last modified: 2026-06-21 21:33 EDT

Chatter Desktop is a native graphical validation app for CHAT files, released alongside the chatter CLI. Prefer the chatter CLI for scripted or batch validation; use the desktop app when you want a standalone graphical validation experience without a terminal.

When to use Chatter Desktop

Chatter Desktop (apps/chatter-desktop/) is the right tool when you want to:

  • Validate CHAT files through a graphical interface, no terminal required
  • Drag and drop a file or folder and read errors with source snippets
  • Work on the desktop without setting up a terminal workflow

Related surfaces:

  • Validate CHAT from the command line: use chatter validate

This page documents the desktop surface:

  • Chatter Desktop (apps/chatter-desktop/), the CHAT validation GUI

Current status

  • Release contract: released alongside the CLI in the public chatter release
  • Distribution: ships in the coordinated chatter release alongside the CLI; also buildable from source (below)
  • Platforms: macOS, Windows, and Linux

Staying up to date

Chatter Desktop keeps itself current. When you launch it, it quietly checks for a newer release; if one is available it asks whether to update, and on your confirmation it downloads, installs, and restarts into the new version. If the check cannot reach the network it simply does nothing and the app keeps working on the version you have. You never have to track releases or re-download by hand.

Getting Started

Build from source

cd apps/chatter-desktop
npm ci
cargo tauri dev       # launches the app with hot reload
cargo tauri build     # produces a distributable app bundle

Requires: Rust (stable, edition 2024), Node.js, and npm.

Using the App

Opening files

Chatter validates one target at a time: a single .cha file or one folder.

Three ways to start validating:

  1. Choose File: opens a file picker filtered to .cha files
  2. Choose Folder: opens a folder picker; validates all .cha files recursively
  3. Drag and drop: drag one .cha file or one folder onto the app window

When idle, if you’ve previously validated a target, the drop zone shows “Last: corpus/reference/, Re-validate?” as a clickable shortcut.

Reading results

The main window has three areas:

┌──────────────────────────────────────────────────────────────┐
│  [Choose File] [Choose Folder] or drag here  [System|Light|Dark] │
├──────────────────┬───────────────────────────────────────────┤
│ 3 FILES WITH     │  Filter by code… [All|Errors|Warnings]    │
│ ERRORS / 120     │                                           │
│                  │  ▾ [E302] Missing @End header              │
│  📁 corpus/      │  ┌───────────────────────┐                │
│    ✗ file1 (3)   │  │ 41 │ *CHI: hello .    │                │
│    ✗ file3 (1)   │  │ 42 │                   │                │
│                  │  │    │ ^                 │                │
│                  │  └───────────────────────┘                │
│                  │  💡 Add @End on the last line             │
│                  │  [Copy] [Open in CLAN]                    │
├──────────────────┴───────────────────────────────────────────┤
│  Progress: 45/120 │ 4 errors │ ~2m 30s remaining │ [Cancel]  │
└──────────────────────────────────────────────────────────────┘
  • File tree (left), collapsible directory tree showing only files with errors (valid files are hidden to reduce clutter). A header shows “N files with errors / M total”. Files are sorted alphabetically.

  • Error panel (right), for the selected file, shows each error with its code in [E001] format, severity color, message, source snippet with caret underlines, and multi-span labels for complex errors (e.g., alignment mismatches across tiers). CHAT-specific formatting is handled: tabs expanded to 8-column boundaries, \x15 bullets rendered as , underline markers shown as styled underlined text. Suggestions prefixed with 💡.

  • Status bar (bottom), streaming progress during validation, ETA after 5+ files, total error count, and action buttons.

Filtering errors

A compact filter bar appears above the error cards when a file has diagnostics:

  • Code filter: type “E7” to show only alignment errors, “W” for warnings, etc.
  • Severity toggle: switch between All / Errors / Warnings

The file header updates to show filtered vs. total count (e.g., “3 errors (7 total)”).

Collapsible error cards

Each error card has a clickable header that toggles between expanded and collapsed view. Collapsed cards show only the error code and first line of the message. When a file has 5 or more errors, an Expand All / Collapse All button appears.

Dark mode

Chatter follows your system appearance by default. A System / Light / Dark toggle in the drop zone area lets you override. Your preference is remembered across sessions.

The dark palette uses muted Apple-style colors, readable miette error highlighting on dark backgrounds.

Clickable file paths

Click the file name in the error panel heading to reveal the file in Finder (macOS), Explorer (Windows), or the default file manager (Linux).

Copy errors

Each error card has a Copy button that copies the full miette-rendered error text (plain text, not HTML) to your clipboard for pasting into issue reports or messages.

Actions

ActionWhereWhat it does
Re-validateStatus bar / last-target hintRe-run validation on the same target (picks up edits)
CancelStatus bar (during validation)Stop the current run
ExportStatus barSave results as JSON or plain text via a save dialog
Open in CLANPer-error buttonOpens the file at the error location in the CLAN editor
CopyPer-error buttonCopies the plain-text error to clipboard
Reveal in file managerFile name headingOpens the file’s parent directory

“Open in CLAN” only appears when the CLAN application is detected on your system (macOS and Windows only). It adjusts line numbers to account for headers that CLAN hides (@UTF8, @PID, @Font, @ColorWords, @Window).

Keyboard shortcuts

ShortcutAction
Ctrl+R / Cmd+RRe-validate
EscapeCancel running validation

All other navigation is mouse-driven (click files, scroll errors).

Window title

The window title updates to reflect the current state:

  • Idle: “Chatter”
  • Discovering: “Chatter, Discovering files…”
  • Running: “Chatter, Validating (45/120)”
  • Finished: “Chatter, 14 errors in 3 files” or “Chatter, All 74 files valid”

ETA

After 5 or more files have been processed, the status bar shows an estimated time remaining (e.g., “~2m 30s remaining”). The estimate updates every second.

Notifications

When validation finishes while the app is not focused, a system notification shows the summary (“Validation complete, 14 errors in 3 files”).

First launch

On first launch, an onboarding overlay explains the four main interactions: drag files, error panel, keyboard shortcuts, and export. Dismiss with “Got it”, it won’t appear again.

CLI Bundling

The desktop app can bundle the chatter CLI binary so power users who download the GUI can also run the CLI from their terminal (like VS Code ships the code command).

An Install CLI Command menu item (when available) symlinks the bundled binary to /usr/local/bin/chatter (macOS/Linux) or copies it to a PATH directory (Windows).

To build with the bundled CLI:

cargo build --release -p chatter
mkdir -p apps/chatter-desktop/src-tauri/resources
cp target/release/chatter apps/chatter-desktop/src-tauri/resources/
cargo tauri build

Architecture

The desktop app lives in apps/chatter-desktop/:

apps/chatter-desktop/
  src-tauri/          Rust backend (Tauri v2)
    src/
      main.rs         Bin entry, calls chatter_desktop_lib::run()
      lib.rs          Tauri app setup (Builder + module wiring)
      protocol.rs     Shared command/event names + request types
      commands.rs     validate, cancel, open_in_clan, export, reveal, install_cli
      events.rs       ValidationEvent → frontend event bridge
      validation.rs   Desktop validation orchestration for one target
  src/                React + TypeScript frontend
    components/       DropZone, FileTree, ErrorPanel, ProgressBar, OnboardingOverlay
    hooks/            useValidation, validationState, useTheme
    protocol/         Command/event names + TypeScript transport mirrors
    runtime/          Tauri transport + capability-focused runtime seam

The Rust backend calls validate_directory_streaming() from talkbank-transform directly, the same streaming validation pipeline used by the TUI. Events flow over crossbeam channels to the Rust side, then are serialized to JSON and emitted to the frontend via Tauri’s event bridge.

Cancellation uses ArcSwapOption for lock-free atomic swap of the cancel sender, no mutex.

The frontend keeps Tauri-specific code confined to src/runtime/tauriTransport.ts. React components and hooks consume narrower capabilities (validationRunner, validationTarget, clan, exports) instead of reaching for one broad desktop service object.

Comparison with TUI

FeatureTUI (chatter validate)Desktop app
File selectionCLI argumentsDrag-and-drop, file picker
NavigationKeyboard (Tab, arrows)Mouse click
Error displayTwo-pane terminal UIScrollable panels with source snippets
Error filtering,Code filter + severity toggle
Copy error,Copy button per error
Open in CLANc keyButton per error
Export--format json --auditSave dialog (JSON or text)
Streaming progressProgress barProgress bar + ETA
Dark modeTerminal themeSystem/Light/Dark toggle
CachingSame engineSame engine
Who it’s forPower users, CIResearchers, linguists

Both use the identical validation engine and produce the same error codes.

When to Use Which Tool

The TalkBank toolchain offers validation through three interfaces. Each serves a different workflow:

ToolAudienceUse when
Chatter DesktopResearchers, linguistsYou want a graphical, drag-and-drop CHAT validation app without using a terminal.
chatter validate (TUI)Power usersYou’re comfortable in a terminal and want keyboard-driven navigation.
chatter validate (CLI)CI, scriptsYou need machine-readable output (--format json) or batch audits (--audit).

Chatter Desktop focuses on validation only.

CLAN Line Numbering

Status: Current Last modified: 2026-05-29 17:31 EDT

When you click “Open in CLAN” in the desktop app or press Enter in the TUI, chatter sends the error location to the CLAN editor. CLAN opens the file and places the cursor at the error. This usually works seamlessly, but there is one caveat: CLAN and chatter count lines differently.

Hidden Headers

CLAN hides five header types from its editor display:

HeaderPurpose
@UTF8Character encoding declaration
@PIDPersistent identifier
@FontDisplay font settings
@ColorWordsColor coding rules
@WindowWindow position/size

These headers are present in the .cha file but invisible in CLAN’s editor. CLAN’s line numbers skip them entirely. A file that starts with @UTF8 on line 1 will show @Begin as “line 1” in CLAN’s display, even though it’s actually line 2 in the file.

What Chatter Does

Chatter automatically adjusts line numbers before sending to CLAN:

  1. Compute the error’s line number in the source file
  2. Count how many hidden headers appear before that line
  3. Subtract the hidden count to get CLAN’s line number
  4. Send the adjusted line number to CLAN

This happens transparently, you don’t need to do anything.

Edge Case: Errors on Hidden Lines

If an error is on a hidden header itself (e.g., a malformed @UTF8 line), CLAN cannot navigate to it because CLAN doesn’t display that line. In this case, “Open in CLAN” will show an error message explaining why.

For Developers

The shared resolution logic lives in talkbank_model::resolve_clan_location(). Both the TUI and the desktop app call this function, it resolves line/column from byte offsets when needed and adjusts for hidden headers.

See clan_location.rs for the implementation and tests.

Batch Workflows

Status: Current Last modified: 2026-06-12 21:05 EDT

The chatter CLI is designed for processing large CHAT corpora efficiently. This page covers common batch workflows.

Validating a Corpus

Validate all .cha files in a directory tree:

chatter validate /path/to/corpus/

The validator recursively discovers .cha files and processes them in parallel. Results are cached, subsequent runs skip unchanged files.

Forcing Revalidation

To bypass the cache and revalidate everything:

chatter validate /path/to/corpus/ --force

Filtering Output

Show only errors (hide warnings):

chatter validate /path/to/corpus/ --quiet

Stop after the first reported error:

chatter validate /path/to/corpus/ --max-errors 1

Write a JSONL audit file while validating:

chatter validate /path/to/corpus/ --audit validation.jsonl

CHAT-JSON Roundtrip

Convert an entire corpus to JSON and back:

# CHAT → JSON
for f in corpus/**/*.cha; do
  chatter to-json "$f" > "${f%.cha}.json"
done

# JSON → CHAT
for f in corpus/**/*.json; do
  chatter from-json "$f" > "${f%.json}.roundtrip.cha"
done

The roundtrip is designed to preserve the ChatFile model. In regression tests, compare normalized output rather than assuming byte-for-byte identity after parser or serializer changes.

Cache Management

The validation cache stores results for previously validated files (keyed by content hash). The cache database file is named talkbank-cache.db and lives in the OS cache directory:

  • macOS: ~/Library/Caches/talkbank-chat/talkbank-cache.db
  • Linux: ~/.cache/talkbank-chat/talkbank-cache.db
  • Windows: %LocalAppData%\talkbank-chat\talkbank-cache.db

It can hold results for large file collections.

To relocate the cache (a different disk, a per-project cache, or an isolated cache for scripted runs), set the TALKBANK_CHAT_CACHE_DIR environment variable to a directory; the database is created directly inside it. This is the supported override on every platform, and the only effective one on Windows, where the default location comes from the system Known Folder API rather than environment variables.

chatter cache stats    # Show hit rates and entry count
chatter cache clear --all

Do not delete the cache file manually while chatter is running.

Reference Corpus Validation

This repository includes a reference corpus at corpus/reference/ (currently ~100 .cha files; verify by find corpus/reference -name '*.cha' | wc -l). The parser must handle every file in this corpus at 100%:

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'

This runs the parser equivalence test, each .cha file is its own test, so nextest runs them in parallel and reports individual failures.

Integration with batchalign

The Batchalign pipeline uses the same Rust core (via PyO3) for CHAT parsing and serialization. Since the 2026-04-28 monorepo merge, Batchalign source lives inside this repository under crates/batchalign-* (the standalone batchalign3 GitHub repo was archived). Files processed by Batchalign produce valid CHAT that passes chatter validate.

CI Integration

Status: Current Last updated: 2026-04-13 19:23 EDT

How to use chatter in continuous integration pipelines.

Exit Codes

CodeMeaning
0All files valid / command succeeded
1Validation errors found or command failed
2Invalid arguments or missing required options

All examples below rely on exit code 1 to signal validation failure.

Basic Usage

chatter validate corpus/ --quiet --tui-mode disable
  • --quiet suppresses per-file success output
  • --tui-mode disable prevents interactive TUI (required in non-TTY environments)
  • Exit code 0 means all files valid; 1 means errors found

GitHub Actions Example

- name: Validate CHAT corpus
  run: |
    chatter validate corpus/ --quiet --tui-mode disable --format json --audit results.jsonl

- name: Upload validation report
  if: failure()
  uses: actions/upload-artifact@v4
  with:
    name: validation-report
    path: results.jsonl

The --audit results.jsonl flag streams per-error JSON lines to a file, which is useful for archiving or downstream analysis even when the step fails.

JSON Output for Automation

chatter validate corpus/ --format json --tui-mode disable 2>/dev/null

Each file produces a JSON object on stdout with status, error_count, and errors array. The exit code still reflects overall pass/fail.

Pre-commit Hook

#!/bin/sh
# .git/hooks/pre-commit
chatter validate . --quiet --tui-mode disable

This blocks commits that introduce invalid CHAT files. The hook runs quickly on cached files; only modified files are re-validated.

Suppressing Specific Errors

Some corpora have known issues that should not block CI. Use --suppress to ignore specific error codes or named groups:

chatter validate corpus/ --suppress E726,E727,E728 --tui-mode disable

Or use the named group shorthand:

chatter validate corpus/ --suppress xphon --tui-mode disable

Suppressed errors do not appear in output and do not affect the exit code.

Audit Mode for Large Corpora

For bulk corpus validation where you want a full error database without caching overhead:

chatter validate corpus/ --audit errors.jsonl --tui-mode disable

The --audit flag streams one JSON object per error to the specified file. A summary is printed to stderr at the end.

CHAT Processing Playbook for Editors and Analysts

Status: Current Last updated: 2026-03-24 00:01 EDT

Objective

Provide practical guidance for non-compiler users who create, edit, and validate CHAT files, with emphasis on error interpretation and correction workflow.

Who This Is For

  • Transcript editors,
  • corpus curators,
  • QA reviewers,
  • linguists using tooling outputs but not parser internals.

Core Editing Workflow

  1. Open file in editor with CHAT diagnostics enabled.
  2. Run validation (single file first, then batch).
  3. Fix highest-severity structural issues first (headers, tier markers, unmatched delimiters).
  4. Re-run validation and inspect warnings.
  5. Only then address style and normalization suggestions.

Error Triage Heuristic

  • Errors at file start: likely header formatting or encoding issues.
  • Errors at tier prefix: likely malformed */% tier syntax.
  • Errors inside words: likely symbol, marker, or annotation boundary issues.
  • Repeated same error class: likely one systemic rule violation pattern.

Fast Interpretation Guide

  • Error: parser/validator could not accept structure; must fix.
  • Warning: valid but suspicious or non-canonical; review strongly recommended.
  • Info: advisory normalization or convention hints.

Common Fix Recipes

  • Header spacing problems:
    • Ensure expected separators and avoid accidental tabs/spaces drift.
  • Unclear language/form markers:
    • Confirm @s usage and suffix ordering with house style guide.
  • Duration/annotation confusion:
    • Verify bracketed annotation form and avoid malformed punctuation.
  • Dependent tier attachment issues:
    • Ensure % tiers follow intended main tier and keep indentation consistent.

Batch Validation Workflow

  1. Validate a small sample first.
  2. Group failures by error code.
  3. Fix by pattern, not file-by-file random order.
  4. Re-run and confirm error count decreases monotonically.
  5. Save run report for audit trail.

Collaboration Workflow with Developers

When reporting parsing issues, include:

  • exact file path,
  • minimal excerpt around failing span,
  • observed diagnostic code/message,
  • expected behavior (if known).

This reduces back-and-forth and speeds defect triage.

Quality Checklist Before Publishing Corpus Updates

  • No unresolved error-level diagnostics.
  • Warning classes reviewed and accepted or fixed.
  • Participant headers and IDs internally consistent.
  • Roundtrip serialization check passes for representative samples.
  • Changelog note recorded for major normalization edits.

Training Recommendations

  • Maintain short examples for each common error class.
  • Provide editor cheat sheet for tier prefixes and marker syntax.
  • Run periodic QA calibration sessions across editors.

Sanitize (chatter debug sanitize)

Status: Current Last updated: 2026-04-28 22:18 EDT

chatter debug sanitize strips contributor lexical content from a CHAT file while preserving structure (timing bullets, %wor per-word offsets, speaker codes, dependent-tier scaffolding, structural counts, POS tags, language markers). Output is structurally identical to the input but contains no participant words, names, or free-text annotations.

The command exists so engineering tooling, including LLM-assisted debugging, can operate on protected-corpus files (aphasia/, dementia/, rhd/, fluency/Password/, clinical-children corpora, etc.) without exposing contributor speech to commercial LLM services.

When to use it

Run chatter debug sanitize on the source file before loading it into any tool (LLM-backed debugger, scratch directory, screen-shareable session) where you don’t want participant content visible.

When you need to ask a contributor for help debugging a specific file, frame the request as “run the sanitizer locally and send me the output” rather than asking for the raw file.

Usage

# Write sanitized output to stdout
chatter debug sanitize input.cha

# Write sanitized output to a file
chatter debug sanitize input.cha --output sanitized.cha

Working location for sanitized files: prefer a stable, non-/tmp scratch directory (e.g. set TB_SCRATCH_DIR to a per-project dir under your workstation’s persistent storage) for any state that should outlive a single command. macOS clears /tmp on reboot.

What is preserved (byte-exact)

  • Timing bullets •start_end• on the main tier.
  • %wor per-word offsets (word START_END triples).
  • Speaker codes (*PAR, *INV, *CHI, …).
  • Utterance count, word count per utterance, dependent-tier count.
  • Structural markers: compound +, clitic ~, CA elements, overlap points, lengthening, stress markers, syllable pause, underline begin/end, proper-noun @n markers.
  • Language markers (@s:LANG), form types (@a, @b), POS tags ($adj, $n).
  • Headers: @Languages, @Birth, @Date, @Media, @PID, @L1Of, @Begin/@End/@UTF8.
  • %mor POS categories and morphological features (e.g., n|, -Past).
  • %gra (numeric grammatical relations) and %tim (timing).
  • Untranscribed tokens xxx / yyy / www, preserving them changes semantic meaning, so they pass through unchanged.

What is replaced or redacted

SourceReplacement
WordContent::TextwN placeholder, indexed by document position
Shortening text(x)
%mor lemmas (MorWord.lemma)lemmaN; POS + features preserved
%pho / %mod / %modsyl / %phosyl / %phoaln / %sintier dropped
Free-text dependent tiers (%com %add %exp %sit %spa %int %gpx %eng %gls %ort %flo %def %coh %fac %par %alt %err)[redacted]
@Comment, @Transcriber, @Birthplace, @Activities, @Situation, @RoomLayout, @Location, @TapeLocation, @Warning, @Bck[redacted] (when content was free text)
@Participants participant-name fielddropped (Participant_<SPEAKER_CODE> is implied by speaker code + role)
@ID custom_field and educationcleared
Event event_type (&=imitates:Mary&=[redacted])[redacted]
Freecode text ([^ aside])[redacted]
OtherSpokenEvent text[redacted]

Determinism + Idempotence

Placeholder generation is keyed off (utterance_index, word_index) tree position, not a global counter. Two consequences:

  • Deterministic: sanitizing the same input twice produces byte-identical output.
  • Idempotent: sanitizing a sanitized file produces the same file again, no double-replacement, no shifting placeholder numbers.

Pipeline

flowchart LR
    Input["Source .cha\n(protected corpus)"] --> Parser["TreeSitterParser\n(talkbank-parser)"]
    Parser --> Model["ChatFile model\n(talkbank-model)"]
    Model --> Sanitize["sanitize()\n(talkbank-transform::redact)"]
    Sanitize --> Walker["walk_words_mut\n+ header walker\n+ dep-tier walker\n+ scoped-annot walker"]
    Walker --> Mutated["Mutated ChatFile\n(placeholders + redactions)"]
    Mutated --> Writer["WriteChat\n(byte-exact bullets)"]
    Writer --> Output["Sanitized .cha\n(scratch path)"]

The walker step replaces WordContent::Text segments inside Word.content, mutates MorWord.lemma fields, redacts free-text header / dep-tier / scoped-annotation strings, and drops phonological tiers. WriteChat then re-serializes, and because it serializes from Word.content (not from Word.raw_text), every CA element, compound marker, clitic boundary, and timing bullet round-trips byte-exact.

Out of v1 scope

Documented for transparency; v2 work:

  • Speaker-code anonymization (graph rewrite across @Participants, @ID, *SPK:, @Birth, @L1Of).
  • @Birth / @Date fuzzing (exact birth dates can be identifying).
  • @Media filename redaction.
  • Audio-side sanitization. (Audio bytes are never touched by the sanitizer; the audio stays at its original path.)
  • “Unsanitize” or round-trip mapping. Explicitly not built, the sanitizer is one-way, the mapping table that would reverse it is the exact artifact we don’t want to exist.

Implementation

Library module: talkbank_transform::redact. CLI surface: chatter debug sanitize. The strict policy is the only public preset in v1; future variants can grow on SanitizationPolicy.

Speaker-ID (chatter speaker-id)

Status: Draft Last modified: 2026-06-15 12:18 EDT

chatter speaker-id assigns CHAT-conformant speaker codes and role tags to a CHAT file whose speakers carry anonymous or placeholder labels (typically the output of an ASR system that labels speakers as PAR0, PAR1, …). It is the bridge between an ASR pipeline that does not understand speaker roles and a CHAT pipeline that does.

The command is structural: it does not modify utterance content, does not run audio analysis, does not infer speaker identity from voice features. Its inputs are the CHAT file to relabel plus an identification signal (reference transcript, explicit mapping, or saved override record); its output is the same CHAT file with speaker codes rewritten and @Participants / @ID headers reconciled.

When to use it

Whenever you have a CHAT file with placeholder speaker codes that need to become CHAT-conformant codes before downstream tooling can process the file meaningfully. The canonical case is an ASR system that emits CHAT but does not know which speaker is the child, parent, clinician, etc.

A complete pipeline that consumes ASR output and produces a publishable CHAT file goes:

flowchart LR
    Media --> Transcribe
    Transcribe["batchalign3 transcribe<br/>ASR"] --> AsrAnon["asr.cha<br/>PAR0, PAR1, ..."]
    Ref["reference.cha<br/>target speakers only"] -.->|reference signal| SpkId
    AsrAnon --> SpkId
    SpkId["chatter speaker-id<br/>(this page)"] --> AsrLabeled["asr-labeled.cha<br/>CHI, INV, MOT, ..."]
    AsrLabeled --> Merge["chatter merge"]
    Ref --> Merge
    Merge --> Aligned["batchalign3 align"]

The speaker-id stage is the single point in the pipeline where “which anonymous speaker corresponds to which CHAT role” is decided. Downstream stages (chatter merge, batchalign3 align, batchalign3 morphotag) all trust that the labels they receive are correct.

Identification modes

Three mutually-exclusive modes, exactly one of which must be selected:

1. Reference mode

The most common case: a separate CHAT file already exists that covers the same media and contains an authoritative speaker (typically the hand-transcribed target speaker). The reference file’s anchor speaker tells us what that speaker’s content looks like; speaker-id finds the matching speaker in the input by text similarity.

The matching algorithm is multiset Jaccard over bags of content tokens, see “Algorithm” below for the full specification. The ASR speaker whose bag-of-words best matches the reference anchor’s bag-of-words is taken as the same speaker, and is marked for drop in the output (because the reference file authoritatively covers them, the downstream chatter merge stage will pull their utterances from the reference, not from this file). The remaining speakers are renamed to the role specified by --inserted-role.

If the Jaccard margin between the winning speaker and the runner-up is below --confidence-threshold, the command refuses to auto-decide. The operator must either lower the threshold (not recommended without spot-checking), supply an explicit mapping (--mapping), or load a previously-adjudicated override (--override-file).

2. Explicit-mapping mode

The operator already knows the mapping (typically because they listened to the audio, or because the contributor’s data sheet documents it). They supply it directly.

chatter speaker-id input.cha \
  --mapping "PAR0=INV:Investigator,PAR1=drop" \
  -o relabeled.cha

The grammar for --mapping:

  • One or more comma-separated assignments.
  • OLD=CODE:ROLE renames OLD to CODE with role tag ROLE.
  • OLD=drop removes OLD’s utterances entirely.
  • Every speaker present in the input must be named in the mapping (no defaulting). This is intentional, we want operator decisions to be explicit.

3. Override-file mode

The operator has previously adjudicated this session (perhaps through an interactive review tool) and saved the decision to a shared override file. speaker-id reads the file, finds the entry for this session, and applies it. See “Override file format” below.

chatter speaker-id input.cha \
  --override-file batch-2026-05-27.overrides.toml \
  --session-id NF203-2 \
  -o relabeled.cha

This mode is the production substrate for batch workflows: the orchestrator first runs chatter speaker-id in reference mode for every session; for any session that exits with low-confidence, the operator works through an adjudication tool that writes to the override file; the orchestrator then re-runs chatter speaker-id in override-file mode for those sessions.

CLI contract

chatter speaker-id <INPUT> [OPTIONS]

ARGUMENTS:
  <INPUT>  Path to the CHAT file to relabel.

OPERATION MODES (exactly one required):

  REFERENCE MODE:
    --reference <FILE>
    --anchor <SPEAKER>
    --inserted-role <CODE>:<TAG>[,<CODE>:<TAG>...]

  EXPLICIT-MAPPING MODE:
    --mapping <SPEC>

  OVERRIDE-FILE MODE:
    --override-file <FILE>
    --session-id <ID>

REFERENCE-MODE OPTIONS:
  --confidence-threshold <FLOAT>
      Minimum Jaccard margin (winner_score / loser_score) for the
      command to auto-decide. Below threshold: exit code 4. The
      command prints per-speaker scores to stderr so the operator
      can inspect. Default: 2.0.

  --write-override <FILE>
      When auto-decide succeeds, append the decision to FILE in
      override-file format (creates if missing). Captures the
      audit trail of a batch run.

COMMON OPTIONS:
  -o, --output <PATH>
      Write relabeled CHAT to PATH. Default: stdout.

The operator identity and any free-text note for a session are set
when an operator confirms it through `chatter adjudicate` (see the
merge workflow), not on this command.

Exit codes:

CodeMeaning
0Success, relabeled file written
1Invalid input (parse error, missing file, unreadable)
2Semantic precondition violated (reference has no utterances for anchor; mapping covers a speaker not in input; etc.)
3Internal error
4Reference mode: confidence threshold not met. Per-speaker scores printed to stderr; no output written

What the output guarantees

These are testable invariants. Every release verifies them against the reference corpus.

Speaker codes match the supplied mapping

For every speaker in the input file:

  • If the mapping marks the speaker for drop, none of their utterances appear in the output, AND their @ID row (if any) is removed from the headers, AND their entry is removed from the @Participants header.
  • If the mapping marks the speaker for rename, every main-tier line *OLD:\t... becomes *NEW:\t... byte-stable except for the speaker code prefix. The @ID row’s third pipe-separated field (speaker code) and eighth field (role tag) are rewritten; other @ID fields are preserved. The @Participants entry’s code and role-tag tokens are rewritten; any intervening tokens (corpus ID, participant name) are preserved.
  • Speakers not in the mapping are passed through unchanged. (In modes 1 and 3, all speakers are assigned automatically; in mode 2, “all speakers must be in the mapping” is a precondition.)

Utterance content is byte-stable except for the speaker prefix

For every retained utterance, every byte EXCEPT the leading *CODE:\t prefix is preserved verbatim. Dependent tiers attached to the utterance are preserved exactly. NAK-delimited time bullets, CHAT markup, special-form annotations, paralinguistic codes, retracing scopes, all untouched.

Headers reconcile per a fixed table

HeaderBehavior
@UTF8, @Begin, @End, @Window, @Languages, @MediaPass-through unchanged
@ParticipantsDrop entries for dropped speakers; rewrite code + role-tag for renamed speakers; entries for unaffected speakers preserved
@IDDrop rows for dropped speakers; rewrite field 3 (code) and field 8 (role) for renamed speakers; other fields preserved
@CommentPass-through unchanged (provenance-carrying comments survive)

Provenance is captured if --write-override is set

When --write-override <FILE> is supplied AND the command succeeds in reference mode, an entry is appended to FILE recording the session ID (derived from the input filename stem unless overridden), the per-speaker Jaccard scores, the chosen mapping, the operator, and an ISO 8601 timestamp. The format is specified in “Override file format” below. The operator identity and any free-text note are set later, when a session is confirmed via chatter adjudicate.

This is the audit-trail mechanism: a year from now, a researcher who asks “why was PAR0 labeled INV in this session?” can read the override entry and see the scores, the operator, and any notes the operator added.

Algorithm (reference mode)

Token cleaning

Both the reference anchor’s bag of words and each input speaker’s bag of words are built by walking the typed CHAT AST and emitting content tokens. The cleaner strips:

  • NAK-delimited time bullets
  • bracket-annotated markup [*], [//], [/], [=! ...], etc.
  • angle-bracket retracing scope (<...>, unwrap, keep inner text)
  • terminator variants +//., +..., +/., +!?, etc.
  • filled-pause and phonological-fragment markers &-..., &+...
  • unintelligible placeholders xxx, yyy, www
  • zero-realization markers 0
  • special-form suffixes (word@lword)
  • CHAT compound underscores (Valentine's_DayValentine s Day)
  • punctuation, then lowercase, then filter to alpha-only tokens of length ≥ 2

Both sides are cleaned identically so the comparison is apples-to-apples. This is the same cleaner specified in the reference corpus under spec/constructs/speaker-id/token-cleaner/.

Multiset Jaccard

For two bags-of-words A and B (counted multisets):

J(A, B) = sum_w min(A[w], B[w])  /  sum_w max(A[w], B[w])

Range [0, 1]. The multiset (rather than set) form rewards speakers who say similar things to the anchor in similar volume, not just speakers whose vocabulary happens to intersect.

Decision

scores  = { speaker: J(anchor_bag, speaker_bag) for speaker in input }
winner  = argmax(scores)
loser   = argmax(scores - {winner})
margin  = scores[winner] / scores[loser]    # ∞ when loser score = 0
  • winner is the input speaker whose content matches the reference anchor’s content best → marked for drop (the reference authoritatively covers them).
  • loser (and any other lower-scoring speakers, in the multi-speaker case) → renamed to the role given by --inserted-role.

If margin < --confidence-threshold (default 2.0), the command exits with code 4 and prints per-speaker scores to stderr. The operator must inspect, adjudicate, and re-run with --mapping or --override-file.

Why this algorithm

The choice was empirical, not theoretical, and was made against a calibration set of CHAT files paired with their corresponding ASR output. Two earlier candidates were tested first and rejected:

  • Raw temporal-overlap (sum of ms of an input speaker’s activity inside the anchor’s bullet windows): too weak on real data. Hand transcripts often place per-utterance time bullets as end-to-end segmentation boundaries covering 95-99% of the session timeline, rather than as tight “speaker active here” windows. Both input speakers fall almost entirely “inside” the anchor’s bullet windows and the signal disappears.
  • Speaker purity (fraction of each input speaker’s activity falling inside anchor windows): same root cause, same failure.

Multiset Jaccard over content tokens succeeded on every session of the calibration set. The borderline cases (margin below 2.0x) clustered around tasks where the non-anchor speaker shares vocabulary with the anchor by the structure of the task, e.g. a clinician describing the same scene the child is also describing in a picture-narrative task. These borderline cases are the reason for the conservative threshold and the --mapping/--override-file escape hatches; the algorithm correctly refuses to auto-decide them rather than silently picking wrong.

Override file format

The override file is a UTF-8 TOML document with one [<session_id>] table per decision. A minimal entry:

schema_version = 1

[session-101-t1]
mode = "auto"
inserted_role = { code = "INV", tag = "Investigator" }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.1931, PAR1 = 0.7347 }
margin = 3.81
operator = "alice"
decided_at = 2026-05-27T08:41:00-04:00

The complete schema specification, every field, every type, every mode-semantics rule, the strict refuse-with-clear-error versioning policy, and worked examples for auto/explicit/replay/diarization-mixed cases, is on the dedicated reference page: Merge Override File Format.

Highlights from the reference:

  • mode = "auto" | "explicit" | "override" records how the decision was made (informational for audit trail; behavior at apply time is the same).
  • inserted_role.code is the CHAT speaker code (INV, MOT, FAT, PAR, …); inserted_role.tag is the CHAT role-tag (Investigator, Mother, …). All renamed speakers in one entry share the same role.
  • mapping must cover every speaker in the input, no defaulting.
  • scores and margin are optional but the writer always records them when an auto attempt produced them (even when the final decision was operator-supplied).
  • flags carries operator-supplied markers like "diarization-mixed" for unusual cases. Unknown strings are preserved verbatim.

Preconditions

chatter speaker-id refuses (exit code 2) if any hold:

Reference mode

  • The reference file has no utterances for --anchor
  • The reference file fails to parse
  • The input file has fewer than 2 distinct speakers (no discrimination problem)

Explicit-mapping mode

  • A speaker in the mapping is not present in the input
  • A speaker in the input is not covered by the mapping (no defaulting)

Override-file mode

  • The override file does not contain a <session-id> entry
  • The entry’s mapping references a speaker not in the input
  • The entry’s mapping does not cover every speaker in the input

What chatter speaker-id is NOT

  • Not voice diarization. Use Batchalign’s ASR pipeline upstream; the labels this command consumes are the labels Batchalign emits.
  • Not content correction. If the speaker the command identifies has been mis-transcribed by ASR, this command does not fix that , re-run ASR with a better engine.
  • Not a merge. This command operates on a single CHAT file. To combine the relabeled file with the reference, use chatter merge.
  • Not interactive. chatter speaker-id is batch-only: it succeeds, refuses, or fails. The interactive review that resolves a low-confidence refusal into an override-file entry is a separate command, chatter adjudicate, run as part of the merge workflow.

Worked example

A typical fully-automated reference-mode call from an orchestrator script:

chatter speaker-id asr-anonymous.cha \
  --reference hand-transcript.cha \
  --anchor CHI \
  --inserted-role INV:Investigator \
  --confidence-threshold 2.0 \
  --write-override batch.overrides.toml \
  -o asr-labeled.cha

For a session this refused (e.g., shared-vocabulary narrative task with margin 1.82x), the orchestrator captures the failure and the operator later resolves it:

# Inspect the scores the command emitted to stderr:
#   PAR0=0.6286  PAR1=0.3457  margin=1.82x  threshold=2.0
# Operator listens to a few seconds of audio and confirms PAR0 is
# the child:

chatter speaker-id asr-anonymous.cha \
  --mapping "PAR0=drop,PAR1=INV:Investigator" \
  --write-override batch.overrides.toml \
  -o asr-labeled.cha

Later, if anyone re-runs the batch, they use override-file mode:

chatter speaker-id asr-anonymous.cha \
  --override-file batch.overrides.toml \
  --session-id NF204-2 \
  -o asr-labeled.cha

The same asr-labeled.cha content is produced; the audit trail remains intact.

Implementation notes (for contributors)

  • Source: crates/talkbank-transform/src/speaker_id/ (proposed layout).
  • CLI surface: crates/chatter/src/commands/speaker_id/.
  • Domain types (SpeakerCode, RoleTag, SpeakerMapping, MergeOverride, JaccardScore, ConfidenceThreshold, Margin) live in talkbank-model and are shared with chatter merge plus any future adjudication UI.
  • The Jaccard cleaner walks talkbank-model::ChatFile directly via the existing content walker (talkbank-model::walk_words); it does NOT re-implement CHAT parsing or use regex on raw bytes for tokenization.
  • Spec entries for the cleaner and the algorithm live in spec/constructs/speaker-id/. Every invariant on this page has a spec; regenerate them with the current spec/tools commands from Spec Workflow.
  • The override-file reader/writer is a typed serde round-trip on a TOML representation; the schema lives in talkbank-model so the format is one shared type across the codebase, not duplicated parsing logic in each consumer.

Merge (chatter merge)

Status: Draft Last modified: 2026-06-11 15:32 EDT

chatter merge combines two CHAT transcripts that cover the same media recording into one. The caller designates which speakers’ utterances are authoritative in which file; the merged output interleaves them by time while byte-preserving every utterance from its designated source.

The command is structural: it does not invent or rewrite utterance content, does not run ASR, does not run forced alignment, does not infer speaker identity. It is the moment in a multi-input CHAT workflow where two parsed transcripts become one.

When to use it

Whenever you have two valid CHAT files of the same recording and you want a single combined CHAT file out, with explicit per-speaker provenance.

Two recurring shapes from real TalkBank workflows:

  • Hand-coded target speaker + ASR everyone else. A contributor has hand-transcribed only the target speaker (often the child in child-language research) with rich disfluency and error coding, and separately someone runs ASR on the same media to produce a rough-but-complete transcript with all speakers. chatter merge combines them with the hand-coded target speaker’s utterances byte-preserved and the other speakers spliced in from the ASR file.
  • Older hand transcript + later supplementary transcription. A legacy CHAT file covers most of the recording; a newer pass transcribes additional content (an investigator’s turns, a parent’s turns, a second target child). Merge with --retain listing the speakers whose content lives in the legacy file.

In both shapes the speakers are the unit of authority, not the files. chatter merge’s job is to express that mapping cleanly.

Conceptual model

A CHAT file describes utterances on a shared media timeline. Two CHAT files of the same media share the same timeline; their utterance sets may overlap (same speech transcribed twice) or be disjoint (each file covers different speakers). The merged output is a single CHAT file on the same timeline whose utterance set is the disjoint union of:

  • the utterances of every speaker listed in --retain from the first input file, and
  • the utterances of every speaker NOT listed in --retain from the second input file.

Retained-speaker utterances from the first file are kept byte-for-byte identical, including every dependent tier they own (%wor, %mor, %gra, %com, %pho, …). Inserted-speaker utterances from the second file have their downstream-generated dependent tiers (%wor/%mor/%gra/%pho, anything a later pipeline stage will regenerate) stripped before insertion, so the merged file is in a clean state for batchalign3 align and batchalign3 morphotag to own those tiers authoritatively post-merge.

flowchart LR
    File1["File 1<br/>any CHAT file"] --> Merge
    File2["File 2<br/>any CHAT file<br/>(same media)"] --> Merge
    Retain["--retain CHI[,SPK,…]"] -.-> Merge
    Merge["chatter merge<br/>(structural)"] --> Out["Merged CHAT file<br/>retained speakers: byte-stable from File 1<br/>inserted speakers: from File 2,<br/>derived tiers stripped"]

CLI contract

chatter merge <FILE1> <FILE2> --retain <SPEAKER_LIST> [OPTIONS]

ARGUMENTS:
  <FILE1>  Path to the first CHAT file. Speakers listed in --retain are
           taken from here, byte-preserved.
  <FILE2>  Path to the second CHAT file. All other speakers are taken
           from here.

REQUIRED OPTIONS:
  --retain <SPEAKER>[,<SPEAKER>...]
           Comma-separated list of speaker codes (e.g. CHI, or
           CHI,SI2). These speakers' utterances come from <FILE1>;
           everything else comes from <FILE2>.

OPTIONS:
  -o, --output <PATH>
           Write merged output to PATH. Default: stdout.

  --strip-tiers <TIER>[,<TIER>...]
           Dependent tier names to strip from inserted-speaker
           utterances before merging. Default: wor,mor,gra,pho.
           Use empty list (--strip-tiers '') to preserve all
           dependent tiers as-is.

  --allow-bullet-drift
           Permit small backward-time bullets in either input (where
           one utterance's end_ms is slightly greater than the next
           utterance's start_ms). Default behavior: warn but proceed.
           Set this flag to silence the warning.

Exit codes:

CodeMeaning
0Merge succeeded
1Invalid input (parse error, missing file, unreadable)
2Semantic precondition violated (e.g. retained speaker missing from File 1, conflicting @Media, no time bullets in File 1)
3Internal error

What the merged output guarantees

These are testable invariants. Every release verifies them against the reference corpus.

Retained speakers are byte-stable

For every speaker code in --retain, every main-tier line and every dependent-tier line attached to that speaker in <FILE1> appears byte-for-byte identical in the merged output, in the same relative order they appeared in <FILE1>. CHAT markup, NAK-delimited time bullets, paralinguistic annotations, retracing scope, terminator variants, special-form @l/@n/@c suffixes, all preserved.

This is the core semantic guarantee of merge: if you hand-coded disfluency on the target speaker, the disfluency coding survives the merge without any structural change.

Inserted speakers’ downstream-generated tiers are stripped

For every speaker code in <FILE2> that is NOT in --retain, the utterance is included in the merged output with its main tier preserved verbatim BUT with %wor, %mor, %gra, and %pho removed (configurable via --strip-tiers). Other dependent tiers (%com, %spa, %act, %sit, %add, contributor-specific tiers) are preserved.

The rationale: batchalign3 align and batchalign3 morphotag are the authoritative source stages for these tiers in the post-merge pipeline. Carrying inserted-speaker %wor across the merge would leave the merged file in a half-state, some utterances would have %wor, others would not, and downstream behavior on mixed inputs is undefined. The contract is: enter the post-merge stages in a clean state, exit with the tier present and consistent across every utterance.

Utterance order is timeline order

Utterances in the merged output appear in ascending order by their start time bullet (\\x15START_END\\x15, milliseconds). Where two utterances have identical start times, the first-file utterance comes first.

Time bullets are pass-through

chatter merge does NOT recompute, smooth, or refresh time bullets. The bullets in the merged output are exactly those that appeared in the source files. If <FILE2> had %wor rows whose first/last word times implied a slightly different utterance span than the main-tier bullet, the main-tier bullet wins (it was the contract before merge).

If the merge stage detects an inserted-speaker utterance with no main-tier bullet at all (Batchalign occasionally omits these), it lifts a bullet from the corresponding %wor row’s first-word start and last-word end, appending it to the main tier so the merged file has uniform bullet placement. The original %wor is then stripped (per the per-tier rule above).

Header reconciliation

The merged file’s headers are constructed deterministically from the two inputs:

HeaderSourceNotes
@UTF8File 1always required to be @UTF8
@Begin / @EndFile 1always present in merge output
@WindowFile 1 if presentnot generated if absent
@LanguagesFile 1must match File 2; mismatch is an error
@MediaFile 1File 2’s @Media is discarded; warning if mismatched media filename (NOT the modality field, see below)
@ParticipantsconcatenationFile 1’s entries first, then File 2’s entries for non-retained speakers in their original order
@IDconcatenationFile 1’s @ID rows first; File 2’s @ID rows for non-retained speakers appended in their original order
@CommentconcatenationFile 1’s @Comment rows first; File 2’s @Comment rows appended in original order (preserves any provenance comments like ASR engine/run timestamp)

The @Media modality field (audio vs video) is a known divergence point: when ASR runs against an mp4, it may write video on its input but emit audio on its output. File 1’s modality wins, as with all @Media content; no warning is emitted for modality mismatch.

Overlap markup is NOT injected

When an inserted-speaker utterance temporally overlaps a retained-speaker utterance, chatter merge does NOT inject CHAT [>] / [<] / angle-bracket-scoped overlap markers. The time bullets carry overlap information; markers are a CLAN-era surface convention that the output of chatter merge deliberately omits.

The retained speakers’ existing overlap markers (if File 1 already contains some) are preserved byte-stably under the byte-preservation rule above.

Preconditions

chatter merge refuses (exit code 2) if any of these hold:

  • File 1 declares no utterances for any speaker in --retain.
  • File 1 has no time-bulleted utterances at all (no shared timeline to merge against).
  • The two files’ @Languages headers disagree.
  • A speaker code appears in both files but not in --retain (use --retain to disambiguate).
  • File 2 is missing or unparseable.

chatter merge does NOT refuse on these (proceeds with warning):

  • Small backward-time bullets in either input (one utterance ends slightly after the next starts), common in hand transcripts, not corrupting; downstream batchalign3 align cleans these.
  • File 2’s @Media modality disagrees with File 1’s (audio vs video).
  • File 1 has fewer utterances than File 2, or vice versa.

Speaker identity in File 2 must already be coherent

chatter merge does NOT identify or rename speakers. If File 2 came from ASR and carries anonymous codes like PAR0, PAR1, run chatter speaker-id first to assign CHAT-conformant codes. The merge step trusts whatever speaker codes appear in its inputs.

What chatter merge is NOT

  • Not ASR. Use batchalign3 transcribe.
  • Not forced alignment. Use batchalign3 align.
  • Not morphological tagging. Use batchalign3 morphotag.
  • Not speaker identification. Use chatter speaker-id.
  • Not content reconciliation. If two files disagree about what a speaker said at the same time, chatter merge does not adjudicate; it trusts --retain to designate one file as authoritative per speaker.
  • Not three-way or n-way merge in this release. The 2-input case composes into the n-input case by chaining (chatter merge a b --retain X -o tmp.cha && chatter merge tmp.cha c --retain Y -o out.cha). A future release may add native n-ary merging if a workflow appears for which chained 2-way merges are awkward.

Worked example

A speech-pathology lab hand-transcribed a child’s spontaneous-speech session, marking disfluency carefully, but did not transcribe the clinician’s turns. They send the media and the child-only transcript; the project runs ASR on the media to produce a full-coverage transcript with anonymous speaker codes; then chatter speaker-id labels the ASR file’s adult speaker as INV; then chatter merge combines.

# After ASR labeling: asr.cha has speakers CHI and INV.
chatter merge child-only.cha asr.cha \
  --retain CHI \
  -o merged.cha

# Then alignment regenerates %wor cleanly across all speakers:
batchalign3 align merged.cha

# Then morphotag regenerates %mor and %gra:
batchalign3 morphotag merged.cha

The merged file contains:

  • Every *CHI utterance byte-stable from child-only.cha, including every disfluency marker, every retracing scope, every paralinguistic annotation, every %com session-structural comment.
  • Every *INV utterance from asr.cha, in their original time order, interleaved with the *CHI utterances by start time.
  • One @Participants row listing CHI and INV; @ID rows for both; the union of @Comment rows including any ASR provenance comments from asr.cha.

Relationship to other commands

flowchart TB
    Media[Media file mp4 / wav] --> Transcribe
    Transcribe["batchalign3 transcribe<br/>ASR"] --> AsrAnon["asr-anonymous.cha<br/>PAR0, PAR1, ..."]
    HandTranscript["hand-transcript.cha<br/>target speakers only"] --> SpeakerId
    AsrAnon --> SpeakerId
    SpeakerId["chatter speaker-id<br/>label anon speakers"] --> AsrLabeled["asr-labeled.cha<br/>CHI, INV, MOT, ..."]
    HandTranscript --> Merge
    AsrLabeled --> Merge
    Merge["chatter merge<br/>(this page)"] --> Merged[merged.cha]
    Merged --> Align
    Align["batchalign3 align"] --> Aligned[aligned.cha]
    Aligned --> Morph
    Morph["batchalign3 morphotag"] --> Final[final.cha]

chatter merge sits between speaker-identity resolution and forced alignment. It assumes its inputs have coherent CHAT-conformant speaker codes (no anonymous PAR0/PAR1) and emits a file ready for batchalign3 align to refresh timing and produce %wor.

Inputs must be valid CHAT (pipeline / batch)

The per-session chatter pipeline shortcut and the directory-level chatter batch driver validate every input as CHAT before doing any speaker-id or merge work. Each donor and the reference it is merged against must pass the same validation chatter validate runs; an input that fails is never merged. Clean invalid transcripts to valid CHAT first (run chatter validate <file> to see the errors), then re-run.

  • chatter pipeline refuses (exit 2, no output written) if its donor or reference is invalid CHAT.
  • chatter batch is fail-closed and whole-batch: if any input under the donor/reference directories is invalid CHAT, it reports every offending file and aborts the entire run without merging a single session. “All inputs are chatter-valid” is a hard precondition of the batch, not something discovered session-by-session mid-run.

This gate catches validation-only invalidity (files that parse but fail chatter validate, e.g. a malformed @ID), which the lower-level chatter merge parse is otherwise lenient about.

LLM holistic judgment (pending-only)

--judgment holistic is now reachable from pipeline and batch (not just speaker-id). In holistic mode the command is pending-only: it writes an engine = "llm" review-gated entry via --write-pending and produces no merged file. The operator supplies the LLM connection with --llm-endpoint / --llm-model (or the environment variables CHATTER_LLM_ENDPOINT / CHATTER_LLM_MODEL); an optional --session-context <file.json> provides per-session context that the LLM prompt includes to sharpen its judgment.

Session-context JSON (--session-context)

The session-context file is a corpus-agnostic JSON object mapping session IDs (the donor file’s basename stem) to context records. Every record field is optional, and the label fields are free vocabulary: chatter imposes no closed set, the labels are surfaced verbatim into the LLM prompt.

{
  "SESSION-ID": {
    "sample_type": "clinician interview",
    "declared_roles": ["Investigator"],
    "consent_tier": "video+audio",
    "age_months": 52
  }
}
  • sample_type: what kind of speech sample the session is (e.g. "narrative retell").
  • declared_roles: adult roles declared present in the session.
  • consent_tier: media-consent tier governing what may be shared.
  • age_months: child age in months at the session.

When --session-context is absent, the CHATTER_SESSION_CONTEXT environment variable supplies the path (empty counts as unset). Per session, each context field resolves in order: the explicit record from the file; for the age only, the donor’s CHAT @ID age header (pure CHAT, no external metadata needed); otherwise unknown. Absent sessions or fields are passed to the judgment as unknown, never guessed. A configured-but-malformed file is a hard error, and labels must contain at least one non-whitespace character. Configuring session context on a non-holistic run prints a warning (the deterministic judgment never consults it).

Conversion from a contributor’s own records format (a spreadsheet, a database export) to this JSON happens outside chatter.

The two-pass operator flow is:

  1. batch --judgment holistic --session-context context.json --write-pending P accumulates one engine = "llm" pending entry per session in P.
  2. Operator reviews P, accepts or corrects each entry.
  3. chatter adjudicate promotes reviewed entries to the override file.
  4. batch (deterministic, reading the override file) replays every confirmed mapping and writes the merged files.

Note: the MLU sanity-scan is unreliable for the FluencyBank clinical-interview corpus (children out-narrate the adult, so MLU ratios invert relative to typical child-language recordings). Holistic-pending review via --judgment holistic is the trustworthy alternative there.

Implementation notes (for contributors)

  • Source: crates/talkbank-transform/src/transcript_merge/ (proposed layout, see the design plan).
  • CLI surface: crates/chatter/src/commands/transcript_merge/.
  • Domain types (SpeakerCode, RetainSet, MergeOverride, SpeakerMapping) live in talkbank-model so the override-file format is sharable across the speaker-id stage, the orchestrator, and any future adjudication UI.
  • The merge operates on talkbank-model::ChatFile; both inputs are parsed via talkbank-parser. The byte-preservation guarantee on retained-speaker utterances relies on the parser’s existing round-trip serialization.
  • Spec entries exercising the merge live in spec/constructs/, every behavioral invariant on this page has a spec; tests are regenerated via the current spec/tools workflow documented in Spec Workflow.
  • This page is the user contract; book/src/chatter/reference/ carries the override-file reference for the speaker-id stage that this merge consumes.

The Merge Workflow (pipeline, batch, adjudicate, sanity-scan)

Status: Draft (experimental) Last modified: 2026-06-15 10:39 EDT

The merge workflow combines, at scale, the two structural primitives documented elsewhere, chatter speaker-id (assign CHAT-conformant speaker codes to an anonymous donor) and chatter merge (combine two transcripts of the same recording), and adds the operator loop needed when the automatic speaker decision is not confident enough to trust.

Four commands make up the workflow. They are experimental and in active development; flags and behavior may change.

CommandScopeRole
chatter pipelineone sessionspeaker-id (reference mode) then merge, in a single invocation
chatter batcha directory pairloop pipeline over matched donor / reference files
chatter adjudicatethe operatorresolve the low-confidence sessions a pass left pending
chatter sanity-scanmerged outputflag confident auto-decisions that still look suspicious

If you only have one pair of files and one clean answer, reach for pipeline. Everything else here is about doing that safely across a directory of sessions where some answers are not clean.

The big picture: a two-pass loop

The hard part of merging at scale is not the merge; it is deciding, per session, which anonymous ASR speaker is the child the reference already covers. speaker-id’s multiset-Jaccard match (see its page) answers that automatically when the winner clearly beats the runner-up, and refuses (exit code 4) when it does not. The workflow turns that refusal into a reviewable queue.

flowchart TD
    subgraph Pass1["Pass 1: automatic"]
        B1["chatter batch DONOR_DIR REF_DIR\n--write-override audit.toml\n--write-pending pending.toml"]
        B1 --> Clean["confident sessions:\nmerged file written,\ndecision logged to audit.toml"]
        B1 --> Refused["low-confidence sessions:\nNO merge, appended to pending.toml\n(exit code 4)"]
    end
    Refused --> Adj["chatter adjudicate pending.toml\n--override-file audit.toml\n(operator decides)"]
    Adj --> Pass2["Pass 2: chatter batch ... --override-file audit.toml\n(replays the operator's decisions,\nmerges the previously-refused sessions)"]
    Clean --> Done["all sessions merged"]
    Pass2 --> Done

Pass 1 merges everything it is confident about and parks the rest. The operator works the parked queue once. Pass 2 replays their decisions. The same chatter batch (or chatter pipeline) command runs both passes; what changes is whether an override file with entries exists yet.

chatter pipeline (one session)

The per-session shortcut: run speaker-id in reference mode to relabel an anonymous donor, then merge the relabeled donor with the reference, in one command instead of two.

chatter pipeline <DONOR> <REFERENCE> \
  --anchor <SPEAKER> --inserted-role <CODE>:<ROLE> --output <PATH> [OPTIONS]

ARGUMENTS:
  <DONOR>      Donor CHAT file with anonymous speaker codes (the ASR output).
  <REFERENCE>  Reference CHAT file carrying the authoritative anchor speaker
               (typically the hand-coded child transcript).

REQUIRED:
  --anchor <SPEAKER>            Anchor code in the reference (typically CHI).
  --inserted-role <CODE>:<ROLE> Role for the donor's non-anchor speakers
                                (e.g. INV:Investigator).
  -o, --output <PATH>           Output path for the merged CHAT file.

KEY OPTIONS:
  --retain <SPEAKER>            Speaker(s) taken from the reference in the
                               final merge (typically the same as --anchor).
  --confidence-threshold <F>    Minimum winner/runner-up Jaccard margin to
                               auto-decide (default 2.0x).
  --write-override <FILE>       On a confident auto-decision, append a
                               mode = "auto" audit entry for this session.
  --write-pending <FILE>        On a low-confidence refusal, append a pending
                               entry (exit code 4 still fires).
  --override-file <FILE>        If the file has an entry for this session
                               (the donor's basename stem), replay that
                               decision instead of running reference mode.

The same command serves pass 1 (no override entry yet, run reference mode) and pass 2 (entry present, replay it). Validation is a hard precondition: a donor or reference that fails chatter validate is never merged (exit 2, nothing written).

chatter batch (a directory pair)

Loops pipeline over matched files: the reference for DONOR_DIR/X.cha is REFERENCE_DIR/X.cha. Donors without a matching reference are warned and skipped. It is fail-closed and whole-batch on validity: if any input under either directory is invalid CHAT, the batch reports every offending file and aborts without merging a single session.

chatter batch <DONOR_DIR> <REFERENCE_DIR> \
  --anchor <SPEAKER> --inserted-role <CODE>:<ROLE> --output <DIR> [OPTIONS]

PASS-1 AUDIT + QUEUE:
  --write-override <FILE>  Append every confident auto-decision (mode =
                          "auto"). Required if you want --sanity-scan.
  --write-pending <FILE>   Aggregate every low-confidence refusal into one
                          pending file. One `chatter adjudicate` run resolves
                          them all. Refusals do NOT abort the batch.

PASS-2 REPLAY:
  --override-file <FILE>   Threaded to every per-session pipeline call.
                          Sessions with an entry replay it; the rest fall
                          through to reference mode.

POST-MERGE QA:
  --sanity-scan            Run `sanity-scan` after the loop. Requires
                          --write-override (it reads the auto-decisions) and
                          --write-pending (flagged sessions are appended).
                          Exit code 4 fires if it flags any session.
  --sanity-scan-threshold <F>  Heuristic ratio (default 1.5).

OPERATIONAL:
  --skip-existing          Skip donors whose merged output already exists, to
                          resume an interrupted batch.

batch also accepts the same --judgment deterministic|holistic and LLM / --session-context options as pipeline; see Merge, LLM holistic judgment for that mode and the session-context JSON format.

chatter adjudicate (the operator step)

Reads the pending file a pass produced, walks the operator through the unresolved sessions, and appends the resolved decisions to the override file. On success the pending file is rewritten to drop the entries that were resolved, so re-running adjudicate only ever shows what is left.

chatter adjudicate <PENDING> --override-file <FILE> [--interactive | --scripted <TOML>]

ARGUMENTS:
  <PENDING>  The pending-adjudications TOML a pass wrote.

REQUIRED:
  --override-file <FILE>  Override file to append resolved decisions to
                         (created if absent). This is the same file pass 2
                         reads back.

DECISION SOURCE (one of):
  --interactive           Prompt per pending entry on stdin. Currently
                         supports `accept` / `a` (accept the suggested
                         mapping).
  --scripted <TOML>       Pre-canned operator decisions, for replayable /
                         tested runs. Mutually exclusive with --interactive.

  --operator <NAME>       Recorded in each override entry (defaults to $USER).

This is the interactive review tool the speaker-id and merge pages refer to: the audit trail (who decided, the scores, any note) lands in the override file so a later reader can see why a session was labeled the way it was. The decision schema is the same override-file format used everywhere in the workflow; see Merge Override File Format, and the Adjudication Workflow architecture page for the design.

chatter sanity-scan (post-merge QA)

A confident auto-decision can still be wrong, the runner-up was simply even further off. sanity-scan re-reads the merged output and the pass-1 audit file and flags sessions that pass an out-of-band check: the mean utterance word count of the anchor speaker versus the inserted speaker. In a typical child-language recording the adult out-talks the child, so an anchor (child) mean that is much higher than the inserted (adult) mean is suspicious, possibly the two were swapped.

chatter sanity-scan <MERGED_DIR> \
  --override-file <FILE> --anchor <SPEAKER> --write-pending <FILE> [OPTIONS]

REQUIRED:
  --override-file <FILE>  The pass-1 audit file. Only auto-decided sessions
                         are scanned; explicit-mode entries are skipped (the
                         operator already signed off).
  --anchor <SPEAKER>      Anchor code in the merged files (typically CHI).
  --write-pending <FILE>  Flagged sessions are appended here as
                         sanity-scan-misclassification pending entries for
                         `chatter adjudicate`. Required.

  --threshold <F>         Flag when anchor_mean >= inserted_mean * threshold
                         (default 1.5).

A flag is a question, not a verdict: the session goes back into the adjudication queue for an operator to confirm or correct. Whether to run the scan at all is a judgment about the corpus. It assumes the typical “adult out-talks child” shape, and is unreliable where that inverts (e.g. a clinical-interview corpus where children out-narrate the adult); there, prefer the LLM holistic-pending review described on the merge page.

End-to-end worked example

A directory of ASR donors (asr/) and the matching hand-coded child references (ref/), child anchor CHI, adults labeled INV:

# Pass 1: merge what we are sure of; queue the rest; keep an audit trail.
chatter batch asr/ ref/ \
  --anchor CHI --inserted-role INV:Investigator \
  --output merged/ \
  --write-override audit.toml \
  --write-pending pending.toml \
  --sanity-scan

# Exit 0: every session merged confidently and the scan was clean.
# Exit 4: some sessions are pending (low-confidence and/or scan-flagged).

# Operator resolves the queue once (audit trail recorded):
chatter adjudicate pending.toml --override-file audit.toml --interactive --operator alice

# Pass 2: replay the operator's decisions; the previously-pending
# sessions now merge.
chatter batch asr/ ref/ \
  --anchor CHI --inserted-role INV:Investigator \
  --output merged/ \
  --override-file audit.toml \
  --skip-existing

Exit codes

The workflow commands share the convention used across the merge surface:

CodeMeaning
0Success
1Invalid input (parse error, missing file, unreadable)
2Semantic precondition violated (e.g. invalid CHAT input, missing anchor)
3Internal error
4A pass parked work for the operator: a low-confidence speaker-id refusal, or a sanity-scan flag. Nothing was lost; the sessions are in the pending file

Exit code 4 is the normal “there is operator work to do” signal, not an error: a batch that parks ten sessions still merged the rest.

See also

CHAT Format Overview

Status: Reference Last updated: 2026-05-11 21:51 EDT

CHAT (Codes for the Human Analysis of Transcripts) is a standardized transcription format for spoken language data, developed by MacWhinney as part of the CHILDES and TalkBank projects. It is the most widely used format in child language research and conversational analysis.

File Anatomy

Every CHAT file follows this structure:

@UTF8
@Begin
@Languages:	eng
@Participants:	CHI Target_Child, MOT Mother
@ID:	eng|corpus|CHI|2;6.||||Target_Child|||
@ID:	eng|corpus|MOT|||||Mother|||
*MOT:	what do you want ?
%mor:	ADV|what AUX|do PRON|you VERB|want ?
%gra:	1|4|LINK 2|4|AUX 3|4|SUBJ 4|0|ROOT 5|4|PUNCT
*CHI:	I want cookie .
%mor:	PRON|I VERB|want NOUN|cookie .
%gra:	1|2|SUBJ 2|0|ROOT 3|2|OBJ 4|2|PUNCT
@End

A CHAT file consists of:

  1. @UTF8: required first line, declares UTF-8 encoding
  2. @Begin: marks the start of the transcript
  3. Headers: lines starting with @ that provide metadata (participants, languages, IDs, etc.)
  4. Utterances: blocks consisting of:
    • A main tier (line starting with *SPEAKER:) containing the transcribed speech
    • Zero or more dependent tiers (lines starting with %tier:) containing annotations
  5. @End: marks the end of the transcript

Key Conventions

  • Tab separation: a tab character separates the tier prefix from its content (e.g., *CHI:⟶content)
  • Terminators: every utterance ends with a terminator (., ?, !, or special forms like +...)
  • Line continuation: long lines wrap with a tab at the start of continuation lines
  • Speaker codes: short identifiers; the validator accepts up to seven characters from A-Z, 0-9, _, -, '; three uppercase letters is the convention (e.g., CHI, MOT, FAT, INV)
  • Media linking: timestamps link transcripts to audio/video via bullet markers

CHAT vs Other Formats

FeatureCHATPraat TextGridELAN EAF
Morphological tiersBuilt-in (%mor, %gra)NoNo
Dependency syntaxBuilt-in (%gra)NoNo
Standardized POSUD-style via %morNoNo
Word-level alignment%wor tierInterval-basedInterval-based
Error recoveryTree-sitter GLRN/AN/A

References

Headers

Status: Reference Last updated: 2026-05-11 20:30 EDT

Headers are lines beginning with @ that provide metadata about the transcript. They appear between @Begin and the first utterance (though some headers like @Comment can appear anywhere).

Required Headers

@UTF8

Must be the very first line of every CHAT file. Declares UTF-8 encoding.

@UTF8

@Begin / @End

Mark the start and end of the transcript body. Every CHAT file must have exactly one @Begin and one @End.

@Participants

Declares all speakers in the transcript. Format: CODE [Name] Role, comma-separated. The role is required; the name is optional, so each entry is either CODE Role or CODE Name Role.

@Participants:	CHI Target_Child, MOT Mother, FAT Father
@Participants:	CHI Alex Target_Child, MOT Mary Mother

In the first line, Target_Child, Mother, and Father are roles, not names. In the second line, Alex and Mary are optional names sitting between the speaker code and the role.

Speaker codes are short identifiers; the validator accepts up to seven characters from A-Z, 0-9, _, -, and '. The convention is three uppercase letters; the most common codes are:

  • CHI: target child
  • MOT: mother
  • FAT: father
  • INV: investigator
  • OBS: observer

@ID

Provides detailed metadata for each participant. One @ID line per participant.

@ID:	eng|corpus|CHI|2;6.||||Target_Child|||

Fields (pipe-separated): language, corpus, speaker code, age, sex, group, SES, participant role, education, custom field.

Age format: years;months.days (e.g., 2;6. = 2 years, 6 months).

SES field: ethnicity (White, Black, Asian, Latino, Pacific, Native, Multiple, Unknown), socioeconomic code (UC, MC, WC, LI), or combined with comma separator (e.g., White,MC).

Optional Headers

@Languages

Declares the language(s) used in the transcript.

@Languages:	eng, fra

@Date

Recording date in DD-MON-YYYY format.

@Date:	15-JAN-2024

@Location

Where the recording took place.

@Location:	Boston, MA, USA

@Situation

Description of the recording context.

@Situation:	free play with toys in lab

@Activities

Activities during the recording.

@Activities:	toyplay, reading

@Comment

Free-form comments. Can appear anywhere in the file (before, between, or after utterances).

@Comment:	child was tired during this session

@Media

Links the transcript to an audio or video file.

@Media:	session01, audio

@Transcriber / @Coder

Identifies who created or coded the transcript.

@Transcriber:	JDS
@Coder:	ABC

Header Ordering

Headers should follow this conventional order:

  1. @UTF8 (required, first line)
  2. @Begin (required)
  3. @Languages
  4. @Participants (required)
  5. @ID lines (one per participant)
  6. Other metadata headers (@Date, @Location, etc.)
  7. @Comment lines (can also appear later)

Validation

The parser validates header structure including:

  • @UTF8 must be the first non-empty line
  • @Begin and @End are required and must appear exactly once
  • @Participants is required and must declare all speakers used in utterances
  • @ID participant codes must match @Participants declarations
  • Age format validation in @ID lines

Utterances

Status: Reference Last updated: 2026-05-11 23:22 EDT

An utterance is the fundamental unit of a CHAT transcript. It consists of a main tier (the transcribed speech) followed by zero or more dependent tiers (annotations).

Main Tier

The main tier begins with *SPEAKER: followed by a tab and the utterance content, ending with a terminator.

*CHI:	I want a cookie .

Speaker Codes

Speaker codes are short identifiers (up to seven characters from A-Z, 0-9, _, -, '; three uppercase letters is the convention) matching a code declared in @Participants:

@Participants:	CHI Target_Child, MOT Mother
*MOT:	what do you want ?
*CHI:	cookie .

Terminators

Every utterance must end with a terminator:

TerminatorMeaning
.Declarative (period)
?Question
!Exclamation
+...Trailing off
+..?Trailing-off question
+/.Interruption
+//.Self-interruption
+/?Interrupted question
+!?Broken question
+"/.Quotation follows on next line

Line Continuation

Long utterances wrap to the next line with a leading tab:

*MOT:	well I think that we should probably go to
	the store and get some more cookies .

Content Items

The content between *SPEAKER: and the terminator consists of content items separated by whitespace:

  • Words: regular words, potentially with annotations
  • Groups: bracketed content like <word word> for overlap, retrace, etc.
  • Special forms: pauses (.), events &=laughs, fillers &-uh
  • Separators: commas , and other punctuation

Words

Words are the primary content unit. See Word Syntax for full details.

Groups

Angle brackets < > group words for annotations:

*CHI:	<I want> [/] I want cookie .

Common group annotations:

  • [/]: partial retrace (speaker repeats the same words)
  • [//]: full retrace (speaker restarts with different words)
  • [///]: multiple retracing (multiple false starts)
  • [/-]: reformulation (speaker rephrases with different structure)
  • [?]: uncertain transcription

Special Forms

*CHI:	um (.) I want &-uh cookie .
  • (.): short pause
  • (..): medium pause
  • (...): long pause
  • (1.5): timed pause in seconds
  • &=laughs: paralinguistic event
  • &-uh: filler

Media Linking

Utterances can include media timestamps (bullets) that link to audio/video:

*CHI:	I want cookies . •1234_5678•

The numbers represent start and end times in milliseconds. The bullets delimiting the pair render as in most editors; on disk they are the NAK control character (U+0015). See grammar/grammar.js rule bullet.

Dependent Tiers

See Dependent Tiers for documentation on %mor, %gra, %pho, %wor, and other annotation tiers that follow the main tier.

Retraces and Repetitions

Status: Current Last updated: 2026-05-11 23:16 EDT

Retraces mark content that the speaker said but then corrected, repeated, or abandoned. They are one of the most consequential constructs in CHAT because they affect how every dependent tier aligns to the main tier.

CHAT Syntax

A retrace has two parts: the retraced content (what the speaker said first) and the correction (what follows). The retraced content is marked with a trailing bracket code:

MarkerNameMeaning
[/]Partial repetitionSpeaker repeats the same words
[//]Full correctionSpeaker restarts with different words
[///]Multiple correctionMultiple false starts
[/-]ReformulationSpeaker rephrases with different structure

Single-Word Retraces

When only one word is retraced, no angle brackets are needed:

*CHI: I [/] I want that .
*CHI: ana [//] an .
*MOT: the book [/-] the magazine is here .

Group Retraces

When multiple words are retraced, angle brackets delimit the scope:

*MOT: <the dog> [//] the cat ran .
*CHI: <I want> [/] I need cookie .
*CHI: <I want the> [///] give me that .

Retraces with Replacements

A retraced word often has a replacement [: target] and/or error code [* code]. This is common in aphasia and child language corpora where the speaker produces an incorrect form:

*PAR: tika@u [: kitty] [* p:n] [//] kitty is nice .
%mor: noun|kitty aux|be-Fin-Ind-Pres-S3 adj|nice-S1 .

*PAR: lɛɾɪ@u [: later] [* p:n] [//] later in the day .
%mor: adv|late adp|in det|the-Def-Art noun|day .

*CHI: male [: female] [* s:r] [/] male [: female] [* s:r] .
%mor: adj|female-S1 .

In each case, the retraced word (before the [//] or [/]) is excluded from %mor alignment. Only the correction (after the marker) is counted.

Data Model

Retraces are a first-class variant of UtteranceContent:

flowchart TD
    UC["UtteranceContent"]
    UC --> Word
    UC --> RW["ReplacedWord"]
    UC --> Retrace
    UC --> AG["AnnotatedGroup"]
    UC --> Other["...20 other variants"]

    Retrace --> BC["BracketedContent"]
    Retrace --> RK["RetraceKind"]
    BC --> BIW["BracketedItem::Word"]
    BC --> BIRW["BracketedItem::ReplacedWord"]

    style Retrace fill:#f96,stroke:#333

The Retrace struct wraps the retraced content in a BracketedContent container, which can hold any combination of words, replaced words, and other content items:

// crates/talkbank-model/src/model/content/retrace.rs
pub struct Retrace {
    pub content: BracketedContent,          // the retraced words
    pub kind: RetraceKind,                  // Partial, Full, Multiple, Reformulation
    pub is_group: bool,                     // <word> [/] vs word [/]
    pub annotations: Vec<ContentAnnotation>,// non-retrace annotations after marker
    pub span: Span,
}

Why First-Class?

Before the retrace refactor, retraces were represented as annotations on words or groups. This meant every match on content had to inspect annotation lists to determine whether a word was retraced. This led to a class of bugs where retraced content was accidentally included in alignment counting, word extraction, or retokenization.

Making Retrace a top-level UtteranceContent variant means:

  1. The compiler enforces handling. Every match on UtteranceContent must have a Retrace arm. Forgetting to handle retraces is a compile error, not a silent runtime bug.
  2. Domain-aware gating is centralized. The content walker checks the Retrace variant once, not at every annotation-inspection site.
  3. Alignment counting is simple. The count function returns 0 for Retrace in Mor domain, no annotation inspection needed.

Parser Conversion

The tree-sitter grammar parses retrace markers ([/], [//], etc.) as annotations on word_with_optional_annotations. The Rust parser converts them to structural Retrace nodes in parse_word_content():

flowchart LR
    subgraph "Tree-sitter CST"
        WOA["word_with_optional_annotations"]
        SW["standalone_word"]
        BA["base_annotations"]
        RP["retrace_partial / retrace_complete / ..."]
        WOA --> SW
        WOA --> BA
        BA --> RP
    end

    subgraph "Rust Model"
        RET["UtteranceContent::Retrace"]
        BC2["BracketedContent"]
        W2["Word or ReplacedWord"]
        RET --> BC2
        BC2 --> W2
    end

    WOA -->|"parse_word_content()\n(word.rs)"| RET

Three cases in parse_word_content():

  1. Word + retrace (I [/]), wrap Word in BracketedItem::Word inside Retrace
  2. Word + replacement + retrace (tika@u [: kitty] [* p:n] [//]), build ReplacedWord, then wrap in BracketedItem::ReplacedWord inside Retrace
  3. Word + replacement, no retrace (tika@u [: kitty]), emit bare ReplacedWord

Group retraces (<content> [/]) are handled in group/parser.rs via the same structural wrapping.

Alignment Behavior

Retraces interact differently with each dependent tier domain:

flowchart TD
    RT["Retrace node\n(e.g. 'tika@u [: kitty] [* p:n] [//]')"]

    RT -->|"Mor domain"| SKIP["SKIP\n(return 0)\nNot morphologically analyzed"]
    RT -->|"Pho domain"| COUNT["COUNT\nPhonologically produced"]
    RT -->|"Sin domain"| COUNT2["COUNT\nGesturally produced"]
    RT -->|"Wor domain"| COUNT3["RECURSE\napply retrace-aware %wor leaf rule"]

    style SKIP fill:#faa,stroke:#333
    style COUNT fill:#afa,stroke:#333
    style COUNT2 fill:#afa,stroke:#333
    style COUNT3 fill:#afa,stroke:#333

Why %mor skips retraces: The %mor tier represents the morphological analysis of what the speaker meant to say. Retraced content is a false start or error; it was produced phonologically but is not part of the intended linguistic structure. The correction after the retrace marker carries the morphological analysis.

Why %pho/%sin/%wor include retraces: These tiers document what was actually produced, the sounds, gestures, and timing of the speech as it happened, including false starts. The retrace was physically spoken, so it appears in these tiers.

For %wor, retrace ancestry does not change leaf-level membership:

  • spoken word tokens count both inside and outside retrace
  • that includes fillers, fragments, nonwords, and untranscribed placeholders
  • overlap annotations do not affect %wor membership

Exact corpus-shaped contrast:

*CHI:	<one &+ss> [/] one play ground .
%wor:	one •321008_321148• ss •321148_321368• one •321809_321969• play •322049_322310• ground •322390_322890• .

*CHI:	&+ih <the what> [/] what's letter &+th is this ?
%wor:	ih •49063_49103• the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• th •50886_50946• is •50946_51046• this •51086_51586• ?

Implementation

Counting: count_alignable_item() in alignment/helpers/count.rs:

UtteranceContent::Retrace(retrace) => {
    if domain == TierDomain::Mor {
        0  // excluded from morphological alignment
    } else {
        count_bracketed_alignable_content(&retrace.content, domain, true)
    }
}

Walking: walk_words() in alignment/helpers/walk/mod.rs:

UtteranceContent::Retrace(retrace) => {
    if !matches!(domain, Some(TierDomain::Mor)) {
        walk_bracketed_content(&retrace.content.content, domain, f);
    }
}

%wor generation and overlap counting still use dedicated recursive helpers, but now for %wor-specific sequencing details like replacement handling rather than for retrace-sensitive membership.

Validation

Cross-Utterance Retrace Validation

The retrace validators in validation/retrace/ check:

  • Collection: collection/utterance.rs and collection/bracketed.rs walk the content tree to find all Retrace nodes
  • Detection: detection.rs provides utterance_item_has_retrace() for quick retrace presence checks

Alignment Validation (E705)

E705 fires when the main tier has more alignable items than %mor. If retraces are correctly parsed as Retrace nodes; they are excluded from the count and E705 does not fire. If a retrace is accidentally parsed as a bare ReplacedWord (the bug fixed in c90b9bf), it is counted and triggers a false E705.

Regression Tests

tests/retrace_replaced_word_regression.rs contains 6 targeted tests:

TestPatternVerifies
single_word_retrace_with_replacement_fullword [: repl] [* err] [//]Retrace wraps ReplacedWord
single_word_retrace_with_replacement_partialword [: repl] [* err] [/]Partial retrace with replacement
single_word_retrace_with_replacement_multipleword [: repl] [* err] [///]Multiple retrace with replacement
single_word_retrace_with_replacement_no_error_markerword [: repl] [///]No [*] still produces Retrace
single_word_retrace_without_replacementword [//]Baseline (no replacement)
retrace_with_replacement_does_not_cause_e705Full pipeline with %morNo false E705

Reference corpus entries: corpus/reference/annotation/retrace.cha

See Also

Replacements

Status: Current Last modified: 2026-05-29 17:47 EDT

A replacement is a CHAT annotation [: ...] that pairs a single spoken word on the main tier with one or more “intended” words. It records both what the speaker actually said and what the analysis should treat the utterance as containing.

*CHI:	wanna [: want to] go .
*CHI:	dis [: this] is fun .
*CHI:	rocking+house [: rocking+horse] [*] ?

This page is the canonical reference for what replacements mean in TalkBank, both as a CHAT-manual construct and as a typed AST in this repo. The most important load-bearing fact, which the rest of the page expands on:

Replacements are word-level, not group-level. Each tier domain chooses one side of the pair: %mor analyzes the replacement (right side); %wor, %pho, %sin align to the original (left side). %gra follows %mor.

CHAT Syntax

Word-Level Scope

A replacement attaches to a single standalone_word on the main tier and contains one or more replacement words inside the brackets:

*CHI:	gonna [: going to] eat lunch .
*CHI:	dis [: this] toy .
*CHI:	rocking+house [: rocking+horse] [*] ?

The grammar rules are word_with_optional_annotations and replacement in grammar/grammar.js grep for the rule names rather than line numbers so this stays accurate as the grammar evolves. Replacement words can be separated by whitespace, so [: going to] is a single replacement of gonna with two words.

There Is No Group-Level Replacement

<dat is> [: that is] is not valid CHAT. A replacement does not attach to a group; it attaches to a single word. The grammar enforces this by typing: ReplacedWord.word: Word, never Group. To replace words inside a group, attach the replacement to the inner word:

*CHI:	<dat [: that] is> [/] is broken .

This shape, replacement inside a group inside a retrace, is legal because each annotation operates at its own scope.

There Is No [::] Form

Some literature on CHILDES tooling references a [::] annotation; it does not exist in this repo’s grammar, parser, or model, and is not defined by the current CHAT manual. Only [:] exists. If you encounter [::] in legacy data, treat it as a parse error to investigate, not a construct to support.

The Per-Domain Alignment Rule

This is the rule contributors most often get wrong. Different tier domains align to different sides of a replacement pair:

TierSide aligned toRationale
%morreplacement (right)Morphosyntactic analysis annotates the target form, not the error
%grareplacement (right)Grammatical relations align to %mor’s structure
%wororiginal (left)Word-level timing is for what was actually spoken
%phooriginal (left)Phonological transcription describes what was actually spoken
%sinoriginal (left)Spelling-in-actual describes the original surface form

The mnemonic: the replacement encodes the intended form (what the speaker meant or what a corrected transcript would read). Tiers analyzing intent (%mor/%gra) use the replacement; tiers documenting realization (%wor/%pho/%sin) use the original.

flowchart LR
    spoken["Original word\n(left of [:)\n'dis'"]
    target["Replacement words\n(inside [: ])\n'this'"]

    spoken -->|"%wor (timing)"| wor["%wor: dis"]
    spoken -->|"%pho (phonology)"| pho["%pho: dɪs"]
    spoken -->|"%sin (spelling)"| sin["%sin: dis"]
    target -->|"%mor (UD parse)"| mor["%mor: pron|this"]
    target -->|"%gra (paired with %mor)"| gra["%gra: 1|0|ROOT"]

For multi-word replacements like gonna [: going to], the rule generalizes consistently:

  • %wor / %pho / %sin produce one entry, for gonna.
  • %mor produces two entries, for going and to.
  • %gra produces two entries, paired to the two %mor items.

The alignment-counting code that enforces this is in alignment/units.rs look for the UtteranceContent::ReplacedWord arm. The full table of per-domain rules is in spec/docs/ALIGNMENT_RULES.md.

Rust AST

A replacement is modeled as a first-class UtteranceContent variant, not as a flag on Word:

// crates/talkbank-model/src/model/annotation/replacement.rs
pub struct ReplacedWord {
    pub word: Word,                       // left side: original spoken word
    pub replacement: Replacement,         // right side: 1+ intended words
    pub scoped_annotations: ReplacedWordAnnotations,
}

Two consequences of this shape:

  1. A replacement is a wrapper around a Word, not a kind of Word. ReplacedWord lives as its own variant of UtteranceContent (and BracketedItem), holding an inner word: Word plus the replacement payload. Contrast with retraces: Retrace is also a variant of UtteranceContent/BracketedItem, but it wraps a group of content (a single word or a <...> group), not a single Word. Different mechanism, different scope, same top-level slot in the AST.
  2. The walk_words() content walker yields WordItem::ReplacedWord as a distinct leaf (defined in crates/talkbank-model/src/alignment/helpers/walk/mod.rs). Domain-aware extraction code branches on this leaf type and chooses original or replacement per the table above.

Validation

Each Replacement Word Is Validated Like a Main-Tier Word

The replacement is a Vec<Word>. Each Word inside it goes through the same validator that runs on main-tier words:

*CHI:	dog [: C-3PO] .

This produces [E220] "C-3PO" is not a legal word in language(s) "eng": numeric digits not allowed, exactly as if C-3PO had appeared on the main tier directly. The replacement does not provide an escape from word-level validation. The implementation is in replacement.rs.

This is critical for any code generating replacements programmatically: do not assume [: ...] lets you smuggle arbitrary text past the word validator. If your producer emits a replacement, both sides must be CHAT-legal under the utterance’s declared language.

Replacement-Specific Error Codes

Three error codes are specific to replacements and do not apply to main-tier words:

CodeMeaning
E208Empty replacement [:] (no words provided between : and ])
E390Replacement contains an omission (0prefix form), disallowed inside replacements
E391Replacement contains untranscribed material (xxx, yyy, www), disallowed inside replacements

The principle: a replacement must be a concrete intended form. Empty, omitted, or unintelligible content defeats that purpose.

Interactions with Other Annotations

Replacements and Retraces Are Orthogonal

A retrace ([/], [//], [///], [/-]) and a replacement ([:]) are distinct annotations operating at different structural levels:

  • Retraces wrap content (a single word or a group). They are first- class UtteranceContent variants and represent post-hoc speaker correction.
  • Replacements attach inside a Word slot via ReplacedWord. They are editorial metadata about an individual spoken word.

Both can coexist:

*CHI:	<dat [: that] is> [/] is broken .   (replacement inside retrace)

A retrace cannot live inside a replacement (the grammar wraps replacements around standalone_word, not arbitrary content).

Replacements and Error Coding

Error codes follow the replacement and operate on the replaced word as a unit:

*CHI:	rocking+house [: rocking+horse] [*] ?

Here [*] marks rocking+house as containing a phonological/lexical error; the [: rocking+horse] records the intended form. The two annotations cooperate: the replacement encodes what was meant, the error code classifies how it deviates. Implementation: scoped_annotations field on ReplacedWord.

Common Misconceptions

These are bugs we have repeatedly written down then forgotten, recording them here so future contributors don’t reinvent them.

  1. [: ...] lets me put any text I want.” No. Each replacement word is validated. [: C-3PO] fails E220 in English just as C-3PO would.
  2. [:] is the right mechanism for ASR sanitization.” Usually no. ASR-introduced normalization typically wants [% ...] (free- form comment) or [= ...] (free-form explanation), neither of which validates word grammar. Use [:] only when you have a concrete CHAT-legal intended form.
  3. %mor analyzes the original.” No. %mor analyzes the replacement. This is the correction’s morphology, not the error’s.
  4. %wor count must equal %mor count.” No. For gonna [: going to], %wor has 1 entry and %mor has 2. They align to different sides. The validator’s per-domain rule respects this.
  5. <a b> [: c d] is a group-level replacement.” No. Group-level replacements don’t exist. Either replace inside (<a [: c] b [: d]>) or rephrase the transcription.

Source Citations

ConcernFile:line
Grammar rule (replacement)grammar/grammar.js:1341-1352
Word-with-replacement rulegrammar/grammar.js:1063-1071
ReplacedWord structcrates/talkbank-model/src/model/annotation/replacement.rs (search pub struct ReplacedWord)
Per-domain alignmentcrates/talkbank-model/src/model/file/utterance/metadata/alignment/units.rs (search UtteranceContent::ReplacedWord)
Replacement validationcrates/talkbank-model/src/model/annotation/replacement.rs (search impl ... Validate for ReplacementWords)
Reference corpus examplecorpus/reference/annotation/errors-and-replacements.cha
CHAT manualhttps://talkbank.org/0info/manuals/CHAT.html#Replacement_Scope

See Also

Untranscribed Markers: xxx, yyy, www

Status: Reference Last updated: 2026-06-14 19:57 EDT

CHAT reserves three short word-level markers for material the human transcriber cannot or chose not to render as words on the main tier. Each one has a specific meaning. Tools that emit CHAT, including ASR pipelines, format converters, and editor heuristics, must respect those meanings, because every downstream consumer (researchers, validators, and aggregate-statistics tools like CLAN’s freq, kideval, mlu) reads them at face value.

MarkerMeaningEmitter
xxxTranscriber listened to the audio and could not make out what was said. The speech is unintelligible to the human ear at this point.Human transcriber only.
yyyTranscriber heard a discrete utterance but could not write it as ordinary CHAT words. Used when the surface form resists orthography (mumbled, slurred, foreign with no equivalent). The phonetic content typically appears on the %pho tier.Human transcriber only.
wwwTranscriber chose not to transcribe this stretch, usually for privacy, off-topic content, or because the segment is irrelevant to the corpus’s purpose.Human transcriber only.

The shared property: each marker is the human transcriber telling later readers something specific about their experience listening to the audio. None of them mean “tooling could not process this token”.

Why this matters

When a researcher loads a CHAT corpus and counts xxx occurrences, the result is a measure of human listening difficulty: it tells them how much of the audio resisted human transcription. That number feeds into methodology decisions (“can we get reliable MLU from this corpus?”, “what’s the noise floor on this child’s speech?”, “should we re-record in a quieter environment next time?”). It is a load-bearing signal in language-development research.

If an ASR pipeline emits xxx whenever it can’t sanitize a token, for example, substituting xxx for any word that fails CHAT validation under a strict language profile, every xxx count in the corpus becomes a meaningless mixture of “human couldn’t tell” and “pipeline gave up”. Researchers then reading those counts are silently misled. The signal is destroyed for the entire history of that corpus, because the corruption is indistinguishable from real unintelligibility once committed.

The same reasoning applies to yyy and www. A converter or post-processor that emits any of these three markers because the tooling couldn’t handle a token is committing semantic vandalism against the whole field.

Rules for tooling

  1. Never emit xxx, yyy, or www from a tool to mean “could not process”. These markers are reserved for human transcriber judgment.
  2. When a token cannot be validated as legal CHAT under the declared language, prefer one of:
    • Pass the token through verbatim and let the CHAT validator (or CLAN’s check) flag it for human review. The transcriber listens, decides, and corrects.
    • Fail loud, abort the file rather than emit corrupted output.
    • Apply only purely orthographic, semantically null repairs (e.g., stripping a stray boundary quote mark from "My). These are safe because no information is lost.
  3. Never sanitize a token by replacing it with one of the three markers. That is exactly the corrupting behavior this document prohibits.
  4. Never delete a token to “fix” a validation failure. Deletion loses data without any flag.

What tools synthesizing CHAT should do instead

Any tool that builds CHAT from an external source (ASR output, an importer, a format converter) should follow the same division of labor:

  1. Silently fix only orthographically inarguable problems (for example, stripping a stray boundary quote mark from "My).
  2. For tokens that fail language-level validation but are structurally legal CHAT (e.g., C-3PO under English: tree-sitter accepts the digit-hyphen compound but Word::validate fires E220 “numeric digits not allowed”), ship the token verbatim. The full-file validator and check fire E220 on the same word, the file ends up in the human review queue, and the transcriber listens to the audio and decides what was actually said.
  3. For tokens that fail structural parsing (tree-sitter rejects), fail loud: emitting malformed CHAT would corrupt the file beyond the validator’s ability to flag it.

The division of labor is: the tool fixes only what is mechanically unambiguous; CHECK and the human transcriber handle everything that requires judgment about what the speaker said.

  • xxx / yyy / www survive the transcript through all NLP passes (morphotag, utseg, translate, coref) without re-interpretation. Tools that walk the AST treat them as opaque tokens; they have no POS tag, no lemma, no dependency parent, no translation.
  • %wor excludes all three (no phoneme sequence to align). %pho may reference yyy directly because the phonetic content is the whole point of the marker.
  • See word-syntax.md for grammar; this document is the policy reference for who is allowed to emit them and why.

Postcodes ([+ ...])

Status: Reference Last updated: 2026-05-11 23:01 EDT

A postcode is a tagged annotation token that attaches to an utterance as a whole and appears after the terminator. The canonical CHAT syntax is [+ <text>]. Postcodes carry researcher / analysis tags about the utterance, whether it should be excluded from analysis, how it should be coded, what kind of speech act it represents, without modifying the utterance’s word content.

Syntax and Scope

*CHI:   I want cookie .  [+ exc]
*MOT:   what did you say ?  [+ imp]
*CHI:   no I don't want it !  [+ neg] [+ trn]

Three structural facts to internalize:

  1. Postcodes attach to the utterance, not to a word. They sit after the terminator, on the main tier, alongside (but distinct from) any utterance-level bullet. Unlike word-scoped annotations ([: ...] replacement, [% ...] comment, [= ...] explanation, [* ...] error code), a postcode does not modify the interpretation of any single word, it tags the whole utterance.
  2. Multiple postcodes may follow a single terminator. They are ordered, but the order is not semantically privileged.
  3. The body is free-form text. The CHAT word grammar is not applied to postcode contents. Researchers can write arbitrary tags, codes, descriptions, comments, or analytic notes. The model stores the raw text and leaves interpretation to downstream tooling and conventions.

Common Postcodes, Empirical Survey

The postcode vocabulary is open-ended: the CHAT format imposes no closed set, and an audit of every [+ ...] token across a JSON-mirrored snapshot of the TalkBank corpora (~99k files, 23+ data-repo families) found 488 distinct values in active use.

The findings split into three tiers ranked by repo spread (in how many distinct corpus families the code appears), the more useful ranking than raw count, because high-count codes can be concentrated in a single corpus.

Tier 1, Cross-corpus codes (in 7+ repos)

These are the conventions every CHAT consumer should expect to encounter across collections:

PostcodeRepo spreadTotal occurrencesMeaning
[+ gram]13~3,100Grammatical, utterance is grammatically well-formed for purposes of the analysis.
[+ exc]9~26,900Exclude utterance from analysis. The utterance is preserved in the transcript but tagged so analytic tools (CLAN’s freq, mlu, etc.) skip it.
[+ bch]9~10,000Backchannel, listener-side acknowledgement (mhm, yeah) that should not be counted as a substantive turn.
[+ trn]7~3,800Translation utterance.

Tier 2, Multi-corpus protocol codes (in 4-6 repos)

Codes deployed across several CHILDES sub-collections, typically encoding picture-narration / story-reading / imitation experimental conditions. Substantial raw counts (often tens of thousands), but their meaning is set by the originating protocol, consult per-corpus documentation rather than assuming a global definition:

PostcodeRepo spreadTotal occurrences
[+ SR]5~31,000
[+ IN]5~24,500
[+ PI]5~22,700
[+ R]4~16,200
[+ I]4~10,500
[+ nv]4~3,300
[+ imit]4~3,200

Tier 3, Single-corpus and long-tail codes

About 80% of the 488 distinct values appear in one repo only. The single-corpus codes include high-volume protocol vocabularies (e.g. [+ uncued] ~19,500 in one repo, [+ NAC] ~3,500 in one repo, [+ diary] ~2,800 in a Romance/Germanic diary-study collection, [+ noatt] ~2,300 in one repo, [+ inter-utter-switch] ~720 flagging code-switching turns).

The long tail also includes researcher-private notes, typos that survived check, and per-study coding schemes. Tooling MUST treat any unknown postcode value as opaque text, the corpus author may know what it means, the format does not.

Caveats

  • Numbers are from a snapshot audit and will drift as corpora are added or revised. Treat the broad shape (open vocabulary, ~4 truly cross-corpus codes, ~10 multi-corpus protocol codes, ~hundreds of single-corpus or long-tail codes) as the load-bearing finding, not the exact counts.
  • “Repo spread” counts data-repo families, not individual files. Two corpora curated by the same group inside one data-repo count as one for spread; researchers using the same code in two different family-of-corpora packages count as two.
  • The CHAT manual remains the source of truth for standard conventions. The empirical survey above shows what is actually deployed; when ingesting a new corpus, consult its own documentation for the postcodes in use.

What Postcodes Are NOT

Postcodes are easy to confuse with several other CHAT annotation forms because they all use square brackets. The differences are substantive and load-bearing.

FormScopeBody validationPurpose
[+ ...]Utterance-level (this doc)None, free textResearcher / analysis tag attached to the whole utterance
[: ...]Word-levelReplacement words ARE validated as CHAT wordsSanctioned-form correction of the preceding word (see replacements.md)
[% ...]Word-levelNone, free textFree-form comment about the preceding word or local span
[= ...]Word-levelNone, free textExplanation of unclear / non-standard speech (often paired with xxx / yyy placeholders)
[* ...]Word-levelNone, error code textError coding for the preceding word, optionally with a structured code

Two consequences worth pinning down explicitly:

  • A postcode cannot carry per-word semantics. If you want to attach a comment, replacement, or error code to a single word, use the appropriate word-scoped form. Stretching a postcode to mean “this word is X” loses the per-word position downstream tools depend on.
  • A word-scoped annotation cannot tag an utterance. If you want to mark an entire utterance for exclusion or translation, use a postcode. A [% exclude this] after a word does not mean “exclude the utterance” to any consumer.

Not Postcodes: Quotation Markers

Quotation marking in CHAT is not a postcode form. The constructs +"/. (quotation end), +"/, and +" (quotation linkers / continuations) are tier-level terminators and linkers, not [+ ...] postcodes, the grammar rule postcode in grammar/grammar.js is strictly [+ <text>], and the quotation forms live under separate grammar rules (quoted_new_line, linker_quotation_follows).

See Utterances → Terminators for the syntactic forms, and the talkbank-model::validation::cross_utterance validator family (gated by ValidationContext::enable_quotation_validation) for the cross-utterance balance checks.

A walker in talkbank-model::validation::utterance::quotation (check_quotation_balance) does scan the postcode list for text "/ and "/., but a sweep over the data-json corpus mirror (101,414 files, 2026-05-11) returned zero such postcodes, that code path is effectively dead, retained presumably as defence against hand-edited oddities. The real quotation-balance work happens in the cross-utterance family above.

Position in the AST

An utterance’s main tier is MainTier, whose content: TierContent field carries the actual tier payload, including postcodes, as a typed list:

pub struct MainTier {
    pub speaker: SpeakerCode,
    pub content: TierContent,
    // spans omitted for brevity
}

pub struct TierContent {
    pub linkers: TierLinkers,                  // utterance-leading +<, ++, etc.
    pub language_code: Option<LanguageCode>,   // [- code]
    pub content: TierContentItems,             // word-level items (newtype over Vec<UtteranceContent>)
    pub terminator: Option<Terminator>,        // ., ?, !, +..., etc.
    pub postcodes: TierPostcodes,              // [+ ...] tokens after the terminator
    pub bullet: Option<Bullet>,                // optional terminal media bullet
    // content_span omitted for brevity
}

(See talkbank-model/src/model/content/main_tier.rs and tier_content.rs for the exact shape; the same TierContent type is shared by dependent tiers, so the postcode slot exists on every tier even though only main-tier postcodes are conventional.)

Because postcodes live at the utterance level, the per-word traversal helpers (walk_words, walk_words_mut) do not visit them. Code that needs to read or rewrite postcodes accesses the list directly.

The model stores postcode text as SmolStr and preserves it verbatim through CHAT roundtrips. Downstream tooling, including CLAN command implementations such as freq, mlu, kideval, is responsible for interpreting individual postcode values per its own conventions.

Tooling Rules

Tools that emit or consume CHAT must respect the scope distinction.

  • Emitters: when adding a researcher tag to an utterance, attach a Postcode to the utterance’s MainTierContent, not a ContentAnnotation to a word. Both serialize, but only the former reaches downstream consumers as utterance-level metadata.
  • Consumers: when reading utterance-level tags (e.g., implementing an “exclude” filter), iterate main.content.postcodes on each utterance, not the word-level annotations in UtteranceContent. The two lists are populated by different parser branches and have different semantics.
  • Round-trip preservers (extract→modify→inject pipelines such as the NLP injection passes in crates/batchalign-*): preserve the postcode list unchanged. None of the standard NLP passes have a reason to add, remove, or reorder postcodes.

References

Dependent Tiers

Status: Reference Last updated: 2026-06-22 23:33 EDT

Dependent tiers appear on lines beginning with % immediately after an utterance. They provide annotations linked to the main tier content.

CHAT defines four structural categories of dependent tiers:

  1. Structured linguistic tiers: parsed into typed AST nodes with word-level alignment
  2. Phon phonological tiers: syllabification and segmental alignment from the Phon project
  3. Bullet-content tiers: free-form text with optional inline timing markers
  4. Text tiers: plain text with no structural alignment

Structured Linguistic Tiers

These tiers have rich, parsed representations in the data model. Each token aligns 1-to-1 with an alignable word on the main tier (excluding retraces, pauses, and events). Terminators (., ?, !) must match the main tier terminator.

%mor, Morphological Analysis

The %mor tier carries part-of-speech tags, lemmas, and morphological features for each word on the main tier. See The %mor Tier for full documentation covering the UD-style format, data model, divergences from Universal Dependencies, and migration from traditional CHAT MOR.

Format: POS|lemma[-Feature]*, with ~ separating post-clitics.

*CHI:	she's eating cookies .
%mor:	PRON|she~AUX|be-Pres-S3 VERB|eat-Prog NOUN|cookie-Plur .

%gra, Grammatical Relations

The %gra tier encodes dependency syntax using Universal Dependencies relation labels. Each entry has the format index|head|relation, where indices are 1-based and head 0 indicates ROOT.

*CHI:	I want cookies .
%mor:	PRON|I VERB|want NOUN|cookie-Plur .
%gra:	1|2|SUBJ 2|0|ROOT 3|2|OBJ 4|2|PUNCT

The %gra tier aligns with %mor chunks (clitics expand into multiple chunks). Validation checks sequential indices (E721), ROOT structure (E722 missing root, E723 multiple roots), and circular dependencies (E724).

%pho / %mod, Phonological Transcription

The %pho tier records actual pronunciation; %mod records target/model pronunciation. Both use the same format: space-separated phonetic tokens aligned 1-to-1 with main tier words.

*CHI:	I want three cookies .
%pho:	aɪ wɑnt fwi kʊkiz .
%mod:	aɪ wɑnt θri kʊkiz .

Phonological tiers support IPA, UNIBET, X-SAMPA, or custom notation systems. They are used for child language, speech disorders, L2 learning, and dialectal variation studies.

Parsing strategy: We deliberately parse only the minimal word/group-level structure in %pho and %mod needed for coarse alignment with the main tier. The full IPA phoneme content is stored as opaque strings, deep phonological analysis is handled by Phon, and we avoid duplicating that work. The Phon extension tiers (%modsyl, %phosyl, %phoaln) follow the same strategy.

%sin, Gesture and Sign Annotation

The %sin tier codes gestures and signs aligned with speech. Each token is either 0 (no gesture) or g:referent:type (e.g., g:ball:dpoint for a deictic point at a ball).

*CHI:	that ball .
%sin:	g:ball:dpoint 0 .

Multiple simultaneous gestures use bracket grouping: 〔g:toy:hold g:toy:shake〕.

%wor, Word Timing

The %wor tier carries word-level timing annotations for media synchronization. Words may include inline bullets with millisecond timestamps. Word text is display-only (“eye candy”); timing data comes from the bullet fields.

⚠ IMPORTANT: %wor word text is the cleaned form, by design. When chatter serializes a %wor word it writes the word’s cleaned text, the spoken form with surface markers removed, NOT the raw main-tier surface form. This is a deliberate convention (see WorTier::write_chat in crates/talkbank-model/src/model/dependent_tier/wor.rs), chosen for human readability and because %wor exists to anchor timing, not to re-state the main tier’s orthography. The generated %wor text and the TextGrid export both use this cleaned form.

Consequence you must know: surface markers carried on a word, prosodic lengthening (wabe:), and similar in-word notation, are not preserved in %wor output. A main-tier word wabe: becomes wabe on %wor. This means a %wor line containing such words does not byte-roundtrip (parse, serialize, reparse changes the surface text), and that is expected, not a bug. %wor is a cleaned, timing-only view; the main tier remains the faithful record of surface forms. Do not “fix” the %wor serializer to emit raw text without an explicit decision to change this convention.

%wor is not a flat “all tokens except punctuation” tier. It follows a word-level alignment rule:

  • Regular words count.
  • Fillers (&-um, &-uh, &-you_know) count; they are real spoken words with known phoneme sequences.
  • Fragments (&+...) do NOT count: incomplete phoneme sequences; the FA engine cannot reliably anchor partial phonological material.
  • Nonwords (&~...) do NOT count: interactional/gestural sounds without stable lexical phoneme content for alignment.
  • Untranscribed placeholders (xxx, yyy, www) do NOT count: they have no known phoneme sequence; CTC forced alignment cannot produce timings for unknown material.
  • Replacements keep the original spoken word slot for %wor; the replacement text matters for %mor, not %wor. If the original slot is untranscribed or a fragment/nonword, it is still excluded.
  • Retrace scope does not change %wor membership.
  • Overlap markers do not change %wor membership.

%wor is a timing-annotation tier. Its word count equals the number of Wor-domain words and may differ from a naive main-tier word count. There is no downstream positional indexing into %wor; the %wor count is not validated against the main-tier word count.

*CHI:	I want cookies .
%wor:	I want cookies .

Exact corpus-shaped contrast:

*CHI:	<one &+ss> [/] one play ground .
%wor:	one •321809_321969• play •322049_322310• ground •322390_322890• .
# &+ss is a fragment, excluded from %wor regardless of retrace context.

*EXP:	&+ih <the what> [/] what's letter &+th is this ?
%wor:	the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• is •50946_51046• this •51086_51586• ?
# Fragments &+ih and &+th excluded; regular words remain.

*EXP:	what's is dis [: this] ?
%wor:	what's •37050_37471• is •37491_37631• dis •37631_38131• ?

*CHI:	xxx snack .
%wor:	snack •884668_885168• .
# xxx has no phoneme sequence, excluded from %wor; only snack appears.

*CHI:	&~um a boat .
%wor:	a •1073779_1073799• boat •1076861_1077361• .
# &~um is a nonword, excluded from %wor.

*CHI:	&-mm [<] bananas are good .
%wor:	mm •1949506_1949566• bananas •1949566_1949766• are •1949846_1949987• good •1950067_1950567• .
# &-mm is a filler, included in %wor (real spoken word with alignable phoneme sequence).
flowchart TD
    A["Main-tier word candidate"] --> B{"Timestamp token /\nomission / empty?"}
    B -->|Yes| OUT["Excluded from %wor"]
    B -->|No| C{"Untranscribed?\n(xxx/yyy/www)"}
    C -->|Yes| OUT
    C -->|No| D{"Fragment or nonword?\n(&+ or &~)"}
    D -->|Yes| OUT
    D -->|No| IN["Counts for %wor\n(word or filler &-)"]

    style IN fill:#afa,stroke:#333
    style OUT fill:#faa,stroke:#333

Phon Phonological Tiers

These tiers originate from the Phon project and provide syllable-annotated phonological transcription and segmental alignment. They were originally serialized as %x-prefixed user-defined tiers (%xmodsyl, %xphosyl, %xphoaln) and are being promoted to official CHAT tiers. Phon stores phonological data in its own XML format; the CHAT representation is generated by PhonTalk.

%modsyl / %phosyl, Syllabified Phonology

%modsyl is a syllabified version of %mod (target pronunciation); %phosyl is a syllabified version of %pho (actual pronunciation). Each phoneme is annotated with a syllable position code (N=nucleus, O=onset, C=coda, etc.). Words are space-separated and align 1-to-1 with the corresponding %mod or %pho tier.

*CHI:	the best .
%mod:	ðə bɛst .
%modsyl:	ð:Oə:N b:Oɛ:Ns:Ct:C .
%pho:	ðə bɛs .
%phosyl:	ð:Oə:N b:Oɛ:Ns:C .

Alignment: Content-based, stripping position codes (:N, :O, :C, etc.) and stress markers (ˈ, ˌ) from %modsyl should yield the same phonemes as %mod. Same for %phosyl%pho.

%phoaln, Phone Alignment

%phoaln provides segmental alignment between target and actual IPA, showing phoneme-by-phoneme correspondence. Each pair uses source↔target notation; marks insertions or deletions.

*CHI:	the best .
%phoaln:	ð↔ð,ə↔ə b↔b,ɛ↔ɛ,s↔s,t↔∅

Alignment: Positional, word-by-word, word N in %phoaln aligns with word N in both %mod and %pho.

Parsing strategy: Same as %pho/%mod, we parse just enough structure for alignment (word boundaries for %modsyl/%phosyl, alignment pairs for %phoaln). IPA phoneme content is treated as opaque strings.

Validation (E725-E728)

Because these are derived views, word counts must match between each syllabification tier and its parent IPA tier:

CheckError code
%modsyl word count ≠ %mod word countE725
%phosyl word count ≠ %pho word countE726
%phoaln word count ≠ %mod word countE727
%phoaln word count ≠ %pho word countE728

These checks are gated on ParseHealth, if either tier in a pair has parse errors, the alignment check is suppressed to avoid false positives.

Known PhonTalk Export Issue

The PhonTalk XML→CHAT converter writes %mod/%pho through a OneToOne alignment path that maps IPA words to orthography words and silently drops extras. The syllabification tiers (%modsyl, %phosyl, %phoaln) bypass this path and include all IPA words. In child phonology data where children produce more IPA words than orthographic targets (~4% of Phon corpus files), this creates tier-to-tier word count mismatches. The mismatches originate in the Phon XML source data (orthography↔IPA word count discrepancies) and are inconsistently handled during CHAT export. This is being investigated in collaboration with the Phon team.

Bullet-Content Tiers

These tiers contain free-form text with optional embedded timing markers (•START_END•) and picture references (•%pic:"file.jpg"•). They do not align word-by-word with the main tier.

TierPurpose
%actPhysical actions, gestures, non-verbal behaviors
%codResearch-specific coding (semantic roles, thematic coding, error classification)
%comComments, annotations, and contextual notes
%expExplanations or expansions of ambiguous/incomplete speech
%addAddressee identification in multi-party conversations
%spaSpeech act coding (request, assertion, question, directive)
%sitSituational context or setting description
%gpxExtended gesture position coding
%intIntonational contours and prosodic patterns

%cod is bullet-content in the shared TalkBank AST. In the %cod coding convention, a word selector such as <w4> scopes the code that follows it (it names which main-tier word the code applies to) rather than being a code in its own right.

Example with timing:

*CHI:	gimme that .
%act:	reaches toward shelf
%com:	child is pointing to picture

Text Tiers

These tiers contain plain text with no bullets, timing, or structural alignment:

TierPurpose
%altAlternative transcriptions
%cohCohesion annotation
%defDefinitions
%engEnglish translations (for non-English transcripts)
%errError annotations
%facFacial expressions
%floFlow annotation
%glsGlosses
%ortOrthographic representations
%parParalinguistic information
%timTiming information

User-Defined Tiers

Tiers prefixed with %x (e.g., %xcod, %xact) are user-defined dependent tiers. They are preserved during parsing and roundtrip but receive no structural validation beyond basic format checks. Any %x-prefixed tier is always accepted, this is the open extension point for project-specific annotation.

The Supported Set Is Closed

A dependent tier is valid in chatter only if it is one of the standard tiers documented above (the structured, Phon, bullet-content, and text tiers) or a %x-prefixed user-defined tier. Any other %-tier is invalid CHAT, and chatter rejects the file with error E605 (UnsupportedDependentTier). This is a closed set by design: chatter validate is the binding judgment on CHAT validity, so an unrecognized dependent tier is an error, not a warning.

Deliberate Divergence from CLAN: Retired Legacy Tiers

When TalkBank standardized morphology on a single Universal Dependencies %mor tier (plus %gra for relations), several legacy dependent tiers were retired. CLAN’s check still accepts three of them, so on these chatter is intentionally stricter, a deliberate, documented divergence:

Retired tierCLAN checkchatter
%trnacceptsrejects (E605)
%traacceptsrejects (E605)
%grtacceptsrejects (E605)
%umorrejectsrejects (E605)

The modern UD-%mor workflow has one morphology tier (%mor) plus %gra; the older training/translation/variant tiers are no longer part of the format chatter validates. %umor is rejected by both validators and is listed only for completeness. Note that %xtra (with the %x prefix) is a perfectly valid user-defined tier; only the bare %tra is retired.

This is one instance of a general principle: where chatter intentionally departs from CLAN/CHECK behavior, the divergence is documented rather than left implicit. See CHECK Parity Audit.

The %mor Tier: Morphological Analysis

Status: Reference Last updated: 2026-05-11 20:35 EDT

The %mor (morphological) dependent tier provides word-by-word morphosyntactic annotation aligned with the main tier. Each main-tier word receives a morphological code specifying part of speech, lemma, and grammatical features.

Format Overview

*CHI:	I want cookies .
%mor:	pron|I-Prs-Nom-S1 verb|want-Fin-Ind-Pres-S1 noun|cookie-Plur .

Each %mor item has the structure POS|lemma[-Feature]*, where:

  • POS: part-of-speech category (noun, verb, pron, det, aux, etc.)
  • |: pipe separator (always present)
  • Lemma: base form of the word (cookie, be, I). May contain language-specific compound or derivational boundary markers (see Compound Lemma Boundaries below)
  • Features: zero or more morphological features, each preceded by - (-Plur, -Fin-Ind-Pres-S3)

Items are space-separated and terminate with a punctuation marker (., ?, !, etc.).

The UD MOR Format

TalkBank’s %mor tier uses a format inspired by Universal Dependencies (UD) but adapted to CHAT conventions. We call this the UD MOR format to distinguish it from the older CLAN-era MOR format.

The UD MOR format was introduced via batchalign’s Stanza-based morphosyntax pipeline. Stanza produces standard UD analysis (UPOS, lemma, morphological features, dependency relations), and the Rust mapping layer converts this to CHAT %mor and %gra tiers. The new format has been adopted for all new corpus annotation.

Structure: Flat POS|lemma[-Feature]*

Every morphological word is flat, a single POS tag, a single lemma, and a linear chain of features:

POS|lemma[-Feature1][-Feature2][-Feature3]...

There are no compounds, prefixes, subcategories, or nested structures in the UD MOR format. The entire morphological analysis of a word is captured by the POS+lemma+features triple.

Examples:

Word%mor codePOSLemmaFeatures
dognoun|dognoundog(none)
dogsnoun|dog-PlurnoundogPlur
runningverb|run-Part-Pres-SverbrunPart, Pres, S
isaux|be-Fin-Ind-Pres-S3auxbeFin, Ind, Pres, S3
Ipron|I-Prs-Nom-S1pronIPrs, Nom, S1
thedet|the-Def-ArtdettheDef, Art

Multi-Word Tokens (Clitics)

English contractions and similar multi-word tokens (MWTs) are represented using the tilde (~) separator for post-clitics:

*CHI:	it's red .
%mor:	pron|it~aux|be-Fin-Ind-Pres-S3 adj|red .

Here it's is a single main-tier word that expands to two morphological words: pron|it (main) and aux|be-Fin-Ind-Pres-S3 (post-clitic). The ~ indicates the two MOR words are fused into one orthographic token.

Each clitic counts as its own chunk for %gra alignment; pron|it~aux|be-Fin-Ind-Pres-S3 produces 2 chunks, each needing its own grammatical relation.

Terminator

The %mor tier ends with a terminator that matches the main tier’s utterance terminator:

*CHI:	what is that ?
%mor:	pron|what aux|be-Fin-Ind-Pres-S3 det|that ?

The terminator (., ?, !, +..., etc.) counts as one chunk for %gra alignment.

How It Diverges from UD

The UD MOR format is UD-inspired but not UD-compliant. Several deliberate adaptations make it fit CHAT conventions while preserving most UD information. This section catalogs every divergence.

1. POS Tags Are Lowercased UPOS

UD uses uppercase UPOS tags (NOUN, VERB, PRON). CHAT uses lowercase (noun, verb, pron). This is a lossless, trivially reversible surface change.

UD UPOSCHAT POS
NOUNnoun
VERBverb
AUXaux
PRONpron
DETdet
ADJadj
ADVadv
ADPadp
PROPNpropn
INTJintj
CCONJcconj
SCONJsconj
NUMnum
PARTpart
Xx

2. Feature Values Are Flat, Not Key=Value (Currently)

UD represents morphological features as key=value pairs: Number=Plur, Tense=Past, Person=3. The current CHAT convention drops the keys and uses only the values: -Plur, -Past, -S3.

This is the most significant divergence from UD, because:

  • Information loss: Plur could in principle be Number=Plur or Degree=Plur (though in practice the UD feature value set has no real ambiguities).
  • Collapsed person/number: UD Person=3|Number=Sing becomes -S3, a combined code that cannot be mechanically decomposed back to its UD components.
  • Feature ordering: Features appear in a conventional order determined by the generation pipeline, not in UD’s alphabetical order.

The data model now supports key=value features. The MorFeature type has an optional key field, when present, the feature serializes as Key=Value (e.g., -Number=Plur); when absent, it serializes as just the value (e.g., -Plur). This is forward-compatible: existing flat features parse and serialize identically, and if batchalign’s mapper begins emitting Key=Value features, they flow through the parser and model without any format changes.

3. Multi-Value Features: Commas Preserved

UD encodes multi-value features with commas: PronType=Int,Rel (the word is both interrogative and relative). In CHAT %mor, the comma is preserved within the feature value:

-Int,Rel

This is treated as a single feature value "Int,Rel". The grammar accepts commas within feature values, and the model stores them as-is. No decomposition occurs; the model faithfully records the string that appears in the %mor tier.

Historical note: Earlier documentation described a “comma-stripping” convention where PronType=Int,Rel became -IntRel (concatenated without separator). The current grammar and parser preserve the comma. Existing corpus data using the concatenated form (-IntRel) also parses correctly; it’s simply treated as the flat value "IntRel".

4. Dependency Relations Are Uppercase with Dash Subtypes

The %gra tier (not %mor, but closely related) uses uppercase relation names with dashes for subtypes, where UD uses lowercase with colons:

UDCHAT %gra
nsubjNSUBJ
acl:relclACL-RELCL
obl:tmodOBL-TMOD

This is lossless; case and separator are trivially reversible.

5. ROOT Head Convention

In UD, the root word has head=0. In %gra, two conventions coexist:

  • UD convention: head=0 (e.g., 3|0|ROOT), the standard we now emit
  • Legacy TalkBank convention: head=self (e.g., 3|3|ROOT), found in older corpus data

The parser and validator accept both forms. New output uses head=0.

6. No XPOS, No DEPREL Subtypes in %mor

UD provides both UPOS (universal POS) and XPOS (language-specific POS). CHAT %mor uses only UPOS-equivalent tags; there is no XPOS field. Language-specific POS distinctions are not represented.

Similarly, UD’s fine-grained dependency relation subtypes (e.g., nsubj:pass) appear in %gra as NSUBJ-PASS, but the %mor tier itself contains no dependency information.

7. No Morpheme Segmentation

Traditional CHAT MOR formats (CLAN-era) supported morpheme-level segmentation with compound markers (+), prefix markers (#), and suffix chains (-SUFFIX&type). The UD MOR format does not use any of these, each word is analyzed as a flat POS+lemma+features triple.

The grammar still accepts some of these legacy markers for backward compatibility with older corpus data, but the canonical UD MOR format does not produce them.

Compound Lemma Boundaries

Several UD treebanks use special characters inside lemmas to mark morphological boundaries. These are meaningful linguistic annotations preserved in the CHAT %mor lemma field when possible.

Known Markers Across Languages

LanguageMarkerMeaningExample LemmaIn %mor
Estonian=Compound boundarymaja=uks (house-door)noun|maja=uks, preserved
Basque!Derivational boundarypartxi!se (share + derivation)noun|partxi!se-Ine, preserved
Finnish#Compound boundaryjää#kaappi (ice-cabinet)noun|jää_kaappi, mangled (#_)

= and ! pass through the cleaning pipeline because they are not reserved CHAT %mor syntax characters. # is reserved in traditional CHAT MOR for prefix markers (e.g., v|#un#do), so the sanitizer replaces it with _.

Gotcha: = ambiguity with legacy CLAN translation glosses. Legacy CLAN %mor tiers use = for translation glosses (e.g., n|perro=dog), a convention predating UD adoption. The parser treats = identically in both cases; it is preserved as part of the lemma string. This means legacy n|perro=dog parses successfully but the translation semantics are lost: the model stores perro=dog as a single lemma, indistinguishable from an Estonian compound like maja=uks. Since we cannot reliably disambiguate the two uses without language-specific context, legacy translation glosses are silently absorbed into the lemma. Files with legacy =translation syntax still parse and round-trip correctly, but the translation information is not semantically accessible. This affects corpora that predate our UD MOR adoption and lack Stanza coverage for their language.

Multi-Word Expression Lemmas (Stanza _ Convention)

Stanza uses underscores in lemmas to represent multi-word expressions across many languages: New_York, parce_que (French), pick_up (English), a_causa_di (Italian). The current cleaning pipeline strips underscores entirely (New_YorkNewYork), which is a known data quality issue and should be treated as an open data-quality limitation of the current mapper.

Multi-Value Features (Commas in Feature Values)

UD encodes multi-value features with commas: PronType=Int,Rel means a word is both interrogative and relative. These commas appear in the CHAT %mor feature suffix and are preserved as-is:

pron|wat-Int,Rel

This is sometimes mistaken for a compound lemma marker, but commas in UD always appear in the feature column (CONLLU column 6), never in the lemma column (CONLLU column 3). In CHAT %mor, they appear after the - feature separator, not inside the lemma. The grammar, both parsers, and the data model all accept commas in feature values. See Section 3: Multi-Value Features above.

Future Direction

The current handling of compound lemma boundaries is inconsistent across languages. A possible future improvement is a unified Unicode separator character that would normalize all compound/derivational boundary markers (=, !, #, and potentially _) into a single convention. This has not been implemented as of 2026-03-02 and requires a design decision on which character to use and whether to preserve the original markers in a structured field.

Data Model

The Rust data model in talkbank-model represents %mor tiers with these types:

MorTier

The top-level tier container:

pub struct MorTier {
    pub tier_type: MorTierType,    // MorTierType::Mor
    pub(crate) items: MorItems,    // Vec<Mor> wrapper; accessed via accessor methods
    pub terminator: Terminator,    // typed terminator (.`, `?`, `!`, `+...`, etc.)
    pub span: Span,                // source location
}

Mor (Item)

One item aligned with one main-tier word:

pub struct Mor {
    pub main: MorWord,                        // required main word
    pub post_clitics: SmallVec<[MorWord; 2]>, // optional ~clitics
}

MorWord

A single morphological word (POS + lemma + features):

pub struct MorWord {
    pub pos: PosCategory,                    // e.g., "noun"
    pub lemma: MorStem,                      // e.g., "dog"
    pub features: SmallVec<[MorFeature; 4]>, // e.g., [Plur]
}

MorFeature

A morphological feature with optional key:

pub struct MorFeature {
    key: Option<Arc<str>>,  // e.g., Some("Number") or None
    value: Arc<str>,        // e.g., "Plur"
}

Construction examples:

// Flat feature (current convention)
MorFeature::new("Plur")         // key=None, value="Plur"
MorFeature::new("S3")           // key=None, value="S3"
MorFeature::new("Int,Rel")      // key=None, value="Int,Rel"

// Keyed feature (UD-standard, forward-compatible)
MorFeature::new("Number=Plur")  // key=Some("Number"), value="Plur"
MorFeature::new("Tense=Past")   // key=Some("Tense"), value="Past"

// Explicit constructors
MorFeature::flat("Plur")
MorFeature::with_key_value("Number", "Plur")

Lossless roundtrip guarantee: MorFeature::new auto-detects the = delimiter. Features without = are flat; features with = split into key+value. Serialization reproduces the original format exactly, flat features stay flat, keyed features keep their key.

PosCategory and MorStem

Both are interned Arc<str> newtypes for memory efficiency:

pub struct PosCategory(pub Arc<str>);  // interned via pos_interner()
pub struct MorStem(pub Arc<str>);      // interned via stem_interner()

Common values (noun, verb, the, a, be, etc.) are pre-populated in the interner. Cloning is O(1), atomic reference count increment.

Memory Layout

The model uses SmallVec for inline storage of common cases:

  • Mor.post_clitics: SmallVec<[MorWord; 2]>: most words have 0-1 clitics
  • MorWord.features: SmallVec<[MorFeature; 4]>: most words have 0-4 features
  • MorFeature key and value are Arc<str>, interned for deduplication

For a typical 30-word utterance with %mor, the model allocates approximately 30 Mor items, each with 1 MorWord and 0-4 MorFeature values. The interning system ensures that repeated POS tags, stems, and feature values share a single allocation across the entire file.

Grammar

The tree-sitter grammar for %mor is defined in grammar.js. The relevant rules:

mor_content → mor_word (mor_post_clitic)*
mor_post_clitic → tilde mor_word
mor_word → mor_pos pipe mor_lemma (mor_feature)*
mor_feature → hyphen mor_feature_value
mor_feature_value → /[^\.\?\|\+~\-\s\r\n]+/

Key design decisions:

  • mor_feature_value accepts = and !: The regex [^\.\?\|\+~\-\s\r\n]+ matches any characters except the MOR structural delimiters. This means Number=Plur parses as a single mor_feature_value node. The split on = happens in the model layer, not the grammar, following the “parse, don’t validate” principle.
  • mor_feature_value accepts ,: Multi-value features like Int,Rel parse as a single node.
  • No compound/prefix rules: The grammar has no rules for + (compounds) or # (prefixes) in the UD MOR format. These are legacy CHAT MOR features not used in UD-style output.

Parser

The tree-sitter parser produces MorTier from CHAT text. It is GLR-based and error-recovering, producing a CST that the Rust talkbank-parser crate walks to construct MorTier. Used by the CLI, LSP, and batchalign. High-frequency values (PosCategory, MorStem) are interned via Arc<str> during construction.

The corpus/reference/ set is the correctness gate for %mor parsing, every file must parse and round-trip cleanly. The file count grows as new constructs are added; run find corpus/reference -name '*.cha' | wc -l to get the live total.

Validation

The %mor tier undergoes several validation checks:

Content Validation (E711)

Every MorWord is checked for:

  • Empty POS: |lemma with no POS before the pipe
  • Empty lemma: pos| with no lemma after the pipe
  • Empty feature: bare - separator with no feature text

Main-tier Alignment (E705 / E706)

The %mor tier must align 1-to-1 with the main tier’s alignable words (excluding pauses, events, and other non-word content). The number of Mor items must equal the number of alignable main-tier words. The validator emits E705 MorCountMismatchTooFew when %mor has fewer items than the main tier and E706 MorCountMismatchTooMany when it has more. Terminator-mismatch errors are emitted separately as E707 (presence) and E716 (value).

GRA Alignment (E720)

When both %mor and %gra tiers are present, the number of %gra relations must equal the number of %mor chunks (including clitics and the terminator). A mismatch emits E720 MorGraCountMismatch. This is computed via MorTier::count_chunks().

(%gra’s own internal validators, E708 malformed relation, E709 invalid index, E712 word-index out of range, E713 head-index out of range, E721 non-sequential index, E722 no ROOT, E723 multiple ROOTs, E724 circular dependency, are documented in Dependent Tiers § %gra.)

JSON Serialization

The MorTier serializes to JSON using serde. MorFeature serializes as a plain string ("Plur" or "Number=Plur"), so the JSON schema is simply "type": "string". Example:

{
  "tier_type": "Mor",
  "items": [
    {
      "main": {
        "pos": "pron",
        "lemma": "I",
        "features": ["Prs", "Nom", "S1"]
      }
    },
    {
      "main": {
        "pos": "verb",
        "lemma": "want",
        "features": ["Fin", "Ind", "Pres", "S1"]
      }
    },
    {
      "main": {
        "pos": "noun",
        "lemma": "cookie",
        "features": ["Plur"]
      }
    }
  ],
  "terminator": "."
}

When key=value features are present, they serialize with the key included:

"features": ["Number=Plur", "Tense=Past"]

The JSON schema for MorFeature is "type": "string" regardless of whether keys are present.

Migration from Traditional CHAT MOR

What Changed

The traditional CHAT MOR format (CLAN-era) used a complex, hierarchically structured notation:

%mor:	pro:sub|I v|want n|cookie-PL .

Key differences from the UD MOR format:

AspectTraditional CHAT MORUD MOR
POS tagsCLAN categories (pro:sub, v, n, adj, adv)Lowercased UPOS (pron, verb, noun, adj, adv)
POS subtypesColon-separated (pro:sub, det:art, v:aux)Flat (subtypes dropped or encoded differently)
FeaturesCLAN suffix system (-PL, -PAST, -3S, -PRES)UD feature values (-Plur, -Past, -S3, -Pres)
Compounds+ separator (`n+n|black+n|bird`)
Prefixes# separator (`v#un#do`)
Morpheme segmentationFull segmentation (v|eat&PAST)Not used (features are abstract, not morphemic)
Translations= separator (n|perro=dog)Not present in base format (separate mechanism)

What the Model Removed

The UD MOR redesign (2026) removed the following types from the data model:

  • MorSuffix: suffix with type discriminant (fusional, derivational, etc.)
  • MorCompound: compound word with + separator
  • MorPrefix: prefix with # separator
  • MorSubcategory: POS subcategory after colon
  • AnnotatedChunk: chunk with optional translation
  • Chunk: enum of word/compound/terminator

These were replaced by the flat MorWord { pos, lemma, features } structure. The model went from ~12 types to 4 (MorTier, Mor, MorWord, MorFeature).

Backward Compatibility

The grammar still accepts many traditional CHAT MOR constructs (colons in POS tags, etc.) because the reference corpus contains files in both formats. The parser produces the same flat MorWord regardless; legacy constructs are mapped to the simplified structure during parsing.

What Stays the Same

Despite the format changes, fundamental CHAT conventions remain:

  • Pipe (|) separates POS from lemma
  • Hyphen (-) introduces features
  • Tilde (~) marks post-clitics
  • Space separates items
  • Terminator ends the tier
  • 1-to-1 alignment with main tier words

Toward Full UD Compatibility

The current format is UD-inspired but not UD-compliant. Here is a roadmap of what would be needed for full lossless UD round-tripping:

Already Supported

  • POS tags (UPOS equivalents)
  • Lemmas
  • Feature values (flat and key=value)
  • MWT expansions (clitics)
  • Dependency relations (via %gra)

Gaps Remaining

  1. Feature keys: The model supports Key=Value features, but batchalign’s mapper currently emits flat values only. When the mapper switches to emitting Number=Plur instead of just Plur, the parser, model, and serializer handle it automatically with no code changes.

  2. Person+Number composites: UD has separate Person=3 and Number=Sing features. CHAT combines them into -S3 (3rd person singular). Decomposing S3 back to Person=3|Number=Sing would require a lookup table or a convention change.

  3. Multi-value feature delimiter: UD uses commas (PronType=Int,Rel). CHAT preserves these commas in the feature value, but the semantic structure (two separate values) is not explicitly modeled. The model treats Int,Rel as an opaque string.

  4. XPOS: UD provides language-specific POS tags (XPOS) alongside universal tags (UPOS). CHAT %mor has no XPOS field. This information is simply not represented.

  5. Morpheme-level analysis: UD’s MISC field can encode morpheme boundaries and glosses. CHAT’s UD MOR format does not attempt morpheme segmentation, features are abstract grammatical categories, not morphemic decompositions.

The Path Forward

The model is designed so that moving toward UD compliance requires no breaking changes:

  • MorFeature already supports Key=Value, just needs the mapper to emit keys
  • PosCategory is an opaque string, could hold XPOS in a separate field if needed
  • JSON schema uses "type": "string" for features, adding keys doesn’t break consumers
  • The grammar already accepts = in feature values, no grammar changes needed

The migration can happen incrementally: the mapper starts emitting key=value features, existing flat data continues to parse identically, and corpus files can be upgraded at their own pace.

Phon Tiers (%xmodsyl, %xphosyl, %xphoaln, %xphoint)

Status: Reference Last updated: 2026-06-23 07:28 EDT

The Phon extension tiers provide syllable-level phonological annotation, segmental alignment between target and actual IPA, and per-phone time intervals. They are produced by the Phon application and exported to CHAT via PhonTalk.

chatter parses and validates all four tiers as first-class CHAT tiers.

The x prefix. Phon emits these tiers with a leading x (%xmodsyl, %xphosyl, %xphoaln, %xphoint) to mark them as extension tiers. The grammar accepts both the x-prefixed names and the historical non-x names (%modsyl, %phosyl, %phoaln, %phoint); the parser and validator key off the tier kind, not the literal prefix. The canonical serialized form is the x-prefixed name.

The four tiers

TierSourceCarriesWord separator
%xmodsyl%modSyllabification of the model/target transcriptionspace
%xphosyl%phoSyllabification of the actual transcriptionspace
%xphoaln%mod+%phoPhone-by-phone alignment of model ↔ actualspace
%xphoint%phoPer-phone time intervals (0x15 time bullets)/

%xmodsyl, %xphosyl, and %xphoaln are word-aligned to their source tier(s) with single ASCII spaces. %xphoint uses / (space-slash-space) as its word separator because single spaces already separate the phone and bullet tokens inside each word.

Tier formats

%xmodsyl / %xphosyl, syllabification

A word is one or more phone:CODE units concatenated with no internal whitespace; words are separated by single spaces. The phone is one IPA phone (IPA length is written with the modifier letter ː, U+02D0, never an ASCII colon, so the : separator is unambiguous). A leading stress marker (ˈ primary, ˌ secondary) is part of the phone it precedes.

The constituent code is one character. The legal codes are O N C L R E A D U:

CodeConstituentNotes
OOnset
NNucleusmonophthong nucleus
CCoda
LLeft appendixe.g. /s/ in an /s/-stop cluster
RRight appendixe.g. final /z/ in a complex coda
EOEHS (onset of empty-headed syllable)e.g. the stop element of an affricate
AAmbisyllabic
DDiphthonga nucleus member of a diphthong/triphthong; treated as a nucleus
UUnknownPhon could not assign a concrete constituent; common on %xphosyl when the model %xmodsyl is fully syllabified

The remaining Phon SyllableConstituentType mnemonics, B (boundary), S (stress), W (word boundary), T (tone), are not emitted on these tiers: boundary, stress, and tone need no per-phone marker.

*CHI:	I want three .
%mod:	aɪ wɑnt θri
%xmodsyl:	a:Dɪ:D w:Oɑ:Nn:Ct:C θ:Oɹ:Oi:N
%pho:	aɪ wɑn fwi
%xphosyl:	a:Dɪ:D w:Oɑ:Nn:C f:Ow:Oi:N

%xphoaln, phone alignment

A word is one or more comma-separated pairs; a pair is model↔actual ( is U+2194). Either side may be (U+2205, empty set): on the left is an epenthesis (a phone produced but not targeted); on the right is a deletion. Both sides are never at once.

*CHI:	the best .
%mod:	ðə bɛst
%pho:	ðə bɛs
%xphoaln:	ð↔ð,ə↔ə b↔b,ɛ↔ɛ,s↔s,t↔∅

The alignment lists segments (phones). Suprasegmental stress (ˈ/ˌ) that may appear on the %mod/%pho word is therefore not part of the alignment pairs; the reconstruction checks below compare modulo those stress markers.

%xphoint, per-phone intervals

%xphoint gives the time segmentation of each individual phone on %pho, effectively phone-level bullets analogous to the word-level timing on %wor. Groups (one per %pho word) are separated by /. Within a group, each phone is followed by a CLAN time-alignment bullet: the byte 0x15 (NAK), the interval start_end, then 0x15.

*CHI:	I want . •0_500•
%pho:	aɪ wɑnt
%xphoint:	aɪ •0_250• / w •250_320• ɑ •320_400• n •400_460• t •460_500•

(Bullets are shown as above; in the file they are the 0x15 byte.)

Validation

These checks run by default. Pass --suppress xphon to silence the entire Phon %x validation surface, or suppress an individual code. (The historical --check-xphon opt-in flag is now a deprecated no-op: the checks it used to gate are on by default.)

Word-count cross-checks (each %x tier has the same number of words as the tier(s) it depends on):

  • %xmodsyl%mod: E725
  • %xphosyl%pho: E726
  • %xphoaln%mod: E727, ↔ %pho: E728

Content checks:

CodeTierRule
E735xmodsyl/xphosyla unit is not a well-formed phone:CODE (no :, empty phone, or empty code)
E736xmodsyl/xphosyla constituent code is not one of O N C L R E A D U
E737xmodsylstripping codes and concatenating phones does not reproduce the %mod word
E738xphosylstripping codes and concatenating phones does not reproduce the %pho word
E739xphoalna pair is malformed (not exactly one , an empty side, or ∅↔∅)
E740xphoalnconcatenating the model sides (skipping , modulo stress) does not reproduce the %mod word
E741xphoalnconcatenating the actual sides (skipping , modulo stress) does not reproduce the %pho word
E742xphointa bullet has start >= end
E743xphointinterval start times are not non-decreasing across the tier
E744xphointthe first start / last end falls outside the record’s media bullet (1 ms tolerance)
E745xphointa group’s phones do not reproduce the %pho word
E746xphointthe number of groups does not equal the %pho word count

See Alignment Architecture for the word-count implementation.

Parsing strategy

  • %xmodsyl / %xphosyl: stored as flat word strings (talkbank-model::dependent_tier::phon::SylTier), consistent with how %pho and %mod store flat phone words. The validator tokenizes each word into typed phone:CODE units (PositionCode) to apply the content rules above; the IPA characters themselves stay verbatim for exact round-trip.
  • %xphoaln: each word is parsed into a Vec<AlignmentPair>, where AlignmentPair { source, target } carries one model↔actual mapping (None is ).
  • %xphoint: parsed into typed groups of (phone, bullet) pairs (XphointTier / XphointGroup / PhoneInterval), reusing the same 0x15 bullet machinery as %wor.

Deep phonological analysis is Phon’s domain; chatter parses the structure that validation needs and keeps the IPA content verbatim.

Phon XML source format

In Phon’s native XML format, phonological data is stored as structured elements:

<ipaTarget>
  <pho>
    <pw>
      <ph scType="onset"><base>θ</base></ph>
      <ph scType="nucleus"><base>ɹ</base></ph>
      <ph scType="nucleus"><base>i</base></ph>
    </pw>
  </pho>
</ipaTarget>

Each <pw> (phonological word) element contains <ph> elements with syllable constituent types (scType). The <alignment> element provides phone-level mappings between target and actual using index-based <pm> (phone map) entries.

Data quality notes

A small percentage of Phon corpus XML records have an orthography↔IPA word-count mismatch: the number of <pw> elements in <ipaTarget> / <ipaActual> differs from the number of <w> elements in <orthography>. This is expected in child phonology data: children may produce extra syllables, partial words, or over-productions relative to the target.

For current counts on a local CHILDES/TalkBank data tree, run:

python3 scripts/analysis/scan_phon_mismatches.py /path/to/data

The PhonTalk CHAT export handles this discrepancy inconsistently:

  1. %mod/%pho are written through a OneToOne alignment path that maps IPA words to orthography words; extras are silently dropped.
  2. %xmodsyl/%xphosyl/%xphoaln are written directly from the raw IPATranscript; all IPA words are included.

This produces CHAT files where %xmodsyl may have more words than %mod, triggering the E725-E728 word-count errors. This is being investigated in collaboration with the Phon team.

Word Syntax

Status: Reference Last updated: 2026-05-11 23:33 EDT

Words are the primary content unit on the main tier. CHAT defines several word types and annotation mechanisms.

Standalone Words

Most words are simple tokens separated by whitespace:

*CHI:	I want a cookie .

Words can contain Unicode characters for any language:

*CHI:	ich möchte Kekse .

Compounds

Compound words join multiple elements with +:

*CHI:	I want ice+cream .

Special Word Forms

Shortened Forms

Parentheses mark omitted portions of a word:

*CHI:	(be)cause I want it .

The full form is because; the child produced cause.

Replacements

Square brackets with colon mark what the speaker actually meant:

*CHI:	I goed [: went] to the store .

The speaker said “goed” but the intended word was “went”.

Language Markers

The @s: suffix marks a word’s language in multilingual transcripts:

*CHI:	I want a Keks@s:deu .

Other @ markers:

  • @l: letter
  • @c: child-invented form
  • @f: family-specific word
  • @n: neologism
  • @o: onomatopoeia
  • @b: babbling
  • @wp: word play
  • @si: signed word

Annotations

Words and groups can carry post-positioned annotations in square brackets:

Error Marking

*CHI:	he goed [*] to school .

[*] marks an error. More specific error codes can follow: [* m:+ed].

Explanations

*CHI:	that one [= the red ball] .

[= text] provides an explanation or gloss.

Replacements

*CHI:	I wanna [: want to] go .

[: text] marks the target/intended form.

Best Guess

*CHI:	I want the birfer [?] .

[?] marks uncertain transcription.

Events and Actions

Paralinguistic Events

Events marked with &= describe non-speech sounds:

*CHI:	&=laughs I want cookie .
*CHI:	&=coughs .

Fillers

Fillers are marked with &-:

*CHI:	&-um I want &-uh cookie .

Interposed Speech (Other Speaker)

Brief background speech from a different speaker is marked with the &*SPK:text prefix, it captures the interjection without creating a full turn line:

*CHI:	I want &*MOT:careful a cookie .

This says CHI was speaking and MOT briefly said “careful” mid-turn. If the intervention is substantial enough to constitute its own turn, transcribe it as a separate *MOT: utterance instead. Model: crates/talkbank-model/src/model/content/other_spoken.rs.

(Note: [^ text] is a freecode, a standalone free-form researcher annotation that sits as its own content item on the main tier (variant of UtteranceContent::Freecode, sibling of Word and Group; it is NOT attached to any word). See grammar/grammar.js rule freecode and crates/talkbank-model/src/model/content/utterance_content/. Used for transcriber notes that are independent of any single word; for notes about a single word use [% text] or [= text] instead.)

Pauses

*CHI:	I (.) want (..) a (...) cookie .
*CHI:	I (1.5) want a cookie .
  • (.): short pause
  • (..): medium pause
  • (...): long pause
  • (N.N): timed pause in seconds

Overlap

Overlapping speech between speakers uses angle brackets and overlap markers:

*MOT:	do you want <a cookie> [>] ?
*CHI:	<cookie> [<] !
  • [>]: follows the overlap (this speaker started first)
  • [<]: overlaps the previous speaker

Retrace and Repetition

Groups followed by retrace markers indicate speech disfluencies:

*CHI:	<I want> [/] I want a cookie .
*CHI:	<I want> [//] I need a cookie .
*CHI:	<I want a> [///] give me a cookie .
  • [/]: partial retrace (speaker repeats the same words)
  • [//]: full retrace (speaker restarts with different words)
  • [///]: multiple retracing (multiple false starts)
  • [/-]: reformulation (speaker rephrases with different structure)

The CHAT Word

Status: Current Last modified: 2026-05-29 18:43 EDT

“Word” is the most complex and most misunderstood concept in CHAT. This chapter documents what a word actually is, how the grammar parses it, and how the Rust model represents it. If you maintain this codebase, you will encounter word-level bugs. This chapter exists so you can understand them.

The Fundamental Rule

Whitespace delimits words. Contiguous non-whitespace characters form one word token. This applies everywhere on the main tier.

*CHI:   hello world .
        ^^^^^              word: "hello"
              ^^^^^        word: "world"

The grammar uses extras: $ => [] – no implicit whitespace. Whitespace nodes (whitespaces, space) are explicit in the CST. Tree-sitter does not skip whitespace between tokens. This is the foundation of every tokenization decision in the grammar.

There are no exceptions to this rule. Every ambiguity described in this chapter is resolved by applying this rule consistently.

Word Structure

A word in the grammar is standalone_word – a sequence of an optional prefix, a required body, optional suffixes, and an optional POS tag.

The following diagram shows the full decomposition. All named nodes are separate CST children.

flowchart TD
    sw["standalone_word\n(grammar.js, prec.right 6)"]

    zero["zero\n'0' -- omission prefix"]
    wp["word_prefix\n'&amp;-' filler | '&amp;~' nonword | '&amp;+' fragment"]

    wb["word_body\n(required)"]
    fm["form_marker\n@b, @c, @d, @z:label, ..."]
    wls["word_lang_suffix\n@s, @s:eng, @s:eng+fra"]
    pos["pos_tag\n$n, $v, $adj, ..."]

    sw -->|"optional prefix"| zero & wp
    sw -->|"required"| wb
    sw -->|"optional suffix"| fm & wls
    sw -->|"optional"| pos

    ws["word_segment\npure spoken text"]
    short["shortening\n'(text)' omitted sound"]
    sm["stress_marker\nprimary or secondary"]
    len["lengthening\n':' one or more colons"]
    op["overlap_point\none of four brackets"]
    cae["ca_element\nsingle CA marker"]
    cad["ca_delimiter\npaired CA marker"]
    ub["underline_begin\ncontrol char pair"]
    ue["underline_end\ncontrol char pair"]
    cm["'+'\ncompound marker"]

    wb -->|"children (any order)"| ws & short & sm & len & op & cae & cad & ub & ue & cm

In the grammar (search grammar/grammar.js for the standalone_word and word_body rules), the structure is:

standalone_word: $ => prec.right(6, seq(
  optional(choice($.word_prefix, $.zero)),
  $.word_body,
  optional($.form_marker),
  optional($.word_lang_suffix),
  optional($.pos_tag),
)),

word_body: $ => prec.right(choice(
  seq(
    choice($.word_segment, $.shortening, $.stress_marker),
    repeat(choice($.word_segment, $.shortening, $.stress_marker, $._word_marker)),
  ),
  seq(
    choice($.overlap_point, $.ca_element, $.ca_delimiter, $.underline_begin),
    choice($.word_segment, $.shortening, $.stress_marker),
    repeat(choice($.word_segment, $.shortening, $.stress_marker, $._word_marker)),
  ),
)),

word_body has two branches:

  1. Standard start: the word begins with word_segment, shortening, or stress_marker, followed by any number of body children.
  2. Marker-initial: the word begins with a structural marker (overlap, CA, underline), but that marker must be immediately followed by text content. This prevents degenerate words like a standalone overlap marker from forming a valid standalone_word.

Lengthening and + (compound marker) are excluded from starting a word body. This is how standalone : falls through to separator(colon) – see Section 5 (Tokenization Ambiguities) below.

The word_segment Purity Invariant

word_segment contains ONLY pure spoken text. All structural markers are separate typed children in word_body, never consumed by word_segment.

This is a hard invariant with three consequences:

  1. cleaned_text() never scans for markers. It concatenates Text and Shortening elements. No stripping needed.
  2. Validation finds ALL markers by type. Overlap markers, CA elements, and underline pairs are always WordContent variants, regardless of position within the word.
  3. Editors get typed CST nodes. Syntax highlighting, bracket matching, and hover info work on individual markers, not opaque substrings.

How it works

word_segment is a DFA token at prec(5) with a regex that excludes all structural characters. The exclusions are generated from the symbol registry (grammar/src/generated_symbol_sets.js) – never hand-written.

word_segment: $ => token(prec(5, seq(
  WORD_SEGMENT_FIRST_RE,    // generated: excludes structural chars + '0' at start
  WORD_SEGMENT_REST_RE,     // generated: excludes structural chars
))),

Full exclusion table

Every character in this table is excluded from word_segment and becomes a separate typed node in the CST.

CategoryCharactersCST node type
Overlap markers overlap_point
CA elements ca_element
CA delimiters ° Ϋ §ca_delimiter
Stress markersˈ ˌstress_marker
Colons:lengthening
Underline markers\x02\x01, \x02\x02underline_begin / underline_end
Brackets[ ] < > ( ) { }structural (annotations, groups)
Punctuation. ! ? , ; +terminators, separators, compound
CHAT prefixes@ $ & * %headers, events, speakers
Intonation contours content-level markers
Group delimiters " " pho/sin groups, quotes
Control chars\x01-\x08, \x15bullets, underline

First-character-only exclusion: 0 is excluded from the first character of word_segment (it is the omission prefix). 0 in non-initial positions is valid: 200, h0me, abc0 all parse correctly.

The Rust Data Model

Word struct

The Word struct (crates/talkbank-model/src/model/content/word/word_type.rs) is the canonical typed representation:

pub struct Word {
    pub span: Span,
    pub word_id: Option<SmolStr>,
    pub(crate) raw_text: SmolStr,
    pub content: WordContents,
    pub category: Option<WordCategory>,
    pub form_type: Option<FormType>,
    pub lang: Option<WordLanguageMarker>,
    pub part_of_speech: Option<SmolStr>,
    pub inline_bullet: Option<Bullet>,
}

Key fields:

  • raw_text: the exact text from the input, including all markers. Used for roundtrip serialization.
  • content: a WordContents (SmallVec-backed sequence of WordContent elements). This is the structured decomposition. Most words have 1-2 elements; SmallVec avoids heap allocation for the common case.
  • category: optional prefix (Omission, CAOmission, Filler, Nonword, PhonologicalFragment).
  • form_type: optional @ suffix (@c child-invented, @d dialect, @z:label user-defined, etc.).
  • lang: optional @s language marker (Shortcut, Explicit, Multiple, Ambiguous).
  • part_of_speech: optional $ tag.

WordContent enum

WordContent (crates/talkbank-model/src/model/content/word/content.rs) is the enum of everything that can appear inside a word body. Each variant maps directly to a grammar node.

Grammar nodeWordContent variantRust typeExample
word_segmentTextWordText(NonEmptyString)hello, want
shorteningShorteningWordShortening(NonEmptyString)(be) in (be)cause
overlap_pointOverlapPointOverlapPoint, ⌉2
ca_elementCAElementCAElement,
ca_delimiterCADelimiterCADelimiter, °
stress_markerStressMarkerWordStressMarkerˈ primary, ˌ secondary
lengtheningLengtheningWordLengthening { count: u8 }: = 1, :: = 2, ::: = 3
(caret in word)SyllablePauseWordSyllablePause^ in o^ver
underline_beginUnderlineBeginUnderlineMarker\x02\x01
underline_endUnderlineEndUnderlineMarker\x02\x02
+ (compound)CompoundMarkerWordCompoundMarker+ in ice+cream
~ (clitic boundary)CliticBoundaryWordCliticBoundary~ in le~ha

cleaned_text()

Word::cleaned_text() derives NLP-ready text from content by concatenating only Text and Shortening variants:

pub fn compute_cleaned_text(&self) -> String {
    let mut result = String::new();
    for item in &self.content {
        match item {
            WordContent::Text(t) => result.push_str(t.as_ref()),
            WordContent::Shortening(s) => result.push_str(s.as_ref()),
            _ => {}
        }
    }
    result
}

This works because the purity invariant guarantees that Text elements never contain structural markers. There is nothing to strip.

Examples:

Inputcontentcleaned_text()
hello[Text("hello")]hello
(be)cause[Shortening("be"), Text("cause")]because
no::[Text("no"), Lengthening(2)]no
ice+cream[Text("ice"), CompoundMarker, Text("cream")]icecream
le~ha[Text("le"), CliticBoundary, Text("ha")]leha
ja^ja[Text("ja"), SyllablePause, Text("ja")]jaja
he↑llo[Text("he"), CAElement(PitchUp), Text("llo")]hello
°soft°[CADelimiter(Softer), Text("soft"), CADelimiter(Softer)]soft
ˈhello[StressMarker(Primary), Text("hello")]hello
⌈hello⌉[OverlapPoint(TopBegin), Text("hello"), OverlapPoint(TopEnd)]hello

The result is cached via OnceLock on first access.

What is included in cleaned_text vs what is stripped

The following table is the complete inventory of how every word-internal element contributes to (or is excluded from) cleaned_text(). This must match what NLP pipelines (Stanza, etc.) expect as input.

WordContent variantCharacter(s)In cleaned_text?Rationale
Textspoken textYESThe actual word
Shortening(be)YESShortened form is still spoken
CompoundMarker+NoStructural boundary, not spoken
CliticBoundary~NoMorphological boundary, not spoken
SyllablePause^NoPause between syllables, not spoken
Lengthening: :: :::NoProsodic marker, not spoken
StressMarkerˈ ˌNoProsodic marker, not spoken
OverlapPoint NoTiming marker, not spoken
CAElement NoProsodic annotation
CADelimiter ° Ϋ §NoVoice quality annotation
UnderlineBegin\x02\x01NoFormatting marker
UnderlineEnd\x02\x02NoFormatting marker

Characters that stay in word_segment (ARE spoken text):

  • Letters (all Unicode)
  • Digits (in non-initial position; 0 in initial = omission prefix)
  • Hyphen (-), part of word text, e.g., ice-cream, self-conscious
  • Apostrophe ('), contractions, e.g., don't, it's
  • Hash (#), appears in some transcription conventions
  • Underscore (_), compound boundary in some conventions

Characters NOT in word_segment (excluded by symbol registry): See the full exclusion table in Precedence Decisions in the grammar docs.

Comparison with batchalign2

batchalign2’s annotation_clean() (60 lines of .replace() calls) strips all the same characters that our grammar excludes from word_segment. Key differences:

  1. Parentheses: ba2 COMMENTED OUT the strip. We handle them as Shortening, the content inside parens IS included in cleaned_text.
  2. IPA characters (ạ ā ʔ ʕ ʰ): ba2 incorrectly strips them. We correctly keep them; they are real phonetic content.
  3. Hyphen (-): ba2 strips it. We keep it in word_segment because hyphen is a valid word character (contractions, compounds, morphological suffixes in %mor tier).

Our design eliminates the need for character-by-character stripping entirely. cleaned_text() is a simple concatenation of Text + Shortening elements, with zero scanning.

The Six Tokenization Ambiguities

CHAT was designed for human readability, not machine parsing. Six characters have context-dependent meanings that the grammar must disambiguate. Full details with proof grammars are in grammar/docs/tokenization-rules.md and grammar/docs/precedence-decisions.md. What follows is a summary for orientation.

1. Overlap markers (⌈⌉⌊⌋)

Adjacent to text = part of the word. Space-separated = standalone overlap_point.

Yeah⌋⌈2 hey      ONE word: "Yeah⌋⌈2"
Yeah ⌋ ⌈2 hey    three tokens: "Yeah", ⌋, ⌈2

Maximal munch at prec(5) makes word_segment consume adjacent overlap characters. Overlap markers are only recognized as overlap_point when space-separated on both sides.

2. Zero/omission prefix (0)

Adjacent to word body = omission prefix. Space-separated = action marker.

0die              ONE word: standalone_word(zero, word_body("die"))
0 die             TWO tokens: nonword(zero), word("die")

standalone_word at prec.right(6) beats nonword at prec(1). The extras: [] setting prevents whitespace from being skipped between zero and word_body. The zero token is inlined directly into standalone_word (not through word_prefix) because tree-sitter’s precedence does not propagate through intermediate rules. This was proven empirically with a minimal test grammar – see grammar/docs/precedence-decisions.md.

3. CA parenthetical vs shortening

In CA mode (@Options: CA), a fully parenthesized word (word) is an uncertain/omitted word (CAOmission), semantically equivalent to 0word. Partially parenthesized hel(lo) is always a shortening.

@Options: CA
*CHI:   (ja) .            CAOmission: uncertain "ja"
*CHI:   hel(lo) .         Shortening: "(lo)" is the shortened part

Distinguishing these requires file-level context (@Options header). The parser sets WordCategory::CAOmission when the word is fully parenthesized in CA mode. Isolated parser.parse_word_fragment() calls cannot determine CA mode – they need a FragmentSemanticContext.

4. Colon – lengthening vs separator

Inside a word (after text): prosodic lengthening. Standalone: separator.

no::              ONE word: Text("no") + Lengthening(2)
hello : world     separator(colon)

The DFA always produces lengthening for : (higher precedence). But word_body rejects lengthening as a first element, so standalone : cannot form a valid word and falls through to separator(colon). This is the “constrain the parser, not the DFA” pattern.

5. Plus (+) – compound vs terminator vs linker

Inside a word: compound marker. At line end: terminator prefix. At line start: linker prefix.

ice+cream         ONE word with compound marker
and then +...     terminator: trailing_off (prec 10 beats prec 5)
+< but I +/.      linker: lazy_overlap, terminator: interruption

Terminators and linkers use prec(10), which beats word_segment at prec(5). No valid CHAT word ends with + – the grammar enforces this by structure.

6. Bracket annotations vs plain brackets

Bracket annotations ([= text], [=! text], [% text]) use prec(8) prefix tokens to beat generic bracket handling.

Historical Context: The Coarsening and Its Reversal

The original structured grammar (pre-coarsening)

The original grammar (preserved in grammar/docs/pre-coarsening-grammar.js.reference) had all word-internal markers as children of word_content:

// Pre-coarsening: every marker was a child of word_content
word_content: $ => choice(
  $.word_segment,
  $.shortening,
  $.stress,
  $.colon,
  $.caret,
  $.tilde,
  $.plus,
  $.overlap_point,
  $.ca_element,
  $.ca_delimiter,
  $.underline_begin,
  $.underline_end,
),

The coarsening decision

At one point, standalone_word was coarsened into an opaque token – a single DFA regex that consumed the entire word as one undifferentiated string. The rationale was:

  • Simpler grammar with fewer tree-sitter conflicts.
  • A Chumsky-based direct parser in Rust would re-parse the opaque token into structured WordContent elements.

This worked but had costs:

  • Two parsers (tree-sitter + Chumsky) with independent bugs.
  • Validation could not find structural markers without re-parsing.
  • Editors got one opaque node instead of typed children.
  • cleaned_text() had to scan for and strip marker characters.

The reversal (Chumsky elimination)

When the Chumsky direct parser was eliminated (making tree-sitter the sole parser), the structured word grammar was restored. The key decisions:

  1. All marker characters were re-excluded from word_segment using the symbol registry as the single source of truth.
  2. Each marker type became a separate CST child in word_body.
  3. The WordContent enum in the Rust model was aligned 1:1 with the grammar nodes.
  4. The word_segment purity invariant was established as a TDD gate.

The result is one parser, one source of truth for exclusions, and typed markers from grammar through model.

Testing: The word_segment Purity Gate

The purity invariant, each structural marker produces a separate CST child rather than being consumed by word_segment, is enforced by a group of tree-sitter corpus tests under grammar/test/corpus/word/. Each *_in_word_lint.txt file embeds a structural marker inside a word and asserts the CST splits the word appropriately:

Test fileInputAsserts
overlap_in_word_lint.txtbutt⌈er⌉word_segment, overlap_point, word_segment, overlap_point
ca_element_in_word_lint.txtCA element inside a wordword_segment, ca_element, word_segment
ca_delimiter_in_word_lint.txtCA delimiter pair around a wordca_delimiter, word_segment, ca_delimiter
lengthening.txt, lengthening_between_segments.txtno::, etc.word_segment, lengthening
stacked_ca_markers.txtMultiple adjacent CA markers in one wordEach marker is its own CST child

Underline and stress invariants are covered by corpus tests elsewhere in grammar/test/corpus/ and by the parser-equivalence tests in crates/talkbank-parser-tests/. The historical word_segment_purity.txt consolidated 8 named tests in one file; it was retired in commit fdceeac2 when the corresponding constructs were given their own per-construct test files (this is the new layout that the current spec generators produce from the spec sources).

How to add a new purity-style test

If you add a new structural marker to the grammar:

  1. Add its characters to the symbol registry (spec/symbols/symbol_registry.json).
  2. Run just symbols-gen to regenerate the exclusion sets.
  3. Add a spec in spec/constructs/ that embeds the marker inside a word; regenerate the affected grammar/parser fixtures with the current spec/tools commands from Spec Workflow so a per-construct test fixture is created in grammar/test/corpus/word/. Verify the CST output names each marker as its own child.
  4. Run the full verification sequence:
    cd grammar && tree-sitter generate && tree-sitter test
    cargo nextest run -p talkbank-parser
    cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
    cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus
    

Key Source Files

FileWhat it defines
grammar/grammar.jssearch for standalone_word, word_body, word_segment, _word_marker
grammar/src/generated_symbol_sets.jsCharacter exclusion sets (generated, do not edit)
grammar/test/corpus/word/*_in_word_lint.txt, lengthening*.txt, stacked_ca_markers.txtPer-construct purity-invariant gate tests (replaced the consolidated word_segment_purity.txt retired in fdceeac2)
grammar/docs/tokenization-rules.mdThe 6 tokenization ambiguities with full examples
grammar/docs/precedence-decisions.mdPrecedence proofs (zero, colon, purity invariant)
grammar/docs/pre-coarsening-grammar.js.referenceHistorical: the grammar before coarsening
crates/talkbank-model/src/model/content/word/word_type.rsWord struct
crates/talkbank-model/src/model/content/word/content.rsWordContent enum (12 variants)
crates/talkbank-model/src/model/content/word/word_contents.rsWordContents (SmallVec-backed sequence)
crates/talkbank-model/src/model/content/word/category.rsWordCategory enum (5 variants)
crates/talkbank-model/src/model/content/word/form.rsFormType enum (22 variants)
crates/talkbank-model/src/model/content/word/language.rsWordLanguageMarker enum (4 variants)

Symbols

Status: Reference Last modified: 2026-05-29 18:43 EDT

CHAT uses a rich set of symbols for transcription conventions. This page documents the symbol categories and the symbol registry that drives both the grammar and the Rust crates. The symbol registry (spec/symbols/symbol_registry.json) is the source of truth, when this page and the registry disagree, the registry wins.

Symbol Registry

The authoritative symbol definitions live in spec/symbols/symbol_registry.json. This JSON file is the single source of truth, it generates:

  • Character sets for the tree-sitter grammar (grammar.js)
  • Rust constants for the model and validation crates
  • Validation rules for the spec tool

After any change to the symbol registry, run:

just symbols-gen

Symbol Categories

Terminators

Punctuation that ends an utterance:

SymbolNameUsage
.PeriodDeclarative
?QuestionInterrogative
!ExclamationExclamatory
+...Trailing offIncomplete utterance
+..?Trailing-off questionQuestion trails off
+/.InterruptionSpeaker interrupted by another
+//.Self-interruptionSpeaker interrupts self
+/?Interrupted questionQuestion interrupted
+!?Broken questionExclamation-question
+"/.Quoted new lineQuotation continues on next line

CA (Conversation Analysis) Symbols

CA notation symbols fall into three parser-distinct categories in spec/symbols/symbol_registry.json. They are not interchangeable, the grammar treats them as different node kinds.

CA element symbols (ca_element_symbols) attach to a word, so book↑ is a single token whose content carries the symbol:

SymbolMeaning
Rising pitch (attaches to a word)
Falling pitch (attaches to a word)
Micropause
Inhalation marker
Other CA element symbols

CA arrow separators (in word_segment_forbidden_start_symbols) are own-node separators between words, not word-attachments. The parser splits them as their own nodes:

SymbolMeaning
Level pitch contour
Rising-to-mid contour
Falling-to-mid contour
Rising-to-high contour
Falling-to-low contour
Other CA arrow separators

CA delimiter symbols (ca_delimiter_symbols) bracket annotated prosodic regions:

SymbolMeaning
°Quiet speech
Higher / lower pitch register
Other prosodic-region delimiters
Low / high prosodic-region delimiters
§ ΫAdditional registered CA delimiters

Confirm the current contents of each category by reading spec/symbols/symbol_registry.json directly, that is the file just symbols-gen derives the grammar and Rust constants from.

Word Segment Characters

Characters that are forbidden at the start of words, forbidden in the rest of words, or forbidden throughout. These define the lexical boundaries of what constitutes a “word” in CHAT.

The grammar uses these sets to construct the word-matching regex patterns. Characters like [, ], <, >, (, ) are structural delimiters and cannot appear inside words.

Event Segment Characters

Characters forbidden in event descriptions (&=event content). Events have slightly different lexical rules than words.

Language Codes

CHAT uses ISO 639-3 three-letter language codes in @Languages headers and @s: word markers:

@Languages:	eng, fra
*CHI:	I want a croissant@s:fra .

Common codes: eng (English), fra (French), deu (German), spa (Spanish), zho (Mandarin), jpn (Japanese).

Special Markers

@ Markers (Word-Level)

The authoritative form-marker set is FormType in crates/talkbank-model/src/model/content/word/form.rs. Current variants:

MarkerMeaning
@aApproximate / phonologically consistent form
@bBabbling
@cChild-invented form
@dDialect form
@fFamily-specific form
@fpFilled pause (deprecated, use &-um etc.)
@gGemination / general special form
@iInterjection
@kLetter sequence (kinship)
@lSingle letter
@lsLetter plural
@nNeologism
@oOnomatopoeia
@pProper name
@qMetalinguistic reference
@sasSecond-attempt success
@siSinging
@slSlang
@tTest word
@uUnibet transcription
@wpWord play
@xComplex / excluded
@z:<label>User-defined special form (carries an arbitrary label)

The second-language qualifier @s:LANG is a separate construct (see the L2 morphotag section of the Batchalign book); it is not part of FormType.

& Markers (Events and Fillers)

PrefixMeaning
&=Paralinguistic event (e.g., &=laughs)
&-Filler (e.g., &-um)
&+Phonological fragment (e.g., &+sh)
&~Nonword (e.g., &~mama)
&*Other speaker’s speech event (e.g., &*MOT:word, speech attributed to another speaker)

Scope Markers

MarkerMeaning
[/]Partial retrace, speaker repeats the same words
[//]Full retrace, speaker restarts with different words
[///]Multiple retracing, multiple false starts
[/-]Reformulation, speaker rephrases with different structure
[*]Error
[?]Best guess
[>]Overlap follows
[<]Overlap precedes
[= text]Explanation
[: text]Replacement

Architecture Overview

Status: Current Last modified: 2026-06-15 15:00 EDT

TalkBank/chatter is the standalone home of the TalkBank CHAT specification, tree-sitter grammar, Rust crates, chatter CLI, LSP server, and desktop app. It is self-contained: the CHAT-format core builds and runs without any external TalkBank repository, so downstream consumers can depend on its crates directly.

Data Flow

Specification is the source of truth. Code is generated downstream from it.

spec/           Source of truth (CHAT specification)
    ↓
grammar.js      Tree-sitter grammar (in grammar/)
    ↓
parser.c        Generated C parser (never hand-edited)
    ↓
Rust crates     Parser → Model → Validation → Transform
    ↓
Applications    chatter CLI, LSP server, desktop app

Two layers

Within this repository, the architecture splits into two layers:

Source-of-truth artifacts. spec/, spec/symbols/, and grammar/ define the CHAT language and generate downstream parser tests, error docs, and shared symbol sets.

Consumer crates and applications. The Rust crates under crates/, the chatter CLI, talkbank-lsp, and the desktop app all consume those source-of-truth artifacts rather than defining CHAT semantics independently.

Crate Dependency Graph

flowchart TD
    derive["talkbank-derive\nProc macros"]
    model["talkbank-model\nData model, validation, alignment, errors"]
    cache["talkbank-cache\nValidation + roundtrip cache"]
    parser["talkbank-parser\nCanonical parser (tree-sitter)"]
    re2c["talkbank-parser-re2c\nAlternate parser (equivalence oracle)"]
    transform["talkbank-transform\nPipelines, CHAT↔JSON, caching"]
    cli["chatter\nCLI: validate, normalize, convert"]
    lsp["talkbank-lsp\nLanguage Server Protocol"]
    s2c["send2clan\nCLAN app bindings"]
    desktop["chatter-desktop\nDesktop validation app (Tauri)"]
    tests["talkbank-parser-tests\nEquivalence tests"]

    derive --> model
    model --> parser & re2c
    parser --> transform
    re2c --> transform
    cache --> transform
    transform --> cli & lsp & desktop
    s2c --> cli & desktop
    parser --> tests
    re2c --> tests

Repository Layout

chatter/
├── grammar/                Tree-sitter grammar
├── spec/                   CHAT specification (source of truth)
│   ├── constructs/         Valid CHAT examples + expected parse trees
│   ├── errors/             Invalid CHAT examples + expected error codes
│   ├── symbols/            Shared symbol registry (JSON)
│   ├── tools/              Core spec generators
│   └── runtime-tools/      Runtime-aware spec bootstrap/validation tools
├── crates/                 Rust crates (model, parser, transform, CLI support, LSP)
├── corpus/                 Reference corpus
├── tests/                  Integration tests and fixtures
├── schema/                 JSON Schema (auto-generated)
├── apps/chatter-desktop/   Desktop validation app (Tauri v2, React)
├── fuzz/                   cargo-fuzz targets (separate workspace)
├── book/                   This documentation
└── docs/                   Strategy, proposals, investigations

Cargo Workspaces

Three separate Cargo workspaces live here:

  1. Root workspace (Cargo.toml), all Rust crates for parsing, model, transform, CLI, LSP, and apps/chatter-desktop/src-tauri.
  2. Spec workspace (spec/Cargo.toml), spec/tools for core generation, spec/runtime-tools for runtime-aware spec tooling.
  3. Fuzz workspace (fuzz/Cargo.toml), cargo-fuzz targets for parser and validation robustness checks.

Use the relevant manifest path for the workspace you mean to operate in:

  • spec/tools/Cargo.toml for generators
  • spec/runtime-tools/Cargo.toml for bootstrap/mining/runtime validation
  • fuzz/Cargo.toml for cargo-fuzz targets

For per-topic detail (sections being consolidated; see SUMMARY for the authoritative current list):

  • Spec System, Grammar, Parser Backends, how CHAT becomes typed AST.
  • CHAT model: the AST itself, content traversal, wide-struct rule.
  • Alignment: tier alignment, DP, sequence alignment.
  • Errors and validation: diagnostics, validation gates, and parser/model invariants.
  • Editor/runtime integration: talkbank-lsp and application boundaries layered on top of the CHAT core.
  • Memory and Ownership, Type-Driven Design (lands during M11 errors-and-validation work).
  • XML Emitter: projection.

For per-crate summaries see Crate Reference.

Spec System

Status: Current Last modified: 2026-05-29 17:50 EDT

Specifications in spec/ are the authoritative source of truth for the CHAT format. They drive grammar artifact generation, validation/error docs, and targeted test generation.

Historical note: This system was originally shaped during a dual-parser era. The chumsky-based direct parser was removed in March 2026. Today the canonical parser is tree-sitter (talkbank-parser); a second implementation, talkbank-parser-re2c, exists as a specification oracle and high-throughput batch parser. Fragment specs remain valuable, but synthetic tree-sitter wrapper behavior is audit-only legacy unless a page or test explicitly says otherwise.

Spec Types

Construct Specs (spec/constructs/)

Each construct spec defines a valid CHAT pattern with its expected parse tree:

# example_name

Description of what this example tests.

## Input

\```mor_dependent_tier
%mor:	VERB|eat .
\```

## Expected CST

\```cst
(mor_dependent_tier
  (mor_tier_prefix)
  ...)
\```

## Metadata

- **Level**: tier
- **Category**: tiers

The Input code fence label (e.g., mor_dependent_tier, utterance) selects which template wraps the fragment into a full CHAT file for parsing.

That is an explicit grammar/test templating mechanism. It is useful, but it does not by itself define honest isolated-fragment semantics for the direct parser.

Error Specs (spec/errors/)

Each error spec defines an invalid CHAT pattern with expected error codes:

# Error E301

## Metadata

- Code: E301
- Name: missing_participants
- Severity: Error
- Layer: parser

## Examples

### missing_participants_1

\```chat
@UTF8
@Begin
*CHI: hello .
@End
\```

Key metadata fields:

  • Layer: parser: error caught during parsing (returns Err)
  • Layer: validation: error caught after successful parse
  • Status: not_implemented: generates #[ignore] tests

Symbol Registry (spec/symbols/)

symbol_registry.json defines character sets used by both the grammar and Rust crates. In this repo, just symbols-gen validates the registry and regenerates the checked-in grammar and Rust symbol-set outputs. The generation step produces:

  • JavaScript constants for grammar.js
  • Rust constants for model validation

Test Generation

The predecessor monorepo used make test-gen as shorthand for three generator classes. That root wrapper is not yet ported into this repo, but the underlying generation responsibilities are still:

1. Tree-sitter Corpus Tests

gen_tree_sitter_tests reads construct specs and error specs, then:

  • Wraps each Input in a template to create a full CHAT file
  • Parses with tree-sitter and checks for error nodes
  • Writes Expected CST to grammar/test/corpus/

For error specs, it captures the actual parse (with ERROR nodes) as the expected tree.

2. Rust Tests

gen_rust_tests generates Rust test functions:

  • Construct specs become parse-and-compare tests
  • Parser-layer error specs become parser.parse_chat_file() tests expecting Err
  • Validation-layer error specs become parse-then-validate tests

Output: crates/talkbank-parser-tests/tests/generated/

The generated suites are useful as grammar/audit support and regression coverage, but they are not the sole authority for parser semantics.

3. Error Documentation

gen_error_docs generates optional local markdown pages for each error code under docs/errors/ when maintainers want a browsable reference set while working on diagnostics. The source of truth remains spec/errors/.

Workflow After Spec Changes

  1. Regenerate only the affected spec-driven artifacts using the current commands documented in spec/CLAUDE.md.
  2. Run the concrete verification commands from Contributing > Setup.

Never hand-edit generated artifacts, always regenerate from specs.

Post-Bootstrap Doctrine

  • spec/tools remains the generator/validator for grammar corpus tests, error docs, and shared symbol artifacts.
  • talkbank-parser-tests owns parser equivalence and roundtrip contracts.
  • Isolated grammar additions should usually need two things: one grammar corpus example and one full-file fixture. They should not require the old bootstrap ritual unless generated artifacts really changed.

Grammar

Status: Current Last updated: 2026-03-24 00:01 EDT

The CHAT grammar is defined in grammar/grammar.js using the tree-sitter parser generator. It produces a GLR parser that handles the full CHAT format with error recovery.

Design Principles

Explicit Whitespace

Unlike most tree-sitter grammars, CHAT does not use extras for whitespace. All whitespace is grammar-visible because CHAT’s structure is whitespace-sensitive:

  • Tab separates tier prefix from content
  • Newline ends tiers
  • Line continuation uses tab-at-start-of-line
  • Space separates words and annotations

Two-Level Structure

The grammar has two structural levels:

  1. Document level: headers, utterances, @Begin/@End
  2. Tier level: main tier content, dependent tier content (each with distinct rules)

Opaque Lemmas

In the %mor tier rules, lemmas are parsed as opaque Unicode strings. The grammar does not attempt to decompose lemma content, that happens in the model layer. This follows the “parse, don’t validate” principle.

Key Grammar Rules

Document Structure

document → utf8_header, begin_header, lines..., end_header
line → header | utterance
utterance → main_tier, dependent_tiers...

Main Tier

main_tier → star, speaker, colon, tab, tier_body
tier_body → contents, utterance_end
contents → content_item, (whitespace, content_item)...

MOR Tier (UD-style)

mor_contents → mor_content, (whitespace, mor_content)..., terminator
mor_content → mor_word, mor_post_clitic*
mor_word → mor_pos, pipe, mor_lemma, mor_feature*
mor_post_clitic → tilde, mor_word
mor_feature → hyphen, mor_feature_value

POS tags are simple identifiers (no subcategories). Lemmas are opaque strings. Features are hyphen-separated values that may contain = for Key=Value pairs and , for multi-value features.

Grammar Change Workflow

parser.c is generated from grammar.js, never edit it directly.

After any change to grammar.js:

  1. cd grammar && tree-sitter generate
  2. tree-sitter test (160 tests)
  3. cargo test -p talkbank-parser
  4. cargo nextest run -p talkbank-parser-tests (reference corpus equivalence, per-file)
  5. Verify the 78-file reference corpus passes at 100%

Conflict Resolution

The grammar uses tree-sitter’s precedence and conflict mechanisms to handle ambiguities:

  • Word tokens use prec(5) to win over separators
  • Inline bullets use prec(10) for their delimiters
  • CA (conversation analysis) symbols use prec(3) for colon disambiguation

Generated Artifacts

Running tree-sitter generate produces:

  • src/parser.c: the C parser
  • src/node-types.json: node type metadata

The Rust crate talkbank-parser references node-types.json to generate node_types.rs (a generated constants file).

Parsing

Status: Current Last updated: 2026-05-19 16:54 EDT

The parsing pipeline converts CHAT text into a typed ChatFile AST. The default and canonical parser is the tree-sitter parser (talkbank-parser). A second implementation, talkbank-parser-re2c, exists alongside it as a specification oracle and high-throughput batch parser; it produces the same ChatFile model and is opt-in via chatter validate --parser re2c. The LSP and all production paths use the tree-sitter parser.

Tree-Sitter Parser

The talkbank-parser crate wraps the tree-sitter C parser and converts its concrete syntax tree (CST) into the ChatFile model.

Full-file parsing is the canonical entry point. TreeSitterParser also provides fragment methods (parse_word_fragment(), parse_main_tier_fragment(), parse_chat_file_fragment(), etc.) for parsing isolated CHAT fragments directly.

CST → AST Pipeline

flowchart LR
    chat["CHAT text\n(.cha file)"]
    grammar["tree-sitter grammar\n(grammar.js → parser.c)"]
    cst["Concrete Syntax Tree\n(all whitespace preserved)"]
    walker["TreeSitterParser\n(CST traversal)"]
    ast["ChatFile AST\n(semantic model)"]

    chat --> grammar --> cst --> walker --> ast
Source text
    ↓ tree-sitter parse
Concrete Syntax Tree (CST), green tree with all tokens
    ↓ tree_parsing (Rust)
ChatFile AST, typed model with validation-ready data

The CST preserves every character of the source (whitespace, punctuation, comments). The Rust tree-parsing modules walk the CST and extract semantic information into the typed model.

Error Recovery

Tree-sitter’s GLR algorithm provides automatic error recovery. When the parser encounters unexpected input, it:

  1. Inserts ERROR nodes in the CST
  2. Continues parsing the rest of the file
  3. Reports parse errors via the ErrorSink trait

This means the parser always produces a result, even for malformed files, it extracts as much structure as possible.

ParseOutcome

Individual parse functions return ParseOutcome<T>:

  • ParseOutcome::parsed(value): successfully parsed
  • ParseOutcome::rejected(): could not parse this node (error already reported)

This allows the parser to skip individual malformed elements while continuing to parse the rest of the file.

Parser Equivalence

The 78-file reference corpus is the primary correctness guarantee:

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'

Each .cha file is its own test, nextest runs them in parallel and reports individual failures.

TreeSitterParser API

TreeSitterParser is the sole API handle for parsing. Callers create one instance and pass &TreeSitterParser to all parsing call sites. There is no trait abstraction, TreeSitterParser is a concrete type in the talkbank-parser crate.

use talkbank_parser::TreeSitterParser;

let parser = TreeSitterParser::new()?;

// Full-file parsing (methods on TreeSitterParser).
// parse_chat_file returns ParseResult<ChatFile> with the diagnostic list
// embedded in the result envelope.
let chat_file = parser.parse_chat_file(&source)?;
// parse_chat_file_streaming pushes diagnostics into an ErrorSink as it
// goes, useful for very large files or LSP-style incremental flows.
let chat_file = parser.parse_chat_file_streaming(&source, &errors);

// Fragment parsing (methods on TreeSitterParser), used when synthesizing
// CHAT from non-CHAT sources (ASR output, UD annotations).
let word = parser.parse_word_fragment(word_text, &errors);
let main_tier = parser.parse_main_tier_fragment(tier_text, &errors);

AST Structure

The resulting ChatFile AST has a recursive content structure:

flowchart TD
    cf["ChatFile"]
    hdr["Headers\n@Languages, @Participants,\n@ID, @Options"]
    utts["Utterances[]"]
    mt["MainTier\nspeaker + content"]
    dt["DependentTiers[]\n%mor, %gra, %pho, %sin, %wor"]
    uc["UtteranceContent\n24 variants"]
    leaf["Leaves\nWord | ReplacedWord | Separator"]
    group["Groups\nGroup | AnnotatedGroup |\nRetrace | PhoGroup | SinGroup | Quotation"]

    cf --> hdr & utts
    utts --> mt & dt
    mt --> uc
    uc --> leaf & group
    group -->|recurse| uc

Parser String Handling

The tree-sitter parser constructs owned model types (e.g., MorWord, GrammaticalRelation) directly from CST text. String-heavy types like PosCategory and MorStem use Arc<str> interning to avoid redundant allocations for repeated values. Short strings in model newtypes use SmolStr for inline storage up to 23 bytes.

CHAT Data Model

Status: Current Last updated: 2026-06-14 19:57 EDT

The talkbank-model crate defines the typed AST for CHAT files. Every other crate, parser, transform, CLAN, CLI, LSP, and the entire batchalign runtime, depends on it. This page describes the model itself, the three-level content hierarchy, the content-walker primitives, and the extract → infer → inject pattern that all NLP tasks follow.

ChatFile

The root type is ChatFile, representing a complete CHAT transcript:

pub struct ChatFile {
    pub lines: Vec<Line>,
    pub participants: IndexMap<SpeakerCode, Participant>,
    pub languages: LanguageCodes,
    pub options: ChatOptionFlags,
}

Each Line is either a Header or an Utterance. The full ownership tree:

flowchart TD
    chatfile["ChatFile\n(talkbank-model/src/model/file/chat_file/core.rs)"]
    chatfile --> lines["lines: Vec&lt;Line&gt;"]
    chatfile --> participants["participants:\nIndexMap&lt;SpeakerCode, Participant&gt;"]
    chatfile --> languages["languages: LanguageCodes"]
    chatfile --> options["options: ChatOptionFlags"]

    lines --> header_line["Line::Header (Header)"]
    lines --> utt_line["Line::Utterance (Utterance)"]

    utt_line --> preceding["preceding_headers:\nSmallVec&lt;Header&gt;"]
    utt_line --> main["main: MainTier"]
    utt_line --> deptiers["dependent_tiers:\nVec&lt;DependentTier&gt;"]
    utt_line --> health["parse_health: ParseHealthState"]

    main --> speaker["speaker: SpeakerCode"]
    main --> tiercontent["content: TierContent"]
    tiercontent --> linkers["linkers: Vec&lt;Linker&gt;"]
    tiercontent --> uttcontent["utterance_content:\nVec&lt;UtteranceContent&gt;\n(24 variants)"]
    tiercontent --> terminator["terminator: Option&lt;Terminator&gt;"]
    tiercontent --> bullet["bullet: Option&lt;Bullet&gt;"]

The DependentTier enum has 25 variants: structured linguistic (Mor/Gra/Pho/Mod/Sin/Act/Cod/Wor), with-inline-bullets (Add/Com/Exp/Gpx/Int/Sit/Spa), text-only (Alt/Coh/Def/Eng/Err/Fac/Flo/Gls/Ort/Par/Tim), Phon-project (Modsyl/Phosyl/Phoaln), and UserDefined / Unsupported.

Three-Level Content Hierarchy

CHAT main-tier content is a tree with three nesting levels. Every content traversal must understand all three.

ChatFile
└── Line::Utterance
    └── MainTier
        └── TierContent
            ├── content: Vec<UtteranceContent>     ← Level 1
            │   ├── Word(Box<Word>)
            │   │   └── content: Vec<WordContent>  ← Level 3
            │   ├── OverlapPoint(OverlapPoint)
            │   ├── Group(Group)
            │   │   └── BracketedContent
            │   │       └── Vec<BracketedItem>     ← Level 2
            │   ├── PhoGroup, SinGroup, Quotation
            │   │   └── (same BracketedContent)
            │   ├── Retrace(Box<Retrace>)
            │   ├── Pause, Event, Separator, ...
            │   └── AnnotatedWord, AnnotatedGroup, ...
            ├── bullet: Option<Bullet>
            ├── linkers: Linkers
            └── terminator: Terminator

Level 1, UtteranceContent (24 variants)

What you iterate when walking utterance.main.content.content.0:

CategoryVariants
WordsWord, AnnotatedWord, ReplacedWord
GroupsGroup, AnnotatedGroup, PhoGroup, SinGroup, Quotation
CA markersOverlapPoint, Separator
EventsEvent, AnnotatedEvent, OtherSpokenEvent
ActionsAnnotatedAction
TimingInternalBullet
Scope markersLongFeatureBegin/End, NonvocalBegin/End/Simple, UnderlineBegin/End
OtherFreecode, Pause

Critical rule: every match on UtteranceContent must explicitly list all 24 variants. No _ => catch-alls. Project policy: silent data loss when new variants are added is unacceptable.

Level 2, BracketedItem (22 variants)

Content inside groups (<...>, ‹...›, 〔...〕, "..."). Accessed via group.content.content.0 (the double .content.content.0 is not a typo, Group.content is BracketedContent, which has .content: BracketedItems, which has .0: Vec<BracketedItem>).

BracketedItem mirrors UtteranceContent closely. Retrace content (<word word> [/], word [//]) is a dedicated Retrace variant at both levels, not hidden inside AnnotatedGroup. Groups can nest arbitrarily deep.

Level 3, WordContent (11 variants)

Content inside a single word token, accessed via word.content:

VariantExample
Textplain text segment
Shortening(lo) omitted sound
OverlapPointbutt⌈er⌉, overlap inside a word
CAElement↑ ↓ prosody markers
CADelimiter° ∆ paired delimiters
StressMarkerˈ ˌ
Lengthening:
SyllablePause^
CompoundMarker+ in ice+cream
UnderlineBegin/Endscope delimiters

Key insight: overlap markers can appear at all three levels, as standalone UtteranceContent::OverlapPoint (space-separated: ⌈ word ⌉), as BracketedItem::OverlapPoint (inside groups), or as WordContent::OverlapPoint (intra-word: butt⌈er⌉). Any traversal looking for overlap markers must check all three levels.

Annotated Wrappers and Replaced Words

Annotated<T>

Adds scoped annotations ([/], [* m], [= explanation], etc.) to any annotatable inner type:

pub struct Annotated<T> {
    pub inner: T,
    pub annotations: Vec<ContentAnnotation>,
    pub span: Span,
}

At Level 1: AnnotatedWord(Box<Annotated<Word>>), AnnotatedGroup(Annotated<Group>), AnnotatedEvent(Annotated<Event>), AnnotatedAction(Annotated<Action>). Same variants exist at Level 2.

ReplacedWord

Represents word [: replacement], a surface form with a replacement:

pub struct ReplacedWord {
    pub word: Word,
    pub replacement: Replacement,
}
pub struct Replacement {
    pub words: Vec<Word>,
}

Convention when extracting words for NLP: use replacement words if non-empty, else the surface form (Wor and Mor domains both follow this, each with its own counts_for_tier filter).

Tier Domains

Different NLP tasks need different views of the same content. The TierDomain enum controls which words count for each tier and how groups are traversed:

DomainUsed bySkipsCounts separators?
Mor%mor / %gra generationRetrace groupsYes, , carry mor items (cm|cm, end|end, beg|beg)
Wor%wor generation, FANothingNo
Pho%pho alignmentPhoGroupNo
Sin%sin alignmentSinGroupNo

The content walker takes Option<TierDomain>: Some(domain) for domain-aware gating, None to recurse everything unconditionally.

Content Walkers

talkbank-model exports closure-based walkers. Two layers:

  • walk_content: generic, visits all content items (custom traversals).
  • walk_words / walk_words_mut, filtered to words / replaced words / separators, with domain-aware gating. The primary primitive.
use talkbank_model::alignment::helpers::{
    walk_words, walk_words_mut,
    WordItem, WordItemMut,
    TierDomain,
};

walk_words(content, Some(TierDomain::Wor), &mut |leaf| {
    match leaf {
        WordItem::Word(word) => { /* ... */ }
        WordItem::ReplacedWord(replaced) => { /* ... */ }
        WordItem::Separator(sep) => { /* ... */ }
    }
});
flowchart TD
    input["&[UtteranceContent]\n+ domain: Option&lt;TierDomain&gt;"]
    dispatch["Match variant\n(24 UtteranceContent variants)"]
    word["Word → emit WordItem::Word"]
    rw["ReplacedWord → emit WordItem::ReplacedWord"]
    sep["Separator → emit WordItem::Separator"]
    group["Group / AnnotatedGroup /\nPhoGroup / SinGroup / Quotation"]
    gate{"Domain\ngating"}
    skip["Skip\n(atomic unit)"]
    recurse["Recurse into\ngroup.content"]

    input --> dispatch
    dispatch --> word & rw & sep & group
    group --> gate
    gate -->|"Mor: skip retraces"| skip
    gate -->|"Pho/Sin: skip groups"| skip
    gate -->|"None: recurse all"| recurse
    recurse -->|back| dispatch

What walk_words does NOT visit

Only words and separators. Not OverlapPoint (any level), not CAElement within words, not events / pauses / actions, not internal bullets. For these, write a custom traversal, see talkbank-model/validation/utterance/overlap.rs for the reference pattern.

walk_overlap_points, overlap marker iterator

walk_overlap_points(content, &mut |visit| {
    // visit.point.kind, visit.point.index, visit.word_position
});

Visits every OverlapPoint at all three content levels with its word-position context. Used by the alignment pipeline (onset estimation) and the validator (pairing checks). For region-level analysis (pairing ⌈ with ⌉ by index), use extract_overlap_info() which builds OverlapRegion structs. For whole-file analysis, analyze_file_overlaps() matches top regions (⌈) with bottom regions (⌊) across utterances with 1:N support (used by E347 and chatter debug overlap-audit).

Validation

Beyond what the grammar enforces, validate_with_alignment() checks semantic constraints:

  • %mor alignment: number of MOR items matches alignable main-tier words.
  • %gra structure: sequential indices, ROOT checks, circular dependency.
  • Header consistency: @ID codes match @Participants.
  • Speaker references: all *SPEAKER: codes declared.

Five parallel alignment flows are computed against the main tier:

flowchart TD
    main["MainTier content"]
    walker["walk_words()\ncount alignable words"]

    subgraph "5 Parallel Alignment Flows"
        mor["%mor\ncustom logic\n(clitic handling)"]
        pho["%pho\npositional_align()\n(skip PhoGroup)"]
        sin["%sin\npositional_align()\n(skip SinGroup)"]
        wor["%wor\npositional_align()\n(LCS diff format)"]
        gra["%gra\nalign to %mor chunks\n(not main tier)"]
    end

    main --> walker
    walker --> mor & pho & sin & wor
    mor --> gra

For the alignment algorithms themselves, see Alignment.

Common Pitfalls

  1. “Consecutive” means in-order traversal, not adjacent array indices. When CHAT tools speak of “consecutive” or “sequential” items on the main tier, this always means document order via recursive traversal, accounting for groups (<...>), retrace groups (<...> [/]), quotations ("..."), and all other bracketed structures. Never check adjacency in the flat Vec<UtteranceContent>, use walk_words or equivalent in-order traversal.
  2. Missing intra-word content. Overlap markers, CA elements, and other markers can appear inside Word.content. Checking only UtteranceContent::OverlapPoint misses WordContent::OverlapPoint (e.g., butt⌈er⌉, a⌈nd).
  3. Missing annotated variants. UtteranceContent::AnnotatedWord and AnnotatedGroup wrap inner types in Annotated<T> and are easy to forget.
  4. BracketedContent access. Group.contentBracketedContent, with .content: BracketedItems, with .0: Vec<BracketedItem>.
  5. Separator counter sync (Mor domain). Tag-marker separators (, ) count as NLP words because they have %mor items. Any code counting words in the Mor domain must count these separators too.

Serialization

  • CHAT: WriteChat trait writes any model type back to CHAT format.
  • JSON: all model types implement Serialize/Deserialize. Format per the JSON Schema.
  • JSON Schema: derived via JsonSchema. Run cargo test --test generate_schema to regenerate schema/chat-file.schema.json.

Memory and Interning

String-heavy types (PosCategory, MorStem, MorFeature) use Arc<str> with a global interner, significant memory savings on large corpora where the same POS tags and lemmas appear thousands of times.

Collections that are typically small use SmallVec for inline storage:

  • SmallVec<[MorFeature; 4]>: features per word (usually 0-4).
  • SmallVec<[MorWord; 2]>: post-clitics (usually 0-1).

Transform Pipeline

Status: Current Last updated: 2026-06-14 19:57 EDT

The talkbank-transform crate provides high-level pipelines that compose parsing, validation, and serialization into reusable workflows.

Core Pipelines

Parse + Validate

The most common pipeline: parse a CHAT file and validate it.

use talkbank_transform::parse_and_validate;

let result = parse_and_validate(source, &parser, &error_collector);

This:

  1. Parses the source text into a ChatFile AST
  2. Runs validation (alignment checks, header consistency, etc.)
  3. Collects all errors and warnings into the ErrorSink

CHAT → JSON

Convert a CHAT file to its JSON representation:

use talkbank_transform::chat_to_json;

let json = chat_to_json(source, &parser)?;

The JSON follows the schema at schema/chat-file.schema.json.

JSON → CHAT

The JSON produced by chat_to_json is schema-conformant and round-trips. Deserialize it back into a ChatFile with serde_json (the model derives Deserialize), then serialize through WriteChat to reproduce CHAT text:

let chat_file: talkbank_model::ChatFile = serde_json::from_str(json_str)?;
let chat_text = chat_file.to_chat_string();

The chatter from-json command wraps this path (crates/chatter/src/commands/json.rs, json_to_chat).

CHAT → CHAT (Normalize)

Parse and reserialize to normalize formatting:

use talkbank_transform::normalize_chat;

let normalized = normalize_chat(source, &parser)?;

normalize_chat lives in crates/talkbank-transform/src/pipeline/convert.rs.

Validation + Roundtrip Cache Lifecycle

The following diagram shows the full validation and roundtrip pipeline, including the cache layer:

flowchart TD
    file["CHAT file"]
    cache{"Cache\nhit?"}
    parse["Parse\n(tree-sitter → AST)"]
    validate["Validate\n(per-file → per-utterance →\nmain tier → dependent tiers)"]
    rt{"Roundtrip\nflag?"}
    ser1["Serialize → CHAT text"]
    reparse["Reparse CHAT text"]
    ser2["Serialize again"]
    cmp{"Two\nserializations\nmatch?"}
    store["Store in cache\n(SQLite)"]
    pass["Pass"]
    fail["Fail"]
    cached["Return cached result"]

    file --> cache
    cache -->|miss| parse --> validate --> rt
    cache -->|hit| cached
    rt -->|yes| ser1 --> reparse --> ser2 --> cmp
    rt -->|no| store --> pass
    cmp -->|yes| store
    cmp -->|no| fail

Streaming Parse

For large files or interactive use, the transform crate supports streaming parse where utterances are processed incrementally rather than loading the entire AST into memory.

Caching

The transform layer integrates with a file-system cache. Validation results are keyed by content hash, so unchanged files skip re-validation. Cache location is platform-specific: ~/Library/Caches/talkbank-chat/ (macOS), ~/.cache/talkbank-chat/ (Linux), %LocalAppData%\talkbank-chat\ (Windows).

Use --force to bypass the cache for specific paths.

Error Collection

Pipelines use the ErrorSink trait for error reporting. Callers can provide:

  • A collecting sink (gathers all diagnostics for batch output)
  • A printing sink (writes diagnostics to stderr in real-time)
  • A custom sink (for LSP diagnostics, JSON output, etc.)

Merge Pipeline, Domain Types

Status: Draft Last modified: 2026-05-29 18:43 EDT

This page specifies the typed Rust vocabulary shared by chatter merge, chatter speaker-id, the override-file reader/writer, and any future adjudication tooling (CLI, VS Code, web). Documenting these types before writing the implementing code is deliberate: the types are the spec, and they need to be designed against the user contract in chatter merge and chatter speaker-id without being inferred from prototype code.

The design follows the cross-cutting rules in this repo’s root CLAUDE.md: newtypes over primitives at every stable boundary; no boolean blindness; no tuple-packed seams; typed errors via thiserror; deterministic BTreeMap/BTreeSet over hash maps for serialized state.

Where the types live

All new types live in talkbank-model::merge. Rationale:

  • Existing CHAT-domain types (SpeakerCode, ParticipantRole, ParticipantEntry, IDHeader, ChatFile) already live in talkbank-model; the new merge-pipeline types reference them pervasively and benefit from being co-located.
  • Consumers outside talkbank-transform (a future override-file reader in a small CLI, an adjudication UI, an orchestrator script’s Rust port) want the types without pulling in the tree-sitter parser, the DP-aligner, etc. talkbank-model is the lightweight type-and-validation crate that fits.
  • If talkbank-model::merge grows past the file-size budget (≤400 lines per file, ≤800 hard) we split into submodules (merge::override_file, merge::scoring, etc.), same crate. Hoisting to a separate talkbank-merge-types crate is a future option but not pre-emptively warranted.

Existing types reused (not redefined)

TypeDefined inUsed as
SpeakerCodetalkbank-model::model::header::codes::speakerIdentifier for *<CODE>: speakers, dictionary keys in mappings, --retain set elements
ParticipantRoletalkbank-model::model::header::codes::participantRole-tag in @Participants and @ID (Target_Child, Investigator, Mother, etc.)
ParticipantNametalkbank-model::model::header::codes::participantOptional participant name in @Participants
ParticipantEntrytalkbank-model::model::header::codes::participantSingle @Participants row
IDHeadertalkbank-model::model::header::idSingle @ID row
ChatFile<S>talkbank-model::model::file::chat_file::coreThe merge stages’ inputs and outputs (parameter S: ValidationState)

None of these are redefined; the merge module imports and references them.

New types (specification)

JaccardScore

A multiset-Jaccard similarity value, by construction in the closed range [0.0, 1.0].

/// Multiset Jaccard similarity between two bags of tokens.
///
/// By construction in [0.0, 1.0]. `JaccardScore::zero()` is the
/// no-overlap point; `JaccardScore::one()` is identical-bag.
///
/// Used by the speaker-id stage to score how well each donor
/// speaker matches a reference anchor's content.
#[derive(Clone, Copy, Debug, PartialEq, PartialOrd, Serialize, Deserialize, JsonSchema)]
#[serde(try_from = "f64", into = "f64")]
pub struct JaccardScore(f64);

impl JaccardScore {
    pub fn new(v: f64) -> Result<Self, JaccardScoreError>;
    pub fn zero() -> Self;
    pub fn one() -> Self;
    pub fn value(self) -> f64;
}

impl Display for JaccardScore { /* "0.735" three-digit */ }
impl TryFrom<f64> for JaccardScore { /* validates range */ }
impl From<JaccardScore> for f64 { /* infallible widen */ }

Construction is fallible: JaccardScore::new(1.5) returns Err(JaccardScoreError::OutOfRange(1.5)). NaN is also rejected. Internal computation that’s guaranteed in-range by construction (the multiset formula) uses an internal from_unchecked private constructor; public API is fallible.

ConfidenceThreshold

The minimum Jaccard margin (winner / loser) the speaker-id stage will auto-accept. By construction in [1.0, ∞), a threshold of < 1.0 makes no sense (means the loser scores higher than the winner, which can’t happen). Default 2.0 per the empirical calibration recorded in chatter speaker-id.

#[derive(Clone, Copy, Debug, PartialEq, PartialOrd, Serialize, Deserialize, JsonSchema)]
#[serde(try_from = "f64", into = "f64")]
pub struct ConfidenceThreshold(f64);

impl ConfidenceThreshold {
    pub const DEFAULT: Self = Self(2.0);
    pub fn new(v: f64) -> Result<Self, ConfidenceThresholdError>;
    pub fn value(self) -> f64;
}

impl Default for ConfidenceThreshold {
    fn default() -> Self { Self::DEFAULT }
}

Margin

The decisive ratio between the highest-scoring speaker and the runner-up. Distinguished from ConfidenceThreshold by intent (this is observed; the threshold is configured) and from JaccardScore by range (margin is ≥ 1.0; score is ≤ 1.0).

Uses an enum rather than a bare float to model the divide-by-zero case (runner-up has zero Jaccard) cleanly. Avoids the f64::INFINITY sentinel that doesn’t round-trip through all serializers.

/// Ratio of winning speaker's score to runner-up's score.
///
/// `Finite(r)` for `r >= 1.0`. `Unbounded` when the runner-up
/// has zero score (winner scored anything, runner-up scored
/// nothing). Compares meaningfully against `ConfidenceThreshold`
/// regardless of variant.
#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize, JsonSchema)]
#[serde(untagged)]
pub enum Margin {
    Finite(f64),
    /// Serialized as the JSON/TOML string "unbounded"; never as
    /// f64::INFINITY (which round-trips inconsistently).
    Unbounded,
}

impl Margin {
    pub fn from_scores(winner: JaccardScore, loser: JaccardScore) -> Self;
    pub fn meets(self, threshold: ConfidenceThreshold) -> bool;
}

impl Display for Margin { /* "3.81x" or "∞" */ }

RetainSet

The set of speaker codes specified by --retain on chatter merge. A BTreeSet<SpeakerCode> wrapped in a newtype so the type signatures of merge functions communicate intent. Empty is allowed (means “no speakers come from File 1; File 1 contributes only headers”, a degenerate but legal case).

/// Speakers whose utterances come from the first input to
/// `chatter merge`. All other speakers come from the second
/// input.
#[derive(Clone, Debug, Default, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
pub struct RetainSet(BTreeSet<SpeakerCode>);

impl RetainSet {
    pub fn new() -> Self;
    pub fn from_iter<I: IntoIterator<Item = SpeakerCode>>(it: I) -> Self;
    pub fn contains(&self, code: &SpeakerCode) -> bool;
    pub fn iter(&self) -> impl Iterator<Item = &SpeakerCode>;
    pub fn is_empty(&self) -> bool;
}

impl FromStr for RetainSet {
    type Err = RetainSetParseError;
    /// Parses `"CHI,SI2"` → `{CHI, SI2}`. Empty entries rejected.
    fn from_str(s: &str) -> Result<Self, Self::Err>;
}

InsertedRole

The CHAT code + role-tag pair to assign to renamed speakers in the speaker-id stage. A struct rather than two function arguments because the pair is meaningful as a unit (in TOML override files it serializes as a nested table; in CLI it parses as CODE:TAG).

/// The CHAT identity to assign to non-anchor speakers in the
/// speaker-id stage. Example: `INV:Investigator`, `MOT:Mother`.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
pub struct InsertedRole {
    pub code: SpeakerCode,
    pub tag: ParticipantRole,
}

impl InsertedRole {
    pub fn investigator() -> Self;    // INV:Investigator
    pub fn mother() -> Self;          // MOT:Mother
    pub fn father() -> Self;          // FAT:Father
    pub fn adult() -> Self;           // PAR:Adult
}

impl FromStr for InsertedRole {
    type Err = InsertedRoleParseError;
    /// Parses `"INV:Investigator"`. Both halves required.
    fn from_str(s: &str) -> Result<Self, Self::Err>;
}

impl Display for InsertedRole { /* "INV:Investigator" */ }

The convenience constructors (investigator(), mother(), etc.) are the closed-set anchor points; arbitrary InsertedRole { code, tag } is also allowed for contributor-specific roles.

MappingAction

What happens to a particular speaker in the input under a SpeakerMapping. Enum (not boolean) to avoid blindness and to leave room for future variants (e.g. RenameTo { code, tag } when multi-role renaming becomes a need).

/// Action to apply to one speaker in a SpeakerMapping.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "lowercase")]
pub enum MappingAction {
    /// Remove this speaker's utterances and its @Participants /
    /// @ID rows entirely.
    Drop,
    /// Rename this speaker to the mapping's `inserted_role.code`.
    /// Rewrites speaker codes on every utterance and the
    /// corresponding @Participants and @ID entries.
    Rename,
}

The TOML serialization uses "drop" / "rename" lowercase strings, matching the override-file format documented in speaker-id.md.

SpeakerMapping

The decision record produced by the speaker-id stage and consumed by the speaker-id apply step. Carries enough information to apply deterministically to a ChatFile.

/// A decision about how to relabel a ChatFile's speakers.
///
/// Produced by `identify_mapping` (reference mode, auto), by the
/// `--mapping` flag parser (explicit mode), or by reading an
/// override-file entry (override mode). Consumed by `apply_mapping`,
/// which rewrites a ChatFile per the assignments.
///
/// All speakers in the input must appear as keys in `assignments`
///, no defaulting. This is a precondition checked at apply time
/// and is intentional (we want every decision to be explicit).
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize, JsonSchema)]
pub struct SpeakerMapping {
    /// The CHAT identity assigned to every speaker whose action
    /// is `MappingAction::Rename`. All renamed speakers go to
    /// the same role in v1 of this schema.
    pub inserted_role: InsertedRole,

    /// Per-speaker action. Use BTreeMap for deterministic
    /// serialization order.
    pub assignments: BTreeMap<SpeakerCode, MappingAction>,
}

The “single inserted_role across all renamed speakers” constraint matches the doc and keeps the most-common case clean. Future multi-role-rename use cases (a 3-speaker file where two get different roles) extend MappingAction with a RenameTo variant rather than changing this struct’s shape.

DecisionMode

How a MergeOverride entry came to exist. Three variants matching the three speaker-id operation modes.

#[derive(Clone, Copy, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "lowercase")]
pub enum DecisionMode {
    /// Reference-mode auto-decide with Jaccard above threshold.
    Auto,
    /// Operator supplied --mapping directly on a one-off run.
    Explicit,
    /// Read from a prior override-file entry; this is a replay.
    Override,
}

MergeFlag

Extensible operator-supplied flags on an override entry. Closed variants for known cases plus a Custom(String) escape hatch.

#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "kebab-case")]
pub enum MergeFlag {
    /// ASR diarization mixed multiple real-world roles into one
    /// speaker label. The rename may still be the best available
    /// approximation but the output is imperfect.
    DiarizationMixed,
    /// The operator could not confidently determine which speaker
    /// is which; mapping is best-guess.
    BestGuess,
    /// Open variant for contributor-specific flag vocabulary.
    /// Serializes as the inner string verbatim.
    #[serde(untagged)]
    Custom(String),
}

OperatorId

Who made the decision. String newtype.

string_newtype!(
    /// Identifier of the operator who created an override entry.
    /// Free-form; typically a username or initials. Recorded as
    /// audit trail.
    pub struct OperatorId;
);

SessionId

Identifies an entry within an override file. Typically the basename stem of the input CHAT file, but the override-file schema doesn’t constrain its shape, contributors may use any stable identifier they like (<participant>-<timepoint>, <recording-id>, etc.).

string_newtype!(
    /// Identifies a session within an override file. Free-form
    /// stable string; typically the CHAT-file basename stem.
    pub struct SessionId;
);

MergeOverride

A single per-session decision record. The unit of operator adjudication.

/// One per-session decision in an override file.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize, JsonSchema)]
pub struct MergeOverride {
    pub mode: DecisionMode,
    pub mapping: SpeakerMapping,

    /// Per-speaker Jaccard scores recorded for audit. Present
    /// when the entry was produced by reference mode or by an
    /// explicit mode that followed a reference attempt.
    #[serde(skip_serializing_if = "BTreeMap::is_empty", default)]
    pub scores: BTreeMap<SpeakerCode, JaccardScore>,

    /// The decisive margin, if available.
    #[serde(skip_serializing_if = "Option::is_none", default)]
    pub margin: Option<Margin>,

    pub operator: OperatorId,
    pub decided_at: DateTime<Utc>,

    /// Operator note. Highly recommended for non-auto decisions.
    #[serde(skip_serializing_if = "Option::is_none", default)]
    pub note: Option<String>,

    /// Flags marking unusual situations.
    #[serde(skip_serializing_if = "Vec::is_empty", default)]
    pub flags: Vec<MergeFlag>,
}

The struct embeds the timestamp via chrono::DateTime<Utc>; serde serializes to RFC 3339 (2026-05-27T08:41:00Z) by default. TOML preserves this format faithfully.

OverrideFile

The top-level container. Holds schema version + per-session entries. Read from / written to disk as TOML.

/// Top-level override-file container.
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize, JsonSchema)]
pub struct OverrideFile {
    /// Schema version. Currently 1. Reader refuses unknown
    /// versions with a typed error rather than guessing.
    pub schema_version: u32,

    /// Per-session entries. BTreeMap for deterministic
    /// on-disk ordering.
    #[serde(flatten)]
    pub entries: BTreeMap<SessionId, MergeOverride>,
}

impl OverrideFile {
    pub const CURRENT_SCHEMA_VERSION: u32 = 1;

    /// Read an override file from a path. Refuses unknown
    /// schema versions.
    pub fn read(path: &Path) -> Result<Self, OverrideFileError>;

    /// Write the override file to a path, replacing the file
    /// atomically.
    pub fn write(&self, path: &Path) -> Result<(), OverrideFileError>;

    /// Read an override file if it exists, else return an
    /// empty file at the current schema version. Used by the
    /// `--write-override` append flow.
    pub fn read_or_default(path: &Path) -> Result<Self, OverrideFileError>;

    pub fn get(&self, id: &SessionId) -> Option<&MergeOverride>;
    pub fn insert(&mut self, id: SessionId, entry: MergeOverride);
}

The #[serde(flatten)] on entries means the on-disk TOML is flat tables keyed by session ID (as shown in the speaker-id.md schema):

schema_version = 1

[NF203-2]
mode = "auto"
# ...

rather than nested under an [entries] table.

Error types

Two thiserror-based enums covering the merge pipeline’s failure modes. Each variant carries enough information for the CLI to produce a useful diagnostic and for callers to pattern-match behavior.

SpeakerIdError

#[derive(Debug, thiserror::Error)]
pub enum SpeakerIdError {
    #[error("reference file has no utterances for anchor speaker {anchor}")]
    AnchorMissingInReference { anchor: SpeakerCode },

    #[error("input has only {n} distinct speakers; speaker-id requires at least 2")]
    InsufficientSpeakers { n: usize },

    #[error("Jaccard margin {margin} is below confidence threshold {threshold}; scores={scores:?}")]
    LowConfidence {
        scores: BTreeMap<SpeakerCode, JaccardScore>,
        threshold: ConfidenceThreshold,
        margin: Margin,
    },

    #[error("speaker {speaker} present in input but not covered by --mapping")]
    SpeakerNotInMapping { speaker: SpeakerCode },

    #[error("--mapping references speaker {speaker} not present in input")]
    MappingSpeakerNotInInput { speaker: SpeakerCode },

    #[error("override file has no entry for session {session}")]
    OverrideEntryMissing { session: SessionId },

    #[error("parse error reading input: {0}")]
    Parse(#[from] talkbank_parser::ParseError),

    #[error("override file I/O: {0}")]
    OverrideIo(#[from] OverrideFileError),
}

The LowConfidence variant is the only “soft” failure, the caller (CLI) maps it to exit code 4 and prints the scores. Every other variant maps to exit code 1 or 2 per the user-guide contract.

MergeError

#[derive(Debug, thiserror::Error)]
pub enum MergeError {
    #[error("File 1 declares no utterances for retain set {retain:?}")]
    RetainSpeakersMissing { retain: RetainSet },

    #[error("File 1 has no time-bulleted utterances; cannot merge against a shared timeline")]
    NoTimelineInFile1,

    #[error("File 1 @Languages = {file1}, File 2 @Languages = {file2}; merge requires matching language")]
    LanguageMismatch {
        file1: LanguageCode,
        file2: LanguageCode,
    },

    #[error("speaker {speaker} appears in both files but is not in --retain; specify --retain to disambiguate")]
    AmbiguousSpeaker { speaker: SpeakerCode },

    #[error("parse error: {0}")]
    Parse(#[from] talkbank_parser::ParseError),
}

OverrideFileError

Independent enum because override-file I/O is also called by non-speaker-id code paths (the orchestrator, future adjudication UIs).

#[derive(Debug, thiserror::Error)]
pub enum OverrideFileError {
    #[error("override file not found at {path}")]
    NotFound { path: PathBuf },

    #[error("override file at {path} has schema_version={found}, this binary supports {supported}")]
    UnsupportedSchemaVersion {
        path: PathBuf,
        found: u32,
        supported: u32,
    },

    #[error("override file at {path} failed to parse: {source}")]
    Parse {
        path: PathBuf,
        #[source]
        source: toml::de::Error,
    },

    #[error("override file at {path} failed to write: {source}")]
    Write {
        path: PathBuf,
        #[source]
        source: std::io::Error,
    },

    #[error("I/O reading override file at {path}: {source}")]
    Io {
        path: PathBuf,
        #[source]
        source: std::io::Error,
    },
}

Module layout

talkbank-model/src/merge/
    mod.rs, pub re-exports
    scoring.rs, JaccardScore, ConfidenceThreshold, Margin
    role.rs, InsertedRole, MappingAction
    mapping.rs, SpeakerMapping
    retain.rs, RetainSet
    override_file.rs, DecisionMode, MergeFlag, OperatorId,
                            SessionId, MergeOverride, OverrideFile
    errors.rs, SpeakerIdError, MergeError, OverrideFileError

Each file aims for the ≤400-line target; if any grows we split further (override_file/ becomes a directory with separate files for the schema, the I/O, and the version-migration logic).

Type design rules followed

A spot-check against the cross-cutting design rules in this repo’s root CLAUDE.md:

  • Newtypes over primitives. Every numeric domain value (JaccardScore, ConfidenceThreshold, Margin) is wrapped; every string domain value (SessionId, OperatorId, SpeakerCode, ParticipantRole) is wrapped or reused from existing wrappers. ✓
  • No tuple-packed seams. InsertedRole is a struct, not (SpeakerCode, ParticipantRole). MergeOverride likewise. ✓
  • No boolean blindness. MappingAction, DecisionMode, MergeFlag are enums, not bools. Margin::Finite/Unbounded is an enum, not Option<f64> or f64::INFINITY. ✓
  • Typed errors. Three thiserror enums with named-field variants carrying full context. ✓
  • Deterministic seams. BTreeMap/BTreeSet for every serialized collection. ✓
  • Module browseability. Six files in merge/, each scoped to one concern. ✓
  • Default impls present where meaningful. ConfidenceThreshold::DEFAULT = 2.0; OverrideFile::default() for the empty-file case. ✓
  • Display impls present where user-visible. JaccardScore, Margin, InsertedRole. ✓
  • FromStr parsers at CLI boundary, not regex hacks in command code. RetainSet::from_str, InsertedRole::from_str, and a parse_mapping_spec helper for --mapping. ✓

Decisions on the seven open questions

Resolved 2026-05-27, captured here so implementers don’t re-litigate.

1. JaccardScore representation: f64

Multiset Jaccard J(A, B) = sum_w min(A[w], B[w]) / sum_w max(A[w], B[w]) is computed from u64 token counts, which fit in f64’s 53-bit mantissa for any plausible CHAT bag-of-words. The division is inexact in general but IEEE 754 makes it bit-deterministic given the same inputs across every platform that implements 754 (all of ours: Windows, macOS, Linux, x86_64, arm64).

The bit-deterministic reproducibility property is load-bearing because the override-file audit trail records scores; a researcher re-running speaker-id years later on the same inputs must compute the same score to verify the decision. f64 arithmetic provides this for free given workspace platform constraints. Document the property in the type’s rustdoc.

A rational u64/u64 representation was considered for “true” reproducibility but adds boilerplate and a comparison-against- threshold operation that loses the same precision in the end (the threshold is a ratio too). Reject.

2. DateTime<Utc> crate: chrono

The workspace already pins chrono = "0.4" at the root Cargo.toml. talkbank-model::merge uses the workspace version verbatim via chrono = { workspace = true }. No new datetime dep.

The “succession-aware” rule from the workspace-root CLAUDE.md contributor guide (outside the book) and the analogous feedback_no_terraform_only_opentofu discipline from operator memory says: do not fragment the ecosystem by introducing a second tool when a workspace tool already does the job. jiff is a fine library but adopting it for one new module would mean two datetime crates in tree.

Override-file timestamps serialize as RFC 3339 UTC; chrono’s serde feature handles this with #[serde(with = "chrono::serde::ts_rfc3339")] or the default Serialize/Deserialize impl.

3. TOML library: toml (the workspace-pinned crate)

Workspace already pins toml = "^1.1.2". That crate reads AND writes, no need to combine toml and toml_edit for the v1 override-file format.

toml_edit was considered for its formatting/comment preservation across in-place edits. The case for it is hypothetical right now: override files are primarily machine-written by chatter speaker-id --write-override; human edits exist but are not the dominant workflow. The cost of toml_edit is the second TOML dep (workspace churn, plus the friction every contributor pays parsing TOML through one API and writing through another).

If a workflow emerges where operators heavily hand-edit override files and lose formatting on each batch re-run, swap to toml_edit then. Defer.

4. MergeOverride::flags: Vec<MergeFlag>

Operator-supplied flags are semantically set-like (each flag present or absent), but Vec is the right representation because:

  • MergeFlag includes a Custom(String) #[serde(untagged)] variant. Deriving Ord on this enum requires a manual Ord impl that hashes the discriminator + the inner string. Doable but adds maintenance load.
  • The order of flags in the on-disk file isn’t load-bearing for correctness; deterministic single-source-write produces a deterministic Vec.
  • Duplicates are noise but not corrupting. Document in the field’s rustdoc that consumers should treat as set semantics (deduplicate before comparing).

The writer (speaker-id --write-override path) inserts flags in a deterministic order; on-disk Vec is fully reproducible. If a hand-edited file has an out-of-order or duplicated flag list, that shows up as a non-corrupting noise in subsequent diffs, acceptable.

5. SpeakerMapping::assignments: BTreeMap<SpeakerCode, MappingAction>

Confirmed. BTreeMap gives:

  • One-action-per-speaker by construction (no duplicate keys).
  • Deterministic serialization order (alphabetical by SpeakerCode).
  • Cheap membership tests during apply.

The CLAUDE.md “no tuple-packed seams” rule targets raw tuples as struct fields or function arguments. A BTreeMap’s internal key-value pairing is not a domain seam exposed to the API; it’s the representation. Approved.

6. Schema versioning policy: strict refuse-with-clear-error

OverrideFile::read refuses any schema_version != CURRENT_SCHEMA_VERSION with a typed OverrideFileError::UnsupportedSchemaVersion { found, supported }. No automatic migration in v1.

This is the conservative default. Reasons:

  • We have no upgrade history yet; building a migration framework for a problem that doesn’t exist is premature abstraction (CLAUDE.md “Always Fix Root Causes” + the general “no premature abstraction” instinct).
  • The override file is fundamentally a record of operator decisions. If the schema breaks, operators re-adjudicate; the prior file becomes a historical artifact that can be read by scripts with old binaries.
  • When a real schema change lands and there is real upgrade friction, that’s the moment to write a one-shot migration (chatter merge migrate-overrides --from <path> --to <path>). Until that happens, premature migration code is dead weight.

Document this in OverrideFile::read’s rustdoc so the policy is explicit to callers.

7. Where the --mapping parser lives: talkbank-model::merge::mapping

parse_mapping_spec("PAR0=drop,PAR1=INV:Investigator") -> Result<SpeakerMapping, MappingSpecParseError> lives in the model crate alongside the SpeakerMapping type it returns.

Why:

  • The spec format is part of the type’s contract. A reader looking for “how do I construct a SpeakerMapping from a string?” should find the answer where the type is defined, not in the consumer CLI crate.
  • A future non-CLI consumer (HTTP API, library wrapper, scripting binding) wants the same parser without re-implementing or depending on chatter.
  • The model crate has no CLI-framework dependency (no clap), but a free function returning Result<SpeakerMapping, _> doesn’t need one. The clap value-parser in chatter becomes a thin shim: fn clap_mapping_value(s: &str) -> Result<SpeakerMapping, String> { parse_mapping_spec(s).map_err(|e| e.to_string()) }.

If at some point a SECOND mapping syntax becomes useful (e.g., JSON-inline, or a TOML fragment), add a parse_mapping_json sibling rather than reshaping parse_mapping_spec. The existing parser stays the lingua franca.


These decisions are the design baseline going into spec authoring and implementation. Future revisions to any of them require an explicit doc update plus a deprecation/migration plan, not a silent change in the implementation.

Relationship to specs and tests

Every type in this doc gets a spec entry in spec/constructs/merge-types/ once we move to implementation one spec per type/invariant pair, regenerated into Rust tests via the current spec/tools generators. Spec authoring sits between this doc and the Rust implementation; types are designed here, behavior is pinned by specs, code follows. The spec entries are also where behavioral invariants (e.g. “JaccardScore::new(NaN) → Err”) become regression gates rather than rustdoc-only contracts.

Merge Pipeline, Test Plan

Status: Draft Last modified: 2026-05-30 06:55 EDT

This page is the test-coverage roadmap for the new merge pipeline (chatter speaker-id + chatter merge + chatter adjudicate + the override-file format + the underlying talkbank-model::merge types). It exists because, per this repo’s root CLAUDE.md red/green TDD rule, every new feature starts with failing tests at the highest level the feature lives at, and we want to enumerate those tests before writing the implementation, so coverage is designed, not discovered.

This is a plan, not yet code. When the implementation work begins, every test case below becomes a real test; the doc then flips to a coverage matrix that gets kept honest by CI.

TDD discipline, what “strict red/green” means here

Every cycle of impl-phase work is:

  1. RED. Write ONE failing test at the highest layer the feature lives at. The test exercises a real user-observable behavior, not an internal helper. Commit the failing test alone (or stage it before any code change), verify it fails for the right reason (the missing behavior), not for a compile error or a typo.
  2. GREEN. Write the smallest code change that makes the test pass. No anticipating future tests, no scaffolding for tests that don’t yet exist. The codebase should compile and pass tests at this point.
  3. REFACTOR. With the green test as the safety net, tighten the implementation: extract helpers, rename for clarity, replace primitives with newtypes, document tricky parts. Tests stay green throughout.
  4. DRILL DOWN if needed. If the L3 (or L2) test passes but pinned the behavior less precisely than the contract requires (e.g., the L3 test asserts “exit 2 with some error” but the contract says “the specific MergeError variant must match”), add an L2 (or L1) test next that drills into the precise path. The drilled test FAILS at first against the green-but-imprecise impl, motivating the tighter impl.

Cycles must be atomic: one RED → one GREEN → optional REFACTOR → optional drill-down. Do not stack multiple tests on top of a single impl change; do not write impl ahead of tests. The discipline matters because the bug bar of this pipeline is high (CHAT-data byte-stable preservation, audit-trail reproducibility) and TDD is the cheapest way to catch regressions before they ship.

Three test layers + the adjudication layer

The merge pipeline’s behavior spans four substrates with different testing mechanisms.

LayerSubstrateWhy tests live here
L1, Spec / fragmentspec/constructs/speaker-id/ → current spec/tools generatorsToken-cleaner behavior on CHAT fragments (markup strip for Jaccard scoring). Same mechanism that pins parser/grammar tests; regenerated regression.
L2, Transform / ASTcrates/talkbank-transform/tests/Pure-Rust tests over parsed ChatFile values. identify_mapping, apply_mapping, merge, run_adjudication semantics on hand-built or parsed CHAT inputs. No process boundary.
L3, CLI / subprocesscrates/chatter/tests/merge_tests.rs (new)End-to-end behavior of chatter speaker-id, chatter merge, and chatter adjudicate invoked as subprocesses (assert_cmd + predicates). Exit codes, flag parsing, file I/O, stderr formats.
L4, Scripted adjudicationcrates/talkbank-transform/tests/adjudication_tests.rs + scripted prompterOperator-decision paths in chatter adjudicate. Uses ScriptedPrompter injecting synthetic operator choices. See Adjudication Workflow for the prompter abstraction.

L1 ⊂ L2 ⊂ L3 in terms of failure-mode coverage: a failing L1 test implies a failing L2 test which implies a failing L3 test. So when the same invariant could be tested at multiple layers, the starter test is the highest layer and lower-layer tests are supplements that pin the precise internal path. L4 sits beside L2/L3, same crate/file conventions but a dedicated layer because the prompter-injection pattern is specific to adjudication.

L1, Spec / fragment tests

Lives in spec/constructs/speaker-id/. Three subdirectories:

  • token-cleaner/: what the Jaccard tokenizer strips and keeps
  • jaccard-scoring/: fixed-input → fixed-score golden tests
  • mapping-application/: header rewrite rules on real fragments

L1.1, Token cleaner

Each spec is a CHAT main-tier fragment + the expected token list after cleaning. Behavior pinned: bracket markup stripped, angle-bracket retracing unwrapped, terminator variants discarded, &-... / &+... discarded, xxx/yyy/www discarded, 0 discarded, @l / @n / @c suffix dropped, _-compound split to spaces, punctuation stripped, lowercased, ≥2-char alpha filter, NAK bullets stripped.

SpecInput fragmentExpected tokens
clean-plain-utterance*CHI:\thello world .["hello", "world"]
clean-strip-bracket-codes*CHI:\thello [*] [/] world [//] .["hello", "world"]
clean-unwrap-angle-retrace*CHI:\t<two of the> [//] three of the presents .["two", "of", "the", "three", "of", "the", "presents"]
clean-strip-fillers*CHI:\t&-um &+pre something &-uh .["something"]
clean-strip-zero-and-paralinguistic*CHI:\t0 [=! nodding] .[]
clean-strip-unintelligible*CHI:\txxx and yyy and www .["and", "and"]
clean-strip-bullets*CHI:\thello world . \x150_1234\x15["hello", "world"]
clean-special-form-suffix*CHI:\tnaming l@l u@l l@l u@l .["naming"]
clean-compound-underscore*CHI:\tValentine's_Day and Fruit_Loops .["valentine", "day", "and", "fruit", "loops"]
clean-terminator-variants*CHI:\thello +//. world +... again +/. last !["hello", "world", "again", "last"]
clean-overlap-markers*CHI:\t↫here↫ and there .["here", "and", "there"]
clean-lowercase-filter*CHI:\tHello World A I am .["hello", "world", "am"]

Each spec file in spec/constructs/speaker-id/token-cleaner/ has the standard # name, ## Input, ## Expected tokens, and ## Metadata sections per the spec authoring template at spec/CLAUDE.md in the workspace root (outside the book).

L1.2, Jaccard scoring

Fixed bag-of-tokens pairs with known multiset Jaccard. These guard against off-by-one errors in the sum_w min / sum_w max implementation and against any future “optimizations” that silently change scoring.

SpecBag ABag BExpected J(A,B)
jaccard-identical{hello:2, world:1}{hello:2, world:1}1.0
jaccard-disjoint{hello:1}{world:1}0.0
jaccard-empty-empty{}{}0.0
jaccard-empty-nonempty{}{x:1}0.0
jaccard-multiset-counts{a:3, b:1}{a:1, b:1}2/4 = 0.5
jaccard-partial-overlap{a:1, b:1, c:1}{b:1, c:1, d:1}2/4 = 0.5

L1.3, Mapping application on fragments

Header-rewrite micro-tests. Each spec gives an input @Participants: or @ID: row and a small mapping; the expected output row is the rewritten form.

SpecInput rowMappingExpected output row
participants-rewrite-rename@Participants:\tPAR0 Participant, PAR1 ParticipantPAR0→INV:Investigator, PAR1→drop@Participants:\tINV Investigator
participants-preserve-name-token@Participants:\tCHI Alex Target_Child, PAR0 ParticipantPAR0→INV:Investigator@Participants:\tCHI Alex Target_Child, INV Investigator
id-rewrite-rename@ID:\teng|corpus_name|PAR0|||||Participant|||PAR0→INV:Investigator@ID:\teng|corpus_name|INV|||||Investigator|||
id-drop-removes-row@ID:\teng|...|PAR1|||||Participant|||PAR1→drop(row removed)
id-preserves-other-fields@ID:\teng|2|CHI|6;01.|female|NF||Target_Child|||(no-op for CHI)identical to input

L2, Transform / AST tests

Lives in crates/talkbank-transform/tests/. Three test files:

  • speaker_id_tests.rs
  • transcript_merge_tests.rs
  • override_file_tests.rs

Each tests behavior over parsed talkbank-model::ChatFile values, using inline synthetic CHAT strings parsed via talkbank_parser::parse_chat_file (no subprocess overhead).

L2.1, identify_mapping (reference mode)

TestScenarioAssertion
identify_mapping_clean_winnerReference has CHI saying content X; donor has PAR0 saying X verbatim and PAR1 saying unrelated contentReturns SpeakerMapping { drop: {PAR0}, rename: {PAR1: INV} }, margin >> 2.0
identify_mapping_borderline_refusesReference and both donor speakers share substantial vocabulary (margin < 2.0)Returns Err(SpeakerIdError::LowConfidence { scores, threshold, margin })
identify_mapping_anchor_missingReference has no utterances tagged with anchor speakerReturns Err(SpeakerIdError::AnchorMissingInReference { anchor: CHI })
identify_mapping_single_speaker_donorDonor has only one speakerReturns Err(SpeakerIdError::InsufficientSpeakers { n: 1 })
identify_mapping_threshold_at_exact_valueConstructed donor where margin = 2.0 exactly with threshold 2.0Returns Ok(_) (≥ comparison, not strict >)
identify_mapping_threshold_below_exact_valueMargin = 1.9999 with threshold 2.0Returns Err(SpeakerIdError::LowConfidence)
identify_mapping_unbounded_marginDonor PAR1 has Jaccard 0 against reference; PAR0 > 0Returns Ok(_) with margin = Margin::Unbounded
identify_mapping_deterministicSame inputs, repeated callIdentical SpeakerMapping byte-for-byte (BTreeMap ordering)

L2.2, apply_mapping

TestScenarioAssertion
apply_mapping_renames_main_tierDonor has *PAR0:\t... and *PAR1:\t...; mapping renames PAR0→INV, drops PAR1Output has *INV:\t... for original PAR0 utts; PAR1 utts absent
apply_mapping_byte_stable_except_prefixDonor has rich CHAT markup, %wor, %com on every uttEvery retained utt is byte-identical except the *CODE:\t prefix; dependent tiers preserved exactly
apply_mapping_rewrites_participantsDonor @Participants: has PAR0+PAR1 entriesOutput has only INV entry (after PAR1 drop)
apply_mapping_rewrites_idDonor @ID: rows for PAR0+PAR1PAR0 row rewritten to INV with role tag; PAR1 row removed
apply_mapping_speaker_not_in_inputMapping references PAR9 which isn’t in donorReturns Err(SpeakerIdError::MappingSpeakerNotInInput { speaker: PAR9 })
apply_mapping_speaker_not_in_mappingDonor has PAR0+PAR1+PAR2 but mapping only covers PAR0+PAR1Returns Err(SpeakerIdError::SpeakerNotInMapping { speaker: PAR2 })
apply_mapping_preserves_other_headersDonor has @Languages, @Media, @CommentAll non-Participants/non-ID headers pass through verbatim
apply_mapping_idempotent_on_rerunApply mapping, parse output, apply identity mappingOutput unchanged (byte-stable)

L2.3, merge (core invariants)

These mirror the user-guide’s “What the merged output guarantees” section directly. Each invariant from that section maps to one or more L2 tests; the L3 tests then re-exercise the same invariant through the CLI.

TestInvariant from user-guideAssertion
merge_retained_speakers_byte_stable“Retained speakers are byte-stable”Every *CHI: block from File 1 (main tier + all dependent tiers, including %com) appears in the output byte-identical, in original order
merge_strips_default_derived_tiers“Inserted speakers’ downstream-generated tiers are stripped”Output has no %wor, %mor, %gra, %pho on inserted-speaker utts; other dependent tiers preserved
merge_strip_tiers_configurable“configurable via --strip-tiersCustom strip_tiers=[com] removes %com instead of the defaults
merge_strip_tiers_empty_preserves_allempty strip setInserted utts retain %wor, %mor, %gra, %pho from File 2 verbatim
merge_utterance_order_by_start_time“Utterance order is timeline order”Output utterances sorted by start_ms ascending
merge_stable_tiebreak_file1_first“first-file utterance comes first”When File 1 and File 2 each have an utterance starting at exactly t, the File 1 one appears first in the output
merge_bullets_pass_through“Time bullets are pass-through”Every bullet in the output is exactly the bullet from its source utterance, merge does not recompute, smooth, or refresh
merge_bullet_lift_from_wor“If main tier lacks bullet, lift from %wor”Donor utt with no end-of-line bullet but a %wor row gets a derived \x15<first>_<last>\x15 appended; original %wor then stripped per the tier policy
merge_no_overlap_markers_injected“Overlap markup is NOT injected”Even when inserted utt’s bullet overlaps a retained utt’s bullet by 500ms, no [>]/[<] tokens appear anywhere in the output that weren’t in the original retained file
merge_preserves_existing_overlap_markersretained file already has [>] somewhereThe original [>] is preserved byte-stable on the retained utt
merge_header_languages_passthroughHeader reconciliation ruleOutput @Languages matches File 1’s
merge_header_media_file1_winsHeader reconciliation ruleFile 1 says video, File 2 says audio → output says video (no warning emitted for modality only)
merge_header_participants_concatenatesHeader reconciliation ruleOutput @Participants: is File 1’s entries + File 2’s non-retained entries, in that order
merge_header_id_concatenatesHeader reconciliation ruleOutput @ID: rows are File 1’s + File 2’s non-retained, original order within each file
merge_header_comments_concatenateHeader reconciliation ruleOutput @Comment rows are File 1’s + File 2’s, in original order (ASR provenance preserved)
merge_preconditions_retain_missingexit code 2 preconditionFile 1 declares no CHI; merge with retain={CHI} returns Err(MergeError::RetainSpeakersMissing)
merge_preconditions_no_timelineexit code 2 preconditionFile 1 has no utterances with bullets → Err(MergeError::NoTimelineInFile1)
merge_preconditions_language_mismatchexit code 2 preconditionFile 1 @Languages: eng, File 2 @Languages: yueErr(MergeError::LanguageMismatch)
merge_preconditions_ambiguous_speakerexit code 2 preconditionBoth files have INV utterances and retain={CHI} (INV not in retain) → Err(MergeError::AmbiguousSpeaker { speaker: INV })
merge_warns_on_backward_bullet_drift“small backward-time bullets … proceeds”File with utt1: 100_200, utt2: 190_300, succeeds, emits a warning

L2.4, Override file I/O

TestScenarioAssertion
override_file_round_tripConstruct OverrideFile with one entry, write, read backRe-read value == original
override_file_refuses_missing_schema_versionTOML with no schema_versionErr(OverrideFileError::UnsupportedSchemaVersion { found: 0, supported: 1 })
override_file_refuses_wrong_schema_versionschema_version = 2 (future)Err(UnsupportedSchemaVersion { found: 2, supported: 1 })
override_file_rejects_unknown_fieldEntry has an extraneous field extra = "x"Err(OverrideFileError::Parse)
override_file_rejects_malformed_modemode = "guess"Err(Parse) (only auto/explicit/override accepted)
override_file_atomic_writeWrite to a path that already existsOriginal file is replaced atomically; no <path>.tmp left behind
override_file_deterministic_serializationSame struct, write twiceBytes on disk are byte-identical between writes
override_file_omits_empty_optionalsEntry has empty scores, no margin, empty flagsTOML output does not contain those keys
override_file_preserves_margin_unboundedEntry has margin = Margin::UnboundedTOML on disk has margin = "unbounded"; reads back as Unbounded
override_file_preserves_margin_finiteEntry has margin = Margin::Finite(3.81)TOML on disk has margin = 3.81; reads back equal
override_file_read_or_default_missingPath does not existReturns empty OverrideFile with current schema version
override_file_get_returns_entryFile has one entry under SessionId Xget(X) returns Some; get(Y) returns None

L2.5, Domain-type unit tests

Smaller per-type tests. Each in its module’s #[cfg(test)] mod tests section.

TestTypeAssertion
jaccard_score_new_in_rangeJaccardScorenew(0.5)Ok; new(-0.1) and new(1.1)Err; new(NaN)Err
jaccard_score_serde_round_tripJaccardScoreSerializes to 0.5 (bare float in JSON/TOML); deserializes back identically; out-of-range deserialize → error
confidence_threshold_default_is_2_0ConfidenceThresholdDefault::default().value() == 2.0
confidence_threshold_rejects_below_1ConfidenceThresholdnew(0.5)Err
margin_from_scores_zero_loserMarginfrom_scores(JaccardScore::new(0.7), JaccardScore::zero()) == Margin::Unbounded
margin_from_scores_zero_zeroMarginfrom_scores(zero, zero) == Margin::Finite(0.0) or explicit “degenerate” representation (decide and document)
margin_meets_thresholdMarginFinite(3.81).meets(threshold=2.0) == true; Finite(1.5).meets(2.0) == false; Unbounded.meets(threshold) == true for any threshold
retain_set_parseRetainSet"CHI".parse() == Ok({CHI}); "CHI,SI2".parse() == Ok({CHI, SI2}); "".parse() == Err; "CHI,,SI2".parse() == Err
inserted_role_parseInsertedRole"INV:Investigator".parse() == Ok(_); "INV".parse() == Err; ":Investigator".parse() == Err
mapping_spec_parse_simpleparse_mapping_spec"PAR0=drop,PAR1=INV:Investigator" parses to a complete SpeakerMapping with correct actions and inserted_role
mapping_spec_parse_drop_onlyparse_mapping_spec"PAR0=drop" parses iff no inserted_role context required (decide whether legal in isolation; if not, must error)
mapping_spec_parse_conflicting_rolesparse_mapping_spec"PAR0=INV:Investigator,PAR1=MOT:Mother", two different inserted roles → error (v1 only allows one)
merge_flag_serde_known_variantsMergeFlagDiarizationMixed serializes as "diarization-mixed" (kebab-case); deserializes the same
merge_flag_serde_customMergeFlagUnknown string deserializes as Custom("unknown-flag"); serializes verbatim

L3, CLI / subprocess tests

Lives in crates/chatter/tests/merge_tests.rs (new file). Uses the same assert_cmd + predicates + tempfile pattern as the existing integration_tests.rs. Each test invokes chatter speaker-id or chatter merge as a subprocess against files written to a tempdir().

L3.1, chatter merge, success paths

TestInvariants exercised
merge_basic_clinician_patternE2E happy path: small hand-coded child-only file + small ASR-labeled file → exit 0, output exists, retained CHI byte-stable, inserted INV present with derived tiers stripped. Single-invocation smoke test.
merge_writes_to_stdout_by_defaultNo -o flag → output goes to stdout, exit 0
merge_writes_to_output_path-o merged.cha → file created with correct content; nothing on stdout
merge_retain_multi_speaker--retain CHI,SI2 keeps both CHI and SI2 byte-stable; everything else from File 2
merge_strip_tiers_custom--strip-tiers com,act removes %com and %act instead of default set
merge_strip_tiers_empty--strip-tiers '' preserves %wor from File 2 in output

L3.2, chatter merge, error paths

TestAsserted exit codeAsserted stderr
merge_missing_file11“No such file” or equivalent typed message
merge_unparseable_file11parser diagnostic
merge_missing_retain_flag2 (clap)clap usage message
merge_retain_empty_value2typed error from RetainSet::from_str
merge_no_retain_speakers_in_file12RetainSpeakersMissing rendered
merge_no_timeline_in_file12NoTimelineInFile1 rendered
merge_language_mismatch2LanguageMismatch { file1: eng, file2: yue } rendered
merge_ambiguous_speaker2AmbiguousSpeaker { speaker: ... } rendered with hint to use –retain

L3.3, chatter speaker-id, reference mode

TestScenarioAssertion
speaker_id_reference_auto_clean_winnerReference + donor where margin >> 2.0Exit 0; output has expected renamed/dropped speakers
speaker_id_reference_writes_overrideWith --write-override path.tomlFile created; entry has mode = "auto", scores, margin, decided_at, operator
speaker_id_reference_appends_to_existing_override--write-override path.toml where file already has another sessionNew session added; existing session preserved
speaker_id_reference_low_confidence_exits_4Margin < thresholdExit 4; stderr contains per-speaker scores
speaker_id_reference_anchor_missing_exits_2Reference has no anchor speaker utterancesExit 2; typed error in stderr
speaker_id_reference_threshold_override--confidence-threshold 1.5 on a margin-1.7 caseExit 0 (would have refused at default 2.0)
speaker_id_reference_anchor_required--reference without --anchorExit 2 (clap or our own); usage error

L3.4, chatter speaker-id, explicit-mapping mode

TestScenarioAssertion
speaker_id_explicit_basic--mapping "PAR0=drop,PAR1=INV:Investigator"Exit 0; output renames PAR1→INV, drops PAR0
speaker_id_explicit_mapping_speaker_not_in_input--mapping references PAR9 not in inputExit 2; typed error
speaker_id_explicit_speaker_missing_from_mappingInput has PAR0+PAR1+PAR2; mapping only covers PAR0+PAR1Exit 2; typed error naming PAR2
speaker_id_explicit_with_note_records_in_override--mapping + --write-override + --note "verified by listening"TOML entry has note = "verified by listening" and mode = "explicit"

L3.5, chatter speaker-id, override-file mode

TestScenarioAssertion
speaker_id_override_file_replayOverride file has entry for session-XReading override + applying produces same output as the original auto/explicit run
speaker_id_override_file_missing_entryOverride file has no entry for the requested sessionExit 2; OverrideEntryMissing in stderr
speaker_id_override_file_missing_file--override-file path.toml where file doesn’t existExit 1; NotFound in stderr
speaker_id_override_file_wrong_schema_versionFile has schema_version = 99Exit 1; UnsupportedSchemaVersion in stderr
speaker_id_override_file_mutually_exclusive_modes--reference AND --mapping both setExit 2 (clap or our own); only one operation mode allowed

L3.6, Pipeline composition

These exercise chatter speaker-idchatter merge composed end-to-end through the file system, simulating the orchestrator workflow.

TestScenarioAssertion
pipeline_speaker_id_then_mergeRun speaker-id on anonymous ASR file; run merge on the result + hand-coded fileFinal merged file passes all merge invariants (retained byte-stable, etc.)
pipeline_replay_via_override_fileRun once with auto; capture override file; delete intermediates; replay via --override-file; merge againFinal merged file is byte-identical to the original run (audit-trail-reproducibility property)
pipeline_low_confidence_then_explicitRun speaker-id; gets exit 4; capture scores from stderr; run again with --mapping matching what the operator would decide; record via --write-override; mergeAll steps succeed; override file has mode = "explicit" with prior scores recorded

L4, Scripted adjudication tests

Lives in crates/talkbank-transform/tests/adjudication_tests.rs. Uses the Prompter trait and ScriptedPrompter documented in Adjudication Workflow §The prompter abstraction. Each test constructs a pending-adjudications input, scripts the operator’s decisions, runs run_adjudication, and asserts on the resulting override file plus the residual pending file.

L4.1, Speaker-id adjudication paths

TestScripted decisionAssertion
adjudicate_speaker_id_accepts_suggestedAcceptSuggested { note: None } for one pending entryOverride file entry has mode = "explicit", mapping matches suggested, pending file emptied
adjudicate_speaker_id_override_mappingOverrideMapping { mapping: { PAR0=rename, PAR1=drop }, note: Some("verified by listening") } (opposite of suggested)Override file mapping matches operator’s choice; note recorded
adjudicate_speaker_id_deferDefer { reason: "need to listen to audio" }Pending entry untouched; override file unchanged; tool exits 4 (deferred)
adjudicate_speaker_id_blockBlock { reason: "reference file missing bullets" }Pending entry tagged as blocked; override file unchanged
adjudicate_speaker_id_kind_mismatch_rejectedOverrideInsertedRole { ... } against a speaker-id-low-confidence entryReturns Err(AdjudicationError::DecisionKindMismatch); nothing written

L4.2, Parent-role-lookup adjudication paths

TestScripted decisionAssertion
adjudicate_parent_role_accepts_default_invAcceptSuggestedOverride entry uses INV:Investigator (the safe default)
adjudicate_parent_role_overrides_to_motherOverrideInsertedRole { code: "MOT", tag: "Mother" }Override entry uses MOT; note recorded
adjudicate_parent_role_overrides_to_fatherOverrideInsertedRole { code: "FAT", tag: "Father" }Override entry uses FAT
adjudicate_parent_role_invalid_code_rejectedOverrideInsertedRole { code: "", tag: "Mother" }Returns Err; with --skip-on-error, logs and proceeds

L4.3, Diarization-mix and sanity-scan paths

TestScripted decisionAssertion
adjudicate_diarization_mix_flag_onlyFlag { flags: [DiarizationMixed], note: "PAR0 mixes clinician+parent" }Existing override entry gets flag added; mapping unchanged
adjudicate_sanity_scan_swap_mappingOverrideMapping { ... } reversing original speaker-idOverride entry updated; mode = "explicit"; original mapping preserved in history
adjudicate_sanity_scan_confirms_real_overlapFlag { flags: [Custom("real-overlap-confirmed")] }Override entry gets custom flag; mapping unchanged

L4.4, Workflow plumbing

TestScenarioAssertion
adjudicate_empty_pending_file_noopPending file has empty entries arrayExit 0; nothing changes
adjudicate_resumption_skips_decided_entriesPending file has 3 entries; first 2 already decided in override; only 3rd has no override entryPrompter is called exactly once, for the 3rd entry
adjudicate_re_adjudicate_preserves_historyExisting override entry; --re-adjudicate with new decisionNew decision saved; prior decision preserved in history array
adjudicate_kind_filter_processes_only_matchingPending file has mixed kinds; --kind parent-role-lookup flag setPrompter only called for parent-role-lookup entries; other kinds untouched
adjudicate_dry_run_writes_nothingAny pending input + any decision; --dry-run setOverride file unchanged; pending file unchanged
adjudicate_scripted_mode_unknown_session_abortsScripted decisions reference session-X but pending has only session-YReturns Err(AdjudicationError::ScriptedDecisionWithoutPendingEntry); tool exits 2
adjudicate_scripted_mode_extra_pending_abortsPending has session-X and session-Y; scripted decisions cover only session-XReturns Err(AdjudicationError::PendingEntryWithoutScriptedDecision); tool exits 2
adjudicate_mutually_exclusive_modes--interactive + --scripted both setReturns Err; tool exits 2 (clap or our own validator)

L4.5, Prompter contract conformance

These tests pin the contract that any Prompter impl must satisfy, so future UI backends (VS Code, web) can be developed against the same invariants.

TestScenarioAssertion
prompter_terminal_round_trip_decisionTerminalPrompter reading a scripted stdinReturns the expected OperatorDecision parsed from the operator’s typed input
prompter_scripted_returns_decisions_in_orderScriptedPrompter::from_decisions([d1, d2, d3])Three consecutive ask() calls return d1, d2, d3 in order
prompter_scripted_panics_on_unscripted_sessionScriptedPrompter has decisions for session A; tool asks for session Bask() returns Err(PrompterError::NoDecisionFor(SessionId))
prompter_scripted_toml_round_tripsWrite a scripted-decisions TOML, read with ScriptedTomlPrompter, runSame OperatorDecision sequence as a ScriptedPrompter::from_decisions with equivalent contents

Fixture catalog

These are the synthetic CHAT pairs that the tests above consume. Each is small (≤20 utterances), exercises a precise invariant, and is fully fictional (no real corpus content).

The fixtures live as inline const FIX_*: &str blocks in the respective test modules, following the precedent in chatter/tests/integration_tests.rs (which has const VALID_CHAT: &str = r#"..."# etc.).

FIX_REF_TWO_UTT_NO_MARKUP

The smallest possible valid CHAT pair input. Two *CHI: utterances, no markup beyond a simple terminator, time bullets on both. Used by cycle 1’s smoke test where the impl must work without yet handling any markup edge cases.

FIX_ASR_LABELED_TWO_UTT

The matching donor for FIX_REF_TWO_UTT_NO_MARKUP: two *INV: utterances at different time positions. Used by cycle 1.

FIX_REF_CHILD_ONLY_SIMPLE

A 6-utterance child-only hand transcript with rich CHAT markup (error code, retracing, filled pause, special-form letter, zero realization with paralinguistic). Used by every L2/L3 merge test from cycle 2 onward as the canonical “File 1”, the reference / authoritative file. Has time bullets on every utterance.

FIX_ASR_ANON_2SPEAKER_SIMPLE

The matching ASR-output file with anonymous PAR0 (clinician, asks questions) and PAR1 (child, says what FIX_REF_* shows plus some extra). Has %wor on every utterance. Used by every speaker-id test where auto-mode is expected to succeed cleanly (margin >> 2.0).

FIX_ASR_LABELED_INV_SIMPLE

FIX_ASR_ANON_2SPEAKER_SIMPLE after speaker-id has run with PAR1→drop, PAR0→INV:Investigator. Used by merge tests where we want to skip the speaker-id step and test merge alone.

FIX_ASR_BORDERLINE_VOCABULARY

ASR file where both speakers describe the same picture-book content (margin 1.6-1.9 against reference). Used by low-confidence tests.

FIX_REF_NO_BULLETS

A reference file with no time bullets at all. Used to test NoTimelineInFile1 precondition.

FIX_REF_LANG_ENG / FIX_ASR_LANG_YUE

Two files with conflicting @Languages. Used to test LanguageMismatch.

FIX_AMBIGUOUS_INV

Two files both containing *INV: utterances, with --retain CHI (INV not in retain set). Used to test AmbiguousSpeaker.

FIX_REF_MULTI_RETAIN

Reference file containing *CHI: and *SI2: utterances (sibling target). Used to test --retain CHI,SI2.

FIX_ASR_NO_MAIN_BULLET

Donor file where some utterances have no main-tier bullet, only %wor. Used to test bullet-lift behavior in normalization.

FIX_OVERRIDE_VALID / FIX_OVERRIDE_WRONG_SCHEMA / FIX_OVERRIDE_MALFORMED

Override files in valid, schema-rejected, and parse-rejected shapes. Used by override-file I/O tests.

FIX_PENDING_SPEAKER_ID / FIX_PENDING_PARENT_ROLE / FIX_PENDING_MIXED_KINDS

Pending-adjudications files exercising one kind, another kind, and a mix. Used by L4 adjudication tests.

FIX_SCRIPTED_ACCEPT_ALL / FIX_SCRIPTED_OVERRIDE_FIRST_DEFER_SECOND

Scripted-decisions TOML files for ScriptedTomlPrompter. Cover the canonical accept-suggested case and a mixed override+defer case.

The exact bytes of each fixture are pinned in their respective test modules when the implementation lands; this plan doesn’t freeze them yet, only their purpose. Drafting the actual bytes is the first step of impl-phase work.

Coverage matrix

Cross-checking that every behavioral invariant from the four design docs has at least one test:

Invariant sourceInvariantFirst-failing layerTest name
merge user-guideRetained byte-stableL3 → L2merge_basic_clinician_pattern + merge_retained_speakers_byte_stable
merge user-guideDerived tiers strippedL3 → L2merge_strip_tiers_custom + merge_strips_default_derived_tiers
merge user-guideOrder by start_msL2merge_utterance_order_by_start_time
merge user-guideTiebreak File1 firstL2merge_stable_tiebreak_file1_first
merge user-guideBullets pass-throughL2merge_bullets_pass_through
merge user-guideBullet lift from %worL2merge_bullet_lift_from_wor
merge user-guideHeader reconciliation (all rows)L2merge_header_* series
merge user-guide + memoryNo overlap markers injectedL2merge_no_overlap_markers_injected + merge_preserves_existing_overlap_markers
merge user-guideEach precondition → exit 2L3merge_*_exits_2 series in L3.2
merge user-guideWarns on bullet driftL2merge_warns_on_backward_bullet_drift
speaker-id user-guideReference mode autoL3speaker_id_reference_auto_clean_winner
speaker-id user-guideExplicit modeL3speaker_id_explicit_basic
speaker-id user-guideOverride-file modeL3speaker_id_override_file_replay
speaker-id user-guideConfidence threshold (exit 4)L3 → L2speaker_id_reference_low_confidence_exits_4 + identify_mapping_borderline_refuses
speaker-id user-guideByte-stable except prefixL2apply_mapping_byte_stable_except_prefix
speaker-id user-guideHeader rewritesL2 + L1apply_mapping_rewrites_* + participants-rewrite-* specs
speaker-id user-guideProvenance capturedL3speaker_id_reference_writes_override
speaker-id user-guideEach precondition → typed errorL3 → L2various *_exits_2 and apply_mapping_* tests
speaker-id user-guideToken cleaner specL1clean-* specs
speaker-id user-guideMultiset Jaccard formulaL1jaccard-* specs
override-file refSchema-version refusalL2override_file_refuses_* tests
override-file refRound-trip fidelityL2override_file_round_trip
override-file refDeterministic serializationL2override_file_deterministic_serialization
override-file refAtomic writeL2override_file_atomic_write
override-file refmargin "unbounded" formL2override_file_preserves_margin_unbounded
domain typesJaccardScore rangeL2jaccard_score_new_in_range
domain typesConfidenceThreshold ≥ 1L2confidence_threshold_*
domain typesMargin semanticsL2margin_*
domain typesRetainSet::from_strL2retain_set_parse
domain typesInsertedRole::from_strL2inserted_role_parse
domain typesparse_mapping_specL2mapping_spec_parse_*
domain typesMergeFlag serdeL2merge_flag_serde_*
domain typesPipeline reproducibilityL3pipeline_replay_via_override_file

Every invariant has at least one named test; many have multiple across layers. When the impl phase begins, the first commit should produce the fixtures, the second commit the highest-layer failing test for the simplest invariant, then drill down per the standard TDD progression.

What this plan does NOT cover

  • Performance / scaling tests. Until the pipeline shows up on a measured workload, no targeted perf assertions. The reference corpus’s existing round-trip benchmarks remain the baseline.
  • Fuzz testing. This repository now has a local fuzz/ workspace for parser/validation fuzzing. If the merge crate stabilizes enough to justify dedicated fuzzing, adding a merge-specific target for random parseable CHAT-pair inputs is a follow-up, not a v1 blocker.
  • Cross-platform CI checks. Windows / Linux / macOS each build the workspace; the merge module rides the existing CI. No platform-specific tests needed (the merge operates on parsed AST and writes UTF-8; no path-or-line-ending quirks).
  • Real-corpus regression sweeps. Once impl lands, running chatter merge over a curated subset of the reference corpus and snapshotting outputs is a smart follow-up. Lives in a separate tests/golden/ style mechanism if added; not designed here.

TDD authoring sequence

Each numbered item is one full RED → GREEN → REFACTOR cycle. Cycles must run in order; do not start cycle N+1 until cycle N is green and committed. Numbers are designed so the first working pipeline (cycle 8) emerges from the absolute minimum set of types + algorithms, then each later cycle extends.

The starter test for cycle 1 is intentionally tiny: a 2-utterance fixture pair with no markup, one retain speaker. The smoke test exercises every layer (parser, transform, CLI) but with the simplest possible CHAT bytes, so the first impl is small enough to land in one cycle.

Phase A, minimal end-to-end pipeline (cycles 1-8)

These cycles produce the simplest possible chatter merge working end-to-end with synthetic fixtures.

#RED (failing test)GREEN (smallest impl that passes)
1merge_basic_smoke, L3 subprocess test against the tiniest fixture pair (FIX_REF_TWO_UTT_NO_MARKUP + FIX_ASR_LABELED_TWO_UTT), retain={CHI}, asserts exit 0 and “merged file exists”Stub chatter merge subcommand wiring; introduce minimal talkbank-transform::transcript_merge::merge that interleaves utterances by start_ms and emits parser→serializer round-trip. No tier-stripping, no header-reconcile, no validation. Just: parse, sort, serialize.
2merge_retained_speakers_byte_stable, L2 over the smoke fixture, asserts every CHI block byte-identicalImplement byte-stable handling for retained utterances (preserve main_raw_lines + dependent tiers exactly).
3merge_strips_default_derived_tiers, L2 against a fixture where the donor has %wor rowsImplement tier_strip per the per-tier policy; drop %wor/%mor/%gra/%pho from inserted-speaker utts.
4merge_utterance_order_by_start_time, L2 with a fixture where File 1 and File 2 utterances interleaveImplement timeline sort key (start_ms primary; source-order tiebreak).
5merge_header_participants_concatenates, L2Implement header_reconcile::participants_merge.
6merge_header_id_concatenates, L2Extend header_reconcile for @ID rows.
7merge_header_languages_passthrough + merge_header_media_file1_wins + merge_header_comments_concatenate, L2Extend header_reconcile for remaining headers per the contract table.
8merge_preconditions_retain_missing + merge_preconditions_no_timeline + merge_preconditions_language_mismatch + merge_preconditions_ambiguous_speaker, L3, each asserting exit code 2 with a specific stderr messageImplement preconditions module + map MergeError to exit codes in the CLI.

Phase A, actual cycle log

The four-precondition cycle 8 was deliberately split into four single-variant cycles (9a / 9b / 9c / 9d) so each MergeError variant lands with its own RED→GREEN cycle and L2 + L3 sibling tests. The numbering here is therefore finer-grained than the plan table above; the table records the shape of Phase A, the log records what was actually committed.

#Test(s)LayerStatus
1merge_basic_smokeL3done
2merge_retained_speakers_byte_stableL2done
3merge_strips_default_derived_tiersL2done
4merge_strip_tiers_configurableL2done
5merge_strip_tiers_empty_preserves_allL2done
6merge_header_participants_concatenatesL2done
7merge_header_id_concatenatesL2done
8amerge_header_comments_concatenateL2done
8bmerge_header_languages_passthrough + merge_header_media_file1_winsL2done
9amerge_no_retain_speakers_in_file1 + _returns_errL3 + L2done (L2 sibling backfilled in 9c)
9bmerge_no_timeline_in_file1 + _returns_errL3 + L2done
9cmerge_language_mismatch + _returns_errL3 + L2done
9dmerge_ambiguous_speaker + _returns_errL3 + L2done

End of Phase A: chatter merge works on simple fixtures with all four preconditions (retain / timeline / language / ambiguous speaker) enforced. The pipeline is publishable as v0.

Phase B, actual cycle log

Phase B picks up at cycle 10 in the cycle log (Phase A used 9a-9d for the precondition split).

#Test(s)LayerStatus
10speaker_id_explicit_basicL3done
11apply_mapping_byte_stable_except_prefix + apply_mapping_rewrites_participants + apply_mapping_rewrites_idL2done (regression-guards)
12identify_mapping_clean_winnerL2done
13identify_mapping_borderline_refusesL2done
14speaker_id_reference_low_confidence_exits_4L3done
15speaker_id_reference_writes_override (+ OverrideFile data model)L3done
16speaker_id_override_file_replay (+ OverrideFile::get)L3done
17adjudicate_speaker_id_accepts_suggested (+ adjudication core)L4done
18adjudicate_scripted_accepts_suggested (+ chatter adjudicate CLI + scripted-TOML I/O)L3done
19speaker_id_reference_writes_pending_on_low_confidence (+ --write-pending flag + LowConfidence carries DonorMatchReport)L3done
20adjudicate_speaker_id_override_mapping (+ OperatorDecision::OverrideMapping variant + scripted-TOML override-mapping shape)L4done
21adjudicate_interactive_accepts_suggested (+ TerminalPrompter + --interactive flag)L3done
22adjudicate_parent_role_lookup_chooses_role (+ PendingKindData promotion + ParentRoleLookup kind + ChooseRole decision)L4done
23adjudicate_interactive_chooses_role (+ parse_operator_response + kind-aware prompt hint)L3done
24adjudicate_interactive_override_mapping (+ parse_override_mapping + parse_speaker_assignment)L3done
25pipeline_clean_winner_end_to_end (+ chatter pipeline subcommand)L3done
26batch_pass1_single_session (+ chatter batch subcommand, subprocess driver)L3done
27batch_mixed_outcomes (regression-guard: clean+borderline aggregation)L3done
28batch_pass2_replay (+ --override-file on pipeline + batch; per-session auto-detection)L3done
29batch_skip_existing (+ --skip-existing flag on batch for idempotent re-runs)L3done
30refactor, PipelineArgs + BatchArgs structs retire three #[allow(clippy::too_many_arguments)] markers,done (true-no-op refactor; covered by cycles 25-29 regression suite)
31refactor, split commands/speaker_id.rs (472 lines) into speaker_id/{mod,modes,writes,support}.rs (158 + 196 + 103 + 86 lines); retire 4 stale #[allow(dead_code)] markers on ReferenceModeOutcome (fields are read by write_override_entry),done (true-no-op refactor; covered by cycles 10-29 regression suite)
32adjudicate_sanity_scan_accept_suggested (+ AdjudicationKind::SanityScanMisclassification variant, PendingKindData::SanityScanMisclassification { suggested, reason } variant, two apply-decision arms mirroring SpeakerIdLowConfidence, terminal prompter render + prompt-hint arm)L4done, adjudication kind end-to-end; the post-merge scan detector itself (heuristic + auto-pending-write) is a separate cycle 33
33sanity_scan_flags_inverted_mlu (+ talkbank_transform::sanity_scan::scan_session + chatter sanity-scan subcommand; mean-utterance-word-count asymmetry heuristic, default 1.5×, binary-mapping only)L3done, detector + CLI end-to-end; multi-rename support, batch integration, and alternative heuristics deferred
34batch_writes_override_for_auto_decisions (+ --write-override on both chatter pipeline and chatter batch; threaded through PipelineArgs.write_override_path + BatchArgs.write_override_path; reference-mode auto-decisions audit-trailed for sanity-scan + future re-runs)L3done
35batch_with_sanity_scan_flag_flags_inverted_mlu (+ --sanity-scan + --sanity-scan-threshold on chatter batch; post-loop subprocess driver for chatter sanity-scan; precondition validation requiring --write-override + --write-pending)L3done
36refactor, split cli/args/core.rs (984 → 747 lines): extract DebugCommandsdebug_commands.rs, CacheCommandscache_commands.rs, config enums (LogFormat, TuiMode, OutputFormat, ParserBackend, AlignmentTier) → cli_types.rs, unit-test module → core_tests.rs (via #[path]); satisfies the 800-line hard limit,done (true-no-op refactor; covered by full regression suite + 110 bin/integration tests)
37+sanity-scan multi-rename support; diarization-mix-review kind (operator workflow design needed); newtype threading at struct seams (deferred simplify finding); apply_decision arm dedup + per-kind OperatorDecision sub-enumsL3 + L4pending

Phase B, speaker-id pipeline (cycles 9-16)

These cycles add chatter speaker-id and its three modes.

#REDGREEN
9speaker_id_explicit_basic, L3 against an anonymous-2-speaker donor with --mapping "PAR0=drop,PAR1=INV:Investigator", asserts output has only INV uttsStub chatter speaker-id subcommand. Implement parse_mapping_spec + apply_mapping. Reference mode and override-file mode return unimplemented!() for now.
10apply_mapping_byte_stable_except_prefix + apply_mapping_rewrites_participants + apply_mapping_rewrites_id, L2Tighten apply_mapping per header rewrite rules.
11identify_mapping_clean_winner, L2 with a fixture where one donor speaker overwhelmingly matches the referenceImplement text_cleaner + jaccard modules. Implement identify_mapping using them. Reference mode in CLI now works.
12identify_mapping_borderline_refuses, L2 with a borderline fixtureAdd ConfidenceThreshold check + LowConfidence error path.
13speaker_id_reference_low_confidence_exits_4, L3 against borderline fixtureMap LowConfidence to exit code 4 in the CLI; print scores to stderr.
14speaker_id_reference_writes_override, L3 with --write-overrideImplement OverrideFile::read_or_default + OverrideFile::write.
15speaker_id_override_file_replay, L3 with --override-file + --session-idImplement override-file mode in CLI (OverrideFile::get + apply).
16Token-cleaner L1 specs (a handful of representative clean-* specs from L1.1) + current spec/tools generatorsMove the regex-and-string cleaner into a spec-test-covered implementation. Specs become the regression net.

End of Phase B: full chatter speaker-id + chatter merge pipeline works auto + explicit + override modes.

Phase C, adjudication (cycles 17-22)

These cycles add the chatter adjudicate tool and its prompter-injection testability.

#REDGREEN
17adjudicate_empty_pending_file_noop, L4 against an empty pending file, asserts exit 0 + no changesStub chatter adjudicate subcommand. Implement PendingAdjudications::read + run_adjudication core skeleton with a no-op Prompter trait.
18prompter_scripted_returns_decisions_in_order, L4Implement ScriptedPrompter::from_decisions (in-memory) per the Prompter trait.
19adjudicate_speaker_id_accepts_suggested, L4 against FIX_PENDING_SPEAKER_ID with one AcceptSuggested decisionImplement apply_decision for the speaker-id-low-confidence kind. Override file now gets the decision; pending entry removed.
20adjudicate_speaker_id_override_mapping, L4 with OverrideMapping decisionExtend apply_decision for the override-mapping variant.
21adjudicate_speaker_id_kind_mismatch_rejected, L4 with a OverrideInsertedRole against a speaker-id pending entryImplement kind→variants validation in apply_decision.
22adjudicate_scripted_mode_unknown_session_aborts + adjudicate_scripted_mode_extra_pending_aborts, L4Tighten scripted-mode validation; assert 1:1 mapping between pending entries and scripted decisions.

End of Phase C: scripted adjudication tested end-to-end with synthetic operator inputs. Interactive terminal UX still unimplemented (next phase).

Phase D, interactive UX (cycles 23-25)

#REDGREEN
23prompter_terminal_round_trip_decision, L4 with mocked stdin/stdoutImplement TerminalPrompter parsing [a]/[o]/[f]/... keys + optional follow-up prompts.
24adjudicate_resumption_skips_decided_entries, L4 with a partially-decided override file + full pending listImplement skip-already-decided logic in run_adjudication.
25Manual smoke test (NOT automated), run chatter adjudicate --interactive against the test fixtures; visually confirm the operator UX matches the doc’s mock-upPolish terminal output: ANSI formatting, fixed-width alignment, the [m] Show more context action, the [p] Play media action.

End of Phase D: full v1 pipeline complete.

Phase E, non-speaker-id adjudication kinds (cycles 26-29)

Each adjudication kind gets its own RED→GREEN cycle.

#REDGREEN
26adjudicate_parent_role_overrides_to_mother + adjudicate_parent_role_overrides_to_father, L4Implement parent-role-lookup kind end-to-end (pending schema, prompter context, decision application).
27adjudicate_diarization_mix_flag_only, L4Implement diarization-mix-review kind end-to-end.
28adjudicate_sanity_scan_swap_mapping, L4Implement sanity-scan-misclassification kind end-to-end.
29adjudicate_re_adjudicate_preserves_history, L4Implement --re-adjudicate flag; add history field to MergeOverride.

Phase F, breadth pass (cycles 30+)

Fill in every remaining test from L1-L4 that hasn’t been written yet. These are coverage-deepening tests, not behavior adders. The impl from Phases A-E should pass them with at most minor refactoring; if a test fails meaningfully, that’s a gap in the impl that this cycle closes.

The breadth pass is the only phase where multiple cycles can proceed in parallel (different contributors take different test groups). Phases A-E are strictly serial.

Hard rules during impl phase

  • No test stubs. Every test in this plan, when written, must FAIL before its impl exists and PASS after. Skipped or #[ignore]-marked tests are not allowed in the regression net (use #[ignore] only for genuinely slow or environment-dependent tests, not for “not implemented yet”).
  • No test deletion to make CI green. If a test that was passing starts failing after a refactor, the refactor is wrong. Investigate; do not delete the test.
  • Three cycle archetypes, distinguish them. A cycle is one of:
    • bug-fix: RED motivates new impl code (cycle N-1’s impl truly cannot satisfy the new test).
    • regression-guard: RED pins an invariant the impl inherits from upstream infrastructure (e.g. parse→serialize byte-stability inherited from talkbank-parser). The test passes against cycle N-1’s impl, but the cycle is valuable because it locks in the invariant against future “optimizations” that might break it. Verbose-output the actual behavior on first run to confirm the invariant holds for the right reasons, not by accident.
    • true no-op: RED tests something already pinned elsewhere. These ARE unnecessary; drop the cycle or sharpen the test. The difference between regression-guard and true no-op is whether the invariant is named explicitly anywhere else. If yes (e.g., the parser crate already has a roundtrip test that covers it), the cycle is true-no-op. If no, the cycle is a regression-guard and worth keeping.

Merge Pipeline, Crate Architecture

Status: Draft Last modified: 2026-06-14 19:57 EDT

This page explains where the new merge-pipeline code lives in the chatter workspace, which crates gain modules, what depends on what, and which boundary each piece sits inside. The goal is succession-readability: a contributor coming to this work for the first time should be able to map a behavior they read about in chatter merge or chatter speaker-id to the precise crate + module that implements it.

Companion documents:

Boundary decisions

Two boundary decisions govern where every new piece of code lives. Both reference rules already documented in this repo’s root CLAUDE.md (workspace-root contributor guide, outside the book).

Decision 1: talkbank-* crates, not batchalign-* crates

The merge pipeline is pure CHAT-AST structural manipulation, no ML, no audio I/O, no network, no model loading, no fleet runtime. Per the crate-boundary decision test in the workspace CLAUDE.md:

If code fundamentally needs ML models, audio processing, network services, or fleet runtime → batchalign-* crate. Otherwise → talkbank-* crate.

chatter merge and chatter speaker-id answer “no” to each ML/audio/network/runtime question. They consume parsed ChatFile values, manipulate them, and emit parsed-and-serialized output. Even the speaker-id text-similarity scoring is a deterministic function over CHAT content tokens, no ML model, no embedding, no inference. All new merge code lives in talkbank-* crates.

The batchalign-* crates remain the home for batchalign3 transcribe (ASR), batchalign3 align (forced alignment), and batchalign3 morphotag (Stanza-based morphological tagging), the ML-bearing stages that surround the merge in the pipeline.

Decision 2: types in talkbank-model, algorithms in talkbank-transform, CLI in chatter

The merge pipeline’s code splits across exactly the same three talkbank-* crates that already host the parse/validate/normalize/ JSON pipelines:

  • talkbank-model owns the typed vocabulary (domain types, errors). No algorithms.
  • talkbank-transform owns the algorithms (token cleaning, Jaccard scoring, mapping application, structural merge). No CLI parsing, no clap.
  • chatter owns the subcommands (chatter speaker-id, chatter merge). Thin shim layer that parses arguments and drives the transform layer.

This mirrors how chatter validate, chatter normalize, chatter to-json are wired today and keeps the crate boundaries honest: a future caller wanting only the algorithms (e.g., a library binding, an HTTP service) can depend on talkbank-transform without pulling in clap. A future caller wanting only the types (e.g., an external tool reading override files) can depend on talkbank-model without pulling in the tree-sitter parser.

Crate dependency graph

The new code does not introduce any new crate-level dependencies, every edge below already exists in the workspace today. The merge work adds modules to existing crates.

flowchart TD
    derive["talkbank-derive\n(proc macros, unchanged)"]
    model["talkbank-model\n(+ merge module)"]
    parser["talkbank-parser\n(unchanged)"]
    transform["talkbank-transform\n(+ transcript_merge, speaker_id modules)"]
    cli["chatter\n(+ merge, speaker-id subcommands)"]
    cli_tests["chatter/tests/\n(+ merge_tests.rs)"]
    transform_tests["talkbank-transform/tests/\n(+ transcript_merge_tests, speaker_id_tests, override_file_tests)"]
    spec["spec/constructs/speaker-id/\n(token cleaner + Jaccard specs)"]
    parser_tests["talkbank-parser-tests\n(consumes regenerated specs)"]

    derive --> model
    model --> parser
    model --> transform
    parser --> transform
    transform --> cli
    model --> cli
    transform --> transform_tests
    transform --> cli_tests
    cli --> cli_tests
    spec -.->|spec/tools generators| parser_tests
    model -.-> parser_tests

Dashed edges (-.->) are build-time rather than dependency edges: the current spec/tools generators regenerate Rust tests under talkbank-parser-tests from spec markdown, but the spec directory is not a Cargo crate.

Module layout per affected crate

talkbank-model, new merge/ module

Adds a top-level merge module under crates/talkbank-model/src/. Layout:

crates/talkbank-model/src/merge/
    mod.rs, pub re-exports
    scoring.rs, JaccardScore, ConfidenceThreshold, Margin
    role.rs, InsertedRole, MappingAction
    mapping.rs, SpeakerMapping + parse_mapping_spec
    retain.rs, RetainSet
    override_file.rs, DecisionMode, MergeFlag, OperatorId,
                             SessionId, MergeOverride, OverrideFile
    errors.rs, SpeakerIdError, MergeError, OverrideFileError

Per the file-size rule (≤400 lines target, ≤800 hard) each file stays modest. Re-exports go through mod.rs:

// crates/talkbank-model/src/merge/mod.rs
pub mod scoring;
pub mod role;
pub mod mapping;
pub mod retain;
pub mod override_file;
pub mod errors;

pub use errors::{MergeError, OverrideFileError, SpeakerIdError};
pub use mapping::{parse_mapping_spec, SpeakerMapping};
pub use override_file::{MergeOverride, OverrideFile, ...};
pub use retain::RetainSet;
pub use role::{InsertedRole, MappingAction};
pub use scoring::{ConfidenceThreshold, JaccardScore, Margin};

Exposed at the crate root via the existing crates/talkbank-model/src/lib.rs pattern:

pub mod merge;

No new external crate dependency: chrono and toml are already pinned at workspace level. The merge module pulls them in via { workspace = true } annotations in crates/talkbank-model/Cargo.toml.

talkbank-transform, new speaker_id/ and transcript_merge/ modules

Two sibling top-level modules, mirroring the user-facing distinction between the two subcommands:

crates/talkbank-transform/src/speaker_id/
    mod.rs, identify_mapping, apply_mapping
    text_cleaner.rs, content-token extraction from ChatFile
    jaccard.rs, multiset Jaccard over Counter<&str>
    header_rewrite.rs, @Participants / @ID rewriting per mapping

crates/talkbank-transform/src/transcript_merge/
    mod.rs, pub fn merge(...) entry point
    timeline.rs, utterance ordering by start_ms
    bullet_lift.rs, derive main-tier bullet from %wor
    tier_strip.rs, strip downstream-owned dependent tiers
    header_reconcile.rs, @Languages match, @Participants concat, etc.
    preconditions.rs, RetainSpeakersMissing, NoTimelineInFile1, etc.

Both modules land alongside the existing CHAT-core transform modules (parse, serialize, validate, normalize) in talkbank-transform.

Exposed via crates/talkbank-transform/src/lib.rs:

pub mod speaker_id;
pub mod transcript_merge;

chatter, new speaker_id/ and transcript_merge/ command directories

The CLI dispatch pattern in this crate uses one directory per multi-file command (e.g. commands/validate/, commands/find/, commands/alignment/) or one file for single-file commands (commands/normalize.rs, commands/lint.rs). Both new subcommands warrant directories because each has multiple operation modes (speaker-id has reference / explicit / override-file modes; merge has the main merge path plus probably a future merge --check mode).

crates/chatter/src/commands/speaker_id/
    mod.rs, clap subcommand dispatch
    args.rs, flag parsing, mode disambiguation
    reference_mode.rs, drives identify_mapping + apply_mapping
    explicit_mode.rs, drives parse_mapping_spec + apply_mapping
    override_mode.rs, drives OverrideFile::read + apply_mapping
    output.rs, formats per-speaker scores to stderr,
                             writes override file via --write-override

crates/chatter/src/commands/transcript_merge/
    mod.rs, clap subcommand dispatch
    args.rs, --retain, --strip-tiers parsing
    runner.rs, drives the merge pipeline
    output.rs, exit-code mapping, error formatting

The CLI argument enums extend crates/chatter/src/cli/args.rs’s top-level Commands enum:

// in crates/chatter/src/cli/args.rs
pub enum Commands {
    Validate(/* ... */),
    Normalize(/* ... */),
    // ... existing variants ...
    SpeakerId(commands::speaker_id::args::SpeakerIdArgs),  // NEW
    Merge(commands::transcript_merge::args::MergeArgs),     // NEW
}

Subcommand dispatch in crates/chatter/src/main.rs already matches on the Commands enum; the new arms wire to the respective commands::*::run entry points.

Test crates

Per the Test Plan:

crates/talkbank-transform/tests/
    speaker_id_tests.rs, L2 tests for identify_mapping / apply_mapping
    transcript_merge_tests.rs, L2 tests for merge invariants
    override_file_tests.rs, L2 tests for round-trip / refusal

crates/chatter/tests/
    merge_tests.rs, L3 subprocess tests for both new commands

spec/constructs/speaker-id/
    token-cleaner/, L1 fragment specs
    jaccard-scoring/, L1 golden Jaccard specs
    mapping-application/, L1 header rewrite specs

The spec entries flow into Rust tests under crates/talkbank-parser-tests/tests/generated/ via the standard current spec/tools workflow.

Data flow for chatter merge

The full call graph when an operator runs chatter merge file1.cha file2.cha --retain CHI -o out.cha:

sequenceDiagram
    actor Operator
    participant CLI as chatter<br/>(merge)
    participant Args as commands::transcript_merge::args
    participant Runner as commands::transcript_merge::runner
    participant Parser as talkbank-parser<br/>(TreeSitterParser)
    participant Merge as talkbank-transform::transcript_merge
    participant Model as talkbank-model::ChatFile

    Operator->>CLI: chatter merge file1 file2 --retain CHI
    CLI->>Args: parse argv → MergeArgs
    Args-->>CLI: MergeArgs { file1, file2, retain: RetainSet, ... }
    CLI->>Runner: run(args)
    Runner->>Parser: parse_chat_file(file1)
    Parser-->>Runner: ChatFile (file1)
    Runner->>Parser: parse_chat_file(file2)
    Parser-->>Runner: ChatFile (file2)
    Runner->>Merge: merge(file1, file2, retain)
    Merge->>Merge: header_reconcile · timeline · tier_strip · bullet_lift
    Merge-->>Runner: ChatFile (merged) or MergeError
    alt Ok(merged)
        Runner->>Model: merged.write_chat() to output path
        Runner-->>Operator: exit 0
    else Err(MergeError)
        Runner->>Operator: formatted stderr + exit code 2
    end

The CLI layer is thin: it parses arguments, calls the transform layer’s merge function, and translates the Result<ChatFile, MergeError> into stdout/stderr/exit-code output. All algorithm logic lives in talkbank-transform.

Data flow for chatter speaker-id

The reference-mode call path:

sequenceDiagram
    actor Operator
    participant CLI as chatter<br/>(speaker-id)
    participant Runner as commands::speaker_id::reference_mode
    participant Parser as talkbank-parser
    participant SpkId as talkbank-transform::speaker_id
    participant Override as talkbank-model::merge::override_file

    Operator->>CLI: chatter speaker-id input --reference ref --anchor CHI<br/>--inserted-role INV:Investigator
    CLI->>Runner: run(args)
    Runner->>Parser: parse_chat_file(input)
    Parser-->>Runner: ChatFile (donor)
    Runner->>Parser: parse_chat_file(reference)
    Parser-->>Runner: ChatFile (reference)
    Runner->>SpkId: identify_mapping(donor, reference, anchor, role, threshold)
    SpkId-->>Runner: SpeakerMapping or LowConfidence{scores, margin}
    alt Ok(mapping)
        Runner->>SpkId: apply_mapping(donor, mapping)
        SpkId-->>Runner: ChatFile (relabeled)
        opt --write-override
            Runner->>Override: OverrideFile::read_or_default(path)
            Override-->>Runner: OverrideFile
            Runner->>Override: insert entry, write back
        end
        Runner-->>Operator: relabeled output, exit 0
    else Err(LowConfidence)
        Runner-->>Operator: scores to stderr, exit 4
    end

The explicit-mapping and override-file modes use the same apply_mapping and --write-override paths but skip identify_mapping, the mapping comes from parse_mapping_spec or OverrideFile::get respectively.

How this composes with the post-merge ML stages

The end-to-end pipeline batchalign3 transcribe → chatter speaker-id → chatter merge → batchalign3 align → batchalign3 morphotag crosses the talkbank-* / batchalign-* boundary twice:

flowchart LR
    subgraph BA[Batchalign, ML / audio / network]
        Trans["batchalign3 transcribe"]
        Align["batchalign3 align"]
        Morph["batchalign3 morphotag"]
    end
    subgraph TB[talkbank, pure CHAT-AST]
        SpkId["chatter speaker-id"]
        Merge["chatter merge"]
    end
    Media["mp4 / wav media"] --> Trans
    Trans -->|ASR.cha| SpkId
    Hand["hand transcript.cha"] -->|reference| SpkId
    Hand --> Merge
    SpkId -->|labeled.cha| Merge
    Merge -->|merged.cha| Align
    Align -->|+ bullets + %wor| Morph
    Morph -->|+ %mor + %gra| Final["final.cha"]

Each crossing is CHAT-file-to-CHAT-file at a stable serialization boundary: Batchalign emits a CHAT file, talkbank consumes it; talkbank emits a CHAT file, Batchalign consumes it. Neither side has a runtime dependency on the other; they exchange data through the file system (or piped stdin/stdout) exactly as the user-facing CLI commands do. This keeps the boundary honest: a contributor working on the merge pipeline never needs to load a Stanza model, and a contributor working on batchalign3 align never needs to parse a speaker-id override file.

Public surface impact

Cumulative public API additions (the surface a downstream library consumer would see):

CrateNew pub itemsStability
talkbank-modelmerge::{SpeakerCode, ParticipantRole, ...}, re-exports for ergonomics, the underlying types already exist; PLUS the new types in merge::scoring/role/mapping/retain/override_file/errorsStable, versioned via the workspace’s existing release process
talkbank-transformspeaker_id::{identify_mapping, apply_mapping, LowConfidenceError}; transcript_merge::{merge}Stable, algorithms behind these are pinned by the test plan’s L2 tests
chatterTwo new Commands enum variants and their argument structsInternal to the binary, not a library surface

No existing public surface is modified or removed; this is a purely-additive change. Existing consumers (the VS Code extension, talkbank-lsp, chatter-desktop, batchalign) continue to depend on the existing surface and can ignore the additions until a workflow uses them.

Where to look for things (newcomer guide)

QuestionFile
“What does chatter merge do?”book/src/chatter/user-guide/merge.md
“What does chatter speaker-id do?”book/src/chatter/user-guide/speaker-id.md
“What’s in an override file?”book/src/chatter/integrating/merge-overrides.md
“What types are in talkbank-model::merge?”book/src/architecture/merge-domain-types.md
“Where are the tests?”book/src/architecture/merge-test-plan.md
“Which crate is this code in and why?”This page
“Where does the merge code live in source?”crates/talkbank-transform/src/speaker_id/ + crates/talkbank-transform/src/transcript_merge/ + crates/chatter/src/commands/speaker_id/ + crates/chatter/src/commands/transcript_merge/
“What’s in an utterance / ChatFile / %mor tier?”talkbank-model crate rustdoc; book/src/architecture/chat-model/chat-model.md
“What’s the parser do?”book/src/architecture/parsing.md; book/src/architecture/parser-model-contracts.md

Adjudication Workflow

Status: Draft Last updated: 2026-06-13 20:54 EDT

This page specifies how human-in-the-loop adjudication fits into the merge pipeline. Several pipeline stages have decision points where the algorithm cannot or should not auto-decide; this document specifies how those refusals reach an operator, how the operator’s decision is recorded, and how the pipeline resumes with the decision applied.

The design satisfies two constraints set explicitly upstream:

  • Test the interaction. Every operator-decision path must be exercisable in automated tests by providing synthetic operator choices. No hardcoded stdin reads in the decision core; a pluggable prompter abstraction is mandatory.
  • Batch-then-review is the default workflow. No mid-batch interactive pauses in the main pipeline. The optional --interactive flag exists on the adjudication tool only, for small-batch debugging, and rides on the same data contract.

Companion documents:

Why batch-then-review, and not real-time

Every adjudication point in the pipeline is per-session local: the operator’s decision affects this session’s output and no other session in the same batch. There is no case where an operator decision propagates forward to influence how other sessions get processed.

The cases that might appear to want real-time interaction are better served by sampling:

CaseReal-time approachBetter approach
Systematic pipeline failure (everything refuses)Watch each refusal, abort batchRun a 5-10-session canary first; examine; abort or proceed
Confidence-threshold calibration on a new corpusAdjust threshold mid-batchRun canary; pick threshold; full batch
Cross-session pattern (one contributor always has PAR0 = clinician)Notice during interactive reviewRun canary; observe pattern; add per-contributor explicit mapping to orchestrator config
Operator wants per-session progress visibilityWatch each stepchatter adjudicate --interactive after a batch run, walking the same pending queue

TalkBank’s operational reality makes batch-then-review strictly better:

  • Batches are research-scale (hundreds of sessions per donor). Forcing operator presence during the batch run = forcing hours of babysitting.
  • Overnight and fleet runs are routine; interactive doesn’t work for those.
  • Focused operator review of all refusals together is more efficient than scattered per-batch decisions (less context-switching; easier to spot patterns across sessions).
  • Aligns with the project’s “academic research, accuracy is the standard, take however long it takes” rule: operator efficiency dominates wall-clock latency.

The --interactive flag is preserved for the small-batch debugging case but is explicitly NOT the dominant workflow.

The known adjudication points

The pipeline has at least five points where adjudication may be needed. Each is recorded as one or more entries in the override file via the same schema.

#Adjudication pointTriggerOperator’s decisionAffects
1Speaker-id low confidencechatter speaker-id Jaccard margin < thresholdPer-speaker mapping (drop/rename) and inserted_roleSpeaker labeling, drop set, downstream merge
2Parent role lookupParent-sample session needs MOT vs FAT decisioninserted_role.code and .tag for this sessionThe merged file’s headers + main-tier prefixes
3Diarization-mix flagOperator observes Batchalign collapsed multiple real-world speakers into one labelflags = ["diarization-mixed"] plus a noteDownstream consumers know output is imperfect; might gate publication
4Post-merge sanity scanAuto-scan flags retained-speaker utterances with high-text-similarity inserted-speaker utterances nearby (suggesting speaker-id misclassification)Confirm or override the original speaker-id mappingTriggers re-run of speaker-id + merge for the session
5Unbulleted reference fileReference CHAT file has no time bullets; merge can’t proceedEither bullet the reference upstream, or request fresh authoritative dataPipeline blocked for this session pending external fix

Points 1-4 are handled by the unified chatter adjudicate tool specified below. Point 5 is an out-of-scope failure mode: the adjudication tool records that the session is blocked, but the fix lives outside this pipeline (operator contacts the contributor or runs forced-alignment first).

Data flow

flowchart TD
    Inputs["Input CHAT files +<br/>reference files"]
    Orch["Orchestrator<br/>(future: tb subcommand;<br/>now: shell/script)"]
    SpkId["chatter speaker-id<br/>(per session)"]
    Merge["chatter merge<br/>(per session)"]
    Pending["pending-adjudications.toml<br/>(workflow queue)"]
    Override["overrides.toml<br/>(durable decisions)"]
    Adj["chatter adjudicate"]
    Operator((Operator))
    Final["merged/*.cha"]

    Inputs --> Orch
    Orch -->|pass 1: speaker-id| SpkId
    SpkId -->|exit 0 → auto entry| Override
    SpkId -->|exit 4 → pending entry| Pending
    Orch -->|pass 1: merge for ok sessions| Merge
    Merge --> Final
    Pending --> Adj
    Override --> Adj
    Adj <-->|prompter| Operator
    Adj -->|writes decision| Override
    Adj -->|removes resolved| Pending
    Override -->|pass 2| Orch
    Orch -.->|loop until pending empty| SpkId

The orchestrator runs two passes:

Pass 1: for every input session, run chatter speaker-id in reference mode. Successful auto-decides write to the override file with mode = "auto" and immediately proceed to chatter merge. Refusals (exit code 4) and other adjudication-requiring states write a pending entry to pending-adjudications.toml and the session is skipped for the rest of pass 1.

Pass 2 (after operator runs chatter adjudicate): the orchestrator re-runs chatter speaker-id for the previously skipped sessions, now finding decisions in the override file (mode = "override"). Sessions complete; pending entries are removed.

The pipeline is idempotent: re-running pass 1 on a partially adjudicated batch produces no spurious work, sessions with already-recorded decisions skip to merge directly.

The pending-adjudications artifact

Separate from the override file, a pending-adjudications.toml file holds in-flight workflow state. Its purpose is to carry the evidence the operator needs (per-speaker scores, opening utterance previews) from the orchestrator’s pass 1 to the adjudication tool, without polluting the override file with “to-do” entries.

Schema

schema_version = 1

[[entries]]
session_id = "session-102-t1"
kind = "speaker-id-low-confidence"
created_at = 2026-05-27T11:00:00-04:00

# Inputs the adjudication tool needs:
input_path = "asr/session-102-t1.cha"
reference_path = "chi-only/session-102-t1.cha"
anchor_speaker = "CHI"

# Evidence for the operator:
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
threshold_used = 2.0

# Opening turns (first N utterances per speaker) for context:
preview = """
*CHI:    they start to bite . [0_1708]
*PAR0:   They start to bite . [75_1165]
*PAR1:   They do what . [1515_2245]
... (further preview)
"""

# Suggested defaults the operator can accept-as-is:
suggested = { mapping = { PAR0 = "drop", PAR1 = "rename" }, inserted_role = { code = "INV", tag = "Investigator" } }

[[entries]]
session_id = "session-103-t1-parent"
kind = "parent-role-lookup"
# ... different evidence for the MOT-vs-FAT case ...

Schema characteristics

  • kind discriminates the adjudication type (one of speaker-id-low-confidence, parent-role-lookup, diarization-mix-review, sanity-scan-misclassification). Each kind has its own required field set; the adjudication tool dispatches on kind to choose the right prompt template and the right validator for the operator’s response.
  • suggested carries what the algorithm WOULD have chosen had the threshold been lower (for speaker-id) or a parsed default (for parent-role). The operator can accept-as-is or override.
  • Entries are a [[entries]] array of tables (not a session-keyed [<session_id>] map) because the same session could conceivably have multiple pending decisions (e.g., a speaker-id refusal AND a parent-role lookup), each a separate array entry.

Lifecycle

  • Written by: the orchestrator’s pass 1, when chatter speaker-id exits with code 4 or when other adjudication triggers fire.
  • Consumed by: chatter adjudicate, which reads it, prompts the operator entry-by-entry, writes decisions to the override file, and removes resolved entries.
  • Cleaned up: an empty entries array is the “all clear” state; pass 2 of the orchestrator can proceed.

chatter adjudicate, CLI surface

A new chatter subcommand in chatter. Its job is to walk a pending-adjudications file and write decisions to an override file.

chatter adjudicate <PENDING_FILE> --override-file <OVERRIDE_FILE> [OPTIONS]

ARGUMENTS:
  <PENDING_FILE>   Path to pending-adjudications.toml.

REQUIRED OPTIONS:
  --override-file <PATH>
      Path to the override file (created if missing, appended if
      existing). Decisions go here.

OPTIONS:
  --interactive
      (default) Prompt the operator for each pending entry via
      a terminal UI. This is the only mode for v1; later UI
      backends may add e.g. --backend=web for web-served prompts.

  --scripted <PATH>
      Read pre-canned decisions from a TOML file. Used in tests
      and in automated bulk-decision workflows (e.g., the
      operator has prepared a decision sheet in advance).
      Mutually exclusive with --interactive.

  --kind <KIND>
      Process only pending entries whose `kind` matches. Useful
      when the operator wants to batch through one class of
      decision at a time (e.g., do all parent-role lookups
      first, then all speaker-id refusals).

  --skip-on-error
      If the operator's response cannot be applied (e.g., they
      typed an invalid speaker code), log and skip rather than
      abort. Default: abort on first invalid response.

  --operator <NAME>
      Operator identifier recorded in override entries.
      Default: $USER.

  --dry-run
      Read pending and prompt the operator, but do NOT write to
      the override file. Useful for previewing what decisions
      look like before committing.

Exit codes:

CodeMeaning
0All pending entries decided; pending file updated
1I/O error (missing file, unparseable, write failure)
2Operator-supplied decision rejected as invalid (when --skip-on-error not set)
3Internal error
4Operator deferred at least one entry (used :skip in the prompt); pending file still has entries

The --scripted mode is the testability seam. A scripted decision file looks like:

schema_version = 1

[[decisions]]
session_id = "session-102-t1"
kind = "speaker-id-low-confidence"
choice = { kind = "accept-suggested", note = "verified by listening" }

[[decisions]]
session_id = "session-103-t1-parent"
kind = "parent-role-lookup"
choice = { kind = "override", inserted_role = { code = "FAT", tag = "Father" }, note = "per contributor data sheet" }

The adjudication tool reads the scripted file, matches decisions to pending entries by session_id + kind, applies each as though the operator had typed it. If a scripted decision has no matching pending entry, or a pending entry has no scripted decision, the run aborts with a clear error.

The prompter abstraction (testability)

The adjudication tool’s core flow is:

// pseudocode, actual signatures live in talkbank-transform
pub fn run_adjudication(
    pending: PendingAdjudications,
    override_file: &mut OverrideFile,
    prompter: &mut dyn Prompter,
    operator: OperatorId,
) -> Result<AdjudicationOutcome, AdjudicationError> {
    for entry in pending.entries() {
        let context = build_context(entry);
        let decision = prompter.ask(&context)?;
        apply_decision(override_file, entry, decision, &operator);
    }
    Ok(...)
}

pub trait Prompter {
    fn ask(&mut self, context: &AdjudicationContext)
        -> Result<OperatorDecision, PrompterError>;
}

Production implementations:

  • TerminalPrompter: prints context to stdout, reads operator response from stdin. Used by --interactive.

Test implementations:

  • ScriptedPrompter::from_decisions(Vec<(SessionId, OperatorDecision)>), returns each decision in turn, errors if asked for an unprovided session. Used by L2 transform tests.
  • ScriptedTomlPrompter::read(path): reads the same TOML format as --scripted. Used by L3 CLI tests so subprocess tests and library-level tests share fixture format.

This means:

  • Every adjudication test path is automated. No subprocess PTY hackery, no expect-script DSL. Tests construct ScriptedPrompter, run the adjudication core, assert on the resulting OverrideFile.
  • The terminal UI is dumb. All it does is Display-format the context and parse the operator’s response into an OperatorDecision. No business logic in the UI layer.
  • Future UI backends (VS Code, web) implement Prompter and drop in. The adjudication core is unchanged.

The OperatorDecision type

pub enum OperatorDecision {
    /// Accept the algorithm's suggested mapping verbatim.
    AcceptSuggested { note: Option<String> },

    /// Override with an operator-supplied mapping (speaker-id).
    OverrideMapping {
        mapping: SpeakerMapping,
        note: Option<String>,
    },

    /// Override the inserted role only (parent-role lookup).
    OverrideInsertedRole {
        inserted_role: InsertedRole,
        note: Option<String>,
    },

    /// Add or update flags on an existing entry.
    Flag { flags: Vec<MergeFlag>, note: Option<String> },

    /// Defer this entry; leave it in pending for later review.
    Defer { reason: String },

    /// Mark the session as blocked (e.g., unbulleted reference);
    /// requires upstream action before pipeline can resume.
    Block { reason: String },
}

Each variant maps cleanly to one or more adjudication kinds:

KindAllowed OperatorDecision variants
speaker-id-low-confidenceAcceptSuggested, OverrideMapping, Defer
parent-role-lookupAcceptSuggested, OverrideInsertedRole, Defer
diarization-mix-reviewFlag, Defer
sanity-scan-misclassificationOverrideMapping, Flag, Defer
(any)Block is always available

The kind → allowed-variants mapping is enforced by the adjudication tool: a kind = "parent-role-lookup" entry that gets an OverrideMapping decision is rejected with a clear error (AdjudicationError::DecisionKindMismatch).

Operator terminal UX (interactive mode)

What the operator sees when running chatter adjudicate pending.toml --override-file overrides.toml --interactive:

═══════════════════════════════════════════════════════════════
ADJUDICATION  [1 / 14]  session-102-t1   kind = speaker-id-low-confidence
═══════════════════════════════════════════════════════════════

Reference file:  chi-only/session-102-t1.cha
Donor file:      asr/session-102-t1.cha
Anchor speaker:  CHI

Per-speaker Jaccard scores against reference's CHI:
  PAR0 = 0.6286   ◄── higher
  PAR1 = 0.3457
  margin = 1.82×   (threshold was 2.00×)

Opening turns side-by-side:

  *CHI    [0_1708]    they start to bite .
  *PAR0   [75_1165]   They start to bite .
  *PAR1   [1515_2245] They do what .

  *CHI    [1708_5966] they put up their shields at some point .
  *PAR0   [2755_4405] They put up those heels .
  *PAR1   [4865_6045] At some point oh .

  (3 more turns shown; press 'm' for more)

Algorithm-suggested mapping:
  PAR0 → drop   (winner, matches CHI content)
  PAR1 → rename to INV:Investigator

Your decision?
  [a] Accept suggested
  [o] Override mapping
  [f] Flag and defer
  [d] Defer (review later)
  [b] Block (needs upstream fix)
  [m] Show more context
  [p] Play media (uses $TB_MEDIA_PLAYER)
  [q] Quit (save progress and exit)
> 

When the operator types a and then is prompted for an optional note, the tool writes the decision to the override file and advances to the next pending entry.

The [p] Play media action is just a wrapper around Command::new($TB_MEDIA_PLAYER).arg(media_path).spawn(), the adjudication tool doesn’t bundle an audio player. The operator configures their preferred player via the environment.

Adjudication contexts beyond speaker-id

The same chatter adjudicate tool handles all five adjudication points by dispatching on kind. For each, the displayed context and the allowed decisions differ:

parent-role-lookup

Shown context: the session is a parent sample (basename contains parent-suffix conventionally, or contributor data sheet says so). The merged output needs an inserted-role code of MOT, FAT, or PAR. The operator picks.

Session: session-103-t1-parent
Kind: parent-role-lookup

This is a parent-sample session. The merged file's inserted
speaker (currently labeled PAR0 → ???) needs a CHAT role.

Contributor data sheet (if attached): not available
Audio preview duration: 8m 14s

Algorithm-suggested:  INV : Investigator   (default for ambiguity)

Your decision?
  [a] Accept suggested (INV : Investigator)
  [m] MOT : Mother
  [f] FAT : Father
  [p] PAR : Adult (gender unknown)
  [c] Custom role
  [d] Defer
  [b] Block (needs upstream metadata)
> 

diarization-mix-review

Triggered by the operator (or a post-merge auto-scan) observing that an ASR speaker’s content mixes real-world speakers. The adjudication is to add the "diarization-mixed" flag plus a note explaining the mix.

sanity-scan-misclassification

Triggered by the post-merge sanity scan when a retained-speaker utterance has high text similarity with a temporally-adjacent inserted-speaker utterance. The operator either confirms (“the original speaker-id was wrong, swap the mapping”) or overrides (“the duplication is real, both speakers said the same thing at the same time”).

Resumption and re-adjudication

The pending-adjudications file is the source of truth for “what still needs deciding.” If the operator quits mid-review (via [q] or process-kill), the next chatter adjudicate invocation picks up where they left off, already-decided entries have already been removed from pending and written to the override file.

Re-adjudication of an already-decided entry is a planned extension, not yet implemented. The proposed interface would load the existing override entry, present it as the “current decision,” and ask the operator whether to keep or replace it; the operator’s decision would overwrite the entry, and the prior decision would be preserved in a history array on the entry (recording the prior mode, mapping, operator, decided_at, and note). The proposed invocation shape (not a working command today) is:

# Proposed, not yet implemented:
chatter adjudicate --re-adjudicate <SESSION_ID> --override-file overrides.toml

It needs a small override-file schema extension, a per-entry optional history: Vec<MergeOverride> field. This is a minor schema change; if it ships in v1, no schema bump is needed; if it ships later, that is a schema_version = 2 migration.

Composition with the orchestrator

The orchestrator (proposed tb merge or similar) drives the pipeline. Its high-level flow:

// pseudocode for the orchestrator's main loop
let inputs = discover_input_sessions(input_dir);
let override_file = OverrideFile::read_or_default(override_path);
let mut pending = PendingAdjudications::default();

for session in inputs {
    if let Some(decision) = override_file.get(&session.id) {
        // Already adjudicated; apply directly.
        let labeled = apply_mapping(&session.donor, &decision.mapping)?;
        let merged = merge(&session.reference, &labeled, &session.retain)?;
        write_merged(merged, &session.output_path)?;
    } else {
        // Try auto-decide.
        match identify_mapping(&session.donor, &session.reference, ...) {
            Ok(mapping) => {
                let labeled = apply_mapping(&session.donor, &mapping)?;
                let merged = merge(...)?;
                write_merged(merged, &session.output_path)?;
                override_file.insert(session.id.clone(), record_auto_decision(&mapping));
            }
            Err(SpeakerIdError::LowConfidence { scores, margin, threshold }) => {
                pending.push(PendingEntry::speaker_id_low_confidence(
                    session.id.clone(),
                    scores, margin, threshold,
                    /* preview */ build_preview(&session),
                ));
            }
            Err(other) => return Err(other),
        }
    }
}

pending.write(pending_path)?;
override_file.write(override_path)?;

if !pending.is_empty() {
    eprintln!(
        "Pipeline complete for {} sessions; {} sessions need adjudication.\n\
         Run: chatter adjudicate {} --override-file {}",
        decided_count, pending.len(), pending_path, override_path
    );
    return Ok(ExitCode::NeedsAdjudication);
}

The orchestrator is the layer that hasn’t been designed yet at the type level. It’s likely a tb subcommand (since tb is the workflow tool for multi-repo / multi-step ops), with a fallback shell-script form for the v0 pipeline.

What this design does NOT cover

  • The orchestrator binary itself. That’s a separate design pass; this doc only specifies the contract between the pipeline stages and the adjudication tool.
  • GUI/web adjudication backends. v1 is terminal-only. The Prompter trait is the extension point; future backends implement it. The data contract (pending.toml, overrides.toml) does not change.
  • Audio playback / waveform display. v1 launches the operator’s $TB_MEDIA_PLAYER and gets out of the way. A future TUI with inline audio scrubbing is conceivable but is a major UI project, not v1.
  • ML-suggested decisions. A future version could feed pending entries to a classifier that pre-fills “suggested” with model output. Out of scope; the suggested field exists today as a hook.

Test coverage

Every behavior of chatter adjudicate is tested via the scripted-prompter abstraction. See the Test Plan (TBD section L4) for the test inventory. Coverage spans:

  • Each adjudication kind’s happy path (operator accepts suggested, decision written to override file)
  • Each adjudication kind’s override path (operator types an alternative, decision validated and recorded)
  • Each adjudication kind’s defer path (entry stays in pending)
  • Each adjudication kind’s block path (entry marked blocked; pipeline reports blocker)
  • Re-adjudication path (operator changes their mind; prior decision preserved in history)
  • Mutually-exclusive flag enforcement (--interactive + --scripted rejected)
  • Invalid operator response handling (with and without --skip-on-error)
  • Schema-version refusal on the pending file
  • Empty pending file (no-op, exit 0)

XML Emitter

Status: Current Last updated: 2026-06-14 12:56 EDT

Purpose

crates/talkbank-transform/src/xml/ serialises a ChatFile<S> into TalkBank XML, an obsolete, frozen interchange format. The emitter is chatter’s implementation of that format’s CHAT to XML projection.

Scope:

  • Legacy / rare-use facility. The TalkBank project no longer publishes XML for download; CHAT is the canonical distribution format. The XML emitter exists to support rare legacy consumers that still need the XML projection; it is not a primary interchange path. New integrations should consume CHAT directly.
  • Emission only. XML ingest (XML → CHAT) is explicitly out of scope. The only historical consumer that ever needed XML → CHAT was Phon (via its PhonTalk plug-in, which used an XML round-trip); Phon has since pivoted to reading CHAT directly. The other XML readers are all either dormant or migrated:
    • NLTK’s CHILDESCorpusReader is unmaintained and was always read-only.
    • langcog/childes-db has had no commits since September 2022.
    • TalkBankDB and the current TalkBank analysis stack read CHAT directly, not XML.
  • Phonetic tiers are permanently unsupported. %pho, %mod, %phosyl, %modsyl, %phoaln report XmlWriteError::PhoneticTierUnsupported. Phon has pivoted to CHAT-only interchange; no downstream consumer reads the rich <pg>/<pw>/<ph>/<cmph>/<ss> XML. Files carrying these tiers still parse, validate, and round-trip through CHAT unchanged, only the XML projection is declined.
  • Parity oracle. The goldens in corpus/reference-xml/ (the reference TalkBank XML generated against the reference CHAT corpus) are the parity target. All paired goldens pass structurally, full parity across every reference .cha file the TalkBank XML format can represent. A small number of reference fixtures have no golden because the frozen format cannot express them: some use UD POS tags (propn) that postdate it, and others declare @Media with a linkage type that the E544 validator catches before emission has a chance to run. Intentional divergences, not Rust gaps.

Module layout

The emitter is split across six submodules under xml/. Each file contributes an impl XmlEmitter { … } block plus any free helpers it owns; state lives on the single XmlEmitter struct defined in writer.rs.

flowchart TD
    entry["write_chat_xml<br/>(writer.rs)"]
    emitter["XmlEmitter struct<br/>owns quick-xml Writer<br/>+ next_utterance_id"]
    root["root.rs<br/>document / participants /<br/>body / utterance orchestration<br/>+ metadata helpers"]
    word["word.rs<br/>&lt;w&gt; / &lt;t&gt; / &lt;tagMarker&gt; /<br/>&lt;pause&gt; / &lt;g&gt; wrappers /<br/>word-internal markers /<br/>scoped annotations"]
    mor["mor.rs<br/>&lt;mor&gt; / &lt;mw&gt; / &lt;gra&gt; /<br/>UtteranceTiers collector /<br/>%mor feature serialization"]
    wor["wor.rs<br/>&lt;media&gt; / &lt;wor&gt; /<br/>&lt;internal-media&gt; /<br/>ms → seconds formatting"]
    deptier["deptier.rs<br/>&lt;a type=…&gt; side tiers<br/>(%act / %com / %exp /<br/>%gpx / %sit / %xLABEL)"]
    error["error.rs<br/>XmlWriteError variants"]

    entry --> emitter
    emitter --> root
    emitter --> word
    emitter --> mor
    emitter --> wor
    emitter --> deptier
    root -->|"terminator,<br/>separator"| word
    root -->|"collect_utterance_tiers,<br/>UtteranceTiers"| mor
    root -->|"&lt;media&gt;,<br/>&lt;wor&gt;"| wor
    root -->|"side tiers"| deptier
    word -->|"&lt;mor&gt; subtree<br/>inside &lt;w&gt;"| mor
    word -->|"&lt;mor&gt; subtree<br/>inside &lt;tagMarker&gt;"| mor
    wor -->|"%wor terminator<br/>label"| word
    error -.->|"errors"| entry
    error -.->|"errors"| root
    error -.->|"errors"| word
    error -.->|"errors"| mor
    error -.->|"errors"| wor
    error -.->|"errors"| deptier
FileRole
writer.rsXmlEmitter struct, namespace/version constants, write_chat_xml entry point, minimal-document unit test, escape_text helper
root.rsDocument / participants / body / utterance orchestration; root-element metadata helpers (corpus lookup, date/age/sex formatting, @Options flags, @Types projection, per-speaker extras)
word.rsAll word-level element shapes; word-internal marker walking; scoped-annotation dispatch; event / action emission
mor.rs%mor / %gra emission including post-clitic <mor-post>; UtteranceTiers aggregator
wor.rs%wor tier emission plus utterance-level <media>; format_seconds ms → seconds
deptier.rsText-content “side tiers” that render as <a type=…>text</a> (%act, %com, %exp, %gpx, %sit, %xLABEL)
error.rsXmlWriteError thiserror enum

Top-level data flow

sequenceDiagram
    participant Caller
    participant write_chat_xml as write_chat_xml<br/>(writer.rs)
    participant XmlEmitter as XmlEmitter
    participant emit_document as emit_document<br/>(root.rs)
    participant emit_body as emit_body<br/>(root.rs)
    participant emit_utterance as emit_utterance<br/>(root.rs)

    Caller->>write_chat_xml: ChatFile&lt;S&gt;
    write_chat_xml->>XmlEmitter: new()
    write_chat_xml->>emit_document: emit_document(file)
    emit_document->>emit_document: emit &lt;?xml?&gt; + &lt;CHAT&gt; attrs
    emit_document->>emit_document: emit_participants(file)
    emit_document->>emit_body: emit_body(file)
    loop each Line
        alt Line::Header
            emit_body->>emit_body: emit_header_if_body(header)
        else Line::Utterance
            emit_body->>emit_utterance: emit_utterance(utterance)
        end
    end
    write_chat_xml->>XmlEmitter: finish() → String
    XmlEmitter-->>Caller: Ok(String)

Utterance emission in detail

emit_utterance is the most complex orchestrator: it walks the main tier in parallel with two cursors into the dependent tiers.

flowchart TD
    start([emit_utterance])
    preHdr[emit pre-begin<br/>headers]
    collect["collect_utterance_tiers<br/>→ UtteranceTiers {<br/>mor, gra, wor, sin, side_tiers }"]
    openU["&lt;u who=… uID=…&gt;"]
    linkers["emit_linker × N<br/>(utterance.main.content.linkers)"]
    walk{"walk<br/>utterance.main.content.content"}
    term{"terminator<br/>present?"}
    emitTerm["emit_terminator<br/>(word.rs)"]
    missing["&lt;t type='missing<br/>CA terminator'/&gt;"]
    media{"main bullet<br/>present?"}
    emitMedia["emit_utterance_media<br/>(wor.rs)"]
    wor{"%wor tier<br/>present?"}
    emitWor["emit_wor<br/>(wor.rs)"]
    side{"side tiers<br/>non-empty?"}
    emitSide["emit_side_tiers<br/>(deptier.rs)"]
    closeU["&lt;/u&gt;"]
    done([return])

    start --> preHdr --> collect --> openU --> linkers --> walk
    walk -->|"Word / AnnotatedWord /<br/>ReplacedWord / AnnotatedGroup /<br/>Separator / Pause / Retrace /<br/>Event / AnnotatedAction /<br/>OverlapPoint"| walk
    walk --> term
    term -->|yes| emitTerm
    term -->|no| missing
    emitTerm --> media
    missing --> media
    media -->|yes| emitMedia
    media -->|no| wor
    emitMedia --> wor
    wor -->|yes| emitWor
    wor -->|no| side
    emitWor --> side
    side -->|yes| emitSide
    side -->|no| closeU
    emitSide --> closeU
    closeU --> done

The TierCursors invariant

Walking the main tier requires tracking three independent cursors into the %mor / %gra / %sin tiers. This separation is the single most important correctness invariant in the emitter; a merged cursor silently drifts on any utterance containing a clitic chain, an untranscribed placeholder, or a sign-language item.

A TierCursors helper in mor.rs owns the three cursors and provides mor_index() / gra_chunk() / sin_index() accessors plus consume_mor(post_clitics_len) / consume_sin() / advance_bulk(mor, gra) advance methods. Every content-arm in emit_utterance runs a fixed template: look up partners at the current cursor positions, emit, call consume_*. The advance math has exactly one home.

CursorIndexes intoAdvances by
mormor_tier.items (one Mor per main-tier word)1 per alignable word
gragra.relations (1-based <gra index=…/>)1 + post_clitics.len() per Mor
sinsin_tier.items (one SinItem per sin-countable word)1 per sin-countable word

A Mor item like pron|what-Int-S1~aux|be-Fin-Ind-Pres-S3 is one entry in mor_tier.items but contributes two %gra edges, one for the main <mw>, one for each <mor-post><mw/></mor-post>. So mor and gra cursors advance at different rates.

%sin uses a separate counting predicate than %mor. The model’s counts_for_tier(word, TierDomain) function encodes the differences:

  • TierDomain::Mor excludes nonwords (&~), fillers (&-), phonological fragments (&+), and untranscribed placeholders (xxx, yyy, www).
  • TierDomain::Sin includes everything that was phonologically or gesturally produced, fragments and untranscribed do participate. A gesture can accompany an unintelligible vocalisation.

Because the predicates diverge, the sin cursor advances on its own schedule. For *CHI: mommy xxx . %sin: g:point 0 . the xxx word consumes a %sin item but not a %mor item.

Four main-tier content variants delegate cursor arithmetic through their emitters: emit_replaced_word and emit_annotated_group return (mor_used, gra_used) tuples consumed via cursors.advance_bulk(mor_used, gra_used); emit_word and emit_annotated_word call cursors.consume_mor(post_count) inline.

Why cursor-based, not AlignmentSet-based?

talkbank-model’s AlignmentSet (Utterance.alignments) holds pre-computed MorAlignmentPair / SinAlignmentPair / etc., the same main-word-index ↔ target-tier-index mapping the emitter computes on-the-fly. Why not use it directly?

The XML emitter accepts ChatFile<S: ValidationState> for any S. When called on a ChatFile<NotValidated>, compute_alignments has never run and Utterance.alignments is None. Rather than force callers to validate first, or risk panics on unvalidated input, the emitter recomputes what it needs via the cursor walk.

The cursor walk is equivalent to the model’s alignment output for every reference-corpus input; it only diverges on malformed files that the model’s alignment would also flag. The cursors stay as local emitter state, and the alignment module stays a separate, optional layer.

%sin<sg><w><sw/></w></sg> emission

When a %sin tier is present and the current word counts for TierDomain::Sin, the emitter wraps the <w> element (and its nested <mor> subtree if any) in a <sg> (sign group) with a <sw> (sign word) sibling:

<sg><w>what<mor>...</mor></w><sw>0</sw></sg>

SinItem::Token(text) renders as <sw>text</sw>; SinItem::SinGroup(…) joins its gesture tokens with spaces. The emission is the entirety of XmlEmitter::emit_sin_word; everything else is just the <sg>…</sg> wrap in emit_utterance’s Word arm.

@Media linkage and timing evidence (E544)

Validation fires E544 before XML emission when an unqualified @Media header (status-less) claims linkage but the transcript carries no timing evidence (no main-tier bullets, no positional %wor sidecar). This is a validator-level rule (lives in crates/talkbank-model/src/model/file/chat_file/validate.rs check_media_linkage_has_timing), not an emitter rule; it runs during ChatFile::validate and blocks downstream emission on validation-gated entry points. See spec/errors/E544_media_linkage_without_timing.md.

The emitter itself doesn’t care about bullet presence; this check was historically imposed as a parser-level semantic failure, and Rust implements it in the validator instead.

Post-clitic emission

flowchart LR
    mor["&lt;mor type='mor'&gt;"]
    mw["&lt;mw&gt;…&lt;/mw&gt;<br/>(main MorWord)"]
    gra["&lt;gra type='gra'<br/>index=N head=… relation=…/&gt;"]
    post["&lt;mor-post&gt;"]
    pmw["&lt;mw&gt;…&lt;/mw&gt;<br/>(post-clitic MorWord)"]
    pgra["&lt;gra type='gra'<br/>index=N+1 head=… relation=…/&gt;"]
    postEnd["&lt;/mor-post&gt;"]
    endMor["&lt;/mor&gt;"]

    mor --> mw --> gra --> post --> pmw --> pgra --> postEnd --> endMor

Each post-clitic gets its own <mor-post> wrapper containing one <mw> plus the next <gra> index. Multiple post-clitics emit sequentially.

Emitter / parser / model boundary

The emitter generally defers to the Rust model’s canonical predicates rather than inventing output-side rules. Four cases are exceptions where the emitter bridges a disagreement between the parser and the TalkBank XML format at the output boundary. All four are legitimate divergences, not regressions: the Rust model is correct, the TalkBank XML format is obsolete and frozen at a pre-evolution CHAT snapshot, and the emitter’s bridges are the right place to reconcile the output shape.

CA intonation contour terminators

Rust parses , , , , at the end of an utterance as Terminator::CaRisingToHigh etc. The TalkBank XML format classifies them as separators followed by an implicit “missing CA terminator”. The emitter splits a pitch-contour terminator into two sibling elements:

<s type="rising to high"/>
<t type="missing CA terminator"/>

See ca_terminator_separator_label in word.rs. If the Rust parser ever migrates to classify these as separators, the emitter’s bridge becomes dead and should be removed.

CAOmission as whole-word shortening

(parens) (a fully-parenthesised word) parses to WordCategory::CAOmission. TalkBank XML emits <w><shortening>parens</shortening></w>, a <shortening> wrapper around the word body with no type="omission" attribute. The 0word syntax (true omission) gets <w type="omission">word</w> with no shortening wrapper.

The emitter branches on CAOmission and opens a <shortening> wrapper around emit_word_contents. word_category_attr returns None for CAOmission so no type="omission" attribute is emitted.

Leading overlap-point hoisting

Rust parses ⌈°overlapping+soft⌉° as a single word whose WordContent vector starts with a TopOverlapBegin marker. TalkBank XML keeps the leading as a top-level sibling of <w> but keeps the trailing inside. The emitter hoists the prefix of leading WordContent::OverlapPoint items out before opening <w>, and emit_word_contents skips them during the content walk.

xxx / yyy / www case-sensitivity

The model’s word.untranscribed() helper is case-insensitive; it treats XXX and xxx identically as “unintelligible” to protect downstream Stanza/MOR pipelines from spurious uppercase entries. The XML schema’s untranscribed attribute, however, attaches only to the strictly lowercase placeholders. The emitter uses a local untranscribed_attribute_for_xml helper that does the case-sensitive check at the output boundary.

Both behaviours are deliberate and stay: the model’s case-insensitive helper is a Stanza/MOR correctness fix, and the emitter’s case-sensitive gate matches the XML schema contract.

Reserving element boundaries: single state holder

XmlEmitter owns a quick_xml::Writer<Vec<u8>> and a running next_utterance_id: u32 counter. Every emission helper writes through that single writer so indentation, escaping, and the document-order contract are centrally enforced.

Every BytesText emission routes through escape_text (in writer.rs) which uses quick_xml::escape::partial_escape to escape only <, >, &. Apostrophes and double quotes pass through literally, matching the TalkBank XML format and avoiding entity-decode issues that would otherwise split text at &apos; boundaries during structural comparison.

Testing

Two complementary test surfaces:

  1. Unit tests in xml/writer.rs (minimal document smoke) and xml/wor.rs (format_seconds fractional padding) exercise internal helpers directly.

  2. Golden-XML parity harness at crates/talkbank-parser-tests/tests/xml_golden.rs. Runs one parametrised test per file in corpus/reference-xml/**/*.xml, parses both emitted and golden XML via quick-xml, and diffs event streams with whitespace and attribute-order normalisation. Comparator lives in crates/talkbank-parser-tests/tests/xml_support/mod.rs.

The harness diagnostic surfaces the first divergence as actual: … vs expected: …. To debug further, temporary dump helpers (write the emitted XML to /tmp/emitted.xml and side-by-side diff against the golden) are the quickest path; add them as #[ignore]d tests in crates/talkbank-parser-tests/tests/xml_dump.rs when needed and delete after the divergence is resolved.

  • spec/errors/E544_media_linkage_without_timing.md: the @Media bullet-existence validator that runs before emission.

Reference-XML coverage gaps (which files the TalkBank XML format can’t represent) are called out inline in the “Parity oracle” bullet of §Purpose above, permanent exclusions are UD-POS files that postdate the frozen format and @Media-without-timing files E544 blocks at validation, both intentional divergences, not Rust gaps.

Staged features

The emitter reports XmlWriteError::FeatureNotImplemented for CHAT constructs that have a known XML shape but haven’t been wired in yet. With all paired reference-XML goldens passing, any new staged feature that lands will be triggered by a file added to the reference corpus that exercises it. When that happens:

  1. Run cargo nextest run -p talkbank-parser-tests --test xml_golden and read the failure message.
  2. Find the TalkBank XML output for the construct in the paired golden.
  3. Add a match arm in the appropriate submodule (word.rs::emit_scoped_annotation, deptier.rs::emit_side_tier, word.rs::ca_delimiter_label, etc.) with a short comment explaining the mapping.
  4. If the construct changes %mor / %gra cursor accounting, update emit_utterance in root.rs, not individual callers.

Permanently-unsupported tiers (%pho, %mod, %phosyl, %modsyl, %phoaln) use XmlWriteError::PhoneticTierUnsupported and are not staged for future work, Phon’s pivot to CHAT-only interchange removed the downstream need.

Errors, CHAT core

Status: Current Last modified: 2026-06-17 11:29 EDT

The error infrastructure used across all CHAT-core crates (talkbank-model, talkbank-parser, talkbank-transform, chatter, talkbank-lsp). Defined in the errors module of talkbank-model.

External runtime/application errors that live outside this repo’s CHAT core are documented separately in their owning projects. For the diagnostic UX standard that applies within this workspace, see error-diagnostics-ux.

Core Types

ParseError

Every diagnostic is a ParseError:

pub struct ParseError {
    pub code: ErrorCode,
    pub severity: Severity,
    pub location: SourceLocation,
    pub context: ErrorContext,
    pub message: String,
}

ErrorCode

Error codes follow a structured numbering scheme:

RangeCategory
E1xxEncoding
E2xxWords and content
E3xxMain tier (speakers, terminators, content, retraces)
E4xxDependent tier structure
E5xxHeaders
E6xxDependent tier validation
E7xxAlignment (%mor, %gra, %pho, %wor)
W1xx-WxxxWarnings (same categories)

Codes are grouped by range as above. The numbering is a navigational aid, not the authority on where a code is caught: most codes are emitted at the layer suggested below, but a few main-tier checks (for example undeclared-speaker and retrace structure) are validation-layer despite their E3xx number. The per-code Layer in spec/errors/ is authoritative.

flowchart LR
    subgraph "Parser layer\n(parser.parse_chat_file())"
        E1["E1xx\nEncoding\n(BOM, charset)"]
        E2["E2xx\nWords and content\n(word syntax, events,\noverlap markers)"]
        E3["E3xx\nMain tier\n(speaker, content,\nterminator, retraces)"]
        E4["E4xx\nDependent tier structure\n(tier presence, format)"]
        E5["E5xx\nHeaders\n(format, required fields,\nparticipant resolution)"]
    end

    subgraph "Validation layer\n(validate_with_alignment)"
        E6["E6xx\nDependent tier validation\n(tier name/format)"]
        E7["E7xx\nAlignment\n(%mor/%gra/%pho/%wor counts,\nGRA indices, orphaned tiers)"]
    end

    W["Wxxx\nWarnings\n(same categories,\nnon-fatal)"]

    E1 ~~~ E2 ~~~ E3 ~~~ E4 ~~~ E5
    E6 ~~~ E7

The source of truth for error-code details is spec/errors/. Maintainers can generate a local markdown reference set under docs/errors/ with gen_error_docs when they need a browsable error catalog while working on diagnostics.

Severity

  • Error: must be fixed; indicates invalid CHAT.
  • Warning: should be fixed; indicates questionable but parseable CHAT.

SourceLocation and Span

Byte offsets into the source text:

#![allow(unused)]
fn main() {
pub struct SourceLocation { pub start: usize, pub end: usize }
pub struct Span { pub start: usize, pub end: usize }
}

ErrorContext

Carries the source fragment around the error location:

pub struct ErrorContext {
    pub source_fragment: String,
    pub byte_range: Range<usize>,
    pub node_kind: String,
}

ErrorSink Trait

The central abstraction for error reporting:

flowchart LR
    val["Validator / Parser"]
    pe["ParseError\ncode + severity +\nlocation + message"]
    sink["ErrorSink trait\n.report()"]
    vec["ErrorCollector\ncollect to Vec"]
    chan["ChannelErrorSink\ncrossbeam channel\n(feature = channels)"]
    asyncchan["AsyncChannelErrorSink\ntokio mpsc"]
    cfg["ConfigurableErrorSink\nseverity gating"]
    null["NullErrorSink\nno-op"]

    val --> pe --> sink
    sink --> vec & chan & asyncchan & cfg & null
pub trait ErrorSink {
    fn report(&self, error: ParseError);
}

All parsing and validation functions accept &impl ErrorSink rather than returning errors directly. This allows:

  • Collecting all errors (for batch processing).
  • Printing errors in real-time (for interactive use).
  • Filtering by severity or code.
  • Counting errors without storing them.

The trait uses &self (not &mut self) so it can be shared across threads. Implementations typically use interior mutability (Mutex<Vec<ParseError>>).

ErrorCollector is the in-memory collector in errors/collectors.rs. The stored-diagnostics role is explicit in both code and docs.

Module layout in talkbank-model:

  • errors/error_sink.rs: trait and lightweight forwarding sinks.
  • errors/collectors.rs: in-memory collectors and counters.
  • errors/async_channel_sink.rs: Tokio-channel streaming.
  • errors/configurable_sink.rs, errors/offset_adjusting_sink.rs, errors/tee_sink.rs, adapters.

ChannelErrorSink is opt-in behind the channels feature so the default talkbank-model dependency does not pull in crossbeam just to own the core error trait and in-memory collectors.

Two Error Layers

Errors are detected at two layers. This distinction matters for spec testing.

  1. Parser layer: structural errors caught during parser.parse_chat_file(). These prevent the file from being fully parsed (missing @Begin, invalid syntax). Parser-layer specs test that parser.parse_chat_file() returns Err.

  2. Validation layer: semantic errors caught by validate_with_alignment() after a successful parse. The file parsed correctly but violates constraints (%mor alignment mismatch, undeclared speakers). Validation-layer specs test that validation reports specific error codes.

Adding a New Error Code

  1. Add the variant to ErrorCode in crates/talkbank-model/src/errors/codes/error_code.rs with a #[code("Exxx")] attribute.
  2. Create a spec file in spec/errors/Exxx-description.md following the existing template.
  3. Construct ParseError::new(ErrorCode::YourVariant, ...) at the detection site in the parser or validator.
  4. Regenerate the affected spec artifacts with the current spec/tools binaries (gen_rust_tests, gen_validation_corpus, and optionally gen_error_docs).
  5. Run the concrete verification commands from book/src/contributing/dev-checks.md.

Validation

Status: Current Last modified: 2026-06-13 22:40 EDT

CHAT validation runs at multiple points in the processing pipeline. All validation logic is in Rust: talkbank-model::validation owns CHAT-core validation, and talkbank_transform::validate (crates/talkbank-transform/src/validate.rs) owns the transform-side pre/post validation gate functions (validate_to_level, validate_output). This page covers validity levels, pre/post validation gates, severity posture, the verification-gate set (G0-G14), and how validation failures interact with caches and bug reports.

For error-code infrastructure (codes, sinks, severities, layers), see chat-core-errors. For the diagnostic UX standard, see error-diagnostics-ux.

Validity Levels

The ValidityLevel enum defines three cumulative validation levels. Each level includes all checks from lower levels.

LevelNameChecks
L0ParseableNo parse errors (clean tree-sitter CST)
L1StructurallyComplete@Participants and @Languages present, all speaker codes declared, every utterance has a terminator
L2MainTierValidWell-formed words, valid timing bullets if present

Pre-validation gates

Each command requires input to meet a minimum level before processing:

CommandRequired level
morphotagMainTierValid
utsegStructurallyComplete
translateStructurallyComplete
corefStructurallyComplete
alignParseable (lenient, must handle messy real-world files)

validate_to_level() checks the file against the required level and returns all failures found. Invalid files are rejected early with diagnostics, before any compute is spent on inference.

flowchart TD
    cmd["a transform command\n(morphotag, utseg, translate,\ncoref, align)"]
    gate["validate_to_level(file, required_level)\n(talkbank-transform validate.rs)"]
    check{"meets the command's minimum\nValidityLevel?\n(L0 Parseable / L1 StructurallyComplete\n/ L2 MainTierValid)"}
    reject["reject early with diagnostics;\nno compute spent"]
    proceed["run the command's inference"]

    cmd --> gate --> check
    check -->|"no"| reject
    check -->|"yes"| proceed

Post-Serialization Validation

After an orchestrator injects results and serializes CHAT output, the server runs validate_output():

  1. Alignment validation: checks that %mor/%gra/%wor tier word counts match the main tier. ParseHealth-aware: utterances flagged as unparseable during lenient parsing are excluded.
  2. Semantic validation: full CHAT validation:
    • E362: non-monotonic timestamps (utterance bullets must increase).
    • E701 / E704: temporal constraints (overlap rules, same-speaker timing).
    • Header correctness, required headers present and well-formed.
    • Cross-utterance patterns, speaker code consistency.

Only blocks on severity="error", not warnings.

Severity Posture

Validation intentionally distinguishes errors from warnings:

  • Errors block output. The server will not write CHAT with error-level validation failures.
  • Warnings are reported but do not block. Legacy corpora contain widespread minor violations that must remain processable.

This distinction matters especially for %gra:

  • Existing broken %gra in old corpora may be accepted with warnings so files remain processable.
  • Newly generated %gra from batchalign3 is validated more strictly before writeback.

Bug Reports and Cache Purges

When post-serialization validation fails:

  1. A structured bug report is written to ~/.batchalign3/bug-reports/.
  2. Cache entries that produced the invalid output are purged (self-correcting cache).

This prevents broken results from being served on future runs.

Verification Contract

This repo does not currently expose the predecessor workspace’s make verify wrapper. The current local contract is the concrete command set documented in Developer Verification Checks and Testing and Quality Gates.

Core local sweep:

cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Add the surface-specific checks that match the validation-affecting code you changed:

  • grammar: cd grammar && tree-sitter generate && tree-sitter test
  • spec tools: cargo build --manifest-path spec/tools/Cargo.toml and cargo build --manifest-path spec/runtime-tools/Cargo.toml
  • parser / model / alignment / serialization: cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)' and cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus

The reference corpus at corpus/reference/ remains the sacred semantic target. Historical labels like G0-G14 are useful for older design notes, but they are not the current command surface of this checkout.

Validation at the PyO3 Boundary

There is no public Python validation API. The ParsedChat handle that previously exposed validate() / validate_structured() / validate_chat_structured() was retired in the 2026-03-21 PyO3 slimdown to worker-runtime-only. Validation now runs entirely on the Rust side; when a worker invocation detects a failure it constructs BatchalignBoundaryError::ChatValidation { entries, … } which the PyO3 boundary lowers into a CHATValidationException carrying a populated errors: list[ValidationErrorEntry] on the Python side.

Python callers that need structured validation results invoke batchalign3 via subprocess and catch the exception:

from batchalign_core import CHATValidationException

try:
    batchalign_core.execute_v2(request)
except CHATValidationException as exc:
    for entry in exc.errors:
        print(entry.code, entry.line, entry.message)

Upstream batchalign runtime errors and the Python ↔ Rust boundary contract are documented separately in the batchalign3 project.

Known limitations

  • Validation rules are intentionally permissive on legacy data. Some checks emit warnings rather than errors so legacy corpora remain processable while still surfacing the issue. Examples: pre-existing malformed %gra (warned, not blocked, so files that already shipped with bad %gra round-trip cleanly); some bullet-format minor variants. Newly generated tiers from batchalign are validated more strictly before writeback.
  • %wor word counts are not validated against the main tier. %wor is a timing-annotation tier with no downstream positional indexing; legacy files may have xxx, fragments, or nonwords in %wor without producing alignment errors.
  • Cross-utterance quotation validation is gated off by default (enable_quotation_validation flag), the cross-utterance walker exists but is not yet wired into the standard validation gate.
  • Some error-spec / validator pairs are not yet implemented. Tracked in spec/errors/ files marked Status: not_implemented; these generate #[ignore] tests via the current spec/tools generators rather than failing CI. Run grep -rl "Status.*not_implemented" spec/errors/ to enumerate.

CHECK Parity Audit

Status: Current Last updated: 2026-06-17 11:29 EDT

CLAN’s check (CHECK) was the long-standing validator for CHAT files. chatter validate is the forward-looking replacement, and it is the binding judgment on whether a byte sequence is valid CHAT: when chatter rejects a file, the file is invalid and the right response is to clean the data, not to weaken the parser. CHECK is no longer the authority on validity.

CHECK is still useful for one thing: as a reference oracle that helps find validation rules chatter does not yet have. The CHECK Parity Audit is the tool that compares the two systematically, so that every rule CHECK enforces is either matched by chatter, or is a deliberate, documented divergence.

What the audit answers

For every error code CLAN’s check actually emits, the audit answers: does chatter have an equivalent rule, and if not, why not?

  • Semantic parity: does chatter enforce the same intended rule?
  • Behavioral parity: does chatter match CHECK’s literal runtime behavior, including CHECK’s documented anomalies (some CHECK rules are buggy or were disabled in place; reproducing those bugs is not a goal)?
  • Strictness policy: chatter should be at least as strict semantically. A file CHECK rejects should not silently pass chatter unless the divergence is deliberate.

How it works

flowchart LR
    cpp["CLAN check.cpp\n(OSX-CLAN/src/clan)"]
    extract["scripts/extract_check_codes.py\n(every code CHECK actually emits)"]
    ref["clan-check-reference/\ncheck-error-codes.json"]
    map["map_by_id()\n(audit_check_parity.rs)"]
    audit["audit_check_parity\nbinary"]
    out["docs/audits/\ncheck-parity-audit.md"]

    cpp --> extract --> ref
    ref --> audit
    map --> audit
    audit --> out
  1. The CHECK reference is generated from CLAN’s check.cpp by scripts/extract_check_codes.py into crates/talkbank-parser-tests/clan-check-reference/check-error-codes.json. It records every code CHECK emits (the call sites in the C source), not the stale subset documented in CLAN’s own CHECK-rules.md.

  2. The mapping lives in map_by_id() in crates/talkbank-parser-tests/src/bin/audit_check_parity.rs: an explicit CHECK-number to TalkBank-code table (for example 138 | 139 => &["E256"]), with a keyword fallback for the unmapped remainder.

  3. The audit binary joins the two and writes the report. Regenerate it with:

    cargo run -p talkbank-parser-tests --bin audit_check_parity
    

    which rewrites docs/audits/check-parity-audit.md (the full per-rule table and the executive summary). That generated file is the authoritative, citation-stable record; this page explains how to read it.

The current headline numbers (regenerate to refresh): of the CHECK codes that are actually emitted, roughly two-thirds map directly to a TalkBank code, and the audit reports semantic parity, behavioral parity, and an “enhancements beyond CHECK” set (TalkBank codes with no CHECK equivalent, the majority).

Triaging a gap

A CHECK rule with no TalkBank mapping is not automatically a chatter bug. Each gap is triaged against the CLAN source (OSX-CLAN/src/clan/check.cpp) into one of three buckets:

  • (a) Genuine gap. CHECK enforces a real CHAT rule chatter is missing. Action: implement it in chatter with strict top-down TDD (a failing chatter validate test on a real .cha fixture first), then add the map_by_id entry. Example: curly single quotes (see below).
  • (b) Intentional divergence. CHECK’s rule is wrong, disabled, or a text-hack chatter deliberately does not reproduce. Action: document the divergence, do not implement. Examples: CHECK error 49 (uppercase-in-word) has been commented out in check.cpp since 2019, so flagging it would diverge from current CHECK; CHECK error 109 (postcodes on dependent tiers) is a raw character-match text-hack on %-tier tokens that chatter models as structured content.
  • (c) Enhancement beyond CHECK. A TalkBank code with no CHECK counterpart. These are validation rules chatter adds; they need no CHECK mapping.

The remaining unmapped CHECK codes are an open, low-priority tail: most resolve to bucket (b) on source examination. Closing them is not a release gate.

Worked example: E256 (CHECK 138/139), implemented across both parsers

Curly single quotes (U+2018, U+2019) used as word characters were a genuine gap (bucket a): CHECK errors 138/139 flag them, chatter previously absorbed them silently. They are illegal CHAT word characters; CHAT uses the ASCII apostrophe.

Because chatter has two parsers that must agree (the tree-sitter parser and the re2c oracle, see Parser Backends), the fix lands in both, reaching the same recovery:

  • The character is excluded from the word token via the shared Symbol Registry (so it can never be part of a word).
  • The tree-sitter grammar recognizes it as a dedicated illegal_curly_quote node (not a generic parse error), and the parser emits E256 with a span pointing at the exact character.
  • The re2c lexer emits a recognized IllegalCurlyQuote token; the file-level parser emits E256 and drops the token before parsing.
  • In both, the offending quote is dropped and the surrounding words survive, so validation continues and reports a precise, actionable diagnostic.

This is the canonical shape of a CHECK-parity rule implemented to chatter’s standards: a recognized construct (parse, don’t merely fail), the same behavior in both parsers, and a spec in spec/errors/ that drives the tests.

  • Bullet Validation documents the temporal media-bullet checks (CLAN errors 83/133/84 and chatter’s E701/E704/E729), a specific instance of the same “match CHECK where it is right, diverge where it is wrong” reconciliation this audit tracks across the whole error set.
  • Errors, CHAT core describes the ErrorCode model and the parser-layer / validation-layer split.
  • The spec-driven test pipeline that backs every rule is in Testing: rules live in spec/errors/ and generate both parser tests and the validation corpus.

Crate Reference

Status: Current Last modified: 2026-06-15 15:00 EDT

Summary of the main crates and packages in TalkBank/chatter.

Foundational crates

tree-sitter-talkbank

Rust binding crate for the generated TalkBank CHAT tree-sitter grammar. Exposes LANGUAGE, NODE_TYPES, and the generated query constants used by editor and parser integrations.

talkbank-model

The typed data model for CHAT files. Defines ChatFile, Utterance, DependentTier, MorTier, GraTier, and all other AST types. Includes validation logic, the WriteChat trait for CHAT serialization, serde support for JSON, and JsonSchema derivations. Also owns error types (ParseError, ErrorSink trait, Span, SourceLocation), diagnostic infrastructure, and ParseValidateOptions. Provides a closure-based content walker (walk_words / walk_words_mut) that centralizes recursive traversal of UtteranceContent and BracketedItem with domain-aware group gating.

talkbank-derive

Procedural macros for the model crate (SemanticEq, SemanticDiff, SpanShift, ValidationTagged, and the error_code_enum macro).

talkbank-cache

SQLite-backed validation and roundtrip cache used by higher-level validation and corpus workflows.

talkbank-parser

The canonical parser. Wraps the tree-sitter C parser and converts the concrete syntax tree (CST) into ChatFile model types. Provides error recovery via tree-sitter’s GLR algorithm and is the parser used by the CLI, LSP, transform pipelines, and editor tooling.

talkbank-parser-re2c

Independent alternate parser used as an equivalence oracle against the tree-sitter parser. Primarily a testing and spec-hardening tool rather than a first-wave end-user surface.

talkbank-transform

High-level pipelines: parse+validate, CHAT-to-JSON, JSON-to-CHAT, normalization. Integrates the validation cache, JSON schema validation, and parallel directory validation.

Application and integration surfaces

chatter

The chatter CLI binary: validate, normalize, to-json, and corpus management.

talkbank-lsp

Language Server Protocol server with tree-sitter incremental parsing, real-time diagnostics, and semantic highlighting.

send2clan

Rust bindings for sending files to the CLAN application (macOS Apple Events, Windows WM_APP). The crate exposes the safe send2clan API directly while keeping the raw FFI in private modules.

chatter-desktop

Desktop validation app (Tauri v2, React). Mandates TUI parity with the CLI.

Test and spec-support crates

talkbank-parser-tests

Parser tests. Runs the parser over the reference corpus and validates the results. Also owns spec-generated tests, roundtrip tests, equivalence tests, and property tests.

spec/tools

Generator binaries for tree-sitter corpus tests, generated Rust tests, shared spec artifacts, and error documentation.

spec/runtime-tools

Runtime-aware spec tooling for validation, bootstrap, and corpus-mining tasks that should not live in the root Rust workspace.

CLI Startup and the Program Stack

Status: Current Last modified: 2026-06-12 19:01 EDT

Why main() in crates/chatter/src/main.rs does not run the program directly, and what every contributor adding CLI surface should know about stack budgets.

The incident this page exists for

From 2026-06-05 to 2026-06-12, every chatter invocation crashed on Windows in debug builds with STATUS_STACK_OVERFLOW (exit code 0xC00000FD) before argument parsing even began. The crash surfaced as four failing adjudication_tests subprocess tests in the windows-latest CI job, but the faulting code was the clap-derived command-tree construction (Cli::augment_args via CommandFactory::command()), shared by every subcommand. The trigger was ordinary growth: the FREQ parity work added several hundred flags across 2026-06-03/04, and the construction path’s stack needs crossed 1 MiB.

Why stack usage is not portable

Two multipliers vary independently, and the crash happens where they collide:

  1. Platform main-thread allowance. There is no single default:

    ContextMain/default stack
    Windows main thread1 MiB (set in the PE header at link time)
    macOS main thread8 MiB
    Linux main threadtypically 8 MiB (ulimit -s)
    Rust spawned threads2 MiB unless stack_size is given

    Shipping cross-platform means your real budget is the smallest of these: Windows’ 1 MiB.

  2. Build profile. At opt-level 0, rustc gives every temporary in a function body its own stack slot and does not coalesce them, so a function’s frame is roughly the SUM of all its temporaries, not the maximum simultaneously alive. clap’s derive expands to one enormous builder function per args struct (one multi-call chain per flag, each Arg/Command temporary a few hundred bytes by value), which is exactly the shape this penalizes. Release builds coalesce slots and inline, shrinking the same frames by one to two orders of magnitude.

Consequence: identical code can be fine in release on macOS (8 MiB budget, small frames) and fatal in debug on Windows (1 MiB budget, fat frames). Debug test binaries cross the line first, which is why CI subprocess tests caught it and shipped release binaries never crashed.

The design: an explicitly sized program thread

main() spawns the entire program onto a thread with an explicit, documented stack size (PROGRAM_STACK_BYTES, 16 MiB) and only joins and re-raises panics, so exit semantics are unchanged. This removes the dependency on platform main-stack defaults altogether instead of chasing the budget back under an invisible, platform-dependent line that the CLAN parity roadmap (roughly sixty commands’ worth of flags still to come) guarantees we would cross again. rustc itself uses the same pattern for the same reasons.

flowchart TD
    main["main()\n(crates/chatter/src/main.rs)"]
    spawn["thread::Builder::stack_size(PROGRAM_STACK_BYTES)\n.spawn(program_main)"]
    prog["program_main()\nclap tree build + parse + cli::run"]
    join{"join() result?"}
    ok["process exits normally"]
    panic["resume_unwind(payload)\n(same exit behavior as a panic in main)"]
    fail["spawn failed (OS resource):\neprintln + exit(1)"]

    main --> spawn
    spawn -->|"Ok(handle)"| prog
    prog --> join
    join -->|"Ok(())"| ok
    join -->|"Err(payload)"| panic
    spawn -.->|"Err(e)"| fail

The reservation is virtual address space; physical pages are committed only as they are touched, so the 16 MiB costs nothing measurable. The extra thread spawn at startup is microseconds.

Regression gates

  • crates/chatter/tests/stack_limit_tests.rs runs the real binary under a Windows-sized 1 MiB stack (sh -c 'ulimit -s 1024') on Unix, so macOS and Linux CI enforce the Windows constraint on every run. Without this, the constraint is tested only by the windows-latest job, where this incident sat unnoticed for a week.
  • The windows-latest cross-platform job remains the native test of the real 1 MiB main stack (which no longer matters to the program thread, but guards the main() shim itself).

Guidance for contributors

  • Do not move program logic back onto the bare OS main thread; anything before the spawn runs under the platform’s smallest default.
  • Adding flags and subcommands is normal and expected; the budget is now the explicit PROGRAM_STACK_BYTES constant. If deep recursion or generated code ever approaches it, raise the constant deliberately in a reviewed change rather than discovering the limit in CI.
  • The same two multipliers apply to any worker threads you spawn: Rust’s 2 MiB spawned-thread default is also finite, and recursive parser or validation code running on worker threads should size them explicitly if depth is data-dependent.

Repository Architecture and Boundaries

Status: Current Last modified: 2026-06-15 15:00 EDT

Top-level layout

spec/                     canonical syntax and error spec source
spec/tools/               deterministic generators + validators (separate Cargo workspace)
grammar/                  tree-sitter grammar source + generated parser artifacts
crates/                   all Rust crates (root Cargo workspace)
  talkbank-model/         data model, validation, alignment, errors, parser API trait
  talkbank-derive/        proc macros (SemanticEq, SpanShift, ValidationTagged, error_code_enum)
  talkbank-parser/        canonical parser (tree-sitter)
  talkbank-parser-re2c/   alternate parser (specification oracle, opt-in batch parser)
  talkbank-parser-tests/  parser equivalence and roundtrip tests
  talkbank-transform/     pipelines, CHAT↔JSON, caching, parallel validation
  chatter/           the `chatter` CLI binary
  talkbank-lsp/           LSP server
  send2clan/              Rust bindings to the legacy CLAN app bridge
 talkbank-cache/         validation + roundtrip cache
apps/                     desktop app (Tauri v2 + React): chatter-desktop
corpus/                   reference corpus (must pass 100%)
schema/                   JSON Schema for ChatFile AST
tests/                    workspace-level integration tests and fixtures
fuzz/                     fuzz targets (separate Cargo workspace)
book/                     mdBook documentation source
docs/                     strategy docs, proposals, and investigations

Architectural principles

  1. Clear boundaries between specification, generation, runtime logic, and documentation.
  2. Generated artifacts and hand-authored code are kept separate with hard guardrails, parser.c, node-types.json, generated tests and error-doc artifacts are never edited by hand.
  3. Each crate has a single clear responsibility.
  4. Entry-point docs guide new contributors to authoritative references quickly.

Canonical ownership rules

  • spec/ owns the language intent and accepted examples, what CHAT means.
  • grammar/ owns tokenization and CST shape only, not semantic validation policy.
  • talkbank-model owns semantic validity, serialization invariants, error types, and parser API contracts.
  • talkbank-transform owns pipelines and JSON schema validation.
  • talkbank-cache owns the shared SQLite-backed validation and roundtrip cache.

Dependency direction rules

  1. spec does not depend on runtime crates.
  2. grammar is consumed by parser crates, not vice versa.
  3. talkbank-model is dependency-minimal and stable; all other talkbank-* crates depend on it.
  4. CLI / LSP / desktop apps depend on stable internal APIs, never directly on unstable internals of other crates.
  5. Generator tools may read specs and grammar metadata but do not become runtime dependencies.

Acceptance criteria

  • Every top-level directory has a clear purpose statement.
  • No crate depends on internal modules outside declared boundaries.
  • No generated artifact is edited manually.
  • New contributors can identify authoritative docs in less than five minutes.

Grammar System and Token Governance

Status: Current Last modified: 2026-05-29 18:43 EDT

Current Reality

grammar/grammar.js encodes substantial implicit language knowledge directly in regex exclusions, reserved symbol lists, and leniency decisions. Example areas:

  • word segment forbidden start/rest classes,
  • CA delimiter/element symbol groups,
  • event segment exclusions,
  • hand-maintained coupling between comments and token rules.

This is currently powerful but fragile.

Primary Failure Modes

  1. New symbolic token added in one place but not in exclusion sets.
  2. Parser behavior changes silently due to regex class edits.
  3. Generated node types drift from assumptions in spec tooling.
  4. Lenient parsing choices become undocumented policy.

Current Design

The generated symbol registry is the single source of token constraints. The pipeline has shipped, just symbols-gen rebuilds it.

Registry Artifacts

  • spec/symbols/symbol_registry.json (human-authored intent):
    • symbol string
    • category (delimiter, continuation, overlap, punctuation, etc.)
    • contexts where reserved/allowed
    • parse role and precedence notes
  • Generated outputs:
    • grammar/src/generated_symbol_sets.js
    • crates/talkbank-model/src/generated/symbol_sets.rs
    • spec/tools/src/generated/symbol_sets.rs
    • docs: Symbol Registry

Grammar Refactor Requirements

  1. Replace large manual regex strings with generated character classes.
  2. Keep final grammar readable by preserving semantic names in generated constants.
  3. Distinguish clearly between:
  • syntax permissiveness,
  • semantic validation restrictions.
  1. Add comments only for design rationale, not for duplicating manual references.

Node Type Drift Controls

  • Enforce regeneration and consistency checks:
    • grammar source change must regenerate parser and node types,
    • node type constants consumed by spec/tools and parser code must compile,
    • CI fails if generated files differ from committed state.

Leniency Policy

Explicitly classify every lenient parse behavior:

  • Parse-lenient + validate-strict.
  • Parse-lenient + validate-warning.
  • Parse-strict (hard fail).

Document this matrix in the Leniency Policy.

Grammar Test Strategy

  1. Keep corpus tests generated from spec/constructs.
  2. Add targeted hand-authored edge tests for symbol boundary interactions.
  3. Add mutation-style tests for forbidden-character regressions.
  4. Add parser equivalence tests for tokenizer-sensitive cases.

Acceptance Criteria

  • No manual reserved-symbol duplication in grammar.js.
  • Symbol registry is generated to all required consumers.
  • Grammar modifications cannot land with stale generated artifacts.
  • Every special token category has explicit policy documentation.

Parser, Model, and API Contracts

Status: Current Last updated: 2026-06-21 21:33 EDT

Single-handle parser API

talkbank-parser provides TreeSitterParser as the canonical API handle for all parsing, full-file and fragment methods live directly on the struct. Callers create one instance and pass &TreeSitterParser everywhere. The alternate talkbank-parser-re2c is opt-in (specification oracle and high-throughput batch parsing) and produces the same ChatFile model.

Contract for Batchalign

The Batchalign runtime (the batchalign crate) consumes these guarantees from the talkbank-* core crates:

  • parsing produces a typed ChatFile or an explicit parse-status signal
  • parse-health taint is visible to alignment consumers
  • alignment helpers operate on semantic model types, not raw text hacks
  • recovery never fabricates valid-looking placeholder semantics for malformed input

The parser/model boundary stays honest enough for downstream workflows, align, compare, benchmark, morphotagging, to make their own validity decisions.

Canonical Contract Model

Public Contract Layers

  1. Parse API Contract:
  • stable function signatures,
  • deterministic parse result envelope,
  • clear partial-success semantics.
  1. Semantic Model Contract:
  • stable core model fields,
  • explicit unstable/internal fields policy.
  1. Diagnostic Contract:
  • stable error code IDs and severity semantics,
  • best-effort message text compatibility.
  1. Serialization Contract:
  • deterministic output constraints,
  • normalized formatting policy.

Required Types

  • ParseOutcome<T>
    • value: T | omitted-by-status
    • diagnostics: Vec<Diagnostic>
    • status: Success | Partial | Failed
  • Diagnostic
    • code, severity, category, message, location, context, suggestion

Parser Role

  • talkbank-parser: the sole parser, used by CLI/LSP/API/batchalign3. TreeSitterParser is the only API handle, callers create one and pass &TreeSitterParser everywhere.
  • Tree-sitter GLR provides error recovery; the Rust traversal code converts CST to typed model.
  • Full-file methods: parser.parse_chat_file(), parser.parse_chat_file_streaming().
  • Fragment methods: parser.parse_word_fragment(), parser.parse_main_tier_fragment(), etc.

Invariants

  1. Parsing with offset must shift all spans consistently.
  2. Parse-level and validation-level diagnostics must remain distinguishable.
  3. Serialization should preserve semantic equivalence and documented formatting rules.
  4. Roundtrip behavior must be testable per parser implementation.
  5. Parser functions that accept ErrorSink should not return Option<T> for fallible parse state.

API Versioning Policy (Pre-1.0, Strict)

  • Three intended contract levels:
    • Stable-for-integrators
    • Stable-internal
    • Experimental
  • Mark every public function/type by contract level.

This classification is not yet codified in a separate manifest file; the levels above are the working policy. Integrators should treat any unmarked surface as Experimental until contract levels are formally published.

Acceptance Criteria

  • Single canonical parse outcome envelope exposed for integrators.
  • Parser implementations conform to shared contract tests.
  • Contract-level annotations exist for all public API surfaces.
  • Documentation for parse/validate/serialize lifecycle is centralized and current.

Recovery Contract: No Fabricated Semantic Values

The parser contract must forbid sentinel semantic values during error recovery.

Disallowed recovery behavior:

  • returning arbitrary enum variants as fallback for unknown/missing nodes,
  • returning empty strings as stand-ins for required fields,
  • constructing fake words/chunks like "missing", "error", or other placeholders.

Required recovery behavior:

  1. Emit structured diagnostic with precise span and expected node kind.
  2. Return an explicit parse-status signal (Partial/Failed) through ParseOutcome.
  3. Omit invalid semantic node OR store it in explicit recovery metadata, never as a valid semantic value.

Current enforcement:

  • CI guardrail script tracks and blocks introduction of new ErrorSink + Option signatures.
  • See scripts/check-errorsink-option-signatures.sh and scripts/errorsink_option_allowlist.txt.

Rationale:

  • fabricated semantic values create secondary, misleading diagnostics against synthetic data,
  • downstream tools cannot distinguish real user content from parser-generated placeholders,
  • equivalence and regression tests become noisy and non-actionable.

For batchalign3, this is especially important because alignment workflows must be able to tell the difference between:

  • a malformed input that should taint or block alignment
  • a recoverable input where raw text can be preserved
  • a clean input that should proceed through the align/compare pipeline

String Storage Policy

The model uses three string storage strategies:

  • Arc<str> interning (interned_newtype!): For high-frequency repeated values (POS tags, stems, speaker codes). Global interner avoids redundant allocations.
  • SmolStr (string_newtype!): For short strings (median 10-15 chars) that benefit from inline storage. O(1) clone, no heap allocation for strings ≤23 bytes.
  • String: Only for utility types outside the core model (e.g., semantic_diff/).

Parser Backends

Status: Current Last modified: 2026-06-13 22:40 EDT

TalkBank has two CHAT parser implementations. Both implement the ChatParser trait and produce identical ChatFile model types.

The --parser flag selects the backend at the CLI boundary; everything downstream consumes the identical ChatFile output, so the choice is invisible past the dispatch point:

flowchart TD
    cli["chatter validate --parser &lt;backend&gt;\n(ParserBackend enum,\nchatter cli_types.rs)"]
    sel{"which backend?\n(ParserKind,\ntalkbank-transform\nvalidation_runner/config.rs)"}
    ts["TreeSitterParser\n(talkbank-parser:\nGLR, incremental)"]
    re2c["Re2cParser\n(talkbank-parser-re2c:\nre2c DFA + chumsky)"]
    trait["ChatParser trait\n(talkbank-model\nparser_api/chat_parser.rs)"]
    model["ChatFile\n(talkbank-model:\nSemanticEq-identical\nfor both backends)"]

    cli --> sel
    sel -->|"tree-sitter (default)"| ts
    sel -->|"re2c"| re2c
    ts -->|"ParserDispatch::TreeSitter\n(worker.rs) implements"| trait
    re2c -->|"ParserDispatch::Re2c\n(worker.rs) implements"| trait
    trait --> model

ParserDispatch::new(kind) (in validation_runner/worker.rs) is the single place that constructs the chosen backend from a ParserKind; both variants wrap a ChatParser implementor, so the validation runner never branches on backend again.

TreeSitterParser (default)

  • Crate: talkbank-parser
  • Technology: tree-sitter GLR parser
  • Grammar: grammar/grammar.js → generated C parser
  • Strengths: Incremental reparsing (LSP), robust error recovery (GLR), CST-level diagnostics
  • Weaknesses: Slower on batch workloads, !Send + !Sync (one parser per thread)

Used by the LSP, the default CLI, and all production validation.

Re2cParser

  • Crate: talkbank-parser-re2c
  • Technology: re2c DFA lexer + chumsky parser combinators
  • Grammar: Translated from grammar.js rules → re2c conditions + chumsky combinators
  • Strengths: 4-8x faster, Send + Sync, zero constructor cost, specification oracle
  • Weaknesses: No incremental reparsing, Box::leak memory strategy

Used for batch validation, parser parity testing, and performance benchmarking.

CLI Usage

# Default: tree-sitter
chatter validate corpus/

# Use re2c for faster batch validation
chatter validate --parser re2c corpus/

# Roundtrip with re2c
chatter validate --parser re2c --roundtrip corpus/

The --parser flag accepts tree-sitter (default) or re2c. Cache entries are parser-specific, switching parsers does not invalidate the other’s cache.

Parity Status

Both parsers produce SemanticEq-identical output on the 87-file reference corpus (100% match). On the ~100k-file wild corpus, parity is ~98.7%.

Error Detection

MetricValue
Specs tested140
Both detect error140/140 (100%)
Same error code79/140 (56.4%)
Different code, both detect61/140 (43.6%)
Re2c silent (misses error)0

The 61 code mismatches come from architectural differences, not bugs. Both parsers report actionable diagnostics for all 140 testable error specs.

Performance

BenchmarkTreeSitterRe2cSpeedup
Small file (13 lines)44 µs9.6 µs4.6x
Medium file (dependent tiers)69 µs9.4 µs7.3x
Large file (complex)7,734 µs970 µs8.0x
Batch (35 files)21.7 ms3.0 ms7.2x

Run benchmarks: cargo bench -p talkbank-parser-re2c --bench parse_comparison

When to Use Which

Use CaseRecommended ParserWhy
LSP / editor integrationtree-sitterIncremental reparsing
Batch validation (>100 files)re2c4-8x faster
CI validationEitherBoth correct; re2c saves CI time
Error diagnostics (user-facing)tree-sitterMore specific E3xx codes
Parser parity testingBothRe2c is the specification oracle
Profiling / benchmarkingre2cDFA lexer gives a performance floor

Shared Model Infrastructure

Both parsers convert to the same talkbank_model::ChatFile type and share post-hoc promotion logic:

  • TierContent::extract_terminal_bullet(): trailing InternalBullet → utterance bullet
  • parse_bullet_node_timestamps(): structured bullet CST → (start_ms, end_ms)

CA intonation arrows are no longer promoted to terminators at the parser/model boundary; both parsers leave them as Separator items. See CA Terminator Resolution.

Detailed Parity Report

See crates/talkbank-parser-re2c/docs/parity-report.md for the full gap analysis, divergence categories, and remaining work items.

Parser Leniency Policy

Status: Current Last updated: 2026-06-15 13:08 EDT

This document is the single source of truth for how the tree-sitter grammar, Rust validation layer, and CLI tooling divide responsibility for enforcing the CHAT specification. It consolidates decisions scattered across grammar.js comments, analysis documents, and code.

Scope: Documentation only. This document does not implement new validation rules; it records what exists, what is intentionally absent, and proposes a roadmap for closing gaps.


Philosophy: Parse, Don’t Validate

The tree-sitter grammar intentionally accepts a superset of valid CHAT. The rationale:

  1. Maximise parse coverage: Real-world .cha files contain legacy patterns, whitespace variations, and edge cases. A grammar that rejects them produces no AST and therefore no diagnostics. Accepting them gives the validation layer something to work with.

  2. Separate syntax from semantics: The grammar captures structure (headers, utterances, tiers, annotations). The Rust validation layer enforces semantic rules (required headers, participant declarations, alignment counts).

  3. Enable configurable strictness: Different consumers need different policies. A roundtrip pipeline can be strict; an editor providing live diagnostics should be lenient. Validation profiles (see Validation Profile Infrastructure) make this possible.

Three-Tier Classification

Every intentional leniency decision falls into one of three tiers:

TierLabelMeaning
AParse-lenient + validate-strictGrammar accepts it; validation rejects it as an error
BParse-lenient + validate-warningGrammar accepts it; validation emits a warning
CParse-lenient onlyGrammar accepts it; no validation needed: the construct is genuinely optional or the broad acceptance is by design

This classification was proposed in an earlier grammar governance analysis and is formalised here.


Leniency Matrix

Master table of every documented leniency decision in the grammar. The Status column indicates whether downstream validation compensates for the grammar’s permissiveness.

#Grammar ConstructSpec RequirementGrammar BehaviorTierValidationError CodeStatus
1@UTF8 headerRequired, must be first lineOptional (not enforced)AValidatedE503OK
2@Begin headerRequiredOptional (grammar.js ~L104)AValidatedE504OK
3@End headerRequiredOptional (grammar.js ~L106)AValidatedE502OK
4Pre-first-utterance header orderNo enforced order (matches CLAN CHECK)choice(), any order (grammar.js ~L122-135)CN/A (by design),OK
5Headers after utterancesAllowed (e.g. @Bg, @Eg, @G, @Comment)Interleaved freelyCN/A (by design),OK
6Content type context restrictionsUnified across contextsUnified base_content_item (grammar.js ~L731-738)CN/A (by design); specific semantic rules (E371, E372) exist separately,OK
7Terminator presenceRequired (except CA mode)Optional (grammar.js ~L691-692)AValidatedE305OK
8Bare shortening as wordCA mode onlyAccepted anywhereAValidatedE2xxOK
9Trailing whitespace in annotationsNot specifiedOptional trailing space (grammar.js ~L957, 966, 975, 1004, 1013)CN/A,OK
10MOR segment UnicodeVery permissive (broad language support)Exclusion-based regex (grammar.js ~L1909-1915)CN/A (by design),OK
11MOR fusional suffixes with hyphensALNUM + IPA onlyAllows hyphens (grammar.js ~L1942-1945)CN/A (by design),OK
12MOR nested translationsNo nested structuresAllows () and [] nesting (grammar.js ~L1954-1966)CN/A (by design),OK
13Linkers / language codesTruly optionalOptionalCN/A,OK
14Word annotationsTruly optionalOptionalCN/A,OK
15Media bulletTruly optionalOptionalCN/A,OK
16Group whitespace (leading/trailing)No whitespace inside < >Optional (grammar.js ~L1097, 1099)CN/A,OK
17Long feature label charactersLimited character set/[A-Za-z0-9@%_-]+/ (grammar.js ~L1327)CN/A,OK
18Catch-all headers ($.anything)Structured content for some headers/[^\r\n]+/ for ~19 header typesCN/A (content is opaque),OK
19Header gap whitespaceSingle space/tabrepeat1(choice(space, tab)) (grammar.js ~L467, 477, 489)CN/A,OK
20@Types header whitespaceNo spaces around commasOptional whitespace around commas (grammar.js ~L584-592)CN/A,OK

Permissiveness Regression Decisions

During development, several validation rules were tightened and then relaxed after they produced false positives against the reference corpus. These decisions are documented in the permissiveness regression log (archived). Each is summarised here with its rationale.

Decision 1: [*] bare annotation, E214 disabled

  • Previous behaviour: E214 emitted when [*] appeared without an explicit error code (empty ContentAnnotation::Error).
  • Current behaviour: Bare [*] is accepted without error.
  • Implementation: Removed validation branch in talkbank-model/src/model/annotation/annotated.rs.
  • Rationale: Reference files (errormarkers.cha, compound.cha) use bare [*] as valid CHAT.
  • Revisit: If coded error annotations become required, do it behind an explicit strict profile.

Decision 2: @t without @s:<lang>, E248 disabled

  • Previous behaviour: E248 emitted for @t markers without an explicit language marker.
  • Current behaviour: @t accepted without requiring @s:<lang>.
  • Implementation: Removed checks in talkbank-model/src/validation/word/structure.rs.
  • Rationale: Reference file formmarkers.cha contains a@t and is expected to be valid.
  • Revisit: Scope to explicit strict validation mode if desired.

Decision 3: Undeclared inline language codes, E254 re-introduced as warning

  • Original behaviour: Inline @s:... markers with language codes not declared in @Languages emitted E254 as an error.
  • Intermediate behaviour: E254 was disabled and the code removed from the codebase to keep reference file lang-marker.cha valid.
  • Current behaviour: E254 (UndeclaredExplicitWordLanguage) is back in the registry at crates/talkbank-model/src/errors/codes/error_code.rs:321 and emitted at crates/talkbank-model/src/validation/word/language/resolve.rs:195, but as a warning rather than an error. This was paired with the introduction of E255 (WholeUtteranceLanguageSwitchShouldUsePrecode) for whole-utterance @s runs that should use [- lang] precodes.
  • Why it returned: Heterogeneous corpora (Cantonese, Polish, Czech, Spanish, HK bilingual) made the warn-only signal load-bearing for catching @s:LANG markers that disagreed with @Languages. The warning surfaces the inconsistency without blocking the file.
  • Revisit: If the warn-only signal turns out to be ignored in practice, decide between escalating back to error severity or removing.

Decision 4: Mixed-language digit legality, permissive-any rule

  • Previous behaviour: Digits had to be legal in all applicable languages for mixed/ambiguous markers.
  • Current behaviour: Digits accepted if legal in at least one applicable language.
  • Implementation: Changed from is_valid_in_all() to any() in talkbank-model/src/validation/word/language/digits.rs.
  • Rationale: Prevents false positives in mixed-language reference examples.
  • Revisit: Confirm spec intent for mixed/ambiguous validation semantics.

Decision 5: @Bg nesting, same-label only

  • Previous behaviour: Any nested @Bg while another gem scope was open emitted E529.
  • Current behaviour: E529 only fires when nesting the same label (or same unlabeled scope key). Different labels may nest hierarchically.
  • Implementation: Changed from any_scope_open to same_scope_open in talkbank-model/src/validation/header/structure.rs.
  • Rationale: Avoids false positives on hierarchical markup patterns (e.g., HSLLD corpus).
  • Revisit: Decide whether nesting policy should be global or per-label.

Decision 6: Temporal bullets in CA mode, skipped

  • Previous behaviour: E701/E704 temporal checks ran even for CA-mode files.
  • Current behaviour: Temporal constraints are skipped when file is in CA mode.
  • Implementation: validate_temporal_constraints() early-returns when ca_mode is true (talkbank-model/src/validation/temporal.rs).
  • Rationale: CA reference files include patterns that triggered false monotonicity/self-overlap diagnostics.
  • Revisit: Implement CA-specific temporal policy rather than global skip.

Decision 7: Pipeline severity threshold, errors only

  • Previous behaviour: Any validation diagnostic (including warnings) caused PipelineError::Validation.
  • Current behaviour: Pipeline returns failure only if at least one diagnostic has Severity::Error.
  • Implementation: talkbank-transform/src/pipeline/parse.rs.
  • Rationale: Warnings should not block parse/transform/export pipelines.
  • Revisit: Keep as default; add explicit --strict flag/profile if needed.

Decision 8: Spacing warnings W210/W211, disabled

  • Previous behaviour: Style-level spacing warnings around terminators and overlap markers.
  • Current behaviour: Checks removed from core main-tier validation path.
  • Implementation: check_spacing_warnings() invocation removed from talkbank-model/src/model/content/main_tier.rs.
  • Rationale: Generated unexpected diagnostics on files treated as valid in reference workflow.
  • Revisit: Reintroduce as optional lint profile, not core validator hard path.

Validation Gap Roadmap

Concrete items where the grammar is lenient but no validation compensates. Each proposes a new error code and priority.

Priority 1: @UTF8 Presence (E503), DONE

  • Grammar: @UTF8 is optional.
  • Spec: Required, must be the first line.
  • Implemented: E503 (MissingUTF8Header) added to check_headers() in talkbank-model/src/validation/header/structure.rs.
  • Severity: Error.
  • Note: All 340 reference corpus files contain @UTF8, zero roundtrip impact.

Priority 2: Pre-First-Utterance Header Order (proposed E534), Not a Gap

  • Grammar: choice() accepts headers in any order between @Begin and the first utterance.
  • Assessment: CLAN CHECK does not enforce any ordering for post-@Begin headers; it validates presence and format only. Our grammar’s flexible ordering matches CHECK’s behavior.
  • Status: Reclassified from Tier B (GAP) to Tier C (by design).

Priority 3: Content Type Context Validation, Not a Gap

  • Grammar: Unified base_content_item accepts any content type in any context.
  • Assessment: The unified rule is correct by design. Nested groups are legal CHAT (e.g., <the <dag> [: dog]> [= something]). The two specific semantic restrictions that do exist (no pauses in pho groups, E371; no nested quotations, E372) are already validated.
  • Status: Reclassified from Tier A (PARTIAL) to Tier C (by design).

Validation Profile Infrastructure

What Exists

ValidationConfig (talkbank-model/src/errors/config.rs)

Builder-pattern configuration for per-error-code severity overrides.

let config = ValidationConfig::new()
    .downgrade(ErrorCode::IllegalUntranscribed, Severity::Warning)
    .disable(ErrorCode::InvalidOverlapIndex)
    .upgrade(ErrorCode::UnknownAnnotation, Severity::Error);

API:

  • new(): empty config, all codes use original severity
  • downgrade(code, severity): lower severity (chainable)
  • disable(code): suppress entirely (chainable)
  • upgrade(code, severity): raise severity (chainable)
  • set_severity(code, Option<Severity>): set or disable (chainable)
  • effective_severity(code, original) -> Option<Severity>: query
  • is_disabled(code) -> bool: check

Pre-built profiles:

  • lenient(): Downgrades IllegalUntranscribed and InvalidOverlapIndex to Severity::Warning. Designed for legacy corpora gradual migration.
  • strict(): escalates unmapped warnings to errors (sets upgrade_unmapped_warnings, honored by effective_severity). Explicit per-code overrides still take precedence, so a caller can opt a specific code back to Severity::Warning.

ConfigurableErrorSink (talkbank-model/src/errors/configurable_sink.rs)

Wrapper that intercepts errors and applies ValidationConfig before forwarding to an inner ErrorSink.

let inner = ErrorCollector::new();
let sink = ConfigurableErrorSink::new(&inner, config);
// Pass `sink` to parser/validator, disabled errors are filtered,
// severity overrides are applied.

Runner-Level Flags (talkbank-transform, chatter)

FlagEffect
--skip-alignmentSkip tier alignment validation
--roundtripTest serialization idempotency after validation
--forceClear cache for path and revalidate
--max-errors NStop after N errors

What Is Missing

GapDescriptionEffort
No --profile CLI flagUsers cannot select strict / lenient / lint from the command lineMedium
ConfigurableErrorSink not wired into validation pipelineInfrastructure exists but is not used by chatter validateMedium
No lint-style profileSpacing/style warnings (W210, W211) have no homeSmall (once profiles are wired)
No profile serializationCannot load profiles from TOML/JSON config filesMedium
No corpus-specific profilesE.g., HSLLD-specific rulesFuture

Proposed Profiles

From the permissiveness regression log:

ProfilePurposeBehaviour
reference-compatibleCurrent permissive baselineDefault, matches current validation behaviour
strict-chatFull spec enforcementRe-enable selected tightenings (E214, E248, E254, etc.)
lint-styleSpacing/style warnings onlyEnable W210, W211; do not fail pipeline

The roundtrip gate should be pinned to an agreed profile to prevent future ambiguity about what “pass” means.


Silent Recovery Points (NLP Pipelines)

An earlier Python-Rust boundary audit identified several places where batchalign-core silently massages data without diagnostics. These are related to leniency because they represent permissive acceptance without transparency.

PipelineRecovery MechanismDiagnostics?
Stanza morphosyntaxretokenize.rs DP alignment; Word::new_unchecked fallbackNo
Whisper/Wave2Vec FAforced_alignment.rs DP “best fit”No
Google TranslateImported verbatim into %xtraNo filtering
Stanza segmentationSilent abort on assignment mismatchNo

Key infrastructure gap: ParseHealth exists in talkbank-model (per-utterance tier cleanliness flags with taint(), is_clean(), can_align_main_to_mor() methods). It is used by the tree-sitter and direct parsers during parsing. However, batchalign-core does not read, write, or propagate ParseHealth during any mutation (morphosyntax injection, FA injection, retokenisation). The infrastructure exists in the model layer but is not connected to the pipeline layer.


Cross-References

SourceWhat It Contains
Grammar governance analysis (archived)Proposed this document; leniency matrix concept; three-tier classification
Permissiveness regression log (archived)8 permissiveness regression decisions with rationale
Python-Rust boundary audit (archived)Silent recovery points; ParseHealth gap; NLP pipeline audit
grammar/grammar.jsInline comments on each leniency decision (line references in matrix above)
talkbank-model/src/errors/config.rsValidationConfig API
talkbank-model/src/errors/configurable_sink.rsConfigurableErrorSink adapter
talkbank-model/src/validation/header/structure.rsHeader validation: E501, E502, E503, E504-E533
talkbank-model/src/validation/temporal.rsTemporal constraint checks (E701, E704); CA-mode skip
talkbank-model/src/model/content/main_tier.rsWhere W210/W211 were removed

Last updated: 2026-02-18

Error Diagnostics UX Standard

Status: Current Last modified: 2026-05-30 07:08 EDT

Workspace-wide standard for diagnostic shape, severity, recovery behavior, span correctness, and integrator output formats. Applies to the CHAT-core error system. Upstream batchalign-runtime errors follow the same shape and are documented separately in the batchalign3 project.

Objective

Make diagnostics precise, explainable, and actionable for both developers and non-technical editors, while keeping machine readability for downstream tools.

Open concerns

  • Message quality across the error catalog is not yet governed by one central style standard. Different error codes were authored at different times and converge unevenly on the message-quality guidance below.

Canonical Diagnostic Schema

Diagnostic {
  code: String,
  severity: Error | Warning | Info,
  category: Parse | Validation | Alignment | Header | Tier | Internal,
  location: SourceLocation,
  context: ErrorContext,
  message: String,
  suggestion: Option<String>,
  related: Vec<RelatedLocation>
}

Message Quality Standard

Each diagnostic must answer:

  1. What failed.
  2. Where it failed.
  3. Why it likely failed.
  4. What to do next.

Avoid internal jargon unless accompanied by user-facing explanation.

Severity Policy

  • Error: blocks parse/validation outcome.
  • Warning: content is usable but has quality/compliance concerns.
  • Info: optional guidance and migration hints.

Severity must not be overloaded for tooling convenience.

Recovery Policy: Diagnostic-First, Not Sentinel-First

When parser recovery is required:

  • do not invent semantic fallback values to keep type construction convenient,
  • do not use empty strings or arbitrary enum defaults as recovered content.

Instead:

  1. Report a diagnostic with expected/actual node context.
  2. Preserve span information for tooling and UI.
  3. Propagate partial/failure status explicitly.

Any synthetic placeholders that are unavoidable for internal plumbing must be:

  • non-semantic (not exposed as real model content),
  • marked internal-only,
  • excluded from user-facing diagnostics and serialization.

Sentinel vs error-variant rule

If an unexpected condition changes semantic trust in parsed content:

  • Represent that explicitly as an error-bearing state (enum variant, parse-taint flag, or explicit outcome type).
  • Never represent it as None or a default payload that can be mistaken for valid content.

This applies both to parser outputs and to runtime metadata consumed during validation.

Diagnostic construction

Use shared constructors/helpers for common diagnostics to reduce drift:

  • span-only diagnostics (code + severity + span + message),
  • source-backed diagnostics (code + severity + span + source + offending + message).

Benefits: consistent location/context population, fewer ad-hoc ParseError::new(...) call shapes, simpler migration to richer miette rendering.

Error Code Governance

  • Central registry under talkbank-model (errors module).
  • One authoritative description and example per code.
  • Deprecated codes remain mapped with explicit migration notes.
  • CI check forbids duplicate code definitions or orphaned docs.

Span and Location Correctness

  • All diagnostics use consistent line/column and byte-offset definitions.
  • Golden tests cover:
    • single-byte and multi-byte UTF-8 content,
    • embedded content offsets,
    • continuation lines and tabs.

Integrator Output Formats

  • Human-readable CLI diagnostics.
  • Machine-readable JSON diagnostics.
  • LSP diagnostic mapping.

All formats share the same underlying diagnostic schema.

Acceptance Criteria

  • Every emitted diagnostic includes code, severity, location, and suggestion policy.
  • Error code documentation and runtime definitions are synchronized automatically.
  • Span correctness is covered by dedicated tests.
  • CLI and JSON outputs are contract-tested for schema compliance.

Wide Struct Audit

Status: Current Last modified: 2026-05-29 22:34 EDT

A repository-wide audit rule for struct shape. Applies to the crates in TalkBank/chatter (model, parser, transform, CLI, CLAN, LSP, cache, and related tooling). The rule originated in the predecessor monorepo, but this page is scoped to the current repository rather than the old mixed CHAT+batchalign workspace.

A struct with many fields is not automatically wrong. The smell is:

  • many unrelated concerns packed into one value
  • several related booleans that act like implicit policy enums
  • repeated field-name prefixes that point to missing sub-structs
  • parallel vectors or stringly runtime fields
  • runtime code reaching into many unrelated fields of the same value

The repo therefore treats 10 or more named fields as an audit threshold, not as an automatic ban.

Categories

Wide structs fall into four categories.

1. Boundary shim, may stay wide

CLI, JSON, or clap boundary types. Acceptable if they are converted into typed policies or sub-structs before entering core runtime code.

Examples: ValidateDirectoryOptions, clap-facing CLI arg structs, JSON boundary records.

2. Transport or schema record, may stay wide

DB rows, HTTP response shapes, JSON schema mirrors. Acceptable as long as they don’t become the internal runtime shape.

Examples: WordJsonSchema, DbMetadata, CoverageReport, CorpusManifest.

3. Real aggregate, may stay wide

Domain values whose fields all answer one coherent question and whose callers consume the whole rather than spelunking through unrelated subsets.

Examples: metric/report records like SpeakerEval, SpeakerKideval, SpeakerComplexity, and SpeakerFluency (report records, not runtime coordination).

4. Refactor target, must be split

Mix of policy and state, multiple responsibilities, or callers needing to know the whole subsystem to use a subset of fields.

Design Rules

  1. Treat 10 or more named fields as an audit trigger.
  2. Treat 3 or more related boolean fields as a smell even below that threshold.
  3. Boundary and transport records may stay wide when they mirror a real external shape.
  4. Runtime coordination structs prefer named sub-structs over flat bags.
  5. Replace parallel vectors with per-item records where possible.
  6. If a wide struct stays wide, record the reason in the surrounding design docs, audit notes, or code review rather than letting it remain unexplained.

Refactor Examples

ValidateDirectoryOptions (chatter), was a flat bag

Used to be a flat bag of format, cache, traversal, roundtrip, parser, audit, and TUI flags. Now grouped by concern:

  • ValidationRules
  • ValidationExecution
  • ValidationTraversalMode
  • ValidationPresentation

Shape this audit wants for policy-rich CLI boundaries: one small top-level struct with explicit sub-objects and enums rather than a dozen flat fields.

ParseHealth (talkbank-model), was a ten-boolean state vector

Now stores taint as a compact tier bitset keyed by ParseHealthTier, the shape this audit expects for fixed domain sets.

flowchart LR
    tier["ParseHealthTier"] --> set["Tier health set"]
    set --> checks["Alignment safety checks"]

Open Hotspots

TUI state bags

Real state owners that still want grouping by concern (selection vs. progress vs. render flags vs. status):

  • crates/chatter/src/ui/validation_tui/state.rs TuiState

Backend (talkbank-lsp)

crates/talkbank-lsp/src/backend/state.rs is a service-root aggregate. Defensible, but still wants grouping such as document caches, parse caches, validation state, language services.

Metric structs

SpeakerEval/SpeakerKideval are acceptable as report records. If output renderers keep needing subsets (lexical metrics, morphosyntax metrics, error counts, derived scores), those records should eventually nest along those lines.

Audit Guardrail

There is currently no repo-local automated wide-struct lint in TalkBank/chatter. Treat this page as a manual review checklist and refactor trigger: when a type grows past the threshold, decide explicitly whether it is an acceptable boundary/schema aggregate or a real split target.

Spec Tooling and Generation Pipeline

Status: Current Last updated: 2026-05-19 17:38 EDT

Objective

Make spec/ the reliable language-contract source while keeping generation deterministic, maintainable, and appropriately scoped.

The goal is to separate:

  • grammar artifact generation
  • validation/error-doc generation
  • parser semantic testing (fragment and full-file)

Anything that still looks like bootstrap-era synthetic fragment orchestration is now audit-only unless a doc says it remains operational.

Open structural concerns

  • spec/tools still carries bootstrap-era Rust parser/model dependencies that create circular or awkward workflow coupling.
  • Contributor workflows still over-assume that make test-gen is the right reaction to every parser-related change.

Current Generation Pipeline

spec constructs/errors
  -> spec validators
  -> generated grammar corpus tests
  -> generated rust parser/validation tests
  -> generated error docs
  -> coverage dashboards and quality reports

That pipeline is still useful, but it is too broad to remain the single mental model for parser testing.

Desired Post-Bootstrap Split

grammar specs/templates
  -> generated tree-sitter corpus tests

error specs
  -> generated validation/parser error tests
  -> generated error docs

fragment semantic fixtures and invariants
  -> fragment-level parser tests

reference corpus / curated full files
  -> parser parity tests

Structural Reorganization for spec/tools (proposed, not yet implemented)

The intent here is to narrow spec/tools’s mission back to spec-driven artifact generation and validation rather than leaving it as a bootstrap-era staging ground for parser semantics. A proposed module split:

  • input (markdown/spec parsing)
  • ir (normalized internal representation)
  • emit (grammar tests, rust tests, docs)
  • validate (schema and semantic checks)
  • sync (grammar node-types and symbol-registry checks)

Current layout (crates: bin/, generated/, lib.rs, output/, spec/, templates/) has not been migrated to this shape. Treat this section as a design target for future work rather than a description of the current source tree.

Legacy vs Active

Keep these active:

  • grammar corpus generation
  • error doc generation
  • symbol registry sync/validation
  • affected regeneration when a spec or grammar input truly changed

Treat these as legacy audit paths:

  • synthetic tree-sitter fragment wrappers
  • bootstrap-era parser equivalence rituals

Determinism Requirements

  1. Stable ordering of generated outputs.
  2. Stable formatting of generated code/docs.
  3. Re-runs without source changes produce no diffs.

Drift Prevention Controls

  • Node type compatibility check:
    • spec/tools must compile and run against current generated node constants.
  • Registry compatibility check:
    • all symbol categories used in specs and grammar must be known in registry.
  • Generation integration check:
    • full generation pass with clean tree must produce zero diff.
  • Boundary check:
    • generated grammar/docs flows should not silently become the sole authority for fragment parsing semantics.

Authoring Experience (proposed, not yet implemented)

Spec authoring would benefit from:

  • Strict but simple spec templates for constructs and errors.
  • A spec lint command for immediate feedback (missing fields, invalid tags, malformed examples, unknown error codes).
  • Clearer documentation of when make test-gen is actually needed and when a small direct test is the right answer instead.

The spec lint binary does not yet exist; the strict-validation work that exists today happens implicitly through make test-gen failures plus the spec validators in spec/tools/src/bin/.

Versioning and Metadata

Each spec file should include:

  • ownership,
  • status (draft, accepted, deprecated),
  • parser/validation scope,
  • linked tests and generated outputs.

Acceptance Criteria

  • spec/tools is green and deterministic.
  • Every generation target has explicit provenance from source specs.
  • Drift between node types, specs, and generators is blocked in CI.
  • Spec contributors have a documented and automated happy path.
  • Small grammar changes no longer force a giant regeneration ritual by default.
  • Fragment parsing semantics are tested outside the generation pipeline.

Symbol Registry Architecture

Status: Current Last modified: 2026-05-29 18:43 EDT

Purpose

spec/symbols/symbol_registry.json is the canonical source of token/symbol classes used by CHAT grammar tokenization policy.

Scope

The registry currently governs:

  • CA delimiter symbols,
  • CA element symbols,
  • word segment forbidden symbol classes,
  • event segment forbidden symbol classes.

Governance Rules

  1. Symbol changes must be made only in spec/symbols/symbol_registry.json.
  2. Registry must pass validation:
    • node spec/symbols/validate_symbol_registry.js
  3. Grammar symbol sets must be regenerated after any registry change:
    • just symbols-gen
  4. Generated files are read-only and must not be edited manually.

Determinism Requirements

  • Every category list in the registry must be lexicographically sorted.
  • Duplicate symbols are forbidden.
  • ca_delimiter_symbols and ca_element_symbols must be disjoint.

These constraints keep generated outputs stable and review diffs minimal.

Consuming Outputs

Generated symbol constants are emitted to:

  • grammar/src/generated_symbol_sets.js
  • crates/talkbank-model/src/generated/symbol_sets.rs
  • spec/tools/src/generated/symbol_sets.rs

grammar/grammar.js imports from this generated module to avoid manual duplication of critical symbol policy.

Change Workflow

  1. Edit registry JSON.
  2. Run registry validation.
  3. Run just symbols-gen.
  4. Run grammar generation/tests.
  5. Run parser equivalence tests.
  6. Commit source + generated outputs together.

Auditability

Registry drift is caught by the checked-in generated artifacts plus the normal local verification sweep and CI checks, so symbol changes should land together with regenerated grammar and Rust outputs.

Bullet Validation

Status: Current Last updated: 2026-05-01 05:19 EDT

Media bullets are timestamps embedded in CHAT utterances that link transcript text to audio/video. They appear as •start_end• at the end of a main tier line (e.g., *CHI: hello . •1000_2000•). Validating that these timestamps are internally consistent is one of the more subtle parts of CHAT validation, because the “obvious” rules turn out to be wrong for multi-party conversation.

This chapter documents what CLAN CHECK does, where its implementation falls short of its own intent, and how chatter validate interprets and improves on that intent.

The three temporal checks

There are three distinct temporal constraints that can be checked on bullet timestamps. They differ in scope, severity, and whether they should run by default.

E701: Same-speaker start-time monotonicity (CLAN Error 83)

Rule: For each speaker, their utterances’ start times must be non-decreasing. If speaker CHI has utterance A starting at 10,000ms and utterance B (later in document order) starting at 8,000ms, that is an error, CHI’s timeline has gone backward.

Scope: Per-speaker. Cross-speaker non-monotonicity is allowed (see Why cross-speaker non-monotonicity is not an error).

Severity: Error.

E704: Same-speaker self-overlap (CLAN Error 133)

Rule: For each speaker, the current utterance’s start time must not be more than 500ms before the same speaker’s previous utterance’s end time. In other words, a speaker cannot overlap with themselves by more than 500ms.

Scope: Per-speaker. The 500ms tolerance accounts for annotation rounding and minor timing imprecision at boundaries.

Severity: Error.

E729: Cross-speaker overlap (CLAN Error 84)

Rule: The current utterance’s start time must not be before the previous utterance’s (any speaker) end time. This checks for any temporal overlap between adjacent utterances, regardless of speaker.

Scope: Global (cross-speaker). Only fires with CLAN’s +c0 flag.

Severity: Warning. Not part of default validation.

This check is part of CLAN’s “strict timeline contiguity” mode, which requires that every utterance’s start time equals the previous utterance’s end time, no gaps (Error 85) and no overlaps (Error 84). It is designed for a very specific use case: verifying that audio has been exhaustively and non-redundantly segmented. In normal conversational transcripts, cross-speaker overlap is ubiquitous, so this check would be absurd as a default.

What CLAN CHECK does

CLAN CHECK implements bullet validation in the function check_checkBulletsConsist() in check.cpp. Understanding its implementation is essential because it has several accidental behaviors that affect the error counts users see.

The snapshot-and-compare pattern

The function uses a global pair (check_SNDBeg, check_SNDEnd) to hold the “current” bullet timing, and saves the previous values into local variables (tBegTime, tEndTime) at the start of each call. The comparison flow is:

1. Save previous: tBegTime = check_SNDBeg, tEndTime = check_SNDEnd
2. Parse new bullet into check_SNDBeg, check_SNDEnd
3. Check error 83: check_SNDBeg < tBegTime?           (cross-speaker comparison)
4. Check error 133: speaker's last END - check_SNDBeg > 500?  (same-speaker)
5. If +c0 mode: check error 84 (overlap) and error 85 (gap)
6. Update speaker's last END time via check_setLastTime()

The early-return shadowing bug

The critical implementation detail is that error 83 fires via return(83) at step 3. This causes the function to exit immediately, skipping steps 4 through 6. Two consequences follow:

  1. Error 83 shadows error 133. An utterance that triggers error 83 (global non-monotonicity) can never also trigger error 133 (same-speaker overlap) in the same call, even if both conditions are true. This is not intentional, it is an artifact of C-style early-return control flow.

  2. Speaker state goes stale. Step 6 (check_setLastTime) updates the speaker’s per-speaker tracking in the SPLIST linked list. When error 83 fires, this update is skipped. All subsequent error-133 checks for that speaker compare against a stale endTime value, causing cascading state corruption that suppresses legitimate error 133 reports.

Error 83 is global, not per-speaker

CLAN fires error 83 by comparing the current utterance’s start time against the previous utterance’s start time, regardless of speaker. In a multi-party conversation:

*PIL: something . •100000_102000•
*UEL: response .  •99500_101000•      ← Error 83: 99500 < 100000

This fires error 83 because UEL’s start time (99,500ms) is before PIL’s start time (100,000ms). But this is just two people talking at the same time, normal conversational overlap. The [>] and [<] markers in CHAT explicitly annotate this as intentional simultaneous speech.

In files with many speakers (the Koine/bre corpus has 7-9 speakers per file, including children talking over each other), this fires on a huge fraction of utterances. CLAN’s accidental shadowing partially masks the problem by suppressing downstream error-133 reports when error 83 fires.

Why cross-speaker non-monotonicity is not an error

Consider a classroom recording with a teacher (PIL) and seven children. The teacher asks a question, and three children answer simultaneously:

*PIL: qué es esto ?        •50000_52000•
*UEL: un coche .           •51200_52500•   ← started during PIL's question
*MAR: coches .             •51000_51800•   ← started even earlier
*REN: es un coche grande . •51500_53000•   ← started between UEL and MAR

In document order, the start times are: 50000, 51200, 51000, 51500. This is non-monotonic (51000 < 51200), but there is nothing wrong with this data. The children are simply talking at the same time. No amount of reordering the utterances in the file would make all start times monotonically increasing while preserving the speaker-turn structure.

Cross-speaker non-monotonicity is an inherent property of multi-party conversation, not a data error. Flagging it as an error produces thousands of false positives on any corpus with overlapping speech.

When IS non-monotonic start time an error?

Same-speaker non-monotonicity IS an error. If CHI speaks at 10,000ms, then later in the file CHI speaks again at 8,000ms, CHI’s timeline has gone backward. This almost certainly indicates a transcription or alignment mistake.

The test is simple: within the same speaker’s utterance sequence, start times must be non-decreasing. This is what chatter validate checks for E701.

How chatter validate implements bullet validation

E701: Per-speaker monotonicity (not global)

chatter validate tracks each speaker’s last start time in a HashMap. E701 only fires when the same speaker’s start time goes backward. Cross-speaker non-monotonicity is silently accepted.

This is an intentional semantic divergence from CLAN CHECK, which fires error 83 globally. We believe CLAN’s global check reflects the implementation (comparing against a single global tBegTime) rather than the intent (detecting disordered timestamps). The per-speaker version matches the intent without drowning users in false positives from normal conversational overlap.

E704: Per-speaker overlap with 500ms tolerance

chatter validate tracks each speaker’s last end time in a HashMap. E704 fires when the overlap exceeds 500ms (same threshold as CLAN Error 133).

Unlike CLAN, E704 runs independently of E701. An utterance can trigger both errors if it is both non-monotonic (E701) and self-overlapping (E704). CLAN’s early-return pattern prevents error 133 from firing when error 83 fires, which is a bug, not a feature.

Speaker state is always updated regardless of whether errors fire. This avoids the cascading state corruption that CLAN’s implementation suffers from.

E729: Not in default validation

E729 (CLAN Error 84, cross-speaker overlap) is implemented but not called during default validation. It exists for future use in a strict-bullet mode equivalent to CLAN’s +c0 flag.

Untranscribed utterances are skipped

Utterances containing only untranscribed markers (www, xxx, yyy) are skipped for E704 checks. These utterances often carry broad segment bullets (covering a long span of background speech) that would create false self-overlap reports. This matches CLAN CHECK’s behavior, where untranscribed tiers do not contribute to timing comparisons.

CA mode disables all temporal checks

When the file header includes @Options: CA, all temporal validation is skipped. Conversation Analysis mode intentionally relaxes timing constraints because CA transcription conventions use overlapping and non-sequential timing as part of the analytic notation.

Comparison: CLAN CHECK vs chatter validate

The following table summarizes the behavioral differences:

┌────────────────────────────┬──────────────┬─────────────────┐
│ Behavior                   │ CLAN CHECK   │ chatter validate│
├────────────────────────────┼──────────────┼─────────────────┤
│ Error 83 / E701 scope      │ Global       │ Per-speaker     │
│ Error 133 / E704 scope     │ Per-speaker  │ Per-speaker     │
│ Error 84 / E729 default    │ Off (+c0)    │ Off             │
│ 83 shadows 133             │ Yes (bug)    │ No              │
│ 83 corrupts speaker state  │ Yes (bug)    │ No              │
│ E701 + E704 independent    │ No           │ Yes             │
│ Speaker state always fresh │ No           │ Yes             │
│ Untranscribed skipped      │ Implicit     │ Explicit        │
│ CA mode bypass             │ Yes          │ Yes             │
│ 500ms tolerance (E704)     │ Yes          │ Yes             │
└────────────────────────────┴──────────────┴─────────────────┘

Expected count differences

On multi-party files with overlapping speech:

  • E701 count will be lower than CLAN’s error 83 count. CLAN fires error 83 on cross-speaker non-monotonicity; we don’t. The difference represents legitimate conversational overlap that we intentionally do not flag.

  • E704 count will be higher than CLAN’s error 133 count. CLAN’s early-return shadowing prevents error 133 from firing when error 83 fires, and the stale speaker state causes further suppression. Our correctly maintained per-speaker tracking reports all genuine self-overlaps.

On single-speaker files or files with minimal overlap, the counts should be very close or identical.

Implementation details

The implementation lives in crates/talkbank-model/src/validation/temporal.rs.

Data flow

flowchart TD
    A["collect_bullets(file)\n(temporal.rs:101)"] -->|"Vec&lt;BulletInfo&gt;"| B
    B["validate_global_timeline()\n(temporal.rs:169)"] -->|"Per-speaker HashMap"| C["E701 errors"]
    A -->|"Vec&lt;BulletInfo&gt;"| D
    D["validate_speaker_timelines()\n(temporal.rs:212)"] -->|"Per-speaker HashMap"| E["E704 errors"]

BulletInfo

Each utterance with a bullet produces a BulletInfo containing:

  • utterance_idx: 0-based index in the file
  • speaker: the speaker code (e.g., "CHI", "PIL")
  • bullet: the Bullet struct with start_ms and end_ms
  • has_timeable_content: whether the utterance contains transcribed words (used to skip untranscribed-only turns for E704)

Only main speaker tiers are collected. Dependent tiers (%mor, %gra, etc.) are excluded.

Per-speaker tracking

Both E701 and E704 use HashMap<&str, ...> keyed by speaker code:

  • E701: stores (utterance_idx, start_ms), the speaker’s most recent start time
  • E704: stores (utterance_idx, end_ms), the speaker’s most recent end time

State is always updated after processing each bullet, regardless of whether an error was reported. This ensures clean tracking for subsequent comparisons.

CLAN source reference

For readers who want to trace the CLAN implementation:

  • Function: check_checkBulletsConsist() in OSX-CLAN/src/clan/check.cpp, lines 3849-3967
  • Error 83: lines 3883-3890 (early return(83))
  • Error 133: lines 3892-3895 (only reached if error 83 did not fire)
  • Speaker state update: line 3909 (check_setLastTime), only reached if no error fired
  • Per-speaker tracking: SPLIST linked list, lookup via check_getLatTime() / check_setLastTime()
  • +c0 mode: checkBullets flag, set via +c0 command-line option (line 5920), guards errors 84/85 at lines 3897 and 3953
  • Call site: check_ParseWords() line 4801, guarded by utterance->speaker[0] == '*' (main tiers only)

CA Terminator Resolution

Status: Current Last updated: 2026-05-05 12:23 EDT

How CA markers are split between separators and linkers in the parser/model.

Current rule

The parser/model no longer promotes CA markers into utterance terminators.

The supported split is:

  1. Standard utterance terminators remain the CHAT terminators such as . ? ! +... +/. and related final punctuation tokens.
  2. CA intonation arrows (⇗ ↗ → ↘ ⇘) stay Separator content items.
  3. CA TCU markers (≈ ≋) stay Separator content items.
  4. CA TCU linker forms (+≈ +≋) stay Linker items.

This means a trailing , , or remains in main-tier content rather than being retyped as Terminator.

Parser/model consequences

  1. Tree-sitter grammar keeps arrows and ≈/≋ on the separator path.
  2. The tree parser converts those nodes directly into Separator variants.
  3. The re2c parser classifies ≈/≋ as separators and +≈/+≋ as linkers.
  4. The old post-hoc resolve_ca_terminator() promotion pass was removed.
  5. Terminator::try_from_chat_str() intentionally rejects CA arrows, , , +≈, and +≋.

Data Model

The active surface split is:

KindCHAT tokens
Terminator. ? ! +... +/. +//. +/? +!? +"/. +". +//? +..? +.
Separator plus the other CA/content separators
Linker+≈ +≋ plus the other utterance linkers

Legacy CA-only Terminator variants still exist in the type for backward compatibility with older serialized data, but new parser/classifier code does not construct them from CHAT text.

Regression coverage

The regression surface for this split is:

  • ca_symbols_are_not_chat_terminators in talkbank-model
  • trailing_ca_arrow_stays_separator in talkbank-parser
  • trailing_ca_no_break_stays_separator in talkbank-parser
  • trailing_ca_technical_break_stays_separator in talkbank-parser

Validation Cache

Status: Current Last modified: 2026-06-22 06:48 EDT

The CHAT-core validation cache, used by chatter validate and the LSP server. Distinct from the audio-task cache used by upstream batchalign3 for FA / UTR ASR / media conversion (documented separately in that project): this cache stores parse + validate results keyed by file path + options.

crates/talkbank-cache/.

Architecture

flowchart TD
    req["Validation request\n(path + options)"]
    key["Cache key\n(path_hash + RulesVersion + check_alignment + parser_kind)"]
    db["SQLite WAL\n~/.cache/talkbank-chat/\ntalkbank-cache.db"]
    hit["Cache hit\n→ return stored result"]
    miss["Cache miss\n→ parse + validate + store"]

    req --> key --> db
    db -->|"found + RulesVersion match + content_hash match"| hit
    db -->|"not found, rules changed, or content edited"| miss
    miss --> db

Configuration

ConfigValueWhy
BackendSQLite via sqlxConcurrent reads (WAL), atomic writes, zero-config
Pool size16 connectionsMatches validation worker count
mmap256 MBFast random access for 95k+ entries
InvalidationRules-version field + content hash + 30-day TTLRule-set or schema changes auto-invalidate; content edits invalidate per-file; stale entries pruned
BridgeEmbedded single-threaded tokio runtimeSync workers call rt.block_on() for async SQLite

Schema

file_cache table (see crates/talkbank-cache/migrations/20260101000000_initial.sql):

ColumnRole
path_hashBLAKE3 hash of the resolved path (part of the lookup key)
file_pathResolved file path, indexed for path-based maintenance ops
content_hashHash of the file content; mismatch invalidates the entry
versionCache-compatibility version (RulesVersion): the cache crate version folded together with a fingerprint of the active validation rule set. A mismatch invalidates the entry
cached_atInsertion timestamp
check_alignmentWhether alignment validation was requested
is_validCached validation outcome (0/1)
roundtrip_testedWhether roundtrip equivalence was checked
roundtrip_passedRoundtrip result when tested
parser_kindParser backend (tree-sitter or re2c)

The lookup key is the compound unique index (path_hash, version, check_alignment, parser_kind); file_path is a secondary index used by maintenance operations (orphan pruning, etc.).

Database location

PlatformPath
macOS~/Library/Caches/talkbank-chat/talkbank-cache.db
Linux~/.cache/talkbank-chat/talkbank-cache.db
Windows%LocalAppData%\talkbank-chat\talkbank-cache.db

Invalidation

  • Validation-rule changes: the version column holds a RulesVersion, which folds the talkbank-cache crate version together with a fingerprint of the active validation rule set (an FNV-1a hash over every ErrorCode the validator can emit, via talkbank_model::validation_rules_fingerprint). Adding, removing, or renaming a rule (for example introducing error code E370, “retrace marker must be followed by material”) changes the fingerprint, hence the RulesVersion, hence the lookup key, so verdicts cached under the old rule set become a cache MISS and are re-validated instead of served stale. This is the mechanism that keeps chatter validate (the authority on CHAT validity) from returning a stale “Valid” after the rules tighten. The stale rows stay on disk under their old version for selective re-testing; they are simply never served to a query carrying the new version.
  • Content changes: each entry stores the file’s content_hash; a mismatch is a per-file miss.
  • Time-based: entries older than 30 days are pruned.
  • Manual: pass --force to bypass cache lookups for a particular validation run.

Per repository policy, do not delete the cache directory without explicit request. Use --force when you want fresh validation for specific paths without destroying the whole cache.

See also

  • Upstream batchalign3 documents its own audio-task cache for FA / UTR ASR / media conversion.

Alignment

Status: Current Last modified: 2026-06-15 15:00 EDT

Alignment in the toolchain operates at two structural layers, plus a separate overlap-marker pass. Tier alignment is structural (counting and pairing AST nodes); word extraction is positional (domain-ordered token indices).

LayerWherePurpose
Tier alignmenttalkbank-model::alignment1:1 mapping between main tier and dependent tiers (%mor, %pho, %wor, %sin, %gra)
Word extractiontalkbank-transform::extractPull NLP-ready words from the AST in domain order

Tier Alignment

Validates that dependent tiers have the correct number and arrangement of items relative to the main tier. Lives in crates/talkbank-model/src/alignment/.

TierDomain

#![allow(unused)]
fn main() {
enum TierDomain { Mor, Pho, Sin, Wor }
}

The same utterance produces different counts per domain:

RuleMorPhoSinWor
Skip retrace groupsYesNoNoNo
Count pausesNoYesNoNo
PhoGroupRecurseAtomic (1)Skip (0)Recurse
SinGroupRecurseSkip (0)Atomic (1)Recurse
Include fragments (&+)NoYesYesNo
Include nonwords (&~)NoYesYesNo
Include fillers (&-)NoYesYesYes
Include untranscribedNoYesYesNo
Include tag-marker separatorsYesNoNoNo
ReplacedWord aligns toReplacementOriginalOriginalOriginal

For the underlying word filter (counts_for_tier, should_skip_group), the content walker, and the ChatFile model itself, see CHAT Data Model. The walker plus the domain table together govern every tier-alignment count.

Retrace handling, alignment-critical

Retraces are the most alignment-critical content type. A Retrace node wraps content the speaker said then corrected.

  • Mor: skip entirely (count 0). The retrace was a false start; only the correction carries morphological analysis.
  • Pho, Sin: recurse, words were physically produced and have phonological / gestural data.
  • Wor: recurse, retrace ancestry does not change %wor membership.

Critical invariant: the parser must emit UtteranceContent::Retrace for all retrace patterns, including single-word retraces with replacements (word [: repl] [* err] [//]). If a retrace is accidentally emitted as a bare ReplacedWord, it counts for %mor alignment, causing false E705 errors. Enforced by tests/retrace_replaced_word_regression.rs. Full data model + parsing pipeline + CHAT examples in Retraces and Repetitions.

AlignmentPair

#![allow(unused)]
fn main() {
struct AlignmentPair {
    source_index: Option<usize>,
    target_index: Option<usize>,
}
}

Universal index-pair primitive. Some/Some = matched. One None = insertion / deletion placeholder for mismatch diagnostics. is_complete(), both indices Some. is_placeholder(), unmatched.

Per-domain results

TypeFunctionSource → Target
MorAlignmentalign_main_to_mor()Main → %mor items
PhoAlignmentalign_main_to_pho()Main → %pho tokens
SinAlignmentalign_main_to_sin()Main → %sin tokens
WorAlignmentalign_main_to_wor()Main → %wor tokens
GraAlignmentalign_mor_to_gra()%mor chunks → %gra relations

%gra aligns to %mor chunks, not items. Clitics create additional chunks (pro|it~v|be&PRES = 2 chunks: pre-clitic + main).

Trait abstractions

TraitPurposeImplementors
IndexPairsource()/target() on any pair typeAlignmentPair, GraAlignmentPair
TierAlignmentResultpairs()/errors()/push_*() accumulatorAll 5 alignment result types
AlignableTierWhat a tier provides for generic alignmentPhoTier, SinTier, WorTier
TierCountablecount_tier_positions() / collect_tier_items() methods[UtteranceContent]

The generic positional_align() function uses AlignableTier to eliminate duplication: align_main_to_{pho,sin,wor}() are thin wrappers around it. %mor doesn’t use it (additional terminator validation logic). %gra doesn’t use it (source is MorTier, not MainTier). WorTier overrides mismatch_format() to Diff (LCS) since both sides are word sequences; the others use Positional.

%wor is not validated

%wor is a timing-annotation tier. There is no downstream positional indexing into %wor, and validate_alignments() does not check %wor word count against the main tier. Old corpus files may have xxx, fragments, or nonwords in %wor (pre-2026-04 behavior) without producing false errors.

Phon tier-to-tier alignment

A second class of alignment that operates between dependent tiers:

SourceTargetCode
%modsyl%modE725
%phosyl%phoE726
%phoaln%modE727
%phoaln%phoE728

Derived-view alignments: %modsyl is a syllabified reannotation of %mod, %phosyl of %pho, %phoaln aligns both. Word counts must match between source and target. Computed in compute_alignments() after the main-tier alignments. build_tier_to_tier_alignment() constructs index pairs and emits build_count_mismatch_error() when counts disagree. %phoaln checks against both %mod and %pho, potentially emitting E727 and E728 simultaneously.

Known data issue: Phon XML source data has orthography↔IPA word count discrepancies in ~4% of files (518 / 12,340). Expected in child phonology data. The PhonTalk converter handles this inconsistently, %mod/%pho are truncated to match orthography via OneToOne, but %xmodsyl/%xphosyl/%xphoaln are written from raw IPATranscript, exposing the full IPA word count. Result: E725-E728 mismatches.

Parse-health gating

Alignment diagnostics honor ParseHealth metadata. If a dependent tier’s domain is parse-tainted, mismatch errors for that domain pair are suppressed. Main-tier taint blocks all main→dependent alignments. Dependent-tier taint blocks only that tier. Phon tier-to-tier checks have their own gates (can_align_modsyl_to_mod, can_align_phosyl_to_pho, can_align_phoaln).

Word Extraction

extract_words() (in crates/talkbank-transform/src/extract.rs) uses the content walker to pull words from the AST in domain-specific order. Returns Vec<ExtractedWord> with text, word_index, is_separator, special_form. Tag-marker separators (, ) are included as words in Mor domain because they have %mor items (cm|cm, end|end, beg|beg).

Overlap Marker Iteration

CA overlap markers (⌈⌉⌊⌋) appear at three content levels, UtteranceContent (top-level), BracketedItem (inside groups), and WordContent (intra-word, butt⌈er⌉). Two APIs in talkbank-model/src/alignment/helpers/overlap.rs:

walk_overlap_points, low-level

Visits every OverlapPoint in document order with word-position context. Analogous to walk_words but for overlap markers:

walk_overlap_points(&utterance.main.content.content.0, &mut |visit| {
    // visit.point: &OverlapPoint (kind + optional index)
    // visit.word_position: usize (alignable words seen so far)
});

extract_overlap_info, region-based

Pairs markers by (kind, index) into OverlapRegion structs. Each region represents a matched ⌈…⌉ or ⌊…⌋ pair. Index-aware: ⌈2...⌉2 forms a separate region from ⌈...⌉. Mismatched indices leave markers unpaired. Onset-only ⌈ (without ⌉) is a legitimate CA convention, region has end_at_word = None, is_well_paired() = false, but top_onset_fraction() still works.

Cross-utterance, analyze_file_overlaps

For whole-file analysis, in overlap_groups.rs. 1:N matching: one top region from speaker A can match multiple bottom regions from speakers B, C, etc. Used by E347 and chatter debug overlap-audit.

Overlap validation

CodeLevelCheck
E347Cross-utteranceOrphaned tops/bottoms with 1:N matching (warning)
E348UtteranceUnpaired markers within a single utterance (warning)
E373UtteranceInvalid overlap index values (must be 2-9)
E704Cross-utteranceSame speaker encoding both top and bottom (error)

chatter debug overlap-audit <path> reports per-file statistics (groups, bottoms, orphans, temporal consistency) in TSV format. Use --database <path.jsonl> for a persistent JSON-lines database.

Design Principles

  1. No string hacking. All alignment operates on typed AST structures (Word, MorTier, AlignmentPair), never on serialized CHAT text.
  2. Domain-aware from the start. TierDomain gates traversal at the walker level. Downstream code never re-implements retrace / group skipping logic.
  3. Deterministic over approximate. Tier alignment and word extraction use deterministic, positional algorithms over the typed AST.
  4. Dense indexed structures. AlignmentPair uses Option<usize> rather than cloned data; index pairs are stored positionally, not in hash maps.
  5. Exhaustive matching. Every match on UtteranceContent (24 variants) or BracketedItem (22 variants) lists all variants explicitly. New variants are a compile error, not a silent bug.
  6. Walker as shared primitive. walk_words() removed ~330 lines of duplicated traversal boilerplate across 7 call sites.

Downstream Consumers

ConsumerCrateUsage
Validationtalkbank-modelCross-tier checks (E714/E715, E725-E728), overlap (E347/E348/E373/E704)
LSP hovertalkbank-lspShow aligned tier items for word under cursor
Word extractiontalkbank-transformNLP-ready words from utterances
Overlap auditchatterchatter debug overlap-audit
%wor generationtalkbank-modelBuild %wor tier from main tier

Memory and Ownership

Status: Current Last updated: 2026-03-24 01:32 EDT

This chapter documents the memory management and ownership patterns used across the TalkBank Rust crates. Understanding these decisions helps contributors make consistent choices when adding new code.

String Representation Strategy

CHAT corpora contain massive repetition, the same speaker codes, language codes, POS tags, and high-frequency words appear millions of times across files. The codebase uses three string types, chosen by expected cardinality and duplication:

flowchart LR
    raw["Raw input (&amp;str)"]
    smol["SmolStr\n(inline ≤23 bytes)"]
    arc["Arc&lt;str&gt;\n(interned, deduplicated)"]
    string["String\n(owned, unique)"]

    raw -->|short, low repetition| smol
    raw -->|high repetition domain value| arc
    raw -->|ephemeral/unique| string
TypeWhen to useExamples
SmolStrShort tokens, low duplicationPostcode text, tier content, event labels
Arc<str> (interned)High-cardinality domain symbolsSpeaker codes, language codes, POS tags, stems
StringEphemeral or unique valuesError messages, temporary formatting

String Interning

Location: talkbank-model/src/model/intern.rs

Five global process-local interners, each a DashMap<Arc<str>, Arc<str>> behind OnceLock<StringInterner>:

InternerPre-seeded valuesTypical savings
speaker_interner()30+ codes (CHI, MOT, FAT, …)High, 3-letter codes repeat per utterance
language_interner()45+ ISO 639-3 codesModerate, per-file
pos_interner()60+ POS tags + UD relationsVery high, every %mor word
stem_interner()200+ frequent English stemsHigh, function words dominate
participant_interner()14 roles (Target_Child, …)Low, per-file

How it works:

  • Fast path: get() on DashMap, O(1) Arc::clone if found
  • Slow path: insert() new Arc if miss, deduplicates on future access
  • Thread-safe: DashMap uses shard-level locks, no global contention
  • After initialization, reads are lock-free

Memory impact: 50-200 MB savings on large corpora (5-20% reduction). Arc::clone is O(1) atomic increment vs String::clone O(n) copy.

Newtype Macros

Two macros generate domain-typed string wrappers:

  • string_newtype!: wraps SmolStr. Used for generic CHAT text.
  • interned_newtype!: wraps Arc<str> with automatic interning. Used for domain symbols.
// SmolStr-backed: no interning, inline small strings
string_newtype!(PostcodeText);

// Arc<str>-backed: interned via global interner
interned_newtype!(SpeakerCode, speaker_interner);

Ownership Model

ChatFile Lifecycle

flowchart TD
    src["Source text (&amp;str)"]
    cst["tree-sitter CST\n(Tree, borrowed nodes)"]
    model["ChatFile\n(owned AST)"]
    cache["SQLite cache\n(validation result)"]
    lsp["LSP server\n(per-document state)"]
    json["JSON output\n(serde serialization)"]
    cli["CLI output\n(CHAT text)"]

    src -->|tree-sitter parse| cst
    cst -->|CST-to-model conversion| model
    model -->|validate + hash| cache
    model -->|held in backend| lsp
    model -->|to_json()| json
    model -->|to_chat_string()| cli
  • Parsing: tree-sitter Tree owns the CST. Node<'a> values borrow from Tree, zero-copy traversal. The CST-to-model conversion copies data into owned ChatFile fields (SmolStr, Arc<str>). The Tree is dropped after conversion.
  • Validation: ChatFile is borrowed (&self) during validation. Errors are streamed to an ErrorSink, no accumulation required.
  • LSP: Each open document holds an owned ChatFile in the backend. Re-parsed on every edit via tree-sitter incremental parsing.
  • CLI batch: Each file is independently parsed → validated → reported → dropped. No cross-file state except the shared cache.

Arc Usage

Arc appears in three distinct roles:

RoleTypeWhy
String interningArc<str> in model typesO(1) clone for high-repetition domain values
Worker poolArc<WorkerGroup> in batchalignRAII CheckedOutWorker::drop() needs group reference to return worker
Cache backendArc<dyn CacheBackend> in batchalignShared across async request handlers

No Rc (single-threaded sharing not needed). No Cow<str> (SmolStr covers the inline-small-string use case more naturally).

Interior Mutability

PatternWhereWhat it protects
RefCell<Parser> inside TreeSitterParsertalkbank-parserTree-sitter Parser needs &mut self but isn’t Sync. Callers create a TreeSitterParser and pass &TreeSitterParser everywhere.
DashMap<Arc<str>, Arc<str>>String internersConcurrent interning during parallel parsing. Shard-level locks.
OnceLock<StringInterner>5 global internersLazy init, lock-free after first access
LazyLock<Regex>All regex patterns workspace-wideCompile-once, no per-call overhead
std::sync::Mutex<VecDeque>batchalign worker idle queueHeld < 10 μs for push/pop only
tokio::sync::Mutex<HashMap>batchalign job storeShort reads/writes, never held across .await
SemaphoreWorker availability (batchalign)Async signaling without holding locks during dispatch

Rule: std::sync::Mutex for data accessed from sync code or held briefly. tokio::sync::Mutex only when the lock must be held across .await points (which we avoid when possible). DashMap when many threads read concurrently.

Collection Choices

CollectionWhereWhy not HashMap/Vec
BTreeMapAll test/snapshot JSON outputDeterministic key ordering for reviewable diffs
IndexMapParticipants, per-speaker resultsPreserves encounter order (CHAT spec requires @Participants order)
SmallVec<[T; N]>Headers (N=2), tiers (N=3), features (N=4), token mappings (N=4)Inline storage for common sizes; avoids heap for typical cases
VecDequeWorker idle queue (batchalign)FIFO fair scheduling
Dense Vec indexed by positionRetokenize word-to-token mappingO(1) lookup, no hashing overhead, cache-friendly

No LinkedList, BinaryHeap, or custom allocators.

Tree-Sitter Memory Model

Tree-sitter parsing is zero-copy for CST traversal:

// Node<'a> borrows from Tree, no allocation per node
fn process_node<'a>(node: Node<'a>, source: &str) -> ParseResult<...> {
    for i in 0..node.child_count() {
        let child: Node<'a> = node.child(i).unwrap(); // Stack-only, no heap
        let text: &str = child.utf8_text(source.as_bytes())?; // Borrows source
        // ... convert to owned model types ...
    }
}

The tree-sitter parser consumes &str, produces a CST, and the Rust traversal code constructs owned model types from CST nodes.

SQLite Memory-Mapped I/O

The validation cache uses SQLite with memory-mapped I/O for fast random access:

SqliteConnectOptions::new()
    .journal_mode(SqliteJournalMode::Wal)       // Concurrent reads during writes
    .pragma("cache_size", "-8000")               // 8 MB page cache
    .pragma("mmap_size", "268435456")            // 256 MB memory-mapped region
    .synchronous(SqliteSynchronous::Normal)      // Balanced durability

This configuration handles 95,000+ cached entries efficiently. The cache is never deleted (use --force to refresh specific paths).

Manual Drop Implementations

Three types have custom Drop for resource cleanup:

TypeCleanup actionWhy
AuditReporterJoins audit writer thread and flushes outputAudit mode owns file IO in a dedicated writer thread
CheckedOutWorkerReturns worker to idle queue + releases semaphore permitRAII pool resource management
WorkerHandleSends SIGTERM/SIGKILL to child processProcess must be terminated when handle drops

All drops are acyclic, no ordering dependencies between them.

Allocation Optimization Patterns

Rather than using an arena allocator (bumpalo was evaluated and removed, the data lifetimes don’t fit the “allocate many, free all at once” pattern), the codebase uses targeted optimizations:

PatternWhereSavings
Scratch buffer reuse (clear + swap)DP alignment row costs~50% fewer allocations in inner loop
Flat table (vec![...; rows * cols])DP small-problem fallback1 allocation vs rows+1
Dense Vec instead of HashMapRetokenize word mappingO(1) lookup, no hash overhead
SmallVec inline storageThroughoutAvoids heap for 1-4 element collections
SmolStr inline stringsAll short CHAT tokensNo heap allocation for ≤23 byte strings

See also: the batchalign3 book’s Arena Allocators page for the full evaluation of where arenas do and don’t help.

Algorithms and Data Structures

Status: Current Last modified: 2026-06-15 15:00 EDT

This chapter documents the key algorithms and data structure decisions across the TalkBank Rust crates.

CHAT AST Representation

The CHAT model is a tree of owned enums. The two central types are:

  • UtteranceContent: 24 variants covering all main-tier content
  • BracketedItem: 22 variants for content inside groups/brackets
flowchart TD
    file["ChatFile"]
    header["Headers\n(@Languages, @Participants, ...)"]
    utt["Utterance"]
    mc["MainContent\nVec&lt;UtteranceContent&gt;"]
    dt["DependentTiers\n(%mor, %pho, %gra, ...)"]

    file --> header
    file --> utt
    utt --> mc
    utt --> dt

    mc --> word["Word / AnnotatedWord / ReplacedWord"]
    mc --> group["Group / PhoGroup / SinGroup / Quotation"]
    mc --> marker["Pause / Separator / OverlapPoint / ..."]
    group --> bi["BracketedContent\nVec&lt;BracketedItem&gt;"]
    bi --> word2["Word / ReplacedWord / Separator"]
    bi --> nested["Nested groups"]

Memory layout: Large variants (e.g., AnnotatedWord with scoped annotations) are Boxed to keep the enum’s stack size bounded.

Content Walker

Location: talkbank-model/src/alignment/helpers/walk/

Closure-based recursive traversal centralizing the walk over all 24+22 variants:

pub fn for_each_leaf<'a>(
    content: &'a [UtteranceContent],
    domain: Option<AlignmentDomain>,
    f: &mut impl FnMut(ContentLeaf<'a>),
)

Domain-aware gating:

  • Some(Mor): skips retrace groups (retrace words aren’t morphologically analyzed)
  • Some(Pho | Sin): skips PhoGroup/SinGroup (treated as atomic by those tiers)
  • None: recurses everything unconditionally

Both immutable (for_each_leaf) and mutable (for_each_leaf_mut) versions exist. Used by talkbank-model, talkbank-transform word extraction, and other typed CHAT traversals across the workspace.

Parsing Strategies

Tree-Sitter (Canonical Parser)

flowchart LR
    src["Source .cha text"]
    ts["tree-sitter C parser\n(generated from grammar.js)"]
    cst["CST (Tree)"]
    conv["Recursive descent\nover CST nodes"]
    model["ChatFile (owned AST)"]
    errors["ErrorSink\n(diagnostics)"]

    src --> ts --> cst --> conv --> model
    conv --> errors
  • Grammar defined in grammar/grammar.js (source of truth)
  • parser.c is generated, never edit directly
  • CST-to-model conversion: recursive dispatch on node kind, skip WHITESPACES, report unrecognized nodes via ErrorSink
  • Strict + catch-all pattern: Known header values get named grammar rules (syntax highlighting); unknown values hit a catch-all (flagged by validator)

Fragment Parsing

TreeSitterParser provides fragment methods for parsing individual CHAT fragments (a word, a tier line) directly. Methods like parser.parse_word_fragment(), parser.parse_main_tier_fragment(), etc. are used when synthesizing CHAT from non-CHAT sources (ASR output, UD annotations).

Historical note: A Chumsky-based direct parser previously provided combinator-based fragment parsing. It was removed in March 2026; tree-sitter is now the sole parser.

Tier Alignment (1:1 Positional)

Location: talkbank-model/src/alignment/traits.rs

Generic positional_align() pairs main-tier words with dependent-tier items by position (O(n)). Traits: AlignableTier, TierAlignmentResult, AlignableContent.

  • %pho, %sin, %wor, use generic positional alignment
  • %mor, %gra, domain-specific custom implementations
  • Mismatch diagnostics via similar crate (Patience diff algorithm, O(n log n))

Caching

The CHAT-core validation cache is documented separately in Validation Cache. The upstream batchalign3 project documents its own audio-task cache (FA / UTR ASR / media conversion) separately.

Text Processing

Regex Compilation

All regex patterns use LazyLock<Regex> from std::sync, compiled once at first use, lock-free thereafter. Never call Regex::new() inside functions or loops.

Deterministic Output

  • BTreeMap for all test/snapshot JSON (lexicographic key ordering)
  • IndexMap for participant/speaker ordering (preserves encounter order per spec)
  • Frequency results collected into BTreeMap<NormalizedWord, Count>

Setup

Status: Current Last modified: 2026-06-21 21:33 EDT

Development is supported on Windows, macOS, and Linux. The instructions below use Unix shell syntax; on Windows, use PowerShell or Git Bash equivalently.

Prerequisites

  • Rust (stable) via rustup (all platforms)
  • Node.js for tree-sitter grammar generation and symbol validation
  • tree-sitter CLI: cargo install tree-sitter-cli
  • just (optional but recommended) for the repo’s top-level helper recipes

Clone Repository

mkdir -p ~/talkbank && cd ~/talkbank
git clone https://github.com/TalkBank/chatter.git
cd chatter

Build

From your chatter checkout root:

cargo build --workspace --locked
cargo build --workspace --all-targets --locked

# Optional helpers from the root justfile
just build
just test
just book-install-tools
just book

Two Cargo Workspaces

The repository has two independent Cargo workspaces:

1. Root workspace (Cargo.toml)

Contains all Rust crates for parsing, model, validation, and transform:

cargo build
cargo test

2. Spec workspace (spec/Cargo.toml)

Contains two sibling crates for spec-driven artifacts. Invoke with --manifest-path relative to the chatter repo root:

cargo build --manifest-path spec/tools/Cargo.toml
cargo build --manifest-path spec/runtime-tools/Cargo.toml
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- --help
cargo run --manifest-path spec/runtime-tools/Cargo.toml --bin validate_error_specs -- --help

Root justfile recipes

just build        # Build the Rust workspace
just build-release
just test         # cargo test --workspace
just clippy
just fmt
just fmt-check

Verification

This repo does not currently have the old monorepo-wide make verify wrapper ported into the root checkout. Until that lands, use the concrete verification commands from the repo guidance:

cargo fmt
cargo check --workspace --all-targets
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Add grammar/spec commands when your change touches those surfaces:

cd grammar && tree-sitter generate && tree-sitter test
cargo build --manifest-path spec/tools/Cargo.toml
cargo build --manifest-path spec/runtime-tools/Cargo.toml

CI green on the pushed commit remains the authoritative pre-push gate for this repo.

Editor Setup

rust-analyzer

The workspace should work out of the box with rust-analyzer. The root Cargo.toml workspace configuration is standard.

Grammar Workflow

Status: Current Last modified: 2026-05-29 18:36 EDT

The tree-sitter grammar at grammar/grammar.js is the formal definition of the CHAT format. Changes require careful validation.

The following diagram shows the complete regeneration pipeline. Every step must pass before committing a grammar change.

flowchart TD
    edit(["Edit grammar/grammar.js"])
    generate["tree-sitter generate\n→ src/parser.c\n→ src/node-types.json"]
    grammar_test["tree-sitter test\n(corpus tests)"]
    rust_test["cargo test -p talkbank-parser\n(CST-to-model conversion)"]
    equiv["parser equivalence\n(corpus/reference/ files)"]
    spec_check{"Grammar change\naffects spec examples?"}
    test_gen["spec/tools generators\n→ grammar/test/corpus/\n→ parser-tests/tests/generated/\n→ docs/errors/"]
    commit(["Commit"])

    edit --> generate --> grammar_test --> rust_test --> equiv --> spec_check
    spec_check -->|Yes| test_gen --> commit
    spec_check -->|No| commit

Step-by-Step Procedure

1. Edit the Grammar

Modify grammar.js in the grammar/ directory. Key design principles:

  • Explicit whitespace (no extras)
  • Precedence annotations to resolve ambiguities
  • Named rules for all semantically meaningful nodes

2. Generate the Parser

cd grammar
tree-sitter generate

This produces src/parser.c and src/node-types.json. Never edit these files by hand.

3. Run Grammar Tests

tree-sitter test

Every test under grammar/test/corpus/ must pass. Tests live there and are partially auto-generated from specs (primarily via gen_tree_sitter_tests).

4. Run Parser Tests

cargo test -p talkbank-parser

This verifies the Rust parser wrapper handles all CST nodes correctly.

5. Run Parser Equivalence

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'

Every file in the reference corpus must parse correctly. Each .cha file is its own test, nextest runs them in parallel and reports individual failures.

6. Regenerate Spec Tests

If the grammar change affects any spec examples:

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- \
  --output-dir grammar/test/corpus \
  --template-dir spec/tools/templates

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_rust_tests -- \
  --output-dir crates/talkbank-parser-tests/tests/generated

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_validation_corpus -- \
  --corpus-dir crates/talkbank-parser-tests/tests/error_corpus/validation_errors

This regenerates tree-sitter corpus tests and other generated outputs that still depend on the spec pipeline.

Do this when the grammar change actually affects generated artifacts.

7. Update node_types.rs

If new node types were added to the grammar, the generated node_types.rs in talkbank-parser needs updating. The spec tools handle this via node-types.json.

Critical Policy

The reference corpus at corpus/reference/ must pass parser equivalence at 100%. If a grammar change breaks even one file, revert immediately. The reference corpus is the ultimate arbiter of correctness.

Common Patterns

Adding a New Token

  1. Define the token in grammar.js
  2. Add handling in the Rust tier parser (match on the new node kind)
  3. Add a spec construct example
  4. Run the relevant generation and verification steps

For small, isolated syntax additions, the grammar workflow should stay local:

  • one grammar change
  • one grammar corpus example
  • one full-file fixture if needed

Changing a Rule

  1. Modify the rule in grammar.js
  2. tree-sitter generate && tree-sitter test
  3. Update Rust parser if CST node structure changed
  4. Update spec examples if the expected CST changed
  5. Run the current local verification sweep from contributing/dev-checks.md

Spec Workflow

Status: Current Last modified: 2026-05-29 17:50 EDT

Specifications in spec/ are the source of truth for CHAT format intent, grammar examples, and validation/error contracts.

Adding a Construct Spec

Construct specs define valid CHAT patterns with expected parse trees.

1. Create the Spec File

Create a new markdown file in the appropriate spec/constructs/ subdirectory:

spec/constructs/
├── header/         # Header-related constructs
├── main_tier/      # Main tier patterns
├── tiers/          # Dependent tier patterns
├── utterance/      # Utterance-level patterns
└── word/           # Word syntax patterns

2. Write the Spec

# my_example

Description of what this example demonstrates.

## Input

\```utterance
*CHI:	hello world .
\```

## Expected CST

\```cst
(utterance
  (main_tier
    ...))
\```

## Metadata

- **Level**: utterance
- **Category**: main_tier

The code fence label (e.g., utterance, mor_dependent_tier) selects which template wraps the input into a full CHAT file.

3. Generate the CST

Parse your input with tree-sitter to get the actual CST, then copy it as the Expected CST (stripping positions and field names).

4. Regenerate The Affected Generated Artifacts

The predecessor monorepo wrapped this step as make test-gen. That root wrapper is not yet ported into this repo, so follow spec/CLAUDE.md and run only the generator command(s) relevant to the artifacts you intentionally changed.

For isolated grammar additions, keep the change small:

  1. Add or adjust one grammar example.
  2. Add one full-file fixture if the change matters in context.
  3. Regenerate only the artifacts that truly changed.

Adding an Error Spec

Error specs define invalid CHAT patterns with expected error codes.

1. Create the Spec File

Error specs live in spec/errors/, named by error code. The convention is E###_auto.md (or E###_<short-slug>.md); for example spec/errors/E301_auto.md covers “Empty speaker code”.

2. Write the Spec

The actual on-disk format (per spec/errors/E301_auto.md) uses bolded metadata keys; there is no Name field and severity is implicit in the error-code numbering:

# E301: Empty speaker code

## Description

Empty speaker code

## Metadata

- **Error Code**: E301
- **Category**: Main tier validation
- **Level**: utterance
- **Layer**: parser

## Example 1

**Source**: `E3xx_main_tier_errors/E301_empty_speaker.cha`
**Trigger**: Main tier with * but no speaker code
**Expected Error Codes**: E301

\```chat
@UTF8
@Begin
@Languages:	eng
@Participants:	CHI Child
...
\```

Key Metadata Fields

  • Layer: parser: the error is caught during parser.parse_chat_file() (file fails to parse)
  • Layer: validation: the error is caught by validate_with_alignment() after successful parse
  • Status: not_implemented: generates #[ignore] tests (validation logic not yet coded)

3. Regenerate The Affected Artifacts

Regenerate the affected artifacts with the current spec-tool commands from spec/CLAUDE.md, then run the concrete verification commands from Setup / Developer Verification Checks.

Updating the Symbol Registry

The symbol registry at spec/symbols/symbol_registry.json defines character sets used by the grammar and Rust crates.

flowchart TD
    registry["Edit spec/symbols/\nsymbol_registry.json"]
    validate["validate_symbol_registry.js\n(structure check)"]
    gen_grammar["Generate grammar symbols\n(for tree-sitter)"]
    gen_rust["generate_rust_symbol_sets.js\n→ talkbank-model/src/generated/symbol_sets.rs\n→ spec/tools/src/generated/symbol_sets.rs"]
    fmt["rustfmt\n(format generated code)"]
    verify["Run current symbol generators\nthen local verification sweep"]

    registry --> validate --> gen_grammar & gen_rust
    gen_rust --> fmt --> verify
    gen_grammar --> verify

After editing, run the current symbol-generation commands from spec/CLAUDE.md, then regenerate any dependent grammar/tests/docs outputs if the symbol change affects them.

Common Mistakes

  • Editing generated files: never edit grammar/test/corpus/ or crates/talkbank-parser-tests/tests/generated/ by hand
  • Regenerating reflexively: use regeneration when generated artifacts changed, not as a substitute for thinking about what kind of test authority the change really needs
  • Wrong layer: parser-layer specs expect parse failure; validation-layer specs expect parse success + error report

Testing

Status: Current Last modified: 2026-06-15 15:00 EDT

Test Generation Pipeline

Specs are the source of truth. All grammar corpus tests, Rust parser tests, and error docs are generated from specs. This repo does not currently have the old monorepo-wide make test-gen wrapper; run the relevant spec/tools binaries directly instead, and never hand-edit generated files.

flowchart LR
    subgraph sources["Source of Truth"]
        constructs["spec/constructs/\n(construct specs, see directory listing)"]
        errors["spec/errors/\n(error specs, see directory listing)"]
        templates["spec/tools/templates/\n(Tera wrappers)"]
    end

    subgraph generators["spec/tools generators\n(run only what changed)"]
        gen_ts["gen_tree_sitter_tests"]
        gen_rust["gen_rust_tests"]
        gen_validation["gen_validation_corpus"]
        gen_docs["gen_error_docs"]
    end

    subgraph outputs["Generated Outputs (DO NOT EDIT)"]
        ts_tests["grammar/test/corpus/\n(tree-sitter tests)"]
        rust_tests["crates/talkbank-parser-tests/tests/generated/\n(Rust tests)"]
        val_corpus["crates/talkbank-parser-tests/tests/error_corpus/validation_errors/\n(.cha fixtures + manifest.json)"]
        error_docs["docs/errors/\n(local generated error pages)"]
    end

    constructs & errors --> gen_ts
    templates --> gen_ts
    constructs & errors --> gen_rust
    errors --> gen_validation
    errors --> gen_docs

    gen_ts --> ts_tests
    gen_rust --> rust_tests
    gen_validation --> val_corpus
    gen_docs --> error_docs

To add a grammar test or error test, add a spec file in spec/constructs/ or spec/errors/, then run the current generator command(s) from Spec Workflow. Use only the binaries that match the artifacts you intentionally changed.

Test Strategy

Testing is organized in layers, from fastest to most comprehensive.

flowchart TD
    unit["Unit + Integration Tests\n(cargo nextest run)"]
    specgen["Spec-Generated Tests\n(spec/tools generators)\nParser + validation layer"]
    grammar["Grammar Corpus\n(tree-sitter test)"]
    ref["Reference Corpus\n(corpus/reference/, 100% required)"]
    gates["Local verification sweep + CI\n(dev-checks.md / quality-gates.md)"]

    unit --> specgen --> grammar --> ref --> gates

Never-Regress Gates

Four gates form the regression contract for the CHAT core. They guard the behavior a successor cannot easily re-derive: parser correctness, lossless serialization, full-corpus coverage, and error detection. Any commit touching the relevant surface (grammar, parser, model, validation, serialization, or alignment) MUST run the matching gate(s) and keep them green. A red gate is a bug until proven otherwise (see the repo CLAUDE.md, “Test Failures Are Bugs Until Proven Otherwise”), never a test expectation to quietly update.

GateCommandWhat it protects
Parser equivalencecargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'The re2c oracle parser and the tree-sitter parser agree on every reference file. A divergence means one parser is wrong, or a construct spec is missing.
Roundtrip idempotencycargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpusparse, serialize, re-parse yields a semantically identical AST (SemanticEq) for every reference file. Catches any model or WriteChat change that silently loses information.
Reference corpus 100%(the same roundtrip_reference_corpus test)Every file under corpus/reference/ parses and roundtrips with zero failures. The reference corpus is the ultimate arbiter of full-file correctness; it must be 100%, never “mostly”.
Error-code spec testscargo nextest run -p talkbank-parser-tests --test generated_tests --test validation_error_corpus --test error_coverageEvery error spec under spec/errors/ still fires its expected code: parser-layer errors reject as designed, validation-layer errors are detected, and every ErrorCode has a backing spec. These tests are generated from specs, never hand-written.

Two of the four share one test: roundtrip_reference_corpus enforces both roundtrip idempotency and the reference-corpus 100% guarantee, because it iterates every reference file (the coverage guarantee) and checks roundtrip semantic equality on each (the idempotency guarantee).

All four also run as part of the full workspace sweep (cargo nextest run --workspace), so a complete local run before committing covers them. The per-gate commands above are the fast, targeted way to re-check one surface during the inner development loop. The sections below describe each layer in more detail.

Unit Tests (nextest)

cargo nextest run

Runs all unit and integration tests across all crates (~2300+ tests). These test individual functions, serialization roundtrips, and model invariants.

cargo nextest does not run doctests. Keep cargo test --doc as a separate verification step when you change public API examples or doc comments.

Parser Equivalence

cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'

Runs the parser on each file in the corpus/reference/ tree and validates results. Each .cha file is its own test, enabling per-file parallelism and failure isolation via nextest. The exact file count is whatever find corpus/reference -name '*.cha' -type f | wc -l reports, do not hard-code it here.

Spec-Generated Tests

Part of talkbank-parser-tests. These are generated from specs via the current spec/tools binaries and currently test:

  • Construct specs: input parses correctly
  • Parser-layer error specs: input fails to parse with expected error code
  • Validation-layer error specs: input parses but validation reports expected error code

Common entrypoints from the repository root:

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- \
  --output-dir grammar/test/corpus \
  --template-dir spec/tools/templates

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_rust_tests -- \
  --output-dir crates/talkbank-parser-tests/tests/generated

cargo run --manifest-path spec/tools/Cargo.toml --bin gen_validation_corpus -- \
  --corpus-dir crates/talkbank-parser-tests/tests/error_corpus/validation_errors

Tree-Sitter Grammar Tests

cd grammar && tree-sitter test

Runs the tree-sitter grammar corpus tests. This is the right gate for grammar structure changes.

Error Corpus Tests

Error fixtures live in spec/errors/. Parser-layer error examples become Rust tests via gen_rust_tests; validation-layer examples become a .cha fixture corpus + manifest.json via gen_validation_corpus, under crates/talkbank-parser-tests/tests/error_corpus/validation_errors/, which the data-driven runner validation_error_corpus.rs consumes. Add a new error spec under spec/errors/E###_*.md and regenerate.

Tree-Sitter Tests

cd grammar
tree-sitter test

Verifies the grammar produces correct CSTs for known inputs. The actual test count comes from ls grammar/test/corpus/*.txt | wc -l; do not hard-code it.

Reference Corpus

The reference corpus at corpus/reference/ is organized into subdirs (annotation/, audio/, ca/, content/, core/, edge-cases/, languages/, tiers/, word-features/). The parser must handle every file at 100%, the exact file count is whatever find corpus/reference -name '*.cha' -type f | wc -l reports.

This corpus is the ultimate arbiter of correctness for full-file parsing.

Local Verification Contract

There is no repo-local make verify wrapper in this checkout today. Use the explicit command set from Developer Verification Checks and Testing and Quality Gates instead.

Core local sweep:

cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Then add the surface-specific checks that match your change:

  • grammar changes: cd grammar && tree-sitter generate && tree-sitter test
  • spec-tool changes: cargo build --manifest-path spec/tools/Cargo.toml and cargo build --manifest-path spec/runtime-tools/Cargo.toml
  • parser / model / alignment / serialization changes: cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)' and cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus

Running Specific Tests

# Single test by name
cargo nextest run test_name

# Tests in a specific crate
cargo nextest run -p talkbank-model

# Tests matching a pattern
cargo nextest run -- mor

# With output
cargo nextest run --no-capture

What to Run When

What you changedRun
Grammar (grammar.js)cd grammar && tree-sitter generate && tree-sitter test, then the relevant parser/spec-generator commands
Parser (CST-to-model)cargo nextest run -p talkbank-parser
Model (types, validation, alignment)cargo nextest run -p talkbank-model
CLI (chatter args, dispatch)cargo nextest run -p chatter
LSPcargo nextest run -p talkbank-lsp
Spec filesRun the relevant gen_* commands from spec/tools, then the local verification sweep from dev-checks.md
Pre-merge (any change)The local verification sweep from dev-checks.md plus surface-specific additions
Pre-push (quick)Re-run the narrowest commands that cover the surfaces you touched; there is no repo-local make ci-local wrapper

Mutation Testing

Use cargo-mutants to find code that can be changed without any test failing, the true coverage gaps.

# Install (once)
cargo install cargo-mutants

# Run against a specific crate (--jobs 1 to avoid OOM on 64 GB machines)
cargo mutants -p talkbank-parser --timeout 120 --jobs 1

# Review results
cat mutants.out/missed.txt    # Mutations no test caught
cat mutants.out/caught.txt    # Mutations properly detected

Mutation testing is not part of CI but should be run periodically (after major changes) to find untested logic paths. Results guide where to add new tests.

Configuration: mutants.toml at the repo root excludes trivial functions.

Adding Tests

  • Model tests: add to the relevant crate’s tests/ directory or #[cfg(test)] module
  • Parser tests: if the change is about grammar shape or validation contracts, add or update specs and regenerate with the relevant spec/tools generator binaries
  • Error tests: add a new spec under spec/errors/E###_*.md and run gen_rust_tests (parser-layer) or gen_validation_corpus (validation-layer); the generated Rust tests / fixture corpus + manifest are produced automatically

Coding Standards

Status: Current Last updated: 2026-03-24 00:01 EDT

Rust Conventions

  • Edition: 2024
  • Formatting: cargo fmt before every commit
  • Linting: cargo clippy --all-targets -- -D warnings must pass with zero warnings
  • No clippy silencing without explicit approval

Error Handling

  • No panics for recoverable conditions, use thiserror/miette for error types
  • Library code uses the ErrorSink trait for error reporting, not Result
  • Use ParseOutcome<T> in parser code (parsed or rejected)

Logging

  • Library crates use tracing (never println! or eprintln!)
  • CLI binaries write to stdout (results) and stderr (diagnostics)
  • Use appropriate log levels: error!, warn!, info!, debug!, trace!

Naming

  • Follow standard Rust conventions (snake_case for functions, CamelCase for types)
  • Conventional Commits for commit messages: <type>[scope]: <description>
    • Types: feat, fix, refactor, test, docs, chore

Dependencies

Preferred crates:

  • clap: CLI argument parsing
  • serde: serialization
  • miette: user-facing diagnostics
  • insta: snapshot testing
  • tracing: structured logging
  • rayon / crossbeam, concurrency
  • smallvec: small-buffer optimization

Code Organization

  • Keep crate boundaries clean, lower crates should not depend on higher ones
  • The model crate should not depend on any parser
  • Parsing code should not depend on serialization/transform code
  • All CHAT parsing and serialization goes through the AST, never ad-hoc string manipulation
  • Treat 10 or more named struct fields as an audit trigger. Wide boundary or report records can be acceptable, but wide runtime state bags need explicit review. See architecture/chat-model/wide-structs.md.

Testing

  • Prefer spec-driven tests over hand-written tests for parser behavior
  • Use cargo nextest run for unit tests (except doctests)
  • Snapshot tests with insta for complex output comparisons

Generated Files

Never hand-edit generated artifacts:

  • parser.c: generated from grammar.js
  • grammar/test/corpus/: generated from specs
  • crates/talkbank-parser-tests/tests/generated/: generated from specs
  • crates/talkbank-model/src/generated/symbol_sets.rs: generated from symbol registry

Always regenerate from source inputs.

Coding Standards and Engineering Practices

Status: Current Last updated: 2026-05-21 08:38 EDT

Objective

Set enforceable, language-specific standards that reduce ambiguity and improve long-term maintainability.

Global Standards

  1. Prefer explicit domain types over ad-hoc strings.
  2. Keep parsing, validation, and rendering logic separated.
  3. Eliminate magic numbers/strings/paths via named constants and config.
  4. Treat generated code as immutable artifacts.
  5. Require tests for every bugfix and behavior change.

Rust Standards

  • Enforce formatter and clippy in CI.
  • Minimize #[allow(clippy::...)]; each allowance needs rationale.
  • Prefer small focused modules with clear ownership.
  • Public APIs require doc comments with examples and error behavior.
  • In parser code, disallow ErrorSink + Option<T> signatures for fallible parse operations.
    • Use explicit outcome enums or Result with structured diagnostics.
    • Guardrail script: scripts/check-errorsink-option-signatures.sh.
  • For model enums that encode validation state, require ValidationTagged derive.
    • Explicit annotation: #[validation_tag(error|warning|clean)].
    • Naming-convention fallback (per crates/talkbank-derive/src/validation_tagged.rs:118-123): variants ending in ErrorError; variants ending in Warning OR Unsupported, plus a variant named exactly Unsupported, → Warning; otherwise → Clean.

Grammar Standards

  • Grammar rules must map to documented token/category semantics.
  • No duplicated symbol sets in free-form literals.
  • Every non-obvious precedence/conflict decision must include rationale.

Spec and Generator Standards

  • Spec files must follow strict metadata template.
  • Generators must be deterministic and pure with respect to inputs.
  • No hardcoded user-specific paths in docs or generated outputs.

Magic Value Policy

Disallowed

  • Inline path literals tied to local machines.
  • Unnamed numeric constants encoding protocol behavior.
  • Repeated header/tier string literals across modules.

Required

  • Central constants/modules:
    • path defaults,
    • tier/header prefixes,
    • token categories,
    • formatting policies.

Review and PR Standards

  • PR template must include:
    • subsystem touched,
    • contract impact,
    • generated artifact impact,
    • tests added/updated,
    • docs updated.
  • Require at least one reviewer with subsystem ownership for core modules.

Internal Decision Records

Adopt short ADR format in the book’s architecture section:

  • context,
  • decision,
  • alternatives considered,
  • consequences,
  • rollback path.

Acceptance Criteria

  • Coding standards are documented once and enforced automatically.
  • Magic values are systematically reduced and tracked.
  • Every behavior change includes tests and doc impact assessment.
  • Architecture decisions are recorded and discoverable.

CI and Release

Status: Current Last updated: 2026-06-21 21:33 EDT

Pre-Merge Verification

Use the concrete local verification commands from Setup and Developer Verification Checks:

cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Then rely on GitHub Actions CI as the authoritative shared signal before you announce a change as ready.

Generated artifact drift

Generated artifacts are still important, but the old root wrappers from the predecessor workspace are not yet ported into this repo. In practice:

  • regenerate only the affected spec/symbol outputs,
  • do not hand-edit generated artifacts,
  • and run the surface-specific verification commands that match the change.

See Spec Workflow and spec/CLAUDE.md for the current source-of-truth guidance.

Release Process

TalkBank/chatter is the public release source of truth: release.yml (cargo-dist) publishes the signed GitHub Releases for the CLI and the desktop app.

Workflows that actually exist in this repo

WorkflowPurposeNotes
.github/workflows/ci.ymlMain build/test/book CIPrimary shared signal on pushes and PRs
.github/workflows/cross-platform.ymlCross-platform build coverageSupplements the main CI workflow
.github/workflows/crates-io-foundation.ymlFirst-wave crates.io readinessChecks foundation-crate metadata, package surfaces, hold-backs, and publish order
.github/workflows/release.ymlcargo-dist release automationBuilds dist-enabled workspace artifacts from version tags; owns the GitHub Release
.github/workflows/release-desktop.ymlDesktop installer release automationBuilds chatter-desktop installers on the same version tags and uploads them into the release that release.yml creates; workflow_dispatch runs build-only
.github/workflows/clippy-rolling.ymlNew-stable clippy drift detectionWeekly maintenance workflow

Current release stance

  • release.yml is about workspace artifact packaging via cargo-dist, not about crates.io publication.
  • The first-wave crates.io path is documented separately in Crates.io Publication and is checked by just crates-io-foundation-check plus .github/workflows/crates-io-foundation.yml.

Desktop release workflow: how the two tag workflows compose

On a version tag, release.yml (cargo-dist) and release-desktop.yml run in parallel. cargo-dist owns creating the GitHub Release and attaching the CLI archives, checksums, and installer scripts; release-desktop.yml builds the Tauri installers, then polls until the release exists and uploads its installers into it. Two platform notes baked into the workflow:

  • macOS: Tauri signs, notarizes, and staples the .app, but NOT the .dmg it wraps around it. The workflow therefore submits the .dmg itself to the notary service and staples it, then verifies codesign, spctl, and stapler validate on both artifacts. The signing identity is supplied via environment, never hardcoded in tauri.conf.json.
  • Windows / Linux: artifacts are currently unsigned by decision; see docs/strategy/distribution-and-signing.md (“Decisions, 2026-06-12”) and the SmartScreen guidance in the install docs.

Release secrets (Actions secrets on this repository)

Required by the macOS jobs of release-desktop.yml (and by cargo-dist macOS codesigning if macos-sign is enabled, which uses the separate CODESIGN_* names documented in the strategy doc):

SecretContent
APPLE_CERTIFICATEbase64-encoded Developer ID Application .p12
APPLE_CERTIFICATE_PASSWORDpassword for the .p12
APPLE_SIGNING_IDENTITYfull identity string, Developer ID Application: <Name> (<TEAMID>)
APPLE_API_KEYApp Store Connect API key ID (notarization)
APPLE_API_ISSUERApp Store Connect issuer ID
APPLE_API_KEY_CONTENTcontents of the AuthKey_*.p8 file

Rotation: replacing the certificate or notary key means updating these secrets and nothing else; no workflow edits are needed. A maintainer must re-create all of them on any new repository (secrets do not transfer).

Crates.io Publication

Status: Current Last updated: 2026-06-21 21:33 EDT

Scope

The crates.io automation in this repo currently targets the Wave 1A foundation crates only. crates.io publication is a deliberate maintainer action, not a tag-triggered release path.

Wave 1A is:

  1. tree-sitter-talkbank
  2. talkbank-derive
  3. talkbank-model
  4. talkbank-cache
  5. talkbank-parser
  6. talkbank-parser-re2c
  7. talkbank-transform

talkbank-parser-re2c is part of the first wave because talkbank-transform has a runtime dependency on it. Holding it back would make talkbank-transform unpublishable.

The current Wave 1B hold-backs are explicitly marked publish = false:

  • send2clan
  • chatter
  • talkbank-lsp

They stay blocked until their support contract, install story, and user-facing docs are ready.

What the repo now automates

Two repo-native entry points cover the first-wave foundations:

SurfacePurpose
just crates-io-foundation-checkLocal preflight for first-wave crates.io readiness
.github/workflows/crates-io-foundation.ymlCI enforcement for first-wave metadata, package surfaces, hold-backs, and publish order

The readiness check enforces:

  • required crates.io metadata (repository, homepage, keywords, categories, readme)
  • readme-file existence
  • package assembly for every first-wave crate via cargo package --list
  • the first-wave runtime dependency graph
  • publish = false guards on Wave 1B crates
  • a real cargo publish --dry-run for the standalone tree-sitter-talkbank crate

Important limitation: Cargo cannot fully dry-run the bootstrap wave

For the first publication of an interdependent workspace, cargo publish --dry-run is not a complete CI gate for every crate. Cargo rewrites path dependencies to registry dependencies while preparing the package. That means a crate such as talkbank-model cannot complete a registry-style dry-run until its prerequisite talkbank-derive already exists on crates.io.

So the current automation is intentionally honest:

  • tree-sitter-talkbank gets a real crates.io dry-run because it stands alone.
  • The remaining Wave 1A crates are validated by metadata, readme, and dependency checks before publication. (No MSRV is declared yet; set a deliberate rust-version and re-add an MSRV check when publication is actually pursued.)
  • As each prerequisite crate lands on crates.io, rerun targeted cargo publish --dry-run -p <crate> checks for the later crates before publishing them.

This is a real limitation of the initial bootstrap wave, not a missing script. If we later want full registry-resolution rehearsal before publication, that requires a staging registry/local index strategy, not just another shell loop.

Publication procedure

Before publishing anything:

  1. Verify crates.io name availability for every Wave 1A package.
  2. Run just crates-io-foundation-check.
  3. Ensure .github/workflows/crates-io-foundation.yml and the main CI workflow are green on the commit you intend to publish.
  4. Publish in this exact order, waiting for the crates.io index to observe each crate before moving to the next:
    • tree-sitter-talkbank
    • talkbank-derive
    • talkbank-model
    • talkbank-cache
    • talkbank-parser
    • talkbank-parser-re2c
    • talkbank-transform
  5. After each prerequisite becomes visible on crates.io, rerun any newly-unblocked cargo publish --dry-run -p <crate> checks before the next publish step.

Example command shape:

cargo publish -p tree-sitter-talkbank --locked

Tagging policy

Do not use version tags to drive crates.io publication from this repo. .github/workflows/release.yml is reserved for cargo-dist GitHub Releases of dist-enabled artifacts. Crates.io publication remains a deliberate manual maintainer flow.

Testing and Quality Gates

Status: Current Last modified: 2026-06-21 21:33 EDT

This page summarizes the current relationship between local verification and the repository CI workflows.

Local pre-merge contract

There is no repo-local make verify wrapper in this checkout today. The local contract is the command set documented in Setup and Developer Verification Checks:

cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

plus grammar/spec/parser-specific checks when you touch those surfaces.

Never-regress gates

Beyond the formatting/build/test sweep above, the CHAT core has four never-regress gates that must stay green for any change touching the grammar, parser, model, validation, serialization, or alignment: parser equivalence, roundtrip idempotency, reference-corpus 100%, and the error-code spec tests. Each has a fast, targeted command. They are defined, with the exact command and what each protects, under Testing, Never-Regress Gates. A red gate is a bug until proven otherwise, never a test expectation to quietly update.

Root CI contract

The main CI workflow (.github/workflows/ci.yml) is the authoritative shared signal for this repo. Today it covers:

  • Rust build, test, and clippy
  • mdBook build

Additional workflows cover cross-platform build coverage and rolling-clippy drift checks.

Because the old local wrapper pipeline has not been ported into this repo, historical references to numbered gates such as G0-G14 should be treated as legacy labels from the predecessor workspace, not as the current command surface here.

Additional CI-only checks

These are required CI signals or workflow checks that are not identical to the local command set:

  • cross-platform release/build coverage
  • weekly rolling-clippy drift checks
  • workflow-specific smoke tests attached to release automation

Documentation Architecture

Status: Current Last modified: 2026-06-15 15:00 EDT

Principle: Centralized Book + Subsystem Satellites

User-facing and contributor-facing prose lives in mdBook (book/). The repo-level docs/ directory holds operator-facing material (release contract, versioning, code-signing, platform support, validation feature flags). Maintainers can also generate a local error-reference tree under docs/errors/ while working on diagnostics, but that output is not the canonical checked-in docs surface. Subsystem-specific working docs stay in place only when tightly coupled to files in that directory.

flowchart TD
    main["book/ (the unified Chatter mdBook)\nSurfaces: chatter, chat-format, architecture, contributing\nAudiences: users, integrators, contributors"]
    spec["spec/docs/\nSpec authoring guides"]
    errors["docs/errors/\nOptional local generated error reference"]
    api["cargo doc\nRust API docs (auto-generated)"]

    main -->|"links to"| spec
    main -->|"links to"| errors
    main -.->|"complements"| api

Where Documentation Goes

Content typeLocationExamples
User guides, CHAT format referencebook/src/chatter/user-guide/, book/src/chat-format/CLI usage, validation errors
Architecture and designbook/src/architecture/Parsing, data model, concurrency, memory
Contributor workflowsbook/src/contributing/Grammar workflow, testing, coding standards
Integrator contractsbook/src/chatter/integrating/JSON schema, diagnostic contract
Technical reference and auditsbook/src/ (Technical Reference section)Parity audits, UTF-8 audit, risk register
Spec authoring guidesspec/docs/Error spec format, curation workflow
Generated error docsdocs/errors/Optional local output from gen_error_docs; source of truth stays in spec/errors/
Historical/archived docsproject archiveOld audits, superseded proposals
AI assistant contextCLAUDE.md files (per repo/subdir)Not documentation for humans

Rules

  1. One canonical page per topic. No duplicate coverage across locations.
  2. No crate-level docs/ directories. Architectural explanations go in the book. Crate API docs come from /// doc comments via cargo doc.
  3. Satellites stay only when the audience is editing files in that directory. Spec authors need WRITING_ERROR_SPECS.md next to their specs. Everyone else reads the book.
  4. Generated docs are build artifacts. Never hand-edit docs/errors/. If you need that local reference set, regenerate it with gen_error_docs.
  5. Historical docs go to project archive. Don’t keep old audit logs, investigation notes, or superseded proposals in the public repo.

One unified book

There is one mdBook for this repo at book/, titled “Chatter, TalkBank CHAT Toolchain”, organized by audience-first sections under book/src/:

SectionAudienceContent
book/src/chatter/chatter CLI users + integratorsCLI reference, library usage, JSON contracts
book/src/chat-format/All users + integratorsCHAT format reference (headers, tiers, symbols)
book/src/architecture/All devsCross-surface architecture, parser/grammar/data-model design
book/src/contributing/ContributorsSetup, testing, coding standards, dev checks

One book.toml and one SUMMARY.md for the whole tree. Cross-section links resolve as ordinary in-book paths.

CHAT Processing Playbook for Developers

Status: Current Last updated: 2026-03-23 23:49 EDT

Objective

Provide an implementation playbook for developers building or extending CHAT parsing, validation, transformation, and serialization logic.

Mental Model

Treat CHAT processing as a layered pipeline:

  1. Ingest bytes and normalize line boundaries.
  2. Parse syntax into structured model with exact spans.
  3. Validate semantic rules with structured diagnostics.
  4. Transform or enrich model without breaking invariants.
  5. Serialize in canonical form.

Developer Workflow

  1. Start from a concrete fixture or corpus case.
  2. Add/adjust parser behavior with contract tests first.
  3. Add semantic validator rules separately from parser acceptance.
  4. Confirm roundtrip and equivalence gates.
  5. Update docs for any visible behavior or policy change.

Tier Dispatch Strategy

Use cheap byte-prefix dispatch before heavy parsing:

  • @ => header candidate,
  • * => main tier,
  • % => dependent tier,
  • continuation rules and whitespace handled deterministically.

This preserves performance and isolates error contexts earlier.

For downstream batchalign3 consumers, tier dispatch is only the front door. The important contract is what happens after dispatch: parse-health taint, recovery vs rejection, and whether a tier is safe to pass into alignment.

Word Parsing Rules of Thumb

  • Parse suffix markers in strict order (@..., @s..., $...) with explicit precedence.
  • Keep raw_text exact, cleaned_text policy-driven and test-locked.
  • Treat CA delimiters and special symbols via centralized symbol sets.
  • Never embed ad hoc symbol literals in multiple files.

Error Handling Contract

  • Every parser failure should produce structured diagnostics with:
    • code,
    • severity,
    • span,
    • context,
    • message.
  • Avoid silent fallback behavior unless policy explicitly allows it.
  • If fallback occurs, emit warning-grade diagnostics where relevant.
  • Never fabricate semantic placeholders (empty required text, arbitrary enum default, fake word/chunk) to satisfy type construction.
  • Prefer None/partial outcome + diagnostics over synthetic model values.

Span Discipline

  • Offsets are absolute across full file content.
  • Nested parser helpers must accept base offset and return shifted spans.
  • Add tests for boundary and continuation-line spans.

Performance Policy

  • Prefer byte-oriented prechecks for top-level dispatch and simple delimiters.
  • Use parser combinators for structural parsing, not for obvious constant-prefix routing.
  • Measure parser performance on representative corpus slices before/after major changes.

Common Failure Patterns and Fixes

  • Symptom: semantic mismatch only in snapshots.
    • Fix: compare parser outputs directly and isolate first structural delta.
  • Symptom: generated tests pass, corpus fails.
    • Fix: add missing fixture, decide parse-vs-validate placement, lock behavior.
  • Symptom: output drift after grammar edit.
    • Fix: run full regeneration and equivalent parser contract suite before merge.

Batchalign3 Surface Checks

When a change affects the surface used by batchalign3, confirm:

  • full-file parse equivalence still holds for corpus coverage
  • alignment-sensitive downstream tiers still gate on parse-health appropriately

Review Checklist for Parser PRs

  • New or changed behavior has targeted tests.
  • Equivalence suite status is attached.
  • Snapshot updates are intentional and explained.
  • No hidden magic symbols or magic string literals introduced.
  • Docs updated where user-visible behavior changes.

Required Artifacts for Significant Changes

  • Design note (architecture decision record in the book).
  • Before/after examples.
  • Impacted fixtures list.
  • Migration implications for integrators.

GitHub Readiness and Open Source Governance

Status: Current Last modified: 2026-06-21 21:33 EDT

Objective

Prepare TalkBank/chatter to operate as a healthy public project with clear legal, security, contribution, and release processes.

Root Artifacts

ArtifactStatusNotes
LICENSE-MIT + LICENSE-APACHEDoneDual-licensed MIT OR Apache-2.0 (standard Rust convention; both files present at root, no combined LICENSE). Every crate inherits license = "MIT OR Apache-2.0" from [workspace.package].
CONTRIBUTING.mdDoneSetup, standards, PR flow, pre-PR checklist
CODE_OF_CONDUCT.mdTODO (deferred)Intentionally absent for now: it is held until a durable enforcement contact (an institutional address or successor handle, not an individual) is settled. The plan is to adopt the Contributor Covenant once that contact exists.
SECURITY.mdDoneRoot file added; issue-template contact link now resolves to a real policy
CODEOWNERSTODONot added yet: repo contents do not currently publish an authoritative GitHub owner/team map for path-level review ownership
.github/workflows/*.ymlDoneci.yml (Rust build+test, mdBook, Rust-version-sync) + cross-platform.yml (OS matrix) + clippy-rolling.yml + crates-io-foundation.yml + release.yml + release-desktop.yml
.github/ISSUE_TEMPLATE/*DoneBug report + feature request (YAML forms)
Pull request templateDone.github/PULL_REQUEST_TEMPLATE.md mirrors current CONTRIBUTING + PR review requirements

CI Governance Policy

  • Required status checks: the ci.yml jobs that run on every pull request, Rust build + test, mdBook build, and Rust version pins in sync. See Branch Protection for the exact GitHub check names and which other workflow (cross-platform.yml) is deliberately not in the required set.
  • Branch protection rules: documented in Branch Protection; configure on GitHub once the repo is public.

Release Governance

  • Releases: the CLI and desktop app are published as signed GitHub Releases (cargo-dist); the Rust crates are source-available (not yet on crates.io).
  • Cargo publication governance: first-wave crates.io foundations are documented in Crates.io Publication and checked by .github/workflows/crates-io-foundation.yml.
  • Binary release governance: release.yml is reserved for cargo-dist GitHub Release packaging of dist-enabled artifacts. It is not the crates.io publication workflow.
  • Tagging rule: do not treat version tags as authorization to publish new surfaces. A surface becomes stable only when its release notes explicitly say so and its public distribution channel is live.
  • Release-note rule: every public release note must state the surface’s distribution channel, support boundary, and any closely related surfaces that remain held back.

Community Operations

  • Label taxonomy: bug and enhancement auto-applied by issue templates. Richer taxonomy (drift, spec, grammar, parser, docs, good first issue): TODO (GitHub settings).
  • Contributor pathway: CONTRIBUTING.md covers setup and PR flow. First-time/advanced contributor pathways: TODO.
  • Public project roadmap: TODO.

Supply Chain and Security

  • Dependency scanning: CI runs rustsec/audit-check and cargo-deny (with deny.toml). Automated update PRs (Dependabot/Renovate): TODO.
  • Signed release artifacts: TODO.
  • Security advisories process: documented in SECURITY.md.

Acceptance Criteria

  • Repo has complete governance artifacts at root.
  • CI and branch protections enforce stated policy.
  • Contributors can onboard and submit PRs without tribal knowledge.
  • Release/support tiers are documented per surface.
  • Release process is repeatable and documented.

Rust Compilation Times: Findings and Optimizations

Status: Reference (historical analysis; current Cargo.toml profile knobs are the source of truth) Last updated: 2026-05-20 20:32 EDT

This document captures the compilation performance analysis that drove the current dev/test profile knobs in the workspace root Cargo.toml. The absolute measurements below were taken before the 2026-04-28 batchalign3 fold roughly tripled the third-party dependency surface; subsequent updates are reflected in Cargo.toml comments, which are the source of truth.

Background: How Rust Compilation Works

Rust compilation has two key mechanisms for speed:

  1. Incremental compilation: When you change one file and rebuild, the compiler remembers which “codegen units” within each crate were affected and only recompiles those. This is the primary speedup mechanism for local iterative development (edit-compile-test cycles).

  2. Crate-level caching: Cargo tracks which crates have changed inputs (source files, dependencies, feature flags). Unchanged crates are skipped entirely. This helps when you edit a leaf crate and don’t need to rebuild unrelated crates.

Additionally, there are external tools:

  1. sccache: A shared compilation cache that stores compiled artifacts by content hash. Designed for CI environments where builds start from a clean state. It works by wrapping rustc and checking a cache before invoking the real compiler.

  2. Linker choice: The linker runs after all crates are compiled to produce the final binary. Faster linkers (like lld) can shave seconds off link time for large binaries.

What We Found

Problem 1: sccache Was Disabling Incremental Compilation (Critical)

The global ~/.cargo/config.toml had:

[build]
rustc-wrapper = "/opt/homebrew/bin/sccache"

This caused two compounding problems:

  • sccache disables Rust incremental compilation entirely. When a rustc-wrapper is set, Cargo cannot use incremental mode because the wrapper interposes between Cargo and rustc, breaking the incremental artifact protocol.

  • sccache had near-zero cache benefit for this workspace. The sccache stats showed a 2.7% Rust cache hit rate. Out of 37 compilations, 36 were marked “non-cacheable” because rlib crates (library crates, which is what most workspace crates produce) cannot be cached by sccache.

The result: every cargo build after a one-line change was effectively a clean rebuild of the entire dependency chain. A change to talkbank-model (near the root of the crate graph) triggered a full recompile of 11+ downstream crates, taking 60-90 seconds even for a trivial edit.

The dev profile was generating full DWARF debug info (level 2), which includes:

  • Type definitions for every struct/enum
  • Variable location info for debugger inspection
  • Full scope and lifetime metadata

This produces large .dSYM bundles and .o files, increasing linker input size and slowing down the link phase.

Problem 3: Third-Party Dependencies at -O0

All third-party crates (serde, regex, tree-sitter, etc.) were compiled at opt-level = 0 in dev builds. Since these crates rarely change, this was a pure penalty: slow runtime (tests using serde deserialization, tree-sitter parsing, or regex matching ran ~10x slower than necessary) with no compile-time benefit after the first build.

Non-Problem: lld Linker

The linker = "lld" setting in the global cargo config was fine. On macOS this uses ld64.lld from Homebrew’s LLVM toolchain (LLD 21.1.8), which is slightly faster than Apple’s default linker for workspaces of this size. No change needed.

Changes Made

Change 1: Project-Local sccache Override

Created .cargo/config.toml in the project root:

[build]
rustc-wrapper = ""

This overrides the global sccache setting for this project only, re-enabling incremental compilation. Other Rust projects on the system are unaffected.

Why not modify the global config? Keeping the project-local override is safer, sccache may still be useful for other projects or CI workflows.

Note: .cargo/config.toml is gitignored (not committed) because the empty-string rustc-wrapper = "" value trips a cargo-llvm-cov bug that treats "" as a real wrapper path instead of “no wrapper.” Each contributor opts in locally; CI does not carry the override.

Change 2: Reduced Debug Info

In the workspace Cargo.toml:

[profile.dev]
debug = "line-tables-only"

[profile.test]
debug = "line-tables-only"

This generates only file/line number information for backtraces, skipping the bulky type and variable metadata. You still get useful panic/backtrace output with source locations; you just can’t inspect local variables in a debugger (lldb/gdb). For most development workflows this is the right tradeoff.

Change 3: Optimized Third-Party Dependencies, RETIRED post-fold

The original change set [profile.dev.package."*"] opt-level = 1 to optimize every third-party crate. After the 2026-04-28 batchalign3 fold roughly tripled the third-party dependency surface (axum, async-trait, tokio’s full feature set, etc.), the build-time cost of this setting became prohibitive, and the workspace Cargo.toml comment block now explains why it was removed.

[profile.test.package."*"] opt-level = 1 was also removed for the same reason; for specific tests where runtime is the bottleneck, opt in locally rather than reintroducing the workspace-wide setting.

Results (pre-fold, 2026-03 measurement)

The numbers below were captured pre-fold against the original ten-crate workspace. The fold roughly tripled the third-party dep set and forced retiring [profile.dev.package."*"] opt-level = 1; today’s wall-clock will be slower and depends on which crate you touched. Re-run cargo build --timings on the current workspace if you need fresh numbers.

ScenarioBeforeAfter (pre-fold)
Clean build~3-5 min (est.)~39s
Incremental rebuild (touch talkbank-model)~60-90s~4s
Test runtime (serde/regex/tree-sitter hot paths)Slow (-O0)Faster (-O1, when opt-in)

Optional: Cranelift Backend for Maximum Iteration Speed

For the fastest possible “does it compile?” checks during rapid iteration, Rust nightly supports the Cranelift codegen backend:

cargo +nightly -Z codegen-backend=cranelift build

Cranelift generates code ~2x faster than LLVM but produces unoptimized output and is nightly-only. It is useful for compile-check cycles but not for correctness testing or benchmarking.

General Principles for Rust Compile Time

  1. Incremental compilation is king for local dev. Anything that disables it (sccache, certain rustc-wrapper tools) is a net negative for iterative development.

  2. sccache is for CI, not local dev. It shines when doing clean builds from scratch (CI runners, cross-compilation). For edit-rebuild cycles, incremental compilation is far more valuable.

  3. Optimize dependencies, not your own crates. [profile.dev.package."*"] with opt-level = 1 gives you faster test execution with minimal compile cost (dependencies rarely change).

  4. Debug info has a real cost. Full DWARF debug info inflates binary sizes and link times. Use line-tables-only unless you actively need a debugger.

  5. Measure before optimizing. Use cargo build --timings to generate an HTML report showing per-crate compile times and parallelism. Use sccache --show-stats to verify cache effectiveness.

  6. Watch for crate graph bottlenecks. Crates that sit at the root of the dependency graph (like talkbank-model) are the critical path, changes to them trigger the longest rebuild chains. Keep these crates lean and consider splitting them if they grow too large.

Developer Verification Checks

Status: Current Last modified: 2026-05-30 20:13 EDT

This page defines the current local verification expectations for TalkBank/chatter.

There is not yet a repo-local make verify wrapper in this checkout. Use the concrete commands below instead.

Core local sweep

Run this from the repository root before opening or merging substantial changes:

cargo fmt --all -- --check
cargo check --workspace --all-targets
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc

Surface-specific additions

Add the checks that match the surface you changed:

  • Grammar changes

    cd grammar && tree-sitter generate && tree-sitter test
    
  • Spec tooling changes

    cargo build --manifest-path spec/tools/Cargo.toml
    cargo build --manifest-path spec/runtime-tools/Cargo.toml
    
  • Parser / model / alignment / serialization changes

    cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
    cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus
    

See Setup and Spec Workflow for the surface-specific regeneration guidance.

When to Run

  • Always before creating a PR.
  • Always before merging parser, spec-tool, grammar, or generated-artifact changes.
  • Again after rebasing if upstream changed the same surface.

Additional Engineering Checks

Run these in addition to the core sweep when touching parser/model code:

  1. cargo test -p talkbank-parser --test test_parse_health_recovery
  2. cargo nextest run -p talkbank-parser-tests --test parser_equivalence_files

These protect against regressions in:

  • parser recovery without sentinel fabrication
  • parse-health taint propagation
  • parser semantic equivalence

Failure Policy

  • If any required check fails, do not merge.
  • Fix the failing check or scope down the change.
  • If the failure is unrelated and pre-existing, document it in the PR and open a blocker issue.

Use narrower loops while iterating, then run the full sweep before final review. For a broad Rust verification pass:

cargo test --workspace

For grammar-only edits, prefer the smallest relevant loop first:

cd grammar && tree-sitter test
cargo nextest run -p talkbank-parser

Only reach for spec/symbol regeneration when the change truly affects generated artifacts; do not treat regeneration as a substitute for choosing the right regression test.

Branch Protection and Required CI Checks

Status: Current Last updated: 2026-06-15 15:00 EDT

This page defines the required status checks and protection policy for main.

Branch Protection Policy

Enable branch protection for main with:

  • Require pull request before merge.
  • Require approvals (minimum 1; maintainers may set higher).
  • Require conversation resolution before merge.
  • Require status checks to pass before merge.
  • Restrict force pushes and branch deletions.

Required Status Checks

Configure these CI checks as required. The names are the GitHub check names, which come from each job’s name: in .github/workflows/ci.yml; that workflow runs on every pull request to main:

  • Rust build + test
  • mdBook build
  • Rust version pins in sync

One other workflow is deliberately NOT in the required set:

  • cross-platform.yml (the Ubuntu + macOS + Windows matrix) runs on push to main, a daily schedule, and manual dispatch, NOT on pull requests, so it cannot report a status on a PR and must not be required (requiring it would block every merge). It is a post-merge and daily drift gate. Add a pull_request trigger first if you want it required.

Optional Hardening

  • Require branches to be up to date before merging.
  • Enable merge queue if PR volume increases.
  • Restrict who can dismiss stale reviews.

Operational Rule

If required checks fail:

  • Do not bypass protection.
  • Fix the issue or revert the breaking change.
  • Re-run checks until green.

Reference Corpus Overhaul

Status: Historical (Phase 0-6 narrative is preserved for context; the live corpus layout is described in Testing § Reference Corpus, read that first for current counts and structure) Last modified: 2026-05-29 18:43 EDT

Subsequent reorganization moved the corpus from the 345-flat-plus-language-subdirs layout described below into nine topical subdirectories under corpus/reference/. Absolute counts in this page (file totals, language-dir counts, the constructs/ directory) reflect the pre-reorganization state and are kept here only as the historical record of how the corpus got to where it is.

Motivation

The reference corpus (corpus/reference/) is the 100%-pass quality gate for all parser/grammar changes. The parser must handle every file at 100%. Before this overhaul, the corpus had three problems:

  1. Language monoculture: 345 files, all English. We have 100K+ real files across 42 languages in the corpus data directory but the gate only tested English.
  2. Construct gaps: 18 concrete grammar node types were never exercised (e.g., interrupted_question, scoped_best_guess, trailing_off_question). A grammar regression affecting these constructs would pass CI undetected.
  3. Error coverage gaps: 27 error specs were stubs (no CHAT example), 4 error codes had no spec file at all.

Strategy

Fresh build, not incremental patching. We kept the existing 345 English files as-is (they encode years of parser fixes) and added multilingual files + construct gap-fillers on top.

Phase 0: Coverage Tooling

Built corpus_node_coverage (spec/tools/src/bin/corpus_node_coverage.rs) to measure which of the 334 concrete grammar node types the corpus exercises. Running against the old 345-file corpus confirmed exactly 18 gaps.

Phase 1: Language Selection & File Extraction

Built extract_corpus_candidates (spec/runtime-tools/src/bin/extract_corpus_candidates.rs) to automatically select representative files from the corpus data directory for 20 target languages:

eng, zho, fra, deu, spa, jpn, nld, heb, por, ell,
tur, hrv, pol, ita, hun, rus, est, dan, ara, isl

Selection criteria:

  • Clean tree-sitter parsing (no ERROR nodes), mandatory
  • Short files (under 200 lines, preferring 15-100)
  • Varied tiers (%mor/%gra/%pho/%com)
  • Multiple speakers preferred
  • Privacy: explicitly skip Password directories in the corpus data directory

For each language, the tool scored and ranked candidates. We selected 1-2 files per language (25 files total across 20 language subdirectories).

Phase 2: Construct Gap-Filling

Created 4 handcrafted files in corpus/reference/constructs/ to exercise the 18 missing node types that don’t appear in real-world data:

FileNode types exercised
rare-terminators.chainterrupted_question, self_interrupted_question, self_interruption, trailing_off_question
uptake.chauptake_symbol
best-guess.chascoped_best_guess
unsupported.chathumbnail_header, unsupported_header, unsupported_dependent_tier, unsupported_line, unsupported_header_prefix, unsupported_tier_prefix

Other gaps (l1_of_header, utf8_header, etc.) were already covered by the language files or were confirmed as supertypes (not concrete).

Result: 334/334 concrete types exercised (100%).

Phase 3: Tier Regeneration

Ran batchalign3 morphotag on all 25 language files to generate fresh %mor/%gra tiers:

cd /path/to/batchalign3
uv run batchalign3 morphotag /path/to/chatter/corpus/reference/{lang}/ --in-place

All 20 languages are covered by Stanza’s UD models. Validation confirmed all 374 files pass parser equivalence and roundtrip.

Phase 4: Error Corpus Expansion

4.1: Created 3 missing error specs (E707, E711, E717) with CHAT examples and metadata. Fixed E376 (had wrong error code E208 in metadata).

4.2: Filled 17 triggerable stub specs with CHAT examples:

  • Cross-utterance validation (E341, E351-E355)
  • Parser recovery warnings (E319-E322, E325, E326)
  • Underline tier errors (E356-E357)
  • Overlap index errors (E373)
  • Direct parser tier errors (E381, E384)

4.3: Documented 12 untriggerable stubs (internal, deprecated, or not-yet-wired error codes) with explanations of why no example is possible: E001, E002, E211, E317, E318, E340, E374, E377, E378, E380, E385, E386.

4.4: Corrected 5 misclassified specs where examples triggered different error codes than intended (E319-E322, E376). Added Status: not_implemented and explanatory notes.

4.5: Built perturbation tool (spec/tools/src/bin/perturb_corpus.rs) with 11 mutation strategies that take a valid .cha file and produce controlled mutations targeting specific error codes:

PerturbationTarget Error
delete-participantsE501
delete-languagesE503
delete-idE504
undeclared-speakerE308
delete-terminatorE305
extra-mor-wordE706
fewer-mor-wordsE705
delete-beginE502
delete-endE510
duplicate-participantsE511
mor-terminator-mismatchE716

Also includes a mining mode (--mine DIR) that scans real data for tree-sitter ERROR nodes, with automatic Password directory exclusion.

4.6: Regenerated golden artifacts: all 8 golden generators + audit + bootstrap:

ArtifactLines
golden_words.txt769 (1949 unique words)
golden_mor_tiers.txt405
golden_gra_tiers.txt7
golden_main_tiers.txt607
golden_pho_tiers.txt25
golden_wor_tiers.txt7
golden_sin_tiers.txt5
golden_com_tiers.txt24
golden_words_featured.txt96
golden_words_minimal.txt62

Bootstrap regenerated reference_corpus.rs with 374 test cases.

Phase 5: CI Integration & Validation

At that milestone, the then-current verification sweep passed:

  • Parser equivalence: 377/377 (374 files + 3 extra)
  • Node coverage: 334/334 (100%)
  • Error coverage: 181/181 (100%), 169 with CHAT examples, 12 documented stubs
  • The parser-equivalence and reference-corpus regression gates passed

Phase 6: Cleanup & Documentation

  • Updated file count references (339→374) across CLAUDE.md files
  • Rewrote corpus/README.md with new structure
  • Updated memory files

Final State

corpus/reference/           374 files total
  *.cha                     345 files (original English corpus)
  constructs/                 4 files (rare grammar constructs)
  {20 language dirs}/        25 files (multilingual, from corpus data)
MetricBeforeAfter
Total files345374
Languages1 (English)20
Concrete node coverage316/334 (94.6%)334/334 (100%)
Error specs177/181 (97.8%)181/181 (100%)
Error specs with examples~150169
Documented stubs012
Golden artifactsStaleFreshly regenerated

Tools Built

ToolPathPurpose
corpus_node_coveragespec/tools/src/bin/Grammar node type coverage
extract_corpus_candidatesspec/runtime-tools/src/bin/Automated file selection from corpus data
perturb_corpusspec/tools/src/bin/Error file generation by mutation

What Worked

  • extract_corpus_candidates: Automated scoring eliminated guesswork in file selection. Files were high-quality, short, and diverse.
  • construct gap-filling: 4 handcrafted files closed 18 gaps efficiently.
  • Keeping existing 345 files: No breakage, no regressions. The new files are purely additive.
  • batchalign3 morphotag: Generated correct %mor/%gra for all 20 languages without manual intervention.

What Didn’t Work / Lessons Learned

  • Mining real errors from corpus data: The MacWhinney subcorpus (407 files) had zero tree-sitter parse errors; the data is too clean. Mining is slow on large directories (>4 minutes for all of Eng-NA). The perturbation approach is more effective for systematic error coverage.
  • Parser recovery error specs (E319-E322): Writing examples that trigger specific tree-sitter error recovery codes is very difficult. Tree-sitter’s error recovery is robust and routes most malformed input through generic paths (E316) rather than the specific recovery codes. These remain as documented stubs.
  • Direct parser vs unsupported.cha (historical, direct parser has been removed): The former Chumsky direct parser could not handle unsupported_line nodes (failed on constructs/unsupported.cha). This is no longer relevant since tree-sitter is now the sole parser.

Known Remaining Gaps

  1. 12 untriggerable error stubs: Internal (E001, E002), deprecated (E211, E317, E318, E340, E374, E377, E378, E380, E385, E386). These are legitimate, the codes either have no emission path or are reserved.
  2. No audio files: Phase 3.3 (audio subset with %wor tiers) was deferred. Adding ~10 short audio clips would test the alignment pipeline end-to-end.
  3. Direct parser roundtrip (historical, direct parser has been removed): 373/374 passed under the former Chumsky direct parser (unsupported.cha failed). No longer relevant since tree-sitter is now the sole parser.
  4. 5 parser recovery specs not_implemented: E319-E322, E376. Examples don’t trigger the intended codes due to tree-sitter’s error recovery routing.

Desktop App Testing

Status: Current Last updated: 2026-05-20 20:28 EDT

This document covers the testing strategy for the Chatter desktop app (apps/chatter-desktop/). Testing is split into three tiers by speed and scope.

Testing Tiers

┌─────────────────────────────────────────────────────────┐
│  Tier 3: E2E (WebdriverIO + tauri-driver)               │
│  Real app, real DOM, real IPC. Slow (~5-10s/test).       │
│  Catches: rendering bugs, IPC wiring, platform quirks.   │
│  Run: manually before releases, optionally in CI.        │
├─────────────────────────────────────────────────────────┤
│  Tier 2: Rust integration tests                          │
│  Real validation pipeline, real event bridge, no GUI.    │
│  Catches: serialization mismatches, event ordering,      │
│  stats consistency, single-file handling.                 │
│  Run: every commit, CI required.                         │
├─────────────────────────────────────────────────────────┤
│  Tier 1: Unit tests (Rust + TypeScript)                  │
│  Pure functions and thin runtime seams in isolation.     │
│  Catches: protocol drift, reducer bugs, CLAN math.       │
│  Run: every commit, CI required.                         │
└─────────────────────────────────────────────────────────┘

Most bugs will be caught by Tier 2. The Rust integration tests exercise the exact same code path as the Tauri commands; they call validate_target_streaming() and the frontend event bridge directly, then verify the JSON shape, field names, event ordering, and stats consistency.

Tier 1 & 2: Unit and integration tests

Running

# TypeScript capability/seam tests
cd apps/chatter-desktop && npm run test:unit

# Rust contract/integration tests
cargo nextest run -p chatter-desktop --test validation_bridge

What they cover

TestWhat it verifies
apps/chatter-desktop/tests/unit/validationRunner.test.cjsValidation capability uses centralized command names, subscribes before invoke, and disposes listeners exactly once
apps/chatter-desktop/tests/unit/validationState.test.cjsValidation reducer computes relative file names and merges diagnostics/status immutably
reference_corpus_no_hard_errorsevery file under corpus/reference/ produces zero Severity::Error (warnings allowed)
event_lifecycle_has_correct_sequenceDiscovering → Started → FileComplete×N → Finished ordering
frontend_events_serialize_to_expected_json_shapeEvery event has type field; camelCase field names match TypeScript types; diagnostics include renderedText
protocol_contracts_serialize_to_expected_json_shapeRust command/event constants and request payloads stay aligned with the TypeScript protocol module
single_file_validationSingle-file path validates exactly the selected file
finished_stats_match_file_eventsvalid + invalid + parseErrors == totalFiles; FileComplete count matches
rendered_html_present_for_errorsEvery diagnostic carries non-empty miette HTML with box-drawing characters and style= attributes (ANSI colors converted to HTML)

Adding new tests

Test file: apps/chatter-desktop/src-tauri/tests/validation_bridge.rs

The tests use collect_events() which runs the real validation pipeline and collects all FrontendEvent values. To test a specific scenario:

#![allow(unused)]
fn main() {
#[test]
fn my_scenario() {
    let target = workspace_root().join("path/to/corpus");
    let events = collect_events(&target);
    let summary = summarize(&events);
    // assert on summary fields or individual events
}
}

Miette rendering pipeline

Error rendering is server-side. Each FrontendDiagnostic carries two renderings:

  • rendered_html: render_error_with_miette_with_source_colored() produces ANSI-colored text, ansi-to-html converts it to HTML <span style="...">. The frontend displays it in a <pre> block via dangerouslySetInnerHTML. This guarantees identical output to the CLI.
  • rendered_text: render_error_with_miette_with_source() produces plain text (no ANSI codes) for clean clipboard copy-paste.

The rendered_html_present_for_errors integration test verifies that every error diagnostic includes non-empty HTML containing miette box-drawing characters and style= attributes from ANSI color conversion.

TypeScript seam tests

The TypeScript unit tests compile a focused subset of apps/chatter-desktop/src/ to a temporary CommonJS directory, then run Node’s built-in test runner against the compiled output. This keeps the test toolchain small while still exercising the runtime seam as real JavaScript.

  • Runner script: apps/chatter-desktop/scripts/run-unit-tests.mjs
  • Compile config: apps/chatter-desktop/tsconfig.unit.json
  • Test files: apps/chatter-desktop/tests/unit/*.test.cjs

TypeScript ↔ Rust contract

The Rust integration tests verify that serialized JSON matches what the TypeScript frontend expects. If you change a field name or event structure in events.rs, the frontend_events_serialize_to_expected_json_shape test will catch the mismatch before you discover it at runtime.

The key serde attributes:

  • #[serde(tag = "type", rename_all = "camelCase")] on enums, variant names become camelCase tag values (fileComplete, not FileComplete)
  • #[serde(rename_all = "camelCase")] on individual variants, field names become camelCase (totalFiles, not total_files)
  • Both must be present: the enum-level rename_all only affects tag names, not field names within variants

Tier 3: E2E Tests (WebdriverIO)

Prerequisites

cargo install tauri-driver    # WebDriver backend for Tauri (Linux/Windows only)
cargo tauri build --debug     # Build the app binary

Note: tauri-driver only works on Linux and Windows. On macOS, WKWebView does not support WebDriver. Run E2E tests in CI (Linux) or on a Windows machine.

Running

# Terminal 1: start tauri-driver (WebDriver server on :4444)
tauri-driver

# Terminal 2: run the tests
cd apps/chatter-desktop
npm run test:e2e

What they cover

The smoke tests in tests/e2e/smoke.spec.ts verify that the app launches and renders the expected UI elements:

  • Drop zone with Choose File / Choose Folder buttons
  • Empty file tree (“No files loaded”)
  • Empty error panel (“Select a file to view errors”)
  • Status bar showing “Ready”

Limitations

File dialogs cannot be driven via WebDriver. The native file picker (@tauri-apps/plugin-dialog) opens an OS-level dialog that WebDriver can’t interact with. Options for testing the validation flow:

  1. Test-only Tauri command: add validate_for_test(path) behind #[cfg(debug_assertions)] that bypasses the file dialog
  2. Programmatic invoke: use driver.executeScript() to call window.__TAURI__.core.invoke("validate", { path }) directly
  3. Drag-and-drop simulation: possible but platform-dependent and fragile

For now, the Rust integration tests cover the full validation pipeline. E2E tests focus on UI rendering and user-visible layout.

Adding E2E tests

Test file: apps/chatter-desktop/tests/e2e/*.spec.ts

WebdriverIO provides $() and $$() for CSS selectors, plus Tauri-aware capabilities:

it("should show validation results", async () => {
  // Programmatically trigger validation (bypasses file dialog)
    await browser.executeAsync(async (path, done) => {
      await (window as any).__TAURI__.core.invoke("validate", {
        path,
      });
      // Wait for finished event
      setTimeout(done, 5000);
  }, "/path/to/corpus");

  const tree = await $(".file-tree-panel");
  const text = await tree.getText();
  expect(text).not.toContain("No files loaded");
});

When to run E2E tests

  • Before releases: manual run to verify the built app works end-to-end
  • Optionally in CI: requires tauri-driver and a display server (Xvfb on Linux). Slow, so consider running only on release branches.
  • Not on every commit: the Rust integration tests are fast and cover more ground

Platform-Specific Considerations

PlatformWebView engineE2E support
macOSWKWebViewNot supported: tauri-driver does not work on macOS (WKWebView has no WebDriver API)
WindowsWebView2 (Chromium)Full support via tauri-driver
LinuxWebKitGTKFull support via tauri-driver; requires Xvfb for headless

macOS limitation: Apple’s WKWebView does not expose a WebDriver endpoint, so tauri-driver cannot drive the app on macOS. E2E tests must run on Linux (CI) or Windows. For local macOS development, rely on the Rust integration tests (Tier 2) and manual smoke testing.

CSS rendering differs slightly between WebKit (Linux) and Chromium (Windows). Visual regressions are possible, consider screenshot comparison tests if this becomes a problem.

Test Data

All tests use the reference corpus at corpus/reference/. This corpus is checked into the repo and must always pass validation with zero hard errors (warnings are allowed). The exact set of files and the current warning-emitting files are whatever find corpus/reference -name '*.cha' -type f and the validator report, do not hard-code those lists here.

Do not create ad-hoc .cha test files. Use existing reference corpus files or ask the user to provide test data.

CI Integration

Add to the existing CI workflow:

# Rust integration tests (fast, always run)
- name: Desktop integration tests
  run: cargo nextest run -p chatter-desktop --test validation_bridge

# E2E tests (slow, release branches only)
- name: Build desktop app
  if: startsWith(github.ref, 'refs/heads/release')
  run: cargo tauri build --debug
- name: E2E smoke tests
  if: startsWith(github.ref, 'refs/heads/release')
  run: |
    tauri-driver &
    sleep 2
    cd apps/chatter-desktop && npm run test:e2e

Library Usage

Status: Current Last modified: 2026-06-21 21:33 EDT

The TalkBank Rust crates can be used as dependencies in your own Rust projects for parsing, validating, and manipulating CHAT files. This page shows the most common entry points; the API reference on docs.rs (once published) is the authoritative source. Until then, treat the rustdoc comments inside each crate’s src/lib.rs as the source of truth.

Examples on this page are mirrored as a real Cargo test at crates/talkbank-transform/tests/book_library_usage_examples.rs. The book renders them as rust,ignore so mdbook doesn’t try to link against the workspace’s many compiled crate variants; the parallel test runs the same code under cargo test and is what catches API drift between this page and the libraries. If you edit either, update both.

Important: some legacy tree-sitter fragment helpers are synthetic rather than semantically honest. They can inject fragment input into boilerplate CHAT text and parse the resulting synthetic file. Prefer full-file parsing for real tree-sitter use, and do not treat legacy fragment helpers as the long-term fragment API. For direct-parser fragment semantics, use direct-parser-native tests instead of treating synthetic wrappers as the oracle.

Adding Dependencies

The TalkBank library crates are source-available from this repository. They are not yet published on crates.io, so depend on them from the public repo via git (pinned to a release tag), or via local path dependencies from a TalkBank/chatter checkout for local development:

[dependencies]
talkbank-model = { path = "../chatter/crates/talkbank-model" }
talkbank-transform = { path = "../chatter/crates/talkbank-transform" }
talkbank-parser = { path = "../chatter/crates/talkbank-parser" }

The published-crate workflow is tracked separately; once it lands these paths can become version = "X.Y" deps.

Parsing and Validating a CHAT File

The simplest entry point is parse_and_validate from talkbank-transform. It takes the source text and a ParseValidateOptions, returns a fully constructed ChatFile, or a PipelineError if parsing or validation failed.

extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::parse_and_validate;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = std::fs::read_to_string("file.cha")?;
let options = ParseValidateOptions::default().with_validation();
let chat_file = parse_and_validate(&source, options)?;

for utt in chat_file.utterances() {
    println!("Speaker: {}", utt.main.speaker);
}
Ok(())
}

ChatFile is generic over a ValidationState parameter; the parse_and_validate return defaults to the validated state. chat_file.utterances() returns an iterator over &Utterance derived from the file’s lines (utterances are interleaved with headers and comments in source order).

For batch workflows where parser construction overhead matters, reuse a single TreeSitterParser and call parse_and_validate_with_parser:

extern crate talkbank_model;
extern crate talkbank_parser;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_parser::TreeSitterParser;
use talkbank_transform::parse_and_validate_with_parser;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let chat_files: Vec<std::path::PathBuf> = Vec::new();
let parser = TreeSitterParser::new()?;
let options = ParseValidateOptions::default().with_validation();

for path in &chat_files {
    let source = std::fs::read_to_string(path)?;
    let chat_file = parse_and_validate_with_parser(&parser, &source, options.clone())?;
    let _ = chat_file;
}
Ok(())
}

ParseValidateOptions also exposes with_alignment() (implies with_validation(), additionally validates cross-tier alignment for %mor, %gra, %pho, %wor) and with_strict_linkers() (enables E351-E355 self-completion/other-completion linker checks).

Working with the Model

ChatFile stores participants and language metadata as top-level fields populated from @Participants / @ID / @Languages headers during parsing. Utterances live in lines and are iterated via chat_file.utterances().

extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::DependentTier;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::parse_and_validate;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "\
@UTF8
@Begin
@Languages:\teng
@Participants:\tCHI Target_Child
@ID:\teng|test|CHI|||||Target_Child|||
*CHI:\thello world .
%mor:\tco|hello n|world .
@End
";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;

// Participant metadata is top-level on the ChatFile.
let _participants = &chat_file.participants;

// Iterate utterances and their dependent tiers.
for utt in chat_file.utterances() {
    for tier in &utt.dependent_tiers {
        if let DependentTier::Mor(mor_tier) = tier {
            for item in mor_tier.items() {
                println!("POS: {}, Lemma: {}", item.main.pos, item.main.lemma);
            }
        }
    }
}
Ok(())
}

DependentTier is a closed-set enum (Mor, Gra, Pho, Mod, Sin, Act, Add, Com, Err, Exp, Gpx, Int, Lan, …); match on the variants you care about and ignore the rest. MorTier::items() returns &[Mor]; each Mor has a main MorWord plus optional post-clitics.

Serializing to CHAT

Bring the WriteChat trait into scope and call to_chat_string() for a fully-rendered CHAT string, or write_chat(&mut writer) to stream into any std::fmt::Write.

extern crate talkbank_model;
extern crate talkbank_transform;
use std::fmt::Write as _;

use talkbank_model::ParseValidateOptions;
use talkbank_model::WriteChat;
use talkbank_transform::parse_and_validate;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "@UTF8\n@Begin\n@Languages:\teng\n@Participants:\tCHI Target_Child\n@ID:\teng|test|CHI|||||Target_Child|||\n*CHI:\thello .\n@End\n";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;

// Convenience: render to a fresh String.
let chat_text = chat_file.to_chat_string();
assert!(chat_text.starts_with("@UTF8"));

// Streaming: write into any std::fmt::Write sink.
let mut output = String::new();
chat_file.write_chat(&mut output)?;
Ok(())
}

Serializing to JSON

Prefer the schema-validated helpers in talkbank_transform::json: to_json_pretty_validated checks the output against the JSON schema and catches drift between the data model and the schema. The unvalidated variants are a faster bypass when you’ve already validated upstream.

extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::json::to_json_pretty_validated;
use talkbank_transform::parse_and_validate;

fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "@UTF8\n@Begin\n@Languages:\teng\n@Participants:\tCHI Target_Child\n@ID:\teng|test|CHI|||||Target_Child|||\n*CHI:\thi .\n@End\n";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;

let json = to_json_pretty_validated(&chat_file)?;
assert!(json.contains("\"speaker\""));
Ok(())
}

The schema for ChatFile lives at schema/chat-file.schema.json and is regenerated from the Rust types via cargo test --test generate_schema. For arbitrary serde values (not just ChatFile), to_json_unvalidated / to_json_pretty_unvalidated work the same way without the schema step.

Custom Error Handling

Lower-level parser entry points stream diagnostics through the ErrorSink trait. Implement it to collect, count, filter, or forward errors as they arrive, useful when you need finer-grained control than the Result<ChatFile, PipelineError> shape parse_and_validate returns.

extern crate talkbank_model;
use talkbank_model::ErrorSink;
use talkbank_model::ParseError;

struct MyErrorHandler;

impl ErrorSink for MyErrorHandler {
    fn report(&self, error: ParseError) {
        // Custom handling: log, filter, count, etc.
        eprintln!("[{}] {}", error.code, error.message);
    }
}

ErrorSink is Send + Sync, and a blanket &T: ErrorSink impl means borrowed references are sinks too, no Arc wrapper required. The built-in ErrorCollector (gathers into a Vec), ParseTracker (counts by severity), and NullErrorSink (discards) cover most common needs; implement ErrorSink directly for everything else.

Crate Selection Guide

NeedCrate
Data model types, error types, WriteChat, ErrorSinktalkbank-model
Tree-sitter CHAT parsing (low-level)talkbank-parser
Full pipeline (parse + validate + JSON, schema validation)talkbank-transform

talkbank-model is the foundation, every other crate depends on it. If all you need are the AST types and validation, model alone is enough. talkbank-transform brings parsing + JSON + caching.

Batchalign3-Facing Surface

If you are building Batchalign3 or another external consumer, the stable surface is usually:

Batchalign3 needPrefer
Canonical full-file parsingtalkbank-parser
Parse/validate contracts and typed model accesstalkbank-model
Alignment-aware downstream consumers (align, compare, benchmark)talkbank-model alignment helpers plus the model AST
Whole-pipeline parse+validate+converttalkbank-transform

For batch workflows, keep parser instances reusable and keep alignment logic separate from parse semantics.

JSON Output Reference

Status: Reference Last updated: 2026-05-11 23:45 EDT

This document describes the structure of JSON produced by chatter to-json. For the formal JSON Schema, see JSON Schema.

Quick Start

# Default: parse + validate + align, pretty-printed, schema-checked
chatter to-json file.cha

# Write to file
chatter to-json file.cha -o file.json

# Skip validation (parse only, faster)
chatter to-json file.cha --skip-validation

# Skip alignment only
chatter to-json file.cha --skip-alignment

Validation and alignment are on by default. Use --skip-validation or --skip-alignment to opt out.

Top-Level Structure

{
  "lines": [ ... ]
}

A ChatFile is a flat list of lines. Each line has a line_type discriminator:

line_typeDescription
"header"File header (@Begin, @Languages, @Participants, etc.)
"utterance"Main tier + dependent tiers + alignment
"comment"@Comment: lines

Word Fields

Words are the fundamental unit. Every word in the main tier content array carries these fields:

FieldTypeAlways?Description
type"word"yesDiscriminator
raw_textstringyesExact text from the transcript, including all CHAT markers
cleaned_textstringyesNLP-ready text (shortenings restored, markers stripped)
contentarrayyesStructured breakdown of word parts (see below)
categorystringno"omission", "filler", "nonword", "fragment", "ca_omission"
form_typestringnoSpecial form code: "c", "d", "f", "x", etc.
langobjectnoLanguage marker (see Language-Switched example)
untranscribedstringno"unintelligible" (xxx), "phonetic" (yyy), "untranscribed" (www)

Word content items use "content" for the text value:

{ "type": "text", "content": "dog" }

Computed Fields

cleaned_text and untranscribed are computed from content during serialization. They do not exist as stored fields in the data model.

  • cleaned_text: Concatenates Text and Shortening elements from content. Excludes lengthening markers (:), stress markers, CA elements, overlap points, compound markers, and underline markers. Example: sit(ting)"sitting".

  • untranscribed: Present only when cleaned_text is "xxx", "yyy", or "www".

Word Examples

Simple Word

dog
{
  "type": "word",
  "raw_text": "dog",
  "cleaned_text": "dog",
  "content": [{ "type": "text", "content": "dog" }]
}

Filler

&-uh
{
  "type": "word",
  "raw_text": "&-uh",
  "cleaned_text": "uh",
  "content": [{ "type": "text", "content": "uh" }],
  "category": "filler"
}

Untranscribed

xxx
{
  "type": "word",
  "raw_text": "xxx",
  "cleaned_text": "xxx",
  "content": [{ "type": "text", "content": "xxx" }],
  "untranscribed": "unintelligible"
}

Compound

ice+cream
{
  "type": "word",
  "raw_text": "ice+cream",
  "cleaned_text": "icecream",
  "content": [
    { "type": "text", "content": "ice" },
    { "type": "compound_marker", "content": { "span": { "start": 0, "end": 1 } } },
    { "type": "text", "content": "cream" }
  ]
}

Omission

0she
{
  "type": "word",
  "raw_text": "0she",
  "cleaned_text": "she",
  "content": [{ "type": "text", "content": "she" }],
  "category": "omission"
}

Nonword

&~baba
{
  "type": "word",
  "raw_text": "&~baba",
  "cleaned_text": "baba",
  "content": [{ "type": "text", "content": "baba" }],
  "category": "nonword"
}

Special Form

doggy@c
{
  "type": "word",
  "raw_text": "doggy@c",
  "cleaned_text": "doggy",
  "content": [{ "type": "text", "content": "doggy" }],
  "form_type": "c"
}

Language-Switched

maison@s:fra
{
  "type": "word",
  "raw_text": "maison@s:fra",
  "cleaned_text": "maison",
  "content": [{ "type": "text", "content": "maison" }],
  "lang": { "type": "explicit", "code": "fra" }
}

The lang field has variants: {"type": "shortcut"} (bare @s), {"type": "explicit", "code": "fra"} (@s:fra), and {"type": "multiple", "code": ["eng", "zho"]} (@s:eng+zho).

Utterances

An utterance line contains:

{
  "line_type": "utterance",
  "main": {
    "speaker": "CHI",
    "content": {
      "content": [ ... ],
      "terminator": { "type": "period" },
      "bullet": { "start_ms": 0, "end_ms": 3042 }
    }
  },
  "dependent_tiers": [ ... ],
  "alignments": { ... },
  "utterance_language": { "status": "resolved_default", "code": "eng" },
  "language_metadata": { ... }
}

Key structural points:

  • The utterance body is under "main", not "utterance".
  • content, terminator, and bullet are nested inside main.content.
  • terminator is an object with a type field ("period", "question", "exclamation", etc.), not a bare string.
  • bullet (utterance-level timing) is inside main.content, omitted when absent (not present as null).
  • dependent_tiers, alignments, utterance_language, and language_metadata are top-level siblings of main. Empty dependent_tiers and alignments are omitted when there is nothing to report.

Content Items

main.content.content is a heterogeneous array. Each item has a type discriminator:

TypeDescription
"word"A word token (see Word Fields above)
"event"Non-verbal action (&=laughs)
"pause"Timed or untimed pause ((.), (0.5))
"group"Bracketed group (<word word>)
"separator"Tag markers, linkers, etc.

Dependent Tiers

When present, dependent_tiers is an array of tagged objects:

"dependent_tiers": [
  {
    "type": "Mor",
    "data": {
      "tier_type": "Mor",
      "items": [
        {
          "main": { "pos": "pron", "lemma": "I" }
        },
        {
          "main": { "pos": "verb", "lemma": "want", "features": ["Fin", "Ind", "Pres"] }
        }
      ],
      "terminator": "."
    }
  },
  {
    "type": "Gra",
    "data": {
      "tier_type": "Gra",
      "relations": [
        { "index": 1, "head": 2, "relation": "NSUBJ" },
        { "index": 2, "head": 0, "relation": "ROOT" }
      ]
    }
  }
]
typeTierDescription
"Mor"%morMorphological analysis (POS tags, lemmas, features, clitics)
"Gra"%graGrammatical relations (dependency arcs)
"Pho"%phoPhonological transcription
"Sin"%sinSyntax tier
"Wor"%worWord-level timing (items with inline_bullet)
Other%xxxUser-defined dependent tiers

%wor Tier

The Wor tier contains word items with timing:

{
  "type": "Wor",
  "data": {
    "items": [
      {
        "kind": "word",
        "raw_text": "hello",
        "cleaned_text": "hello",
        "content": [{ "type": "text", "content": "hello" }],
        "inline_bullet": { "start_ms": 100, "end_ms": 300 }
      }
    ],
    "terminator": { "type": "period" }
  }
}

Note that %wor items use "kind" instead of "type" for their discriminator, since "type" is used by the tier envelope.

Alignment Data

When validation runs (the default), the alignments object contains:

  • units: per-tier index arrays (for internal bookkeeping)
  • Named tier pairs (e.g., mor, gra) with alignment mappings
"alignments": {
  "units": {
    "main_mor": [{"index": 0}, {"index": 1}],
    "main_pho": [{"index": 0}, {"index": 1}],
    "main_sin": [{"index": 0}, {"index": 1}],
    "main_wor": [{"index": 0}, {"index": 1}],
    "mor": [{"index": 0}, {"index": 1}]
  },
  "mor": {
    "pairs": [
      { "source_index": 0, "target_index": 0 },
      { "source_index": 1, "target_index": 1 }
    ],
    "errors": []
  }
}

Alignment links each main-tier word (source_index) to its corresponding dependent-tier item (target_index) by position. errors contains any alignment-level diagnostics (count mismatches, etc.) and is [] when alignment validated cleanly.

Headers

Headers use the header object with a type discriminator:

TypeHeaderKey Fields
"utf8"@UTF8,
"begin"@Begin,
"end"@End,
"languages"@Languagescodes
"participants"@Participantsentries (speaker_code, name, role)
"id"@IDlanguage, corpus, speaker, role, age, sex, …
"media"@Mediafilename, media_type, status
"comment"@Commenttext
"date"@Datedate
"options"@Optionsoptions (array of strings)

See the JSON Schema for the complete list of header types and fields.

Timing

Utterance-level timing appears in main.content.bullet:

"bullet": {
  "start_ms": 1234,
  "end_ms": 5678
}

Word-level timing (from %wor tier) appears in inline_bullet on individual words within the Wor dependent tier.

JSON Schema

Status: Current Last modified: 2026-06-15 15:00 EDT

This repository generates JSON Schema from Rust-owned types with schemars for the ChatFile transcript model used by chatter to-json.

Keeping that schema generated from the Rust source of truth lets cross-language integrations consume a stable contract without re-deriving the shapes by hand.

Available schemas

SchemaCanonical URLRepositoryGenerator
ChatFile transcript modelhttps://talkbank.org/schemas/v0.1/chat-file.jsonschema/chat-file.schema.jsoncargo test --test generate_schema

The generated schema declares both $schema (JSON Schema 2020-12) and $id (the canonical URL above). External consumers that want to track the current transcript-model version should follow the v0.1 URL; there is no /latest/ alias in the generated artifacts.

Transcript schema: ChatFile

chatter to-json converts CHAT transcripts into a structured JSON form backed by the same ChatFile model used by the parser, validator, and serializer.

How chatter to-json uses it

By default, chatter to-json:

  • validates the CHAT input,
  • checks dependent-tier alignment unless --skip-alignment is passed, and
  • validates the emitted JSON against the schema unless --skip-schema-validation is passed.

Useful flags:

chatter to-json input.cha --skip-validation
chatter to-json input.cha --skip-alignment
chatter to-json input.cha --skip-schema-validation

chatter from-json deserializes JSON back into the internal ChatFile model and re-serializes it to CHAT format. The input should conform to this schema.

Roundtrip expectations

The CHAT-to-JSON-to-CHAT pipeline is intended to preserve the ChatFile model:

chatter to-json input.cha -o intermediate.json
chatter from-json intermediate.json -o output.cha
diff input.cha output.cha

Both directions go through the same typed model. When changing the parser, serializer, or schema generation, confirm roundtrip behavior with the existing roundtrip test suites rather than assuming byte-for-byte identity.

Using the schema externally

Validate JSON in Python

import json
import jsonschema
import urllib.request

schema_url = "https://talkbank.org/schemas/v0.1/chat-file.json"
schema = json.loads(urllib.request.urlopen(schema_url).read())

with open("transcript.json") as f:
    data = json.load(f)

jsonschema.validate(data, schema)

IDE autocompletion

{
  "$schema": "https://talkbank.org/schemas/v0.1/chat-file.json",
  "lines": [],
  "participants": {},
  "languages": [],
  "options": []
}

Generate types from the schema

Tools like quicktype, json-schema-to-typescript, and datamodel-code-generator can generate typed structs or classes from the schema for TypeScript, Python, Go, and other languages.

Regenerating the schema

After changing transcript-model types in talkbank-model:

cd chatter
cargo test --test generate_schema

This writes the checked-in schema artifact in schema/. CI already checks that generated artifacts stay in sync.

Code references

  • schema/chat-file.schema.json: generated schema
  • crates/talkbank-transform/src/json.rs: schema loading and validation
  • crates/talkbank-model/src/model/: Rust data model
  • tests/generate_schema/: shared schema generation helpers

Diagnostic and JSON Output Contract

Status: Current Last updated: 2026-06-15 15:00 EDT

This page documents the machine-readable JSON surfaces currently exposed by the top-level chatter CLI.

Stability policy

  • Treat field names documented here as the public contract.
  • Treat additional fields as additive unless this page says otherwise.
  • Treat message wording as human-facing text, not a stable machine contract.

chatter validate ... --format json

Both chatter validate FILE --format json and chatter validate DIR --format json emit newline-delimited JSON (NDJSON) on stdout, with the same record shapes in both modes:

  1. zero or more per-file records (one per validated file), then
  2. one final summary record.

A single-file invocation still emits a file record followed by a summary record; it is not a single-object surface.

Per-file records

Valid files:

{"type":"file","file":"/path/to/file.cha","status":"valid","cache_hit":false}

Invalid files (the errors array is opaque per-error JSON; the note field is appended when the validator stopped further checks because of structural errors):

{
  "type": "file",
  "file": "/path/to/file.cha",
  "status": "invalid",
  "error_count": 1,
  "errors": [
    {
      "code": "E502",
      "message": "Missing @End header at end of file",
      "severity": "Error"
    }
  ],
  "note": "Some additional checks may not have run because of structural errors. Fix the structural errors first, then re-validate."
}

Parser-failure files use "status":"parse_error" with an error string. Read-failure files use "status":"read_error" with an error string.

Summary record

{
  "type": "summary",
  "directory": "/path/to/dir",
  "total_files": 2,
  "valid": 1,
  "invalid": 1,
  "parse_errors": 0,
  "cache_hits": 0,
  "cache_misses": 2,
  "cache_hit_rate": 0.0,
  "cancelled": false
}

When --roundtrip is set, the summary also includes roundtrip_passed and roundtrip_failed counters.

Contract notes

  • The type field is stable: "file" or "summary".
  • For file records: file and status are stable; cache_hit is stable for valid records. error_count and errors are stable for invalid records.
  • For summary records: directory, total_files, valid, invalid, parse_errors, cache_hits, cache_misses, cache_hit_rate, and cancelled are stable.
  • status values currently observed: valid, invalid, parse_error, read_error. New status values may appear.
  • Errors do not include a byte-offset location field in the NDJSON surface; for byte-offset diagnostics use the LSP or the non-JSON renderer.
  • The note field on invalid file records is human-facing guidance and may be added or omitted between releases.
  • Exit code 0 means all files validated successfully; exit code 1 means at least one file failed or an I/O error occurred.

chatter to-json

chatter to-json emits the full ChatFile JSON model rather than a diagnostic summary. The authoritative contract for that output is the JSON Schema documented in JSON Schema.

Practical notes:

  • The JSON itself is the contract, not any validation status lines printed by the CLI.
  • Use -o/--output if you want only the JSON in a file.
  • Use --skip-validation, --skip-alignment, or --skip-schema-validation only when you explicitly want to bypass those checks.

chatter cache stats --json

Cache statistics emit one JSON object on stdout:

{
  "total_entries": 743,
  "cache_dir": "/Users/example/Library/Caches/talkbank-chat",
  "cache_size_bytes": 274432,
  "last_modified": "2026-03-09T13:05:31+00:00"
}

Contract notes:

  • total_entries, cache_dir, cache_size_bytes, and last_modified are stable.
  • last_modified is RFC 3339 / ISO 8601 text.

Merge Override File Format

Status: Draft Last updated: 2026-06-15 15:00 EDT

The merge override file is the typed, human-readable record of operator decisions in the chatter speaker-idchatter merge pipeline. It serves three purposes:

  1. Persistence: operator adjudications made for one batch can be replayed on later runs without re-prompting (chatter speaker-id --override-file <FILE> --session-id <ID>).
  2. Audit trail: each entry records who decided what, when, and on the basis of which Jaccard scores. Years later, a researcher can answer “why was PAR0 labeled INV in this session?” by reading the file.
  3. Interchange: an adjudication UI (CLI, future web app) and the batch pipeline share the same file format; UI tools can be added or replaced without changing the on-disk contract.

This page is the authoritative reference for the file’s schema. For the usage contract (which commands read/write it, when, why), see chatter speaker-id.

File location and naming

The file’s location is caller-chosen. The convention is one file per donor batch, named for the batch:

batch-2026-05-27-childes-eng.overrides.toml
batch-2026-06-15-fluency-pilot.overrides.toml
batch-2026-08-22-aphasiabank-bilingual.overrides.toml

Pipeline operators pass the path explicitly via --override-file; no implicit search of a default location.

File format

UTF-8 TOML. The file has exactly one top-level key, schema_version, followed by zero or more session entries, each keyed by a session ID.

schema_version = 1

[<session_id_1>]
mode = "auto"
# ... fields per entry ...

[<session_id_2>]
mode = "explicit"
# ... fields per entry ...

The session ID is the table name. It is a free-form stable string, typically the basename stem of the CHAT file the entry applies to (s12-t1, Corpus2024-session-07, etc.). The TOML parser treats it as a key; CHAT-conformant identifiers fit the unquoted-key grammar and need no escaping, but any string is permitted if it conforms to TOML key syntax (use quoted keys like "unusual_session-id" if the ID contains non-bare-key characters).

Top-level fields

FieldTypeRequiredMeaning
schema_versionunsigned integeryesThe schema version this file conforms to. Currently 1. Readers refuse files with any other value.

The reader refuses files with schema_version absent or unknown, returning a typed error (OverrideFileError::UnsupportedSchemaVersion). There is no implicit version, no fallback, no auto-migration. Operators of a file written by a newer version of chatter must upgrade their binary; operators of a file written by an older version that the current binary no longer supports must re-adjudicate. This policy is documented in architecture/merge-domain-types.md §6; its rationale is to keep the schema honest and avoid premature migration code that might silently misinterpret old data.

Per-session entry fields

Each [<session_id>] table contains the fields below. Required fields must be present and well-typed; optional fields may be omitted; unknown fields cause a parse error.

Required fields

FieldTypeMeaning
modestring enumOne of "auto", "explicit", "override". How the decision was made; see “Mode semantics” below.
inserted_roleinline tableThe CHAT identity assigned to every speaker whose mapping action is "rename". Fields: code (string, CHAT speaker code), tag (string, CHAT role-tag).
mappinginline tableMap from input speaker codes to actions. Keys are speaker codes; values are "rename" or "drop". Every speaker that exists in the input CHAT file must appear in mapping.
operatorstringFree-form identifier of the person who created the entry (username, initials, email prefix). Recorded as audit trail.
decided_atRFC 3339 datetimeWhen the decision was made. Must include a time zone (UTC recommended).

Optional fields

FieldTypeDefaultMeaning
scoresinline table{}Per-speaker Jaccard scores recorded at decision time. Keys are speaker codes; values are floats in [0.0, 1.0]. Populated when the decision was based on a reference-mode auto attempt (even if the final mode is "explicit" because the operator overrode a low-confidence result).
marginfloat or stringabsentThe decisive margin (winner-score / loser-score). Finite values serialize as numbers; the divide-by-zero case (loser score = 0) serializes as the string "unbounded".
notestring""Free-text operator note. Strongly recommended for "explicit" and "override" modes, captures why the operator made the call.
flagsarray of strings[]Operator-supplied flags marking unusual situations. Known values listed in “Flag vocabulary” below; unknown strings are preserved verbatim (treated as Custom).
enginestring enum"deterministic"Which engine produced the decision. Always written on new entries; absent only in pre-provenance files, which read as "deterministic". One of "deterministic" (Jaccard reference-mode, spreadsheet, or operator adjudication) or "llm" (language-model judgment).
judgmentinline tableabsentLLM audit trail. Present only when engine = "llm"; omitted for deterministic decisions. Sub-fields documented below.

judgment sub-table fields

The judgment inline table records the audit trail for LLM-produced decisions. It is present if and only if engine = "llm".

FieldTypeRequiredMeaning
modelstringyesModel identifier used for the judgment (e.g. "deepseek-v4-flash").
endpointstringyesOpenAI-compatible base URL the judgment was made against.
prompt_versionstringyesPrompt-template version tag (e.g. "v1"). Bumping this marks older entries as produced by a prior template.
confidenceinline tableno (omitted when empty)Per-field model confidence in [0.0, 1.0]. Keys are decision field names (e.g. "mapping", "roles", "merge_applicable"). Omitted entirely when no confidence values were reported.
reasoningstringyesOne or two sentence model rationale for the decision.

Mode semantics

The mode field records how the decision was made and is informational only at read time, every mode applies the same mapping deterministically. Distinguishing modes matters for audit purposes.

ModeSet whenOperator confidence
"auto"chatter speaker-id ran in reference mode, Jaccard margin was at or above --confidence-threshold, and the operator did not intervene.High; the algorithm picked.
"explicit"The operator supplied --mapping directly, typically after a prior reference-mode attempt failed at the confidence threshold.Operator made the call; confidence depends on what evidence they used (listening to audio, contributor data sheet, prior knowledge).
"override"The entry was created by reading a prior override file (replay).Inherited from whichever prior decision the entry was first stamped with. The mode is updated to "override" whenever a replay re-writes the entry.

The reader does not enforce mode → field correlations (e.g., it does not require scores to be present when mode = "auto"). The writer follows these conventions:

  • "auto" entries always include scores and margin.
  • "explicit" entries include scores and margin IFF a prior reference-mode attempt produced them; otherwise they are absent.
  • "override" entries preserve whatever scores, margin, and note were in the source file.

Mapping semantics

Each entry in mapping is one of:

  • "rename": the speaker is renamed to inserted_role.code with role tag inserted_role.tag in the output CHAT file. Every utterance for this speaker has its *CODE: prefix rewritten; the @Participants entry for this speaker has its code + role-tag rewritten (preserving any intervening name); the @ID row’s code (field 3) and role (field 8) are rewritten.
  • "drop": the speaker’s utterances are removed from the output entirely. The speaker’s @Participants entry and @ID row are also removed.

Precondition. Every speaker that appears in the input CHAT file must appear in mapping. There is no defaulting; omission is rejected with SpeakerIdError::SpeakerNotInMapping { speaker }. This is by design: every decision must be explicit, so a future reader knows that no speaker was silently passed through.

The reader rejects:

  • Mapping entries whose key is not a speaker present in the input (SpeakerIdError::MappingSpeakerNotInInput).
  • Mapping values other than "rename" or "drop" (TOML parse error from the typed deserializer).

Flag vocabulary

The flags array contains zero or more string values. The following are recognized vocabulary; consumers MAY treat them specially:

FlagMeaning
"diarization-mixed"The ASR diarization label being renamed actually contains multiple real-world speakers (e.g., clinician + parent collapsed). The rename is the best available approximation; downstream consumers should know the output is imperfect.
"best-guess"The operator could not confidently determine which speaker is which (e.g., from audio alone). The mapping is recorded as best-guess and merits review by a domain expert before publication.

Any other string is preserved verbatim as a contributor-specific flag (Custom(String) in the Rust type). Consumers SHOULD NOT crash on unknown flags but MAY surface them in audit-trail displays.

The order of flags within an entry is not semantically meaningful; duplicates are tolerated but considered noise. Tooling that modifies the list SHOULD deduplicate.

Reader semantics

OverrideFile::read(path) is the canonical reader. Its behavior:

  1. Open path UTF-8.
  2. Parse via toml.
  3. Refuse if schema_version is absent or not equal to the binary’s CURRENT_SCHEMA_VERSION (currently 1). Error: OverrideFileError::UnsupportedSchemaVersion { found, supported }.
  4. Parse all [<session_id>] tables into MergeOverride values; reject unknown fields.
  5. Return OverrideFile { schema_version, entries }.

OverrideFile::read_or_default(path) is the variant used by chatter speaker-id --write-override: if the file does not exist, returns OverrideFile::default() (empty, current schema version); otherwise behaves as read.

OverrideFile::get(&session_id) retrieves a single entry; returns None if absent.

Writer semantics

OverrideFile::write(path) serializes the file deterministically:

  • Top-level field order: schema_version first.
  • Entries ordered by session ID alphabetically (BTreeMap default).
  • Per-entry field order: mode, inserted_role, mapping, scores, margin, operator, decided_at, note, flags, engine, judgment.
  • Optional fields omitted when empty / absent.
  • Atomic replace: writes to <path>.tmp then renames over <path> to avoid leaving a partial file on crash.

chatter speaker-id --write-override <path> appends a single entry: it reads the file (or starts empty), inserts/updates the entry for the current session, and writes back. The session ID defaults to the input CHAT file’s basename stem unless overridden via --session-id.

Example: minimal auto-mode entry

schema_version = 1

[session-101-t1]
mode = "auto"
inserted_role = { code = "INV", tag = "Investigator" }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.1931, PAR1 = 0.7347 }
margin = 3.81
operator = "alice"
decided_at = 2026-05-27T08:41:00-04:00

The reader reconstructs: child speaker was PAR1 (high Jaccard match with reference’s CHI); auto-decide succeeded with margin 3.81×; PAR0 becomes INV:Investigator in the output.

Example: operator-adjudicated entry

After a low-confidence refusal, the operator listened to the audio, confirmed the call, and re-ran with --mapping:

[session-102-t1]
mode = "explicit"
inserted_role = { code = "INV", tag = "Investigator" }
mapping = { PAR0 = "drop", PAR1 = "rename" }
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
operator = "alice"
decided_at = 2026-05-27T11:15:00-04:00
note = "Auto refused at 2.0× threshold. Listened to first 60 seconds; PAR0 produces child-content matching the hand transcript. PAR1 introduces herself as the clinician."

The scores from the prior auto attempt are preserved; the note captures why the operator was confident in the call despite the close margin. Years later, a researcher can verify by listening to the same 60 seconds and confirming the operator’s observation, the audit trail is reproducible.

Example: diarization-mixed parent sample

[session-103-t1-parent]
mode = "explicit"
inserted_role = { code = "MOT", tag = "Mother" }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.3727, PAR1 = 0.6940 }
margin = 1.86
operator = "alice"
decided_at = 2026-05-27T11:22:00-04:00
note = "Parent sample. Per contributor data sheet: mother. PAR0 contains clinician intro + parent mixed (Batchalign diarization limitation)."
flags = ["diarization-mixed"]

The flags = ["diarization-mixed"] warns downstream consumers that the renamed MOT speaker is not a clean parent-only stream the first ~15 seconds were the clinician giving setup instructions before leaving the room. The note captures the specifics for future review.

Example: replayed entry

The same file run on a different day from the override file:

[session-102-t1]
mode = "override"
inserted_role = { code = "INV", tag = "Investigator" }
mapping = { PAR0 = "drop", PAR1 = "rename" }
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
operator = "alice"
decided_at = 2026-05-27T11:15:00-04:00
note = "Auto refused at 2.0× threshold. Listened to first 60 seconds; PAR0 produces child-content matching the hand transcript. PAR1 introduces herself as the clinician."

mode becomes "override" whenever the entry is re-applied by reading the file. The other fields (including the original operator and decided_at) are preserved, the override file is the audit trail of the original decision, not of the replay.

TOML grammar reference

For consumers writing the file by hand or generating it from other tools, the grammar is standard TOML 1.0 (toml.io) with the following domain-specific conventions:

  • Datetimes use RFC 3339 with explicit time zone. UTC offset Z and offsets like -04:00 are both accepted.
  • Floats: standard TOML float syntax. The margin field accepts either a float or the string "unbounded".
  • Tables vs inline tables: top-level [<session_id>] tables may use either standard or inline syntax; the writer emits standard tables for readability.
  • Comments: TOML # line comments are permitted anywhere; the reader ignores them. The writer does not preserve comments across read-modify-write cycles (toml, not toml_edit); hand-edited comments may be lost on subsequent --write-override runs. If preserving comments becomes important, the writer can be swapped for toml_edit in a future release.

Future schema changes

Schema version increments will appear here under “Migration” with the version-to-version diff and migration instructions. Until then, this is the only schema; the policy is strict refuse-with-clear-error on any other schema_version value.

2026-06 additive fields: engine and judgment (no version bump)

The engine and judgment fields were added in 2026-06 to record decision provenance (deterministic vs LLM). This addition did NOT increment schema_version because both fields are backward compatible in both directions:

  • Old reader, new file: TOML deny_unknown_fields is not set globally; older binaries that parse a file containing engine and judgment will silently ignore the unknown keys. The decision itself (mode, mapping, inserted_role) is unaffected.
  • New reader, old file: engine has #[serde(default)] and defaults to "deterministic"; judgment has skip_serializing_if = "Option::is_none" and is absent, which deserializes as None. Pre-provenance files are therefore readable without error and are treated as deterministic decisions.

A future version bump would be warranted only if a change makes old files unreadable or misinterpretable, neither of which applies here.

Relationship to JSON Schema

The Rust OverrideFile type is implemented (in talkbank-transform, src/speaker_id/override_file.rs) and drives the override-file replay workflow today. What is not yet built is its JSON Schema export: OverrideFile does not yet derive schemars::JsonSchema, so no schema is generated, and the canonical URL https://talkbank.org/schemas/v0.1/merge-overrides.json is reserved but not yet published. Exposing it follows the same schemars-based generator pattern documented in JSON Schema.

The TOML form is the on-disk format; JSON Schema is the machine-readable spec for external tooling. Both describe the same OverrideFile Rust type.