Introduction
Status: Current Last modified: 2026-06-21 21:33 EDT
TalkBank is the world’s largest open repository of spoken language data. This repository (TalkBank/chatter) is the standalone home of the CHAT format authority and the chatter tool family: the chatter CLI, the Rust crates for parsing/validation/transformation, the tree-sitter-talkbank grammar, the talkbank-lsp language server, and the desktop validation app.
chatter is publicly released. To get it right away:
- Command-line tool (macOS / Linux):
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.sh | sh(Windows and other options: Install). - Desktop app: download for your platform from the latest release.
- Full installation guide (all platforms, package details): Install.
The Rust crates are source-available from this repository (not yet published to crates.io). As a 0.x release, APIs and flags may change before 1.0.
Choose the right surface
| Task | Recommended Surface |
|---|---|
| CHAT validation, normalization, or conversion | chatter CLI |
| LSP integration in editors | talkbank-lsp standalone |
| Build CHAT tooling in Rust | Rust crates (talkbank-model, talkbank-parser, etc.) |
| Reuse grammar in other tools | tree-sitter-talkbank |
| Standalone desktop GUI for CHAT validation | Chatter Desktop (apps/chatter-desktop/) |
What’s In This Repo
chatterCLI: validate, convert, normalize, and analyze CHAT files from the command line, with an interactive TUI for corpus-scale workflows- Language Server (LSP): works with any LSP-compatible editor (Neovim, Emacs, Helix, Zed, etc.) to provide live validation and cross-tier alignment
- JSON data model: every CHAT structure as typed JSON with lossless roundtrip fidelity, backed by a published JSON Schema
- Rust API: parse, validate, inspect, and transform CHAT files programmatically via library crates
Who This Book Is For
| Audience | Start Here | Then Go To |
|---|---|---|
| CLI users validating, normalizing, or converting CHAT | Install | chatter Quick Start, CLI Reference |
| Rust library consumers parsing or transforming CHAT | Library Usage | crate-root rustdoc for talkbank-model, talkbank-parser, and talkbank-transform |
| Grammar / format consumers embedding CHAT parsing in other tools | CHAT Format Overview | tree-sitter-talkbank docs and the grammar/reference chapters |
| Contributors / maintainers working in this repo | Contributing setup | CI and release |
Repository Layout
grammar/ Tree-sitter grammar for CHAT
spec/ Source of truth: CHAT specification + error specs
crates/ Rust crates for model, parser, transform, cache, CLI, LSP, tests, and FFI support
apps/ Tauri v2 desktop app (`chatter-desktop`)
corpus/ Reference corpus (must stay 100% valid under the regression gate)
schema/ JSON Schema for the CHAT AST
tests/ Integration tests and fixtures
fuzz/ Fuzz testing targets (separate Cargo workspace)
docs/ Strategy docs, proposals, and investigations for this repo
book/ This documentation (mdBook)
Data flows: spec (source of truth) → grammar (tree-sitter) → Rust crates (parsers, model, validation, CLI, LSP) → applications (chatter, desktop app).
Install
Status: Current Last modified: 2026-06-21 21:33 EDT
Installation paths for each surface of chatter. Pick the row that matches what you want to do and the audience you belong to.
| If you want to… | Use this surface | Start here |
|---|---|---|
| Validate, normalize, convert, or batch-process CHAT files | chatter CLI | CLI installation |
| Embed the Rust crates in another program | Rust libraries | Library usage |
| Reuse the grammar in editor or parser tooling | tree-sitter-talkbank | crate docs plus the CHAT format overview |
chatter is publicly released: the CLI and desktop app are available from the
latest GitHub release.
The Rust crates and grammar are source-available from this repository (not yet
published to crates.io). As a 0.x release, APIs and flags may change before 1.0.
For audio + ML pipelines (transcribe, force-align, morphotag,
benchmark), see the upstream batchalign3 project, that lives
outside the chatter repo and has its own installation flow.
Quickstart
Status: Current Last modified: 2026-06-21 21:33 EDT
Task-driven entry points. Pick the row that matches what you want to do today; each path starts at the narrowest useful documentation surface instead of dropping you into the whole book.
| Today’s goal | Best first page | Surface |
|---|---|---|
| Validate / normalize / convert existing CHAT | chatter Quick Start | CLI |
| Add CHAT parsing/validation to a Rust program | Library Usage | Rust crates |
To download and install chatter, see Install.
For audio + ML workflows (transcribe / align media → CHAT), see the
upstream batchalign3 project, outside the chatter repo.
Installation
Status: Current Last modified: 2026-06-16 07:55 EDT
chatter targets Windows, macOS, and Linux. There are two ways to
install it: the prebuilt binaries (recommended for most people,
including clinicians and researchers) and a from-source build (for
contributors or unsupported platforms).
Prebuilt binaries (recommended)
Every GitHub Release attaches prebuilt binaries for macOS (Apple Silicon and Intel), Linux (x86_64 and ARM64), and Windows (x64), plus desktop-app installers.
chatter CLI
One-line installers (they download the binary for your platform, place
it on your PATH, and also install the chatter-update self-updater):
-
macOS and Linux:
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.sh | sh -
Windows (PowerShell):
powershell -ExecutionPolicy Bypass -c "irm https://github.com/TalkBank/chatter/releases/latest/download/chatter-installer.ps1 | iex"
On Windows the binary is not yet code-signed, so SmartScreen may warn on first run: choose More info, then Run anyway. The macOS binaries are codesigned, and the installer above does not set the quarantine attribute, so Gatekeeper does not prompt.
Prefer a manual download? Grab the archive for your platform from the
latest release
and extract chatter onto your PATH. (On macOS, a browser-downloaded
archive is quarantined; right-click the binary and choose Open once,
or run xattr -d com.apple.quarantine ./chatter.)
Verify:
chatter --version
chatter --help
chatter desktop app
The desktop app (“Chatter”) is for people who prefer a window to a terminal. Download the installer for your platform from the latest release:
- macOS: the
.dmgis signed and notarized; open it and drag the app to Applications. No Gatekeeper override is required. - Windows: the installer is not yet signed (same SmartScreen note as above: More info then Run anyway).
- Linux: an AppImage and a
.debare provided.
Updating chatter
chatter keeps itself current so you do not have to track releases by
hand.
-
CLI: run
chatter updateThis runs the bundled
chatter-updateprogram, which checks GitHub Releases and installs the newest release in place. (The self-update facility is experimental. It is installed only by the one-line installers above; if you installed another way, update the same way you installed.) -
Desktop app: the app checks for updates on launch and offers to install a new version when one is available.
From source
Building from source needs only a stable Rust toolchain (install via
rustup, which supports Windows, macOS, and Linux).
Node.js and the tree-sitter CLI (cargo install tree-sitter-cli) are
needed only when working on the grammar or generated artifacts.
Clone and install the CLI:
git clone https://github.com/TalkBank/chatter.git
cd chatter
cargo install --path crates/chatter --locked
This installs the chatter binary to ~/.cargo/bin/ (macOS/Linux) or
%USERPROFILE%\.cargo\bin\ (Windows). To update a source install, pull
and re-run the cargo install command above (chatter update is only
for installer-based installs).
Building the libraries
If you are developing with the Rust crates directly, from your chatter checkout root:
cargo build --workspace --all-targets --locked
cargo test --workspace --locked
cargo clippy --all-targets -- -D warnings
See the contributor setup for additional commands.
Directory layout
Everything lives in a single repository:
<your-chatter-checkout>/
├── grammar/ # Tree-sitter grammar
├── crates/ # All Rust crates (talkbank-* + the chatter binary)
├── spec/ # CHAT specification
├── apps/ # Tauri desktop app (chatter-desktop)
└── book/ # Chatter mdBook (this book)
The CLI, grammar, crates, and the LSP/desktop integrations all live in this single repository.
Quick Start
Status: Current Last updated: 2026-06-15 15:00 EDT
This page gets you from zero to productive with chatter in five minutes.
Install chatter first if you haven’t already.
Validate a CHAT file
Check a single transcript for errors:
chatter validate transcript.cha
If the file is valid you get a summary (a cache-statistics block follows it;
use --quiet to suppress all output and rely on the exit code):
=== Summary ===
Total files: 1
Valid: 1
Invalid: 0
If there are problems, you’ll see rich diagnostics with the exact location
and a stable error code. For example, a *CHI: line missing its terminator:
✗ Errors found in transcript.cha
E305 (https://talkbank.org/errors/E305)
× error[E305]: Expected terminator not found (line 6, column 1)
╭─[input:6:1]
6 │ *CHI: hello world
· ─────────┬─────────
· ╰── here
╰────
help: Add a terminator at the end: Standard (. ? !), Interruption
(+... +/. ...), or CA intonation (⇗ ↗ → ↘ ⇘ ...)
Every error code (E305, E705, etc.) is documented with fix guidance in the
validation error reference.
Validate an entire corpus
Point chatter at a directory, it walks recursively, validates in parallel,
and caches results:
chatter validate corpus/
The interactive TUI shows progress and lets you browse errors per file.
Use --format json for machine-readable output, or --quiet for CI
(exit code 1 on errors).
Convert to JSON
Get a structured representation of any CHAT file:
chatter to-json transcript.cha
The output conforms to the TalkBank CHAT JSON Schema.
Convert back with chatter from-json.
Watch for changes
Edit a file and get live validation feedback:
chatter watch transcript.cha
Every time you save, chatter re-validates and shows updated diagnostics.
What next?
- CLI Reference: all commands, flags, and output formats
- Validation Errors: every error code, with examples and fix guidance
- Batch Workflows: corpus-scale validation and analysis
CLI Reference
Status: Current Last modified: 2026-06-15 15:00 EDT
The chatter CLI is the primary command-line surface for the TalkBank CHAT toolchain.
The following diagram shows the command dispatch structure. Each top-level command dispatches to a handler in the corresponding crate.
flowchart TD
chatter(["chatter"])
chatter --> validate["validate\n(chatter)"]
chatter --> normalize["normalize\n(chatter)"]
chatter --> tojson["to-json\n(talkbank-transform)"]
chatter --> fromjson["from-json\n(talkbank-transform)"]
chatter --> showalign["show-alignment\n(chatter)"]
chatter --> watch["watch\n(chatter)"]
chatter --> lint["lint\n(chatter)"]
chatter --> clean["clean\n(chatter)"]
chatter --> newfile["new-file\n(chatter)"]
chatter --> cache["cache\n(stats, clear)"]
chatter --> schema["schema\n(JSON Schema output)"]
chatter --> debug["debug\n(overlap-audit, linker-audit,\nfind, sanitize, fix-s)"]
chatter --> merge["merge\n(experimental)"]
chatter --> speakerid["speaker-id\n(experimental)"]
chatter --> adjudicate["adjudicate\n(experimental)"]
chatter --> pipeline["pipeline\n(experimental)"]
chatter --> batch["batch\n(experimental)"]
chatter --> sanityscan["sanity-scan\n(experimental)"]
Top-Level Commands
chatter validate PATH...
chatter normalize INPUT
chatter to-json INPUT
chatter from-json INPUT
chatter to-xml INPUT
chatter show-alignment INPUT
chatter watch PATH
chatter lint PATH
chatter clean PATH
chatter new-file
chatter cache stats
chatter cache clear --prefix PATH
chatter schema
chatter debug ...
chatter merge FILE1 FILE2 # experimental: combine two transcripts
chatter speaker-id INPUT # experimental
chatter adjudicate ... # experimental
chatter pipeline ... # experimental
chatter batch ... # experimental
chatter sanity-scan ... # experimental
Use chatter --help or chatter <command> --help for the exact live surface.
validate
Validate CHAT file(s) or directory tree(s). Accepts multiple paths.
Usage: chatter validate [OPTIONS] <PATH>...
chatter validate file.cha # single file
chatter validate file1.cha file2.cha file3.cha # multiple files
chatter validate corpus/ # directory (recursive, parallel)
chatter validate file.cha corpus/ other.cha # mix of files and directories
chatter validate corpus/ -f json # structured JSON output
chatter validate corpus/ --force # ignore cache, revalidate everything
chatter validate corpus/ --force --audit out.jsonl # bulk audit to JSONL file
chatter validate corpus/ --suppress xphon # suppress named error group
chatter validate corpus/ --suppress E726,E727 # suppress specific error codes
chatter validate corpus/ -j 8 # use 8 parallel workers
chatter validate corpus/ --max-errors 50 # stop after 50 errors
Options:
| Flag | Description |
|---|---|
-f, --format text|json | Output format (default: text) |
--list-checks | Print every validation check with Active/Planned status, then exit (no <PATH> required) |
--skip-alignment | Skip dependent-tier alignment checks |
--force | Ignore cache, revalidate all files |
-j, --jobs N | Parallel workers for directory mode (default: CPU count) |
--quiet | Only emit errors, suppress success messages |
--max-errors N | Stop after N errors across all files |
--roundtrip | Test serialization idempotency (developer tool) |
--parser tree-sitter|re2c | Parser backend (default: tree-sitter; re2c is opt-in for faster batch validation) |
--strict-linkers | Enable strict cross-utterance linker pairing checks (E351-E355); off by default |
--check-xphon | Re-enable %xphon* cross-tier alignment checks (E725-E728); skipped by default |
--audit FILE | Stream errors to JSONL file (bulk audit mode) |
--suppress CODES | Suppress error codes or groups (comma-separated) |
Suppress groups: xphon expands to E725/E726/E727/E728
(%xphosyl/%xphoaln/%xmodsyl cross-tier alignment). These are
suppressed by default since 2026-04-21; pass --check-xphon to
include them. The --suppress flag can mix groups and codes:
--suppress xphon,E316.
normalize
Serialize a CHAT file into canonical formatting.
chatter normalize input.cha
chatter normalize input.cha -o normalized.cha
chatter normalize input.cha --validate
chatter normalize input.cha --validate --skip-alignment
Flags:
-o, --output <PATH>: write to a file instead of stdout.--validate: validate (including alignment by default) before writing the normalized output.--skip-alignment: when paired with--validate, skip the dependent-tier alignment checks (still validates the rest).
normalize writes to stdout unless you pass -o/--output. There is no --in-place flag.
JSON Conversion
# Single file
chatter to-json input.cha # pretty-printed JSON to stdout
chatter to-json input.cha --compact # minified JSON to stdout
chatter to-json input.cha -o output.json # JSON to file
# Directory (recursive, preserves structure)
chatter to-json corpus/ --output-dir json/ # incremental by default (mtime check)
chatter to-json corpus/ --output-dir json/ --compact # minified output (saves disk)
chatter to-json corpus/ --output-dir json/ --force # full rebuild
chatter to-json corpus/ --output-dir json/ --prune # remove orphaned .json files
chatter to-json corpus/ --output-dir json/ --jobs 4 # parallel workers
# Reverse and schema
chatter from-json input.json -o output.cha
chatter schema
chatter schema --url
Single-file mode: to-json validates by default. Use --skip-validation,
--skip-alignment, or --skip-schema-validation to bypass checks.
Directory mode: Walks recursively, converting each .cha to .json under --output-dir
with the same relative path. Incremental by default: skips files whose JSON is
already newer than the source. Use --force to rebuild all. Use --prune to remove
.json files with no matching .cha (handles renames/deletions). Use --jobs N for
parallel conversion (defaults to number of CPUs).
to-xml
Export one CHAT transcript to TalkBank XML. The transcript is validated
before any XML is emitted, so an invalid input fails (exit 1) and writes
nothing to stdout; a failed export never leaves a partial document. This
command is export-only: XML ingest is not implemented, so there is no
from-xml.
chatter to-xml input.cha # XML to stdout
chatter to-xml input.cha -o output.xml # XML to a file
chatter to-xml input.cha --skip-alignment # skip dependent-tier alignment checks
The output is TalkBank XML in the http://www.talkbank.org/ns/talkbank
namespace (referencing talkbank.xsd). Writing to --output prints a
one-line ✓ Converted ... to ... confirmation on stderr; writing to
stdout prints only the XML.
Flags: -o, --output <PATH> (stdout if omitted); --skip-alignment
(disable dependent-tier alignment validation during export).
Editing and Inspection Commands
show-alignment
Print the dependent-tier alignment for a CHAT file (debugging aid).
chatter show-alignment file.cha
chatter show-alignment file.cha -t mor # one tier type
chatter show-alignment file.cha -t gra -c # compact one-line-per-alignment output
Flags: -t/--tier <mor|gra|pho|sin> (omit to show all available
tiers); -c/--compact (one line per alignment).
watch
Watch a CHAT file or directory and re-validate on every save.
chatter watch file.cha
chatter watch corpus/
chatter watch corpus/ --skip-alignment --clear
Flags: --skip-alignment (faster reruns); -c/--clear (clear the
terminal between runs).
lint
Run lint checks and optionally auto-fix.
chatter lint corpus/
chatter lint corpus/ --fix
chatter lint corpus/ --fix --dry-run # preview without modifying files
chatter lint corpus/ --skip-alignment
Flags: --fix (apply fixes); --dry-run (show what would change
without writing); --skip-alignment.
clean
Show the cleaned text for each word (a debugging aid for the text-normalization pipeline).
chatter clean file.cha
chatter clean file.cha --diff-only # only words where raw differs from cleaned
chatter clean file.cha --format json
Flags: --diff-only; --format text|json.
new-file
Create a new minimal valid CHAT file from defaults.
chatter new-file
chatter new-file -o starter.cha --speaker CHI --language eng
chatter new-file -o adult.cha -s MOT -l eng -r Mother
chatter new-file -c brown -u "hello world ."
Flags:
-o, --output <PATH>: stdout if omitted-s, --speaker <CODE>: defaultCHI-l, --language <ISO 639-3>: defaulteng-r, --role <ROLE>: defaultTarget_Child-c, --corpus <CORPUS>: corpus identifier in the@IDheader (defaultcorpus)-u, --utterance <TEXT>: optional initial main-tier utterance content
Cache Commands
chatter cache stats
chatter cache stats --json
chatter cache clear --prefix /path/to/corpus
chatter cache clear --all --dry-run
The validation cache lives under the platform cache directory and stores per-file validation results. validate --force refreshes cache state for the specified path.
debug
Developer / debugging subcommands for CHAT analysis. Not intended
for routine end-user workflows; surface and behavior may change
between releases. Run chatter debug --help for the live list. Current
subcommands include:
-
overlap-audit: analyze CA overlap markers (⌈⌉⌊⌋): pairing, temporal consistency, orphans. -
linker-audit: audit linker / special-terminator usage across a corpus (cross-utterance pairing for+<,++,+^,+",+,,+≋,+≈, plus+...,+/.,+//.,+"/.etc.). -
find: filter CHAT files by@Languagesand body content (token / substring counts) across a corpus tree; emits paths, JSONL, or CSV. -
sanitize: strip contributor lexical content while preserving structure, for protected-corpus debugging. See the Sanitize user-guide page for the full workflow. -
fix-s: normalize whole-utterance same-language@sruns into a[- lang]precode, clear the per-word@smarkers (including those on fillers and nonwords), and append any missing explicit@s:LANGcodes to@Languages. Trigger conditions and safety rules:- Every word-bearing item in the utterance, including fillers
(
&~,&-,&+), nonwords, and retraced material, must carry an explicit language marker AND every marker must resolve to the same target language. If a single filler such as&~dang3lacks a marker, the utterance is left untouched (the predicate cannot prove it is monolingual). - Bare
@sshortcuts on fillers must be cleared when the rewrite fires. A bare@sresolves relative to the surrounding tier language, so adding a[- LANG]precode without clearing the shortcut would flip the filler’s language to the precode target.fix-sclears the shortcut to keep the original meaning intact. - The pre-validation rule that catches the unrewritten pattern is
E255 (whole-utterance same-language
@srun);fix-sis the canonical repair. The companion warn-only E254 reports@s:LANGcodes missing from@Languages;fix-sappends them. - True no-op on already-correct files: a file is rewritten only when
a
[- lang]conversion or@Languagesrepair can be proved necessary.
- Every word-bearing item in the utterance, including fillers
(
Merge and Reconciliation Commands (experimental)
These commands combine, reconcile, and relabel CHAT transcripts of the
same recording, in the tradition of CLAN’s reliability and comparison
tools (rely, trnfix). They are experimental and in active
development: flags and behavior may change, and several modes are not
yet complete. Work on copies and validate the output.
| Command | What it does |
|---|---|
merge | Merge two CHAT transcripts of the same media into one, interleaving by time with explicit per-speaker provenance. Structural only: no ASR, no forced alignment, no content rewriting. |
speaker-id | Assign CHAT-conformant speaker codes to an anonymously-labeled file, from an explicit mapping or by text similarity against a reference transcript. |
adjudicate | Resolve pending low-confidence decisions (currently speaker-id) interactively, writing results to an override file. |
pipeline | Per-session shortcut: run speaker-id in reference mode, then merge. |
batch | Loop pipeline over matched donor / reference file pairs across two directories. |
sanity-scan | Post-merge QA: flag sessions whose automatic decisions look suspicious by an out-of-band heuristic, for operator review via adjudicate. |
Full guides: Merge and Speaker ID. The
speaker-id holistic-judgment mode can call an LLM provider
(talkbank-llm) when configured; the deterministic modes need no network
access.
Exit Codes
| Code | Meaning |
|---|---|
0 | Success – all files valid, or command completed without errors |
1 | Failure – validation errors found, parse errors, or command failed |
2 | Usage error – invalid arguments or missing required options (from clap) |
chatter validate exits with code 1 if any file has validation errors
or parse errors. This makes it safe to use in scripts and CI pipelines:
chatter validate corpus/ --quiet --tui-mode disable || echo "Validation failed"
Use --quiet to suppress per-file success output while still relying on
exit codes. Use --format json for machine-readable structured output
(JSON objects go to stdout; exit code still reflects pass/fail).
Output Contracts
- Text output is intended for humans.
- JSON output is intended for automation and downstream tools.
- Error codes and the JSON Schema are documented public contracts; see the Integrating section of this book.
Validation Errors
Status: Current Last modified: 2026-06-17 11:29 EDT
The CHAT validator produces diagnostics at two severity levels: errors (must fix) and warnings (should fix). Each diagnostic has an error code that maps back to a documented spec and validator rule.
chatter validate is the binding judgment on whether a byte sequence is valid CHAT. When it reports an error, the file is invalid CHAT: clean the data rather than working around the check. A warning flags a questionable but parseable construct you should review. Where chatter and an older tool such as CLAN’s check disagree on whether a file is valid, chatter validate is authoritative (see CHECK Parity Audit for how the two are reconciled).
Reading Error Output
The validator emits rich diagnostics that include the error code, a source-pointed snippet, and a suggested fix:
× error[E304]: Missing speaker in main tier (line 15, column 3)
15 │ * hello world .
· ╰── here
╰────
help: Add a speaker code between * and : (e.g., *CHI:)
Each diagnostic contains:
- File path and location (line:column)
- Severity:
errororwarning - Error code:
Eprefix for errors,Wprefix for warnings, with a URL pointing at the per-code documentation page - Message: human-readable description
- Suggestion: actionable fix guidance where available
Error Code Ranges
| Range | Category | Examples |
|---|---|---|
| E1xx | UTF-8 and encoding | E101: Invalid line format |
| E2xx | Word-level content | E202: Missing form type after @, E203: Invalid form type marker, E207: Unknown annotation |
| E3xx | Main tier (speakers, terminators, content) | E301: Empty/missing main tier, E304: Missing speaker, E305: Missing terminator, E306: Empty utterance, E307: Invalid speaker, E308: Undeclared speaker |
| E4xx | Dependent tier structure | E401: Duplicate dependent tier |
| E5xx | Headers | E501: Duplicate header, E504: Missing @Participants, E505: Invalid @ID format |
| E6xx | Dependent tier validation | E601: Invalid dependent tier, E604: %gra without %mor |
| E7xx | Alignment (%mor, %gra, %pho, %wor) | E705: Main/%mor count mismatch, E721: %gra index error |
| W1xx-W6xx | Warnings | W108: BOM detected, W601: Empty user-defined tier |
Common Errors and Fixes
E256: Curly single quote used as a word character
A curly single quotation mark (U+2018 or U+2019), commonly introduced by
autocorrect or speech-to-text, is not a legal CHAT word character. CHAT words
use the ASCII apostrophe (U+0027, the plain '). For example, a contraction
typed as don + U+2019 + t is rejected; write don't with the ASCII
apostrophe instead. chatter flags the curly form wherever it appears in word
content and points the diagnostic at the exact character. This mirrors CLAN
CHECK errors 138 and 139.
E304: Missing speaker code
A main tier line must have a speaker code after the *:
*CHI: hello world .
An empty speaker code (*: hello .) triggers E304.
E308: Undeclared speaker
Every *SPEAKER: code must be listed in @Participants. Add the missing speaker to the header:
@Participants: CHI Target_Child, MOT Mother
E370: Retrace marker with nothing to retrace
A retrace or repetition marker ([/], [//], [///]) must be followed by the
repeated or corrected material; per the CHAT manual the marker always refers to
the text that follows it. A marker followed only by a terminator has nothing to
retrace:
*CHI: <the> [/] . ← invalid: [/] is not followed by repeated material
*CHI: <the> [/] the cat . ← valid: the repeated material follows the marker
This mirrors CLAN CHECK error 119 (and the related retrace checks 151 and 159).
E505: Invalid @ID format
Check that pipe-separated fields are correct and the speaker code matches @Participants:
@ID: eng|corpus|CHI|2;6.||||Target_Child|||
E705: Main/%mor alignment mismatch
The number of %mor items must match the number of alignable words on the main tier. Retraces, pauses, and events are not counted. The validator shows a columnar diff:
Main tier %mor tier
────────────── ──────────────
I pro|I
want v|want
to inf|to
go v|go
home, ⊖
E714 / E715: %pho, %mod, or %wor count mismatch
The same two codes are reused for “too few” / “too many” count mismatches on
%pho, %mod, and %wor.
For %wor, the main-tier side is a spoken-token inventory:
- regular words and fillers count
- fragments, nonwords, and
xxx/yyy/wwwcount - retrace does not change
%wormembership - replacements keep the original spoken surface word for
%wor
That context-sensitivity decides membership, not leniency. Once an item is
in the %wor set, alignment is still strict 1:1. So if a filler like
&-mm counts on the main tier and %wor omits it, E714 is the correct result.
So this is valid:
*CHI: <one &+ss> [/] one play ground .
%wor: one •321008_321148• ss •321148_321368• one •321809_321969• play •322049_322310• ground •322390_322890• .
But this is also valid:
*EXP: &+ih <the what> [/] what's letter &+th is this ?
%wor: ih •49063_49103• the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• th •50886_50946• is •50946_51046• this •51086_51586• ?
And this is valid too:
*EXP: what's is dis [: this] ?
%wor: what's •37050_37471• is •37491_37631• dis •37631_38131• ?
E721: %gra sequential index error
%gra entries must have sequential 1-based indices: 1|...|... 2|...|... 3|...|...
Generated Error Documentation
The source of truth for error-code details is spec/errors/. Maintainers can
also regenerate a local error-reference set from those specs when working on
diagnostics:
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_error_docs
That generated reference includes the error description, example inputs, suggested fixes, and the layer that catches the diagnostic.
Chatter Desktop
Status: Current Last modified: 2026-06-21 21:33 EDT
Chatter Desktop is a native graphical validation app for CHAT files, released
alongside the chatter CLI. Prefer the chatter CLI for scripted or batch
validation; use the desktop app when you want a standalone graphical validation
experience without a terminal.
When to use Chatter Desktop
Chatter Desktop (apps/chatter-desktop/) is the right tool when you want to:
- Validate CHAT files through a graphical interface, no terminal required
- Drag and drop a file or folder and read errors with source snippets
- Work on the desktop without setting up a terminal workflow
Related surfaces:
- Validate CHAT from the command line: use
chatter validate
This page documents the desktop surface:
- Chatter Desktop (
apps/chatter-desktop/), the CHAT validation GUI
Current status
- Release contract: released alongside the CLI in the public chatter release
- Distribution: ships in the coordinated chatter release alongside the CLI; also buildable from source (below)
- Platforms: macOS, Windows, and Linux
Staying up to date
Chatter Desktop keeps itself current. When you launch it, it quietly checks for a newer release; if one is available it asks whether to update, and on your confirmation it downloads, installs, and restarts into the new version. If the check cannot reach the network it simply does nothing and the app keeps working on the version you have. You never have to track releases or re-download by hand.
Getting Started
Build from source
cd apps/chatter-desktop
npm ci
cargo tauri dev # launches the app with hot reload
cargo tauri build # produces a distributable app bundle
Requires: Rust (stable, edition 2024), Node.js, and npm.
Using the App
Opening files
Chatter validates one target at a time: a single .cha file or one folder.
Three ways to start validating:
- Choose File: opens a file picker filtered to
.chafiles - Choose Folder: opens a folder picker; validates all
.chafiles recursively - Drag and drop: drag one
.chafile or one folder onto the app window
When idle, if you’ve previously validated a target, the drop zone shows “Last: corpus/reference/, Re-validate?” as a clickable shortcut.
Reading results
The main window has three areas:
┌──────────────────────────────────────────────────────────────┐
│ [Choose File] [Choose Folder] or drag here [System|Light|Dark] │
├──────────────────┬───────────────────────────────────────────┤
│ 3 FILES WITH │ Filter by code… [All|Errors|Warnings] │
│ ERRORS / 120 │ │
│ │ ▾ [E302] Missing @End header │
│ 📁 corpus/ │ ┌───────────────────────┐ │
│ ✗ file1 (3) │ │ 41 │ *CHI: hello . │ │
│ ✗ file3 (1) │ │ 42 │ │ │
│ │ │ │ ^ │ │
│ │ └───────────────────────┘ │
│ │ 💡 Add @End on the last line │
│ │ [Copy] [Open in CLAN] │
├──────────────────┴───────────────────────────────────────────┤
│ Progress: 45/120 │ 4 errors │ ~2m 30s remaining │ [Cancel] │
└──────────────────────────────────────────────────────────────┘
-
File tree (left), collapsible directory tree showing only files with errors (valid files are hidden to reduce clutter). A header shows “N files with errors / M total”. Files are sorted alphabetically.
-
Error panel (right), for the selected file, shows each error with its code in
[E001]format, severity color, message, source snippet with caret underlines, and multi-span labels for complex errors (e.g., alignment mismatches across tiers). CHAT-specific formatting is handled: tabs expanded to 8-column boundaries,\x15bullets rendered as•, underline markers shown as styled underlined text. Suggestions prefixed with 💡. -
Status bar (bottom), streaming progress during validation, ETA after 5+ files, total error count, and action buttons.
Filtering errors
A compact filter bar appears above the error cards when a file has diagnostics:
- Code filter: type “E7” to show only alignment errors, “W” for warnings, etc.
- Severity toggle: switch between All / Errors / Warnings
The file header updates to show filtered vs. total count (e.g., “3 errors (7 total)”).
Collapsible error cards
Each error card has a clickable header that toggles between expanded and collapsed view. Collapsed cards show only the error code and first line of the message. When a file has 5 or more errors, an Expand All / Collapse All button appears.
Dark mode
Chatter follows your system appearance by default. A System / Light / Dark toggle in the drop zone area lets you override. Your preference is remembered across sessions.
The dark palette uses muted Apple-style colors, readable miette error highlighting on dark backgrounds.
Clickable file paths
Click the file name in the error panel heading to reveal the file in Finder (macOS), Explorer (Windows), or the default file manager (Linux).
Copy errors
Each error card has a Copy button that copies the full miette-rendered error text (plain text, not HTML) to your clipboard for pasting into issue reports or messages.
Actions
| Action | Where | What it does |
|---|---|---|
| Re-validate | Status bar / last-target hint | Re-run validation on the same target (picks up edits) |
| Cancel | Status bar (during validation) | Stop the current run |
| Export | Status bar | Save results as JSON or plain text via a save dialog |
| Open in CLAN | Per-error button | Opens the file at the error location in the CLAN editor |
| Copy | Per-error button | Copies the plain-text error to clipboard |
| Reveal in file manager | File name heading | Opens the file’s parent directory |
“Open in CLAN” only appears when the CLAN application is detected on your
system (macOS and Windows only). It adjusts line numbers to account for headers
that CLAN hides (@UTF8, @PID, @Font, @ColorWords, @Window).
Keyboard shortcuts
| Shortcut | Action |
|---|---|
| Ctrl+R / Cmd+R | Re-validate |
| Escape | Cancel running validation |
All other navigation is mouse-driven (click files, scroll errors).
Window title
The window title updates to reflect the current state:
- Idle: “Chatter”
- Discovering: “Chatter, Discovering files…”
- Running: “Chatter, Validating (45/120)”
- Finished: “Chatter, 14 errors in 3 files” or “Chatter, All 74 files valid”
ETA
After 5 or more files have been processed, the status bar shows an estimated time remaining (e.g., “~2m 30s remaining”). The estimate updates every second.
Notifications
When validation finishes while the app is not focused, a system notification shows the summary (“Validation complete, 14 errors in 3 files”).
First launch
On first launch, an onboarding overlay explains the four main interactions: drag files, error panel, keyboard shortcuts, and export. Dismiss with “Got it”, it won’t appear again.
CLI Bundling
The desktop app can bundle the chatter CLI binary so power users who download
the GUI can also run the CLI from their terminal (like VS Code ships the code
command).
An Install CLI Command menu item (when available) symlinks the bundled
binary to /usr/local/bin/chatter (macOS/Linux) or copies it to a PATH
directory (Windows).
To build with the bundled CLI:
cargo build --release -p chatter
mkdir -p apps/chatter-desktop/src-tauri/resources
cp target/release/chatter apps/chatter-desktop/src-tauri/resources/
cargo tauri build
Architecture
The desktop app lives in apps/chatter-desktop/:
apps/chatter-desktop/
src-tauri/ Rust backend (Tauri v2)
src/
main.rs Bin entry, calls chatter_desktop_lib::run()
lib.rs Tauri app setup (Builder + module wiring)
protocol.rs Shared command/event names + request types
commands.rs validate, cancel, open_in_clan, export, reveal, install_cli
events.rs ValidationEvent → frontend event bridge
validation.rs Desktop validation orchestration for one target
src/ React + TypeScript frontend
components/ DropZone, FileTree, ErrorPanel, ProgressBar, OnboardingOverlay
hooks/ useValidation, validationState, useTheme
protocol/ Command/event names + TypeScript transport mirrors
runtime/ Tauri transport + capability-focused runtime seam
The Rust backend calls validate_directory_streaming() from
talkbank-transform directly, the same streaming validation pipeline used by
the TUI. Events flow over crossbeam channels to the Rust side, then are
serialized to JSON and emitted to the frontend via Tauri’s event bridge.
Cancellation uses ArcSwapOption for lock-free atomic swap of the cancel
sender, no mutex.
The frontend keeps Tauri-specific code confined to src/runtime/tauriTransport.ts.
React components and hooks consume narrower capabilities (validationRunner,
validationTarget, clan, exports) instead of reaching for one broad
desktop service object.
Comparison with TUI
| Feature | TUI (chatter validate) | Desktop app |
|---|---|---|
| File selection | CLI arguments | Drag-and-drop, file picker |
| Navigation | Keyboard (Tab, arrows) | Mouse click |
| Error display | Two-pane terminal UI | Scrollable panels with source snippets |
| Error filtering | , | Code filter + severity toggle |
| Copy error | , | Copy button per error |
| Open in CLAN | c key | Button per error |
| Export | --format json --audit | Save dialog (JSON or text) |
| Streaming progress | Progress bar | Progress bar + ETA |
| Dark mode | Terminal theme | System/Light/Dark toggle |
| Caching | Same engine | Same engine |
| Who it’s for | Power users, CI | Researchers, linguists |
Both use the identical validation engine and produce the same error codes.
When to Use Which Tool
The TalkBank toolchain offers validation through three interfaces. Each serves a different workflow:
| Tool | Audience | Use when |
|---|---|---|
| Chatter Desktop | Researchers, linguists | You want a graphical, drag-and-drop CHAT validation app without using a terminal. |
chatter validate (TUI) | Power users | You’re comfortable in a terminal and want keyboard-driven navigation. |
chatter validate (CLI) | CI, scripts | You need machine-readable output (--format json) or batch audits (--audit). |
Chatter Desktop focuses on validation only.
CLAN Line Numbering
Status: Current Last modified: 2026-05-29 17:31 EDT
When you click “Open in CLAN” in the desktop app or press Enter in the TUI, chatter sends the error location to the CLAN editor. CLAN opens the file and places the cursor at the error. This usually works seamlessly, but there is one caveat: CLAN and chatter count lines differently.
Hidden Headers
CLAN hides five header types from its editor display:
| Header | Purpose |
|---|---|
@UTF8 | Character encoding declaration |
@PID | Persistent identifier |
@Font | Display font settings |
@ColorWords | Color coding rules |
@Window | Window position/size |
These headers are present in the .cha file but invisible in CLAN’s editor.
CLAN’s line numbers skip them entirely. A file that starts with @UTF8 on
line 1 will show @Begin as “line 1” in CLAN’s display, even though it’s
actually line 2 in the file.
What Chatter Does
Chatter automatically adjusts line numbers before sending to CLAN:
- Compute the error’s line number in the source file
- Count how many hidden headers appear before that line
- Subtract the hidden count to get CLAN’s line number
- Send the adjusted line number to CLAN
This happens transparently, you don’t need to do anything.
Edge Case: Errors on Hidden Lines
If an error is on a hidden header itself (e.g., a malformed @UTF8 line),
CLAN cannot navigate to it because CLAN doesn’t display that line. In this
case, “Open in CLAN” will show an error message explaining why.
For Developers
The shared resolution logic lives in talkbank_model::resolve_clan_location().
Both the TUI and the desktop app call this function, it resolves line/column
from byte offsets when needed and adjusts for hidden headers.
See clan_location.rs
for the implementation and tests.
Batch Workflows
Status: Current Last modified: 2026-06-12 21:05 EDT
The chatter CLI is designed for processing large CHAT corpora efficiently. This page covers common batch workflows.
Validating a Corpus
Validate all .cha files in a directory tree:
chatter validate /path/to/corpus/
The validator recursively discovers .cha files and processes them in parallel. Results are cached, subsequent runs skip unchanged files.
Forcing Revalidation
To bypass the cache and revalidate everything:
chatter validate /path/to/corpus/ --force
Filtering Output
Show only errors (hide warnings):
chatter validate /path/to/corpus/ --quiet
Stop after the first reported error:
chatter validate /path/to/corpus/ --max-errors 1
Write a JSONL audit file while validating:
chatter validate /path/to/corpus/ --audit validation.jsonl
CHAT-JSON Roundtrip
Convert an entire corpus to JSON and back:
# CHAT → JSON
for f in corpus/**/*.cha; do
chatter to-json "$f" > "${f%.cha}.json"
done
# JSON → CHAT
for f in corpus/**/*.json; do
chatter from-json "$f" > "${f%.json}.roundtrip.cha"
done
The roundtrip is designed to preserve the ChatFile model. In regression
tests, compare normalized output rather than assuming byte-for-byte identity
after parser or serializer changes.
Cache Management
The validation cache stores results for previously validated files
(keyed by content hash). The cache database file is named
talkbank-cache.db and lives in the OS cache directory:
- macOS:
~/Library/Caches/talkbank-chat/talkbank-cache.db - Linux:
~/.cache/talkbank-chat/talkbank-cache.db - Windows:
%LocalAppData%\talkbank-chat\talkbank-cache.db
It can hold results for large file collections.
To relocate the cache (a different disk, a per-project cache, or an
isolated cache for scripted runs), set the TALKBANK_CHAT_CACHE_DIR
environment variable to a directory; the database is created directly
inside it. This is the supported override on every platform, and the
only effective one on Windows, where the default location comes from
the system Known Folder API rather than environment variables.
chatter cache stats # Show hit rates and entry count
chatter cache clear --all
Do not delete the cache file manually while chatter is running.
Reference Corpus Validation
This repository includes a reference corpus at
corpus/reference/ (currently ~100 .cha files; verify by
find corpus/reference -name '*.cha' | wc -l). The parser must
handle every file in this corpus at 100%:
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
This runs the parser equivalence test, each .cha file is its own test, so nextest runs them in parallel and reports individual failures.
Integration with batchalign
The Batchalign pipeline uses the same Rust core (via PyO3) for CHAT
parsing and serialization. Since the 2026-04-28 monorepo merge,
Batchalign source lives inside this repository under crates/batchalign-*
(the standalone batchalign3 GitHub repo was archived). Files processed
by Batchalign produce valid CHAT that passes chatter validate.
CI Integration
Status: Current Last updated: 2026-04-13 19:23 EDT
How to use chatter in continuous integration pipelines.
Exit Codes
| Code | Meaning |
|---|---|
0 | All files valid / command succeeded |
1 | Validation errors found or command failed |
2 | Invalid arguments or missing required options |
All examples below rely on exit code 1 to signal validation failure.
Basic Usage
chatter validate corpus/ --quiet --tui-mode disable
--quietsuppresses per-file success output--tui-mode disableprevents interactive TUI (required in non-TTY environments)- Exit code 0 means all files valid; 1 means errors found
GitHub Actions Example
- name: Validate CHAT corpus
run: |
chatter validate corpus/ --quiet --tui-mode disable --format json --audit results.jsonl
- name: Upload validation report
if: failure()
uses: actions/upload-artifact@v4
with:
name: validation-report
path: results.jsonl
The --audit results.jsonl flag streams per-error JSON lines to a file,
which is useful for archiving or downstream analysis even when the step
fails.
JSON Output for Automation
chatter validate corpus/ --format json --tui-mode disable 2>/dev/null
Each file produces a JSON object on stdout with status, error_count,
and errors array. The exit code still reflects overall pass/fail.
Pre-commit Hook
#!/bin/sh
# .git/hooks/pre-commit
chatter validate . --quiet --tui-mode disable
This blocks commits that introduce invalid CHAT files. The hook runs quickly on cached files; only modified files are re-validated.
Suppressing Specific Errors
Some corpora have known issues that should not block CI. Use --suppress
to ignore specific error codes or named groups:
chatter validate corpus/ --suppress E726,E727,E728 --tui-mode disable
Or use the named group shorthand:
chatter validate corpus/ --suppress xphon --tui-mode disable
Suppressed errors do not appear in output and do not affect the exit code.
Audit Mode for Large Corpora
For bulk corpus validation where you want a full error database without caching overhead:
chatter validate corpus/ --audit errors.jsonl --tui-mode disable
The --audit flag streams one JSON object per error to the specified file.
A summary is printed to stderr at the end.
CHAT Processing Playbook for Editors and Analysts
Status: Current Last updated: 2026-03-24 00:01 EDT
Objective
Provide practical guidance for non-compiler users who create, edit, and validate CHAT files, with emphasis on error interpretation and correction workflow.
Who This Is For
- Transcript editors,
- corpus curators,
- QA reviewers,
- linguists using tooling outputs but not parser internals.
Core Editing Workflow
- Open file in editor with CHAT diagnostics enabled.
- Run validation (single file first, then batch).
- Fix highest-severity structural issues first (headers, tier markers, unmatched delimiters).
- Re-run validation and inspect warnings.
- Only then address style and normalization suggestions.
Error Triage Heuristic
- Errors at file start: likely header formatting or encoding issues.
- Errors at tier prefix: likely malformed
*/%tier syntax. - Errors inside words: likely symbol, marker, or annotation boundary issues.
- Repeated same error class: likely one systemic rule violation pattern.
Fast Interpretation Guide
Error: parser/validator could not accept structure; must fix.Warning: valid but suspicious or non-canonical; review strongly recommended.Info: advisory normalization or convention hints.
Common Fix Recipes
- Header spacing problems:
- Ensure expected separators and avoid accidental tabs/spaces drift.
- Unclear language/form markers:
- Confirm
@susage and suffix ordering with house style guide.
- Confirm
- Duration/annotation confusion:
- Verify bracketed annotation form and avoid malformed punctuation.
- Dependent tier attachment issues:
- Ensure
%tiers follow intended main tier and keep indentation consistent.
- Ensure
Batch Validation Workflow
- Validate a small sample first.
- Group failures by error code.
- Fix by pattern, not file-by-file random order.
- Re-run and confirm error count decreases monotonically.
- Save run report for audit trail.
Collaboration Workflow with Developers
When reporting parsing issues, include:
- exact file path,
- minimal excerpt around failing span,
- observed diagnostic code/message,
- expected behavior (if known).
This reduces back-and-forth and speeds defect triage.
Quality Checklist Before Publishing Corpus Updates
- No unresolved error-level diagnostics.
- Warning classes reviewed and accepted or fixed.
- Participant headers and IDs internally consistent.
- Roundtrip serialization check passes for representative samples.
- Changelog note recorded for major normalization edits.
Training Recommendations
- Maintain short examples for each common error class.
- Provide editor cheat sheet for tier prefixes and marker syntax.
- Run periodic QA calibration sessions across editors.
Sanitize (chatter debug sanitize)
Status: Current Last updated: 2026-04-28 22:18 EDT
chatter debug sanitize strips contributor lexical content from a CHAT
file while preserving structure (timing bullets, %wor per-word offsets,
speaker codes, dependent-tier scaffolding, structural counts, POS tags,
language markers). Output is structurally identical to the input but
contains no participant words, names, or free-text annotations.
The command exists so engineering tooling, including LLM-assisted
debugging, can operate on protected-corpus files (aphasia/,
dementia/, rhd/, fluency/Password/, clinical-children corpora,
etc.) without exposing contributor speech to commercial LLM services.
When to use it
Run chatter debug sanitize on the source file before loading it into
any tool (LLM-backed debugger, scratch directory, screen-shareable
session) where you don’t want participant content visible.
When you need to ask a contributor for help debugging a specific file, frame the request as “run the sanitizer locally and send me the output” rather than asking for the raw file.
Usage
# Write sanitized output to stdout
chatter debug sanitize input.cha
# Write sanitized output to a file
chatter debug sanitize input.cha --output sanitized.cha
Working location for sanitized files: prefer a stable, non-/tmp
scratch directory (e.g. set TB_SCRATCH_DIR to a per-project dir
under your workstation’s persistent storage) for any state that
should outlive a single command. macOS clears /tmp on reboot.
What is preserved (byte-exact)
- Timing bullets
•start_end•on the main tier. %worper-word offsets (word START_ENDtriples).- Speaker codes (
*PAR,*INV,*CHI, …). - Utterance count, word count per utterance, dependent-tier count.
- Structural markers: compound
+, clitic~, CA elements, overlap points, lengthening, stress markers, syllable pause, underline begin/end, proper-noun@nmarkers. - Language markers (
@s:LANG), form types (@a,@b), POS tags ($adj,$n). - Headers:
@Languages,@Birth,@Date,@Media,@PID,@L1Of,@Begin/@End/@UTF8. %morPOS categories and morphological features (e.g.,n|,-Past).%gra(numeric grammatical relations) and%tim(timing).- Untranscribed tokens
xxx/yyy/www, preserving them changes semantic meaning, so they pass through unchanged.
What is replaced or redacted
| Source | Replacement |
|---|---|
WordContent::Text | wN placeholder, indexed by document position |
Shortening text | (x) |
%mor lemmas (MorWord.lemma) | lemmaN; POS + features preserved |
%pho / %mod / %modsyl / %phosyl / %phoaln / %sin | tier dropped |
Free-text dependent tiers (%com %add %exp %sit %spa %int %gpx %eng %gls %ort %flo %def %coh %fac %par %alt %err) | [redacted] |
@Comment, @Transcriber, @Birthplace, @Activities, @Situation, @RoomLayout, @Location, @TapeLocation, @Warning, @Bck | [redacted] (when content was free text) |
@Participants participant-name field | dropped (Participant_<SPEAKER_CODE> is implied by speaker code + role) |
@ID custom_field and education | cleared |
Event event_type (&=imitates:Mary → &=[redacted]) | [redacted] |
Freecode text ([^ aside]) | [redacted] |
OtherSpokenEvent text | [redacted] |
Determinism + Idempotence
Placeholder generation is keyed off (utterance_index, word_index)
tree position, not a global counter. Two consequences:
- Deterministic: sanitizing the same input twice produces byte-identical output.
- Idempotent: sanitizing a sanitized file produces the same file again, no double-replacement, no shifting placeholder numbers.
Pipeline
flowchart LR
Input["Source .cha\n(protected corpus)"] --> Parser["TreeSitterParser\n(talkbank-parser)"]
Parser --> Model["ChatFile model\n(talkbank-model)"]
Model --> Sanitize["sanitize()\n(talkbank-transform::redact)"]
Sanitize --> Walker["walk_words_mut\n+ header walker\n+ dep-tier walker\n+ scoped-annot walker"]
Walker --> Mutated["Mutated ChatFile\n(placeholders + redactions)"]
Mutated --> Writer["WriteChat\n(byte-exact bullets)"]
Writer --> Output["Sanitized .cha\n(scratch path)"]
The walker step replaces WordContent::Text segments inside
Word.content, mutates MorWord.lemma fields, redacts free-text
header / dep-tier / scoped-annotation strings, and drops phonological
tiers. WriteChat then re-serializes, and because it serializes from
Word.content (not from Word.raw_text), every CA element, compound
marker, clitic boundary, and timing bullet round-trips byte-exact.
Out of v1 scope
Documented for transparency; v2 work:
- Speaker-code anonymization (graph rewrite across
@Participants,@ID,*SPK:,@Birth,@L1Of). @Birth/@Datefuzzing (exact birth dates can be identifying).@Mediafilename redaction.- Audio-side sanitization. (Audio bytes are never touched by the sanitizer; the audio stays at its original path.)
- “Unsanitize” or round-trip mapping. Explicitly not built, the sanitizer is one-way, the mapping table that would reverse it is the exact artifact we don’t want to exist.
Implementation
Library module: talkbank_transform::redact. CLI surface: chatter debug sanitize.
The strict policy is the only public preset in v1; future variants can
grow on SanitizationPolicy.
Speaker-ID (chatter speaker-id)
Status: Draft Last modified: 2026-06-15 12:18 EDT
chatter speaker-id assigns CHAT-conformant speaker codes and role
tags to a CHAT file whose speakers carry anonymous or placeholder
labels (typically the output of an ASR system that labels speakers
as PAR0, PAR1, …). It is the bridge between an ASR pipeline that
does not understand speaker roles and a CHAT pipeline that does.
The command is structural: it does not modify utterance content,
does not run audio analysis, does not infer speaker identity from
voice features. Its inputs are the CHAT file to relabel plus an
identification signal (reference transcript, explicit mapping, or
saved override record); its output is the same CHAT file with
speaker codes rewritten and @Participants / @ID headers
reconciled.
When to use it
Whenever you have a CHAT file with placeholder speaker codes that need to become CHAT-conformant codes before downstream tooling can process the file meaningfully. The canonical case is an ASR system that emits CHAT but does not know which speaker is the child, parent, clinician, etc.
A complete pipeline that consumes ASR output and produces a publishable CHAT file goes:
flowchart LR
Media --> Transcribe
Transcribe["batchalign3 transcribe<br/>ASR"] --> AsrAnon["asr.cha<br/>PAR0, PAR1, ..."]
Ref["reference.cha<br/>target speakers only"] -.->|reference signal| SpkId
AsrAnon --> SpkId
SpkId["chatter speaker-id<br/>(this page)"] --> AsrLabeled["asr-labeled.cha<br/>CHI, INV, MOT, ..."]
AsrLabeled --> Merge["chatter merge"]
Ref --> Merge
Merge --> Aligned["batchalign3 align"]
The speaker-id stage is the single point in the pipeline where
“which anonymous speaker corresponds to which CHAT role” is
decided. Downstream stages (chatter merge, batchalign3 align,
batchalign3 morphotag) all trust that the labels they receive
are correct.
Identification modes
Three mutually-exclusive modes, exactly one of which must be selected:
1. Reference mode
The most common case: a separate CHAT file already exists that
covers the same media and contains an authoritative speaker
(typically the hand-transcribed target speaker). The reference
file’s anchor speaker tells us what that speaker’s content looks
like; speaker-id finds the matching speaker in the input by
text similarity.
The matching algorithm is multiset Jaccard over bags of content
tokens, see “Algorithm” below for the full specification. The
ASR speaker whose bag-of-words best matches the reference anchor’s
bag-of-words is taken as the same speaker, and is marked for
drop in the output (because the reference file authoritatively
covers them, the downstream chatter merge stage will pull
their utterances from the reference, not from this file). The
remaining speakers are renamed to the role specified by
--inserted-role.
If the Jaccard margin between the winning speaker and the
runner-up is below --confidence-threshold, the command refuses
to auto-decide. The operator must either lower the threshold
(not recommended without spot-checking), supply an explicit
mapping (--mapping), or load a previously-adjudicated override
(--override-file).
2. Explicit-mapping mode
The operator already knows the mapping (typically because they listened to the audio, or because the contributor’s data sheet documents it). They supply it directly.
chatter speaker-id input.cha \
--mapping "PAR0=INV:Investigator,PAR1=drop" \
-o relabeled.cha
The grammar for --mapping:
- One or more comma-separated assignments.
OLD=CODE:ROLErenames OLD to CODE with role tag ROLE.OLD=dropremoves OLD’s utterances entirely.- Every speaker present in the input must be named in the mapping (no defaulting). This is intentional, we want operator decisions to be explicit.
3. Override-file mode
The operator has previously adjudicated this session (perhaps
through an interactive review tool) and saved the decision to a
shared override file. speaker-id reads the file, finds the entry
for this session, and applies it. See “Override file format”
below.
chatter speaker-id input.cha \
--override-file batch-2026-05-27.overrides.toml \
--session-id NF203-2 \
-o relabeled.cha
This mode is the production substrate for batch workflows: the
orchestrator first runs chatter speaker-id in reference mode for
every session; for any session that exits with low-confidence, the
operator works through an adjudication tool that writes to the
override file; the orchestrator then re-runs chatter speaker-id
in override-file mode for those sessions.
CLI contract
chatter speaker-id <INPUT> [OPTIONS]
ARGUMENTS:
<INPUT> Path to the CHAT file to relabel.
OPERATION MODES (exactly one required):
REFERENCE MODE:
--reference <FILE>
--anchor <SPEAKER>
--inserted-role <CODE>:<TAG>[,<CODE>:<TAG>...]
EXPLICIT-MAPPING MODE:
--mapping <SPEC>
OVERRIDE-FILE MODE:
--override-file <FILE>
--session-id <ID>
REFERENCE-MODE OPTIONS:
--confidence-threshold <FLOAT>
Minimum Jaccard margin (winner_score / loser_score) for the
command to auto-decide. Below threshold: exit code 4. The
command prints per-speaker scores to stderr so the operator
can inspect. Default: 2.0.
--write-override <FILE>
When auto-decide succeeds, append the decision to FILE in
override-file format (creates if missing). Captures the
audit trail of a batch run.
COMMON OPTIONS:
-o, --output <PATH>
Write relabeled CHAT to PATH. Default: stdout.
The operator identity and any free-text note for a session are set
when an operator confirms it through `chatter adjudicate` (see the
merge workflow), not on this command.
Exit codes:
| Code | Meaning |
|---|---|
| 0 | Success, relabeled file written |
| 1 | Invalid input (parse error, missing file, unreadable) |
| 2 | Semantic precondition violated (reference has no utterances for anchor; mapping covers a speaker not in input; etc.) |
| 3 | Internal error |
| 4 | Reference mode: confidence threshold not met. Per-speaker scores printed to stderr; no output written |
What the output guarantees
These are testable invariants. Every release verifies them against the reference corpus.
Speaker codes match the supplied mapping
For every speaker in the input file:
- If the mapping marks the speaker for drop, none of their
utterances appear in the output, AND their
@IDrow (if any) is removed from the headers, AND their entry is removed from the@Participantsheader. - If the mapping marks the speaker for rename, every main-tier
line
*OLD:\t...becomes*NEW:\t...byte-stable except for the speaker code prefix. The@IDrow’s third pipe-separated field (speaker code) and eighth field (role tag) are rewritten; other@IDfields are preserved. The@Participantsentry’s code and role-tag tokens are rewritten; any intervening tokens (corpus ID, participant name) are preserved. - Speakers not in the mapping are passed through unchanged. (In modes 1 and 3, all speakers are assigned automatically; in mode 2, “all speakers must be in the mapping” is a precondition.)
Utterance content is byte-stable except for the speaker prefix
For every retained utterance, every byte EXCEPT the leading
*CODE:\t prefix is preserved verbatim. Dependent tiers attached
to the utterance are preserved exactly. NAK-delimited time
bullets, CHAT markup, special-form annotations, paralinguistic
codes, retracing scopes, all untouched.
Headers reconcile per a fixed table
| Header | Behavior |
|---|---|
@UTF8, @Begin, @End, @Window, @Languages, @Media | Pass-through unchanged |
@Participants | Drop entries for dropped speakers; rewrite code + role-tag for renamed speakers; entries for unaffected speakers preserved |
@ID | Drop rows for dropped speakers; rewrite field 3 (code) and field 8 (role) for renamed speakers; other fields preserved |
@Comment | Pass-through unchanged (provenance-carrying comments survive) |
Provenance is captured if --write-override is set
When --write-override <FILE> is supplied AND the command succeeds
in reference mode, an entry is appended to FILE recording the
session ID (derived from the input filename stem unless overridden),
the per-speaker Jaccard scores, the chosen mapping, the operator, and
an ISO 8601 timestamp. The format is specified in “Override file
format” below. The operator identity and any free-text note are set
later, when a session is confirmed via chatter adjudicate.
This is the audit-trail mechanism: a year from now, a researcher who asks “why was PAR0 labeled INV in this session?” can read the override entry and see the scores, the operator, and any notes the operator added.
Algorithm (reference mode)
Token cleaning
Both the reference anchor’s bag of words and each input speaker’s bag of words are built by walking the typed CHAT AST and emitting content tokens. The cleaner strips:
- NAK-delimited time bullets
- bracket-annotated markup
[*],[//],[/],[=! ...], etc. - angle-bracket retracing scope (
<...>, unwrap, keep inner text) - terminator variants
+//.,+...,+/.,+!?, etc. - filled-pause and phonological-fragment markers
&-...,&+... - unintelligible placeholders
xxx,yyy,www - zero-realization markers
0 - special-form suffixes (
word@l→word) - CHAT compound underscores (
Valentine's_Day→Valentine s Day) - punctuation, then lowercase, then filter to alpha-only tokens of length ≥ 2
Both sides are cleaned identically so the comparison is
apples-to-apples. This is the same cleaner specified in the
reference corpus under spec/constructs/speaker-id/token-cleaner/.
Multiset Jaccard
For two bags-of-words A and B (counted multisets):
J(A, B) = sum_w min(A[w], B[w]) / sum_w max(A[w], B[w])
Range [0, 1]. The multiset (rather than set) form rewards
speakers who say similar things to the anchor in similar volume,
not just speakers whose vocabulary happens to intersect.
Decision
scores = { speaker: J(anchor_bag, speaker_bag) for speaker in input }
winner = argmax(scores)
loser = argmax(scores - {winner})
margin = scores[winner] / scores[loser] # ∞ when loser score = 0
winneris the input speaker whose content matches the reference anchor’s content best → marked for drop (the reference authoritatively covers them).loser(and any other lower-scoring speakers, in the multi-speaker case) → renamed to the role given by--inserted-role.
If margin < --confidence-threshold (default 2.0), the command
exits with code 4 and prints per-speaker scores to stderr. The
operator must inspect, adjudicate, and re-run with
--mapping or --override-file.
Why this algorithm
The choice was empirical, not theoretical, and was made against a calibration set of CHAT files paired with their corresponding ASR output. Two earlier candidates were tested first and rejected:
- Raw temporal-overlap (sum of ms of an input speaker’s activity inside the anchor’s bullet windows): too weak on real data. Hand transcripts often place per-utterance time bullets as end-to-end segmentation boundaries covering 95-99% of the session timeline, rather than as tight “speaker active here” windows. Both input speakers fall almost entirely “inside” the anchor’s bullet windows and the signal disappears.
- Speaker purity (fraction of each input speaker’s activity falling inside anchor windows): same root cause, same failure.
Multiset Jaccard over content tokens succeeded on every
session of the calibration set. The borderline cases (margin
below 2.0x) clustered around tasks where the non-anchor speaker
shares vocabulary with the anchor by the structure of the task,
e.g. a clinician describing the same scene the child is also
describing in a picture-narrative task. These borderline cases
are the reason for the conservative threshold and the
--mapping/--override-file escape hatches; the algorithm
correctly refuses to auto-decide them rather than silently
picking wrong.
Override file format
The override file is a UTF-8 TOML document with one
[<session_id>] table per decision. A minimal entry:
schema_version = 1
[session-101-t1]
mode = "auto"
inserted_role = { code = "INV", tag = "Investigator" }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.1931, PAR1 = 0.7347 }
margin = 3.81
operator = "alice"
decided_at = 2026-05-27T08:41:00-04:00
The complete schema specification, every field, every type, every mode-semantics rule, the strict refuse-with-clear-error versioning policy, and worked examples for auto/explicit/replay/diarization-mixed cases, is on the dedicated reference page: Merge Override File Format.
Highlights from the reference:
mode = "auto" | "explicit" | "override"records how the decision was made (informational for audit trail; behavior at apply time is the same).inserted_role.codeis the CHAT speaker code (INV,MOT,FAT,PAR, …);inserted_role.tagis the CHAT role-tag (Investigator,Mother, …). All renamed speakers in one entry share the same role.mappingmust cover every speaker in the input, no defaulting.scoresandmarginare optional but the writer always records them when an auto attempt produced them (even when the final decision was operator-supplied).flagscarries operator-supplied markers like"diarization-mixed"for unusual cases. Unknown strings are preserved verbatim.
Preconditions
chatter speaker-id refuses (exit code 2) if any hold:
Reference mode
- The reference file has no utterances for
--anchor - The reference file fails to parse
- The input file has fewer than 2 distinct speakers (no discrimination problem)
Explicit-mapping mode
- A speaker in the mapping is not present in the input
- A speaker in the input is not covered by the mapping (no defaulting)
Override-file mode
- The override file does not contain a
<session-id>entry - The entry’s
mappingreferences a speaker not in the input - The entry’s mapping does not cover every speaker in the input
What chatter speaker-id is NOT
- Not voice diarization. Use Batchalign’s ASR pipeline upstream; the labels this command consumes are the labels Batchalign emits.
- Not content correction. If the speaker the command identifies has been mis-transcribed by ASR, this command does not fix that , re-run ASR with a better engine.
- Not a merge. This command operates on a single CHAT file. To
combine the relabeled file with the reference, use
chatter merge. - Not interactive.
chatter speaker-idis batch-only: it succeeds, refuses, or fails. The interactive review that resolves a low-confidence refusal into an override-file entry is a separate command,chatter adjudicate, run as part of the merge workflow.
Worked example
A typical fully-automated reference-mode call from an orchestrator script:
chatter speaker-id asr-anonymous.cha \
--reference hand-transcript.cha \
--anchor CHI \
--inserted-role INV:Investigator \
--confidence-threshold 2.0 \
--write-override batch.overrides.toml \
-o asr-labeled.cha
For a session this refused (e.g., shared-vocabulary narrative task with margin 1.82x), the orchestrator captures the failure and the operator later resolves it:
# Inspect the scores the command emitted to stderr:
# PAR0=0.6286 PAR1=0.3457 margin=1.82x threshold=2.0
# Operator listens to a few seconds of audio and confirms PAR0 is
# the child:
chatter speaker-id asr-anonymous.cha \
--mapping "PAR0=drop,PAR1=INV:Investigator" \
--write-override batch.overrides.toml \
-o asr-labeled.cha
Later, if anyone re-runs the batch, they use override-file mode:
chatter speaker-id asr-anonymous.cha \
--override-file batch.overrides.toml \
--session-id NF204-2 \
-o asr-labeled.cha
The same asr-labeled.cha content is produced; the audit trail
remains intact.
Implementation notes (for contributors)
- Source:
crates/talkbank-transform/src/speaker_id/(proposed layout). - CLI surface:
crates/chatter/src/commands/speaker_id/. - Domain types (
SpeakerCode,RoleTag,SpeakerMapping,MergeOverride,JaccardScore,ConfidenceThreshold,Margin) live intalkbank-modeland are shared withchatter mergeplus any future adjudication UI. - The Jaccard cleaner walks
talkbank-model::ChatFiledirectly via the existing content walker (talkbank-model::walk_words); it does NOT re-implement CHAT parsing or use regex on raw bytes for tokenization. - Spec entries for the cleaner and the algorithm live in
spec/constructs/speaker-id/. Every invariant on this page has a spec; regenerate them with the currentspec/toolscommands from Spec Workflow. - The override-file reader/writer is a typed
serderound-trip on a TOML representation; the schema lives intalkbank-modelso the format is one shared type across the codebase, not duplicated parsing logic in each consumer.
Merge (chatter merge)
Status: Draft Last modified: 2026-06-11 15:32 EDT
chatter merge combines two CHAT transcripts that cover the same media
recording into one. The caller designates which speakers’ utterances are
authoritative in which file; the merged output interleaves them by time
while byte-preserving every utterance from its designated source.
The command is structural: it does not invent or rewrite utterance content, does not run ASR, does not run forced alignment, does not infer speaker identity. It is the moment in a multi-input CHAT workflow where two parsed transcripts become one.
When to use it
Whenever you have two valid CHAT files of the same recording and you want a single combined CHAT file out, with explicit per-speaker provenance.
Two recurring shapes from real TalkBank workflows:
- Hand-coded target speaker + ASR everyone else. A contributor has
hand-transcribed only the target speaker (often the child in
child-language research) with rich disfluency and error coding, and
separately someone runs ASR on the same media to produce a
rough-but-complete transcript with all speakers.
chatter mergecombines them with the hand-coded target speaker’s utterances byte-preserved and the other speakers spliced in from the ASR file. - Older hand transcript + later supplementary transcription. A
legacy CHAT file covers most of the recording; a newer pass
transcribes additional content (an investigator’s turns, a parent’s
turns, a second target child). Merge with
--retainlisting the speakers whose content lives in the legacy file.
In both shapes the speakers are the unit of authority, not the
files. chatter merge’s job is to express that mapping cleanly.
Conceptual model
A CHAT file describes utterances on a shared media timeline. Two CHAT files of the same media share the same timeline; their utterance sets may overlap (same speech transcribed twice) or be disjoint (each file covers different speakers). The merged output is a single CHAT file on the same timeline whose utterance set is the disjoint union of:
- the utterances of every speaker listed in
--retainfrom the first input file, and - the utterances of every speaker NOT listed in
--retainfrom the second input file.
Retained-speaker utterances from the first file are kept byte-for-byte
identical, including every dependent tier they own (%wor, %mor,
%gra, %com, %pho, …). Inserted-speaker utterances from the
second file have their downstream-generated dependent tiers
(%wor/%mor/%gra/%pho, anything a later pipeline stage will
regenerate) stripped before insertion, so the merged file is in a
clean state for batchalign3 align and batchalign3 morphotag to
own those tiers authoritatively post-merge.
flowchart LR
File1["File 1<br/>any CHAT file"] --> Merge
File2["File 2<br/>any CHAT file<br/>(same media)"] --> Merge
Retain["--retain CHI[,SPK,…]"] -.-> Merge
Merge["chatter merge<br/>(structural)"] --> Out["Merged CHAT file<br/>retained speakers: byte-stable from File 1<br/>inserted speakers: from File 2,<br/>derived tiers stripped"]
CLI contract
chatter merge <FILE1> <FILE2> --retain <SPEAKER_LIST> [OPTIONS]
ARGUMENTS:
<FILE1> Path to the first CHAT file. Speakers listed in --retain are
taken from here, byte-preserved.
<FILE2> Path to the second CHAT file. All other speakers are taken
from here.
REQUIRED OPTIONS:
--retain <SPEAKER>[,<SPEAKER>...]
Comma-separated list of speaker codes (e.g. CHI, or
CHI,SI2). These speakers' utterances come from <FILE1>;
everything else comes from <FILE2>.
OPTIONS:
-o, --output <PATH>
Write merged output to PATH. Default: stdout.
--strip-tiers <TIER>[,<TIER>...]
Dependent tier names to strip from inserted-speaker
utterances before merging. Default: wor,mor,gra,pho.
Use empty list (--strip-tiers '') to preserve all
dependent tiers as-is.
--allow-bullet-drift
Permit small backward-time bullets in either input (where
one utterance's end_ms is slightly greater than the next
utterance's start_ms). Default behavior: warn but proceed.
Set this flag to silence the warning.
Exit codes:
| Code | Meaning |
|---|---|
| 0 | Merge succeeded |
| 1 | Invalid input (parse error, missing file, unreadable) |
| 2 | Semantic precondition violated (e.g. retained speaker missing from File 1, conflicting @Media, no time bullets in File 1) |
| 3 | Internal error |
What the merged output guarantees
These are testable invariants. Every release verifies them against the reference corpus.
Retained speakers are byte-stable
For every speaker code in --retain, every main-tier line and every
dependent-tier line attached to that speaker in <FILE1> appears
byte-for-byte identical in the merged output, in the same relative
order they appeared in <FILE1>. CHAT markup, NAK-delimited time
bullets, paralinguistic annotations, retracing scope, terminator
variants, special-form @l/@n/@c suffixes, all preserved.
This is the core semantic guarantee of merge: if you hand-coded disfluency on the target speaker, the disfluency coding survives the merge without any structural change.
Inserted speakers’ downstream-generated tiers are stripped
For every speaker code in <FILE2> that is NOT in --retain, the
utterance is included in the merged output with its main tier
preserved verbatim BUT with %wor, %mor, %gra, and %pho
removed (configurable via --strip-tiers). Other dependent tiers
(%com, %spa, %act, %sit, %add, contributor-specific tiers)
are preserved.
The rationale: batchalign3 align and batchalign3 morphotag are
the authoritative source stages for these tiers in the post-merge
pipeline. Carrying inserted-speaker %wor across the merge would
leave the merged file in a half-state, some utterances would have
%wor, others would not, and downstream behavior on mixed inputs
is undefined. The contract is: enter the post-merge stages in a
clean state, exit with the tier present and consistent across every
utterance.
Utterance order is timeline order
Utterances in the merged output appear in ascending order by their
start time bullet (\\x15START_END\\x15, milliseconds). Where two
utterances have identical start times, the first-file utterance
comes first.
Time bullets are pass-through
chatter merge does NOT recompute, smooth, or refresh time bullets.
The bullets in the merged output are exactly those that appeared in
the source files. If <FILE2> had %wor rows whose first/last word
times implied a slightly different utterance span than the main-tier
bullet, the main-tier bullet wins (it was the contract before merge).
If the merge stage detects an inserted-speaker utterance with no
main-tier bullet at all (Batchalign occasionally omits these),
it lifts a bullet from the corresponding %wor row’s first-word
start and last-word end, appending it to the main tier so the
merged file has uniform bullet placement. The original %wor is
then stripped (per the per-tier rule above).
Header reconciliation
The merged file’s headers are constructed deterministically from the two inputs:
| Header | Source | Notes |
|---|---|---|
@UTF8 | File 1 | always required to be @UTF8 |
@Begin / @End | File 1 | always present in merge output |
@Window | File 1 if present | not generated if absent |
@Languages | File 1 | must match File 2; mismatch is an error |
@Media | File 1 | File 2’s @Media is discarded; warning if mismatched media filename (NOT the modality field, see below) |
@Participants | concatenation | File 1’s entries first, then File 2’s entries for non-retained speakers in their original order |
@ID | concatenation | File 1’s @ID rows first; File 2’s @ID rows for non-retained speakers appended in their original order |
@Comment | concatenation | File 1’s @Comment rows first; File 2’s @Comment rows appended in original order (preserves any provenance comments like ASR engine/run timestamp) |
The @Media modality field (audio vs video) is a known
divergence point: when ASR runs against an mp4, it may write
video on its input but emit audio on its output. File 1’s
modality wins, as with all @Media content; no warning is
emitted for modality mismatch.
Overlap markup is NOT injected
When an inserted-speaker utterance temporally overlaps a retained-speaker
utterance, chatter merge does NOT inject CHAT [>] / [<] /
angle-bracket-scoped overlap markers. The time bullets carry overlap
information; markers are a CLAN-era surface convention that the
output of chatter merge deliberately omits.
The retained speakers’ existing overlap markers (if File 1 already contains some) are preserved byte-stably under the byte-preservation rule above.
Preconditions
chatter merge refuses (exit code 2) if any of these hold:
- File 1 declares no utterances for any speaker in
--retain. - File 1 has no time-bulleted utterances at all (no shared timeline to merge against).
- The two files’
@Languagesheaders disagree. - A speaker code appears in both files but not in
--retain(use--retainto disambiguate). - File 2 is missing or unparseable.
chatter merge does NOT refuse on these (proceeds with warning):
- Small backward-time bullets in either input (one utterance ends
slightly after the next starts), common in hand transcripts, not
corrupting; downstream
batchalign3 aligncleans these. - File 2’s
@Mediamodality disagrees with File 1’s (audiovsvideo). - File 1 has fewer utterances than File 2, or vice versa.
Speaker identity in File 2 must already be coherent
chatter merge does NOT identify or rename speakers. If File 2 came
from ASR and carries anonymous codes like PAR0, PAR1, run
chatter speaker-id first to assign CHAT-conformant
codes. The merge step trusts whatever speaker codes appear in its
inputs.
What chatter merge is NOT
- Not ASR. Use
batchalign3 transcribe. - Not forced alignment. Use
batchalign3 align. - Not morphological tagging. Use
batchalign3 morphotag. - Not speaker identification. Use
chatter speaker-id. - Not content reconciliation. If two files disagree about what a
speaker said at the same time,
chatter mergedoes not adjudicate; it trusts--retainto designate one file as authoritative per speaker. - Not three-way or n-way merge in this release. The 2-input case
composes into the n-input case by chaining (
chatter merge a b --retain X -o tmp.cha && chatter merge tmp.cha c --retain Y -o out.cha). A future release may add native n-ary merging if a workflow appears for which chained 2-way merges are awkward.
Worked example
A speech-pathology lab hand-transcribed a child’s spontaneous-speech
session, marking disfluency carefully, but did not transcribe the
clinician’s turns. They send the media and the child-only transcript;
the project runs ASR on the media to produce a full-coverage
transcript with anonymous speaker codes; then chatter speaker-id
labels the ASR file’s adult speaker as INV; then chatter merge
combines.
# After ASR labeling: asr.cha has speakers CHI and INV.
chatter merge child-only.cha asr.cha \
--retain CHI \
-o merged.cha
# Then alignment regenerates %wor cleanly across all speakers:
batchalign3 align merged.cha
# Then morphotag regenerates %mor and %gra:
batchalign3 morphotag merged.cha
The merged file contains:
- Every
*CHIutterance byte-stable fromchild-only.cha, including every disfluency marker, every retracing scope, every paralinguistic annotation, every%comsession-structural comment. - Every
*INVutterance fromasr.cha, in their original time order, interleaved with the*CHIutterances by start time. - One
@Participantsrow listing CHI and INV;@IDrows for both; the union of@Commentrows including any ASR provenance comments fromasr.cha.
Relationship to other commands
flowchart TB
Media[Media file mp4 / wav] --> Transcribe
Transcribe["batchalign3 transcribe<br/>ASR"] --> AsrAnon["asr-anonymous.cha<br/>PAR0, PAR1, ..."]
HandTranscript["hand-transcript.cha<br/>target speakers only"] --> SpeakerId
AsrAnon --> SpeakerId
SpeakerId["chatter speaker-id<br/>label anon speakers"] --> AsrLabeled["asr-labeled.cha<br/>CHI, INV, MOT, ..."]
HandTranscript --> Merge
AsrLabeled --> Merge
Merge["chatter merge<br/>(this page)"] --> Merged[merged.cha]
Merged --> Align
Align["batchalign3 align"] --> Aligned[aligned.cha]
Aligned --> Morph
Morph["batchalign3 morphotag"] --> Final[final.cha]
chatter merge sits between speaker-identity resolution and
forced alignment. It assumes its inputs have coherent CHAT-conformant
speaker codes (no anonymous PAR0/PAR1) and emits a file ready for
batchalign3 align to refresh timing and produce %wor.
Inputs must be valid CHAT (pipeline / batch)
The per-session chatter pipeline shortcut and the directory-level
chatter batch driver validate every input as CHAT before doing any
speaker-id or merge work. Each donor and the reference it is merged
against must pass the same validation chatter validate runs; an input
that fails is never merged. Clean invalid transcripts to valid CHAT
first (run chatter validate <file> to see the errors), then re-run.
chatter pipelinerefuses (exit 2, no output written) if its donor or reference is invalid CHAT.chatter batchis fail-closed and whole-batch: if any input under the donor/reference directories is invalid CHAT, it reports every offending file and aborts the entire run without merging a single session. “All inputs are chatter-valid” is a hard precondition of the batch, not something discovered session-by-session mid-run.
This gate catches validation-only invalidity (files that parse but fail
chatter validate, e.g. a malformed @ID), which the lower-level
chatter merge parse is otherwise lenient about.
LLM holistic judgment (pending-only)
--judgment holistic is now reachable from pipeline and batch (not
just speaker-id). In holistic mode the command is pending-only: it
writes an engine = "llm" review-gated entry via --write-pending and
produces no merged file. The operator supplies the LLM connection with
--llm-endpoint / --llm-model (or the environment variables
CHATTER_LLM_ENDPOINT / CHATTER_LLM_MODEL); an optional
--session-context <file.json> provides per-session context that the
LLM prompt includes to sharpen its judgment.
Session-context JSON (--session-context)
The session-context file is a corpus-agnostic JSON object mapping session IDs (the donor file’s basename stem) to context records. Every record field is optional, and the label fields are free vocabulary: chatter imposes no closed set, the labels are surfaced verbatim into the LLM prompt.
{
"SESSION-ID": {
"sample_type": "clinician interview",
"declared_roles": ["Investigator"],
"consent_tier": "video+audio",
"age_months": 52
}
}
sample_type: what kind of speech sample the session is (e.g."narrative retell").declared_roles: adult roles declared present in the session.consent_tier: media-consent tier governing what may be shared.age_months: child age in months at the session.
When --session-context is absent, the CHATTER_SESSION_CONTEXT
environment variable supplies the path (empty counts as unset). Per
session, each context field resolves in order: the explicit record
from the file; for the age only, the donor’s CHAT @ID age header
(pure CHAT, no external metadata needed); otherwise unknown. Absent
sessions or fields are passed to the judgment as unknown, never
guessed. A configured-but-malformed file is a hard error, and labels
must contain at least one non-whitespace character. Configuring
session context on a non-holistic run prints a warning (the
deterministic judgment never consults it).
Conversion from a contributor’s own records format (a spreadsheet, a database export) to this JSON happens outside chatter.
The two-pass operator flow is:
batch --judgment holistic --session-context context.json --write-pending Paccumulates oneengine = "llm"pending entry per session inP.- Operator reviews
P, accepts or corrects each entry. chatter adjudicatepromotes reviewed entries to the override file.batch(deterministic, reading the override file) replays every confirmed mapping and writes the merged files.
Note: the MLU sanity-scan is unreliable for the FluencyBank clinical-interview
corpus (children out-narrate the adult, so MLU ratios invert relative to typical
child-language recordings). Holistic-pending review via --judgment holistic
is the trustworthy alternative there.
Implementation notes (for contributors)
- Source:
crates/talkbank-transform/src/transcript_merge/(proposed layout, see the design plan). - CLI surface:
crates/chatter/src/commands/transcript_merge/. - Domain types (
SpeakerCode,RetainSet,MergeOverride,SpeakerMapping) live intalkbank-modelso the override-file format is sharable across the speaker-id stage, the orchestrator, and any future adjudication UI. - The merge operates on
talkbank-model::ChatFile; both inputs are parsed viatalkbank-parser. The byte-preservation guarantee on retained-speaker utterances relies on the parser’s existing round-trip serialization. - Spec entries exercising the merge live in
spec/constructs/, every behavioral invariant on this page has a spec; tests are regenerated via the currentspec/toolsworkflow documented in Spec Workflow. - This page is the user contract;
book/src/chatter/reference/carries the override-file reference for the speaker-id stage that this merge consumes.
The Merge Workflow (pipeline, batch, adjudicate, sanity-scan)
Status: Draft (experimental) Last modified: 2026-06-15 10:39 EDT
The merge workflow combines, at scale, the two structural primitives
documented elsewhere, chatter speaker-id (assign
CHAT-conformant speaker codes to an anonymous donor) and
chatter merge (combine two transcripts of the same
recording), and adds the operator loop needed when the automatic
speaker decision is not confident enough to trust.
Four commands make up the workflow. They are experimental and in active development; flags and behavior may change.
| Command | Scope | Role |
|---|---|---|
chatter pipeline | one session | speaker-id (reference mode) then merge, in a single invocation |
chatter batch | a directory pair | loop pipeline over matched donor / reference files |
chatter adjudicate | the operator | resolve the low-confidence sessions a pass left pending |
chatter sanity-scan | merged output | flag confident auto-decisions that still look suspicious |
If you only have one pair of files and one clean answer, reach for
pipeline. Everything else here is about doing that safely across a
directory of sessions where some answers are not clean.
The big picture: a two-pass loop
The hard part of merging at scale is not the merge; it is deciding,
per session, which anonymous ASR speaker is the child the reference
already covers. speaker-id’s multiset-Jaccard match (see its page)
answers that automatically when the winner clearly beats the
runner-up, and refuses (exit code 4) when it does not. The
workflow turns that refusal into a reviewable queue.
flowchart TD
subgraph Pass1["Pass 1: automatic"]
B1["chatter batch DONOR_DIR REF_DIR\n--write-override audit.toml\n--write-pending pending.toml"]
B1 --> Clean["confident sessions:\nmerged file written,\ndecision logged to audit.toml"]
B1 --> Refused["low-confidence sessions:\nNO merge, appended to pending.toml\n(exit code 4)"]
end
Refused --> Adj["chatter adjudicate pending.toml\n--override-file audit.toml\n(operator decides)"]
Adj --> Pass2["Pass 2: chatter batch ... --override-file audit.toml\n(replays the operator's decisions,\nmerges the previously-refused sessions)"]
Clean --> Done["all sessions merged"]
Pass2 --> Done
Pass 1 merges everything it is confident about and parks the rest. The
operator works the parked queue once. Pass 2 replays their decisions.
The same chatter batch (or chatter pipeline) command runs both
passes; what changes is whether an override file with entries exists
yet.
chatter pipeline (one session)
The per-session shortcut: run speaker-id in reference mode to
relabel an anonymous donor, then merge the relabeled donor with the
reference, in one command instead of two.
chatter pipeline <DONOR> <REFERENCE> \
--anchor <SPEAKER> --inserted-role <CODE>:<ROLE> --output <PATH> [OPTIONS]
ARGUMENTS:
<DONOR> Donor CHAT file with anonymous speaker codes (the ASR output).
<REFERENCE> Reference CHAT file carrying the authoritative anchor speaker
(typically the hand-coded child transcript).
REQUIRED:
--anchor <SPEAKER> Anchor code in the reference (typically CHI).
--inserted-role <CODE>:<ROLE> Role for the donor's non-anchor speakers
(e.g. INV:Investigator).
-o, --output <PATH> Output path for the merged CHAT file.
KEY OPTIONS:
--retain <SPEAKER> Speaker(s) taken from the reference in the
final merge (typically the same as --anchor).
--confidence-threshold <F> Minimum winner/runner-up Jaccard margin to
auto-decide (default 2.0x).
--write-override <FILE> On a confident auto-decision, append a
mode = "auto" audit entry for this session.
--write-pending <FILE> On a low-confidence refusal, append a pending
entry (exit code 4 still fires).
--override-file <FILE> If the file has an entry for this session
(the donor's basename stem), replay that
decision instead of running reference mode.
The same command serves pass 1 (no override entry yet, run reference
mode) and pass 2 (entry present, replay it). Validation is a hard
precondition: a donor or reference that fails chatter validate is
never merged (exit 2, nothing written).
chatter batch (a directory pair)
Loops pipeline over matched files: the reference for DONOR_DIR/X.cha
is REFERENCE_DIR/X.cha. Donors without a matching reference are warned
and skipped. It is fail-closed and whole-batch on validity: if any
input under either directory is invalid CHAT, the batch reports every
offending file and aborts without merging a single session.
chatter batch <DONOR_DIR> <REFERENCE_DIR> \
--anchor <SPEAKER> --inserted-role <CODE>:<ROLE> --output <DIR> [OPTIONS]
PASS-1 AUDIT + QUEUE:
--write-override <FILE> Append every confident auto-decision (mode =
"auto"). Required if you want --sanity-scan.
--write-pending <FILE> Aggregate every low-confidence refusal into one
pending file. One `chatter adjudicate` run resolves
them all. Refusals do NOT abort the batch.
PASS-2 REPLAY:
--override-file <FILE> Threaded to every per-session pipeline call.
Sessions with an entry replay it; the rest fall
through to reference mode.
POST-MERGE QA:
--sanity-scan Run `sanity-scan` after the loop. Requires
--write-override (it reads the auto-decisions) and
--write-pending (flagged sessions are appended).
Exit code 4 fires if it flags any session.
--sanity-scan-threshold <F> Heuristic ratio (default 1.5).
OPERATIONAL:
--skip-existing Skip donors whose merged output already exists, to
resume an interrupted batch.
batch also accepts the same --judgment deterministic|holistic and
LLM / --session-context options as pipeline; see
Merge, LLM holistic judgment
for that mode and the session-context JSON format.
chatter adjudicate (the operator step)
Reads the pending file a pass produced, walks the operator through the unresolved sessions, and appends the resolved decisions to the override file. On success the pending file is rewritten to drop the entries that were resolved, so re-running adjudicate only ever shows what is left.
chatter adjudicate <PENDING> --override-file <FILE> [--interactive | --scripted <TOML>]
ARGUMENTS:
<PENDING> The pending-adjudications TOML a pass wrote.
REQUIRED:
--override-file <FILE> Override file to append resolved decisions to
(created if absent). This is the same file pass 2
reads back.
DECISION SOURCE (one of):
--interactive Prompt per pending entry on stdin. Currently
supports `accept` / `a` (accept the suggested
mapping).
--scripted <TOML> Pre-canned operator decisions, for replayable /
tested runs. Mutually exclusive with --interactive.
--operator <NAME> Recorded in each override entry (defaults to $USER).
This is the interactive review tool the speaker-id and merge pages
refer to: the audit trail (who decided, the scores, any note) lands in
the override file so a later reader can see why a session was labeled
the way it was. The decision schema is the same override-file format
used everywhere in the workflow; see
Merge Override File Format, and the
Adjudication Workflow
architecture page for the design.
chatter sanity-scan (post-merge QA)
A confident auto-decision can still be wrong, the runner-up was simply
even further off. sanity-scan re-reads the merged output and the
pass-1 audit file and flags sessions that pass an out-of-band check: the
mean utterance word count of the anchor speaker versus the inserted
speaker. In a typical child-language recording the adult out-talks the
child, so an anchor (child) mean that is much higher than the inserted
(adult) mean is suspicious, possibly the two were swapped.
chatter sanity-scan <MERGED_DIR> \
--override-file <FILE> --anchor <SPEAKER> --write-pending <FILE> [OPTIONS]
REQUIRED:
--override-file <FILE> The pass-1 audit file. Only auto-decided sessions
are scanned; explicit-mode entries are skipped (the
operator already signed off).
--anchor <SPEAKER> Anchor code in the merged files (typically CHI).
--write-pending <FILE> Flagged sessions are appended here as
sanity-scan-misclassification pending entries for
`chatter adjudicate`. Required.
--threshold <F> Flag when anchor_mean >= inserted_mean * threshold
(default 1.5).
A flag is a question, not a verdict: the session goes back into the adjudication queue for an operator to confirm or correct. Whether to run the scan at all is a judgment about the corpus. It assumes the typical “adult out-talks child” shape, and is unreliable where that inverts (e.g. a clinical-interview corpus where children out-narrate the adult); there, prefer the LLM holistic-pending review described on the merge page.
End-to-end worked example
A directory of ASR donors (asr/) and the matching hand-coded child
references (ref/), child anchor CHI, adults labeled INV:
# Pass 1: merge what we are sure of; queue the rest; keep an audit trail.
chatter batch asr/ ref/ \
--anchor CHI --inserted-role INV:Investigator \
--output merged/ \
--write-override audit.toml \
--write-pending pending.toml \
--sanity-scan
# Exit 0: every session merged confidently and the scan was clean.
# Exit 4: some sessions are pending (low-confidence and/or scan-flagged).
# Operator resolves the queue once (audit trail recorded):
chatter adjudicate pending.toml --override-file audit.toml --interactive --operator alice
# Pass 2: replay the operator's decisions; the previously-pending
# sessions now merge.
chatter batch asr/ ref/ \
--anchor CHI --inserted-role INV:Investigator \
--output merged/ \
--override-file audit.toml \
--skip-existing
Exit codes
The workflow commands share the convention used across the merge surface:
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Invalid input (parse error, missing file, unreadable) |
| 2 | Semantic precondition violated (e.g. invalid CHAT input, missing anchor) |
| 3 | Internal error |
| 4 | A pass parked work for the operator: a low-confidence speaker-id refusal, or a sanity-scan flag. Nothing was lost; the sessions are in the pending file |
Exit code 4 is the normal “there is operator work to do” signal, not an error: a batch that parks ten sessions still merged the rest.
See also
- Speaker-ID, the speaker-relabeling primitive and the Jaccard matching algorithm pass 1 uses.
- Merge, the structural merge primitive and the LLM holistic-judgment mode.
- Merge Override File Format, the shared decision schema.
- Adjudication Workflow and Merge Pipeline, Crate Architecture, the developer-facing design.
CHAT Format Overview
Status: Reference Last updated: 2026-05-11 21:51 EDT
CHAT (Codes for the Human Analysis of Transcripts) is a standardized transcription format for spoken language data, developed by MacWhinney as part of the CHILDES and TalkBank projects. It is the most widely used format in child language research and conversational analysis.
File Anatomy
Every CHAT file follows this structure:
@UTF8
@Begin
@Languages: eng
@Participants: CHI Target_Child, MOT Mother
@ID: eng|corpus|CHI|2;6.||||Target_Child|||
@ID: eng|corpus|MOT|||||Mother|||
*MOT: what do you want ?
%mor: ADV|what AUX|do PRON|you VERB|want ?
%gra: 1|4|LINK 2|4|AUX 3|4|SUBJ 4|0|ROOT 5|4|PUNCT
*CHI: I want cookie .
%mor: PRON|I VERB|want NOUN|cookie .
%gra: 1|2|SUBJ 2|0|ROOT 3|2|OBJ 4|2|PUNCT
@End
A CHAT file consists of:
@UTF8: required first line, declares UTF-8 encoding@Begin: marks the start of the transcript- Headers: lines starting with
@that provide metadata (participants, languages, IDs, etc.) - Utterances: blocks consisting of:
- A main tier (line starting with
*SPEAKER:) containing the transcribed speech - Zero or more dependent tiers (lines starting with
%tier:) containing annotations
- A main tier (line starting with
@End: marks the end of the transcript
Key Conventions
- Tab separation: a tab character separates the tier prefix from its content (e.g.,
*CHI:⟶content) - Terminators: every utterance ends with a terminator (
.,?,!, or special forms like+...) - Line continuation: long lines wrap with a tab at the start of continuation lines
- Speaker codes: short identifiers; the validator accepts up to seven characters from
A-Z,0-9,_,-,'; three uppercase letters is the convention (e.g.,CHI,MOT,FAT,INV) - Media linking: timestamps link transcripts to audio/video via bullet markers
CHAT vs Other Formats
| Feature | CHAT | Praat TextGrid | ELAN EAF |
|---|---|---|---|
| Morphological tiers | Built-in (%mor, %gra) | No | No |
| Dependency syntax | Built-in (%gra) | No | No |
| Standardized POS | UD-style via %mor | No | No |
| Word-level alignment | %wor tier | Interval-based | Interval-based |
| Error recovery | Tree-sitter GLR | N/A | N/A |
References
- CHAT Manual: the canonical reference
- TalkBank: the data repository
Headers
Status: Reference Last updated: 2026-05-11 20:30 EDT
Headers are lines beginning with @ that provide metadata about the transcript. They appear between @Begin and the first utterance (though some headers like @Comment can appear anywhere).
Required Headers
@UTF8
Must be the very first line of every CHAT file. Declares UTF-8 encoding.
@UTF8
@Begin / @End
Mark the start and end of the transcript body. Every CHAT file must have exactly one @Begin and one @End.
@Participants
Declares all speakers in the transcript. Format:
CODE [Name] Role, comma-separated. The role is required; the name
is optional, so each entry is either CODE Role or CODE Name Role.
@Participants: CHI Target_Child, MOT Mother, FAT Father
@Participants: CHI Alex Target_Child, MOT Mary Mother
In the first line, Target_Child, Mother, and Father are roles,
not names. In the second line, Alex and Mary are optional names
sitting between the speaker code and the role.
Speaker codes are short identifiers; the validator accepts up to
seven characters from A-Z, 0-9, _, -, and '. The convention
is three uppercase letters; the most common codes are:
CHI: target childMOT: motherFAT: fatherINV: investigatorOBS: observer
@ID
Provides detailed metadata for each participant. One @ID line per participant.
@ID: eng|corpus|CHI|2;6.||||Target_Child|||
Fields (pipe-separated): language, corpus, speaker code, age, sex, group, SES, participant role, education, custom field.
Age format: years;months.days (e.g., 2;6. = 2 years, 6 months).
SES field: ethnicity (White, Black, Asian, Latino, Pacific, Native, Multiple, Unknown), socioeconomic code (UC, MC, WC, LI), or combined with comma separator (e.g., White,MC).
Optional Headers
@Languages
Declares the language(s) used in the transcript.
@Languages: eng, fra
@Date
Recording date in DD-MON-YYYY format.
@Date: 15-JAN-2024
@Location
Where the recording took place.
@Location: Boston, MA, USA
@Situation
Description of the recording context.
@Situation: free play with toys in lab
@Activities
Activities during the recording.
@Activities: toyplay, reading
@Comment
Free-form comments. Can appear anywhere in the file (before, between, or after utterances).
@Comment: child was tired during this session
@Media
Links the transcript to an audio or video file.
@Media: session01, audio
@Transcriber / @Coder
Identifies who created or coded the transcript.
@Transcriber: JDS
@Coder: ABC
Header Ordering
Headers should follow this conventional order:
@UTF8(required, first line)@Begin(required)@Languages@Participants(required)@IDlines (one per participant)- Other metadata headers (
@Date,@Location, etc.) @Commentlines (can also appear later)
Validation
The parser validates header structure including:
@UTF8must be the first non-empty line@Beginand@Endare required and must appear exactly once@Participantsis required and must declare all speakers used in utterances@IDparticipant codes must match@Participantsdeclarations- Age format validation in
@IDlines
Utterances
Status: Reference Last updated: 2026-05-11 23:22 EDT
An utterance is the fundamental unit of a CHAT transcript. It consists of a main tier (the transcribed speech) followed by zero or more dependent tiers (annotations).
Main Tier
The main tier begins with *SPEAKER: followed by a tab and the utterance content, ending with a terminator.
*CHI: I want a cookie .
Speaker Codes
Speaker codes are short identifiers (up to seven characters from A-Z, 0-9, _, -, '; three uppercase letters is the convention) matching a code declared in @Participants:
@Participants: CHI Target_Child, MOT Mother
*MOT: what do you want ?
*CHI: cookie .
Terminators
Every utterance must end with a terminator:
| Terminator | Meaning |
|---|---|
. | Declarative (period) |
? | Question |
! | Exclamation |
+... | Trailing off |
+..? | Trailing-off question |
+/. | Interruption |
+//. | Self-interruption |
+/? | Interrupted question |
+!? | Broken question |
+"/. | Quotation follows on next line |
Line Continuation
Long utterances wrap to the next line with a leading tab:
*MOT: well I think that we should probably go to
the store and get some more cookies .
Content Items
The content between *SPEAKER: and the terminator consists of content items separated by whitespace:
- Words: regular words, potentially with annotations
- Groups: bracketed content like
<word word>for overlap, retrace, etc. - Special forms: pauses
(.), events&=laughs, fillers&-uh - Separators: commas
,and other punctuation
Words
Words are the primary content unit. See Word Syntax for full details.
Groups
Angle brackets < > group words for annotations:
*CHI: <I want> [/] I want cookie .
Common group annotations:
[/]: partial retrace (speaker repeats the same words)[//]: full retrace (speaker restarts with different words)[///]: multiple retracing (multiple false starts)[/-]: reformulation (speaker rephrases with different structure)[?]: uncertain transcription
Special Forms
*CHI: um (.) I want &-uh cookie .
(.): short pause(..): medium pause(...): long pause(1.5): timed pause in seconds&=laughs: paralinguistic event&-uh: filler
Media Linking
Utterances can include media timestamps (bullets) that link to audio/video:
*CHI: I want cookies . •1234_5678•
The numbers represent start and end times in milliseconds. The bullets
delimiting the pair render as • in most editors; on disk they are
the NAK control character (U+0015). See grammar/grammar.js rule
bullet.
Dependent Tiers
See Dependent Tiers for documentation on %mor, %gra, %pho, %wor, and other annotation tiers that follow the main tier.
Retraces and Repetitions
Status: Current Last updated: 2026-05-11 23:16 EDT
Retraces mark content that the speaker said but then corrected, repeated, or abandoned. They are one of the most consequential constructs in CHAT because they affect how every dependent tier aligns to the main tier.
CHAT Syntax
A retrace has two parts: the retraced content (what the speaker said first) and the correction (what follows). The retraced content is marked with a trailing bracket code:
| Marker | Name | Meaning |
|---|---|---|
[/] | Partial repetition | Speaker repeats the same words |
[//] | Full correction | Speaker restarts with different words |
[///] | Multiple correction | Multiple false starts |
[/-] | Reformulation | Speaker rephrases with different structure |
Single-Word Retraces
When only one word is retraced, no angle brackets are needed:
*CHI: I [/] I want that .
*CHI: ana [//] an .
*MOT: the book [/-] the magazine is here .
Group Retraces
When multiple words are retraced, angle brackets delimit the scope:
*MOT: <the dog> [//] the cat ran .
*CHI: <I want> [/] I need cookie .
*CHI: <I want the> [///] give me that .
Retraces with Replacements
A retraced word often has a replacement [: target] and/or error code
[* code]. This is common in aphasia and child language corpora where
the speaker produces an incorrect form:
*PAR: tika@u [: kitty] [* p:n] [//] kitty is nice .
%mor: noun|kitty aux|be-Fin-Ind-Pres-S3 adj|nice-S1 .
*PAR: lɛɾɪ@u [: later] [* p:n] [//] later in the day .
%mor: adv|late adp|in det|the-Def-Art noun|day .
*CHI: male [: female] [* s:r] [/] male [: female] [* s:r] .
%mor: adj|female-S1 .
In each case, the retraced word (before the [//] or [/]) is excluded
from %mor alignment. Only the correction (after the marker) is counted.
Data Model
Retraces are a first-class variant of UtteranceContent:
flowchart TD
UC["UtteranceContent"]
UC --> Word
UC --> RW["ReplacedWord"]
UC --> Retrace
UC --> AG["AnnotatedGroup"]
UC --> Other["...20 other variants"]
Retrace --> BC["BracketedContent"]
Retrace --> RK["RetraceKind"]
BC --> BIW["BracketedItem::Word"]
BC --> BIRW["BracketedItem::ReplacedWord"]
style Retrace fill:#f96,stroke:#333
The Retrace struct wraps the retraced content in a BracketedContent
container, which can hold any combination of words, replaced words, and
other content items:
// crates/talkbank-model/src/model/content/retrace.rs
pub struct Retrace {
pub content: BracketedContent, // the retraced words
pub kind: RetraceKind, // Partial, Full, Multiple, Reformulation
pub is_group: bool, // <word> [/] vs word [/]
pub annotations: Vec<ContentAnnotation>,// non-retrace annotations after marker
pub span: Span,
}
Why First-Class?
Before the retrace refactor, retraces were represented as annotations
on words or groups. This meant every match on content had to inspect
annotation lists to determine whether a word was retraced. This led to
a class of bugs where retraced content was accidentally included in
alignment counting, word extraction, or retokenization.
Making Retrace a top-level UtteranceContent variant means:
- The compiler enforces handling. Every
matchonUtteranceContentmust have aRetracearm. Forgetting to handle retraces is a compile error, not a silent runtime bug. - Domain-aware gating is centralized. The content walker checks the
Retracevariant once, not at every annotation-inspection site. - Alignment counting is simple. The count function returns
0forRetracein Mor domain, no annotation inspection needed.
Parser Conversion
The tree-sitter grammar parses retrace markers ([/], [//], etc.) as
annotations on word_with_optional_annotations. The Rust parser converts
them to structural Retrace nodes in parse_word_content():
flowchart LR
subgraph "Tree-sitter CST"
WOA["word_with_optional_annotations"]
SW["standalone_word"]
BA["base_annotations"]
RP["retrace_partial / retrace_complete / ..."]
WOA --> SW
WOA --> BA
BA --> RP
end
subgraph "Rust Model"
RET["UtteranceContent::Retrace"]
BC2["BracketedContent"]
W2["Word or ReplacedWord"]
RET --> BC2
BC2 --> W2
end
WOA -->|"parse_word_content()\n(word.rs)"| RET
Three cases in parse_word_content():
- Word + retrace (
I [/]), wrapWordinBracketedItem::WordinsideRetrace - Word + replacement + retrace (
tika@u [: kitty] [* p:n] [//]), buildReplacedWord, then wrap inBracketedItem::ReplacedWordinsideRetrace - Word + replacement, no retrace (
tika@u [: kitty]), emit bareReplacedWord
Group retraces (<content> [/]) are handled in group/parser.rs via
the same structural wrapping.
Alignment Behavior
Retraces interact differently with each dependent tier domain:
flowchart TD
RT["Retrace node\n(e.g. 'tika@u [: kitty] [* p:n] [//]')"]
RT -->|"Mor domain"| SKIP["SKIP\n(return 0)\nNot morphologically analyzed"]
RT -->|"Pho domain"| COUNT["COUNT\nPhonologically produced"]
RT -->|"Sin domain"| COUNT2["COUNT\nGesturally produced"]
RT -->|"Wor domain"| COUNT3["RECURSE\napply retrace-aware %wor leaf rule"]
style SKIP fill:#faa,stroke:#333
style COUNT fill:#afa,stroke:#333
style COUNT2 fill:#afa,stroke:#333
style COUNT3 fill:#afa,stroke:#333
Why %mor skips retraces: The %mor tier represents the morphological analysis of what the speaker meant to say. Retraced content is a false start or error; it was produced phonologically but is not part of the intended linguistic structure. The correction after the retrace marker carries the morphological analysis.
Why %pho/%sin/%wor include retraces: These tiers document what was actually produced, the sounds, gestures, and timing of the speech as it happened, including false starts. The retrace was physically spoken, so it appears in these tiers.
For %wor, retrace ancestry does not change leaf-level membership:
- spoken word tokens count both inside and outside retrace
- that includes fillers, fragments, nonwords, and untranscribed placeholders
- overlap annotations do not affect
%wormembership
Exact corpus-shaped contrast:
*CHI: <one &+ss> [/] one play ground .
%wor: one •321008_321148• ss •321148_321368• one •321809_321969• play •322049_322310• ground •322390_322890• .
*CHI: &+ih <the what> [/] what's letter &+th is this ?
%wor: ih •49063_49103• the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• th •50886_50946• is •50946_51046• this •51086_51586• ?
Implementation
Counting: count_alignable_item() in alignment/helpers/count.rs:
UtteranceContent::Retrace(retrace) => {
if domain == TierDomain::Mor {
0 // excluded from morphological alignment
} else {
count_bracketed_alignable_content(&retrace.content, domain, true)
}
}
Walking: walk_words() in alignment/helpers/walk/mod.rs:
UtteranceContent::Retrace(retrace) => {
if !matches!(domain, Some(TierDomain::Mor)) {
walk_bracketed_content(&retrace.content.content, domain, f);
}
}
%wor generation and overlap counting still use dedicated recursive helpers,
but now for %wor-specific sequencing details like replacement handling rather
than for retrace-sensitive membership.
Validation
Cross-Utterance Retrace Validation
The retrace validators in validation/retrace/ check:
- Collection:
collection/utterance.rsandcollection/bracketed.rswalk the content tree to find allRetracenodes - Detection:
detection.rsprovidesutterance_item_has_retrace()for quick retrace presence checks
Alignment Validation (E705)
E705 fires when the main tier has more alignable items than %mor. If
retraces are correctly parsed as Retrace nodes; they are excluded
from the count and E705 does not fire. If a retrace is accidentally
parsed as a bare ReplacedWord (the bug fixed in c90b9bf), it is
counted and triggers a false E705.
Regression Tests
tests/retrace_replaced_word_regression.rs contains 6 targeted tests:
| Test | Pattern | Verifies |
|---|---|---|
single_word_retrace_with_replacement_full | word [: repl] [* err] [//] | Retrace wraps ReplacedWord |
single_word_retrace_with_replacement_partial | word [: repl] [* err] [/] | Partial retrace with replacement |
single_word_retrace_with_replacement_multiple | word [: repl] [* err] [///] | Multiple retrace with replacement |
single_word_retrace_with_replacement_no_error_marker | word [: repl] [///] | No [*] still produces Retrace |
single_word_retrace_without_replacement | word [//] | Baseline (no replacement) |
retrace_with_replacement_does_not_cause_e705 | Full pipeline with %mor | No false E705 |
Reference corpus entries: corpus/reference/annotation/retrace.cha
See Also
- Alignment Architecture: full alignment system docs
- The %mor Tier: morphological tier format and alignment rules
- CHAT Manual: Retracing
Replacements
Status: Current Last modified: 2026-05-29 17:47 EDT
A replacement is a CHAT annotation [: ...] that pairs a single
spoken word on the main tier with one or more “intended” words. It
records both what the speaker actually said and what the analysis should
treat the utterance as containing.
*CHI: wanna [: want to] go .
*CHI: dis [: this] is fun .
*CHI: rocking+house [: rocking+horse] [*] ?
This page is the canonical reference for what replacements mean in TalkBank, both as a CHAT-manual construct and as a typed AST in this repo. The most important load-bearing fact, which the rest of the page expands on:
Replacements are word-level, not group-level. Each tier domain chooses one side of the pair:
%moranalyzes the replacement (right side);%wor,%pho,%sinalign to the original (left side).%grafollows%mor.
CHAT Syntax
Word-Level Scope
A replacement attaches to a single standalone_word on the main tier
and contains one or more replacement words inside the brackets:
*CHI: gonna [: going to] eat lunch .
*CHI: dis [: this] toy .
*CHI: rocking+house [: rocking+horse] [*] ?
The grammar rules are
word_with_optional_annotations and replacement in
grammar/grammar.js
grep for the rule names rather than line numbers so this stays
accurate as the grammar evolves. Replacement words can be separated
by whitespace, so [: going to] is a single replacement of gonna
with two words.
There Is No Group-Level Replacement
<dat is> [: that is] is not valid CHAT. A replacement does not
attach to a group; it attaches to a single word. The grammar enforces
this by typing: ReplacedWord.word: Word, never Group. To replace
words inside a group, attach the replacement to the inner word:
*CHI: <dat [: that] is> [/] is broken .
This shape, replacement inside a group inside a retrace, is legal because each annotation operates at its own scope.
There Is No [::] Form
Some literature on CHILDES tooling references a [::] annotation; it
does not exist in this repo’s grammar, parser, or model, and is not
defined by the current CHAT manual. Only [:] exists. If you encounter
[::] in legacy data, treat it as a parse error to investigate, not a
construct to support.
The Per-Domain Alignment Rule
This is the rule contributors most often get wrong. Different tier domains align to different sides of a replacement pair:
| Tier | Side aligned to | Rationale |
|---|---|---|
%mor | replacement (right) | Morphosyntactic analysis annotates the target form, not the error |
%gra | replacement (right) | Grammatical relations align to %mor’s structure |
%wor | original (left) | Word-level timing is for what was actually spoken |
%pho | original (left) | Phonological transcription describes what was actually spoken |
%sin | original (left) | Spelling-in-actual describes the original surface form |
The mnemonic: the replacement encodes the intended form (what the
speaker meant or what a corrected transcript would read). Tiers
analyzing intent (%mor/%gra) use the replacement; tiers
documenting realization (%wor/%pho/%sin) use the original.
flowchart LR
spoken["Original word\n(left of [:)\n'dis'"]
target["Replacement words\n(inside [: ])\n'this'"]
spoken -->|"%wor (timing)"| wor["%wor: dis"]
spoken -->|"%pho (phonology)"| pho["%pho: dɪs"]
spoken -->|"%sin (spelling)"| sin["%sin: dis"]
target -->|"%mor (UD parse)"| mor["%mor: pron|this"]
target -->|"%gra (paired with %mor)"| gra["%gra: 1|0|ROOT"]
For multi-word replacements like gonna [: going to], the rule
generalizes consistently:
%wor/%pho/%sinproduce one entry, forgonna.%morproduces two entries, forgoingandto.%graproduces two entries, paired to the two%moritems.
The alignment-counting code that enforces this is in
alignment/units.rs
look for the UtteranceContent::ReplacedWord arm. The full table
of per-domain rules is in
spec/docs/ALIGNMENT_RULES.md.
Rust AST
A replacement is modeled as a first-class UtteranceContent variant,
not as a flag on Word:
// crates/talkbank-model/src/model/annotation/replacement.rs
pub struct ReplacedWord {
pub word: Word, // left side: original spoken word
pub replacement: Replacement, // right side: 1+ intended words
pub scoped_annotations: ReplacedWordAnnotations,
}
Two consequences of this shape:
- A replacement is a wrapper around a
Word, not a kind ofWord.ReplacedWordlives as its own variant ofUtteranceContent(andBracketedItem), holding an innerword: Wordplus the replacement payload. Contrast with retraces:Retraceis also a variant ofUtteranceContent/BracketedItem, but it wraps a group of content (a single word or a<...>group), not a singleWord. Different mechanism, different scope, same top-level slot in the AST. - The
walk_words()content walker yieldsWordItem::ReplacedWordas a distinct leaf (defined incrates/talkbank-model/src/alignment/helpers/walk/mod.rs). Domain-aware extraction code branches on this leaf type and chooses original or replacement per the table above.
Validation
Each Replacement Word Is Validated Like a Main-Tier Word
The replacement is a Vec<Word>. Each Word inside it goes through
the same validator that runs on main-tier words:
*CHI: dog [: C-3PO] .
This produces [E220] "C-3PO" is not a legal word in language(s) "eng": numeric digits not allowed, exactly as if C-3PO had appeared on the
main tier directly. The replacement does not provide an escape from
word-level validation. The implementation is in
replacement.rs.
This is critical for any code generating replacements programmatically:
do not assume [: ...] lets you smuggle arbitrary text past the word
validator. If your producer emits a replacement, both sides must be
CHAT-legal under the utterance’s declared language.
Replacement-Specific Error Codes
Three error codes are specific to replacements and do not apply to main-tier words:
| Code | Meaning |
|---|---|
E208 | Empty replacement [:] (no words provided between : and ]) |
E390 | Replacement contains an omission (0prefix form), disallowed inside replacements |
E391 | Replacement contains untranscribed material (xxx, yyy, www), disallowed inside replacements |
The principle: a replacement must be a concrete intended form. Empty, omitted, or unintelligible content defeats that purpose.
Interactions with Other Annotations
Replacements and Retraces Are Orthogonal
A retrace ([/], [//], [///], [/-]) and a replacement
([:]) are distinct annotations operating at different structural
levels:
- Retraces wrap content (a single word or a group). They are first-
class
UtteranceContentvariants and represent post-hoc speaker correction. - Replacements attach inside a
Wordslot viaReplacedWord. They are editorial metadata about an individual spoken word.
Both can coexist:
*CHI: <dat [: that] is> [/] is broken . (replacement inside retrace)
A retrace cannot live inside a replacement (the grammar wraps
replacements around standalone_word, not arbitrary content).
Replacements and Error Coding
Error codes follow the replacement and operate on the replaced word as a unit:
*CHI: rocking+house [: rocking+horse] [*] ?
Here [*] marks rocking+house as containing a phonological/lexical
error; the [: rocking+horse] records the intended form. The two
annotations cooperate: the replacement encodes what was meant, the
error code classifies how it deviates. Implementation:
scoped_annotations field on ReplacedWord.
Common Misconceptions
These are bugs we have repeatedly written down then forgotten, recording them here so future contributors don’t reinvent them.
- “
[: ...]lets me put any text I want.” No. Each replacement word is validated.[: C-3PO]fails E220 in English just asC-3POwould. - “
[:]is the right mechanism for ASR sanitization.” Usually no. ASR-introduced normalization typically wants[% ...](free- form comment) or[= ...](free-form explanation), neither of which validates word grammar. Use[:]only when you have a concrete CHAT-legal intended form. - “
%moranalyzes the original.” No.%moranalyzes the replacement. This is the correction’s morphology, not the error’s. - “
%worcount must equal%morcount.” No. Forgonna [: going to],%worhas 1 entry and%morhas 2. They align to different sides. The validator’s per-domain rule respects this. - “
<a b> [: c d]is a group-level replacement.” No. Group-level replacements don’t exist. Either replace inside (<a [: c] b [: d]>) or rephrase the transcription.
Source Citations
| Concern | File:line |
|---|---|
Grammar rule (replacement) | grammar/grammar.js:1341-1352 |
| Word-with-replacement rule | grammar/grammar.js:1063-1071 |
ReplacedWord struct | crates/talkbank-model/src/model/annotation/replacement.rs (search pub struct ReplacedWord) |
| Per-domain alignment | crates/talkbank-model/src/model/file/utterance/metadata/alignment/units.rs (search UtteranceContent::ReplacedWord) |
| Replacement validation | crates/talkbank-model/src/model/annotation/replacement.rs (search impl ... Validate for ReplacementWords) |
| Reference corpus example | corpus/reference/annotation/errors-and-replacements.cha |
| CHAT manual | https://talkbank.org/0info/manuals/CHAT.html#Replacement_Scope |
See Also
- Retraces and Repetitions: the orthogonal post-hoc correction mechanism.
- The %mor Tier: UD-syntax morphosyntactic analysis that aligns to the replacement form.
- Word Syntax: the word grammar that replacement words must satisfy.
- Dependent Tiers: overview of
%mor,%wor, etc., with their alignment relationships.
Untranscribed Markers: xxx, yyy, www
Status: Reference Last updated: 2026-06-14 19:57 EDT
CHAT reserves three short word-level markers for material the human
transcriber cannot or chose not to render as words on the main tier.
Each one has a specific meaning. Tools that emit CHAT, including ASR
pipelines, format converters, and editor heuristics, must respect
those meanings, because every downstream consumer (researchers,
validators, and aggregate-statistics tools like CLAN’s freq,
kideval, mlu) reads them at face value.
| Marker | Meaning | Emitter |
|---|---|---|
xxx | Transcriber listened to the audio and could not make out what was said. The speech is unintelligible to the human ear at this point. | Human transcriber only. |
yyy | Transcriber heard a discrete utterance but could not write it as ordinary CHAT words. Used when the surface form resists orthography (mumbled, slurred, foreign with no equivalent). The phonetic content typically appears on the %pho tier. | Human transcriber only. |
www | Transcriber chose not to transcribe this stretch, usually for privacy, off-topic content, or because the segment is irrelevant to the corpus’s purpose. | Human transcriber only. |
The shared property: each marker is the human transcriber telling later readers something specific about their experience listening to the audio. None of them mean “tooling could not process this token”.
Why this matters
When a researcher loads a CHAT corpus and counts xxx occurrences, the
result is a measure of human listening difficulty: it tells them how
much of the audio resisted human transcription. That number feeds into
methodology decisions (“can we get reliable MLU from this corpus?”,
“what’s the noise floor on this child’s speech?”, “should we re-record
in a quieter environment next time?”). It is a load-bearing signal in
language-development research.
If an ASR pipeline emits xxx whenever it can’t sanitize a token,
for example, substituting xxx for any word that fails CHAT
validation under a strict language profile, every xxx count in the
corpus becomes a meaningless mixture of “human couldn’t tell” and
“pipeline gave up”. Researchers then reading those counts are silently
misled. The signal is destroyed for the entire history of that
corpus, because the corruption is indistinguishable from real
unintelligibility once committed.
The same reasoning applies to yyy and www. A converter or
post-processor that emits any of these three markers because the
tooling couldn’t handle a token is committing semantic vandalism
against the whole field.
Rules for tooling
- Never emit
xxx,yyy, orwwwfrom a tool to mean “could not process”. These markers are reserved for human transcriber judgment. - When a token cannot be validated as legal CHAT under the
declared language, prefer one of:
- Pass the token through verbatim and let the CHAT validator
(or CLAN’s
check) flag it for human review. The transcriber listens, decides, and corrects. - Fail loud, abort the file rather than emit corrupted output.
- Apply only purely orthographic, semantically null repairs
(e.g., stripping a stray boundary quote mark from
"My). These are safe because no information is lost.
- Pass the token through verbatim and let the CHAT validator
(or CLAN’s
- Never sanitize a token by replacing it with one of the three markers. That is exactly the corrupting behavior this document prohibits.
- Never delete a token to “fix” a validation failure. Deletion loses data without any flag.
What tools synthesizing CHAT should do instead
Any tool that builds CHAT from an external source (ASR output, an importer, a format converter) should follow the same division of labor:
- Silently fix only orthographically inarguable problems (for
example, stripping a stray boundary quote mark from
"My). - For tokens that fail language-level validation but are
structurally legal CHAT (e.g.,
C-3POunder English: tree-sitter accepts the digit-hyphen compound butWord::validatefires E220 “numeric digits not allowed”), ship the token verbatim. The full-file validator andcheckfire E220 on the same word, the file ends up in the human review queue, and the transcriber listens to the audio and decides what was actually said. - For tokens that fail structural parsing (tree-sitter rejects), fail loud: emitting malformed CHAT would corrupt the file beyond the validator’s ability to flag it.
The division of labor is: the tool fixes only what is mechanically unambiguous; CHECK and the human transcriber handle everything that requires judgment about what the speaker said.
Related rules
xxx/yyy/wwwsurvive the transcript through all NLP passes (morphotag, utseg, translate, coref) without re-interpretation. Tools that walk the AST treat them as opaque tokens; they have no POS tag, no lemma, no dependency parent, no translation.%worexcludes all three (no phoneme sequence to align).%phomay referenceyyydirectly because the phonetic content is the whole point of the marker.- See
word-syntax.mdfor grammar; this document is the policy reference for who is allowed to emit them and why.
Postcodes ([+ ...])
Status: Reference Last updated: 2026-05-11 23:01 EDT
A postcode is a tagged annotation token that attaches to an
utterance as a whole and appears after the terminator. The
canonical CHAT syntax is [+ <text>]. Postcodes carry researcher /
analysis tags about the utterance, whether it should be excluded
from analysis, how it should be coded, what kind of speech act it
represents, without modifying the utterance’s word content.
Syntax and Scope
*CHI: I want cookie . [+ exc]
*MOT: what did you say ? [+ imp]
*CHI: no I don't want it ! [+ neg] [+ trn]
Three structural facts to internalize:
- Postcodes attach to the utterance, not to a word. They sit
after the terminator, on the main tier, alongside (but distinct
from) any utterance-level bullet. Unlike word-scoped annotations
(
[: ...]replacement,[% ...]comment,[= ...]explanation,[* ...]error code), a postcode does not modify the interpretation of any single word, it tags the whole utterance. - Multiple postcodes may follow a single terminator. They are ordered, but the order is not semantically privileged.
- The body is free-form text. The CHAT word grammar is not applied to postcode contents. Researchers can write arbitrary tags, codes, descriptions, comments, or analytic notes. The model stores the raw text and leaves interpretation to downstream tooling and conventions.
Common Postcodes, Empirical Survey
The postcode vocabulary is open-ended: the CHAT format imposes no
closed set, and an audit of every [+ ...] token across a
JSON-mirrored snapshot of the TalkBank corpora (~99k files, 23+
data-repo families) found 488 distinct values in active use.
The findings split into three tiers ranked by repo spread (in how many distinct corpus families the code appears), the more useful ranking than raw count, because high-count codes can be concentrated in a single corpus.
Tier 1, Cross-corpus codes (in 7+ repos)
These are the conventions every CHAT consumer should expect to encounter across collections:
| Postcode | Repo spread | Total occurrences | Meaning |
|---|---|---|---|
[+ gram] | 13 | ~3,100 | Grammatical, utterance is grammatically well-formed for purposes of the analysis. |
[+ exc] | 9 | ~26,900 | Exclude utterance from analysis. The utterance is preserved in the transcript but tagged so analytic tools (CLAN’s freq, mlu, etc.) skip it. |
[+ bch] | 9 | ~10,000 | Backchannel, listener-side acknowledgement (mhm, yeah) that should not be counted as a substantive turn. |
[+ trn] | 7 | ~3,800 | Translation utterance. |
Tier 2, Multi-corpus protocol codes (in 4-6 repos)
Codes deployed across several CHILDES sub-collections, typically encoding picture-narration / story-reading / imitation experimental conditions. Substantial raw counts (often tens of thousands), but their meaning is set by the originating protocol, consult per-corpus documentation rather than assuming a global definition:
| Postcode | Repo spread | Total occurrences |
|---|---|---|
[+ SR] | 5 | ~31,000 |
[+ IN] | 5 | ~24,500 |
[+ PI] | 5 | ~22,700 |
[+ R] | 4 | ~16,200 |
[+ I] | 4 | ~10,500 |
[+ nv] | 4 | ~3,300 |
[+ imit] | 4 | ~3,200 |
Tier 3, Single-corpus and long-tail codes
About 80% of the 488 distinct values appear in one repo only. The
single-corpus codes include high-volume protocol vocabularies (e.g.
[+ uncued] ~19,500 in one repo, [+ NAC] ~3,500 in one repo,
[+ diary] ~2,800 in a Romance/Germanic diary-study collection,
[+ noatt] ~2,300 in one repo, [+ inter-utter-switch] ~720
flagging code-switching turns).
The long tail also includes researcher-private notes, typos that
survived check, and per-study coding schemes. Tooling MUST treat
any unknown postcode value as opaque text, the corpus author may
know what it means, the format does not.
Caveats
- Numbers are from a snapshot audit and will drift as corpora are added or revised. Treat the broad shape (open vocabulary, ~4 truly cross-corpus codes, ~10 multi-corpus protocol codes, ~hundreds of single-corpus or long-tail codes) as the load-bearing finding, not the exact counts.
- “Repo spread” counts data-repo families, not individual files. Two corpora curated by the same group inside one data-repo count as one for spread; researchers using the same code in two different family-of-corpora packages count as two.
- The CHAT manual remains the source of truth for standard conventions. The empirical survey above shows what is actually deployed; when ingesting a new corpus, consult its own documentation for the postcodes in use.
What Postcodes Are NOT
Postcodes are easy to confuse with several other CHAT annotation forms because they all use square brackets. The differences are substantive and load-bearing.
| Form | Scope | Body validation | Purpose |
|---|---|---|---|
[+ ...] | Utterance-level (this doc) | None, free text | Researcher / analysis tag attached to the whole utterance |
[: ...] | Word-level | Replacement words ARE validated as CHAT words | Sanctioned-form correction of the preceding word (see replacements.md) |
[% ...] | Word-level | None, free text | Free-form comment about the preceding word or local span |
[= ...] | Word-level | None, free text | Explanation of unclear / non-standard speech (often paired with xxx / yyy placeholders) |
[* ...] | Word-level | None, error code text | Error coding for the preceding word, optionally with a structured code |
Two consequences worth pinning down explicitly:
- A postcode cannot carry per-word semantics. If you want to attach a comment, replacement, or error code to a single word, use the appropriate word-scoped form. Stretching a postcode to mean “this word is X” loses the per-word position downstream tools depend on.
- A word-scoped annotation cannot tag an utterance. If you want
to mark an entire utterance for exclusion or translation, use a
postcode. A
[% exclude this]after a word does not mean “exclude the utterance” to any consumer.
Not Postcodes: Quotation Markers
Quotation marking in CHAT is not a postcode form. The constructs
+"/. (quotation end), +"/, and +" (quotation linkers /
continuations) are tier-level terminators and linkers, not
[+ ...] postcodes, the grammar rule postcode in
grammar/grammar.js is strictly [+ <text>], and the quotation
forms live under separate grammar rules (quoted_new_line,
linker_quotation_follows).
See Utterances → Terminators for the
syntactic forms, and the
talkbank-model::validation::cross_utterance validator family
(gated by ValidationContext::enable_quotation_validation) for the
cross-utterance balance checks.
A walker in talkbank-model::validation::utterance::quotation
(check_quotation_balance) does scan the postcode list for text
"/ and "/., but a sweep over the data-json corpus mirror
(101,414 files, 2026-05-11) returned zero such postcodes, that
code path is effectively dead, retained presumably as defence
against hand-edited oddities. The real quotation-balance work
happens in the cross-utterance family above.
Position in the AST
An utterance’s main tier is MainTier, whose content: TierContent
field carries the actual tier payload, including postcodes, as a
typed list:
pub struct MainTier {
pub speaker: SpeakerCode,
pub content: TierContent,
// spans omitted for brevity
}
pub struct TierContent {
pub linkers: TierLinkers, // utterance-leading +<, ++, etc.
pub language_code: Option<LanguageCode>, // [- code]
pub content: TierContentItems, // word-level items (newtype over Vec<UtteranceContent>)
pub terminator: Option<Terminator>, // ., ?, !, +..., etc.
pub postcodes: TierPostcodes, // [+ ...] tokens after the terminator
pub bullet: Option<Bullet>, // optional terminal media bullet
// content_span omitted for brevity
}
(See talkbank-model/src/model/content/main_tier.rs and
tier_content.rs for the exact shape; the same TierContent type is
shared by dependent tiers, so the postcode slot exists on every tier
even though only main-tier postcodes are conventional.)
Because postcodes live at the utterance level, the per-word
traversal helpers (walk_words, walk_words_mut) do not visit
them. Code that needs to read or rewrite postcodes accesses the
list directly.
The model stores postcode text as SmolStr and preserves it
verbatim through CHAT roundtrips. Downstream tooling, including
CLAN command implementations such as freq, mlu, kideval, is
responsible for interpreting individual postcode values per its own
conventions.
Tooling Rules
Tools that emit or consume CHAT must respect the scope distinction.
- Emitters: when adding a researcher tag to an utterance, attach
a
Postcodeto the utterance’sMainTierContent, not aContentAnnotationto a word. Both serialize, but only the former reaches downstream consumers as utterance-level metadata. - Consumers: when reading utterance-level tags (e.g.,
implementing an “exclude” filter), iterate
main.content.postcodeson each utterance, not the word-level annotations inUtteranceContent. The two lists are populated by different parser branches and have different semantics. - Round-trip preservers (extract→modify→inject pipelines such as
the NLP injection passes in
crates/batchalign-*): preserve the postcode list unchanged. None of the standard NLP passes have a reason to add, remove, or reorder postcodes.
References
- CHAT manual: Postcodes
- CHAT manual: Excluded Utterance Postcode
- CHAT manual: Included Utterance Postcode
- Model:
talkbank-model/src/model/content/postcode.rs - Quotation validator:
talkbank-model/src/validation/utterance/quotation.rs
Dependent Tiers
Status: Reference Last updated: 2026-06-22 23:33 EDT
Dependent tiers appear on lines beginning with % immediately after an utterance. They provide annotations linked to the main tier content.
CHAT defines four structural categories of dependent tiers:
- Structured linguistic tiers: parsed into typed AST nodes with word-level alignment
- Phon phonological tiers: syllabification and segmental alignment from the Phon project
- Bullet-content tiers: free-form text with optional inline timing markers
- Text tiers: plain text with no structural alignment
Structured Linguistic Tiers
These tiers have rich, parsed representations in the data model. Each token aligns 1-to-1 with an alignable word on the main tier (excluding retraces, pauses, and events). Terminators (., ?, !) must match the main tier terminator.
%mor, Morphological Analysis
The %mor tier carries part-of-speech tags, lemmas, and morphological features for each word on the main tier. See The %mor Tier for full documentation covering the UD-style format, data model, divergences from Universal Dependencies, and migration from traditional CHAT MOR.
Format: POS|lemma[-Feature]*, with ~ separating post-clitics.
*CHI: she's eating cookies .
%mor: PRON|she~AUX|be-Pres-S3 VERB|eat-Prog NOUN|cookie-Plur .
%gra, Grammatical Relations
The %gra tier encodes dependency syntax using Universal Dependencies relation labels. Each entry has the format index|head|relation, where indices are 1-based and head 0 indicates ROOT.
*CHI: I want cookies .
%mor: PRON|I VERB|want NOUN|cookie-Plur .
%gra: 1|2|SUBJ 2|0|ROOT 3|2|OBJ 4|2|PUNCT
The %gra tier aligns with %mor chunks (clitics expand into multiple chunks). Validation checks sequential indices (E721), ROOT structure (E722 missing root, E723 multiple roots), and circular dependencies (E724).
%pho / %mod, Phonological Transcription
The %pho tier records actual pronunciation; %mod records target/model pronunciation. Both use the same format: space-separated phonetic tokens aligned 1-to-1 with main tier words.
*CHI: I want three cookies .
%pho: aɪ wɑnt fwi kʊkiz .
%mod: aɪ wɑnt θri kʊkiz .
Phonological tiers support IPA, UNIBET, X-SAMPA, or custom notation systems. They are used for child language, speech disorders, L2 learning, and dialectal variation studies.
Parsing strategy: We deliberately parse only the minimal word/group-level structure in
%phoand%modneeded for coarse alignment with the main tier. The full IPA phoneme content is stored as opaque strings, deep phonological analysis is handled by Phon, and we avoid duplicating that work. The Phon extension tiers (%modsyl,%phosyl,%phoaln) follow the same strategy.
%sin, Gesture and Sign Annotation
The %sin tier codes gestures and signs aligned with speech. Each token is either 0 (no gesture) or g:referent:type (e.g., g:ball:dpoint for a deictic point at a ball).
*CHI: that ball .
%sin: g:ball:dpoint 0 .
Multiple simultaneous gestures use bracket grouping: 〔g:toy:hold g:toy:shake〕.
%wor, Word Timing
The %wor tier carries word-level timing annotations for media synchronization.
Words may include inline bullets with millisecond timestamps. Word text is
display-only (“eye candy”); timing data comes from the bullet fields.
⚠ IMPORTANT:
%worword text is the cleaned form, by design. When chatter serializes a%worword it writes the word’s cleaned text, the spoken form with surface markers removed, NOT the raw main-tier surface form. This is a deliberate convention (seeWorTier::write_chatincrates/talkbank-model/src/model/dependent_tier/wor.rs), chosen for human readability and because%worexists to anchor timing, not to re-state the main tier’s orthography. The generated%wortext and the TextGrid export both use this cleaned form.Consequence you must know: surface markers carried on a word, prosodic lengthening (
wabe:), and similar in-word notation, are not preserved in%woroutput. A main-tier wordwabe:becomeswabeon%wor. This means a%worline containing such words does not byte-roundtrip (parse, serialize, reparse changes the surface text), and that is expected, not a bug.%woris a cleaned, timing-only view; the main tier remains the faithful record of surface forms. Do not “fix” the%worserializer to emit raw text without an explicit decision to change this convention.
%wor is not a flat “all tokens except punctuation” tier. It follows a
word-level alignment rule:
- Regular words count.
- Fillers (
&-um,&-uh,&-you_know) count; they are real spoken words with known phoneme sequences. - Fragments (
&+...) do NOT count: incomplete phoneme sequences; the FA engine cannot reliably anchor partial phonological material. - Nonwords (
&~...) do NOT count: interactional/gestural sounds without stable lexical phoneme content for alignment. - Untranscribed placeholders (
xxx,yyy,www) do NOT count: they have no known phoneme sequence; CTC forced alignment cannot produce timings for unknown material. - Replacements keep the original spoken word slot for
%wor; the replacement text matters for%mor, not%wor. If the original slot is untranscribed or a fragment/nonword, it is still excluded. - Retrace scope does not change
%wormembership. - Overlap markers do not change
%wormembership.
%wor is a timing-annotation tier. Its word count equals the number of Wor-domain
words and may differ from a naive main-tier word count. There is no downstream
positional indexing into %wor; the %wor count is not validated against the
main-tier word count.
*CHI: I want cookies .
%wor: I want cookies .
Exact corpus-shaped contrast:
*CHI: <one &+ss> [/] one play ground .
%wor: one •321809_321969• play •322049_322310• ground •322390_322890• .
# &+ss is a fragment, excluded from %wor regardless of retrace context.
*EXP: &+ih <the what> [/] what's letter &+th is this ?
%wor: the •49103_49163• what •49183_50205• what's •50205_50405• letter •50405_50685• is •50946_51046• this •51086_51586• ?
# Fragments &+ih and &+th excluded; regular words remain.
*EXP: what's is dis [: this] ?
%wor: what's •37050_37471• is •37491_37631• dis •37631_38131• ?
*CHI: xxx snack .
%wor: snack •884668_885168• .
# xxx has no phoneme sequence, excluded from %wor; only snack appears.
*CHI: &~um a boat .
%wor: a •1073779_1073799• boat •1076861_1077361• .
# &~um is a nonword, excluded from %wor.
*CHI: &-mm [<] bananas are good .
%wor: mm •1949506_1949566• bananas •1949566_1949766• are •1949846_1949987• good •1950067_1950567• .
# &-mm is a filler, included in %wor (real spoken word with alignable phoneme sequence).
flowchart TD
A["Main-tier word candidate"] --> B{"Timestamp token /\nomission / empty?"}
B -->|Yes| OUT["Excluded from %wor"]
B -->|No| C{"Untranscribed?\n(xxx/yyy/www)"}
C -->|Yes| OUT
C -->|No| D{"Fragment or nonword?\n(&+ or &~)"}
D -->|Yes| OUT
D -->|No| IN["Counts for %wor\n(word or filler &-)"]
style IN fill:#afa,stroke:#333
style OUT fill:#faa,stroke:#333
Phon Phonological Tiers
These tiers originate from the Phon
project and provide syllable-annotated phonological transcription and segmental
alignment. They were originally serialized as %x-prefixed user-defined tiers
(%xmodsyl, %xphosyl, %xphoaln) and are being promoted to official CHAT
tiers. Phon stores phonological data in its own XML format; the CHAT
representation is generated by PhonTalk.
%modsyl / %phosyl, Syllabified Phonology
%modsyl is a syllabified version of %mod (target pronunciation); %phosyl
is a syllabified version of %pho (actual pronunciation). Each phoneme is
annotated with a syllable position code (N=nucleus, O=onset, C=coda,
etc.). Words are space-separated and align 1-to-1 with the corresponding
%mod or %pho tier.
*CHI: the best .
%mod: ðə bɛst .
%modsyl: ð:Oə:N b:Oɛ:Ns:Ct:C .
%pho: ðə bɛs .
%phosyl: ð:Oə:N b:Oɛ:Ns:C .
Alignment: Content-based, stripping position codes (:N, :O, :C, etc.)
and stress markers (ˈ, ˌ) from %modsyl should yield the same phonemes
as %mod. Same for %phosyl → %pho.
%phoaln, Phone Alignment
%phoaln provides segmental alignment between target and actual IPA,
showing phoneme-by-phoneme correspondence. Each pair uses source↔target
notation; ∅ marks insertions or deletions.
*CHI: the best .
%phoaln: ð↔ð,ə↔ə b↔b,ɛ↔ɛ,s↔s,t↔∅
Alignment: Positional, word-by-word, word N in %phoaln aligns with
word N in both %mod and %pho.
Parsing strategy: Same as %pho/%mod, we parse just enough structure
for alignment (word boundaries for %modsyl/%phosyl, alignment pairs for
%phoaln). IPA phoneme content is treated as opaque strings.
Validation (E725-E728)
Because these are derived views, word counts must match between each syllabification tier and its parent IPA tier:
| Check | Error code |
|---|---|
%modsyl word count ≠ %mod word count | E725 |
%phosyl word count ≠ %pho word count | E726 |
%phoaln word count ≠ %mod word count | E727 |
%phoaln word count ≠ %pho word count | E728 |
These checks are gated on ParseHealth, if either tier in a pair has parse
errors, the alignment check is suppressed to avoid false positives.
Known PhonTalk Export Issue
The PhonTalk XML→CHAT converter writes %mod/%pho through a OneToOne
alignment path that maps IPA words to orthography words and silently drops
extras. The syllabification tiers (%modsyl, %phosyl, %phoaln) bypass
this path and include all IPA words. In child phonology data where children
produce more IPA words than orthographic targets (~4% of Phon corpus files),
this creates tier-to-tier word count mismatches. The mismatches originate in
the Phon XML source data (orthography↔IPA word count discrepancies) and are
inconsistently handled during CHAT export. This is being investigated in
collaboration with the Phon team.
Bullet-Content Tiers
These tiers contain free-form text with optional embedded timing markers (•START_END•) and picture references (•%pic:"file.jpg"•). They do not align word-by-word with the main tier.
| Tier | Purpose |
|---|---|
%act | Physical actions, gestures, non-verbal behaviors |
%cod | Research-specific coding (semantic roles, thematic coding, error classification) |
%com | Comments, annotations, and contextual notes |
%exp | Explanations or expansions of ambiguous/incomplete speech |
%add | Addressee identification in multi-party conversations |
%spa | Speech act coding (request, assertion, question, directive) |
%sit | Situational context or setting description |
%gpx | Extended gesture position coding |
%int | Intonational contours and prosodic patterns |
%cod is bullet-content in the shared TalkBank AST. In the %cod coding
convention, a word selector such as <w4> scopes the code that follows it
(it names which main-tier word the code applies to) rather than being a code
in its own right.
Example with timing:
*CHI: gimme that .
%act: reaches toward shelf
%com: child is pointing to picture
Text Tiers
These tiers contain plain text with no bullets, timing, or structural alignment:
| Tier | Purpose |
|---|---|
%alt | Alternative transcriptions |
%coh | Cohesion annotation |
%def | Definitions |
%eng | English translations (for non-English transcripts) |
%err | Error annotations |
%fac | Facial expressions |
%flo | Flow annotation |
%gls | Glosses |
%ort | Orthographic representations |
%par | Paralinguistic information |
%tim | Timing information |
User-Defined Tiers
Tiers prefixed with %x (e.g., %xcod, %xact) are user-defined dependent tiers. They are preserved during parsing and roundtrip but receive no structural validation beyond basic format checks. Any %x-prefixed tier is always accepted, this is the open extension point for project-specific annotation.
The Supported Set Is Closed
A dependent tier is valid in chatter only if it is one of the standard tiers documented above (the structured, Phon, bullet-content, and text tiers) or a %x-prefixed user-defined tier. Any other %-tier is invalid CHAT, and chatter rejects the file with error E605 (UnsupportedDependentTier). This is a closed set by design: chatter validate is the binding judgment on CHAT validity, so an unrecognized dependent tier is an error, not a warning.
Deliberate Divergence from CLAN: Retired Legacy Tiers
When TalkBank standardized morphology on a single Universal Dependencies %mor tier (plus %gra for relations), several legacy dependent tiers were retired. CLAN’s check still accepts three of them, so on these chatter is intentionally stricter, a deliberate, documented divergence:
| Retired tier | CLAN check | chatter |
|---|---|---|
%trn | accepts | rejects (E605) |
%tra | accepts | rejects (E605) |
%grt | accepts | rejects (E605) |
%umor | rejects | rejects (E605) |
The modern UD-%mor workflow has one morphology tier (%mor) plus %gra; the older training/translation/variant tiers are no longer part of the format chatter validates. %umor is rejected by both validators and is listed only for completeness. Note that %xtra (with the %x prefix) is a perfectly valid user-defined tier; only the bare %tra is retired.
This is one instance of a general principle: where chatter intentionally departs from CLAN/CHECK behavior, the divergence is documented rather than left implicit. See CHECK Parity Audit.
The %mor Tier: Morphological Analysis
Status: Reference Last updated: 2026-05-11 20:35 EDT
The %mor (morphological) dependent tier provides word-by-word morphosyntactic annotation aligned with the main tier. Each main-tier word receives a morphological code specifying part of speech, lemma, and grammatical features.
Format Overview
*CHI: I want cookies .
%mor: pron|I-Prs-Nom-S1 verb|want-Fin-Ind-Pres-S1 noun|cookie-Plur .
Each %mor item has the structure POS|lemma[-Feature]*, where:
- POS: part-of-speech category (
noun,verb,pron,det,aux, etc.) |: pipe separator (always present)- Lemma: base form of the word (
cookie,be,I). May contain language-specific compound or derivational boundary markers (see Compound Lemma Boundaries below) - Features: zero or more morphological features, each preceded by
-(-Plur,-Fin-Ind-Pres-S3)
Items are space-separated and terminate with a punctuation marker (., ?, !, etc.).
The UD MOR Format
TalkBank’s %mor tier uses a format inspired by Universal Dependencies (UD) but adapted to CHAT conventions. We call this the UD MOR format to distinguish it from the older CLAN-era MOR format.
The UD MOR format was introduced via batchalign’s Stanza-based morphosyntax pipeline. Stanza produces standard UD analysis (UPOS, lemma, morphological features, dependency relations), and the Rust mapping layer converts this to CHAT %mor and %gra tiers. The new format has been adopted for all new corpus annotation.
Structure: Flat POS|lemma[-Feature]*
Every morphological word is flat, a single POS tag, a single lemma, and a linear chain of features:
POS|lemma[-Feature1][-Feature2][-Feature3]...
There are no compounds, prefixes, subcategories, or nested structures in the UD MOR format. The entire morphological analysis of a word is captured by the POS+lemma+features triple.
Examples:
| Word | %mor code | POS | Lemma | Features |
|---|---|---|---|---|
| dog | noun|dog | noun | dog | (none) |
| dogs | noun|dog-Plur | noun | dog | Plur |
| running | verb|run-Part-Pres-S | verb | run | Part, Pres, S |
| is | aux|be-Fin-Ind-Pres-S3 | aux | be | Fin, Ind, Pres, S3 |
| I | pron|I-Prs-Nom-S1 | pron | I | Prs, Nom, S1 |
| the | det|the-Def-Art | det | the | Def, Art |
Multi-Word Tokens (Clitics)
English contractions and similar multi-word tokens (MWTs) are represented using the tilde (~) separator for post-clitics:
*CHI: it's red .
%mor: pron|it~aux|be-Fin-Ind-Pres-S3 adj|red .
Here it's is a single main-tier word that expands to two morphological words: pron|it (main) and aux|be-Fin-Ind-Pres-S3 (post-clitic). The ~ indicates the two MOR words are fused into one orthographic token.
Each clitic counts as its own chunk for %gra alignment; pron|it~aux|be-Fin-Ind-Pres-S3 produces 2 chunks, each needing its own grammatical relation.
Terminator
The %mor tier ends with a terminator that matches the main tier’s utterance terminator:
*CHI: what is that ?
%mor: pron|what aux|be-Fin-Ind-Pres-S3 det|that ?
The terminator (., ?, !, +..., etc.) counts as one chunk for %gra alignment.
How It Diverges from UD
The UD MOR format is UD-inspired but not UD-compliant. Several deliberate adaptations make it fit CHAT conventions while preserving most UD information. This section catalogs every divergence.
1. POS Tags Are Lowercased UPOS
UD uses uppercase UPOS tags (NOUN, VERB, PRON). CHAT uses lowercase (noun, verb, pron). This is a lossless, trivially reversible surface change.
| UD UPOS | CHAT POS |
|---|---|
| NOUN | noun |
| VERB | verb |
| AUX | aux |
| PRON | pron |
| DET | det |
| ADJ | adj |
| ADV | adv |
| ADP | adp |
| PROPN | propn |
| INTJ | intj |
| CCONJ | cconj |
| SCONJ | sconj |
| NUM | num |
| PART | part |
| X | x |
2. Feature Values Are Flat, Not Key=Value (Currently)
UD represents morphological features as key=value pairs: Number=Plur, Tense=Past, Person=3. The current CHAT convention drops the keys and uses only the values: -Plur, -Past, -S3.
This is the most significant divergence from UD, because:
- Information loss:
Plurcould in principle beNumber=PlurorDegree=Plur(though in practice the UD feature value set has no real ambiguities). - Collapsed person/number: UD
Person=3|Number=Singbecomes-S3, a combined code that cannot be mechanically decomposed back to its UD components. - Feature ordering: Features appear in a conventional order determined by the generation pipeline, not in UD’s alphabetical order.
The data model now supports key=value features. The MorFeature type has an optional key field, when present, the feature serializes as Key=Value (e.g., -Number=Plur); when absent, it serializes as just the value (e.g., -Plur). This is forward-compatible: existing flat features parse and serialize identically, and if batchalign’s mapper begins emitting Key=Value features, they flow through the parser and model without any format changes.
3. Multi-Value Features: Commas Preserved
UD encodes multi-value features with commas: PronType=Int,Rel (the word is both interrogative and relative). In CHAT %mor, the comma is preserved within the feature value:
-Int,Rel
This is treated as a single feature value "Int,Rel". The grammar accepts commas within feature values, and the model stores them as-is. No decomposition occurs; the model faithfully records the string that appears in the %mor tier.
Historical note: Earlier documentation described a “comma-stripping” convention where
PronType=Int,Relbecame-IntRel(concatenated without separator). The current grammar and parser preserve the comma. Existing corpus data using the concatenated form (-IntRel) also parses correctly; it’s simply treated as the flat value"IntRel".
4. Dependency Relations Are Uppercase with Dash Subtypes
The %gra tier (not %mor, but closely related) uses uppercase relation names with dashes for subtypes, where UD uses lowercase with colons:
| UD | CHAT %gra |
|---|---|
nsubj | NSUBJ |
acl:relcl | ACL-RELCL |
obl:tmod | OBL-TMOD |
This is lossless; case and separator are trivially reversible.
5. ROOT Head Convention
In UD, the root word has head=0. In %gra, two conventions coexist:
- UD convention:
head=0(e.g.,3|0|ROOT), the standard we now emit - Legacy TalkBank convention:
head=self(e.g.,3|3|ROOT), found in older corpus data
The parser and validator accept both forms. New output uses head=0.
6. No XPOS, No DEPREL Subtypes in %mor
UD provides both UPOS (universal POS) and XPOS (language-specific POS). CHAT %mor uses only UPOS-equivalent tags; there is no XPOS field. Language-specific POS distinctions are not represented.
Similarly, UD’s fine-grained dependency relation subtypes (e.g., nsubj:pass) appear in %gra as NSUBJ-PASS, but the %mor tier itself contains no dependency information.
7. No Morpheme Segmentation
Traditional CHAT MOR formats (CLAN-era) supported morpheme-level segmentation with compound markers (+), prefix markers (#), and suffix chains (-SUFFIX&type). The UD MOR format does not use any of these, each word is analyzed as a flat POS+lemma+features triple.
The grammar still accepts some of these legacy markers for backward compatibility with older corpus data, but the canonical UD MOR format does not produce them.
Compound Lemma Boundaries
Several UD treebanks use special characters inside lemmas to mark morphological boundaries. These are meaningful linguistic annotations preserved in the CHAT %mor lemma field when possible.
Known Markers Across Languages
| Language | Marker | Meaning | Example Lemma | In %mor |
|---|---|---|---|---|
| Estonian | = | Compound boundary | maja=uks (house-door) | noun|maja=uks, preserved |
| Basque | ! | Derivational boundary | partxi!se (share + derivation) | noun|partxi!se-Ine, preserved |
| Finnish | # | Compound boundary | jää#kaappi (ice-cabinet) | noun|jää_kaappi, mangled (# → _) |
= and ! pass through the cleaning pipeline because they are not reserved CHAT %mor syntax characters. # is reserved in traditional CHAT MOR for prefix markers (e.g., v|#un#do), so the sanitizer replaces it with _.
Gotcha:
=ambiguity with legacy CLAN translation glosses. Legacy CLAN%mortiers use=for translation glosses (e.g.,n|perro=dog), a convention predating UD adoption. The parser treats=identically in both cases; it is preserved as part of the lemma string. This means legacyn|perro=dogparses successfully but the translation semantics are lost: the model storesperro=dogas a single lemma, indistinguishable from an Estonian compound likemaja=uks. Since we cannot reliably disambiguate the two uses without language-specific context, legacy translation glosses are silently absorbed into the lemma. Files with legacy=translationsyntax still parse and round-trip correctly, but the translation information is not semantically accessible. This affects corpora that predate our UD MOR adoption and lack Stanza coverage for their language.
Multi-Word Expression Lemmas (Stanza _ Convention)
Stanza uses underscores in lemmas to represent multi-word expressions across many languages: New_York, parce_que (French), pick_up (English), a_causa_di (Italian). The current cleaning pipeline strips underscores entirely (New_York → NewYork), which is a known data quality issue and should be treated as an open data-quality limitation of the current mapper.
Multi-Value Features (Commas in Feature Values)
UD encodes multi-value features with commas: PronType=Int,Rel means a word is both interrogative and relative. These commas appear in the CHAT %mor feature suffix and are preserved as-is:
pron|wat-Int,Rel
This is sometimes mistaken for a compound lemma marker, but commas in UD always appear in the feature column (CONLLU column 6), never in the lemma column (CONLLU column 3). In CHAT %mor, they appear after the - feature separator, not inside the lemma. The grammar, both parsers, and the data model all accept commas in feature values. See Section 3: Multi-Value Features above.
Future Direction
The current handling of compound lemma boundaries is inconsistent across languages. A possible future improvement is a unified Unicode separator character that would normalize all compound/derivational boundary markers (=, !, #, and potentially _) into a single convention. This has not been implemented as of 2026-03-02 and requires a design decision on which character to use and whether to preserve the original markers in a structured field.
Data Model
The Rust data model in talkbank-model represents %mor tiers with these types:
MorTier
The top-level tier container:
pub struct MorTier {
pub tier_type: MorTierType, // MorTierType::Mor
pub(crate) items: MorItems, // Vec<Mor> wrapper; accessed via accessor methods
pub terminator: Terminator, // typed terminator (.`, `?`, `!`, `+...`, etc.)
pub span: Span, // source location
}
Mor (Item)
One item aligned with one main-tier word:
pub struct Mor {
pub main: MorWord, // required main word
pub post_clitics: SmallVec<[MorWord; 2]>, // optional ~clitics
}
MorWord
A single morphological word (POS + lemma + features):
pub struct MorWord {
pub pos: PosCategory, // e.g., "noun"
pub lemma: MorStem, // e.g., "dog"
pub features: SmallVec<[MorFeature; 4]>, // e.g., [Plur]
}
MorFeature
A morphological feature with optional key:
pub struct MorFeature {
key: Option<Arc<str>>, // e.g., Some("Number") or None
value: Arc<str>, // e.g., "Plur"
}
Construction examples:
// Flat feature (current convention)
MorFeature::new("Plur") // key=None, value="Plur"
MorFeature::new("S3") // key=None, value="S3"
MorFeature::new("Int,Rel") // key=None, value="Int,Rel"
// Keyed feature (UD-standard, forward-compatible)
MorFeature::new("Number=Plur") // key=Some("Number"), value="Plur"
MorFeature::new("Tense=Past") // key=Some("Tense"), value="Past"
// Explicit constructors
MorFeature::flat("Plur")
MorFeature::with_key_value("Number", "Plur")
Lossless roundtrip guarantee: MorFeature::new auto-detects the = delimiter. Features without = are flat; features with = split into key+value. Serialization reproduces the original format exactly, flat features stay flat, keyed features keep their key.
PosCategory and MorStem
Both are interned Arc<str> newtypes for memory efficiency:
pub struct PosCategory(pub Arc<str>); // interned via pos_interner()
pub struct MorStem(pub Arc<str>); // interned via stem_interner()
Common values (noun, verb, the, a, be, etc.) are pre-populated in the interner. Cloning is O(1), atomic reference count increment.
Memory Layout
The model uses SmallVec for inline storage of common cases:
Mor.post_clitics: SmallVec<[MorWord; 2]>: most words have 0-1 cliticsMorWord.features: SmallVec<[MorFeature; 4]>: most words have 0-4 featuresMorFeaturekey and value areArc<str>, interned for deduplication
For a typical 30-word utterance with %mor, the model allocates approximately 30 Mor items, each with 1 MorWord and 0-4 MorFeature values. The interning system ensures that repeated POS tags, stems, and feature values share a single allocation across the entire file.
Grammar
The tree-sitter grammar for %mor is defined in grammar.js. The relevant rules:
mor_content → mor_word (mor_post_clitic)*
mor_post_clitic → tilde mor_word
mor_word → mor_pos pipe mor_lemma (mor_feature)*
mor_feature → hyphen mor_feature_value
mor_feature_value → /[^\.\?\|\+~\-\s\r\n]+/
Key design decisions:
mor_feature_valueaccepts=and!: The regex[^\.\?\|\+~\-\s\r\n]+matches any characters except the MOR structural delimiters. This meansNumber=Plurparses as a singlemor_feature_valuenode. The split on=happens in the model layer, not the grammar, following the “parse, don’t validate” principle.mor_feature_valueaccepts,: Multi-value features likeInt,Relparse as a single node.- No compound/prefix rules: The grammar has no rules for
+(compounds) or#(prefixes) in the UD MOR format. These are legacy CHAT MOR features not used in UD-style output.
Parser
The tree-sitter parser produces MorTier from CHAT text. It is GLR-based and error-recovering, producing a CST that the Rust talkbank-parser crate walks to construct MorTier. Used by the CLI, LSP, and batchalign. High-frequency values (PosCategory, MorStem) are interned via Arc<str> during construction.
The corpus/reference/ set is the correctness gate for %mor parsing,
every file must parse and round-trip cleanly. The file count grows as
new constructs are added; run find corpus/reference -name '*.cha' | wc -l
to get the live total.
Validation
The %mor tier undergoes several validation checks:
Content Validation (E711)
Every MorWord is checked for:
- Empty POS:
|lemmawith no POS before the pipe - Empty lemma:
pos|with no lemma after the pipe - Empty feature: bare
-separator with no feature text
Main-tier Alignment (E705 / E706)
The %mor tier must align 1-to-1 with the main tier’s alignable
words (excluding pauses, events, and other non-word content). The
number of Mor items must equal the number of alignable main-tier
words. The validator emits E705 MorCountMismatchTooFew when
%mor has fewer items than the main tier and E706
MorCountMismatchTooMany when it has more. Terminator-mismatch
errors are emitted separately as E707 (presence) and E716 (value).
GRA Alignment (E720)
When both %mor and %gra tiers are present, the number of %gra
relations must equal the number of %mor chunks (including
clitics and the terminator). A mismatch emits E720
MorGraCountMismatch. This is computed via MorTier::count_chunks().
(%gra’s own internal validators, E708 malformed relation, E709
invalid index, E712 word-index out of range, E713 head-index out of
range, E721 non-sequential index, E722 no ROOT, E723 multiple
ROOTs, E724 circular dependency, are documented in
Dependent Tiers § %gra.)
JSON Serialization
The MorTier serializes to JSON using serde. MorFeature serializes as a plain string ("Plur" or "Number=Plur"), so the JSON schema is simply "type": "string". Example:
{
"tier_type": "Mor",
"items": [
{
"main": {
"pos": "pron",
"lemma": "I",
"features": ["Prs", "Nom", "S1"]
}
},
{
"main": {
"pos": "verb",
"lemma": "want",
"features": ["Fin", "Ind", "Pres", "S1"]
}
},
{
"main": {
"pos": "noun",
"lemma": "cookie",
"features": ["Plur"]
}
}
],
"terminator": "."
}
When key=value features are present, they serialize with the key included:
"features": ["Number=Plur", "Tense=Past"]
The JSON schema for MorFeature is "type": "string" regardless of whether keys are present.
Migration from Traditional CHAT MOR
What Changed
The traditional CHAT MOR format (CLAN-era) used a complex, hierarchically structured notation:
%mor: pro:sub|I v|want n|cookie-PL .
Key differences from the UD MOR format:
| Aspect | Traditional CHAT MOR | UD MOR |
|---|---|---|
| POS tags | CLAN categories (pro:sub, v, n, adj, adv) | Lowercased UPOS (pron, verb, noun, adj, adv) |
| POS subtypes | Colon-separated (pro:sub, det:art, v:aux) | Flat (subtypes dropped or encoded differently) |
| Features | CLAN suffix system (-PL, -PAST, -3S, -PRES) | UD feature values (-Plur, -Past, -S3, -Pres) |
| Compounds | + separator (`n | +n|black+n|bird`) |
| Prefixes | # separator (`v | #un#do`) |
| Morpheme segmentation | Full segmentation (v|eat&PAST) | Not used (features are abstract, not morphemic) |
| Translations | = separator (n|perro=dog) | Not present in base format (separate mechanism) |
What the Model Removed
The UD MOR redesign (2026) removed the following types from the data model:
MorSuffix: suffix with type discriminant (fusional,derivational, etc.)MorCompound: compound word with+separatorMorPrefix: prefix with#separatorMorSubcategory: POS subcategory after colonAnnotatedChunk: chunk with optional translationChunk: enum of word/compound/terminator
These were replaced by the flat MorWord { pos, lemma, features } structure. The model went from ~12 types to 4 (MorTier, Mor, MorWord, MorFeature).
Backward Compatibility
The grammar still accepts many traditional CHAT MOR constructs (colons in POS tags, etc.) because the reference corpus contains files in both formats. The parser produces the same flat MorWord regardless; legacy constructs are mapped to the simplified structure during parsing.
What Stays the Same
Despite the format changes, fundamental CHAT conventions remain:
- Pipe (
|) separates POS from lemma - Hyphen (
-) introduces features - Tilde (
~) marks post-clitics - Space separates items
- Terminator ends the tier
- 1-to-1 alignment with main tier words
Toward Full UD Compatibility
The current format is UD-inspired but not UD-compliant. Here is a roadmap of what would be needed for full lossless UD round-tripping:
Already Supported
- POS tags (UPOS equivalents)
- Lemmas
- Feature values (flat and key=value)
- MWT expansions (clitics)
- Dependency relations (via
%gra)
Gaps Remaining
-
Feature keys: The model supports
Key=Valuefeatures, but batchalign’s mapper currently emits flat values only. When the mapper switches to emittingNumber=Plurinstead of justPlur, the parser, model, and serializer handle it automatically with no code changes. -
Person+Number composites: UD has separate
Person=3andNumber=Singfeatures. CHAT combines them into-S3(3rd person singular). DecomposingS3back toPerson=3|Number=Singwould require a lookup table or a convention change. -
Multi-value feature delimiter: UD uses commas (
PronType=Int,Rel). CHAT preserves these commas in the feature value, but the semantic structure (two separate values) is not explicitly modeled. The model treatsInt,Relas an opaque string. -
XPOS: UD provides language-specific POS tags (XPOS) alongside universal tags (UPOS). CHAT
%morhas no XPOS field. This information is simply not represented. -
Morpheme-level analysis: UD’s
MISCfield can encode morpheme boundaries and glosses. CHAT’s UD MOR format does not attempt morpheme segmentation, features are abstract grammatical categories, not morphemic decompositions.
The Path Forward
The model is designed so that moving toward UD compliance requires no breaking changes:
MorFeaturealready supportsKey=Value, just needs the mapper to emit keysPosCategoryis an opaque string, could hold XPOS in a separate field if needed- JSON schema uses
"type": "string"for features, adding keys doesn’t break consumers - The grammar already accepts
=in feature values, no grammar changes needed
The migration can happen incrementally: the mapper starts emitting key=value features, existing flat data continues to parse identically, and corpus files can be upgraded at their own pace.
Phon Tiers (%xmodsyl, %xphosyl, %xphoaln, %xphoint)
Status: Reference Last updated: 2026-06-23 07:28 EDT
The Phon extension tiers provide syllable-level phonological annotation, segmental alignment between target and actual IPA, and per-phone time intervals. They are produced by the Phon application and exported to CHAT via PhonTalk.
chatter parses and validates all four tiers as first-class CHAT tiers.
The
xprefix. Phon emits these tiers with a leadingx(%xmodsyl,%xphosyl,%xphoaln,%xphoint) to mark them as extension tiers. The grammar accepts both thex-prefixed names and the historical non-xnames (%modsyl,%phosyl,%phoaln,%phoint); the parser and validator key off the tier kind, not the literal prefix. The canonical serialized form is thex-prefixed name.
The four tiers
| Tier | Source | Carries | Word separator |
|---|---|---|---|
%xmodsyl | %mod | Syllabification of the model/target transcription | space |
%xphosyl | %pho | Syllabification of the actual transcription | space |
%xphoaln | %mod+%pho | Phone-by-phone alignment of model ↔ actual | space |
%xphoint | %pho | Per-phone time intervals (0x15 time bullets) | / |
%xmodsyl, %xphosyl, and %xphoaln are word-aligned to their source tier(s)
with single ASCII spaces. %xphoint uses / (space-slash-space) as its word
separator because single spaces already separate the phone and bullet tokens
inside each word.
Tier formats
%xmodsyl / %xphosyl, syllabification
A word is one or more phone:CODE units concatenated with no internal
whitespace; words are separated by single spaces. The phone is one IPA phone
(IPA length is written with the modifier letter ː, U+02D0, never an ASCII
colon, so the : separator is unambiguous). A leading stress marker (ˈ
primary, ˌ secondary) is part of the phone it precedes.
The constituent code is one character. The legal codes are O N C L R E A D U:
| Code | Constituent | Notes |
|---|---|---|
O | Onset | |
N | Nucleus | monophthong nucleus |
C | Coda | |
L | Left appendix | e.g. /s/ in an /s/-stop cluster |
R | Right appendix | e.g. final /z/ in a complex coda |
E | OEHS (onset of empty-headed syllable) | e.g. the stop element of an affricate |
A | Ambisyllabic | |
D | Diphthong | a nucleus member of a diphthong/triphthong; treated as a nucleus |
U | Unknown | Phon could not assign a concrete constituent; common on %xphosyl when the model %xmodsyl is fully syllabified |
The remaining Phon SyllableConstituentType mnemonics, B (boundary),
S (stress), W (word boundary), T (tone), are not emitted on these
tiers: boundary, stress, and tone need no per-phone marker.
*CHI: I want three .
%mod: aɪ wɑnt θri
%xmodsyl: a:Dɪ:D w:Oɑ:Nn:Ct:C θ:Oɹ:Oi:N
%pho: aɪ wɑn fwi
%xphosyl: a:Dɪ:D w:Oɑ:Nn:C f:Ow:Oi:N
%xphoaln, phone alignment
A word is one or more comma-separated pairs; a pair is model↔actual (↔ is
U+2194). Either side may be ∅ (U+2205, empty set): ∅ on the left is an
epenthesis (a phone produced but not targeted); ∅ on the right is a deletion.
Both sides are never ∅ at once.
*CHI: the best .
%mod: ðə bɛst
%pho: ðə bɛs
%xphoaln: ð↔ð,ə↔ə b↔b,ɛ↔ɛ,s↔s,t↔∅
The alignment lists segments (phones). Suprasegmental stress (ˈ/ˌ) that
may appear on the %mod/%pho word is therefore not part of the alignment
pairs; the reconstruction checks below compare modulo those stress markers.
%xphoint, per-phone intervals
%xphoint gives the time segmentation of each individual phone on %pho,
effectively phone-level bullets analogous to the word-level timing on %wor.
Groups (one per %pho word) are separated by /. Within a group, each phone
is followed by a CLAN time-alignment bullet: the byte 0x15 (NAK), the interval
start_end, then 0x15.
*CHI: I want . •0_500•
%pho: aɪ wɑnt
%xphoint: aɪ •0_250• / w •250_320• ɑ •320_400• n •400_460• t •460_500•
(Bullets are shown as • above; in the file they are the 0x15 byte.)
Validation
These checks run by default. Pass --suppress xphon to silence the entire
Phon %x validation surface, or suppress an individual code. (The historical
--check-xphon opt-in flag is now a deprecated no-op: the checks it used to
gate are on by default.)
Word-count cross-checks (each %x tier has the same number of words as the
tier(s) it depends on):
%xmodsyl↔%mod: E725%xphosyl↔%pho: E726%xphoaln↔%mod: E727, ↔%pho: E728
Content checks:
| Code | Tier | Rule |
|---|---|---|
| E735 | xmodsyl/xphosyl | a unit is not a well-formed phone:CODE (no :, empty phone, or empty code) |
| E736 | xmodsyl/xphosyl | a constituent code is not one of O N C L R E A D U |
| E737 | xmodsyl | stripping codes and concatenating phones does not reproduce the %mod word |
| E738 | xphosyl | stripping codes and concatenating phones does not reproduce the %pho word |
| E739 | xphoaln | a pair is malformed (not exactly one ↔, an empty side, or ∅↔∅) |
| E740 | xphoaln | concatenating the model sides (skipping ∅, modulo stress) does not reproduce the %mod word |
| E741 | xphoaln | concatenating the actual sides (skipping ∅, modulo stress) does not reproduce the %pho word |
| E742 | xphoint | a bullet has start >= end |
| E743 | xphoint | interval start times are not non-decreasing across the tier |
| E744 | xphoint | the first start / last end falls outside the record’s media bullet (1 ms tolerance) |
| E745 | xphoint | a group’s phones do not reproduce the %pho word |
| E746 | xphoint | the number of groups does not equal the %pho word count |
See Alignment Architecture for the word-count implementation.
Parsing strategy
- %xmodsyl / %xphosyl: stored as flat word strings
(
talkbank-model::dependent_tier::phon::SylTier), consistent with how%phoand%modstore flat phone words. The validator tokenizes each word into typedphone:CODEunits (PositionCode) to apply the content rules above; the IPA characters themselves stay verbatim for exact round-trip. - %xphoaln: each word is parsed into a
Vec<AlignmentPair>, whereAlignmentPair { source, target }carries onemodel↔actualmapping (Noneis∅). - %xphoint: parsed into typed groups of
(phone, bullet)pairs (XphointTier/XphointGroup/PhoneInterval), reusing the same0x15bullet machinery as%wor.
Deep phonological analysis is Phon’s domain; chatter parses the structure that validation needs and keeps the IPA content verbatim.
Phon XML source format
In Phon’s native XML format, phonological data is stored as structured elements:
<ipaTarget>
<pho>
<pw>
<ph scType="onset"><base>θ</base></ph>
<ph scType="nucleus"><base>ɹ</base></ph>
<ph scType="nucleus"><base>i</base></ph>
</pw>
</pho>
</ipaTarget>
Each <pw> (phonological word) element contains <ph> elements with syllable
constituent types (scType). The <alignment> element provides phone-level
mappings between target and actual using index-based <pm> (phone map) entries.
Data quality notes
A small percentage of Phon corpus XML records have an orthography↔IPA word-count
mismatch: the number of <pw> elements in <ipaTarget> / <ipaActual> differs
from the number of <w> elements in <orthography>. This is expected in child
phonology data: children may produce extra syllables, partial words, or
over-productions relative to the target.
For current counts on a local CHILDES/TalkBank data tree, run:
python3 scripts/analysis/scan_phon_mismatches.py /path/to/data
The PhonTalk CHAT export handles this discrepancy inconsistently:
%mod/%phoare written through aOneToOnealignment path that maps IPA words to orthography words; extras are silently dropped.%xmodsyl/%xphosyl/%xphoalnare written directly from the rawIPATranscript; all IPA words are included.
This produces CHAT files where %xmodsyl may have more words than %mod,
triggering the E725-E728 word-count errors. This is being investigated in
collaboration with the Phon team.
Word Syntax
Status: Reference Last updated: 2026-05-11 23:33 EDT
Words are the primary content unit on the main tier. CHAT defines several word types and annotation mechanisms.
Standalone Words
Most words are simple tokens separated by whitespace:
*CHI: I want a cookie .
Words can contain Unicode characters for any language:
*CHI: ich möchte Kekse .
Compounds
Compound words join multiple elements with +:
*CHI: I want ice+cream .
Special Word Forms
Shortened Forms
Parentheses mark omitted portions of a word:
*CHI: (be)cause I want it .
The full form is because; the child produced cause.
Replacements
Square brackets with colon mark what the speaker actually meant:
*CHI: I goed [: went] to the store .
The speaker said “goed” but the intended word was “went”.
Language Markers
The @s: suffix marks a word’s language in multilingual transcripts:
*CHI: I want a Keks@s:deu .
Other @ markers:
@l: letter@c: child-invented form@f: family-specific word@n: neologism@o: onomatopoeia@b: babbling@wp: word play@si: signed word
Annotations
Words and groups can carry post-positioned annotations in square brackets:
Error Marking
*CHI: he goed [*] to school .
[*] marks an error. More specific error codes can follow: [* m:+ed].
Explanations
*CHI: that one [= the red ball] .
[= text] provides an explanation or gloss.
Replacements
*CHI: I wanna [: want to] go .
[: text] marks the target/intended form.
Best Guess
*CHI: I want the birfer [?] .
[?] marks uncertain transcription.
Events and Actions
Paralinguistic Events
Events marked with &= describe non-speech sounds:
*CHI: &=laughs I want cookie .
*CHI: &=coughs .
Fillers
Fillers are marked with &-:
*CHI: &-um I want &-uh cookie .
Interposed Speech (Other Speaker)
Brief background speech from a different speaker is marked with the
&*SPK:text prefix, it captures the interjection without creating
a full turn line:
*CHI: I want &*MOT:careful a cookie .
This says CHI was speaking and MOT briefly said “careful” mid-turn.
If the intervention is substantial enough to constitute its own turn,
transcribe it as a separate *MOT: utterance instead. Model:
crates/talkbank-model/src/model/content/other_spoken.rs.
(Note: [^ text] is a freecode, a standalone free-form
researcher annotation that sits as its own content item on the main
tier (variant of UtteranceContent::Freecode, sibling of Word and
Group; it is NOT attached to any word). See grammar/grammar.js
rule freecode and
crates/talkbank-model/src/model/content/utterance_content/. Used
for transcriber notes that are independent of any single word; for
notes about a single word use [% text] or [= text] instead.)
Pauses
*CHI: I (.) want (..) a (...) cookie .
*CHI: I (1.5) want a cookie .
(.): short pause(..): medium pause(...): long pause(N.N): timed pause in seconds
Overlap
Overlapping speech between speakers uses angle brackets and overlap markers:
*MOT: do you want <a cookie> [>] ?
*CHI: <cookie> [<] !
[>]: follows the overlap (this speaker started first)[<]: overlaps the previous speaker
Retrace and Repetition
Groups followed by retrace markers indicate speech disfluencies:
*CHI: <I want> [/] I want a cookie .
*CHI: <I want> [//] I need a cookie .
*CHI: <I want a> [///] give me a cookie .
[/]: partial retrace (speaker repeats the same words)[//]: full retrace (speaker restarts with different words)[///]: multiple retracing (multiple false starts)[/-]: reformulation (speaker rephrases with different structure)
The CHAT Word
Status: Current Last modified: 2026-05-29 18:43 EDT
“Word” is the most complex and most misunderstood concept in CHAT. This chapter documents what a word actually is, how the grammar parses it, and how the Rust model represents it. If you maintain this codebase, you will encounter word-level bugs. This chapter exists so you can understand them.
The Fundamental Rule
Whitespace delimits words. Contiguous non-whitespace characters form one word token. This applies everywhere on the main tier.
*CHI: hello world .
^^^^^ word: "hello"
^^^^^ word: "world"
The grammar uses extras: $ => [] – no implicit whitespace. Whitespace
nodes (whitespaces, space) are explicit in the CST. Tree-sitter does
not skip whitespace between tokens. This is the foundation of every
tokenization decision in the grammar.
There are no exceptions to this rule. Every ambiguity described in this chapter is resolved by applying this rule consistently.
Word Structure
A word in the grammar is standalone_word – a sequence of an optional
prefix, a required body, optional suffixes, and an optional POS tag.
The following diagram shows the full decomposition. All named nodes are separate CST children.
flowchart TD
sw["standalone_word\n(grammar.js, prec.right 6)"]
zero["zero\n'0' -- omission prefix"]
wp["word_prefix\n'&-' filler | '&~' nonword | '&+' fragment"]
wb["word_body\n(required)"]
fm["form_marker\n@b, @c, @d, @z:label, ..."]
wls["word_lang_suffix\n@s, @s:eng, @s:eng+fra"]
pos["pos_tag\n$n, $v, $adj, ..."]
sw -->|"optional prefix"| zero & wp
sw -->|"required"| wb
sw -->|"optional suffix"| fm & wls
sw -->|"optional"| pos
ws["word_segment\npure spoken text"]
short["shortening\n'(text)' omitted sound"]
sm["stress_marker\nprimary or secondary"]
len["lengthening\n':' one or more colons"]
op["overlap_point\none of four brackets"]
cae["ca_element\nsingle CA marker"]
cad["ca_delimiter\npaired CA marker"]
ub["underline_begin\ncontrol char pair"]
ue["underline_end\ncontrol char pair"]
cm["'+'\ncompound marker"]
wb -->|"children (any order)"| ws & short & sm & len & op & cae & cad & ub & ue & cm
In the grammar (search grammar/grammar.js for the standalone_word
and word_body rules), the structure is:
standalone_word: $ => prec.right(6, seq(
optional(choice($.word_prefix, $.zero)),
$.word_body,
optional($.form_marker),
optional($.word_lang_suffix),
optional($.pos_tag),
)),
word_body: $ => prec.right(choice(
seq(
choice($.word_segment, $.shortening, $.stress_marker),
repeat(choice($.word_segment, $.shortening, $.stress_marker, $._word_marker)),
),
seq(
choice($.overlap_point, $.ca_element, $.ca_delimiter, $.underline_begin),
choice($.word_segment, $.shortening, $.stress_marker),
repeat(choice($.word_segment, $.shortening, $.stress_marker, $._word_marker)),
),
)),
word_body has two branches:
- Standard start: the word begins with
word_segment,shortening, orstress_marker, followed by any number of body children. - Marker-initial: the word begins with a structural marker (overlap,
CA, underline), but that marker must be immediately followed by text
content. This prevents degenerate words like a standalone overlap marker
from forming a valid
standalone_word.
Lengthening and + (compound marker) are excluded from starting a word
body. This is how standalone : falls through to separator(colon) –
see Section 5 (Tokenization Ambiguities) below.
The word_segment Purity Invariant
word_segment contains ONLY pure spoken text. All structural markers
are separate typed children in word_body, never consumed by word_segment.
This is a hard invariant with three consequences:
cleaned_text()never scans for markers. It concatenatesTextandShorteningelements. No stripping needed.- Validation finds ALL markers by type. Overlap markers, CA elements,
and underline pairs are always
WordContentvariants, regardless of position within the word. - Editors get typed CST nodes. Syntax highlighting, bracket matching, and hover info work on individual markers, not opaque substrings.
How it works
word_segment is a DFA token at prec(5) with a regex that excludes
all structural characters. The exclusions are generated from the symbol
registry (grammar/src/generated_symbol_sets.js) – never hand-written.
word_segment: $ => token(prec(5, seq(
WORD_SEGMENT_FIRST_RE, // generated: excludes structural chars + '0' at start
WORD_SEGMENT_REST_RE, // generated: excludes structural chars
))),
Full exclusion table
Every character in this table is excluded from word_segment and
becomes a separate typed node in the CST.
| Category | Characters | CST node type |
|---|---|---|
| Overlap markers | ⌈ ⌉ ⌊ ⌋ | overlap_point |
| CA elements | ↑ ↓ ≠ ∾ ⁑ ⤇ ∙ Ἡ ↻ ⤆ | ca_element |
| CA delimiters | ∆ ∇ ° ▁ ▔ ☺ ♋ ⁇ ∬ Ϋ ∮ ↫ ⁎ ◉ § | ca_delimiter |
| Stress markers | ˈ ˌ | stress_marker |
| Colons | : | lengthening |
| Underline markers | \x02\x01, \x02\x02 | underline_begin / underline_end |
| Brackets | [ ] < > ( ) { } | structural (annotations, groups) |
| Punctuation | . ! ? , ; + | terminators, separators, compound |
| CHAT prefixes | @ $ & * % | headers, events, speakers |
| Intonation contours | ⇗ ↗ → ↘ ⇘ ≈ ≋ ∞ ≡ | content-level markers |
| Group delimiters | ‹ › " " 〔 〕 | pho/sin groups, quotes |
| Control chars | \x01-\x08, \x15 | bullets, underline |
First-character-only exclusion: 0 is excluded from the first
character of word_segment (it is the omission prefix). 0 in
non-initial positions is valid: 200, h0me, abc0 all parse correctly.
The Rust Data Model
Word struct
The Word struct (crates/talkbank-model/src/model/content/word/word_type.rs)
is the canonical typed representation:
pub struct Word {
pub span: Span,
pub word_id: Option<SmolStr>,
pub(crate) raw_text: SmolStr,
pub content: WordContents,
pub category: Option<WordCategory>,
pub form_type: Option<FormType>,
pub lang: Option<WordLanguageMarker>,
pub part_of_speech: Option<SmolStr>,
pub inline_bullet: Option<Bullet>,
}
Key fields:
raw_text: the exact text from the input, including all markers. Used for roundtrip serialization.content: aWordContents(SmallVec-backed sequence ofWordContentelements). This is the structured decomposition. Most words have 1-2 elements; SmallVec avoids heap allocation for the common case.category: optional prefix (Omission,CAOmission,Filler,Nonword,PhonologicalFragment).form_type: optional@suffix (@cchild-invented,@ddialect,@z:labeluser-defined, etc.).lang: optional@slanguage marker (Shortcut,Explicit,Multiple,Ambiguous).part_of_speech: optional$tag.
WordContent enum
WordContent (crates/talkbank-model/src/model/content/word/content.rs)
is the enum of everything that can appear inside a word body. Each variant
maps directly to a grammar node.
| Grammar node | WordContent variant | Rust type | Example |
|---|---|---|---|
word_segment | Text | WordText(NonEmptyString) | hello, want |
shortening | Shortening | WordShortening(NonEmptyString) | (be) in (be)cause |
overlap_point | OverlapPoint | OverlapPoint | ⌈, ⌉2 |
ca_element | CAElement | CAElement | ↑, ↓ |
ca_delimiter | CADelimiter | CADelimiter | ∆, ° |
stress_marker | StressMarker | WordStressMarker | ˈ primary, ˌ secondary |
lengthening | Lengthening | WordLengthening { count: u8 } | : = 1, :: = 2, ::: = 3 |
| (caret in word) | SyllablePause | WordSyllablePause | ^ in o^ver |
underline_begin | UnderlineBegin | UnderlineMarker | \x02\x01 |
underline_end | UnderlineEnd | UnderlineMarker | \x02\x02 |
+ (compound) | CompoundMarker | WordCompoundMarker | + in ice+cream |
~ (clitic boundary) | CliticBoundary | WordCliticBoundary | ~ in le~ha |
cleaned_text()
Word::cleaned_text() derives NLP-ready text from content by
concatenating only Text and Shortening variants:
pub fn compute_cleaned_text(&self) -> String {
let mut result = String::new();
for item in &self.content {
match item {
WordContent::Text(t) => result.push_str(t.as_ref()),
WordContent::Shortening(s) => result.push_str(s.as_ref()),
_ => {}
}
}
result
}
This works because the purity invariant guarantees that Text elements
never contain structural markers. There is nothing to strip.
Examples:
| Input | content | cleaned_text() |
|---|---|---|
hello | [Text("hello")] | hello |
(be)cause | [Shortening("be"), Text("cause")] | because |
no:: | [Text("no"), Lengthening(2)] | no |
ice+cream | [Text("ice"), CompoundMarker, Text("cream")] | icecream |
le~ha | [Text("le"), CliticBoundary, Text("ha")] | leha |
ja^ja | [Text("ja"), SyllablePause, Text("ja")] | jaja |
he↑llo | [Text("he"), CAElement(PitchUp), Text("llo")] | hello |
°soft° | [CADelimiter(Softer), Text("soft"), CADelimiter(Softer)] | soft |
ˈhello | [StressMarker(Primary), Text("hello")] | hello |
⌈hello⌉ | [OverlapPoint(TopBegin), Text("hello"), OverlapPoint(TopEnd)] | hello |
The result is cached via OnceLock on first access.
What is included in cleaned_text vs what is stripped
The following table is the complete inventory of how every word-internal
element contributes to (or is excluded from) cleaned_text(). This must
match what NLP pipelines (Stanza, etc.) expect as input.
WordContent variant | Character(s) | In cleaned_text? | Rationale |
|---|---|---|---|
Text | spoken text | YES | The actual word |
Shortening | (be) | YES | Shortened form is still spoken |
CompoundMarker | + | No | Structural boundary, not spoken |
CliticBoundary | ~ | No | Morphological boundary, not spoken |
SyllablePause | ^ | No | Pause between syllables, not spoken |
Lengthening | : :: ::: | No | Prosodic marker, not spoken |
StressMarker | ˈ ˌ | No | Prosodic marker, not spoken |
OverlapPoint | ⌈ ⌉ ⌊ ⌋ | No | Timing marker, not spoken |
CAElement | ↑ ↓ ≠ ∾ ⁑ ⤇ ∙ Ἡ ↻ ⤆ | No | Prosodic annotation |
CADelimiter | ∆ ∇ ° ▁ ▔ ☺ ♋ ⁇ ∬ Ϋ ∮ ↫ ⁎ ◉ § | No | Voice quality annotation |
UnderlineBegin | \x02\x01 | No | Formatting marker |
UnderlineEnd | \x02\x02 | No | Formatting marker |
Characters that stay in word_segment (ARE spoken text):
- Letters (all Unicode)
- Digits (in non-initial position;
0in initial = omission prefix) - Hyphen (
-), part of word text, e.g.,ice-cream,self-conscious - Apostrophe (
'), contractions, e.g.,don't,it's - Hash (
#), appears in some transcription conventions - Underscore (
_), compound boundary in some conventions
Characters NOT in word_segment (excluded by symbol registry): See the full exclusion table in Precedence Decisions in the grammar docs.
Comparison with batchalign2
batchalign2’s annotation_clean() (60 lines of .replace() calls) strips
all the same characters that our grammar excludes from word_segment.
Key differences:
- Parentheses: ba2 COMMENTED OUT the strip. We handle them as
Shortening, the content inside parens IS included incleaned_text. - IPA characters (
ạ ā ʔ ʕ ʰ): ba2 incorrectly strips them. We correctly keep them; they are real phonetic content. - Hyphen (
-): ba2 strips it. We keep it in word_segment because hyphen is a valid word character (contractions, compounds, morphological suffixes in%mortier).
Our design eliminates the need for character-by-character stripping entirely.
cleaned_text() is a simple concatenation of Text + Shortening
elements, with zero scanning.
The Six Tokenization Ambiguities
CHAT was designed for human readability, not machine parsing. Six
characters have context-dependent meanings that the grammar must
disambiguate. Full details with proof grammars are in
grammar/docs/tokenization-rules.md and grammar/docs/precedence-decisions.md.
What follows is a summary for orientation.
1. Overlap markers (⌈⌉⌊⌋)
Adjacent to text = part of the word. Space-separated = standalone
overlap_point.
Yeah⌋⌈2 hey ONE word: "Yeah⌋⌈2"
Yeah ⌋ ⌈2 hey three tokens: "Yeah", ⌋, ⌈2
Maximal munch at prec(5) makes word_segment consume adjacent overlap
characters. Overlap markers are only recognized as overlap_point when
space-separated on both sides.
2. Zero/omission prefix (0)
Adjacent to word body = omission prefix. Space-separated = action marker.
0die ONE word: standalone_word(zero, word_body("die"))
0 die TWO tokens: nonword(zero), word("die")
standalone_word at prec.right(6) beats nonword at prec(1).
The extras: [] setting prevents whitespace from being skipped between
zero and word_body. The zero token is inlined directly into
standalone_word (not through word_prefix) because tree-sitter’s
precedence does not propagate through intermediate rules. This was proven
empirically with a minimal test grammar – see grammar/docs/precedence-decisions.md.
3. CA parenthetical vs shortening
In CA mode (@Options: CA), a fully parenthesized word (word) is an
uncertain/omitted word (CAOmission), semantically equivalent to 0word.
Partially parenthesized hel(lo) is always a shortening.
@Options: CA
*CHI: (ja) . CAOmission: uncertain "ja"
*CHI: hel(lo) . Shortening: "(lo)" is the shortened part
Distinguishing these requires file-level context (@Options header).
The parser sets WordCategory::CAOmission when the word is fully
parenthesized in CA mode. Isolated parser.parse_word_fragment() calls
cannot determine CA mode – they need a FragmentSemanticContext.
4. Colon – lengthening vs separator
Inside a word (after text): prosodic lengthening. Standalone: separator.
no:: ONE word: Text("no") + Lengthening(2)
hello : world separator(colon)
The DFA always produces lengthening for : (higher precedence). But
word_body rejects lengthening as a first element, so standalone :
cannot form a valid word and falls through to separator(colon). This is
the “constrain the parser, not the DFA” pattern.
5. Plus (+) – compound vs terminator vs linker
Inside a word: compound marker. At line end: terminator prefix. At line start: linker prefix.
ice+cream ONE word with compound marker
and then +... terminator: trailing_off (prec 10 beats prec 5)
+< but I +/. linker: lazy_overlap, terminator: interruption
Terminators and linkers use prec(10), which beats word_segment at
prec(5). No valid CHAT word ends with + – the grammar enforces this
by structure.
6. Bracket annotations vs plain brackets
Bracket annotations ([= text], [=! text], [% text]) use prec(8)
prefix tokens to beat generic bracket handling.
Historical Context: The Coarsening and Its Reversal
The original structured grammar (pre-coarsening)
The original grammar (preserved in
grammar/docs/pre-coarsening-grammar.js.reference) had all word-internal
markers as children of word_content:
// Pre-coarsening: every marker was a child of word_content
word_content: $ => choice(
$.word_segment,
$.shortening,
$.stress,
$.colon,
$.caret,
$.tilde,
$.plus,
$.overlap_point,
$.ca_element,
$.ca_delimiter,
$.underline_begin,
$.underline_end,
),
The coarsening decision
At one point, standalone_word was coarsened into an opaque token –
a single DFA regex that consumed the entire word as one undifferentiated
string. The rationale was:
- Simpler grammar with fewer tree-sitter conflicts.
- A Chumsky-based direct parser in Rust would re-parse the opaque token
into structured
WordContentelements.
This worked but had costs:
- Two parsers (tree-sitter + Chumsky) with independent bugs.
- Validation could not find structural markers without re-parsing.
- Editors got one opaque node instead of typed children.
cleaned_text()had to scan for and strip marker characters.
The reversal (Chumsky elimination)
When the Chumsky direct parser was eliminated (making tree-sitter the sole parser), the structured word grammar was restored. The key decisions:
- All marker characters were re-excluded from
word_segmentusing the symbol registry as the single source of truth. - Each marker type became a separate CST child in
word_body. - The
WordContentenum in the Rust model was aligned 1:1 with the grammar nodes. - The word_segment purity invariant was established as a TDD gate.
The result is one parser, one source of truth for exclusions, and typed markers from grammar through model.
Testing: The word_segment Purity Gate
The purity invariant, each structural marker produces a separate CST
child rather than being consumed by word_segment, is enforced by a
group of tree-sitter corpus tests under
grammar/test/corpus/word/. Each *_in_word_lint.txt file embeds a
structural marker inside a word and asserts the CST splits the word
appropriately:
| Test file | Input | Asserts |
|---|---|---|
overlap_in_word_lint.txt | butt⌈er⌉ | word_segment, overlap_point, word_segment, overlap_point |
ca_element_in_word_lint.txt | CA element inside a word | word_segment, ca_element, word_segment |
ca_delimiter_in_word_lint.txt | CA delimiter pair around a word | ca_delimiter, word_segment, ca_delimiter |
lengthening.txt, lengthening_between_segments.txt | no::, etc. | word_segment, lengthening |
stacked_ca_markers.txt | Multiple adjacent CA markers in one word | Each marker is its own CST child |
Underline and stress invariants are covered by corpus tests elsewhere
in grammar/test/corpus/ and by the parser-equivalence tests in
crates/talkbank-parser-tests/. The historical
word_segment_purity.txt consolidated 8 named tests in one file; it
was retired in commit fdceeac2 when the corresponding constructs
were given their own per-construct test files (this is the new layout
that the current spec generators produce from the spec sources).
How to add a new purity-style test
If you add a new structural marker to the grammar:
- Add its characters to the symbol registry
(
spec/symbols/symbol_registry.json). - Run
just symbols-gento regenerate the exclusion sets. - Add a spec in
spec/constructs/that embeds the marker inside a word; regenerate the affected grammar/parser fixtures with the currentspec/toolscommands from Spec Workflow so a per-construct test fixture is created ingrammar/test/corpus/word/. Verify the CST output names each marker as its own child. - Run the full verification sequence:
cd grammar && tree-sitter generate && tree-sitter test cargo nextest run -p talkbank-parser cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)' cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus
Key Source Files
| File | What it defines |
|---|---|
grammar/grammar.js | search for standalone_word, word_body, word_segment, _word_marker |
grammar/src/generated_symbol_sets.js | Character exclusion sets (generated, do not edit) |
grammar/test/corpus/word/*_in_word_lint.txt, lengthening*.txt, stacked_ca_markers.txt | Per-construct purity-invariant gate tests (replaced the consolidated word_segment_purity.txt retired in fdceeac2) |
grammar/docs/tokenization-rules.md | The 6 tokenization ambiguities with full examples |
grammar/docs/precedence-decisions.md | Precedence proofs (zero, colon, purity invariant) |
grammar/docs/pre-coarsening-grammar.js.reference | Historical: the grammar before coarsening |
crates/talkbank-model/src/model/content/word/word_type.rs | Word struct |
crates/talkbank-model/src/model/content/word/content.rs | WordContent enum (12 variants) |
crates/talkbank-model/src/model/content/word/word_contents.rs | WordContents (SmallVec-backed sequence) |
crates/talkbank-model/src/model/content/word/category.rs | WordCategory enum (5 variants) |
crates/talkbank-model/src/model/content/word/form.rs | FormType enum (22 variants) |
crates/talkbank-model/src/model/content/word/language.rs | WordLanguageMarker enum (4 variants) |
Symbols
Status: Reference Last modified: 2026-05-29 18:43 EDT
CHAT uses a rich set of symbols for transcription conventions. This
page documents the symbol categories and the symbol registry that
drives both the grammar and the Rust crates. The
symbol registry
(spec/symbols/symbol_registry.json) is the source of truth, when
this page and the registry disagree, the registry wins.
Symbol Registry
The authoritative symbol definitions live in spec/symbols/symbol_registry.json. This JSON file is the single source of truth, it generates:
- Character sets for the tree-sitter grammar (
grammar.js) - Rust constants for the model and validation crates
- Validation rules for the spec tool
After any change to the symbol registry, run:
just symbols-gen
Symbol Categories
Terminators
Punctuation that ends an utterance:
| Symbol | Name | Usage |
|---|---|---|
. | Period | Declarative |
? | Question | Interrogative |
! | Exclamation | Exclamatory |
+... | Trailing off | Incomplete utterance |
+..? | Trailing-off question | Question trails off |
+/. | Interruption | Speaker interrupted by another |
+//. | Self-interruption | Speaker interrupts self |
+/? | Interrupted question | Question interrupted |
+!? | Broken question | Exclamation-question |
+"/. | Quoted new line | Quotation continues on next line |
CA (Conversation Analysis) Symbols
CA notation symbols fall into three parser-distinct categories in
spec/symbols/symbol_registry.json. They are not interchangeable,
the grammar treats them as different node kinds.
CA element symbols (ca_element_symbols) attach to a word, so
book↑ is a single token whose content carries the symbol:
| Symbol | Meaning |
|---|---|
↑ | Rising pitch (attaches to a word) |
↓ | Falling pitch (attaches to a word) |
∙ | Micropause |
≠ | Inhalation marker |
⁑ ↻ ∾ ⤆ ⤇ Ἡ | Other CA element symbols |
CA arrow separators (in word_segment_forbidden_start_symbols)
are own-node separators between words, not word-attachments. The
parser splits them as their own nodes:
| Symbol | Meaning |
|---|---|
→ | Level pitch contour |
↗ | Rising-to-mid contour |
↘ | Falling-to-mid contour |
⇗ | Rising-to-high contour |
⇘ | Falling-to-low contour |
↖ ↙ ← | Other CA arrow separators |
CA delimiter symbols (ca_delimiter_symbols) bracket annotated
prosodic regions:
| Symbol | Meaning |
|---|---|
° | Quiet speech |
∆ ∇ | Higher / lower pitch register |
∬ ∮ | Other prosodic-region delimiters |
▁ ▔ | Low / high prosodic-region delimiters |
⁇ § ⁎ ↫ ◉ ☺ ♋ Ϋ | Additional registered CA delimiters |
Confirm the current contents of each category by reading
spec/symbols/symbol_registry.json directly, that is the file
just symbols-gen derives the grammar and Rust constants from.
Word Segment Characters
Characters that are forbidden at the start of words, forbidden in the rest of words, or forbidden throughout. These define the lexical boundaries of what constitutes a “word” in CHAT.
The grammar uses these sets to construct the word-matching regex patterns. Characters like [, ], <, >, (, ) are structural delimiters and cannot appear inside words.
Event Segment Characters
Characters forbidden in event descriptions (&=event content). Events have slightly different lexical rules than words.
Language Codes
CHAT uses ISO 639-3 three-letter language codes in @Languages headers and @s: word markers:
@Languages: eng, fra
*CHI: I want a croissant@s:fra .
Common codes: eng (English), fra (French), deu (German), spa (Spanish), zho (Mandarin), jpn (Japanese).
Special Markers
@ Markers (Word-Level)
The authoritative form-marker set is FormType in
crates/talkbank-model/src/model/content/word/form.rs. Current
variants:
| Marker | Meaning |
|---|---|
@a | Approximate / phonologically consistent form |
@b | Babbling |
@c | Child-invented form |
@d | Dialect form |
@f | Family-specific form |
@fp | Filled pause (deprecated, use &-um etc.) |
@g | Gemination / general special form |
@i | Interjection |
@k | Letter sequence (kinship) |
@l | Single letter |
@ls | Letter plural |
@n | Neologism |
@o | Onomatopoeia |
@p | Proper name |
@q | Metalinguistic reference |
@sas | Second-attempt success |
@si | Singing |
@sl | Slang |
@t | Test word |
@u | Unibet transcription |
@wp | Word play |
@x | Complex / excluded |
@z:<label> | User-defined special form (carries an arbitrary label) |
The second-language qualifier @s:LANG is a separate construct (see
the L2 morphotag section of the Batchalign book); it is not part of
FormType.
& Markers (Events and Fillers)
| Prefix | Meaning |
|---|---|
&= | Paralinguistic event (e.g., &=laughs) |
&- | Filler (e.g., &-um) |
&+ | Phonological fragment (e.g., &+sh) |
&~ | Nonword (e.g., &~mama) |
&* | Other speaker’s speech event (e.g., &*MOT:word, speech attributed to another speaker) |
Scope Markers
| Marker | Meaning |
|---|---|
[/] | Partial retrace, speaker repeats the same words |
[//] | Full retrace, speaker restarts with different words |
[///] | Multiple retracing, multiple false starts |
[/-] | Reformulation, speaker rephrases with different structure |
[*] | Error |
[?] | Best guess |
[>] | Overlap follows |
[<] | Overlap precedes |
[= text] | Explanation |
[: text] | Replacement |
Architecture Overview
Status: Current Last modified: 2026-06-15 15:00 EDT
TalkBank/chatter is the standalone home of the TalkBank CHAT specification,
tree-sitter grammar, Rust crates, chatter CLI, LSP server, and desktop app.
It is self-contained: the CHAT-format core builds and runs
without any external TalkBank repository, so downstream consumers can depend on
its crates directly.
Data Flow
Specification is the source of truth. Code is generated downstream from it.
spec/ Source of truth (CHAT specification)
↓
grammar.js Tree-sitter grammar (in grammar/)
↓
parser.c Generated C parser (never hand-edited)
↓
Rust crates Parser → Model → Validation → Transform
↓
Applications chatter CLI, LSP server, desktop app
Two layers
Within this repository, the architecture splits into two layers:
Source-of-truth artifacts. spec/, spec/symbols/, and grammar/ define
the CHAT language and generate downstream parser tests, error docs, and shared
symbol sets.
Consumer crates and applications. The Rust crates under crates/, the
chatter CLI, talkbank-lsp, and the desktop app all consume those
source-of-truth artifacts rather than defining CHAT semantics independently.
Crate Dependency Graph
flowchart TD
derive["talkbank-derive\nProc macros"]
model["talkbank-model\nData model, validation, alignment, errors"]
cache["talkbank-cache\nValidation + roundtrip cache"]
parser["talkbank-parser\nCanonical parser (tree-sitter)"]
re2c["talkbank-parser-re2c\nAlternate parser (equivalence oracle)"]
transform["talkbank-transform\nPipelines, CHAT↔JSON, caching"]
cli["chatter\nCLI: validate, normalize, convert"]
lsp["talkbank-lsp\nLanguage Server Protocol"]
s2c["send2clan\nCLAN app bindings"]
desktop["chatter-desktop\nDesktop validation app (Tauri)"]
tests["talkbank-parser-tests\nEquivalence tests"]
derive --> model
model --> parser & re2c
parser --> transform
re2c --> transform
cache --> transform
transform --> cli & lsp & desktop
s2c --> cli & desktop
parser --> tests
re2c --> tests
Repository Layout
chatter/
├── grammar/ Tree-sitter grammar
├── spec/ CHAT specification (source of truth)
│ ├── constructs/ Valid CHAT examples + expected parse trees
│ ├── errors/ Invalid CHAT examples + expected error codes
│ ├── symbols/ Shared symbol registry (JSON)
│ ├── tools/ Core spec generators
│ └── runtime-tools/ Runtime-aware spec bootstrap/validation tools
├── crates/ Rust crates (model, parser, transform, CLI support, LSP)
├── corpus/ Reference corpus
├── tests/ Integration tests and fixtures
├── schema/ JSON Schema (auto-generated)
├── apps/chatter-desktop/ Desktop validation app (Tauri v2, React)
├── fuzz/ cargo-fuzz targets (separate workspace)
├── book/ This documentation
└── docs/ Strategy, proposals, investigations
Cargo Workspaces
Three separate Cargo workspaces live here:
- Root workspace (
Cargo.toml), all Rust crates for parsing, model, transform, CLI, LSP, andapps/chatter-desktop/src-tauri. - Spec workspace (
spec/Cargo.toml),spec/toolsfor core generation,spec/runtime-toolsfor runtime-aware spec tooling. - Fuzz workspace (
fuzz/Cargo.toml), cargo-fuzz targets for parser and validation robustness checks.
Use the relevant manifest path for the workspace you mean to operate in:
spec/tools/Cargo.tomlfor generatorsspec/runtime-tools/Cargo.tomlfor bootstrap/mining/runtime validationfuzz/Cargo.tomlfor cargo-fuzz targets
Where to read next
For per-topic detail (sections being consolidated; see SUMMARY for the authoritative current list):
- Spec System, Grammar, Parser Backends, how CHAT becomes typed AST.
- CHAT model: the AST itself, content traversal, wide-struct rule.
- Alignment: tier alignment, DP, sequence alignment.
- Errors and validation: diagnostics, validation gates, and parser/model invariants.
- Editor/runtime integration:
talkbank-lspand application boundaries layered on top of the CHAT core. - Memory and Ownership, Type-Driven Design (lands during M11 errors-and-validation work).
- XML Emitter: projection.
For per-crate summaries see Crate Reference.
Spec System
Status: Current Last modified: 2026-05-29 17:50 EDT
Specifications in spec/ are the authoritative source of truth for the CHAT format. They drive grammar artifact generation, validation/error docs, and targeted test generation.
Historical note: This system was originally shaped during a
dual-parser era. The chumsky-based direct parser was removed in
March 2026. Today the canonical parser is tree-sitter
(talkbank-parser); a second implementation,
talkbank-parser-re2c, exists as a specification oracle and
high-throughput batch parser. Fragment specs remain valuable, but
synthetic tree-sitter wrapper behavior is audit-only legacy unless a
page or test explicitly says otherwise.
Spec Types
Construct Specs (spec/constructs/)
Each construct spec defines a valid CHAT pattern with its expected parse tree:
# example_name
Description of what this example tests.
## Input
\```mor_dependent_tier
%mor: VERB|eat .
\```
## Expected CST
\```cst
(mor_dependent_tier
(mor_tier_prefix)
...)
\```
## Metadata
- **Level**: tier
- **Category**: tiers
The Input code fence label (e.g., mor_dependent_tier, utterance) selects
which template wraps the fragment into a full CHAT file for parsing.
That is an explicit grammar/test templating mechanism. It is useful, but it does not by itself define honest isolated-fragment semantics for the direct parser.
Error Specs (spec/errors/)
Each error spec defines an invalid CHAT pattern with expected error codes:
# Error E301
## Metadata
- Code: E301
- Name: missing_participants
- Severity: Error
- Layer: parser
## Examples
### missing_participants_1
\```chat
@UTF8
@Begin
*CHI: hello .
@End
\```
Key metadata fields:
- Layer: parser: error caught during parsing (returns
Err) - Layer: validation: error caught after successful parse
- Status: not_implemented: generates
#[ignore]tests
Symbol Registry (spec/symbols/)
symbol_registry.json defines character sets used by both the grammar and Rust
crates. In this repo, just symbols-gen validates the registry and regenerates
the checked-in grammar and Rust symbol-set outputs. The generation step produces:
- JavaScript constants for
grammar.js - Rust constants for model validation
Test Generation
The predecessor monorepo used make test-gen as shorthand for three generator
classes. That root wrapper is not yet ported into this repo, but the underlying
generation responsibilities are still:
1. Tree-sitter Corpus Tests
gen_tree_sitter_tests reads construct specs and error specs, then:
- Wraps each
Inputin a template to create a full CHAT file - Parses with tree-sitter and checks for error nodes
- Writes
Expected CSTtogrammar/test/corpus/
For error specs, it captures the actual parse (with ERROR nodes) as the expected tree.
2. Rust Tests
gen_rust_tests generates Rust test functions:
- Construct specs become parse-and-compare tests
- Parser-layer error specs become
parser.parse_chat_file()tests expectingErr - Validation-layer error specs become parse-then-validate tests
Output: crates/talkbank-parser-tests/tests/generated/
The generated suites are useful as grammar/audit support and regression coverage, but they are not the sole authority for parser semantics.
3. Error Documentation
gen_error_docs generates optional local markdown pages for each error code
under docs/errors/ when maintainers want a browsable reference set while
working on diagnostics. The source of truth remains spec/errors/.
Workflow After Spec Changes
- Regenerate only the affected spec-driven artifacts using the current commands
documented in
spec/CLAUDE.md. - Run the concrete verification commands from Contributing > Setup.
Never hand-edit generated artifacts, always regenerate from specs.
Post-Bootstrap Doctrine
spec/toolsremains the generator/validator for grammar corpus tests, error docs, and shared symbol artifacts.talkbank-parser-testsowns parser equivalence and roundtrip contracts.- Isolated grammar additions should usually need two things: one grammar corpus example and one full-file fixture. They should not require the old bootstrap ritual unless generated artifacts really changed.
Grammar
Status: Current Last updated: 2026-03-24 00:01 EDT
The CHAT grammar is defined in grammar/grammar.js using the tree-sitter parser generator. It produces a GLR parser that handles the full CHAT format with error recovery.
Design Principles
Explicit Whitespace
Unlike most tree-sitter grammars, CHAT does not use extras for whitespace. All whitespace is grammar-visible because CHAT’s structure is whitespace-sensitive:
- Tab separates tier prefix from content
- Newline ends tiers
- Line continuation uses tab-at-start-of-line
- Space separates words and annotations
Two-Level Structure
The grammar has two structural levels:
- Document level: headers, utterances,
@Begin/@End - Tier level: main tier content, dependent tier content (each with distinct rules)
Opaque Lemmas
In the %mor tier rules, lemmas are parsed as opaque Unicode strings. The grammar does not attempt to decompose lemma content, that happens in the model layer. This follows the “parse, don’t validate” principle.
Key Grammar Rules
Document Structure
document → utf8_header, begin_header, lines..., end_header
line → header | utterance
utterance → main_tier, dependent_tiers...
Main Tier
main_tier → star, speaker, colon, tab, tier_body
tier_body → contents, utterance_end
contents → content_item, (whitespace, content_item)...
MOR Tier (UD-style)
mor_contents → mor_content, (whitespace, mor_content)..., terminator
mor_content → mor_word, mor_post_clitic*
mor_word → mor_pos, pipe, mor_lemma, mor_feature*
mor_post_clitic → tilde, mor_word
mor_feature → hyphen, mor_feature_value
POS tags are simple identifiers (no subcategories). Lemmas are opaque strings. Features are hyphen-separated values that may contain = for Key=Value pairs and , for multi-value features.
Grammar Change Workflow
parser.c is generated from grammar.js, never edit it directly.
After any change to grammar.js:
cd grammar && tree-sitter generatetree-sitter test(160 tests)cargo test -p talkbank-parsercargo nextest run -p talkbank-parser-tests(reference corpus equivalence, per-file)- Verify the 78-file reference corpus passes at 100%
Conflict Resolution
The grammar uses tree-sitter’s precedence and conflict mechanisms to handle ambiguities:
- Word tokens use
prec(5)to win over separators - Inline bullets use
prec(10)for their delimiters - CA (conversation analysis) symbols use
prec(3)for colon disambiguation
Generated Artifacts
Running tree-sitter generate produces:
src/parser.c: the C parsersrc/node-types.json: node type metadata
The Rust crate talkbank-parser references node-types.json to generate node_types.rs (a generated constants file).
Parsing
Status: Current Last updated: 2026-05-19 16:54 EDT
The parsing pipeline converts CHAT text into a typed ChatFile AST.
The default and canonical parser is the tree-sitter parser
(talkbank-parser). A second implementation, talkbank-parser-re2c,
exists alongside it as a specification oracle and high-throughput
batch parser; it produces the same ChatFile model and is opt-in via
chatter validate --parser re2c. The LSP and all production paths
use the tree-sitter parser.
Tree-Sitter Parser
The talkbank-parser crate wraps the tree-sitter C parser and converts its concrete syntax tree (CST) into the ChatFile model.
Full-file parsing is the canonical entry point. TreeSitterParser also
provides fragment methods (parse_word_fragment(), parse_main_tier_fragment(),
parse_chat_file_fragment(), etc.) for parsing isolated CHAT fragments
directly.
CST → AST Pipeline
flowchart LR
chat["CHAT text\n(.cha file)"]
grammar["tree-sitter grammar\n(grammar.js → parser.c)"]
cst["Concrete Syntax Tree\n(all whitespace preserved)"]
walker["TreeSitterParser\n(CST traversal)"]
ast["ChatFile AST\n(semantic model)"]
chat --> grammar --> cst --> walker --> ast
Source text
↓ tree-sitter parse
Concrete Syntax Tree (CST), green tree with all tokens
↓ tree_parsing (Rust)
ChatFile AST, typed model with validation-ready data
The CST preserves every character of the source (whitespace, punctuation, comments). The Rust tree-parsing modules walk the CST and extract semantic information into the typed model.
Error Recovery
Tree-sitter’s GLR algorithm provides automatic error recovery. When the parser encounters unexpected input, it:
- Inserts ERROR nodes in the CST
- Continues parsing the rest of the file
- Reports parse errors via the
ErrorSinktrait
This means the parser always produces a result, even for malformed files, it extracts as much structure as possible.
ParseOutcome
Individual parse functions return ParseOutcome<T>:
ParseOutcome::parsed(value): successfully parsedParseOutcome::rejected(): could not parse this node (error already reported)
This allows the parser to skip individual malformed elements while continuing to parse the rest of the file.
Parser Equivalence
The 78-file reference corpus is the primary correctness guarantee:
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
Each .cha file is its own test, nextest runs them in parallel and reports individual failures.
TreeSitterParser API
TreeSitterParser is the sole API handle for parsing. Callers create one
instance and pass &TreeSitterParser to all parsing call sites. There is
no trait abstraction, TreeSitterParser is a concrete type in the
talkbank-parser crate.
use talkbank_parser::TreeSitterParser;
let parser = TreeSitterParser::new()?;
// Full-file parsing (methods on TreeSitterParser).
// parse_chat_file returns ParseResult<ChatFile> with the diagnostic list
// embedded in the result envelope.
let chat_file = parser.parse_chat_file(&source)?;
// parse_chat_file_streaming pushes diagnostics into an ErrorSink as it
// goes, useful for very large files or LSP-style incremental flows.
let chat_file = parser.parse_chat_file_streaming(&source, &errors);
// Fragment parsing (methods on TreeSitterParser), used when synthesizing
// CHAT from non-CHAT sources (ASR output, UD annotations).
let word = parser.parse_word_fragment(word_text, &errors);
let main_tier = parser.parse_main_tier_fragment(tier_text, &errors);
AST Structure
The resulting ChatFile AST has a recursive content structure:
flowchart TD
cf["ChatFile"]
hdr["Headers\n@Languages, @Participants,\n@ID, @Options"]
utts["Utterances[]"]
mt["MainTier\nspeaker + content"]
dt["DependentTiers[]\n%mor, %gra, %pho, %sin, %wor"]
uc["UtteranceContent\n24 variants"]
leaf["Leaves\nWord | ReplacedWord | Separator"]
group["Groups\nGroup | AnnotatedGroup |\nRetrace | PhoGroup | SinGroup | Quotation"]
cf --> hdr & utts
utts --> mt & dt
mt --> uc
uc --> leaf & group
group -->|recurse| uc
Parser String Handling
The tree-sitter parser constructs owned model types (e.g., MorWord, GrammaticalRelation) directly from CST text. String-heavy types like PosCategory and MorStem use Arc<str> interning to avoid redundant allocations for repeated values. Short strings in model newtypes use SmolStr for inline storage up to 23 bytes.
CHAT Data Model
Status: Current Last updated: 2026-06-14 19:57 EDT
The talkbank-model crate defines the typed AST for CHAT files. Every
other crate, parser, transform, CLAN, CLI, LSP, and the entire batchalign
runtime, depends on it. This page describes the model itself, the
three-level content hierarchy, the content-walker primitives, and the
extract → infer → inject pattern that all NLP tasks follow.
ChatFile
The root type is ChatFile, representing a complete CHAT transcript:
pub struct ChatFile {
pub lines: Vec<Line>,
pub participants: IndexMap<SpeakerCode, Participant>,
pub languages: LanguageCodes,
pub options: ChatOptionFlags,
}
Each Line is either a Header or an Utterance. The full ownership
tree:
flowchart TD
chatfile["ChatFile\n(talkbank-model/src/model/file/chat_file/core.rs)"]
chatfile --> lines["lines: Vec<Line>"]
chatfile --> participants["participants:\nIndexMap<SpeakerCode, Participant>"]
chatfile --> languages["languages: LanguageCodes"]
chatfile --> options["options: ChatOptionFlags"]
lines --> header_line["Line::Header (Header)"]
lines --> utt_line["Line::Utterance (Utterance)"]
utt_line --> preceding["preceding_headers:\nSmallVec<Header>"]
utt_line --> main["main: MainTier"]
utt_line --> deptiers["dependent_tiers:\nVec<DependentTier>"]
utt_line --> health["parse_health: ParseHealthState"]
main --> speaker["speaker: SpeakerCode"]
main --> tiercontent["content: TierContent"]
tiercontent --> linkers["linkers: Vec<Linker>"]
tiercontent --> uttcontent["utterance_content:\nVec<UtteranceContent>\n(24 variants)"]
tiercontent --> terminator["terminator: Option<Terminator>"]
tiercontent --> bullet["bullet: Option<Bullet>"]
The DependentTier enum has 25 variants: structured linguistic
(Mor/Gra/Pho/Mod/Sin/Act/Cod/Wor), with-inline-bullets
(Add/Com/Exp/Gpx/Int/Sit/Spa), text-only
(Alt/Coh/Def/Eng/Err/Fac/Flo/Gls/Ort/Par/Tim),
Phon-project (Modsyl/Phosyl/Phoaln), and UserDefined /
Unsupported.
Three-Level Content Hierarchy
CHAT main-tier content is a tree with three nesting levels. Every content traversal must understand all three.
ChatFile
└── Line::Utterance
└── MainTier
└── TierContent
├── content: Vec<UtteranceContent> ← Level 1
│ ├── Word(Box<Word>)
│ │ └── content: Vec<WordContent> ← Level 3
│ ├── OverlapPoint(OverlapPoint)
│ ├── Group(Group)
│ │ └── BracketedContent
│ │ └── Vec<BracketedItem> ← Level 2
│ ├── PhoGroup, SinGroup, Quotation
│ │ └── (same BracketedContent)
│ ├── Retrace(Box<Retrace>)
│ ├── Pause, Event, Separator, ...
│ └── AnnotatedWord, AnnotatedGroup, ...
├── bullet: Option<Bullet>
├── linkers: Linkers
└── terminator: Terminator
Level 1, UtteranceContent (24 variants)
What you iterate when walking utterance.main.content.content.0:
| Category | Variants |
|---|---|
| Words | Word, AnnotatedWord, ReplacedWord |
| Groups | Group, AnnotatedGroup, PhoGroup, SinGroup, Quotation |
| CA markers | OverlapPoint, Separator |
| Events | Event, AnnotatedEvent, OtherSpokenEvent |
| Actions | AnnotatedAction |
| Timing | InternalBullet |
| Scope markers | LongFeatureBegin/End, NonvocalBegin/End/Simple, UnderlineBegin/End |
| Other | Freecode, Pause |
Critical rule: every match on UtteranceContent must explicitly
list all 24 variants. No _ => catch-alls. Project policy: silent data
loss when new variants are added is unacceptable.
Level 2, BracketedItem (22 variants)
Content inside groups (<...>, ‹...›, 〔...〕, "..."). Accessed via
group.content.content.0 (the double .content.content.0 is not a
typo, Group.content is BracketedContent, which has .content: BracketedItems, which has .0: Vec<BracketedItem>).
BracketedItem mirrors UtteranceContent closely. Retrace content
(<word word> [/], word [//]) is a dedicated Retrace variant at
both levels, not hidden inside AnnotatedGroup. Groups can nest
arbitrarily deep.
Level 3, WordContent (11 variants)
Content inside a single word token, accessed via word.content:
| Variant | Example |
|---|---|
Text | plain text segment |
Shortening | (lo) omitted sound |
OverlapPoint | butt⌈er⌉, overlap inside a word |
CAElement | ↑ ↓ prosody markers |
CADelimiter | ° ∆ paired delimiters |
StressMarker | ˈ ˌ |
Lengthening | : |
SyllablePause | ^ |
CompoundMarker | + in ice+cream |
UnderlineBegin/End | scope delimiters |
Key insight: overlap markers can appear at all three levels, as
standalone UtteranceContent::OverlapPoint (space-separated:
⌈ word ⌉), as BracketedItem::OverlapPoint (inside groups), or as
WordContent::OverlapPoint (intra-word: butt⌈er⌉). Any traversal
looking for overlap markers must check all three levels.
Annotated Wrappers and Replaced Words
Annotated<T>
Adds scoped annotations ([/], [* m], [= explanation], etc.) to any
annotatable inner type:
pub struct Annotated<T> {
pub inner: T,
pub annotations: Vec<ContentAnnotation>,
pub span: Span,
}
At Level 1: AnnotatedWord(Box<Annotated<Word>>),
AnnotatedGroup(Annotated<Group>),
AnnotatedEvent(Annotated<Event>),
AnnotatedAction(Annotated<Action>). Same variants exist at Level 2.
ReplacedWord
Represents word [: replacement], a surface form with a replacement:
pub struct ReplacedWord {
pub word: Word,
pub replacement: Replacement,
}
pub struct Replacement {
pub words: Vec<Word>,
}
Convention when extracting words for NLP: use replacement words if
non-empty, else the surface form (Wor and Mor domains both follow this,
each with its own counts_for_tier filter).
Tier Domains
Different NLP tasks need different views of the same content. The
TierDomain enum controls which words count for each tier and how
groups are traversed:
| Domain | Used by | Skips | Counts separators? |
|---|---|---|---|
Mor | %mor / %gra generation | Retrace groups | Yes, , „ ‡ carry mor items (cm|cm, end|end, beg|beg) |
Wor | %wor generation, FA | Nothing | No |
Pho | %pho alignment | PhoGroup | No |
Sin | %sin alignment | SinGroup | No |
The content walker takes Option<TierDomain>: Some(domain) for
domain-aware gating, None to recurse everything unconditionally.
Content Walkers
talkbank-model exports closure-based walkers. Two layers:
walk_content: generic, visits all content items (custom traversals).walk_words/walk_words_mut, filtered to words / replaced words / separators, with domain-aware gating. The primary primitive.
use talkbank_model::alignment::helpers::{
walk_words, walk_words_mut,
WordItem, WordItemMut,
TierDomain,
};
walk_words(content, Some(TierDomain::Wor), &mut |leaf| {
match leaf {
WordItem::Word(word) => { /* ... */ }
WordItem::ReplacedWord(replaced) => { /* ... */ }
WordItem::Separator(sep) => { /* ... */ }
}
});
flowchart TD
input["&[UtteranceContent]\n+ domain: Option<TierDomain>"]
dispatch["Match variant\n(24 UtteranceContent variants)"]
word["Word → emit WordItem::Word"]
rw["ReplacedWord → emit WordItem::ReplacedWord"]
sep["Separator → emit WordItem::Separator"]
group["Group / AnnotatedGroup /\nPhoGroup / SinGroup / Quotation"]
gate{"Domain\ngating"}
skip["Skip\n(atomic unit)"]
recurse["Recurse into\ngroup.content"]
input --> dispatch
dispatch --> word & rw & sep & group
group --> gate
gate -->|"Mor: skip retraces"| skip
gate -->|"Pho/Sin: skip groups"| skip
gate -->|"None: recurse all"| recurse
recurse -->|back| dispatch
What walk_words does NOT visit
Only words and separators. Not OverlapPoint (any level), not
CAElement within words, not events / pauses / actions, not internal
bullets. For these, write a custom traversal, see
talkbank-model/validation/utterance/overlap.rs for the reference
pattern.
walk_overlap_points, overlap marker iterator
walk_overlap_points(content, &mut |visit| {
// visit.point.kind, visit.point.index, visit.word_position
});
Visits every OverlapPoint at all three content levels with its
word-position context. Used by the alignment pipeline (onset estimation)
and the validator (pairing checks). For region-level analysis (pairing
⌈ with ⌉ by index), use extract_overlap_info() which builds
OverlapRegion structs. For whole-file analysis,
analyze_file_overlaps() matches top regions (⌈) with bottom regions
(⌊) across utterances with 1:N support (used by E347 and
chatter debug overlap-audit).
Validation
Beyond what the grammar enforces, validate_with_alignment() checks
semantic constraints:
%moralignment: number of MOR items matches alignable main-tier words.%grastructure: sequential indices, ROOT checks, circular dependency.- Header consistency:
@IDcodes match@Participants. - Speaker references: all
*SPEAKER:codes declared.
Five parallel alignment flows are computed against the main tier:
flowchart TD
main["MainTier content"]
walker["walk_words()\ncount alignable words"]
subgraph "5 Parallel Alignment Flows"
mor["%mor\ncustom logic\n(clitic handling)"]
pho["%pho\npositional_align()\n(skip PhoGroup)"]
sin["%sin\npositional_align()\n(skip SinGroup)"]
wor["%wor\npositional_align()\n(LCS diff format)"]
gra["%gra\nalign to %mor chunks\n(not main tier)"]
end
main --> walker
walker --> mor & pho & sin & wor
mor --> gra
For the alignment algorithms themselves, see Alignment.
Common Pitfalls
- “Consecutive” means in-order traversal, not adjacent array
indices. When CHAT tools speak of “consecutive” or “sequential”
items on the main tier, this always means document order via
recursive traversal, accounting for groups (
<...>), retrace groups (<...> [/]), quotations ("..."), and all other bracketed structures. Never check adjacency in the flatVec<UtteranceContent>, usewalk_wordsor equivalent in-order traversal. - Missing intra-word content. Overlap markers, CA elements, and
other markers can appear inside
Word.content. Checking onlyUtteranceContent::OverlapPointmissesWordContent::OverlapPoint(e.g.,butt⌈er⌉,a⌈nd). - Missing annotated variants.
UtteranceContent::AnnotatedWordandAnnotatedGroupwrap inner types inAnnotated<T>and are easy to forget. BracketedContentaccess.Group.content→BracketedContent, with.content: BracketedItems, with.0: Vec<BracketedItem>.- Separator counter sync (Mor domain). Tag-marker separators
(
,„‡) count as NLP words because they have %mor items. Any code counting words in the Mor domain must count these separators too.
Serialization
- CHAT:
WriteChattrait writes any model type back to CHAT format. - JSON: all model types implement
Serialize/Deserialize. Format per the JSON Schema. - JSON Schema: derived via
JsonSchema. Runcargo test --test generate_schemato regenerateschema/chat-file.schema.json.
Memory and Interning
String-heavy types (PosCategory, MorStem, MorFeature) use
Arc<str> with a global interner, significant memory savings on large
corpora where the same POS tags and lemmas appear thousands of times.
Collections that are typically small use SmallVec for inline storage:
SmallVec<[MorFeature; 4]>: features per word (usually 0-4).SmallVec<[MorWord; 2]>: post-clitics (usually 0-1).
Transform Pipeline
Status: Current Last updated: 2026-06-14 19:57 EDT
The talkbank-transform crate provides high-level pipelines that compose parsing, validation, and serialization into reusable workflows.
Core Pipelines
Parse + Validate
The most common pipeline: parse a CHAT file and validate it.
use talkbank_transform::parse_and_validate;
let result = parse_and_validate(source, &parser, &error_collector);
This:
- Parses the source text into a
ChatFileAST - Runs validation (alignment checks, header consistency, etc.)
- Collects all errors and warnings into the
ErrorSink
CHAT → JSON
Convert a CHAT file to its JSON representation:
use talkbank_transform::chat_to_json;
let json = chat_to_json(source, &parser)?;
The JSON follows the schema at schema/chat-file.schema.json.
JSON → CHAT
The JSON produced by chat_to_json is schema-conformant and
round-trips. Deserialize it back into a ChatFile with serde_json
(the model derives Deserialize), then serialize through WriteChat
to reproduce CHAT text:
let chat_file: talkbank_model::ChatFile = serde_json::from_str(json_str)?;
let chat_text = chat_file.to_chat_string();
The chatter from-json command wraps this path
(crates/chatter/src/commands/json.rs, json_to_chat).
CHAT → CHAT (Normalize)
Parse and reserialize to normalize formatting:
use talkbank_transform::normalize_chat;
let normalized = normalize_chat(source, &parser)?;
normalize_chat lives in
crates/talkbank-transform/src/pipeline/convert.rs.
Validation + Roundtrip Cache Lifecycle
The following diagram shows the full validation and roundtrip pipeline, including the cache layer:
flowchart TD
file["CHAT file"]
cache{"Cache\nhit?"}
parse["Parse\n(tree-sitter → AST)"]
validate["Validate\n(per-file → per-utterance →\nmain tier → dependent tiers)"]
rt{"Roundtrip\nflag?"}
ser1["Serialize → CHAT text"]
reparse["Reparse CHAT text"]
ser2["Serialize again"]
cmp{"Two\nserializations\nmatch?"}
store["Store in cache\n(SQLite)"]
pass["Pass"]
fail["Fail"]
cached["Return cached result"]
file --> cache
cache -->|miss| parse --> validate --> rt
cache -->|hit| cached
rt -->|yes| ser1 --> reparse --> ser2 --> cmp
rt -->|no| store --> pass
cmp -->|yes| store
cmp -->|no| fail
Streaming Parse
For large files or interactive use, the transform crate supports streaming parse where utterances are processed incrementally rather than loading the entire AST into memory.
Caching
The transform layer integrates with a file-system cache. Validation results are keyed by content hash, so unchanged files skip re-validation. Cache location is platform-specific: ~/Library/Caches/talkbank-chat/ (macOS), ~/.cache/talkbank-chat/ (Linux), %LocalAppData%\talkbank-chat\ (Windows).
Use --force to bypass the cache for specific paths.
Error Collection
Pipelines use the ErrorSink trait for error reporting. Callers can provide:
- A collecting sink (gathers all diagnostics for batch output)
- A printing sink (writes diagnostics to stderr in real-time)
- A custom sink (for LSP diagnostics, JSON output, etc.)
Merge Pipeline, Domain Types
Status: Draft Last modified: 2026-05-29 18:43 EDT
This page specifies the typed Rust vocabulary shared by chatter merge,
chatter speaker-id, the override-file reader/writer, and any future
adjudication tooling (CLI, VS Code, web). Documenting these types
before writing the implementing code is deliberate: the types are
the spec, and they need to be designed against the user contract in
chatter merge and
chatter speaker-id without
being inferred from prototype code.
The design follows the cross-cutting rules in this repo’s root
CLAUDE.md:
newtypes over primitives at every stable boundary; no boolean
blindness; no tuple-packed seams; typed errors via thiserror;
deterministic BTreeMap/BTreeSet over hash maps for
serialized state.
Where the types live
All new types live in talkbank-model::merge. Rationale:
- Existing CHAT-domain types (
SpeakerCode,ParticipantRole,ParticipantEntry,IDHeader,ChatFile) already live intalkbank-model; the new merge-pipeline types reference them pervasively and benefit from being co-located. - Consumers outside
talkbank-transform(a future override-file reader in a small CLI, an adjudication UI, an orchestrator script’s Rust port) want the types without pulling in the tree-sitter parser, the DP-aligner, etc.talkbank-modelis the lightweight type-and-validation crate that fits. - If
talkbank-model::mergegrows past the file-size budget (≤400 lines per file, ≤800 hard) we split into submodules (merge::override_file,merge::scoring, etc.), same crate. Hoisting to a separatetalkbank-merge-typescrate is a future option but not pre-emptively warranted.
Existing types reused (not redefined)
| Type | Defined in | Used as |
|---|---|---|
SpeakerCode | talkbank-model::model::header::codes::speaker | Identifier for *<CODE>: speakers, dictionary keys in mappings, --retain set elements |
ParticipantRole | talkbank-model::model::header::codes::participant | Role-tag in @Participants and @ID (Target_Child, Investigator, Mother, etc.) |
ParticipantName | talkbank-model::model::header::codes::participant | Optional participant name in @Participants |
ParticipantEntry | talkbank-model::model::header::codes::participant | Single @Participants row |
IDHeader | talkbank-model::model::header::id | Single @ID row |
ChatFile<S> | talkbank-model::model::file::chat_file::core | The merge stages’ inputs and outputs (parameter S: ValidationState) |
None of these are redefined; the merge module imports and references them.
New types (specification)
JaccardScore
A multiset-Jaccard similarity value, by construction in the closed
range [0.0, 1.0].
/// Multiset Jaccard similarity between two bags of tokens.
///
/// By construction in [0.0, 1.0]. `JaccardScore::zero()` is the
/// no-overlap point; `JaccardScore::one()` is identical-bag.
///
/// Used by the speaker-id stage to score how well each donor
/// speaker matches a reference anchor's content.
#[derive(Clone, Copy, Debug, PartialEq, PartialOrd, Serialize, Deserialize, JsonSchema)]
#[serde(try_from = "f64", into = "f64")]
pub struct JaccardScore(f64);
impl JaccardScore {
pub fn new(v: f64) -> Result<Self, JaccardScoreError>;
pub fn zero() -> Self;
pub fn one() -> Self;
pub fn value(self) -> f64;
}
impl Display for JaccardScore { /* "0.735" three-digit */ }
impl TryFrom<f64> for JaccardScore { /* validates range */ }
impl From<JaccardScore> for f64 { /* infallible widen */ }
Construction is fallible: JaccardScore::new(1.5) returns
Err(JaccardScoreError::OutOfRange(1.5)). NaN is also rejected.
Internal computation that’s guaranteed in-range by construction
(the multiset formula) uses an internal from_unchecked private
constructor; public API is fallible.
ConfidenceThreshold
The minimum Jaccard margin (winner / loser) the speaker-id stage
will auto-accept. By construction in [1.0, ∞), a threshold of
< 1.0 makes no sense (means the loser scores higher than the
winner, which can’t happen). Default 2.0 per the empirical
calibration recorded in
chatter speaker-id.
#[derive(Clone, Copy, Debug, PartialEq, PartialOrd, Serialize, Deserialize, JsonSchema)]
#[serde(try_from = "f64", into = "f64")]
pub struct ConfidenceThreshold(f64);
impl ConfidenceThreshold {
pub const DEFAULT: Self = Self(2.0);
pub fn new(v: f64) -> Result<Self, ConfidenceThresholdError>;
pub fn value(self) -> f64;
}
impl Default for ConfidenceThreshold {
fn default() -> Self { Self::DEFAULT }
}
Margin
The decisive ratio between the highest-scoring speaker and the
runner-up. Distinguished from ConfidenceThreshold by intent
(this is observed; the threshold is configured) and from
JaccardScore by range (margin is ≥ 1.0; score is ≤ 1.0).
Uses an enum rather than a bare float to model the
divide-by-zero case (runner-up has zero Jaccard) cleanly. Avoids
the f64::INFINITY sentinel that doesn’t round-trip through
all serializers.
/// Ratio of winning speaker's score to runner-up's score.
///
/// `Finite(r)` for `r >= 1.0`. `Unbounded` when the runner-up
/// has zero score (winner scored anything, runner-up scored
/// nothing). Compares meaningfully against `ConfidenceThreshold`
/// regardless of variant.
#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize, JsonSchema)]
#[serde(untagged)]
pub enum Margin {
Finite(f64),
/// Serialized as the JSON/TOML string "unbounded"; never as
/// f64::INFINITY (which round-trips inconsistently).
Unbounded,
}
impl Margin {
pub fn from_scores(winner: JaccardScore, loser: JaccardScore) -> Self;
pub fn meets(self, threshold: ConfidenceThreshold) -> bool;
}
impl Display for Margin { /* "3.81x" or "∞" */ }
RetainSet
The set of speaker codes specified by --retain on chatter merge.
A BTreeSet<SpeakerCode> wrapped in a newtype so the type
signatures of merge functions communicate intent. Empty is
allowed (means “no speakers come from File 1; File 1 contributes
only headers”, a degenerate but legal case).
/// Speakers whose utterances come from the first input to
/// `chatter merge`. All other speakers come from the second
/// input.
#[derive(Clone, Debug, Default, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
pub struct RetainSet(BTreeSet<SpeakerCode>);
impl RetainSet {
pub fn new() -> Self;
pub fn from_iter<I: IntoIterator<Item = SpeakerCode>>(it: I) -> Self;
pub fn contains(&self, code: &SpeakerCode) -> bool;
pub fn iter(&self) -> impl Iterator<Item = &SpeakerCode>;
pub fn is_empty(&self) -> bool;
}
impl FromStr for RetainSet {
type Err = RetainSetParseError;
/// Parses `"CHI,SI2"` → `{CHI, SI2}`. Empty entries rejected.
fn from_str(s: &str) -> Result<Self, Self::Err>;
}
InsertedRole
The CHAT code + role-tag pair to assign to renamed speakers in
the speaker-id stage. A struct rather than two function arguments
because the pair is meaningful as a unit (in TOML override files
it serializes as a nested table; in CLI it parses as CODE:TAG).
/// The CHAT identity to assign to non-anchor speakers in the
/// speaker-id stage. Example: `INV:Investigator`, `MOT:Mother`.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
pub struct InsertedRole {
pub code: SpeakerCode,
pub tag: ParticipantRole,
}
impl InsertedRole {
pub fn investigator() -> Self; // INV:Investigator
pub fn mother() -> Self; // MOT:Mother
pub fn father() -> Self; // FAT:Father
pub fn adult() -> Self; // PAR:Adult
}
impl FromStr for InsertedRole {
type Err = InsertedRoleParseError;
/// Parses `"INV:Investigator"`. Both halves required.
fn from_str(s: &str) -> Result<Self, Self::Err>;
}
impl Display for InsertedRole { /* "INV:Investigator" */ }
The convenience constructors (investigator(), mother(), etc.)
are the closed-set anchor points; arbitrary
InsertedRole { code, tag } is also allowed for contributor-specific
roles.
MappingAction
What happens to a particular speaker in the input under a
SpeakerMapping. Enum (not boolean) to avoid blindness and to
leave room for future variants (e.g. RenameTo { code, tag }
when multi-role renaming becomes a need).
/// Action to apply to one speaker in a SpeakerMapping.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "lowercase")]
pub enum MappingAction {
/// Remove this speaker's utterances and its @Participants /
/// @ID rows entirely.
Drop,
/// Rename this speaker to the mapping's `inserted_role.code`.
/// Rewrites speaker codes on every utterance and the
/// corresponding @Participants and @ID entries.
Rename,
}
The TOML serialization uses "drop" / "rename" lowercase
strings, matching the override-file format documented in
speaker-id.md.
SpeakerMapping
The decision record produced by the speaker-id stage and
consumed by the speaker-id apply step. Carries enough information
to apply deterministically to a ChatFile.
/// A decision about how to relabel a ChatFile's speakers.
///
/// Produced by `identify_mapping` (reference mode, auto), by the
/// `--mapping` flag parser (explicit mode), or by reading an
/// override-file entry (override mode). Consumed by `apply_mapping`,
/// which rewrites a ChatFile per the assignments.
///
/// All speakers in the input must appear as keys in `assignments`
///, no defaulting. This is a precondition checked at apply time
/// and is intentional (we want every decision to be explicit).
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize, JsonSchema)]
pub struct SpeakerMapping {
/// The CHAT identity assigned to every speaker whose action
/// is `MappingAction::Rename`. All renamed speakers go to
/// the same role in v1 of this schema.
pub inserted_role: InsertedRole,
/// Per-speaker action. Use BTreeMap for deterministic
/// serialization order.
pub assignments: BTreeMap<SpeakerCode, MappingAction>,
}
The “single inserted_role across all renamed speakers” constraint
matches the doc and keeps the most-common case clean. Future
multi-role-rename use cases (a 3-speaker file where two get
different roles) extend MappingAction with a RenameTo variant
rather than changing this struct’s shape.
DecisionMode
How a MergeOverride entry came to exist. Three variants matching
the three speaker-id operation modes.
#[derive(Clone, Copy, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "lowercase")]
pub enum DecisionMode {
/// Reference-mode auto-decide with Jaccard above threshold.
Auto,
/// Operator supplied --mapping directly on a one-off run.
Explicit,
/// Read from a prior override-file entry; this is a replay.
Override,
}
MergeFlag
Extensible operator-supplied flags on an override entry. Closed
variants for known cases plus a Custom(String) escape hatch.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "kebab-case")]
pub enum MergeFlag {
/// ASR diarization mixed multiple real-world roles into one
/// speaker label. The rename may still be the best available
/// approximation but the output is imperfect.
DiarizationMixed,
/// The operator could not confidently determine which speaker
/// is which; mapping is best-guess.
BestGuess,
/// Open variant for contributor-specific flag vocabulary.
/// Serializes as the inner string verbatim.
#[serde(untagged)]
Custom(String),
}
OperatorId
Who made the decision. String newtype.
string_newtype!(
/// Identifier of the operator who created an override entry.
/// Free-form; typically a username or initials. Recorded as
/// audit trail.
pub struct OperatorId;
);
SessionId
Identifies an entry within an override file. Typically the
basename stem of the input CHAT file, but the override-file
schema doesn’t constrain its shape, contributors may use any
stable identifier they like (<participant>-<timepoint>,
<recording-id>, etc.).
string_newtype!(
/// Identifies a session within an override file. Free-form
/// stable string; typically the CHAT-file basename stem.
pub struct SessionId;
);
MergeOverride
A single per-session decision record. The unit of operator adjudication.
/// One per-session decision in an override file.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize, JsonSchema)]
pub struct MergeOverride {
pub mode: DecisionMode,
pub mapping: SpeakerMapping,
/// Per-speaker Jaccard scores recorded for audit. Present
/// when the entry was produced by reference mode or by an
/// explicit mode that followed a reference attempt.
#[serde(skip_serializing_if = "BTreeMap::is_empty", default)]
pub scores: BTreeMap<SpeakerCode, JaccardScore>,
/// The decisive margin, if available.
#[serde(skip_serializing_if = "Option::is_none", default)]
pub margin: Option<Margin>,
pub operator: OperatorId,
pub decided_at: DateTime<Utc>,
/// Operator note. Highly recommended for non-auto decisions.
#[serde(skip_serializing_if = "Option::is_none", default)]
pub note: Option<String>,
/// Flags marking unusual situations.
#[serde(skip_serializing_if = "Vec::is_empty", default)]
pub flags: Vec<MergeFlag>,
}
The struct embeds the timestamp via chrono::DateTime<Utc>; serde
serializes to RFC 3339 (2026-05-27T08:41:00Z) by default. TOML
preserves this format faithfully.
OverrideFile
The top-level container. Holds schema version + per-session entries. Read from / written to disk as TOML.
/// Top-level override-file container.
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize, JsonSchema)]
pub struct OverrideFile {
/// Schema version. Currently 1. Reader refuses unknown
/// versions with a typed error rather than guessing.
pub schema_version: u32,
/// Per-session entries. BTreeMap for deterministic
/// on-disk ordering.
#[serde(flatten)]
pub entries: BTreeMap<SessionId, MergeOverride>,
}
impl OverrideFile {
pub const CURRENT_SCHEMA_VERSION: u32 = 1;
/// Read an override file from a path. Refuses unknown
/// schema versions.
pub fn read(path: &Path) -> Result<Self, OverrideFileError>;
/// Write the override file to a path, replacing the file
/// atomically.
pub fn write(&self, path: &Path) -> Result<(), OverrideFileError>;
/// Read an override file if it exists, else return an
/// empty file at the current schema version. Used by the
/// `--write-override` append flow.
pub fn read_or_default(path: &Path) -> Result<Self, OverrideFileError>;
pub fn get(&self, id: &SessionId) -> Option<&MergeOverride>;
pub fn insert(&mut self, id: SessionId, entry: MergeOverride);
}
The #[serde(flatten)] on entries means the on-disk TOML is
flat tables keyed by session ID (as shown in the
speaker-id.md schema):
schema_version = 1
[NF203-2]
mode = "auto"
# ...
rather than nested under an [entries] table.
Error types
Two thiserror-based enums covering the merge pipeline’s failure
modes. Each variant carries enough information for the CLI to
produce a useful diagnostic and for callers to pattern-match
behavior.
SpeakerIdError
#[derive(Debug, thiserror::Error)]
pub enum SpeakerIdError {
#[error("reference file has no utterances for anchor speaker {anchor}")]
AnchorMissingInReference { anchor: SpeakerCode },
#[error("input has only {n} distinct speakers; speaker-id requires at least 2")]
InsufficientSpeakers { n: usize },
#[error("Jaccard margin {margin} is below confidence threshold {threshold}; scores={scores:?}")]
LowConfidence {
scores: BTreeMap<SpeakerCode, JaccardScore>,
threshold: ConfidenceThreshold,
margin: Margin,
},
#[error("speaker {speaker} present in input but not covered by --mapping")]
SpeakerNotInMapping { speaker: SpeakerCode },
#[error("--mapping references speaker {speaker} not present in input")]
MappingSpeakerNotInInput { speaker: SpeakerCode },
#[error("override file has no entry for session {session}")]
OverrideEntryMissing { session: SessionId },
#[error("parse error reading input: {0}")]
Parse(#[from] talkbank_parser::ParseError),
#[error("override file I/O: {0}")]
OverrideIo(#[from] OverrideFileError),
}
The LowConfidence variant is the only “soft” failure, the
caller (CLI) maps it to exit code 4 and prints the scores.
Every other variant maps to exit code 1 or 2 per the user-guide
contract.
MergeError
#[derive(Debug, thiserror::Error)]
pub enum MergeError {
#[error("File 1 declares no utterances for retain set {retain:?}")]
RetainSpeakersMissing { retain: RetainSet },
#[error("File 1 has no time-bulleted utterances; cannot merge against a shared timeline")]
NoTimelineInFile1,
#[error("File 1 @Languages = {file1}, File 2 @Languages = {file2}; merge requires matching language")]
LanguageMismatch {
file1: LanguageCode,
file2: LanguageCode,
},
#[error("speaker {speaker} appears in both files but is not in --retain; specify --retain to disambiguate")]
AmbiguousSpeaker { speaker: SpeakerCode },
#[error("parse error: {0}")]
Parse(#[from] talkbank_parser::ParseError),
}
OverrideFileError
Independent enum because override-file I/O is also called by non-speaker-id code paths (the orchestrator, future adjudication UIs).
#[derive(Debug, thiserror::Error)]
pub enum OverrideFileError {
#[error("override file not found at {path}")]
NotFound { path: PathBuf },
#[error("override file at {path} has schema_version={found}, this binary supports {supported}")]
UnsupportedSchemaVersion {
path: PathBuf,
found: u32,
supported: u32,
},
#[error("override file at {path} failed to parse: {source}")]
Parse {
path: PathBuf,
#[source]
source: toml::de::Error,
},
#[error("override file at {path} failed to write: {source}")]
Write {
path: PathBuf,
#[source]
source: std::io::Error,
},
#[error("I/O reading override file at {path}: {source}")]
Io {
path: PathBuf,
#[source]
source: std::io::Error,
},
}
Module layout
talkbank-model/src/merge/
mod.rs, pub re-exports
scoring.rs, JaccardScore, ConfidenceThreshold, Margin
role.rs, InsertedRole, MappingAction
mapping.rs, SpeakerMapping
retain.rs, RetainSet
override_file.rs, DecisionMode, MergeFlag, OperatorId,
SessionId, MergeOverride, OverrideFile
errors.rs, SpeakerIdError, MergeError, OverrideFileError
Each file aims for the ≤400-line target; if any grows we split
further (override_file/ becomes a directory with separate
files for the schema, the I/O, and the version-migration
logic).
Type design rules followed
A spot-check against the cross-cutting design rules in this repo’s
root CLAUDE.md:
- Newtypes over primitives. Every numeric domain value
(
JaccardScore,ConfidenceThreshold,Margin) is wrapped; every string domain value (SessionId,OperatorId,SpeakerCode,ParticipantRole) is wrapped or reused from existing wrappers. ✓ - No tuple-packed seams.
InsertedRoleis a struct, not(SpeakerCode, ParticipantRole).MergeOverridelikewise. ✓ - No boolean blindness.
MappingAction,DecisionMode,MergeFlagare enums, not bools.Margin::Finite/Unboundedis an enum, notOption<f64>orf64::INFINITY. ✓ - Typed errors. Three
thiserrorenums with named-field variants carrying full context. ✓ - Deterministic seams.
BTreeMap/BTreeSetfor every serialized collection. ✓ - Module browseability. Six files in
merge/, each scoped to one concern. ✓ Defaultimpls present where meaningful.ConfidenceThreshold::DEFAULT = 2.0;OverrideFile::default()for the empty-file case. ✓Displayimpls present where user-visible.JaccardScore,Margin,InsertedRole. ✓FromStrparsers at CLI boundary, not regex hacks in command code.RetainSet::from_str,InsertedRole::from_str, and aparse_mapping_spechelper for--mapping. ✓
Decisions on the seven open questions
Resolved 2026-05-27, captured here so implementers don’t re-litigate.
1. JaccardScore representation: f64
Multiset Jaccard J(A, B) = sum_w min(A[w], B[w]) / sum_w max(A[w], B[w])
is computed from u64 token counts, which fit in f64’s 53-bit
mantissa for any plausible CHAT bag-of-words. The division is
inexact in general but IEEE 754 makes it bit-deterministic given
the same inputs across every platform that implements 754 (all of
ours: Windows, macOS, Linux, x86_64, arm64).
The bit-deterministic reproducibility property is load-bearing
because the override-file audit trail records scores; a researcher
re-running speaker-id years later on the same inputs must compute
the same score to verify the decision. f64 arithmetic provides
this for free given workspace platform constraints. Document the
property in the type’s rustdoc.
A rational u64/u64 representation was considered for “true”
reproducibility but adds boilerplate and a comparison-against-
threshold operation that loses the same precision in the end (the
threshold is a ratio too). Reject.
2. DateTime<Utc> crate: chrono
The workspace already pins chrono = "0.4" at the root
Cargo.toml. talkbank-model::merge uses the workspace version
verbatim via chrono = { workspace = true }. No new datetime dep.
The “succession-aware” rule from the workspace-root CLAUDE.md
contributor guide (outside the book) and the analogous
feedback_no_terraform_only_opentofu discipline from operator
memory says: do not fragment the ecosystem by introducing a
second tool when a workspace tool already does the job. jiff is
a fine library but adopting it for one new module would mean two
datetime crates in tree.
Override-file timestamps serialize as RFC 3339 UTC; chrono’s serde
feature handles this with #[serde(with = "chrono::serde::ts_rfc3339")]
or the default Serialize/Deserialize impl.
3. TOML library: toml (the workspace-pinned crate)
Workspace already pins toml = "^1.1.2". That crate reads AND
writes, no need to combine toml and toml_edit for the v1
override-file format.
toml_edit was considered for its formatting/comment preservation
across in-place edits. The case for it is hypothetical right now:
override files are primarily machine-written by chatter speaker-id --write-override; human edits exist but are not the dominant
workflow. The cost of toml_edit is the second TOML dep (workspace
churn, plus the friction every contributor pays parsing TOML
through one API and writing through another).
If a workflow emerges where operators heavily hand-edit override
files and lose formatting on each batch re-run, swap to toml_edit
then. Defer.
4. MergeOverride::flags: Vec<MergeFlag>
Operator-supplied flags are semantically set-like (each flag
present or absent), but Vec is the right representation because:
MergeFlagincludes aCustom(String)#[serde(untagged)]variant. DerivingOrdon this enum requires a manualOrdimpl that hashes the discriminator + the inner string. Doable but adds maintenance load.- The order of flags in the on-disk file isn’t load-bearing for correctness; deterministic single-source-write produces a deterministic Vec.
- Duplicates are noise but not corrupting. Document in the field’s rustdoc that consumers should treat as set semantics (deduplicate before comparing).
The writer (speaker-id --write-override path) inserts flags in a
deterministic order; on-disk Vec is fully reproducible. If a
hand-edited file has an out-of-order or duplicated flag list, that
shows up as a non-corrupting noise in subsequent diffs, acceptable.
5. SpeakerMapping::assignments: BTreeMap<SpeakerCode, MappingAction>
Confirmed. BTreeMap gives:
- One-action-per-speaker by construction (no duplicate keys).
- Deterministic serialization order (alphabetical by
SpeakerCode). - Cheap membership tests during apply.
The CLAUDE.md “no tuple-packed seams” rule targets raw tuples as
struct fields or function arguments. A BTreeMap’s internal
key-value pairing is not a domain seam exposed to the API; it’s
the representation. Approved.
6. Schema versioning policy: strict refuse-with-clear-error
OverrideFile::read refuses any schema_version != CURRENT_SCHEMA_VERSION
with a typed OverrideFileError::UnsupportedSchemaVersion { found, supported }. No automatic migration in v1.
This is the conservative default. Reasons:
- We have no upgrade history yet; building a migration framework
for a problem that doesn’t exist is premature abstraction
(
CLAUDE.md“Always Fix Root Causes” + the general “no premature abstraction” instinct). - The override file is fundamentally a record of operator decisions. If the schema breaks, operators re-adjudicate; the prior file becomes a historical artifact that can be read by scripts with old binaries.
- When a real schema change lands and there is real upgrade
friction, that’s the moment to write a one-shot migration
(
chatter merge migrate-overrides --from <path> --to <path>). Until that happens, premature migration code is dead weight.
Document this in OverrideFile::read’s rustdoc so the policy is
explicit to callers.
7. Where the --mapping parser lives: talkbank-model::merge::mapping
parse_mapping_spec("PAR0=drop,PAR1=INV:Investigator") -> Result<SpeakerMapping, MappingSpecParseError>
lives in the model crate alongside the SpeakerMapping type it
returns.
Why:
- The spec format is part of the type’s contract. A reader looking
for “how do I construct a
SpeakerMappingfrom a string?” should find the answer where the type is defined, not in the consumer CLI crate. - A future non-CLI consumer (HTTP API, library wrapper, scripting
binding) wants the same parser without re-implementing or
depending on
chatter. - The model crate has no CLI-framework dependency (no
clap), but a free function returningResult<SpeakerMapping, _>doesn’t need one. Theclapvalue-parser inchatterbecomes a thin shim:fn clap_mapping_value(s: &str) -> Result<SpeakerMapping, String> { parse_mapping_spec(s).map_err(|e| e.to_string()) }.
If at some point a SECOND mapping syntax becomes useful (e.g.,
JSON-inline, or a TOML fragment), add a parse_mapping_json
sibling rather than reshaping parse_mapping_spec. The existing
parser stays the lingua franca.
These decisions are the design baseline going into spec authoring and implementation. Future revisions to any of them require an explicit doc update plus a deprecation/migration plan, not a silent change in the implementation.
Relationship to specs and tests
Every type in this doc gets a spec entry in
spec/constructs/merge-types/ once we move to implementation
one spec per type/invariant pair, regenerated into Rust tests
via the current spec/tools generators. Spec authoring sits between this doc and
the Rust implementation; types are designed here, behavior is
pinned by specs, code follows. The spec entries are also where
behavioral invariants (e.g. “JaccardScore::new(NaN) → Err”)
become regression gates rather than rustdoc-only contracts.
Merge Pipeline, Test Plan
Status: Draft Last modified: 2026-05-30 06:55 EDT
This page is the test-coverage roadmap for the new merge pipeline
(chatter speaker-id + chatter merge + chatter adjudicate +
the override-file format + the underlying talkbank-model::merge
types). It exists because, per this repo’s root CLAUDE.md
red/green TDD rule, every new feature starts with failing tests
at the highest level the feature lives at, and we want to
enumerate those tests before writing the implementation, so
coverage is designed, not discovered.
This is a plan, not yet code. When the implementation work begins, every test case below becomes a real test; the doc then flips to a coverage matrix that gets kept honest by CI.
TDD discipline, what “strict red/green” means here
Every cycle of impl-phase work is:
- RED. Write ONE failing test at the highest layer the feature lives at. The test exercises a real user-observable behavior, not an internal helper. Commit the failing test alone (or stage it before any code change), verify it fails for the right reason (the missing behavior), not for a compile error or a typo.
- GREEN. Write the smallest code change that makes the test pass. No anticipating future tests, no scaffolding for tests that don’t yet exist. The codebase should compile and pass tests at this point.
- REFACTOR. With the green test as the safety net, tighten the implementation: extract helpers, rename for clarity, replace primitives with newtypes, document tricky parts. Tests stay green throughout.
- DRILL DOWN if needed. If the L3 (or L2) test passes but pinned the behavior less precisely than the contract requires (e.g., the L3 test asserts “exit 2 with some error” but the contract says “the specific MergeError variant must match”), add an L2 (or L1) test next that drills into the precise path. The drilled test FAILS at first against the green-but-imprecise impl, motivating the tighter impl.
Cycles must be atomic: one RED → one GREEN → optional REFACTOR → optional drill-down. Do not stack multiple tests on top of a single impl change; do not write impl ahead of tests. The discipline matters because the bug bar of this pipeline is high (CHAT-data byte-stable preservation, audit-trail reproducibility) and TDD is the cheapest way to catch regressions before they ship.
Three test layers + the adjudication layer
The merge pipeline’s behavior spans four substrates with different testing mechanisms.
| Layer | Substrate | Why tests live here |
|---|---|---|
| L1, Spec / fragment | spec/constructs/speaker-id/ → current spec/tools generators | Token-cleaner behavior on CHAT fragments (markup strip for Jaccard scoring). Same mechanism that pins parser/grammar tests; regenerated regression. |
| L2, Transform / AST | crates/talkbank-transform/tests/ | Pure-Rust tests over parsed ChatFile values. identify_mapping, apply_mapping, merge, run_adjudication semantics on hand-built or parsed CHAT inputs. No process boundary. |
| L3, CLI / subprocess | crates/chatter/tests/merge_tests.rs (new) | End-to-end behavior of chatter speaker-id, chatter merge, and chatter adjudicate invoked as subprocesses (assert_cmd + predicates). Exit codes, flag parsing, file I/O, stderr formats. |
| L4, Scripted adjudication | crates/talkbank-transform/tests/adjudication_tests.rs + scripted prompter | Operator-decision paths in chatter adjudicate. Uses ScriptedPrompter injecting synthetic operator choices. See Adjudication Workflow for the prompter abstraction. |
L1 ⊂ L2 ⊂ L3 in terms of failure-mode coverage: a failing L1 test implies a failing L2 test which implies a failing L3 test. So when the same invariant could be tested at multiple layers, the starter test is the highest layer and lower-layer tests are supplements that pin the precise internal path. L4 sits beside L2/L3, same crate/file conventions but a dedicated layer because the prompter-injection pattern is specific to adjudication.
L1, Spec / fragment tests
Lives in spec/constructs/speaker-id/. Three subdirectories:
token-cleaner/: what the Jaccard tokenizer strips and keepsjaccard-scoring/: fixed-input → fixed-score golden testsmapping-application/: header rewrite rules on real fragments
L1.1, Token cleaner
Each spec is a CHAT main-tier fragment + the expected token list
after cleaning. Behavior pinned: bracket markup stripped,
angle-bracket retracing unwrapped, terminator variants
discarded, &-... / &+... discarded, xxx/yyy/www
discarded, 0 discarded, @l / @n / @c suffix dropped,
_-compound split to spaces, punctuation stripped, lowercased,
≥2-char alpha filter, NAK bullets stripped.
| Spec | Input fragment | Expected tokens |
|---|---|---|
clean-plain-utterance | *CHI:\thello world . | ["hello", "world"] |
clean-strip-bracket-codes | *CHI:\thello [*] [/] world [//] . | ["hello", "world"] |
clean-unwrap-angle-retrace | *CHI:\t<two of the> [//] three of the presents . | ["two", "of", "the", "three", "of", "the", "presents"] |
clean-strip-fillers | *CHI:\t&-um &+pre something &-uh . | ["something"] |
clean-strip-zero-and-paralinguistic | *CHI:\t0 [=! nodding] . | [] |
clean-strip-unintelligible | *CHI:\txxx and yyy and www . | ["and", "and"] |
clean-strip-bullets | *CHI:\thello world . \x150_1234\x15 | ["hello", "world"] |
clean-special-form-suffix | *CHI:\tnaming l@l u@l l@l u@l . | ["naming"] |
clean-compound-underscore | *CHI:\tValentine's_Day and Fruit_Loops . | ["valentine", "day", "and", "fruit", "loops"] |
clean-terminator-variants | *CHI:\thello +//. world +... again +/. last ! | ["hello", "world", "again", "last"] |
clean-overlap-markers | *CHI:\t↫here↫ and there . | ["here", "and", "there"] |
clean-lowercase-filter | *CHI:\tHello World A I am . | ["hello", "world", "am"] |
Each spec file in spec/constructs/speaker-id/token-cleaner/ has
the standard # name, ## Input, ## Expected tokens, and
## Metadata sections per the spec authoring template at
spec/CLAUDE.md in the workspace root (outside the book).
L1.2, Jaccard scoring
Fixed bag-of-tokens pairs with known multiset Jaccard. These
guard against off-by-one errors in the sum_w min / sum_w max
implementation and against any future “optimizations” that
silently change scoring.
| Spec | Bag A | Bag B | Expected J(A,B) |
|---|---|---|---|
jaccard-identical | {hello:2, world:1} | {hello:2, world:1} | 1.0 |
jaccard-disjoint | {hello:1} | {world:1} | 0.0 |
jaccard-empty-empty | {} | {} | 0.0 |
jaccard-empty-nonempty | {} | {x:1} | 0.0 |
jaccard-multiset-counts | {a:3, b:1} | {a:1, b:1} | 2/4 = 0.5 |
jaccard-partial-overlap | {a:1, b:1, c:1} | {b:1, c:1, d:1} | 2/4 = 0.5 |
L1.3, Mapping application on fragments
Header-rewrite micro-tests. Each spec gives an input
@Participants: or @ID: row and a small mapping; the expected
output row is the rewritten form.
| Spec | Input row | Mapping | Expected output row |
|---|---|---|---|
participants-rewrite-rename | @Participants:\tPAR0 Participant, PAR1 Participant | PAR0→INV:Investigator, PAR1→drop | @Participants:\tINV Investigator |
participants-preserve-name-token | @Participants:\tCHI Alex Target_Child, PAR0 Participant | PAR0→INV:Investigator | @Participants:\tCHI Alex Target_Child, INV Investigator |
id-rewrite-rename | @ID:\teng|corpus_name|PAR0|||||Participant||| | PAR0→INV:Investigator | @ID:\teng|corpus_name|INV|||||Investigator||| |
id-drop-removes-row | @ID:\teng|...|PAR1|||||Participant||| | PAR1→drop | (row removed) |
id-preserves-other-fields | @ID:\teng|2|CHI|6;01.|female|NF||Target_Child||| | (no-op for CHI) | identical to input |
L2, Transform / AST tests
Lives in crates/talkbank-transform/tests/. Three test files:
speaker_id_tests.rstranscript_merge_tests.rsoverride_file_tests.rs
Each tests behavior over parsed talkbank-model::ChatFile values,
using inline synthetic CHAT strings parsed via
talkbank_parser::parse_chat_file (no subprocess overhead).
L2.1, identify_mapping (reference mode)
| Test | Scenario | Assertion |
|---|---|---|
identify_mapping_clean_winner | Reference has CHI saying content X; donor has PAR0 saying X verbatim and PAR1 saying unrelated content | Returns SpeakerMapping { drop: {PAR0}, rename: {PAR1: INV} }, margin >> 2.0 |
identify_mapping_borderline_refuses | Reference and both donor speakers share substantial vocabulary (margin < 2.0) | Returns Err(SpeakerIdError::LowConfidence { scores, threshold, margin }) |
identify_mapping_anchor_missing | Reference has no utterances tagged with anchor speaker | Returns Err(SpeakerIdError::AnchorMissingInReference { anchor: CHI }) |
identify_mapping_single_speaker_donor | Donor has only one speaker | Returns Err(SpeakerIdError::InsufficientSpeakers { n: 1 }) |
identify_mapping_threshold_at_exact_value | Constructed donor where margin = 2.0 exactly with threshold 2.0 | Returns Ok(_) (≥ comparison, not strict >) |
identify_mapping_threshold_below_exact_value | Margin = 1.9999 with threshold 2.0 | Returns Err(SpeakerIdError::LowConfidence) |
identify_mapping_unbounded_margin | Donor PAR1 has Jaccard 0 against reference; PAR0 > 0 | Returns Ok(_) with margin = Margin::Unbounded |
identify_mapping_deterministic | Same inputs, repeated call | Identical SpeakerMapping byte-for-byte (BTreeMap ordering) |
L2.2, apply_mapping
| Test | Scenario | Assertion |
|---|---|---|
apply_mapping_renames_main_tier | Donor has *PAR0:\t... and *PAR1:\t...; mapping renames PAR0→INV, drops PAR1 | Output has *INV:\t... for original PAR0 utts; PAR1 utts absent |
apply_mapping_byte_stable_except_prefix | Donor has rich CHAT markup, %wor, %com on every utt | Every retained utt is byte-identical except the *CODE:\t prefix; dependent tiers preserved exactly |
apply_mapping_rewrites_participants | Donor @Participants: has PAR0+PAR1 entries | Output has only INV entry (after PAR1 drop) |
apply_mapping_rewrites_id | Donor @ID: rows for PAR0+PAR1 | PAR0 row rewritten to INV with role tag; PAR1 row removed |
apply_mapping_speaker_not_in_input | Mapping references PAR9 which isn’t in donor | Returns Err(SpeakerIdError::MappingSpeakerNotInInput { speaker: PAR9 }) |
apply_mapping_speaker_not_in_mapping | Donor has PAR0+PAR1+PAR2 but mapping only covers PAR0+PAR1 | Returns Err(SpeakerIdError::SpeakerNotInMapping { speaker: PAR2 }) |
apply_mapping_preserves_other_headers | Donor has @Languages, @Media, @Comment | All non-Participants/non-ID headers pass through verbatim |
apply_mapping_idempotent_on_rerun | Apply mapping, parse output, apply identity mapping | Output unchanged (byte-stable) |
L2.3, merge (core invariants)
These mirror the user-guide’s “What the merged output guarantees” section directly. Each invariant from that section maps to one or more L2 tests; the L3 tests then re-exercise the same invariant through the CLI.
| Test | Invariant from user-guide | Assertion |
|---|---|---|
merge_retained_speakers_byte_stable | “Retained speakers are byte-stable” | Every *CHI: block from File 1 (main tier + all dependent tiers, including %com) appears in the output byte-identical, in original order |
merge_strips_default_derived_tiers | “Inserted speakers’ downstream-generated tiers are stripped” | Output has no %wor, %mor, %gra, %pho on inserted-speaker utts; other dependent tiers preserved |
merge_strip_tiers_configurable | “configurable via --strip-tiers” | Custom strip_tiers=[com] removes %com instead of the defaults |
merge_strip_tiers_empty_preserves_all | empty strip set | Inserted utts retain %wor, %mor, %gra, %pho from File 2 verbatim |
merge_utterance_order_by_start_time | “Utterance order is timeline order” | Output utterances sorted by start_ms ascending |
merge_stable_tiebreak_file1_first | “first-file utterance comes first” | When File 1 and File 2 each have an utterance starting at exactly t, the File 1 one appears first in the output |
merge_bullets_pass_through | “Time bullets are pass-through” | Every bullet in the output is exactly the bullet from its source utterance, merge does not recompute, smooth, or refresh |
merge_bullet_lift_from_wor | “If main tier lacks bullet, lift from %wor” | Donor utt with no end-of-line bullet but a %wor row gets a derived \x15<first>_<last>\x15 appended; original %wor then stripped per the tier policy |
merge_no_overlap_markers_injected | “Overlap markup is NOT injected” | Even when inserted utt’s bullet overlaps a retained utt’s bullet by 500ms, no [>]/[<] tokens appear anywhere in the output that weren’t in the original retained file |
merge_preserves_existing_overlap_markers | retained file already has [>] somewhere | The original [>] is preserved byte-stable on the retained utt |
merge_header_languages_passthrough | Header reconciliation rule | Output @Languages matches File 1’s |
merge_header_media_file1_wins | Header reconciliation rule | File 1 says video, File 2 says audio → output says video (no warning emitted for modality only) |
merge_header_participants_concatenates | Header reconciliation rule | Output @Participants: is File 1’s entries + File 2’s non-retained entries, in that order |
merge_header_id_concatenates | Header reconciliation rule | Output @ID: rows are File 1’s + File 2’s non-retained, original order within each file |
merge_header_comments_concatenate | Header reconciliation rule | Output @Comment rows are File 1’s + File 2’s, in original order (ASR provenance preserved) |
merge_preconditions_retain_missing | exit code 2 precondition | File 1 declares no CHI; merge with retain={CHI} returns Err(MergeError::RetainSpeakersMissing) |
merge_preconditions_no_timeline | exit code 2 precondition | File 1 has no utterances with bullets → Err(MergeError::NoTimelineInFile1) |
merge_preconditions_language_mismatch | exit code 2 precondition | File 1 @Languages: eng, File 2 @Languages: yue → Err(MergeError::LanguageMismatch) |
merge_preconditions_ambiguous_speaker | exit code 2 precondition | Both files have INV utterances and retain={CHI} (INV not in retain) → Err(MergeError::AmbiguousSpeaker { speaker: INV }) |
merge_warns_on_backward_bullet_drift | “small backward-time bullets … proceeds” | File with utt1: 100_200, utt2: 190_300, succeeds, emits a warning |
L2.4, Override file I/O
| Test | Scenario | Assertion |
|---|---|---|
override_file_round_trip | Construct OverrideFile with one entry, write, read back | Re-read value == original |
override_file_refuses_missing_schema_version | TOML with no schema_version | Err(OverrideFileError::UnsupportedSchemaVersion { found: 0, supported: 1 }) |
override_file_refuses_wrong_schema_version | schema_version = 2 (future) | Err(UnsupportedSchemaVersion { found: 2, supported: 1 }) |
override_file_rejects_unknown_field | Entry has an extraneous field extra = "x" | Err(OverrideFileError::Parse) |
override_file_rejects_malformed_mode | mode = "guess" | Err(Parse) (only auto/explicit/override accepted) |
override_file_atomic_write | Write to a path that already exists | Original file is replaced atomically; no <path>.tmp left behind |
override_file_deterministic_serialization | Same struct, write twice | Bytes on disk are byte-identical between writes |
override_file_omits_empty_optionals | Entry has empty scores, no margin, empty flags | TOML output does not contain those keys |
override_file_preserves_margin_unbounded | Entry has margin = Margin::Unbounded | TOML on disk has margin = "unbounded"; reads back as Unbounded |
override_file_preserves_margin_finite | Entry has margin = Margin::Finite(3.81) | TOML on disk has margin = 3.81; reads back equal |
override_file_read_or_default_missing | Path does not exist | Returns empty OverrideFile with current schema version |
override_file_get_returns_entry | File has one entry under SessionId X | get(X) returns Some; get(Y) returns None |
L2.5, Domain-type unit tests
Smaller per-type tests. Each in its module’s #[cfg(test)] mod tests section.
| Test | Type | Assertion |
|---|---|---|
jaccard_score_new_in_range | JaccardScore | new(0.5) → Ok; new(-0.1) and new(1.1) → Err; new(NaN) → Err |
jaccard_score_serde_round_trip | JaccardScore | Serializes to 0.5 (bare float in JSON/TOML); deserializes back identically; out-of-range deserialize → error |
confidence_threshold_default_is_2_0 | ConfidenceThreshold | Default::default().value() == 2.0 |
confidence_threshold_rejects_below_1 | ConfidenceThreshold | new(0.5) → Err |
margin_from_scores_zero_loser | Margin | from_scores(JaccardScore::new(0.7), JaccardScore::zero()) == Margin::Unbounded |
margin_from_scores_zero_zero | Margin | from_scores(zero, zero) == Margin::Finite(0.0) or explicit “degenerate” representation (decide and document) |
margin_meets_threshold | Margin | Finite(3.81).meets(threshold=2.0) == true; Finite(1.5).meets(2.0) == false; Unbounded.meets(threshold) == true for any threshold |
retain_set_parse | RetainSet | "CHI".parse() == Ok({CHI}); "CHI,SI2".parse() == Ok({CHI, SI2}); "".parse() == Err; "CHI,,SI2".parse() == Err |
inserted_role_parse | InsertedRole | "INV:Investigator".parse() == Ok(_); "INV".parse() == Err; ":Investigator".parse() == Err |
mapping_spec_parse_simple | parse_mapping_spec | "PAR0=drop,PAR1=INV:Investigator" parses to a complete SpeakerMapping with correct actions and inserted_role |
mapping_spec_parse_drop_only | parse_mapping_spec | "PAR0=drop" parses iff no inserted_role context required (decide whether legal in isolation; if not, must error) |
mapping_spec_parse_conflicting_roles | parse_mapping_spec | "PAR0=INV:Investigator,PAR1=MOT:Mother", two different inserted roles → error (v1 only allows one) |
merge_flag_serde_known_variants | MergeFlag | DiarizationMixed serializes as "diarization-mixed" (kebab-case); deserializes the same |
merge_flag_serde_custom | MergeFlag | Unknown string deserializes as Custom("unknown-flag"); serializes verbatim |
L3, CLI / subprocess tests
Lives in crates/chatter/tests/merge_tests.rs (new file).
Uses the same assert_cmd + predicates + tempfile pattern
as the existing integration_tests.rs. Each test invokes
chatter speaker-id or chatter merge as a subprocess against
files written to a tempdir().
L3.1, chatter merge, success paths
| Test | Invariants exercised |
|---|---|
merge_basic_clinician_pattern | E2E happy path: small hand-coded child-only file + small ASR-labeled file → exit 0, output exists, retained CHI byte-stable, inserted INV present with derived tiers stripped. Single-invocation smoke test. |
merge_writes_to_stdout_by_default | No -o flag → output goes to stdout, exit 0 |
merge_writes_to_output_path | -o merged.cha → file created with correct content; nothing on stdout |
merge_retain_multi_speaker | --retain CHI,SI2 keeps both CHI and SI2 byte-stable; everything else from File 2 |
merge_strip_tiers_custom | --strip-tiers com,act removes %com and %act instead of default set |
merge_strip_tiers_empty | --strip-tiers '' preserves %wor from File 2 in output |
L3.2, chatter merge, error paths
| Test | Asserted exit code | Asserted stderr |
|---|---|---|
merge_missing_file1 | 1 | “No such file” or equivalent typed message |
merge_unparseable_file1 | 1 | parser diagnostic |
merge_missing_retain_flag | 2 (clap) | clap usage message |
merge_retain_empty_value | 2 | typed error from RetainSet::from_str |
merge_no_retain_speakers_in_file1 | 2 | RetainSpeakersMissing rendered |
merge_no_timeline_in_file1 | 2 | NoTimelineInFile1 rendered |
merge_language_mismatch | 2 | LanguageMismatch { file1: eng, file2: yue } rendered |
merge_ambiguous_speaker | 2 | AmbiguousSpeaker { speaker: ... } rendered with hint to use –retain |
L3.3, chatter speaker-id, reference mode
| Test | Scenario | Assertion |
|---|---|---|
speaker_id_reference_auto_clean_winner | Reference + donor where margin >> 2.0 | Exit 0; output has expected renamed/dropped speakers |
speaker_id_reference_writes_override | With --write-override path.toml | File created; entry has mode = "auto", scores, margin, decided_at, operator |
speaker_id_reference_appends_to_existing_override | --write-override path.toml where file already has another session | New session added; existing session preserved |
speaker_id_reference_low_confidence_exits_4 | Margin < threshold | Exit 4; stderr contains per-speaker scores |
speaker_id_reference_anchor_missing_exits_2 | Reference has no anchor speaker utterances | Exit 2; typed error in stderr |
speaker_id_reference_threshold_override | --confidence-threshold 1.5 on a margin-1.7 case | Exit 0 (would have refused at default 2.0) |
speaker_id_reference_anchor_required | --reference without --anchor | Exit 2 (clap or our own); usage error |
L3.4, chatter speaker-id, explicit-mapping mode
| Test | Scenario | Assertion |
|---|---|---|
speaker_id_explicit_basic | --mapping "PAR0=drop,PAR1=INV:Investigator" | Exit 0; output renames PAR1→INV, drops PAR0 |
speaker_id_explicit_mapping_speaker_not_in_input | --mapping references PAR9 not in input | Exit 2; typed error |
speaker_id_explicit_speaker_missing_from_mapping | Input has PAR0+PAR1+PAR2; mapping only covers PAR0+PAR1 | Exit 2; typed error naming PAR2 |
speaker_id_explicit_with_note_records_in_override | --mapping + --write-override + --note "verified by listening" | TOML entry has note = "verified by listening" and mode = "explicit" |
L3.5, chatter speaker-id, override-file mode
| Test | Scenario | Assertion |
|---|---|---|
speaker_id_override_file_replay | Override file has entry for session-X | Reading override + applying produces same output as the original auto/explicit run |
speaker_id_override_file_missing_entry | Override file has no entry for the requested session | Exit 2; OverrideEntryMissing in stderr |
speaker_id_override_file_missing_file | --override-file path.toml where file doesn’t exist | Exit 1; NotFound in stderr |
speaker_id_override_file_wrong_schema_version | File has schema_version = 99 | Exit 1; UnsupportedSchemaVersion in stderr |
speaker_id_override_file_mutually_exclusive_modes | --reference AND --mapping both set | Exit 2 (clap or our own); only one operation mode allowed |
L3.6, Pipeline composition
These exercise chatter speaker-id → chatter merge composed
end-to-end through the file system, simulating the orchestrator
workflow.
| Test | Scenario | Assertion |
|---|---|---|
pipeline_speaker_id_then_merge | Run speaker-id on anonymous ASR file; run merge on the result + hand-coded file | Final merged file passes all merge invariants (retained byte-stable, etc.) |
pipeline_replay_via_override_file | Run once with auto; capture override file; delete intermediates; replay via --override-file; merge again | Final merged file is byte-identical to the original run (audit-trail-reproducibility property) |
pipeline_low_confidence_then_explicit | Run speaker-id; gets exit 4; capture scores from stderr; run again with --mapping matching what the operator would decide; record via --write-override; merge | All steps succeed; override file has mode = "explicit" with prior scores recorded |
L4, Scripted adjudication tests
Lives in crates/talkbank-transform/tests/adjudication_tests.rs.
Uses the Prompter trait and ScriptedPrompter documented in
Adjudication Workflow §The prompter abstraction.
Each test constructs a pending-adjudications input, scripts the
operator’s decisions, runs run_adjudication, and asserts on
the resulting override file plus the residual pending file.
L4.1, Speaker-id adjudication paths
| Test | Scripted decision | Assertion |
|---|---|---|
adjudicate_speaker_id_accepts_suggested | AcceptSuggested { note: None } for one pending entry | Override file entry has mode = "explicit", mapping matches suggested, pending file emptied |
adjudicate_speaker_id_override_mapping | OverrideMapping { mapping: { PAR0=rename, PAR1=drop }, note: Some("verified by listening") } (opposite of suggested) | Override file mapping matches operator’s choice; note recorded |
adjudicate_speaker_id_defer | Defer { reason: "need to listen to audio" } | Pending entry untouched; override file unchanged; tool exits 4 (deferred) |
adjudicate_speaker_id_block | Block { reason: "reference file missing bullets" } | Pending entry tagged as blocked; override file unchanged |
adjudicate_speaker_id_kind_mismatch_rejected | OverrideInsertedRole { ... } against a speaker-id-low-confidence entry | Returns Err(AdjudicationError::DecisionKindMismatch); nothing written |
L4.2, Parent-role-lookup adjudication paths
| Test | Scripted decision | Assertion |
|---|---|---|
adjudicate_parent_role_accepts_default_inv | AcceptSuggested | Override entry uses INV:Investigator (the safe default) |
adjudicate_parent_role_overrides_to_mother | OverrideInsertedRole { code: "MOT", tag: "Mother" } | Override entry uses MOT; note recorded |
adjudicate_parent_role_overrides_to_father | OverrideInsertedRole { code: "FAT", tag: "Father" } | Override entry uses FAT |
adjudicate_parent_role_invalid_code_rejected | OverrideInsertedRole { code: "", tag: "Mother" } | Returns Err; with --skip-on-error, logs and proceeds |
L4.3, Diarization-mix and sanity-scan paths
| Test | Scripted decision | Assertion |
|---|---|---|
adjudicate_diarization_mix_flag_only | Flag { flags: [DiarizationMixed], note: "PAR0 mixes clinician+parent" } | Existing override entry gets flag added; mapping unchanged |
adjudicate_sanity_scan_swap_mapping | OverrideMapping { ... } reversing original speaker-id | Override entry updated; mode = "explicit"; original mapping preserved in history |
adjudicate_sanity_scan_confirms_real_overlap | Flag { flags: [Custom("real-overlap-confirmed")] } | Override entry gets custom flag; mapping unchanged |
L4.4, Workflow plumbing
| Test | Scenario | Assertion |
|---|---|---|
adjudicate_empty_pending_file_noop | Pending file has empty entries array | Exit 0; nothing changes |
adjudicate_resumption_skips_decided_entries | Pending file has 3 entries; first 2 already decided in override; only 3rd has no override entry | Prompter is called exactly once, for the 3rd entry |
adjudicate_re_adjudicate_preserves_history | Existing override entry; --re-adjudicate with new decision | New decision saved; prior decision preserved in history array |
adjudicate_kind_filter_processes_only_matching | Pending file has mixed kinds; --kind parent-role-lookup flag set | Prompter only called for parent-role-lookup entries; other kinds untouched |
adjudicate_dry_run_writes_nothing | Any pending input + any decision; --dry-run set | Override file unchanged; pending file unchanged |
adjudicate_scripted_mode_unknown_session_aborts | Scripted decisions reference session-X but pending has only session-Y | Returns Err(AdjudicationError::ScriptedDecisionWithoutPendingEntry); tool exits 2 |
adjudicate_scripted_mode_extra_pending_aborts | Pending has session-X and session-Y; scripted decisions cover only session-X | Returns Err(AdjudicationError::PendingEntryWithoutScriptedDecision); tool exits 2 |
adjudicate_mutually_exclusive_modes | --interactive + --scripted both set | Returns Err; tool exits 2 (clap or our own validator) |
L4.5, Prompter contract conformance
These tests pin the contract that any Prompter impl must
satisfy, so future UI backends (VS Code, web) can be developed
against the same invariants.
| Test | Scenario | Assertion |
|---|---|---|
prompter_terminal_round_trip_decision | TerminalPrompter reading a scripted stdin | Returns the expected OperatorDecision parsed from the operator’s typed input |
prompter_scripted_returns_decisions_in_order | ScriptedPrompter::from_decisions([d1, d2, d3]) | Three consecutive ask() calls return d1, d2, d3 in order |
prompter_scripted_panics_on_unscripted_session | ScriptedPrompter has decisions for session A; tool asks for session B | ask() returns Err(PrompterError::NoDecisionFor(SessionId)) |
prompter_scripted_toml_round_trips | Write a scripted-decisions TOML, read with ScriptedTomlPrompter, run | Same OperatorDecision sequence as a ScriptedPrompter::from_decisions with equivalent contents |
Fixture catalog
These are the synthetic CHAT pairs that the tests above consume. Each is small (≤20 utterances), exercises a precise invariant, and is fully fictional (no real corpus content).
The fixtures live as inline const FIX_*: &str blocks in the
respective test modules, following the precedent in
chatter/tests/integration_tests.rs (which has
const VALID_CHAT: &str = r#"..."# etc.).
FIX_REF_TWO_UTT_NO_MARKUP
The smallest possible valid CHAT pair input. Two *CHI:
utterances, no markup beyond a simple terminator, time bullets
on both. Used by cycle 1’s smoke test where the impl must
work without yet handling any markup edge cases.
FIX_ASR_LABELED_TWO_UTT
The matching donor for FIX_REF_TWO_UTT_NO_MARKUP: two
*INV: utterances at different time positions. Used by
cycle 1.
FIX_REF_CHILD_ONLY_SIMPLE
A 6-utterance child-only hand transcript with rich CHAT markup (error code, retracing, filled pause, special-form letter, zero realization with paralinguistic). Used by every L2/L3 merge test from cycle 2 onward as the canonical “File 1”, the reference / authoritative file. Has time bullets on every utterance.
FIX_ASR_ANON_2SPEAKER_SIMPLE
The matching ASR-output file with anonymous PAR0 (clinician,
asks questions) and PAR1 (child, says what FIX_REF_* shows
plus some extra). Has %wor on every utterance. Used by every
speaker-id test where auto-mode is expected to succeed cleanly
(margin >> 2.0).
FIX_ASR_LABELED_INV_SIMPLE
FIX_ASR_ANON_2SPEAKER_SIMPLE after speaker-id has run with
PAR1→drop, PAR0→INV:Investigator. Used by merge tests where
we want to skip the speaker-id step and test merge alone.
FIX_ASR_BORDERLINE_VOCABULARY
ASR file where both speakers describe the same picture-book content (margin 1.6-1.9 against reference). Used by low-confidence tests.
FIX_REF_NO_BULLETS
A reference file with no time bullets at all. Used to test
NoTimelineInFile1 precondition.
FIX_REF_LANG_ENG / FIX_ASR_LANG_YUE
Two files with conflicting @Languages. Used to test
LanguageMismatch.
FIX_AMBIGUOUS_INV
Two files both containing *INV: utterances, with
--retain CHI (INV not in retain set). Used to test
AmbiguousSpeaker.
FIX_REF_MULTI_RETAIN
Reference file containing *CHI: and *SI2: utterances (sibling
target). Used to test --retain CHI,SI2.
FIX_ASR_NO_MAIN_BULLET
Donor file where some utterances have no main-tier bullet, only
%wor. Used to test bullet-lift behavior in normalization.
FIX_OVERRIDE_VALID / FIX_OVERRIDE_WRONG_SCHEMA / FIX_OVERRIDE_MALFORMED
Override files in valid, schema-rejected, and parse-rejected shapes. Used by override-file I/O tests.
FIX_PENDING_SPEAKER_ID / FIX_PENDING_PARENT_ROLE / FIX_PENDING_MIXED_KINDS
Pending-adjudications files exercising one kind, another kind, and a mix. Used by L4 adjudication tests.
FIX_SCRIPTED_ACCEPT_ALL / FIX_SCRIPTED_OVERRIDE_FIRST_DEFER_SECOND
Scripted-decisions TOML files for ScriptedTomlPrompter.
Cover the canonical accept-suggested case and a mixed
override+defer case.
The exact bytes of each fixture are pinned in their respective test modules when the implementation lands; this plan doesn’t freeze them yet, only their purpose. Drafting the actual bytes is the first step of impl-phase work.
Coverage matrix
Cross-checking that every behavioral invariant from the four design docs has at least one test:
| Invariant source | Invariant | First-failing layer | Test name |
|---|---|---|---|
| merge user-guide | Retained byte-stable | L3 → L2 | merge_basic_clinician_pattern + merge_retained_speakers_byte_stable |
| merge user-guide | Derived tiers stripped | L3 → L2 | merge_strip_tiers_custom + merge_strips_default_derived_tiers |
| merge user-guide | Order by start_ms | L2 | merge_utterance_order_by_start_time |
| merge user-guide | Tiebreak File1 first | L2 | merge_stable_tiebreak_file1_first |
| merge user-guide | Bullets pass-through | L2 | merge_bullets_pass_through |
| merge user-guide | Bullet lift from %wor | L2 | merge_bullet_lift_from_wor |
| merge user-guide | Header reconciliation (all rows) | L2 | merge_header_* series |
| merge user-guide + memory | No overlap markers injected | L2 | merge_no_overlap_markers_injected + merge_preserves_existing_overlap_markers |
| merge user-guide | Each precondition → exit 2 | L3 | merge_*_exits_2 series in L3.2 |
| merge user-guide | Warns on bullet drift | L2 | merge_warns_on_backward_bullet_drift |
| speaker-id user-guide | Reference mode auto | L3 | speaker_id_reference_auto_clean_winner |
| speaker-id user-guide | Explicit mode | L3 | speaker_id_explicit_basic |
| speaker-id user-guide | Override-file mode | L3 | speaker_id_override_file_replay |
| speaker-id user-guide | Confidence threshold (exit 4) | L3 → L2 | speaker_id_reference_low_confidence_exits_4 + identify_mapping_borderline_refuses |
| speaker-id user-guide | Byte-stable except prefix | L2 | apply_mapping_byte_stable_except_prefix |
| speaker-id user-guide | Header rewrites | L2 + L1 | apply_mapping_rewrites_* + participants-rewrite-* specs |
| speaker-id user-guide | Provenance captured | L3 | speaker_id_reference_writes_override |
| speaker-id user-guide | Each precondition → typed error | L3 → L2 | various *_exits_2 and apply_mapping_* tests |
| speaker-id user-guide | Token cleaner spec | L1 | clean-* specs |
| speaker-id user-guide | Multiset Jaccard formula | L1 | jaccard-* specs |
| override-file ref | Schema-version refusal | L2 | override_file_refuses_* tests |
| override-file ref | Round-trip fidelity | L2 | override_file_round_trip |
| override-file ref | Deterministic serialization | L2 | override_file_deterministic_serialization |
| override-file ref | Atomic write | L2 | override_file_atomic_write |
| override-file ref | margin "unbounded" form | L2 | override_file_preserves_margin_unbounded |
| domain types | JaccardScore range | L2 | jaccard_score_new_in_range |
| domain types | ConfidenceThreshold ≥ 1 | L2 | confidence_threshold_* |
| domain types | Margin semantics | L2 | margin_* |
| domain types | RetainSet::from_str | L2 | retain_set_parse |
| domain types | InsertedRole::from_str | L2 | inserted_role_parse |
| domain types | parse_mapping_spec | L2 | mapping_spec_parse_* |
| domain types | MergeFlag serde | L2 | merge_flag_serde_* |
| domain types | Pipeline reproducibility | L3 | pipeline_replay_via_override_file |
Every invariant has at least one named test; many have multiple across layers. When the impl phase begins, the first commit should produce the fixtures, the second commit the highest-layer failing test for the simplest invariant, then drill down per the standard TDD progression.
What this plan does NOT cover
- Performance / scaling tests. Until the pipeline shows up on a measured workload, no targeted perf assertions. The reference corpus’s existing round-trip benchmarks remain the baseline.
- Fuzz testing. This repository now has a local
fuzz/workspace for parser/validation fuzzing. If the merge crate stabilizes enough to justify dedicated fuzzing, adding a merge-specific target for random parseable CHAT-pair inputs is a follow-up, not a v1 blocker. - Cross-platform CI checks. Windows / Linux / macOS each build the workspace; the merge module rides the existing CI. No platform-specific tests needed (the merge operates on parsed AST and writes UTF-8; no path-or-line-ending quirks).
- Real-corpus regression sweeps. Once impl lands, running
chatter mergeover a curated subset of the reference corpus and snapshotting outputs is a smart follow-up. Lives in a separatetests/golden/style mechanism if added; not designed here.
TDD authoring sequence
Each numbered item is one full RED → GREEN → REFACTOR cycle. Cycles must run in order; do not start cycle N+1 until cycle N is green and committed. Numbers are designed so the first working pipeline (cycle 8) emerges from the absolute minimum set of types + algorithms, then each later cycle extends.
The starter test for cycle 1 is intentionally tiny: a 2-utterance fixture pair with no markup, one retain speaker. The smoke test exercises every layer (parser, transform, CLI) but with the simplest possible CHAT bytes, so the first impl is small enough to land in one cycle.
Phase A, minimal end-to-end pipeline (cycles 1-8)
These cycles produce the simplest possible chatter merge
working end-to-end with synthetic fixtures.
| # | RED (failing test) | GREEN (smallest impl that passes) |
|---|---|---|
| 1 | merge_basic_smoke, L3 subprocess test against the tiniest fixture pair (FIX_REF_TWO_UTT_NO_MARKUP + FIX_ASR_LABELED_TWO_UTT), retain={CHI}, asserts exit 0 and “merged file exists” | Stub chatter merge subcommand wiring; introduce minimal talkbank-transform::transcript_merge::merge that interleaves utterances by start_ms and emits parser→serializer round-trip. No tier-stripping, no header-reconcile, no validation. Just: parse, sort, serialize. |
| 2 | merge_retained_speakers_byte_stable, L2 over the smoke fixture, asserts every CHI block byte-identical | Implement byte-stable handling for retained utterances (preserve main_raw_lines + dependent tiers exactly). |
| 3 | merge_strips_default_derived_tiers, L2 against a fixture where the donor has %wor rows | Implement tier_strip per the per-tier policy; drop %wor/%mor/%gra/%pho from inserted-speaker utts. |
| 4 | merge_utterance_order_by_start_time, L2 with a fixture where File 1 and File 2 utterances interleave | Implement timeline sort key (start_ms primary; source-order tiebreak). |
| 5 | merge_header_participants_concatenates, L2 | Implement header_reconcile::participants_merge. |
| 6 | merge_header_id_concatenates, L2 | Extend header_reconcile for @ID rows. |
| 7 | merge_header_languages_passthrough + merge_header_media_file1_wins + merge_header_comments_concatenate, L2 | Extend header_reconcile for remaining headers per the contract table. |
| 8 | merge_preconditions_retain_missing + merge_preconditions_no_timeline + merge_preconditions_language_mismatch + merge_preconditions_ambiguous_speaker, L3, each asserting exit code 2 with a specific stderr message | Implement preconditions module + map MergeError to exit codes in the CLI. |
Phase A, actual cycle log
The four-precondition cycle 8 was deliberately split into four
single-variant cycles (9a / 9b / 9c / 9d) so each MergeError
variant lands with its own RED→GREEN cycle and L2 + L3 sibling
tests. The numbering here is therefore finer-grained than the
plan table above; the table records the shape of Phase A, the
log records what was actually committed.
| # | Test(s) | Layer | Status |
|---|---|---|---|
| 1 | merge_basic_smoke | L3 | done |
| 2 | merge_retained_speakers_byte_stable | L2 | done |
| 3 | merge_strips_default_derived_tiers | L2 | done |
| 4 | merge_strip_tiers_configurable | L2 | done |
| 5 | merge_strip_tiers_empty_preserves_all | L2 | done |
| 6 | merge_header_participants_concatenates | L2 | done |
| 7 | merge_header_id_concatenates | L2 | done |
| 8a | merge_header_comments_concatenate | L2 | done |
| 8b | merge_header_languages_passthrough + merge_header_media_file1_wins | L2 | done |
| 9a | merge_no_retain_speakers_in_file1 + _returns_err | L3 + L2 | done (L2 sibling backfilled in 9c) |
| 9b | merge_no_timeline_in_file1 + _returns_err | L3 + L2 | done |
| 9c | merge_language_mismatch + _returns_err | L3 + L2 | done |
| 9d | merge_ambiguous_speaker + _returns_err | L3 + L2 | done |
End of Phase A: chatter merge works on simple fixtures with
all four preconditions (retain / timeline / language / ambiguous
speaker) enforced. The pipeline is publishable as v0.
Phase B, actual cycle log
Phase B picks up at cycle 10 in the cycle log (Phase A used 9a-9d for the precondition split).
| # | Test(s) | Layer | Status |
|---|---|---|---|
| 10 | speaker_id_explicit_basic | L3 | done |
| 11 | apply_mapping_byte_stable_except_prefix + apply_mapping_rewrites_participants + apply_mapping_rewrites_id | L2 | done (regression-guards) |
| 12 | identify_mapping_clean_winner | L2 | done |
| 13 | identify_mapping_borderline_refuses | L2 | done |
| 14 | speaker_id_reference_low_confidence_exits_4 | L3 | done |
| 15 | speaker_id_reference_writes_override (+ OverrideFile data model) | L3 | done |
| 16 | speaker_id_override_file_replay (+ OverrideFile::get) | L3 | done |
| 17 | adjudicate_speaker_id_accepts_suggested (+ adjudication core) | L4 | done |
| 18 | adjudicate_scripted_accepts_suggested (+ chatter adjudicate CLI + scripted-TOML I/O) | L3 | done |
| 19 | speaker_id_reference_writes_pending_on_low_confidence (+ --write-pending flag + LowConfidence carries DonorMatchReport) | L3 | done |
| 20 | adjudicate_speaker_id_override_mapping (+ OperatorDecision::OverrideMapping variant + scripted-TOML override-mapping shape) | L4 | done |
| 21 | adjudicate_interactive_accepts_suggested (+ TerminalPrompter + --interactive flag) | L3 | done |
| 22 | adjudicate_parent_role_lookup_chooses_role (+ PendingKindData promotion + ParentRoleLookup kind + ChooseRole decision) | L4 | done |
| 23 | adjudicate_interactive_chooses_role (+ parse_operator_response + kind-aware prompt hint) | L3 | done |
| 24 | adjudicate_interactive_override_mapping (+ parse_override_mapping + parse_speaker_assignment) | L3 | done |
| 25 | pipeline_clean_winner_end_to_end (+ chatter pipeline subcommand) | L3 | done |
| 26 | batch_pass1_single_session (+ chatter batch subcommand, subprocess driver) | L3 | done |
| 27 | batch_mixed_outcomes (regression-guard: clean+borderline aggregation) | L3 | done |
| 28 | batch_pass2_replay (+ --override-file on pipeline + batch; per-session auto-detection) | L3 | done |
| 29 | batch_skip_existing (+ --skip-existing flag on batch for idempotent re-runs) | L3 | done |
| 30 | refactor, PipelineArgs + BatchArgs structs retire three #[allow(clippy::too_many_arguments)] markers | , | done (true-no-op refactor; covered by cycles 25-29 regression suite) |
| 31 | refactor, split commands/speaker_id.rs (472 lines) into speaker_id/{mod,modes,writes,support}.rs (158 + 196 + 103 + 86 lines); retire 4 stale #[allow(dead_code)] markers on ReferenceModeOutcome (fields are read by write_override_entry) | , | done (true-no-op refactor; covered by cycles 10-29 regression suite) |
| 32 | adjudicate_sanity_scan_accept_suggested (+ AdjudicationKind::SanityScanMisclassification variant, PendingKindData::SanityScanMisclassification { suggested, reason } variant, two apply-decision arms mirroring SpeakerIdLowConfidence, terminal prompter render + prompt-hint arm) | L4 | done, adjudication kind end-to-end; the post-merge scan detector itself (heuristic + auto-pending-write) is a separate cycle 33 |
| 33 | sanity_scan_flags_inverted_mlu (+ talkbank_transform::sanity_scan::scan_session + chatter sanity-scan subcommand; mean-utterance-word-count asymmetry heuristic, default 1.5×, binary-mapping only) | L3 | done, detector + CLI end-to-end; multi-rename support, batch integration, and alternative heuristics deferred |
| 34 | batch_writes_override_for_auto_decisions (+ --write-override on both chatter pipeline and chatter batch; threaded through PipelineArgs.write_override_path + BatchArgs.write_override_path; reference-mode auto-decisions audit-trailed for sanity-scan + future re-runs) | L3 | done |
| 35 | batch_with_sanity_scan_flag_flags_inverted_mlu (+ --sanity-scan + --sanity-scan-threshold on chatter batch; post-loop subprocess driver for chatter sanity-scan; precondition validation requiring --write-override + --write-pending) | L3 | done |
| 36 | refactor, split cli/args/core.rs (984 → 747 lines): extract DebugCommands → debug_commands.rs, CacheCommands → cache_commands.rs, config enums (LogFormat, TuiMode, OutputFormat, ParserBackend, AlignmentTier) → cli_types.rs, unit-test module → core_tests.rs (via #[path]); satisfies the 800-line hard limit | , | done (true-no-op refactor; covered by full regression suite + 110 bin/integration tests) |
| 37+ | sanity-scan multi-rename support; diarization-mix-review kind (operator workflow design needed); newtype threading at struct seams (deferred simplify finding); apply_decision arm dedup + per-kind OperatorDecision sub-enums | L3 + L4 | pending |
Phase B, speaker-id pipeline (cycles 9-16)
These cycles add chatter speaker-id and its three modes.
| # | RED | GREEN |
|---|---|---|
| 9 | speaker_id_explicit_basic, L3 against an anonymous-2-speaker donor with --mapping "PAR0=drop,PAR1=INV:Investigator", asserts output has only INV utts | Stub chatter speaker-id subcommand. Implement parse_mapping_spec + apply_mapping. Reference mode and override-file mode return unimplemented!() for now. |
| 10 | apply_mapping_byte_stable_except_prefix + apply_mapping_rewrites_participants + apply_mapping_rewrites_id, L2 | Tighten apply_mapping per header rewrite rules. |
| 11 | identify_mapping_clean_winner, L2 with a fixture where one donor speaker overwhelmingly matches the reference | Implement text_cleaner + jaccard modules. Implement identify_mapping using them. Reference mode in CLI now works. |
| 12 | identify_mapping_borderline_refuses, L2 with a borderline fixture | Add ConfidenceThreshold check + LowConfidence error path. |
| 13 | speaker_id_reference_low_confidence_exits_4, L3 against borderline fixture | Map LowConfidence to exit code 4 in the CLI; print scores to stderr. |
| 14 | speaker_id_reference_writes_override, L3 with --write-override | Implement OverrideFile::read_or_default + OverrideFile::write. |
| 15 | speaker_id_override_file_replay, L3 with --override-file + --session-id | Implement override-file mode in CLI (OverrideFile::get + apply). |
| 16 | Token-cleaner L1 specs (a handful of representative clean-* specs from L1.1) + current spec/tools generators | Move the regex-and-string cleaner into a spec-test-covered implementation. Specs become the regression net. |
End of Phase B: full chatter speaker-id + chatter merge
pipeline works auto + explicit + override modes.
Phase C, adjudication (cycles 17-22)
These cycles add the chatter adjudicate tool and its
prompter-injection testability.
| # | RED | GREEN |
|---|---|---|
| 17 | adjudicate_empty_pending_file_noop, L4 against an empty pending file, asserts exit 0 + no changes | Stub chatter adjudicate subcommand. Implement PendingAdjudications::read + run_adjudication core skeleton with a no-op Prompter trait. |
| 18 | prompter_scripted_returns_decisions_in_order, L4 | Implement ScriptedPrompter::from_decisions (in-memory) per the Prompter trait. |
| 19 | adjudicate_speaker_id_accepts_suggested, L4 against FIX_PENDING_SPEAKER_ID with one AcceptSuggested decision | Implement apply_decision for the speaker-id-low-confidence kind. Override file now gets the decision; pending entry removed. |
| 20 | adjudicate_speaker_id_override_mapping, L4 with OverrideMapping decision | Extend apply_decision for the override-mapping variant. |
| 21 | adjudicate_speaker_id_kind_mismatch_rejected, L4 with a OverrideInsertedRole against a speaker-id pending entry | Implement kind→variants validation in apply_decision. |
| 22 | adjudicate_scripted_mode_unknown_session_aborts + adjudicate_scripted_mode_extra_pending_aborts, L4 | Tighten scripted-mode validation; assert 1:1 mapping between pending entries and scripted decisions. |
End of Phase C: scripted adjudication tested end-to-end with synthetic operator inputs. Interactive terminal UX still unimplemented (next phase).
Phase D, interactive UX (cycles 23-25)
| # | RED | GREEN |
|---|---|---|
| 23 | prompter_terminal_round_trip_decision, L4 with mocked stdin/stdout | Implement TerminalPrompter parsing [a]/[o]/[f]/... keys + optional follow-up prompts. |
| 24 | adjudicate_resumption_skips_decided_entries, L4 with a partially-decided override file + full pending list | Implement skip-already-decided logic in run_adjudication. |
| 25 | Manual smoke test (NOT automated), run chatter adjudicate --interactive against the test fixtures; visually confirm the operator UX matches the doc’s mock-up | Polish terminal output: ANSI formatting, fixed-width alignment, the [m] Show more context action, the [p] Play media action. |
End of Phase D: full v1 pipeline complete.
Phase E, non-speaker-id adjudication kinds (cycles 26-29)
Each adjudication kind gets its own RED→GREEN cycle.
| # | RED | GREEN |
|---|---|---|
| 26 | adjudicate_parent_role_overrides_to_mother + adjudicate_parent_role_overrides_to_father, L4 | Implement parent-role-lookup kind end-to-end (pending schema, prompter context, decision application). |
| 27 | adjudicate_diarization_mix_flag_only, L4 | Implement diarization-mix-review kind end-to-end. |
| 28 | adjudicate_sanity_scan_swap_mapping, L4 | Implement sanity-scan-misclassification kind end-to-end. |
| 29 | adjudicate_re_adjudicate_preserves_history, L4 | Implement --re-adjudicate flag; add history field to MergeOverride. |
Phase F, breadth pass (cycles 30+)
Fill in every remaining test from L1-L4 that hasn’t been written yet. These are coverage-deepening tests, not behavior adders. The impl from Phases A-E should pass them with at most minor refactoring; if a test fails meaningfully, that’s a gap in the impl that this cycle closes.
The breadth pass is the only phase where multiple cycles can proceed in parallel (different contributors take different test groups). Phases A-E are strictly serial.
Hard rules during impl phase
- No test stubs. Every test in this plan, when written,
must FAIL before its impl exists and PASS after. Skipped or
#[ignore]-marked tests are not allowed in the regression net (use#[ignore]only for genuinely slow or environment-dependent tests, not for “not implemented yet”). - No test deletion to make CI green. If a test that was passing starts failing after a refactor, the refactor is wrong. Investigate; do not delete the test.
- Three cycle archetypes, distinguish them. A cycle is one
of:
- bug-fix: RED motivates new impl code (cycle N-1’s impl truly cannot satisfy the new test).
- regression-guard: RED pins an invariant the impl
inherits from upstream infrastructure (e.g. parse→serialize
byte-stability inherited from
talkbank-parser). The test passes against cycle N-1’s impl, but the cycle is valuable because it locks in the invariant against future “optimizations” that might break it. Verbose-output the actual behavior on first run to confirm the invariant holds for the right reasons, not by accident. - true no-op: RED tests something already pinned elsewhere. These ARE unnecessary; drop the cycle or sharpen the test. The difference between regression-guard and true no-op is whether the invariant is named explicitly anywhere else. If yes (e.g., the parser crate already has a roundtrip test that covers it), the cycle is true-no-op. If no, the cycle is a regression-guard and worth keeping.
Merge Pipeline, Crate Architecture
Status: Draft Last modified: 2026-06-14 19:57 EDT
This page explains where the new merge-pipeline code lives in the
chatter workspace, which crates gain modules, what
depends on what, and which boundary each piece sits inside. The
goal is succession-readability: a contributor coming to this
work for the first time should be able to map a behavior they
read about in
chatter merge or
chatter speaker-id to
the precise crate + module that implements it.
Companion documents:
- Domain Types: what types live in
talkbank-model::merge. - Test Plan: what tests live where.
- Override File Format, the on-disk format.
Boundary decisions
Two boundary decisions govern where every new piece of code lives.
Both reference rules already documented in this repo’s root CLAUDE.md
(workspace-root contributor guide, outside the book).
Decision 1: talkbank-* crates, not batchalign-* crates
The merge pipeline is pure CHAT-AST structural manipulation, no ML, no audio I/O, no network, no model loading, no fleet runtime. Per the crate-boundary decision test in the workspace CLAUDE.md:
If code fundamentally needs ML models, audio processing, network services, or fleet runtime →
batchalign-*crate. Otherwise →talkbank-*crate.
chatter merge and chatter speaker-id answer “no” to each
ML/audio/network/runtime question. They consume parsed
ChatFile values, manipulate them, and emit parsed-and-serialized
output. Even the speaker-id text-similarity scoring is a
deterministic function over CHAT content tokens, no ML model,
no embedding, no inference. All new merge code lives in
talkbank-* crates.
The batchalign-* crates remain the home for batchalign3 transcribe (ASR), batchalign3 align (forced alignment), and
batchalign3 morphotag (Stanza-based morphological tagging),
the ML-bearing stages that surround the merge in the pipeline.
Decision 2: types in talkbank-model, algorithms in talkbank-transform, CLI in chatter
The merge pipeline’s code splits across exactly the same three talkbank-* crates that already host the parse/validate/normalize/ JSON pipelines:
talkbank-modelowns the typed vocabulary (domain types, errors). No algorithms.talkbank-transformowns the algorithms (token cleaning, Jaccard scoring, mapping application, structural merge). No CLI parsing, no clap.chatterowns the subcommands (chatter speaker-id,chatter merge). Thin shim layer that parses arguments and drives the transform layer.
This mirrors how chatter validate, chatter normalize,
chatter to-json are wired today and keeps the crate boundaries
honest: a future caller wanting only the algorithms (e.g., a
library binding, an HTTP service) can depend on
talkbank-transform without pulling in clap. A future caller
wanting only the types (e.g., an external tool reading override
files) can depend on talkbank-model without pulling in the
tree-sitter parser.
Crate dependency graph
The new code does not introduce any new crate-level dependencies, every edge below already exists in the workspace today. The merge work adds modules to existing crates.
flowchart TD
derive["talkbank-derive\n(proc macros, unchanged)"]
model["talkbank-model\n(+ merge module)"]
parser["talkbank-parser\n(unchanged)"]
transform["talkbank-transform\n(+ transcript_merge, speaker_id modules)"]
cli["chatter\n(+ merge, speaker-id subcommands)"]
cli_tests["chatter/tests/\n(+ merge_tests.rs)"]
transform_tests["talkbank-transform/tests/\n(+ transcript_merge_tests, speaker_id_tests, override_file_tests)"]
spec["spec/constructs/speaker-id/\n(token cleaner + Jaccard specs)"]
parser_tests["talkbank-parser-tests\n(consumes regenerated specs)"]
derive --> model
model --> parser
model --> transform
parser --> transform
transform --> cli
model --> cli
transform --> transform_tests
transform --> cli_tests
cli --> cli_tests
spec -.->|spec/tools generators| parser_tests
model -.-> parser_tests
Dashed edges (-.->) are build-time rather than dependency edges:
the current spec/tools generators regenerate Rust tests under
talkbank-parser-tests from spec markdown, but the spec
directory is not a Cargo crate.
Module layout per affected crate
talkbank-model, new merge/ module
Adds a top-level merge module under
crates/talkbank-model/src/. Layout:
crates/talkbank-model/src/merge/
mod.rs, pub re-exports
scoring.rs, JaccardScore, ConfidenceThreshold, Margin
role.rs, InsertedRole, MappingAction
mapping.rs, SpeakerMapping + parse_mapping_spec
retain.rs, RetainSet
override_file.rs, DecisionMode, MergeFlag, OperatorId,
SessionId, MergeOverride, OverrideFile
errors.rs, SpeakerIdError, MergeError, OverrideFileError
Per the file-size rule (≤400 lines target, ≤800 hard) each file
stays modest. Re-exports go through mod.rs:
// crates/talkbank-model/src/merge/mod.rs
pub mod scoring;
pub mod role;
pub mod mapping;
pub mod retain;
pub mod override_file;
pub mod errors;
pub use errors::{MergeError, OverrideFileError, SpeakerIdError};
pub use mapping::{parse_mapping_spec, SpeakerMapping};
pub use override_file::{MergeOverride, OverrideFile, ...};
pub use retain::RetainSet;
pub use role::{InsertedRole, MappingAction};
pub use scoring::{ConfidenceThreshold, JaccardScore, Margin};
Exposed at the crate root via the existing
crates/talkbank-model/src/lib.rs pattern:
pub mod merge;
No new external crate dependency: chrono and toml are already
pinned at workspace level. The merge module pulls them in via
{ workspace = true } annotations in
crates/talkbank-model/Cargo.toml.
talkbank-transform, new speaker_id/ and transcript_merge/ modules
Two sibling top-level modules, mirroring the user-facing distinction between the two subcommands:
crates/talkbank-transform/src/speaker_id/
mod.rs, identify_mapping, apply_mapping
text_cleaner.rs, content-token extraction from ChatFile
jaccard.rs, multiset Jaccard over Counter<&str>
header_rewrite.rs, @Participants / @ID rewriting per mapping
crates/talkbank-transform/src/transcript_merge/
mod.rs, pub fn merge(...) entry point
timeline.rs, utterance ordering by start_ms
bullet_lift.rs, derive main-tier bullet from %wor
tier_strip.rs, strip downstream-owned dependent tiers
header_reconcile.rs, @Languages match, @Participants concat, etc.
preconditions.rs, RetainSpeakersMissing, NoTimelineInFile1, etc.
Both modules land alongside the existing CHAT-core transform modules
(parse, serialize, validate, normalize) in talkbank-transform.
Exposed via crates/talkbank-transform/src/lib.rs:
pub mod speaker_id;
pub mod transcript_merge;
chatter, new speaker_id/ and transcript_merge/ command directories
The CLI dispatch pattern in this crate uses one directory per
multi-file command (e.g. commands/validate/, commands/find/,
commands/alignment/) or one file for single-file commands
(commands/normalize.rs, commands/lint.rs). Both new
subcommands warrant directories because each has multiple
operation modes (speaker-id has reference / explicit / override-file
modes; merge has the main merge path plus probably a future
merge --check mode).
crates/chatter/src/commands/speaker_id/
mod.rs, clap subcommand dispatch
args.rs, flag parsing, mode disambiguation
reference_mode.rs, drives identify_mapping + apply_mapping
explicit_mode.rs, drives parse_mapping_spec + apply_mapping
override_mode.rs, drives OverrideFile::read + apply_mapping
output.rs, formats per-speaker scores to stderr,
writes override file via --write-override
crates/chatter/src/commands/transcript_merge/
mod.rs, clap subcommand dispatch
args.rs, --retain, --strip-tiers parsing
runner.rs, drives the merge pipeline
output.rs, exit-code mapping, error formatting
The CLI argument enums extend
crates/chatter/src/cli/args.rs’s top-level Commands
enum:
// in crates/chatter/src/cli/args.rs
pub enum Commands {
Validate(/* ... */),
Normalize(/* ... */),
// ... existing variants ...
SpeakerId(commands::speaker_id::args::SpeakerIdArgs), // NEW
Merge(commands::transcript_merge::args::MergeArgs), // NEW
}
Subcommand dispatch in crates/chatter/src/main.rs already
matches on the Commands enum; the new arms wire to the
respective commands::*::run entry points.
Test crates
Per the Test Plan:
crates/talkbank-transform/tests/
speaker_id_tests.rs, L2 tests for identify_mapping / apply_mapping
transcript_merge_tests.rs, L2 tests for merge invariants
override_file_tests.rs, L2 tests for round-trip / refusal
crates/chatter/tests/
merge_tests.rs, L3 subprocess tests for both new commands
spec/constructs/speaker-id/
token-cleaner/, L1 fragment specs
jaccard-scoring/, L1 golden Jaccard specs
mapping-application/, L1 header rewrite specs
The spec entries flow into Rust tests under
crates/talkbank-parser-tests/tests/generated/ via the standard
current spec/tools workflow.
Data flow for chatter merge
The full call graph when an operator runs
chatter merge file1.cha file2.cha --retain CHI -o out.cha:
sequenceDiagram
actor Operator
participant CLI as chatter<br/>(merge)
participant Args as commands::transcript_merge::args
participant Runner as commands::transcript_merge::runner
participant Parser as talkbank-parser<br/>(TreeSitterParser)
participant Merge as talkbank-transform::transcript_merge
participant Model as talkbank-model::ChatFile
Operator->>CLI: chatter merge file1 file2 --retain CHI
CLI->>Args: parse argv → MergeArgs
Args-->>CLI: MergeArgs { file1, file2, retain: RetainSet, ... }
CLI->>Runner: run(args)
Runner->>Parser: parse_chat_file(file1)
Parser-->>Runner: ChatFile (file1)
Runner->>Parser: parse_chat_file(file2)
Parser-->>Runner: ChatFile (file2)
Runner->>Merge: merge(file1, file2, retain)
Merge->>Merge: header_reconcile · timeline · tier_strip · bullet_lift
Merge-->>Runner: ChatFile (merged) or MergeError
alt Ok(merged)
Runner->>Model: merged.write_chat() to output path
Runner-->>Operator: exit 0
else Err(MergeError)
Runner->>Operator: formatted stderr + exit code 2
end
The CLI layer is thin: it parses arguments, calls the
transform layer’s merge function, and translates the
Result<ChatFile, MergeError> into stdout/stderr/exit-code
output. All algorithm logic lives in talkbank-transform.
Data flow for chatter speaker-id
The reference-mode call path:
sequenceDiagram
actor Operator
participant CLI as chatter<br/>(speaker-id)
participant Runner as commands::speaker_id::reference_mode
participant Parser as talkbank-parser
participant SpkId as talkbank-transform::speaker_id
participant Override as talkbank-model::merge::override_file
Operator->>CLI: chatter speaker-id input --reference ref --anchor CHI<br/>--inserted-role INV:Investigator
CLI->>Runner: run(args)
Runner->>Parser: parse_chat_file(input)
Parser-->>Runner: ChatFile (donor)
Runner->>Parser: parse_chat_file(reference)
Parser-->>Runner: ChatFile (reference)
Runner->>SpkId: identify_mapping(donor, reference, anchor, role, threshold)
SpkId-->>Runner: SpeakerMapping or LowConfidence{scores, margin}
alt Ok(mapping)
Runner->>SpkId: apply_mapping(donor, mapping)
SpkId-->>Runner: ChatFile (relabeled)
opt --write-override
Runner->>Override: OverrideFile::read_or_default(path)
Override-->>Runner: OverrideFile
Runner->>Override: insert entry, write back
end
Runner-->>Operator: relabeled output, exit 0
else Err(LowConfidence)
Runner-->>Operator: scores to stderr, exit 4
end
The explicit-mapping and override-file modes use the same
apply_mapping and --write-override paths but skip
identify_mapping, the mapping comes from
parse_mapping_spec or OverrideFile::get respectively.
How this composes with the post-merge ML stages
The end-to-end pipeline batchalign3 transcribe → chatter speaker-id → chatter merge → batchalign3 align → batchalign3 morphotag crosses the talkbank-* / batchalign-* boundary
twice:
flowchart LR
subgraph BA[Batchalign, ML / audio / network]
Trans["batchalign3 transcribe"]
Align["batchalign3 align"]
Morph["batchalign3 morphotag"]
end
subgraph TB[talkbank, pure CHAT-AST]
SpkId["chatter speaker-id"]
Merge["chatter merge"]
end
Media["mp4 / wav media"] --> Trans
Trans -->|ASR.cha| SpkId
Hand["hand transcript.cha"] -->|reference| SpkId
Hand --> Merge
SpkId -->|labeled.cha| Merge
Merge -->|merged.cha| Align
Align -->|+ bullets + %wor| Morph
Morph -->|+ %mor + %gra| Final["final.cha"]
Each crossing is CHAT-file-to-CHAT-file at a stable
serialization boundary: Batchalign emits a CHAT file, talkbank
consumes it; talkbank emits a CHAT file, Batchalign consumes
it. Neither side has a runtime dependency on the other; they
exchange data through the file system (or piped stdin/stdout)
exactly as the user-facing CLI commands do. This keeps the
boundary honest: a contributor working on the merge pipeline
never needs to load a Stanza model, and a contributor working
on batchalign3 align never needs to parse a speaker-id
override file.
Public surface impact
Cumulative public API additions (the surface a downstream library consumer would see):
| Crate | New pub items | Stability |
|---|---|---|
talkbank-model | merge::{SpeakerCode, ParticipantRole, ...}, re-exports for ergonomics, the underlying types already exist; PLUS the new types in merge::scoring/role/mapping/retain/override_file/errors | Stable, versioned via the workspace’s existing release process |
talkbank-transform | speaker_id::{identify_mapping, apply_mapping, LowConfidenceError}; transcript_merge::{merge} | Stable, algorithms behind these are pinned by the test plan’s L2 tests |
chatter | Two new Commands enum variants and their argument structs | Internal to the binary, not a library surface |
No existing public surface is modified or removed; this is a
purely-additive change. Existing consumers (the VS Code
extension, talkbank-lsp, chatter-desktop, batchalign)
continue to depend on the existing surface and can ignore the
additions until a workflow uses them.
Where to look for things (newcomer guide)
| Question | File |
|---|---|
“What does chatter merge do?” | book/src/chatter/user-guide/merge.md |
“What does chatter speaker-id do?” | book/src/chatter/user-guide/speaker-id.md |
| “What’s in an override file?” | book/src/chatter/integrating/merge-overrides.md |
“What types are in talkbank-model::merge?” | book/src/architecture/merge-domain-types.md |
| “Where are the tests?” | book/src/architecture/merge-test-plan.md |
| “Which crate is this code in and why?” | This page |
| “Where does the merge code live in source?” | crates/talkbank-transform/src/speaker_id/ + crates/talkbank-transform/src/transcript_merge/ + crates/chatter/src/commands/speaker_id/ + crates/chatter/src/commands/transcript_merge/ |
“What’s in an utterance / ChatFile / %mor tier?” | talkbank-model crate rustdoc; book/src/architecture/chat-model/chat-model.md |
| “What’s the parser do?” | book/src/architecture/parsing.md; book/src/architecture/parser-model-contracts.md |
Adjudication Workflow
Status: Draft Last updated: 2026-06-13 20:54 EDT
This page specifies how human-in-the-loop adjudication fits into the merge pipeline. Several pipeline stages have decision points where the algorithm cannot or should not auto-decide; this document specifies how those refusals reach an operator, how the operator’s decision is recorded, and how the pipeline resumes with the decision applied.
The design satisfies two constraints set explicitly upstream:
- Test the interaction. Every operator-decision path must be exercisable in automated tests by providing synthetic operator choices. No hardcoded stdin reads in the decision core; a pluggable prompter abstraction is mandatory.
- Batch-then-review is the default workflow. No mid-batch
interactive pauses in the main pipeline. The optional
--interactiveflag exists on the adjudication tool only, for small-batch debugging, and rides on the same data contract.
Companion documents:
- Merge Override File Format, the on-disk record of decisions.
- Domain Types:
SpeakerMapping,MergeOverride, etc. - Test Plan: where the adjudication tests live.
- Crate Architecture: where the adjudication code lives.
Why batch-then-review, and not real-time
Every adjudication point in the pipeline is per-session local: the operator’s decision affects this session’s output and no other session in the same batch. There is no case where an operator decision propagates forward to influence how other sessions get processed.
The cases that might appear to want real-time interaction are better served by sampling:
| Case | Real-time approach | Better approach |
|---|---|---|
| Systematic pipeline failure (everything refuses) | Watch each refusal, abort batch | Run a 5-10-session canary first; examine; abort or proceed |
| Confidence-threshold calibration on a new corpus | Adjust threshold mid-batch | Run canary; pick threshold; full batch |
| Cross-session pattern (one contributor always has PAR0 = clinician) | Notice during interactive review | Run canary; observe pattern; add per-contributor explicit mapping to orchestrator config |
| Operator wants per-session progress visibility | Watch each step | chatter adjudicate --interactive after a batch run, walking the same pending queue |
TalkBank’s operational reality makes batch-then-review strictly better:
- Batches are research-scale (hundreds of sessions per donor). Forcing operator presence during the batch run = forcing hours of babysitting.
- Overnight and fleet runs are routine; interactive doesn’t work for those.
- Focused operator review of all refusals together is more efficient than scattered per-batch decisions (less context-switching; easier to spot patterns across sessions).
- Aligns with the project’s “academic research, accuracy is the standard, take however long it takes” rule: operator efficiency dominates wall-clock latency.
The --interactive flag is preserved for the small-batch
debugging case but is explicitly NOT the dominant workflow.
The known adjudication points
The pipeline has at least five points where adjudication may be needed. Each is recorded as one or more entries in the override file via the same schema.
| # | Adjudication point | Trigger | Operator’s decision | Affects |
|---|---|---|---|---|
| 1 | Speaker-id low confidence | chatter speaker-id Jaccard margin < threshold | Per-speaker mapping (drop/rename) and inserted_role | Speaker labeling, drop set, downstream merge |
| 2 | Parent role lookup | Parent-sample session needs MOT vs FAT decision | inserted_role.code and .tag for this session | The merged file’s headers + main-tier prefixes |
| 3 | Diarization-mix flag | Operator observes Batchalign collapsed multiple real-world speakers into one label | flags = ["diarization-mixed"] plus a note | Downstream consumers know output is imperfect; might gate publication |
| 4 | Post-merge sanity scan | Auto-scan flags retained-speaker utterances with high-text-similarity inserted-speaker utterances nearby (suggesting speaker-id misclassification) | Confirm or override the original speaker-id mapping | Triggers re-run of speaker-id + merge for the session |
| 5 | Unbulleted reference file | Reference CHAT file has no time bullets; merge can’t proceed | Either bullet the reference upstream, or request fresh authoritative data | Pipeline blocked for this session pending external fix |
Points 1-4 are handled by the unified chatter adjudicate tool
specified below. Point 5 is an out-of-scope failure mode: the
adjudication tool records that the session is blocked, but the
fix lives outside this pipeline (operator contacts the
contributor or runs forced-alignment first).
Data flow
flowchart TD
Inputs["Input CHAT files +<br/>reference files"]
Orch["Orchestrator<br/>(future: tb subcommand;<br/>now: shell/script)"]
SpkId["chatter speaker-id<br/>(per session)"]
Merge["chatter merge<br/>(per session)"]
Pending["pending-adjudications.toml<br/>(workflow queue)"]
Override["overrides.toml<br/>(durable decisions)"]
Adj["chatter adjudicate"]
Operator((Operator))
Final["merged/*.cha"]
Inputs --> Orch
Orch -->|pass 1: speaker-id| SpkId
SpkId -->|exit 0 → auto entry| Override
SpkId -->|exit 4 → pending entry| Pending
Orch -->|pass 1: merge for ok sessions| Merge
Merge --> Final
Pending --> Adj
Override --> Adj
Adj <-->|prompter| Operator
Adj -->|writes decision| Override
Adj -->|removes resolved| Pending
Override -->|pass 2| Orch
Orch -.->|loop until pending empty| SpkId
The orchestrator runs two passes:
Pass 1: for every input session, run chatter speaker-id in
reference mode. Successful auto-decides write to the override
file with mode = "auto" and immediately proceed to chatter merge. Refusals (exit code 4) and other adjudication-requiring
states write a pending entry to pending-adjudications.toml
and the session is skipped for the rest of pass 1.
Pass 2 (after operator runs chatter adjudicate): the
orchestrator re-runs chatter speaker-id for the previously
skipped sessions, now finding decisions in the override file
(mode = "override"). Sessions complete; pending entries are
removed.
The pipeline is idempotent: re-running pass 1 on a partially adjudicated batch produces no spurious work, sessions with already-recorded decisions skip to merge directly.
The pending-adjudications artifact
Separate from the override file, a pending-adjudications.toml
file holds in-flight workflow state. Its purpose is to carry
the evidence the operator needs (per-speaker scores, opening
utterance previews) from the orchestrator’s pass 1 to the
adjudication tool, without polluting the override file with
“to-do” entries.
Schema
schema_version = 1
[[entries]]
session_id = "session-102-t1"
kind = "speaker-id-low-confidence"
created_at = 2026-05-27T11:00:00-04:00
# Inputs the adjudication tool needs:
input_path = "asr/session-102-t1.cha"
reference_path = "chi-only/session-102-t1.cha"
anchor_speaker = "CHI"
# Evidence for the operator:
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
threshold_used = 2.0
# Opening turns (first N utterances per speaker) for context:
preview = """
*CHI: they start to bite . [0_1708]
*PAR0: They start to bite . [75_1165]
*PAR1: They do what . [1515_2245]
... (further preview)
"""
# Suggested defaults the operator can accept-as-is:
suggested = { mapping = { PAR0 = "drop", PAR1 = "rename" }, inserted_role = { code = "INV", tag = "Investigator" } }
[[entries]]
session_id = "session-103-t1-parent"
kind = "parent-role-lookup"
# ... different evidence for the MOT-vs-FAT case ...
Schema characteristics
kinddiscriminates the adjudication type (one ofspeaker-id-low-confidence,parent-role-lookup,diarization-mix-review,sanity-scan-misclassification). Each kind has its own required field set; the adjudication tool dispatches onkindto choose the right prompt template and the right validator for the operator’s response.suggestedcarries what the algorithm WOULD have chosen had the threshold been lower (for speaker-id) or a parsed default (for parent-role). The operator can accept-as-is or override.- Entries are a
[[entries]]array of tables (not a session-keyed[<session_id>]map) because the same session could conceivably have multiple pending decisions (e.g., a speaker-id refusal AND a parent-role lookup), each a separate array entry.
Lifecycle
- Written by: the orchestrator’s pass 1, when
chatter speaker-idexits with code 4 or when other adjudication triggers fire. - Consumed by:
chatter adjudicate, which reads it, prompts the operator entry-by-entry, writes decisions to the override file, and removes resolved entries. - Cleaned up: an empty
entriesarray is the “all clear” state; pass 2 of the orchestrator can proceed.
chatter adjudicate, CLI surface
A new chatter subcommand in chatter. Its job is to walk
a pending-adjudications file and write decisions to an override
file.
chatter adjudicate <PENDING_FILE> --override-file <OVERRIDE_FILE> [OPTIONS]
ARGUMENTS:
<PENDING_FILE> Path to pending-adjudications.toml.
REQUIRED OPTIONS:
--override-file <PATH>
Path to the override file (created if missing, appended if
existing). Decisions go here.
OPTIONS:
--interactive
(default) Prompt the operator for each pending entry via
a terminal UI. This is the only mode for v1; later UI
backends may add e.g. --backend=web for web-served prompts.
--scripted <PATH>
Read pre-canned decisions from a TOML file. Used in tests
and in automated bulk-decision workflows (e.g., the
operator has prepared a decision sheet in advance).
Mutually exclusive with --interactive.
--kind <KIND>
Process only pending entries whose `kind` matches. Useful
when the operator wants to batch through one class of
decision at a time (e.g., do all parent-role lookups
first, then all speaker-id refusals).
--skip-on-error
If the operator's response cannot be applied (e.g., they
typed an invalid speaker code), log and skip rather than
abort. Default: abort on first invalid response.
--operator <NAME>
Operator identifier recorded in override entries.
Default: $USER.
--dry-run
Read pending and prompt the operator, but do NOT write to
the override file. Useful for previewing what decisions
look like before committing.
Exit codes:
| Code | Meaning |
|---|---|
| 0 | All pending entries decided; pending file updated |
| 1 | I/O error (missing file, unparseable, write failure) |
| 2 | Operator-supplied decision rejected as invalid (when --skip-on-error not set) |
| 3 | Internal error |
| 4 | Operator deferred at least one entry (used :skip in the prompt); pending file still has entries |
The --scripted mode is the testability seam. A scripted
decision file looks like:
schema_version = 1
[[decisions]]
session_id = "session-102-t1"
kind = "speaker-id-low-confidence"
choice = { kind = "accept-suggested", note = "verified by listening" }
[[decisions]]
session_id = "session-103-t1-parent"
kind = "parent-role-lookup"
choice = { kind = "override", inserted_role = { code = "FAT", tag = "Father" }, note = "per contributor data sheet" }
The adjudication tool reads the scripted file, matches decisions
to pending entries by session_id + kind, applies each as
though the operator had typed it. If a scripted decision has no
matching pending entry, or a pending entry has no scripted
decision, the run aborts with a clear error.
The prompter abstraction (testability)
The adjudication tool’s core flow is:
// pseudocode, actual signatures live in talkbank-transform
pub fn run_adjudication(
pending: PendingAdjudications,
override_file: &mut OverrideFile,
prompter: &mut dyn Prompter,
operator: OperatorId,
) -> Result<AdjudicationOutcome, AdjudicationError> {
for entry in pending.entries() {
let context = build_context(entry);
let decision = prompter.ask(&context)?;
apply_decision(override_file, entry, decision, &operator);
}
Ok(...)
}
pub trait Prompter {
fn ask(&mut self, context: &AdjudicationContext)
-> Result<OperatorDecision, PrompterError>;
}
Production implementations:
TerminalPrompter: printscontextto stdout, reads operator response from stdin. Used by--interactive.
Test implementations:
ScriptedPrompter::from_decisions(Vec<(SessionId, OperatorDecision)>), returns each decision in turn, errors if asked for an unprovided session. Used by L2 transform tests.ScriptedTomlPrompter::read(path): reads the same TOML format as--scripted. Used by L3 CLI tests so subprocess tests and library-level tests share fixture format.
This means:
- Every adjudication test path is automated. No subprocess
PTY hackery, no expect-script DSL. Tests construct
ScriptedPrompter, run the adjudication core, assert on the resultingOverrideFile. - The terminal UI is dumb. All it does is
Display-format the context and parse the operator’s response into anOperatorDecision. No business logic in the UI layer. - Future UI backends (VS Code, web) implement
Prompterand drop in. The adjudication core is unchanged.
The OperatorDecision type
pub enum OperatorDecision {
/// Accept the algorithm's suggested mapping verbatim.
AcceptSuggested { note: Option<String> },
/// Override with an operator-supplied mapping (speaker-id).
OverrideMapping {
mapping: SpeakerMapping,
note: Option<String>,
},
/// Override the inserted role only (parent-role lookup).
OverrideInsertedRole {
inserted_role: InsertedRole,
note: Option<String>,
},
/// Add or update flags on an existing entry.
Flag { flags: Vec<MergeFlag>, note: Option<String> },
/// Defer this entry; leave it in pending for later review.
Defer { reason: String },
/// Mark the session as blocked (e.g., unbulleted reference);
/// requires upstream action before pipeline can resume.
Block { reason: String },
}
Each variant maps cleanly to one or more adjudication kinds:
| Kind | Allowed OperatorDecision variants |
|---|---|
speaker-id-low-confidence | AcceptSuggested, OverrideMapping, Defer |
parent-role-lookup | AcceptSuggested, OverrideInsertedRole, Defer |
diarization-mix-review | Flag, Defer |
sanity-scan-misclassification | OverrideMapping, Flag, Defer |
| (any) | Block is always available |
The kind → allowed-variants mapping is enforced by the
adjudication tool: a kind = "parent-role-lookup" entry that
gets an OverrideMapping decision is rejected with a clear
error (AdjudicationError::DecisionKindMismatch).
Operator terminal UX (interactive mode)
What the operator sees when running chatter adjudicate pending.toml --override-file overrides.toml --interactive:
═══════════════════════════════════════════════════════════════
ADJUDICATION [1 / 14] session-102-t1 kind = speaker-id-low-confidence
═══════════════════════════════════════════════════════════════
Reference file: chi-only/session-102-t1.cha
Donor file: asr/session-102-t1.cha
Anchor speaker: CHI
Per-speaker Jaccard scores against reference's CHI:
PAR0 = 0.6286 ◄── higher
PAR1 = 0.3457
margin = 1.82× (threshold was 2.00×)
Opening turns side-by-side:
*CHI [0_1708] they start to bite .
*PAR0 [75_1165] They start to bite .
*PAR1 [1515_2245] They do what .
*CHI [1708_5966] they put up their shields at some point .
*PAR0 [2755_4405] They put up those heels .
*PAR1 [4865_6045] At some point oh .
(3 more turns shown; press 'm' for more)
Algorithm-suggested mapping:
PAR0 → drop (winner, matches CHI content)
PAR1 → rename to INV:Investigator
Your decision?
[a] Accept suggested
[o] Override mapping
[f] Flag and defer
[d] Defer (review later)
[b] Block (needs upstream fix)
[m] Show more context
[p] Play media (uses $TB_MEDIA_PLAYER)
[q] Quit (save progress and exit)
>
When the operator types a and then is prompted for an
optional note, the tool writes the decision to the override
file and advances to the next pending entry.
The [p] Play media action is just a wrapper around
Command::new($TB_MEDIA_PLAYER).arg(media_path).spawn(), the
adjudication tool doesn’t bundle an audio player. The operator
configures their preferred player via the environment.
Adjudication contexts beyond speaker-id
The same chatter adjudicate tool handles all five adjudication
points by dispatching on kind. For each, the displayed
context and the allowed decisions differ:
parent-role-lookup
Shown context: the session is a parent sample (basename
contains parent-suffix conventionally, or contributor data
sheet says so). The merged output needs an inserted-role code
of MOT, FAT, or PAR. The operator picks.
Session: session-103-t1-parent
Kind: parent-role-lookup
This is a parent-sample session. The merged file's inserted
speaker (currently labeled PAR0 → ???) needs a CHAT role.
Contributor data sheet (if attached): not available
Audio preview duration: 8m 14s
Algorithm-suggested: INV : Investigator (default for ambiguity)
Your decision?
[a] Accept suggested (INV : Investigator)
[m] MOT : Mother
[f] FAT : Father
[p] PAR : Adult (gender unknown)
[c] Custom role
[d] Defer
[b] Block (needs upstream metadata)
>
diarization-mix-review
Triggered by the operator (or a post-merge auto-scan) observing
that an ASR speaker’s content mixes real-world speakers. The
adjudication is to add the "diarization-mixed" flag plus a
note explaining the mix.
sanity-scan-misclassification
Triggered by the post-merge sanity scan when a retained-speaker utterance has high text similarity with a temporally-adjacent inserted-speaker utterance. The operator either confirms (“the original speaker-id was wrong, swap the mapping”) or overrides (“the duplication is real, both speakers said the same thing at the same time”).
Resumption and re-adjudication
The pending-adjudications file is the source of truth for
“what still needs deciding.” If the operator quits mid-review
(via [q] or process-kill), the next chatter adjudicate
invocation picks up where they left off, already-decided
entries have already been removed from pending and written to
the override file.
Re-adjudication of an already-decided entry is a planned
extension, not yet implemented. The proposed interface would
load the existing override entry, present it as the “current
decision,” and ask the operator whether to keep or replace it;
the operator’s decision would overwrite the entry, and the prior
decision would be preserved in a history array on the entry
(recording the prior mode, mapping, operator, decided_at,
and note). The proposed invocation shape (not a working command
today) is:
# Proposed, not yet implemented:
chatter adjudicate --re-adjudicate <SESSION_ID> --override-file overrides.toml
It needs a small override-file schema extension, a per-entry
optional history: Vec<MergeOverride> field. This is a minor
schema change; if it ships in v1, no schema bump is needed; if it
ships later, that is a schema_version = 2 migration.
Composition with the orchestrator
The orchestrator (proposed tb merge or similar) drives the
pipeline. Its high-level flow:
// pseudocode for the orchestrator's main loop
let inputs = discover_input_sessions(input_dir);
let override_file = OverrideFile::read_or_default(override_path);
let mut pending = PendingAdjudications::default();
for session in inputs {
if let Some(decision) = override_file.get(&session.id) {
// Already adjudicated; apply directly.
let labeled = apply_mapping(&session.donor, &decision.mapping)?;
let merged = merge(&session.reference, &labeled, &session.retain)?;
write_merged(merged, &session.output_path)?;
} else {
// Try auto-decide.
match identify_mapping(&session.donor, &session.reference, ...) {
Ok(mapping) => {
let labeled = apply_mapping(&session.donor, &mapping)?;
let merged = merge(...)?;
write_merged(merged, &session.output_path)?;
override_file.insert(session.id.clone(), record_auto_decision(&mapping));
}
Err(SpeakerIdError::LowConfidence { scores, margin, threshold }) => {
pending.push(PendingEntry::speaker_id_low_confidence(
session.id.clone(),
scores, margin, threshold,
/* preview */ build_preview(&session),
));
}
Err(other) => return Err(other),
}
}
}
pending.write(pending_path)?;
override_file.write(override_path)?;
if !pending.is_empty() {
eprintln!(
"Pipeline complete for {} sessions; {} sessions need adjudication.\n\
Run: chatter adjudicate {} --override-file {}",
decided_count, pending.len(), pending_path, override_path
);
return Ok(ExitCode::NeedsAdjudication);
}
The orchestrator is the layer that hasn’t been designed yet at
the type level. It’s likely a tb subcommand (since tb is
the workflow tool for multi-repo / multi-step ops), with a
fallback shell-script form for the v0 pipeline.
What this design does NOT cover
- The orchestrator binary itself. That’s a separate design pass; this doc only specifies the contract between the pipeline stages and the adjudication tool.
- GUI/web adjudication backends. v1 is terminal-only. The
Promptertrait is the extension point; future backends implement it. The data contract (pending.toml,overrides.toml) does not change. - Audio playback / waveform display. v1 launches the
operator’s
$TB_MEDIA_PLAYERand gets out of the way. A future TUI with inline audio scrubbing is conceivable but is a major UI project, not v1. - ML-suggested decisions. A future version could feed
pending entries to a classifier that pre-fills “suggested”
with model output. Out of scope; the
suggestedfield exists today as a hook.
Test coverage
Every behavior of chatter adjudicate is tested via the
scripted-prompter abstraction. See the
Test Plan (TBD section L4) for the
test inventory. Coverage spans:
- Each adjudication kind’s happy path (operator accepts suggested, decision written to override file)
- Each adjudication kind’s override path (operator types an alternative, decision validated and recorded)
- Each adjudication kind’s defer path (entry stays in pending)
- Each adjudication kind’s block path (entry marked blocked; pipeline reports blocker)
- Re-adjudication path (operator changes their mind; prior
decision preserved in
history) - Mutually-exclusive flag enforcement (
--interactive+--scriptedrejected) - Invalid operator response handling (with and without
--skip-on-error) - Schema-version refusal on the pending file
- Empty pending file (no-op, exit 0)
XML Emitter
Status: Current Last updated: 2026-06-14 12:56 EDT
Purpose
crates/talkbank-transform/src/xml/ serialises a ChatFile<S> into
TalkBank XML, an obsolete, frozen interchange format. The emitter is
chatter’s implementation of that format’s CHAT to XML projection.
Scope:
- Legacy / rare-use facility. The TalkBank project no longer publishes XML for download; CHAT is the canonical distribution format. The XML emitter exists to support rare legacy consumers that still need the XML projection; it is not a primary interchange path. New integrations should consume CHAT directly.
- Emission only. XML ingest (XML → CHAT) is explicitly out of
scope. The only historical consumer that ever needed XML → CHAT
was Phon (via its PhonTalk plug-in, which used an XML
round-trip); Phon has since pivoted to reading CHAT
directly. The other XML readers are all either dormant or
migrated:
- NLTK’s
CHILDESCorpusReaderis unmaintained and was always read-only. langcog/childes-dbhas had no commits since September 2022.- TalkBankDB and the current TalkBank analysis stack read CHAT directly, not XML.
- NLTK’s
- Phonetic tiers are permanently unsupported.
%pho,%mod,%phosyl,%modsyl,%phoalnreportXmlWriteError::PhoneticTierUnsupported. Phon has pivoted to CHAT-only interchange; no downstream consumer reads the rich<pg>/<pw>/<ph>/<cmph>/<ss>XML. Files carrying these tiers still parse, validate, and round-trip through CHAT unchanged, only the XML projection is declined. - Parity oracle. The goldens in
corpus/reference-xml/(the reference TalkBank XML generated against the reference CHAT corpus) are the parity target. All paired goldens pass structurally, full parity across every reference.chafile the TalkBank XML format can represent. A small number of reference fixtures have no golden because the frozen format cannot express them: some use UD POS tags (propn) that postdate it, and others declare@Mediawith a linkage type that the E544 validator catches before emission has a chance to run. Intentional divergences, not Rust gaps.
Module layout
The emitter is split across six submodules under xml/. Each
file contributes an impl XmlEmitter { … } block plus any
free helpers it owns; state lives on the single XmlEmitter
struct defined in writer.rs.
flowchart TD
entry["write_chat_xml<br/>(writer.rs)"]
emitter["XmlEmitter struct<br/>owns quick-xml Writer<br/>+ next_utterance_id"]
root["root.rs<br/>document / participants /<br/>body / utterance orchestration<br/>+ metadata helpers"]
word["word.rs<br/><w> / <t> / <tagMarker> /<br/><pause> / <g> wrappers /<br/>word-internal markers /<br/>scoped annotations"]
mor["mor.rs<br/><mor> / <mw> / <gra> /<br/>UtteranceTiers collector /<br/>%mor feature serialization"]
wor["wor.rs<br/><media> / <wor> /<br/><internal-media> /<br/>ms → seconds formatting"]
deptier["deptier.rs<br/><a type=…> side tiers<br/>(%act / %com / %exp /<br/>%gpx / %sit / %xLABEL)"]
error["error.rs<br/>XmlWriteError variants"]
entry --> emitter
emitter --> root
emitter --> word
emitter --> mor
emitter --> wor
emitter --> deptier
root -->|"terminator,<br/>separator"| word
root -->|"collect_utterance_tiers,<br/>UtteranceTiers"| mor
root -->|"<media>,<br/><wor>"| wor
root -->|"side tiers"| deptier
word -->|"<mor> subtree<br/>inside <w>"| mor
word -->|"<mor> subtree<br/>inside <tagMarker>"| mor
wor -->|"%wor terminator<br/>label"| word
error -.->|"errors"| entry
error -.->|"errors"| root
error -.->|"errors"| word
error -.->|"errors"| mor
error -.->|"errors"| wor
error -.->|"errors"| deptier
| File | Role |
|---|---|
writer.rs | XmlEmitter struct, namespace/version constants, write_chat_xml entry point, minimal-document unit test, escape_text helper |
root.rs | Document / participants / body / utterance orchestration; root-element metadata helpers (corpus lookup, date/age/sex formatting, @Options flags, @Types projection, per-speaker extras) |
word.rs | All word-level element shapes; word-internal marker walking; scoped-annotation dispatch; event / action emission |
mor.rs | %mor / %gra emission including post-clitic <mor-post>; UtteranceTiers aggregator |
wor.rs | %wor tier emission plus utterance-level <media>; format_seconds ms → seconds |
deptier.rs | Text-content “side tiers” that render as <a type=…>text</a> (%act, %com, %exp, %gpx, %sit, %xLABEL) |
error.rs | XmlWriteError thiserror enum |
Top-level data flow
sequenceDiagram
participant Caller
participant write_chat_xml as write_chat_xml<br/>(writer.rs)
participant XmlEmitter as XmlEmitter
participant emit_document as emit_document<br/>(root.rs)
participant emit_body as emit_body<br/>(root.rs)
participant emit_utterance as emit_utterance<br/>(root.rs)
Caller->>write_chat_xml: ChatFile<S>
write_chat_xml->>XmlEmitter: new()
write_chat_xml->>emit_document: emit_document(file)
emit_document->>emit_document: emit <?xml?> + <CHAT> attrs
emit_document->>emit_document: emit_participants(file)
emit_document->>emit_body: emit_body(file)
loop each Line
alt Line::Header
emit_body->>emit_body: emit_header_if_body(header)
else Line::Utterance
emit_body->>emit_utterance: emit_utterance(utterance)
end
end
write_chat_xml->>XmlEmitter: finish() → String
XmlEmitter-->>Caller: Ok(String)
Utterance emission in detail
emit_utterance is the most complex orchestrator: it walks the
main tier in parallel with two cursors into the dependent tiers.
flowchart TD
start([emit_utterance])
preHdr[emit pre-begin<br/>headers]
collect["collect_utterance_tiers<br/>→ UtteranceTiers {<br/>mor, gra, wor, sin, side_tiers }"]
openU["<u who=… uID=…>"]
linkers["emit_linker × N<br/>(utterance.main.content.linkers)"]
walk{"walk<br/>utterance.main.content.content"}
term{"terminator<br/>present?"}
emitTerm["emit_terminator<br/>(word.rs)"]
missing["<t type='missing<br/>CA terminator'/>"]
media{"main bullet<br/>present?"}
emitMedia["emit_utterance_media<br/>(wor.rs)"]
wor{"%wor tier<br/>present?"}
emitWor["emit_wor<br/>(wor.rs)"]
side{"side tiers<br/>non-empty?"}
emitSide["emit_side_tiers<br/>(deptier.rs)"]
closeU["</u>"]
done([return])
start --> preHdr --> collect --> openU --> linkers --> walk
walk -->|"Word / AnnotatedWord /<br/>ReplacedWord / AnnotatedGroup /<br/>Separator / Pause / Retrace /<br/>Event / AnnotatedAction /<br/>OverlapPoint"| walk
walk --> term
term -->|yes| emitTerm
term -->|no| missing
emitTerm --> media
missing --> media
media -->|yes| emitMedia
media -->|no| wor
emitMedia --> wor
wor -->|yes| emitWor
wor -->|no| side
emitWor --> side
side -->|yes| emitSide
side -->|no| closeU
emitSide --> closeU
closeU --> done
The TierCursors invariant
Walking the main tier requires tracking three independent cursors
into the %mor / %gra / %sin tiers. This separation is the
single most important correctness invariant in the emitter; a
merged cursor silently drifts on any utterance containing a clitic
chain, an untranscribed placeholder, or a sign-language item.
A TierCursors helper in mor.rs owns the three cursors and
provides mor_index() / gra_chunk() / sin_index() accessors plus
consume_mor(post_clitics_len) / consume_sin() / advance_bulk(mor, gra) advance methods. Every content-arm in emit_utterance runs
a fixed template: look up partners at the current cursor positions,
emit, call consume_*. The advance math has exactly one home.
| Cursor | Indexes into | Advances by |
|---|---|---|
mor | mor_tier.items (one Mor per main-tier word) | 1 per alignable word |
gra | gra.relations (1-based <gra index=…/>) | 1 + post_clitics.len() per Mor |
sin | sin_tier.items (one SinItem per sin-countable word) | 1 per sin-countable word |
A Mor item like pron|what-Int-S1~aux|be-Fin-Ind-Pres-S3 is one
entry in mor_tier.items but contributes two %gra edges,
one for the main <mw>, one for each <mor-post><mw/></mor-post>.
So mor and gra cursors advance at different rates.
%sin uses a separate counting predicate than %mor. The
model’s counts_for_tier(word, TierDomain) function encodes the
differences:
TierDomain::Morexcludes nonwords (&~), fillers (&-), phonological fragments (&+), and untranscribed placeholders (xxx,yyy,www).TierDomain::Sinincludes everything that was phonologically or gesturally produced, fragments and untranscribed do participate. A gesture can accompany an unintelligible vocalisation.
Because the predicates diverge, the sin cursor advances on its
own schedule. For *CHI: mommy xxx . %sin: g:point 0 . the xxx
word consumes a %sin item but not a %mor item.
Four main-tier content variants delegate cursor arithmetic through
their emitters: emit_replaced_word and emit_annotated_group
return (mor_used, gra_used) tuples consumed via
cursors.advance_bulk(mor_used, gra_used); emit_word and
emit_annotated_word call cursors.consume_mor(post_count) inline.
Why cursor-based, not AlignmentSet-based?
talkbank-model’s AlignmentSet (Utterance.alignments) holds
pre-computed MorAlignmentPair / SinAlignmentPair / etc., the
same main-word-index ↔ target-tier-index mapping the emitter
computes on-the-fly. Why not use it directly?
The XML emitter accepts ChatFile<S: ValidationState> for any
S. When called on a ChatFile<NotValidated>, compute_alignments
has never run and Utterance.alignments is None. Rather than
force callers to validate first, or risk panics on unvalidated
input, the emitter recomputes what it needs via the cursor walk.
The cursor walk is equivalent to the model’s alignment output for every reference-corpus input; it only diverges on malformed files that the model’s alignment would also flag. The cursors stay as local emitter state, and the alignment module stays a separate, optional layer.
%sin → <sg><w><sw/></w></sg> emission
When a %sin tier is present and the current word counts for
TierDomain::Sin, the emitter wraps the <w> element (and its
nested <mor> subtree if any) in a <sg> (sign group) with a
<sw> (sign word) sibling:
<sg><w>what<mor>...</mor></w><sw>0</sw></sg>
SinItem::Token(text) renders as <sw>text</sw>; SinItem::SinGroup(…)
joins its gesture tokens with spaces. The emission is the entirety
of XmlEmitter::emit_sin_word; everything else is just the
<sg>…</sg> wrap in emit_utterance’s Word arm.
@Media linkage and timing evidence (E544)
Validation fires E544 before XML emission when an unqualified
@Media header (status-less) claims linkage but the transcript
carries no timing evidence (no main-tier bullets, no positional
%wor sidecar). This is a validator-level rule (lives in
crates/talkbank-model/src/model/file/chat_file/validate.rs
check_media_linkage_has_timing), not an emitter rule; it runs
during ChatFile::validate and blocks downstream emission on
validation-gated entry points. See spec/errors/E544_media_linkage_without_timing.md.
The emitter itself doesn’t care about bullet presence; this check was historically imposed as a parser-level semantic failure, and Rust implements it in the validator instead.
Post-clitic emission
flowchart LR
mor["<mor type='mor'>"]
mw["<mw>…</mw><br/>(main MorWord)"]
gra["<gra type='gra'<br/>index=N head=… relation=…/>"]
post["<mor-post>"]
pmw["<mw>…</mw><br/>(post-clitic MorWord)"]
pgra["<gra type='gra'<br/>index=N+1 head=… relation=…/>"]
postEnd["</mor-post>"]
endMor["</mor>"]
mor --> mw --> gra --> post --> pmw --> pgra --> postEnd --> endMor
Each post-clitic gets its own <mor-post> wrapper containing one
<mw> plus the next <gra> index. Multiple post-clitics emit
sequentially.
Emitter / parser / model boundary
The emitter generally defers to the Rust model’s canonical predicates rather than inventing output-side rules. Four cases are exceptions where the emitter bridges a disagreement between the parser and the TalkBank XML format at the output boundary. All four are legitimate divergences, not regressions: the Rust model is correct, the TalkBank XML format is obsolete and frozen at a pre-evolution CHAT snapshot, and the emitter’s bridges are the right place to reconcile the output shape.
CA intonation contour terminators
Rust parses ⇗, ↗, →, ↘, ⇘ at the end of an utterance as
Terminator::CaRisingToHigh etc. The TalkBank XML format classifies
them as separators followed by an implicit “missing CA terminator”.
The emitter splits a pitch-contour terminator into two sibling
elements:
<s type="rising to high"/>
<t type="missing CA terminator"/>
See ca_terminator_separator_label in word.rs. If the Rust
parser ever migrates to classify these as separators, the
emitter’s bridge becomes dead and should be removed.
CAOmission as whole-word shortening
(parens) (a fully-parenthesised word) parses to
WordCategory::CAOmission. TalkBank XML emits
<w><shortening>parens</shortening></w>, a <shortening>
wrapper around the word body with no type="omission" attribute.
The 0word syntax (true omission) gets <w type="omission">word</w>
with no shortening wrapper.
The emitter branches on CAOmission and opens a <shortening>
wrapper around emit_word_contents. word_category_attr returns
None for CAOmission so no type="omission" attribute is
emitted.
Leading overlap-point hoisting
Rust parses ⌈°overlapping+soft⌉° as a single word whose
WordContent vector starts with a TopOverlapBegin marker. TalkBank
XML keeps the leading ⌈ as a top-level sibling of <w>
but keeps the trailing ⌉ inside. The emitter hoists the
prefix of leading WordContent::OverlapPoint items out before
opening <w>, and emit_word_contents skips them during the
content walk.
xxx / yyy / www case-sensitivity
The model’s word.untranscribed() helper is case-insensitive; it
treats XXX and xxx identically as “unintelligible” to protect
downstream Stanza/MOR pipelines from spurious uppercase entries.
The XML schema’s untranscribed attribute, however, attaches only
to the strictly lowercase placeholders. The emitter uses a local
untranscribed_attribute_for_xml helper that does the
case-sensitive check at the output boundary.
Both behaviours are deliberate and stay: the model’s case-insensitive helper is a Stanza/MOR correctness fix, and the emitter’s case-sensitive gate matches the XML schema contract.
Reserving element boundaries: single state holder
XmlEmitter owns a quick_xml::Writer<Vec<u8>> and a running
next_utterance_id: u32 counter. Every emission helper writes
through that single writer so indentation, escaping, and the
document-order contract are centrally enforced.
Every BytesText emission routes through escape_text (in
writer.rs) which uses quick_xml::escape::partial_escape to
escape only <, >, &. Apostrophes and double quotes pass
through literally, matching the TalkBank XML format and avoiding
entity-decode issues that would otherwise split text at '
boundaries during structural comparison.
Testing
Two complementary test surfaces:
-
Unit tests in
xml/writer.rs(minimal document smoke) andxml/wor.rs(format_secondsfractional padding) exercise internal helpers directly. -
Golden-XML parity harness at
crates/talkbank-parser-tests/tests/xml_golden.rs. Runs one parametrised test per file incorpus/reference-xml/**/*.xml, parses both emitted and golden XML viaquick-xml, and diffs event streams with whitespace and attribute-order normalisation. Comparator lives incrates/talkbank-parser-tests/tests/xml_support/mod.rs.
The harness diagnostic surfaces the first divergence as
actual: … vs expected: …. To debug further, temporary dump
helpers (write the emitted XML to /tmp/emitted.xml and
side-by-side diff against the golden) are the quickest path;
add them as #[ignore]d tests in
crates/talkbank-parser-tests/tests/xml_dump.rs when needed and
delete after the divergence is resolved.
Related documents
spec/errors/E544_media_linkage_without_timing.md: the@Mediabullet-existence validator that runs before emission.
Reference-XML coverage gaps (which files the TalkBank XML format can’t
represent) are called out inline in the “Parity oracle” bullet of
§Purpose above, permanent exclusions are UD-POS files that postdate the
frozen format and @Media-without-timing files E544 blocks at
validation, both intentional divergences, not Rust gaps.
Staged features
The emitter reports XmlWriteError::FeatureNotImplemented for
CHAT constructs that have a known XML shape but haven’t been
wired in yet. With all paired reference-XML goldens passing,
any new staged feature that lands will be triggered by a file
added to the reference corpus that exercises it. When that
happens:
- Run
cargo nextest run -p talkbank-parser-tests --test xml_goldenand read the failure message. - Find the TalkBank XML output for the construct in the paired golden.
- Add a match arm in the appropriate submodule
(
word.rs::emit_scoped_annotation,deptier.rs::emit_side_tier,word.rs::ca_delimiter_label, etc.) with a short comment explaining the mapping. - If the construct changes
%mor/%gracursor accounting, updateemit_utteranceinroot.rs, not individual callers.
Permanently-unsupported tiers (%pho, %mod, %phosyl,
%modsyl, %phoaln) use
XmlWriteError::PhoneticTierUnsupported and are not staged
for future work, Phon’s pivot to CHAT-only interchange removed
the downstream need.
Errors, CHAT core
Status: Current Last modified: 2026-06-17 11:29 EDT
The error infrastructure used across all CHAT-core crates
(talkbank-model, talkbank-parser, talkbank-transform,
chatter, talkbank-lsp). Defined in the
errors module of talkbank-model.
External runtime/application errors that live outside this repo’s CHAT core are documented separately in their owning projects. For the diagnostic UX standard that applies within this workspace, see error-diagnostics-ux.
Core Types
ParseError
Every diagnostic is a ParseError:
pub struct ParseError {
pub code: ErrorCode,
pub severity: Severity,
pub location: SourceLocation,
pub context: ErrorContext,
pub message: String,
}
ErrorCode
Error codes follow a structured numbering scheme:
| Range | Category |
|---|---|
| E1xx | Encoding |
| E2xx | Words and content |
| E3xx | Main tier (speakers, terminators, content, retraces) |
| E4xx | Dependent tier structure |
| E5xx | Headers |
| E6xx | Dependent tier validation |
| E7xx | Alignment (%mor, %gra, %pho, %wor) |
| W1xx-Wxxx | Warnings (same categories) |
Codes are grouped by range as above. The numbering is a navigational aid, not
the authority on where a code is caught: most codes are emitted at the layer
suggested below, but a few main-tier checks (for example undeclared-speaker and
retrace structure) are validation-layer despite their E3xx number. The
per-code Layer in spec/errors/ is authoritative.
flowchart LR
subgraph "Parser layer\n(parser.parse_chat_file())"
E1["E1xx\nEncoding\n(BOM, charset)"]
E2["E2xx\nWords and content\n(word syntax, events,\noverlap markers)"]
E3["E3xx\nMain tier\n(speaker, content,\nterminator, retraces)"]
E4["E4xx\nDependent tier structure\n(tier presence, format)"]
E5["E5xx\nHeaders\n(format, required fields,\nparticipant resolution)"]
end
subgraph "Validation layer\n(validate_with_alignment)"
E6["E6xx\nDependent tier validation\n(tier name/format)"]
E7["E7xx\nAlignment\n(%mor/%gra/%pho/%wor counts,\nGRA indices, orphaned tiers)"]
end
W["Wxxx\nWarnings\n(same categories,\nnon-fatal)"]
E1 ~~~ E2 ~~~ E3 ~~~ E4 ~~~ E5
E6 ~~~ E7
The source of truth for error-code details is spec/errors/. Maintainers can
generate a local markdown reference set under docs/errors/ with
gen_error_docs when they need a browsable error catalog while working on
diagnostics.
Severity
Error: must be fixed; indicates invalid CHAT.Warning: should be fixed; indicates questionable but parseable CHAT.
SourceLocation and Span
Byte offsets into the source text:
#![allow(unused)]
fn main() {
pub struct SourceLocation { pub start: usize, pub end: usize }
pub struct Span { pub start: usize, pub end: usize }
}
ErrorContext
Carries the source fragment around the error location:
pub struct ErrorContext {
pub source_fragment: String,
pub byte_range: Range<usize>,
pub node_kind: String,
}
ErrorSink Trait
The central abstraction for error reporting:
flowchart LR
val["Validator / Parser"]
pe["ParseError\ncode + severity +\nlocation + message"]
sink["ErrorSink trait\n.report()"]
vec["ErrorCollector\ncollect to Vec"]
chan["ChannelErrorSink\ncrossbeam channel\n(feature = channels)"]
asyncchan["AsyncChannelErrorSink\ntokio mpsc"]
cfg["ConfigurableErrorSink\nseverity gating"]
null["NullErrorSink\nno-op"]
val --> pe --> sink
sink --> vec & chan & asyncchan & cfg & null
pub trait ErrorSink {
fn report(&self, error: ParseError);
}
All parsing and validation functions accept &impl ErrorSink rather
than returning errors directly. This allows:
- Collecting all errors (for batch processing).
- Printing errors in real-time (for interactive use).
- Filtering by severity or code.
- Counting errors without storing them.
The trait uses &self (not &mut self) so it can be shared across
threads. Implementations typically use interior mutability
(Mutex<Vec<ParseError>>).
ErrorCollector is the in-memory collector in
errors/collectors.rs. The stored-diagnostics role is explicit in
both code and docs.
Module layout in talkbank-model:
errors/error_sink.rs: trait and lightweight forwarding sinks.errors/collectors.rs: in-memory collectors and counters.errors/async_channel_sink.rs: Tokio-channel streaming.errors/configurable_sink.rs,errors/offset_adjusting_sink.rs,errors/tee_sink.rs, adapters.
ChannelErrorSink is opt-in behind the channels feature so the
default talkbank-model dependency does not pull in crossbeam just
to own the core error trait and in-memory collectors.
Two Error Layers
Errors are detected at two layers. This distinction matters for spec testing.
-
Parser layer: structural errors caught during
parser.parse_chat_file(). These prevent the file from being fully parsed (missing@Begin, invalid syntax). Parser-layer specs test thatparser.parse_chat_file()returnsErr. -
Validation layer: semantic errors caught by
validate_with_alignment()after a successful parse. The file parsed correctly but violates constraints (%moralignment mismatch, undeclared speakers). Validation-layer specs test that validation reports specific error codes.
Adding a New Error Code
- Add the variant to
ErrorCodeincrates/talkbank-model/src/errors/codes/error_code.rswith a#[code("Exxx")]attribute. - Create a spec file in
spec/errors/Exxx-description.mdfollowing the existing template. - Construct
ParseError::new(ErrorCode::YourVariant, ...)at the detection site in the parser or validator. - Regenerate the affected spec artifacts with the current
spec/toolsbinaries (gen_rust_tests,gen_validation_corpus, and optionallygen_error_docs). - Run the concrete verification commands from
book/src/contributing/dev-checks.md.
Validation
Status: Current Last modified: 2026-06-13 22:40 EDT
CHAT validation runs at multiple points in the processing pipeline.
All validation logic is in Rust: talkbank-model::validation owns
CHAT-core validation, and talkbank_transform::validate
(crates/talkbank-transform/src/validate.rs) owns the transform-side
pre/post validation gate functions (validate_to_level,
validate_output). This page covers validity levels, pre/post
validation gates, severity posture, the verification-gate set
(G0-G14), and how validation failures interact with caches and bug
reports.
For error-code infrastructure (codes, sinks, severities, layers), see chat-core-errors. For the diagnostic UX standard, see error-diagnostics-ux.
Validity Levels
The ValidityLevel enum defines three cumulative validation levels.
Each level includes all checks from lower levels.
| Level | Name | Checks |
|---|---|---|
| L0 | Parseable | No parse errors (clean tree-sitter CST) |
| L1 | StructurallyComplete | @Participants and @Languages present, all speaker codes declared, every utterance has a terminator |
| L2 | MainTierValid | Well-formed words, valid timing bullets if present |
Pre-validation gates
Each command requires input to meet a minimum level before processing:
| Command | Required level |
|---|---|
morphotag | MainTierValid |
utseg | StructurallyComplete |
translate | StructurallyComplete |
coref | StructurallyComplete |
align | Parseable (lenient, must handle messy real-world files) |
validate_to_level() checks the file against the required level and
returns all failures found. Invalid files are rejected early with
diagnostics, before any compute is spent on inference.
flowchart TD
cmd["a transform command\n(morphotag, utseg, translate,\ncoref, align)"]
gate["validate_to_level(file, required_level)\n(talkbank-transform validate.rs)"]
check{"meets the command's minimum\nValidityLevel?\n(L0 Parseable / L1 StructurallyComplete\n/ L2 MainTierValid)"}
reject["reject early with diagnostics;\nno compute spent"]
proceed["run the command's inference"]
cmd --> gate --> check
check -->|"no"| reject
check -->|"yes"| proceed
Post-Serialization Validation
After an orchestrator injects results and serializes CHAT output, the
server runs validate_output():
- Alignment validation: checks that
%mor/%gra/%wortier word counts match the main tier. ParseHealth-aware: utterances flagged as unparseable during lenient parsing are excluded. - Semantic validation: full CHAT validation:
- E362: non-monotonic timestamps (utterance bullets must increase).
- E701 / E704: temporal constraints (overlap rules, same-speaker timing).
- Header correctness, required headers present and well-formed.
- Cross-utterance patterns, speaker code consistency.
Only blocks on severity="error", not warnings.
Severity Posture
Validation intentionally distinguishes errors from warnings:
- Errors block output. The server will not write CHAT with error-level validation failures.
- Warnings are reported but do not block. Legacy corpora contain widespread minor violations that must remain processable.
This distinction matters especially for %gra:
- Existing broken
%grain old corpora may be accepted with warnings so files remain processable. - Newly generated
%grafrom batchalign3 is validated more strictly before writeback.
Bug Reports and Cache Purges
When post-serialization validation fails:
- A structured bug report is written to
~/.batchalign3/bug-reports/. - Cache entries that produced the invalid output are purged (self-correcting cache).
This prevents broken results from being served on future runs.
Verification Contract
This repo does not currently expose the predecessor workspace’s
make verify wrapper. The current local contract is the concrete command set
documented in Developer Verification Checks
and Testing and Quality Gates.
Core local sweep:
cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc
Add the surface-specific checks that match the validation-affecting code you changed:
- grammar:
cd grammar && tree-sitter generate && tree-sitter test - spec tools:
cargo build --manifest-path spec/tools/Cargo.tomlandcargo build --manifest-path spec/runtime-tools/Cargo.toml - parser / model / alignment / serialization:
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'andcargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus
The reference corpus at corpus/reference/ remains the sacred semantic target.
Historical labels like G0-G14 are useful for older design notes, but they are
not the current command surface of this checkout.
Validation at the PyO3 Boundary
There is no public Python validation API. The ParsedChat handle
that previously exposed validate() / validate_structured() /
validate_chat_structured() was retired in the 2026-03-21 PyO3
slimdown to worker-runtime-only. Validation now runs entirely on the
Rust side; when a worker invocation detects a failure it constructs
BatchalignBoundaryError::ChatValidation { entries, … } which the
PyO3 boundary lowers into a CHATValidationException carrying a
populated errors: list[ValidationErrorEntry] on the Python side.
Python callers that need structured validation results invoke
batchalign3 via subprocess and catch the exception:
from batchalign_core import CHATValidationException
try:
batchalign_core.execute_v2(request)
except CHATValidationException as exc:
for entry in exc.errors:
print(entry.code, entry.line, entry.message)
Upstream batchalign runtime errors and the Python ↔ Rust boundary
contract are documented separately in the batchalign3 project.
Known limitations
- Validation rules are intentionally permissive on legacy data.
Some checks emit warnings rather than errors so legacy corpora
remain processable while still surfacing the issue. Examples: pre-existing
malformed
%gra(warned, not blocked, so files that already shipped with bad%graround-trip cleanly); some bullet-format minor variants. Newly generated tiers from batchalign are validated more strictly before writeback. %worword counts are not validated against the main tier.%woris a timing-annotation tier with no downstream positional indexing; legacy files may havexxx, fragments, or nonwords in%worwithout producing alignment errors.- Cross-utterance quotation validation is gated off by default
(
enable_quotation_validationflag), the cross-utterance walker exists but is not yet wired into the standard validation gate. - Some error-spec / validator pairs are not yet implemented.
Tracked in
spec/errors/files markedStatus: not_implemented; these generate#[ignore]tests via the currentspec/toolsgenerators rather than failing CI. Rungrep -rl "Status.*not_implemented" spec/errors/to enumerate.
CHECK Parity Audit
Status: Current Last updated: 2026-06-17 11:29 EDT
CLAN’s check (CHECK) was the long-standing validator for CHAT files. chatter validate is the forward-looking replacement, and it is the binding judgment
on whether a byte sequence is valid CHAT: when chatter rejects a file, the
file is invalid and the right response is to clean the data, not to weaken the
parser. CHECK is no longer the authority on validity.
CHECK is still useful for one thing: as a reference oracle that helps find
validation rules chatter does not yet have. The CHECK Parity Audit is the
tool that compares the two systematically, so that every rule CHECK enforces is
either matched by chatter, or is a deliberate, documented divergence.
What the audit answers
For every error code CLAN’s check actually emits, the audit answers: does
chatter have an equivalent rule, and if not, why not?
- Semantic parity: does
chatterenforce the same intended rule? - Behavioral parity: does
chattermatch CHECK’s literal runtime behavior, including CHECK’s documented anomalies (some CHECK rules are buggy or were disabled in place; reproducing those bugs is not a goal)? - Strictness policy:
chattershould be at least as strict semantically. A file CHECK rejects should not silently passchatterunless the divergence is deliberate.
How it works
flowchart LR
cpp["CLAN check.cpp\n(OSX-CLAN/src/clan)"]
extract["scripts/extract_check_codes.py\n(every code CHECK actually emits)"]
ref["clan-check-reference/\ncheck-error-codes.json"]
map["map_by_id()\n(audit_check_parity.rs)"]
audit["audit_check_parity\nbinary"]
out["docs/audits/\ncheck-parity-audit.md"]
cpp --> extract --> ref
ref --> audit
map --> audit
audit --> out
-
The CHECK reference is generated from CLAN’s
check.cppbyscripts/extract_check_codes.pyintocrates/talkbank-parser-tests/clan-check-reference/check-error-codes.json. It records every code CHECK emits (the call sites in the C source), not the stale subset documented in CLAN’s ownCHECK-rules.md. -
The mapping lives in
map_by_id()incrates/talkbank-parser-tests/src/bin/audit_check_parity.rs: an explicit CHECK-number to TalkBank-code table (for example138 | 139 => &["E256"]), with a keyword fallback for the unmapped remainder. -
The audit binary joins the two and writes the report. Regenerate it with:
cargo run -p talkbank-parser-tests --bin audit_check_paritywhich rewrites
docs/audits/check-parity-audit.md(the full per-rule table and the executive summary). That generated file is the authoritative, citation-stable record; this page explains how to read it.
The current headline numbers (regenerate to refresh): of the CHECK codes that are actually emitted, roughly two-thirds map directly to a TalkBank code, and the audit reports semantic parity, behavioral parity, and an “enhancements beyond CHECK” set (TalkBank codes with no CHECK equivalent, the majority).
Triaging a gap
A CHECK rule with no TalkBank mapping is not automatically a chatter bug.
Each gap is triaged against the CLAN source (OSX-CLAN/src/clan/check.cpp) into
one of three buckets:
- (a) Genuine gap. CHECK enforces a real CHAT rule
chatteris missing. Action: implement it inchatterwith strict top-down TDD (a failingchatter validatetest on a real.chafixture first), then add themap_by_identry. Example: curly single quotes (see below). - (b) Intentional divergence. CHECK’s rule is wrong, disabled, or a
text-hack
chatterdeliberately does not reproduce. Action: document the divergence, do not implement. Examples: CHECK error 49 (uppercase-in-word) has been commented out incheck.cppsince 2019, so flagging it would diverge from current CHECK; CHECK error 109 (postcodes on dependent tiers) is a raw character-match text-hack on%-tier tokens thatchattermodels as structured content. - (c) Enhancement beyond CHECK. A TalkBank code with no CHECK counterpart.
These are validation rules
chatteradds; they need no CHECK mapping.
The remaining unmapped CHECK codes are an open, low-priority tail: most resolve to bucket (b) on source examination. Closing them is not a release gate.
Worked example: E256 (CHECK 138/139), implemented across both parsers
Curly single quotes (U+2018, U+2019) used as word characters were a genuine
gap (bucket a): CHECK errors 138/139 flag them, chatter previously absorbed
them silently. They are illegal CHAT word characters; CHAT uses the ASCII
apostrophe.
Because chatter has two parsers that must agree (the tree-sitter parser and
the re2c oracle, see Parser Backends), the fix lands in
both, reaching the same recovery:
- The character is excluded from the word token via the shared Symbol Registry (so it can never be part of a word).
- The tree-sitter grammar recognizes it as a dedicated
illegal_curly_quotenode (not a generic parse error), and the parser emitsE256with a span pointing at the exact character. - The re2c lexer emits a recognized
IllegalCurlyQuotetoken; the file-level parser emitsE256and drops the token before parsing. - In both, the offending quote is dropped and the surrounding words survive, so validation continues and reports a precise, actionable diagnostic.
This is the canonical shape of a CHECK-parity rule implemented to chatter’s
standards: a recognized construct (parse, don’t merely fail), the same behavior
in both parsers, and a spec in spec/errors/ that drives the tests.
Related
- Bullet Validation documents the temporal media-bullet
checks (CLAN errors 83/133/84 and
chatter’s E701/E704/E729), a specific instance of the same “match CHECK where it is right, diverge where it is wrong” reconciliation this audit tracks across the whole error set. - Errors, CHAT core describes the
ErrorCodemodel and the parser-layer / validation-layer split. - The spec-driven test pipeline that backs every rule is in
Testing: rules live in
spec/errors/and generate both parser tests and the validation corpus.
Crate Reference
Status: Current Last modified: 2026-06-15 15:00 EDT
Summary of the main crates and packages in TalkBank/chatter.
Foundational crates
tree-sitter-talkbank
Rust binding crate for the generated TalkBank CHAT tree-sitter grammar. Exposes
LANGUAGE, NODE_TYPES, and the generated query constants used by editor and
parser integrations.
talkbank-model
The typed data model for CHAT files. Defines ChatFile, Utterance, DependentTier, MorTier, GraTier, and all other AST types. Includes validation logic, the WriteChat trait for CHAT serialization, serde support for JSON, and JsonSchema derivations. Also owns error types (ParseError, ErrorSink trait, Span, SourceLocation), diagnostic infrastructure, and ParseValidateOptions. Provides a closure-based content walker (walk_words / walk_words_mut) that centralizes recursive traversal of UtteranceContent and BracketedItem with domain-aware group gating.
talkbank-derive
Procedural macros for the model crate (SemanticEq, SemanticDiff, SpanShift, ValidationTagged, and the error_code_enum macro).
talkbank-cache
SQLite-backed validation and roundtrip cache used by higher-level validation and corpus workflows.
talkbank-parser
The canonical parser. Wraps the tree-sitter C parser and converts the concrete
syntax tree (CST) into ChatFile model types. Provides error recovery via
tree-sitter’s GLR algorithm and is the parser used by the CLI, LSP, transform
pipelines, and editor tooling.
talkbank-parser-re2c
Independent alternate parser used as an equivalence oracle against the tree-sitter parser. Primarily a testing and spec-hardening tool rather than a first-wave end-user surface.
talkbank-transform
High-level pipelines: parse+validate, CHAT-to-JSON, JSON-to-CHAT, normalization. Integrates the validation cache, JSON schema validation, and parallel directory validation.
Application and integration surfaces
chatter
The chatter CLI binary: validate, normalize, to-json, and corpus management.
talkbank-lsp
Language Server Protocol server with tree-sitter incremental parsing, real-time diagnostics, and semantic highlighting.
send2clan
Rust bindings for sending files to the CLAN application (macOS Apple Events,
Windows WM_APP). The crate exposes the safe send2clan API directly while
keeping the raw FFI in private modules.
chatter-desktop
Desktop validation app (Tauri v2, React). Mandates TUI parity with the CLI.
Test and spec-support crates
talkbank-parser-tests
Parser tests. Runs the parser over the reference corpus and validates the results. Also owns spec-generated tests, roundtrip tests, equivalence tests, and property tests.
spec/tools
Generator binaries for tree-sitter corpus tests, generated Rust tests, shared spec artifacts, and error documentation.
spec/runtime-tools
Runtime-aware spec tooling for validation, bootstrap, and corpus-mining tasks that should not live in the root Rust workspace.
CLI Startup and the Program Stack
Status: Current Last modified: 2026-06-12 19:01 EDT
Why main() in crates/chatter/src/main.rs does not run the program
directly, and what every contributor adding CLI surface should know about
stack budgets.
The incident this page exists for
From 2026-06-05 to 2026-06-12, every chatter invocation crashed on
Windows in debug builds with STATUS_STACK_OVERFLOW (exit code
0xC00000FD) before argument parsing even began. The crash surfaced as
four failing adjudication_tests subprocess tests in the windows-latest
CI job, but the faulting code was the clap-derived command-tree
construction (Cli::augment_args via CommandFactory::command()),
shared by every subcommand. The trigger was ordinary growth: the FREQ
parity work added several hundred flags across 2026-06-03/04, and the
construction path’s stack needs crossed 1 MiB.
Why stack usage is not portable
Two multipliers vary independently, and the crash happens where they collide:
-
Platform main-thread allowance. There is no single default:
Context Main/default stack Windows main thread 1 MiB (set in the PE header at link time) macOS main thread 8 MiB Linux main thread typically 8 MiB ( ulimit -s)Rust spawned threads 2 MiB unless stack_sizeis givenShipping cross-platform means your real budget is the smallest of these: Windows’ 1 MiB.
-
Build profile. At opt-level 0, rustc gives every temporary in a function body its own stack slot and does not coalesce them, so a function’s frame is roughly the SUM of all its temporaries, not the maximum simultaneously alive. clap’s derive expands to one enormous builder function per args struct (one multi-call chain per flag, each
Arg/Commandtemporary a few hundred bytes by value), which is exactly the shape this penalizes. Release builds coalesce slots and inline, shrinking the same frames by one to two orders of magnitude.
Consequence: identical code can be fine in release on macOS (8 MiB budget, small frames) and fatal in debug on Windows (1 MiB budget, fat frames). Debug test binaries cross the line first, which is why CI subprocess tests caught it and shipped release binaries never crashed.
The design: an explicitly sized program thread
main() spawns the entire program onto a thread with an explicit,
documented stack size (PROGRAM_STACK_BYTES, 16 MiB) and only joins and
re-raises panics, so exit semantics are unchanged. This removes the
dependency on platform main-stack defaults altogether instead of
chasing the budget back under an invisible, platform-dependent line
that the CLAN parity roadmap (roughly sixty commands’ worth of flags
still to come) guarantees we would cross again. rustc itself uses the
same pattern for the same reasons.
flowchart TD
main["main()\n(crates/chatter/src/main.rs)"]
spawn["thread::Builder::stack_size(PROGRAM_STACK_BYTES)\n.spawn(program_main)"]
prog["program_main()\nclap tree build + parse + cli::run"]
join{"join() result?"}
ok["process exits normally"]
panic["resume_unwind(payload)\n(same exit behavior as a panic in main)"]
fail["spawn failed (OS resource):\neprintln + exit(1)"]
main --> spawn
spawn -->|"Ok(handle)"| prog
prog --> join
join -->|"Ok(())"| ok
join -->|"Err(payload)"| panic
spawn -.->|"Err(e)"| fail
The reservation is virtual address space; physical pages are committed only as they are touched, so the 16 MiB costs nothing measurable. The extra thread spawn at startup is microseconds.
Regression gates
crates/chatter/tests/stack_limit_tests.rsruns the real binary under a Windows-sized 1 MiB stack (sh -c 'ulimit -s 1024') on Unix, so macOS and Linux CI enforce the Windows constraint on every run. Without this, the constraint is tested only by the windows-latest job, where this incident sat unnoticed for a week.- The windows-latest cross-platform job remains the native test of the
real 1 MiB main stack (which no longer matters to the program thread,
but guards the
main()shim itself).
Guidance for contributors
- Do not move program logic back onto the bare OS main thread; anything
before the
spawnruns under the platform’s smallest default. - Adding flags and subcommands is normal and expected; the budget is now
the explicit
PROGRAM_STACK_BYTESconstant. If deep recursion or generated code ever approaches it, raise the constant deliberately in a reviewed change rather than discovering the limit in CI. - The same two multipliers apply to any worker threads you spawn: Rust’s 2 MiB spawned-thread default is also finite, and recursive parser or validation code running on worker threads should size them explicitly if depth is data-dependent.
Repository Architecture and Boundaries
Status: Current Last modified: 2026-06-15 15:00 EDT
Top-level layout
spec/ canonical syntax and error spec source
spec/tools/ deterministic generators + validators (separate Cargo workspace)
grammar/ tree-sitter grammar source + generated parser artifacts
crates/ all Rust crates (root Cargo workspace)
talkbank-model/ data model, validation, alignment, errors, parser API trait
talkbank-derive/ proc macros (SemanticEq, SpanShift, ValidationTagged, error_code_enum)
talkbank-parser/ canonical parser (tree-sitter)
talkbank-parser-re2c/ alternate parser (specification oracle, opt-in batch parser)
talkbank-parser-tests/ parser equivalence and roundtrip tests
talkbank-transform/ pipelines, CHAT↔JSON, caching, parallel validation
chatter/ the `chatter` CLI binary
talkbank-lsp/ LSP server
send2clan/ Rust bindings to the legacy CLAN app bridge
talkbank-cache/ validation + roundtrip cache
apps/ desktop app (Tauri v2 + React): chatter-desktop
corpus/ reference corpus (must pass 100%)
schema/ JSON Schema for ChatFile AST
tests/ workspace-level integration tests and fixtures
fuzz/ fuzz targets (separate Cargo workspace)
book/ mdBook documentation source
docs/ strategy docs, proposals, and investigations
Architectural principles
- Clear boundaries between specification, generation, runtime logic, and documentation.
- Generated artifacts and hand-authored code are kept separate with
hard guardrails,
parser.c,node-types.json, generated tests and error-doc artifacts are never edited by hand. - Each crate has a single clear responsibility.
- Entry-point docs guide new contributors to authoritative references quickly.
Canonical ownership rules
spec/owns the language intent and accepted examples, what CHAT means.grammar/owns tokenization and CST shape only, not semantic validation policy.talkbank-modelowns semantic validity, serialization invariants, error types, and parser API contracts.talkbank-transformowns pipelines and JSON schema validation.talkbank-cacheowns the shared SQLite-backed validation and roundtrip cache.
Dependency direction rules
specdoes not depend on runtime crates.grammaris consumed by parser crates, not vice versa.talkbank-modelis dependency-minimal and stable; all other talkbank-* crates depend on it.- CLI / LSP / desktop apps depend on stable internal APIs, never directly on unstable internals of other crates.
- Generator tools may read specs and grammar metadata but do not become runtime dependencies.
Acceptance criteria
- Every top-level directory has a clear purpose statement.
- No crate depends on internal modules outside declared boundaries.
- No generated artifact is edited manually.
- New contributors can identify authoritative docs in less than five minutes.
Grammar System and Token Governance
Status: Current Last modified: 2026-05-29 18:43 EDT
Current Reality
grammar/grammar.js encodes substantial implicit language knowledge directly in regex exclusions,
reserved symbol lists, and leniency decisions. Example areas:
- word segment forbidden start/rest classes,
- CA delimiter/element symbol groups,
- event segment exclusions,
- hand-maintained coupling between comments and token rules.
This is currently powerful but fragile.
Primary Failure Modes
- New symbolic token added in one place but not in exclusion sets.
- Parser behavior changes silently due to regex class edits.
- Generated node types drift from assumptions in spec tooling.
- Lenient parsing choices become undocumented policy.
Current Design
The generated symbol registry is the single source of token constraints.
The pipeline has shipped, just symbols-gen rebuilds it.
Registry Artifacts
spec/symbols/symbol_registry.json(human-authored intent):- symbol string
- category (delimiter, continuation, overlap, punctuation, etc.)
- contexts where reserved/allowed
- parse role and precedence notes
- Generated outputs:
grammar/src/generated_symbol_sets.jscrates/talkbank-model/src/generated/symbol_sets.rsspec/tools/src/generated/symbol_sets.rs- docs: Symbol Registry
Grammar Refactor Requirements
- Replace large manual regex strings with generated character classes.
- Keep final grammar readable by preserving semantic names in generated constants.
- Distinguish clearly between:
- syntax permissiveness,
- semantic validation restrictions.
- Add comments only for design rationale, not for duplicating manual references.
Node Type Drift Controls
- Enforce regeneration and consistency checks:
- grammar source change must regenerate parser and node types,
- node type constants consumed by
spec/toolsand parser code must compile, - CI fails if generated files differ from committed state.
Leniency Policy
Explicitly classify every lenient parse behavior:
- Parse-lenient + validate-strict.
- Parse-lenient + validate-warning.
- Parse-strict (hard fail).
Document this matrix in the Leniency Policy.
Grammar Test Strategy
- Keep corpus tests generated from
spec/constructs. - Add targeted hand-authored edge tests for symbol boundary interactions.
- Add mutation-style tests for forbidden-character regressions.
- Add parser equivalence tests for tokenizer-sensitive cases.
Acceptance Criteria
- No manual reserved-symbol duplication in
grammar.js. - Symbol registry is generated to all required consumers.
- Grammar modifications cannot land with stale generated artifacts.
- Every special token category has explicit policy documentation.
Parser, Model, and API Contracts
Status: Current Last updated: 2026-06-21 21:33 EDT
Single-handle parser API
talkbank-parser provides TreeSitterParser as the canonical API
handle for all parsing, full-file and fragment methods live directly
on the struct. Callers create one instance and pass
&TreeSitterParser everywhere. The alternate talkbank-parser-re2c
is opt-in (specification oracle and high-throughput batch parsing)
and produces the same ChatFile model.
Contract for Batchalign
The Batchalign runtime (the batchalign crate) consumes these
guarantees from the talkbank-* core crates:
- parsing produces a typed
ChatFileor an explicit parse-status signal - parse-health taint is visible to alignment consumers
- alignment helpers operate on semantic model types, not raw text hacks
- recovery never fabricates valid-looking placeholder semantics for malformed input
The parser/model boundary stays honest enough for downstream
workflows, align, compare, benchmark, morphotagging, to make
their own validity decisions.
Canonical Contract Model
Public Contract Layers
- Parse API Contract:
- stable function signatures,
- deterministic parse result envelope,
- clear partial-success semantics.
- Semantic Model Contract:
- stable core model fields,
- explicit unstable/internal fields policy.
- Diagnostic Contract:
- stable error code IDs and severity semantics,
- best-effort message text compatibility.
- Serialization Contract:
- deterministic output constraints,
- normalized formatting policy.
Required Types
ParseOutcome<T>value: T | omitted-by-statusdiagnostics: Vec<Diagnostic>status: Success | Partial | Failed
Diagnosticcode,severity,category,message,location,context,suggestion
Parser Role
talkbank-parser: the sole parser, used by CLI/LSP/API/batchalign3.TreeSitterParseris the only API handle, callers create one and pass&TreeSitterParsereverywhere.- Tree-sitter GLR provides error recovery; the Rust traversal code converts CST to typed model.
- Full-file methods:
parser.parse_chat_file(),parser.parse_chat_file_streaming(). - Fragment methods:
parser.parse_word_fragment(),parser.parse_main_tier_fragment(), etc.
Invariants
- Parsing with offset must shift all spans consistently.
- Parse-level and validation-level diagnostics must remain distinguishable.
- Serialization should preserve semantic equivalence and documented formatting rules.
- Roundtrip behavior must be testable per parser implementation.
- Parser functions that accept
ErrorSinkshould not returnOption<T>for fallible parse state.
API Versioning Policy (Pre-1.0, Strict)
- Three intended contract levels:
- Stable-for-integrators
- Stable-internal
- Experimental
- Mark every public function/type by contract level.
This classification is not yet codified in a separate manifest file; the levels above are the working policy. Integrators should treat any unmarked surface as Experimental until contract levels are formally published.
Acceptance Criteria
- Single canonical parse outcome envelope exposed for integrators.
- Parser implementations conform to shared contract tests.
- Contract-level annotations exist for all public API surfaces.
- Documentation for parse/validate/serialize lifecycle is centralized and current.
Recovery Contract: No Fabricated Semantic Values
The parser contract must forbid sentinel semantic values during error recovery.
Disallowed recovery behavior:
- returning arbitrary enum variants as fallback for unknown/missing nodes,
- returning empty strings as stand-ins for required fields,
- constructing fake words/chunks like
"missing","error", or other placeholders.
Required recovery behavior:
- Emit structured diagnostic with precise span and expected node kind.
- Return an explicit parse-status signal (
Partial/Failed) throughParseOutcome. - Omit invalid semantic node OR store it in explicit recovery metadata, never as a valid semantic value.
Current enforcement:
- CI guardrail script tracks and blocks introduction of new
ErrorSink + Optionsignatures. - See
scripts/check-errorsink-option-signatures.shandscripts/errorsink_option_allowlist.txt.
Rationale:
- fabricated semantic values create secondary, misleading diagnostics against synthetic data,
- downstream tools cannot distinguish real user content from parser-generated placeholders,
- equivalence and regression tests become noisy and non-actionable.
For batchalign3, this is especially important because alignment workflows
must be able to tell the difference between:
- a malformed input that should taint or block alignment
- a recoverable input where raw text can be preserved
- a clean input that should proceed through the align/compare pipeline
String Storage Policy
The model uses three string storage strategies:
Arc<str>interning (interned_newtype!): For high-frequency repeated values (POS tags, stems, speaker codes). Global interner avoids redundant allocations.SmolStr(string_newtype!): For short strings (median 10-15 chars) that benefit from inline storage. O(1) clone, no heap allocation for strings ≤23 bytes.String: Only for utility types outside the core model (e.g.,semantic_diff/).
Parser Backends
Status: Current Last modified: 2026-06-13 22:40 EDT
TalkBank has two CHAT parser implementations. Both implement the ChatParser
trait and produce identical ChatFile model types.
The --parser flag selects the backend at the CLI boundary; everything
downstream consumes the identical ChatFile output, so the choice is
invisible past the dispatch point:
flowchart TD
cli["chatter validate --parser <backend>\n(ParserBackend enum,\nchatter cli_types.rs)"]
sel{"which backend?\n(ParserKind,\ntalkbank-transform\nvalidation_runner/config.rs)"}
ts["TreeSitterParser\n(talkbank-parser:\nGLR, incremental)"]
re2c["Re2cParser\n(talkbank-parser-re2c:\nre2c DFA + chumsky)"]
trait["ChatParser trait\n(talkbank-model\nparser_api/chat_parser.rs)"]
model["ChatFile\n(talkbank-model:\nSemanticEq-identical\nfor both backends)"]
cli --> sel
sel -->|"tree-sitter (default)"| ts
sel -->|"re2c"| re2c
ts -->|"ParserDispatch::TreeSitter\n(worker.rs) implements"| trait
re2c -->|"ParserDispatch::Re2c\n(worker.rs) implements"| trait
trait --> model
ParserDispatch::new(kind) (in validation_runner/worker.rs) is the single
place that constructs the chosen backend from a ParserKind; both variants
wrap a ChatParser implementor, so the validation runner never branches on
backend again.
TreeSitterParser (default)
- Crate:
talkbank-parser - Technology: tree-sitter GLR parser
- Grammar:
grammar/grammar.js→ generated C parser - Strengths: Incremental reparsing (LSP), robust error recovery (GLR), CST-level diagnostics
- Weaknesses: Slower on batch workloads,
!Send + !Sync(one parser per thread)
Used by the LSP, the default CLI, and all production validation.
Re2cParser
- Crate:
talkbank-parser-re2c - Technology: re2c DFA lexer + chumsky parser combinators
- Grammar: Translated from
grammar.jsrules → re2c conditions + chumsky combinators - Strengths: 4-8x faster,
Send + Sync, zero constructor cost, specification oracle - Weaknesses: No incremental reparsing,
Box::leakmemory strategy
Used for batch validation, parser parity testing, and performance benchmarking.
CLI Usage
# Default: tree-sitter
chatter validate corpus/
# Use re2c for faster batch validation
chatter validate --parser re2c corpus/
# Roundtrip with re2c
chatter validate --parser re2c --roundtrip corpus/
The --parser flag accepts tree-sitter (default) or re2c. Cache entries
are parser-specific, switching parsers does not invalidate the other’s cache.
Parity Status
Both parsers produce SemanticEq-identical output on the 87-file reference
corpus (100% match). On the ~100k-file wild corpus, parity is ~98.7%.
Error Detection
| Metric | Value |
|---|---|
| Specs tested | 140 |
| Both detect error | 140/140 (100%) |
| Same error code | 79/140 (56.4%) |
| Different code, both detect | 61/140 (43.6%) |
| Re2c silent (misses error) | 0 |
The 61 code mismatches come from architectural differences, not bugs. Both parsers report actionable diagnostics for all 140 testable error specs.
Performance
| Benchmark | TreeSitter | Re2c | Speedup |
|---|---|---|---|
| Small file (13 lines) | 44 µs | 9.6 µs | 4.6x |
| Medium file (dependent tiers) | 69 µs | 9.4 µs | 7.3x |
| Large file (complex) | 7,734 µs | 970 µs | 8.0x |
| Batch (35 files) | 21.7 ms | 3.0 ms | 7.2x |
Run benchmarks: cargo bench -p talkbank-parser-re2c --bench parse_comparison
When to Use Which
| Use Case | Recommended Parser | Why |
|---|---|---|
| LSP / editor integration | tree-sitter | Incremental reparsing |
| Batch validation (>100 files) | re2c | 4-8x faster |
| CI validation | Either | Both correct; re2c saves CI time |
| Error diagnostics (user-facing) | tree-sitter | More specific E3xx codes |
| Parser parity testing | Both | Re2c is the specification oracle |
| Profiling / benchmarking | re2c | DFA lexer gives a performance floor |
Shared Model Infrastructure
Both parsers convert to the same talkbank_model::ChatFile type and share
post-hoc promotion logic:
TierContent::extract_terminal_bullet(): trailing InternalBullet → utterance bulletparse_bullet_node_timestamps(): structured bullet CST → (start_ms, end_ms)
CA intonation arrows are no longer promoted to terminators at the
parser/model boundary; both parsers leave them as Separator items.
See CA Terminator Resolution.
Detailed Parity Report
See crates/talkbank-parser-re2c/docs/parity-report.md
for the full gap analysis, divergence categories, and remaining work items.
Parser Leniency Policy
Status: Current Last updated: 2026-06-15 13:08 EDT
This document is the single source of truth for how the tree-sitter grammar,
Rust validation layer, and CLI tooling divide responsibility for enforcing the
CHAT specification. It consolidates decisions scattered across grammar.js
comments, analysis documents, and code.
Scope: Documentation only. This document does not implement new validation rules; it records what exists, what is intentionally absent, and proposes a roadmap for closing gaps.
Philosophy: Parse, Don’t Validate
The tree-sitter grammar intentionally accepts a superset of valid CHAT. The rationale:
-
Maximise parse coverage: Real-world
.chafiles contain legacy patterns, whitespace variations, and edge cases. A grammar that rejects them produces no AST and therefore no diagnostics. Accepting them gives the validation layer something to work with. -
Separate syntax from semantics: The grammar captures structure (headers, utterances, tiers, annotations). The Rust validation layer enforces semantic rules (required headers, participant declarations, alignment counts).
-
Enable configurable strictness: Different consumers need different policies. A roundtrip pipeline can be strict; an editor providing live diagnostics should be lenient. Validation profiles (see Validation Profile Infrastructure) make this possible.
Three-Tier Classification
Every intentional leniency decision falls into one of three tiers:
| Tier | Label | Meaning |
|---|---|---|
| A | Parse-lenient + validate-strict | Grammar accepts it; validation rejects it as an error |
| B | Parse-lenient + validate-warning | Grammar accepts it; validation emits a warning |
| C | Parse-lenient only | Grammar accepts it; no validation needed: the construct is genuinely optional or the broad acceptance is by design |
This classification was proposed in an earlier grammar governance analysis and is formalised here.
Leniency Matrix
Master table of every documented leniency decision in the grammar. The Status column indicates whether downstream validation compensates for the grammar’s permissiveness.
| # | Grammar Construct | Spec Requirement | Grammar Behavior | Tier | Validation | Error Code | Status |
|---|---|---|---|---|---|---|---|
| 1 | @UTF8 header | Required, must be first line | Optional (not enforced) | A | Validated | E503 | OK |
| 2 | @Begin header | Required | Optional (grammar.js ~L104) | A | Validated | E504 | OK |
| 3 | @End header | Required | Optional (grammar.js ~L106) | A | Validated | E502 | OK |
| 4 | Pre-first-utterance header order | No enforced order (matches CLAN CHECK) | choice(), any order (grammar.js ~L122-135) | C | N/A (by design) | , | OK |
| 5 | Headers after utterances | Allowed (e.g. @Bg, @Eg, @G, @Comment) | Interleaved freely | C | N/A (by design) | , | OK |
| 6 | Content type context restrictions | Unified across contexts | Unified base_content_item (grammar.js ~L731-738) | C | N/A (by design); specific semantic rules (E371, E372) exist separately | , | OK |
| 7 | Terminator presence | Required (except CA mode) | Optional (grammar.js ~L691-692) | A | Validated | E305 | OK |
| 8 | Bare shortening as word | CA mode only | Accepted anywhere | A | Validated | E2xx | OK |
| 9 | Trailing whitespace in annotations | Not specified | Optional trailing space (grammar.js ~L957, 966, 975, 1004, 1013) | C | N/A | , | OK |
| 10 | MOR segment Unicode | Very permissive (broad language support) | Exclusion-based regex (grammar.js ~L1909-1915) | C | N/A (by design) | , | OK |
| 11 | MOR fusional suffixes with hyphens | ALNUM + IPA only | Allows hyphens (grammar.js ~L1942-1945) | C | N/A (by design) | , | OK |
| 12 | MOR nested translations | No nested structures | Allows () and [] nesting (grammar.js ~L1954-1966) | C | N/A (by design) | , | OK |
| 13 | Linkers / language codes | Truly optional | Optional | C | N/A | , | OK |
| 14 | Word annotations | Truly optional | Optional | C | N/A | , | OK |
| 15 | Media bullet | Truly optional | Optional | C | N/A | , | OK |
| 16 | Group whitespace (leading/trailing) | No whitespace inside < > | Optional (grammar.js ~L1097, 1099) | C | N/A | , | OK |
| 17 | Long feature label characters | Limited character set | /[A-Za-z0-9@%_-]+/ (grammar.js ~L1327) | C | N/A | , | OK |
| 18 | Catch-all headers ($.anything) | Structured content for some headers | /[^\r\n]+/ for ~19 header types | C | N/A (content is opaque) | , | OK |
| 19 | Header gap whitespace | Single space/tab | repeat1(choice(space, tab)) (grammar.js ~L467, 477, 489) | C | N/A | , | OK |
| 20 | @Types header whitespace | No spaces around commas | Optional whitespace around commas (grammar.js ~L584-592) | C | N/A | , | OK |
Permissiveness Regression Decisions
During development, several validation rules were tightened and then relaxed after they produced false positives against the reference corpus. These decisions are documented in the permissiveness regression log (archived). Each is summarised here with its rationale.
Decision 1: [*] bare annotation, E214 disabled
- Previous behaviour:
E214emitted when[*]appeared without an explicit error code (emptyContentAnnotation::Error). - Current behaviour: Bare
[*]is accepted without error. - Implementation: Removed validation branch in
talkbank-model/src/model/annotation/annotated.rs. - Rationale: Reference files (
errormarkers.cha,compound.cha) use bare[*]as valid CHAT. - Revisit: If coded error annotations become required, do it behind an explicit strict profile.
Decision 2: @t without @s:<lang>, E248 disabled
- Previous behaviour:
E248emitted for@tmarkers without an explicit language marker. - Current behaviour:
@taccepted without requiring@s:<lang>. - Implementation: Removed checks in
talkbank-model/src/validation/word/structure.rs. - Rationale: Reference file
formmarkers.chacontainsa@tand is expected to be valid. - Revisit: Scope to explicit strict validation mode if desired.
Decision 3: Undeclared inline language codes, E254 re-introduced as warning
- Original behaviour: Inline
@s:...markers with language codes not declared in@LanguagesemittedE254as an error. - Intermediate behaviour:
E254was disabled and the code removed from the codebase to keep reference filelang-marker.chavalid. - Current behaviour:
E254(UndeclaredExplicitWordLanguage) is back in the registry atcrates/talkbank-model/src/errors/codes/error_code.rs:321and emitted atcrates/talkbank-model/src/validation/word/language/resolve.rs:195, but as a warning rather than an error. This was paired with the introduction ofE255(WholeUtteranceLanguageSwitchShouldUsePrecode) for whole-utterance@sruns that should use[- lang]precodes. - Why it returned: Heterogeneous corpora (Cantonese, Polish, Czech,
Spanish, HK bilingual) made the warn-only signal load-bearing for
catching
@s:LANGmarkers that disagreed with@Languages. The warning surfaces the inconsistency without blocking the file. - Revisit: If the warn-only signal turns out to be ignored in practice, decide between escalating back to error severity or removing.
Decision 4: Mixed-language digit legality, permissive-any rule
- Previous behaviour: Digits had to be legal in all applicable languages for mixed/ambiguous markers.
- Current behaviour: Digits accepted if legal in at least one applicable language.
- Implementation: Changed from
is_valid_in_all()toany()intalkbank-model/src/validation/word/language/digits.rs. - Rationale: Prevents false positives in mixed-language reference examples.
- Revisit: Confirm spec intent for mixed/ambiguous validation semantics.
Decision 5: @Bg nesting, same-label only
- Previous behaviour: Any nested
@Bgwhile another gem scope was open emittedE529. - Current behaviour:
E529only fires when nesting the same label (or same unlabeled scope key). Different labels may nest hierarchically. - Implementation: Changed from
any_scope_opentosame_scope_openintalkbank-model/src/validation/header/structure.rs. - Rationale: Avoids false positives on hierarchical markup patterns (e.g., HSLLD corpus).
- Revisit: Decide whether nesting policy should be global or per-label.
Decision 6: Temporal bullets in CA mode, skipped
- Previous behaviour:
E701/E704temporal checks ran even for CA-mode files. - Current behaviour: Temporal constraints are skipped when file is in CA mode.
- Implementation:
validate_temporal_constraints()early-returns whenca_modeis true (talkbank-model/src/validation/temporal.rs). - Rationale: CA reference files include patterns that triggered false monotonicity/self-overlap diagnostics.
- Revisit: Implement CA-specific temporal policy rather than global skip.
Decision 7: Pipeline severity threshold, errors only
- Previous behaviour: Any validation diagnostic (including warnings) caused
PipelineError::Validation. - Current behaviour: Pipeline returns failure only if at least one diagnostic
has
Severity::Error. - Implementation:
talkbank-transform/src/pipeline/parse.rs. - Rationale: Warnings should not block parse/transform/export pipelines.
- Revisit: Keep as default; add explicit
--strictflag/profile if needed.
Decision 8: Spacing warnings W210/W211, disabled
- Previous behaviour: Style-level spacing warnings around terminators and overlap markers.
- Current behaviour: Checks removed from core main-tier validation path.
- Implementation:
check_spacing_warnings()invocation removed fromtalkbank-model/src/model/content/main_tier.rs. - Rationale: Generated unexpected diagnostics on files treated as valid in reference workflow.
- Revisit: Reintroduce as optional lint profile, not core validator hard path.
Validation Gap Roadmap
Concrete items where the grammar is lenient but no validation compensates. Each proposes a new error code and priority.
Priority 1: @UTF8 Presence (E503), DONE
@UTF8 Presence (E503)- Grammar:
@UTF8is optional. - Spec: Required, must be the first line.
- Implemented:
E503(MissingUTF8Header) added tocheck_headers()intalkbank-model/src/validation/header/structure.rs. - Severity: Error.
- Note: All 340 reference corpus files contain
@UTF8, zero roundtrip impact.
Priority 2: Pre-First-Utterance Header Order (proposed E534), Not a Gap
- Grammar:
choice()accepts headers in any order between@Beginand the first utterance. - Assessment: CLAN CHECK does not enforce any ordering for post-
@Beginheaders; it validates presence and format only. Our grammar’s flexible ordering matches CHECK’s behavior. - Status: Reclassified from Tier B (GAP) to Tier C (by design).
Priority 3: Content Type Context Validation, Not a Gap
- Grammar: Unified
base_content_itemaccepts any content type in any context. - Assessment: The unified rule is correct by design. Nested groups are legal
CHAT (e.g.,
<the <dag> [: dog]> [= something]). The two specific semantic restrictions that do exist (no pauses in pho groups, E371; no nested quotations, E372) are already validated. - Status: Reclassified from Tier A (PARTIAL) to Tier C (by design).
Validation Profile Infrastructure
What Exists
ValidationConfig (talkbank-model/src/errors/config.rs)
Builder-pattern configuration for per-error-code severity overrides.
let config = ValidationConfig::new()
.downgrade(ErrorCode::IllegalUntranscribed, Severity::Warning)
.disable(ErrorCode::InvalidOverlapIndex)
.upgrade(ErrorCode::UnknownAnnotation, Severity::Error);
API:
new(): empty config, all codes use original severitydowngrade(code, severity): lower severity (chainable)disable(code): suppress entirely (chainable)upgrade(code, severity): raise severity (chainable)set_severity(code, Option<Severity>): set or disable (chainable)effective_severity(code, original) -> Option<Severity>: queryis_disabled(code) -> bool: check
Pre-built profiles:
lenient(): DowngradesIllegalUntranscribedandInvalidOverlapIndextoSeverity::Warning. Designed for legacy corpora gradual migration.strict(): escalates unmapped warnings to errors (setsupgrade_unmapped_warnings, honored byeffective_severity). Explicit per-code overrides still take precedence, so a caller can opt a specific code back toSeverity::Warning.
ConfigurableErrorSink (talkbank-model/src/errors/configurable_sink.rs)
Wrapper that intercepts errors and applies ValidationConfig before forwarding
to an inner ErrorSink.
let inner = ErrorCollector::new();
let sink = ConfigurableErrorSink::new(&inner, config);
// Pass `sink` to parser/validator, disabled errors are filtered,
// severity overrides are applied.
Runner-Level Flags (talkbank-transform, chatter)
| Flag | Effect |
|---|---|
--skip-alignment | Skip tier alignment validation |
--roundtrip | Test serialization idempotency after validation |
--force | Clear cache for path and revalidate |
--max-errors N | Stop after N errors |
What Is Missing
| Gap | Description | Effort |
|---|---|---|
No --profile CLI flag | Users cannot select strict / lenient / lint from the command line | Medium |
ConfigurableErrorSink not wired into validation pipeline | Infrastructure exists but is not used by chatter validate | Medium |
| No lint-style profile | Spacing/style warnings (W210, W211) have no home | Small (once profiles are wired) |
| No profile serialization | Cannot load profiles from TOML/JSON config files | Medium |
| No corpus-specific profiles | E.g., HSLLD-specific rules | Future |
Proposed Profiles
From the permissiveness regression log:
| Profile | Purpose | Behaviour |
|---|---|---|
reference-compatible | Current permissive baseline | Default, matches current validation behaviour |
strict-chat | Full spec enforcement | Re-enable selected tightenings (E214, E248, E254, etc.) |
lint-style | Spacing/style warnings only | Enable W210, W211; do not fail pipeline |
The roundtrip gate should be pinned to an agreed profile to prevent future ambiguity about what “pass” means.
Silent Recovery Points (NLP Pipelines)
An earlier Python-Rust boundary audit identified several
places where batchalign-core silently massages data without diagnostics. These
are related to leniency because they represent permissive acceptance without
transparency.
| Pipeline | Recovery Mechanism | Diagnostics? |
|---|---|---|
| Stanza morphosyntax | retokenize.rs DP alignment; Word::new_unchecked fallback | No |
| Whisper/Wave2Vec FA | forced_alignment.rs DP “best fit” | No |
| Google Translate | Imported verbatim into %xtra | No filtering |
| Stanza segmentation | Silent abort on assignment mismatch | No |
Key infrastructure gap: ParseHealth exists in talkbank-model (per-utterance
tier cleanliness flags with taint(), is_clean(), can_align_main_to_mor()
methods). It is used by the tree-sitter and direct parsers during parsing.
However, batchalign-core does not read, write, or propagate ParseHealth
during any mutation (morphosyntax injection, FA injection, retokenisation). The
infrastructure exists in the model layer but is not connected to the pipeline
layer.
Cross-References
| Source | What It Contains |
|---|---|
| Grammar governance analysis (archived) | Proposed this document; leniency matrix concept; three-tier classification |
| Permissiveness regression log (archived) | 8 permissiveness regression decisions with rationale |
| Python-Rust boundary audit (archived) | Silent recovery points; ParseHealth gap; NLP pipeline audit |
grammar/grammar.js | Inline comments on each leniency decision (line references in matrix above) |
talkbank-model/src/errors/config.rs | ValidationConfig API |
talkbank-model/src/errors/configurable_sink.rs | ConfigurableErrorSink adapter |
talkbank-model/src/validation/header/structure.rs | Header validation: E501, E502, E503, E504-E533 |
talkbank-model/src/validation/temporal.rs | Temporal constraint checks (E701, E704); CA-mode skip |
talkbank-model/src/model/content/main_tier.rs | Where W210/W211 were removed |
Last updated: 2026-02-18
Error Diagnostics UX Standard
Status: Current Last modified: 2026-05-30 07:08 EDT
Workspace-wide standard for diagnostic shape, severity, recovery
behavior, span correctness, and integrator output formats. Applies
to the CHAT-core error system. Upstream
batchalign-runtime errors follow the same shape and are documented
separately in the batchalign3 project.
Objective
Make diagnostics precise, explainable, and actionable for both developers and non-technical editors, while keeping machine readability for downstream tools.
Open concerns
- Message quality across the error catalog is not yet governed by one central style standard. Different error codes were authored at different times and converge unevenly on the message-quality guidance below.
Canonical Diagnostic Schema
Diagnostic {
code: String,
severity: Error | Warning | Info,
category: Parse | Validation | Alignment | Header | Tier | Internal,
location: SourceLocation,
context: ErrorContext,
message: String,
suggestion: Option<String>,
related: Vec<RelatedLocation>
}
Message Quality Standard
Each diagnostic must answer:
- What failed.
- Where it failed.
- Why it likely failed.
- What to do next.
Avoid internal jargon unless accompanied by user-facing explanation.
Severity Policy
Error: blocks parse/validation outcome.Warning: content is usable but has quality/compliance concerns.Info: optional guidance and migration hints.
Severity must not be overloaded for tooling convenience.
Recovery Policy: Diagnostic-First, Not Sentinel-First
When parser recovery is required:
- do not invent semantic fallback values to keep type construction convenient,
- do not use empty strings or arbitrary enum defaults as recovered content.
Instead:
- Report a diagnostic with expected/actual node context.
- Preserve span information for tooling and UI.
- Propagate partial/failure status explicitly.
Any synthetic placeholders that are unavoidable for internal plumbing must be:
- non-semantic (not exposed as real model content),
- marked internal-only,
- excluded from user-facing diagnostics and serialization.
Sentinel vs error-variant rule
If an unexpected condition changes semantic trust in parsed content:
- Represent that explicitly as an error-bearing state (enum variant, parse-taint flag, or explicit outcome type).
- Never represent it as
Noneor a default payload that can be mistaken for valid content.
This applies both to parser outputs and to runtime metadata consumed during validation.
Diagnostic construction
Use shared constructors/helpers for common diagnostics to reduce drift:
- span-only diagnostics (
code + severity + span + message), - source-backed diagnostics (
code + severity + span + source + offending + message).
Benefits: consistent location/context population, fewer ad-hoc
ParseError::new(...) call shapes, simpler migration to richer
miette rendering.
Error Code Governance
- Central registry under
talkbank-model(errors module). - One authoritative description and example per code.
- Deprecated codes remain mapped with explicit migration notes.
- CI check forbids duplicate code definitions or orphaned docs.
Span and Location Correctness
- All diagnostics use consistent line/column and byte-offset definitions.
- Golden tests cover:
- single-byte and multi-byte UTF-8 content,
- embedded content offsets,
- continuation lines and tabs.
Integrator Output Formats
- Human-readable CLI diagnostics.
- Machine-readable JSON diagnostics.
- LSP diagnostic mapping.
All formats share the same underlying diagnostic schema.
Acceptance Criteria
- Every emitted diagnostic includes code, severity, location, and suggestion policy.
- Error code documentation and runtime definitions are synchronized automatically.
- Span correctness is covered by dedicated tests.
- CLI and JSON outputs are contract-tested for schema compliance.
Wide Struct Audit
Status: Current Last modified: 2026-05-29 22:34 EDT
A repository-wide audit rule for struct shape. Applies to the crates in
TalkBank/chatter (model, parser, transform, CLI, CLAN, LSP, cache, and
related tooling). The rule originated in the predecessor monorepo, but this
page is scoped to the current repository rather than the old mixed
CHAT+batchalign workspace.
A struct with many fields is not automatically wrong. The smell is:
- many unrelated concerns packed into one value
- several related booleans that act like implicit policy enums
- repeated field-name prefixes that point to missing sub-structs
- parallel vectors or stringly runtime fields
- runtime code reaching into many unrelated fields of the same value
The repo therefore treats 10 or more named fields as an audit threshold, not as an automatic ban.
Categories
Wide structs fall into four categories.
1. Boundary shim, may stay wide
CLI, JSON, or clap boundary types. Acceptable if they are converted into typed policies or sub-structs before entering core runtime code.
Examples: ValidateDirectoryOptions, clap-facing CLI arg structs, JSON boundary
records.
2. Transport or schema record, may stay wide
DB rows, HTTP response shapes, JSON schema mirrors. Acceptable as long as they don’t become the internal runtime shape.
Examples: WordJsonSchema, DbMetadata, CoverageReport, CorpusManifest.
3. Real aggregate, may stay wide
Domain values whose fields all answer one coherent question and whose callers consume the whole rather than spelunking through unrelated subsets.
Examples: metric/report records like SpeakerEval, SpeakerKideval,
SpeakerComplexity, and SpeakerFluency (report records, not runtime
coordination).
4. Refactor target, must be split
Mix of policy and state, multiple responsibilities, or callers needing to know the whole subsystem to use a subset of fields.
Design Rules
- Treat 10 or more named fields as an audit trigger.
- Treat 3 or more related boolean fields as a smell even below that threshold.
- Boundary and transport records may stay wide when they mirror a real external shape.
- Runtime coordination structs prefer named sub-structs over flat bags.
- Replace parallel vectors with per-item records where possible.
- If a wide struct stays wide, record the reason in the surrounding design docs, audit notes, or code review rather than letting it remain unexplained.
Refactor Examples
ValidateDirectoryOptions (chatter), was a flat bag
Used to be a flat bag of format, cache, traversal, roundtrip, parser, audit, and TUI flags. Now grouped by concern:
ValidationRulesValidationExecutionValidationTraversalModeValidationPresentation
Shape this audit wants for policy-rich CLI boundaries: one small top-level struct with explicit sub-objects and enums rather than a dozen flat fields.
ParseHealth (talkbank-model), was a ten-boolean state vector
Now stores taint as a compact tier bitset keyed by ParseHealthTier, the
shape this audit expects for fixed domain sets.
flowchart LR
tier["ParseHealthTier"] --> set["Tier health set"]
set --> checks["Alignment safety checks"]
Open Hotspots
TUI state bags
Real state owners that still want grouping by concern (selection vs. progress vs. render flags vs. status):
crates/chatter/src/ui/validation_tui/state.rsTuiState
Backend (talkbank-lsp)
crates/talkbank-lsp/src/backend/state.rs is a service-root aggregate.
Defensible, but still wants grouping such as document caches, parse caches,
validation state, language services.
Metric structs
SpeakerEval/SpeakerKideval are acceptable as report records. If output
renderers keep needing subsets (lexical metrics, morphosyntax metrics, error
counts, derived scores), those records should eventually nest along those
lines.
Audit Guardrail
There is currently no repo-local automated wide-struct lint in
TalkBank/chatter. Treat this page as a manual review checklist and refactor
trigger: when a type grows past the threshold, decide explicitly whether it is
an acceptable boundary/schema aggregate or a real split target.
Spec Tooling and Generation Pipeline
Status: Current Last updated: 2026-05-19 17:38 EDT
Objective
Make spec/ the reliable language-contract source while keeping generation
deterministic, maintainable, and appropriately scoped.
The goal is to separate:
- grammar artifact generation
- validation/error-doc generation
- parser semantic testing (fragment and full-file)
Anything that still looks like bootstrap-era synthetic fragment orchestration is now audit-only unless a doc says it remains operational.
Open structural concerns
spec/toolsstill carries bootstrap-era Rust parser/model dependencies that create circular or awkward workflow coupling.- Contributor workflows still over-assume that
make test-genis the right reaction to every parser-related change.
Current Generation Pipeline
spec constructs/errors
-> spec validators
-> generated grammar corpus tests
-> generated rust parser/validation tests
-> generated error docs
-> coverage dashboards and quality reports
That pipeline is still useful, but it is too broad to remain the single mental model for parser testing.
Desired Post-Bootstrap Split
grammar specs/templates
-> generated tree-sitter corpus tests
error specs
-> generated validation/parser error tests
-> generated error docs
fragment semantic fixtures and invariants
-> fragment-level parser tests
reference corpus / curated full files
-> parser parity tests
Structural Reorganization for spec/tools (proposed, not yet implemented)
The intent here is to narrow spec/tools’s mission back to spec-driven
artifact generation and validation rather than leaving it as a
bootstrap-era staging ground for parser semantics. A proposed module
split:
input(markdown/spec parsing)ir(normalized internal representation)emit(grammar tests, rust tests, docs)validate(schema and semantic checks)sync(grammar node-types and symbol-registry checks)
Current layout (crates: bin/, generated/, lib.rs, output/, spec/, templates/) has not been migrated to this shape. Treat this section
as a design target for future work rather than a description of the
current source tree.
Legacy vs Active
Keep these active:
- grammar corpus generation
- error doc generation
- symbol registry sync/validation
- affected regeneration when a spec or grammar input truly changed
Treat these as legacy audit paths:
- synthetic tree-sitter fragment wrappers
- bootstrap-era parser equivalence rituals
Determinism Requirements
- Stable ordering of generated outputs.
- Stable formatting of generated code/docs.
- Re-runs without source changes produce no diffs.
Drift Prevention Controls
- Node type compatibility check:
spec/toolsmust compile and run against current generated node constants.
- Registry compatibility check:
- all symbol categories used in specs and grammar must be known in registry.
- Generation integration check:
- full generation pass with clean tree must produce zero diff.
- Boundary check:
- generated grammar/docs flows should not silently become the sole authority for fragment parsing semantics.
Authoring Experience (proposed, not yet implemented)
Spec authoring would benefit from:
- Strict but simple spec templates for constructs and errors.
- A
spec lintcommand for immediate feedback (missing fields, invalid tags, malformed examples, unknown error codes). - Clearer documentation of when
make test-genis actually needed and when a small direct test is the right answer instead.
The spec lint binary does not yet exist; the strict-validation work
that exists today happens implicitly through make test-gen failures
plus the spec validators in spec/tools/src/bin/.
Versioning and Metadata
Each spec file should include:
- ownership,
- status (
draft,accepted,deprecated), - parser/validation scope,
- linked tests and generated outputs.
Acceptance Criteria
spec/toolsis green and deterministic.- Every generation target has explicit provenance from source specs.
- Drift between node types, specs, and generators is blocked in CI.
- Spec contributors have a documented and automated happy path.
- Small grammar changes no longer force a giant regeneration ritual by default.
- Fragment parsing semantics are tested outside the generation pipeline.
Symbol Registry Architecture
Status: Current Last modified: 2026-05-29 18:43 EDT
Purpose
spec/symbols/symbol_registry.json is the canonical source of token/symbol classes used by
CHAT grammar tokenization policy.
Scope
The registry currently governs:
- CA delimiter symbols,
- CA element symbols,
- word segment forbidden symbol classes,
- event segment forbidden symbol classes.
Governance Rules
- Symbol changes must be made only in
spec/symbols/symbol_registry.json. - Registry must pass validation:
node spec/symbols/validate_symbol_registry.js
- Grammar symbol sets must be regenerated after any registry change:
just symbols-gen
- Generated files are read-only and must not be edited manually.
Determinism Requirements
- Every category list in the registry must be lexicographically sorted.
- Duplicate symbols are forbidden.
ca_delimiter_symbolsandca_element_symbolsmust be disjoint.
These constraints keep generated outputs stable and review diffs minimal.
Consuming Outputs
Generated symbol constants are emitted to:
grammar/src/generated_symbol_sets.jscrates/talkbank-model/src/generated/symbol_sets.rsspec/tools/src/generated/symbol_sets.rs
grammar/grammar.js imports from this generated module to avoid manual duplication of
critical symbol policy.
Change Workflow
- Edit registry JSON.
- Run registry validation.
- Run
just symbols-gen. - Run grammar generation/tests.
- Run parser equivalence tests.
- Commit source + generated outputs together.
Auditability
Registry drift is caught by the checked-in generated artifacts plus the normal local verification sweep and CI checks, so symbol changes should land together with regenerated grammar and Rust outputs.
Bullet Validation
Status: Current Last updated: 2026-05-01 05:19 EDT
Media bullets are timestamps embedded in CHAT utterances that link transcript
text to audio/video. They appear as •start_end• at the end of a main tier
line (e.g., *CHI: hello . •1000_2000•). Validating that these timestamps are
internally consistent is one of the more subtle parts of CHAT validation,
because the “obvious” rules turn out to be wrong for multi-party conversation.
This chapter documents what CLAN CHECK does, where its implementation falls
short of its own intent, and how chatter validate interprets and improves on
that intent.
The three temporal checks
There are three distinct temporal constraints that can be checked on bullet timestamps. They differ in scope, severity, and whether they should run by default.
E701: Same-speaker start-time monotonicity (CLAN Error 83)
Rule: For each speaker, their utterances’ start times must be non-decreasing. If speaker CHI has utterance A starting at 10,000ms and utterance B (later in document order) starting at 8,000ms, that is an error, CHI’s timeline has gone backward.
Scope: Per-speaker. Cross-speaker non-monotonicity is allowed (see Why cross-speaker non-monotonicity is not an error).
Severity: Error.
E704: Same-speaker self-overlap (CLAN Error 133)
Rule: For each speaker, the current utterance’s start time must not be more than 500ms before the same speaker’s previous utterance’s end time. In other words, a speaker cannot overlap with themselves by more than 500ms.
Scope: Per-speaker. The 500ms tolerance accounts for annotation rounding and minor timing imprecision at boundaries.
Severity: Error.
E729: Cross-speaker overlap (CLAN Error 84)
Rule: The current utterance’s start time must not be before the previous utterance’s (any speaker) end time. This checks for any temporal overlap between adjacent utterances, regardless of speaker.
Scope: Global (cross-speaker). Only fires with CLAN’s +c0 flag.
Severity: Warning. Not part of default validation.
This check is part of CLAN’s “strict timeline contiguity” mode, which requires that every utterance’s start time equals the previous utterance’s end time, no gaps (Error 85) and no overlaps (Error 84). It is designed for a very specific use case: verifying that audio has been exhaustively and non-redundantly segmented. In normal conversational transcripts, cross-speaker overlap is ubiquitous, so this check would be absurd as a default.
What CLAN CHECK does
CLAN CHECK implements bullet validation in the function
check_checkBulletsConsist() in check.cpp. Understanding its implementation
is essential because it has several accidental behaviors that affect the error
counts users see.
The snapshot-and-compare pattern
The function uses a global pair (check_SNDBeg, check_SNDEnd) to hold the
“current” bullet timing, and saves the previous values into local variables
(tBegTime, tEndTime) at the start of each call. The comparison flow is:
1. Save previous: tBegTime = check_SNDBeg, tEndTime = check_SNDEnd
2. Parse new bullet into check_SNDBeg, check_SNDEnd
3. Check error 83: check_SNDBeg < tBegTime? (cross-speaker comparison)
4. Check error 133: speaker's last END - check_SNDBeg > 500? (same-speaker)
5. If +c0 mode: check error 84 (overlap) and error 85 (gap)
6. Update speaker's last END time via check_setLastTime()
The early-return shadowing bug
The critical implementation detail is that error 83 fires via return(83) at
step 3. This causes the function to exit immediately, skipping steps 4
through 6. Two consequences follow:
-
Error 83 shadows error 133. An utterance that triggers error 83 (global non-monotonicity) can never also trigger error 133 (same-speaker overlap) in the same call, even if both conditions are true. This is not intentional, it is an artifact of C-style early-return control flow.
-
Speaker state goes stale. Step 6 (
check_setLastTime) updates the speaker’s per-speaker tracking in theSPLISTlinked list. When error 83 fires, this update is skipped. All subsequent error-133 checks for that speaker compare against a staleendTimevalue, causing cascading state corruption that suppresses legitimate error 133 reports.
Error 83 is global, not per-speaker
CLAN fires error 83 by comparing the current utterance’s start time against the previous utterance’s start time, regardless of speaker. In a multi-party conversation:
*PIL: something . •100000_102000•
*UEL: response . •99500_101000• ← Error 83: 99500 < 100000
This fires error 83 because UEL’s start time (99,500ms) is before PIL’s start
time (100,000ms). But this is just two people talking at the same time, normal
conversational overlap. The [>] and [<] markers in CHAT explicitly annotate
this as intentional simultaneous speech.
In files with many speakers (the Koine/bre corpus has 7-9 speakers per file, including children talking over each other), this fires on a huge fraction of utterances. CLAN’s accidental shadowing partially masks the problem by suppressing downstream error-133 reports when error 83 fires.
Why cross-speaker non-monotonicity is not an error
Consider a classroom recording with a teacher (PIL) and seven children. The teacher asks a question, and three children answer simultaneously:
*PIL: qué es esto ? •50000_52000•
*UEL: un coche . •51200_52500• ← started during PIL's question
*MAR: coches . •51000_51800• ← started even earlier
*REN: es un coche grande . •51500_53000• ← started between UEL and MAR
In document order, the start times are: 50000, 51200, 51000, 51500. This is non-monotonic (51000 < 51200), but there is nothing wrong with this data. The children are simply talking at the same time. No amount of reordering the utterances in the file would make all start times monotonically increasing while preserving the speaker-turn structure.
Cross-speaker non-monotonicity is an inherent property of multi-party conversation, not a data error. Flagging it as an error produces thousands of false positives on any corpus with overlapping speech.
When IS non-monotonic start time an error?
Same-speaker non-monotonicity IS an error. If CHI speaks at 10,000ms, then later in the file CHI speaks again at 8,000ms, CHI’s timeline has gone backward. This almost certainly indicates a transcription or alignment mistake.
The test is simple: within the same speaker’s utterance sequence, start times
must be non-decreasing. This is what chatter validate checks for E701.
How chatter validate implements bullet validation
E701: Per-speaker monotonicity (not global)
chatter validate tracks each speaker’s last start time in a HashMap. E701
only fires when the same speaker’s start time goes backward. Cross-speaker
non-monotonicity is silently accepted.
This is an intentional semantic divergence from CLAN CHECK, which fires error 83
globally. We believe CLAN’s global check reflects the implementation
(comparing against a single global tBegTime) rather than the intent (detecting
disordered timestamps). The per-speaker version matches the intent without
drowning users in false positives from normal conversational overlap.
E704: Per-speaker overlap with 500ms tolerance
chatter validate tracks each speaker’s last end time in a HashMap. E704
fires when the overlap exceeds 500ms (same threshold as CLAN Error 133).
Unlike CLAN, E704 runs independently of E701. An utterance can trigger both errors if it is both non-monotonic (E701) and self-overlapping (E704). CLAN’s early-return pattern prevents error 133 from firing when error 83 fires, which is a bug, not a feature.
Speaker state is always updated regardless of whether errors fire. This avoids the cascading state corruption that CLAN’s implementation suffers from.
E729: Not in default validation
E729 (CLAN Error 84, cross-speaker overlap) is implemented but not called
during default validation. It exists for future use in a strict-bullet mode
equivalent to CLAN’s +c0 flag.
Untranscribed utterances are skipped
Utterances containing only untranscribed markers (www, xxx, yyy) are
skipped for E704 checks. These utterances often carry broad segment bullets
(covering a long span of background speech) that would create false self-overlap
reports. This matches CLAN CHECK’s behavior, where untranscribed tiers do not
contribute to timing comparisons.
CA mode disables all temporal checks
When the file header includes @Options: CA, all temporal validation is
skipped. Conversation Analysis mode intentionally relaxes timing constraints
because CA transcription conventions use overlapping and non-sequential timing
as part of the analytic notation.
Comparison: CLAN CHECK vs chatter validate
The following table summarizes the behavioral differences:
┌────────────────────────────┬──────────────┬─────────────────┐
│ Behavior │ CLAN CHECK │ chatter validate│
├────────────────────────────┼──────────────┼─────────────────┤
│ Error 83 / E701 scope │ Global │ Per-speaker │
│ Error 133 / E704 scope │ Per-speaker │ Per-speaker │
│ Error 84 / E729 default │ Off (+c0) │ Off │
│ 83 shadows 133 │ Yes (bug) │ No │
│ 83 corrupts speaker state │ Yes (bug) │ No │
│ E701 + E704 independent │ No │ Yes │
│ Speaker state always fresh │ No │ Yes │
│ Untranscribed skipped │ Implicit │ Explicit │
│ CA mode bypass │ Yes │ Yes │
│ 500ms tolerance (E704) │ Yes │ Yes │
└────────────────────────────┴──────────────┴─────────────────┘
Expected count differences
On multi-party files with overlapping speech:
-
E701 count will be lower than CLAN’s error 83 count. CLAN fires error 83 on cross-speaker non-monotonicity; we don’t. The difference represents legitimate conversational overlap that we intentionally do not flag.
-
E704 count will be higher than CLAN’s error 133 count. CLAN’s early-return shadowing prevents error 133 from firing when error 83 fires, and the stale speaker state causes further suppression. Our correctly maintained per-speaker tracking reports all genuine self-overlaps.
On single-speaker files or files with minimal overlap, the counts should be very close or identical.
Implementation details
The implementation lives in
crates/talkbank-model/src/validation/temporal.rs.
Data flow
flowchart TD
A["collect_bullets(file)\n(temporal.rs:101)"] -->|"Vec<BulletInfo>"| B
B["validate_global_timeline()\n(temporal.rs:169)"] -->|"Per-speaker HashMap"| C["E701 errors"]
A -->|"Vec<BulletInfo>"| D
D["validate_speaker_timelines()\n(temporal.rs:212)"] -->|"Per-speaker HashMap"| E["E704 errors"]
BulletInfo
Each utterance with a bullet produces a BulletInfo containing:
utterance_idx: 0-based index in the filespeaker: the speaker code (e.g.,"CHI","PIL")bullet: theBulletstruct withstart_msandend_mshas_timeable_content: whether the utterance contains transcribed words (used to skip untranscribed-only turns for E704)
Only main speaker tiers are collected. Dependent tiers (%mor, %gra, etc.)
are excluded.
Per-speaker tracking
Both E701 and E704 use HashMap<&str, ...> keyed by speaker code:
- E701: stores
(utterance_idx, start_ms), the speaker’s most recent start time - E704: stores
(utterance_idx, end_ms), the speaker’s most recent end time
State is always updated after processing each bullet, regardless of whether an error was reported. This ensures clean tracking for subsequent comparisons.
CLAN source reference
For readers who want to trace the CLAN implementation:
- Function:
check_checkBulletsConsist()inOSX-CLAN/src/clan/check.cpp, lines 3849-3967 - Error 83: lines 3883-3890 (early
return(83)) - Error 133: lines 3892-3895 (only reached if error 83 did not fire)
- Speaker state update: line 3909 (
check_setLastTime), only reached if no error fired - Per-speaker tracking:
SPLISTlinked list, lookup viacheck_getLatTime()/check_setLastTime() - +c0 mode:
checkBulletsflag, set via+c0command-line option (line 5920), guards errors 84/85 at lines 3897 and 3953 - Call site:
check_ParseWords()line 4801, guarded byutterance->speaker[0] == '*'(main tiers only)
CA Terminator Resolution
Status: Current Last updated: 2026-05-05 12:23 EDT
How CA markers are split between separators and linkers in the parser/model.
Current rule
The parser/model no longer promotes CA markers into utterance terminators.
The supported split is:
- Standard utterance terminators remain the CHAT terminators such as
.?!+...+/.and related final punctuation tokens. - CA intonation arrows (
⇗ ↗ → ↘ ⇘) staySeparatorcontent items. - CA TCU markers (
≈ ≋) staySeparatorcontent items. - CA TCU linker forms (
+≈ +≋) stayLinkeritems.
This means a trailing →, ≈, or ≋ remains in main-tier content rather
than being retyped as Terminator.
Parser/model consequences
- Tree-sitter grammar keeps arrows and
≈/≋on theseparatorpath. - The tree parser converts those nodes directly into
Separatorvariants. - The re2c parser classifies
≈/≋as separators and+≈/+≋as linkers. - The old post-hoc
resolve_ca_terminator()promotion pass was removed. Terminator::try_from_chat_str()intentionally rejects CA arrows,≈,≋,+≈, and+≋.
Data Model
The active surface split is:
| Kind | CHAT tokens |
|---|---|
Terminator | . ? ! +... +/. +//. +/? +!? +"/. +". +//? +..? +. |
Separator | ⇗ ↗ → ↘ ⇘ ≈ ≋ plus the other CA/content separators |
Linker | +≈ +≋ plus the other utterance linkers |
Legacy CA-only Terminator variants still exist in the type for backward
compatibility with older serialized data, but new parser/classifier code does
not construct them from CHAT text.
Regression coverage
The regression surface for this split is:
ca_symbols_are_not_chat_terminatorsintalkbank-modeltrailing_ca_arrow_stays_separatorintalkbank-parsertrailing_ca_no_break_stays_separatorintalkbank-parsertrailing_ca_technical_break_stays_separatorintalkbank-parser
Validation Cache
Status: Current Last modified: 2026-06-22 06:48 EDT
The CHAT-core validation cache, used by chatter validate and the
LSP server. Distinct from the audio-task cache used by upstream
batchalign3 for FA / UTR ASR / media conversion (documented
separately in that project): this cache stores parse + validate
results keyed by file path + options.
crates/talkbank-cache/.
Architecture
flowchart TD
req["Validation request\n(path + options)"]
key["Cache key\n(path_hash + RulesVersion + check_alignment + parser_kind)"]
db["SQLite WAL\n~/.cache/talkbank-chat/\ntalkbank-cache.db"]
hit["Cache hit\n→ return stored result"]
miss["Cache miss\n→ parse + validate + store"]
req --> key --> db
db -->|"found + RulesVersion match + content_hash match"| hit
db -->|"not found, rules changed, or content edited"| miss
miss --> db
Configuration
| Config | Value | Why |
|---|---|---|
| Backend | SQLite via sqlx | Concurrent reads (WAL), atomic writes, zero-config |
| Pool size | 16 connections | Matches validation worker count |
mmap | 256 MB | Fast random access for 95k+ entries |
| Invalidation | Rules-version field + content hash + 30-day TTL | Rule-set or schema changes auto-invalidate; content edits invalidate per-file; stale entries pruned |
| Bridge | Embedded single-threaded tokio runtime | Sync workers call rt.block_on() for async SQLite |
Schema
file_cache table (see
crates/talkbank-cache/migrations/20260101000000_initial.sql):
| Column | Role |
|---|---|
path_hash | BLAKE3 hash of the resolved path (part of the lookup key) |
file_path | Resolved file path, indexed for path-based maintenance ops |
content_hash | Hash of the file content; mismatch invalidates the entry |
version | Cache-compatibility version (RulesVersion): the cache crate version folded together with a fingerprint of the active validation rule set. A mismatch invalidates the entry |
cached_at | Insertion timestamp |
check_alignment | Whether alignment validation was requested |
is_valid | Cached validation outcome (0/1) |
roundtrip_tested | Whether roundtrip equivalence was checked |
roundtrip_passed | Roundtrip result when tested |
parser_kind | Parser backend (tree-sitter or re2c) |
The lookup key is the compound unique index
(path_hash, version, check_alignment, parser_kind); file_path is a
secondary index used by maintenance operations (orphan pruning, etc.).
Database location
| Platform | Path |
|---|---|
| macOS | ~/Library/Caches/talkbank-chat/talkbank-cache.db |
| Linux | ~/.cache/talkbank-chat/talkbank-cache.db |
| Windows | %LocalAppData%\talkbank-chat\talkbank-cache.db |
Invalidation
- Validation-rule changes: the
versioncolumn holds aRulesVersion, which folds thetalkbank-cachecrate version together with a fingerprint of the active validation rule set (an FNV-1a hash over everyErrorCodethe validator can emit, viatalkbank_model::validation_rules_fingerprint). Adding, removing, or renaming a rule (for example introducing error code E370, “retrace marker must be followed by material”) changes the fingerprint, hence theRulesVersion, hence the lookup key, so verdicts cached under the old rule set become a cache MISS and are re-validated instead of served stale. This is the mechanism that keepschatter validate(the authority on CHAT validity) from returning a stale “Valid” after the rules tighten. The stale rows stay on disk under their old version for selective re-testing; they are simply never served to a query carrying the new version. - Content changes: each entry stores the file’s
content_hash; a mismatch is a per-file miss. - Time-based: entries older than 30 days are pruned.
- Manual: pass
--forceto bypass cache lookups for a particular validation run.
Per repository policy, do not delete the cache directory without explicit
request. Use --force when you want fresh validation for specific paths
without destroying the whole cache.
See also
- Upstream
batchalign3documents its own audio-task cache for FA / UTR ASR / media conversion.
Alignment
Status: Current Last modified: 2026-06-15 15:00 EDT
Alignment in the toolchain operates at two structural layers, plus a separate overlap-marker pass. Tier alignment is structural (counting and pairing AST nodes); word extraction is positional (domain-ordered token indices).
| Layer | Where | Purpose |
|---|---|---|
| Tier alignment | talkbank-model::alignment | 1:1 mapping between main tier and dependent tiers (%mor, %pho, %wor, %sin, %gra) |
| Word extraction | talkbank-transform::extract | Pull NLP-ready words from the AST in domain order |
Tier Alignment
Validates that dependent tiers have the correct number and arrangement
of items relative to the main tier. Lives in
crates/talkbank-model/src/alignment/.
TierDomain
#![allow(unused)]
fn main() {
enum TierDomain { Mor, Pho, Sin, Wor }
}
The same utterance produces different counts per domain:
| Rule | Mor | Pho | Sin | Wor |
|---|---|---|---|---|
| Skip retrace groups | Yes | No | No | No |
| Count pauses | No | Yes | No | No |
| PhoGroup | Recurse | Atomic (1) | Skip (0) | Recurse |
| SinGroup | Recurse | Skip (0) | Atomic (1) | Recurse |
Include fragments (&+) | No | Yes | Yes | No |
Include nonwords (&~) | No | Yes | Yes | No |
Include fillers (&-) | No | Yes | Yes | Yes |
| Include untranscribed | No | Yes | Yes | No |
| Include tag-marker separators | Yes | No | No | No |
ReplacedWord aligns to | Replacement | Original | Original | Original |
For the underlying word filter (counts_for_tier,
should_skip_group), the content walker, and the ChatFile model itself,
see CHAT Data Model. The walker plus the
domain table together govern every tier-alignment count.
Retrace handling, alignment-critical
Retraces are the most alignment-critical content type. A Retrace node
wraps content the speaker said then corrected.
- Mor: skip entirely (count
0). The retrace was a false start; only the correction carries morphological analysis. - Pho, Sin: recurse, words were physically produced and have phonological / gestural data.
- Wor: recurse, retrace ancestry does not change
%wormembership.
Critical invariant: the parser must emit UtteranceContent::Retrace
for all retrace patterns, including single-word retraces with
replacements (word [: repl] [* err] [//]). If a retrace is
accidentally emitted as a bare ReplacedWord, it counts for %mor
alignment, causing false E705 errors. Enforced by
tests/retrace_replaced_word_regression.rs. Full data model + parsing
pipeline + CHAT examples in
Retraces and Repetitions.
AlignmentPair
#![allow(unused)]
fn main() {
struct AlignmentPair {
source_index: Option<usize>,
target_index: Option<usize>,
}
}
Universal index-pair primitive. Some/Some = matched. One None =
insertion / deletion placeholder for mismatch diagnostics.
is_complete(), both indices Some. is_placeholder(), unmatched.
Per-domain results
| Type | Function | Source → Target |
|---|---|---|
MorAlignment | align_main_to_mor() | Main → %mor items |
PhoAlignment | align_main_to_pho() | Main → %pho tokens |
SinAlignment | align_main_to_sin() | Main → %sin tokens |
WorAlignment | align_main_to_wor() | Main → %wor tokens |
GraAlignment | align_mor_to_gra() | %mor chunks → %gra relations |
%gra aligns to %mor chunks, not items. Clitics create additional
chunks (pro|it~v|be&PRES = 2 chunks: pre-clitic + main).
Trait abstractions
| Trait | Purpose | Implementors |
|---|---|---|
IndexPair | source()/target() on any pair type | AlignmentPair, GraAlignmentPair |
TierAlignmentResult | pairs()/errors()/push_*() accumulator | All 5 alignment result types |
AlignableTier | What a tier provides for generic alignment | PhoTier, SinTier, WorTier |
TierCountable | count_tier_positions() / collect_tier_items() methods | [UtteranceContent] |
The generic positional_align() function uses AlignableTier to
eliminate duplication: align_main_to_{pho,sin,wor}() are thin
wrappers around it. %mor doesn’t use it (additional terminator
validation logic). %gra doesn’t use it (source is MorTier, not
MainTier). WorTier overrides mismatch_format() to Diff (LCS) since
both sides are word sequences; the others use Positional.
%wor is not validated
%wor is a timing-annotation tier. There is no downstream positional
indexing into %wor, and validate_alignments() does not check
%wor word count against the main tier. Old corpus files may have
xxx, fragments, or nonwords in %wor (pre-2026-04 behavior) without
producing false errors.
Phon tier-to-tier alignment
A second class of alignment that operates between dependent tiers:
| Source | Target | Code |
|---|---|---|
%modsyl | %mod | E725 |
%phosyl | %pho | E726 |
%phoaln | %mod | E727 |
%phoaln | %pho | E728 |
Derived-view alignments: %modsyl is a syllabified reannotation of
%mod, %phosyl of %pho, %phoaln aligns both. Word counts must
match between source and target. Computed in compute_alignments()
after the main-tier alignments. build_tier_to_tier_alignment()
constructs index pairs and emits build_count_mismatch_error() when
counts disagree. %phoaln checks against both %mod and %pho,
potentially emitting E727 and E728 simultaneously.
Known data issue: Phon XML source data has orthography↔IPA word
count discrepancies in ~4% of files (518 / 12,340). Expected in child
phonology data. The PhonTalk converter handles this inconsistently,
%mod/%pho are truncated to match orthography via OneToOne, but
%xmodsyl/%xphosyl/%xphoaln are written from raw IPATranscript,
exposing the full IPA word count. Result: E725-E728 mismatches.
Parse-health gating
Alignment diagnostics honor ParseHealth metadata. If a dependent
tier’s domain is parse-tainted, mismatch errors for that domain pair
are suppressed. Main-tier taint blocks all main→dependent alignments.
Dependent-tier taint blocks only that tier. Phon tier-to-tier checks
have their own gates (can_align_modsyl_to_mod,
can_align_phosyl_to_pho, can_align_phoaln).
Word Extraction
extract_words() (in crates/talkbank-transform/src/extract.rs) uses
the content walker to pull words from the AST in domain-specific order.
Returns Vec<ExtractedWord> with text, word_index, is_separator,
special_form. Tag-marker separators (, „ ‡) are included as
words in Mor domain because they have %mor items (cm|cm,
end|end, beg|beg).
Overlap Marker Iteration
CA overlap markers (⌈⌉⌊⌋) appear at three content levels,
UtteranceContent (top-level), BracketedItem (inside groups), and
WordContent (intra-word, butt⌈er⌉). Two APIs in
talkbank-model/src/alignment/helpers/overlap.rs:
walk_overlap_points, low-level
Visits every OverlapPoint in document order with word-position
context. Analogous to walk_words but for overlap markers:
walk_overlap_points(&utterance.main.content.content.0, &mut |visit| {
// visit.point: &OverlapPoint (kind + optional index)
// visit.word_position: usize (alignable words seen so far)
});
extract_overlap_info, region-based
Pairs markers by (kind, index) into OverlapRegion structs. Each
region represents a matched ⌈…⌉ or ⌊…⌋ pair. Index-aware:
⌈2...⌉2 forms a separate region from ⌈...⌉. Mismatched indices
leave markers unpaired. Onset-only ⌈ (without ⌉) is a legitimate CA
convention, region has end_at_word = None,
is_well_paired() = false, but top_onset_fraction() still works.
Cross-utterance, analyze_file_overlaps
For whole-file analysis, in overlap_groups.rs. 1:N matching: one
top region from speaker A can match multiple bottom regions from
speakers B, C, etc. Used by E347 and chatter debug overlap-audit.
Overlap validation
| Code | Level | Check |
|---|---|---|
| E347 | Cross-utterance | Orphaned tops/bottoms with 1:N matching (warning) |
| E348 | Utterance | Unpaired markers within a single utterance (warning) |
| E373 | Utterance | Invalid overlap index values (must be 2-9) |
| E704 | Cross-utterance | Same speaker encoding both top and bottom (error) |
chatter debug overlap-audit <path> reports per-file statistics
(groups, bottoms, orphans, temporal consistency) in TSV format. Use
--database <path.jsonl> for a persistent JSON-lines database.
Design Principles
- No string hacking. All alignment operates on typed AST
structures (
Word,MorTier,AlignmentPair), never on serialized CHAT text. - Domain-aware from the start.
TierDomaingates traversal at the walker level. Downstream code never re-implements retrace / group skipping logic. - Deterministic over approximate. Tier alignment and word extraction use deterministic, positional algorithms over the typed AST.
- Dense indexed structures.
AlignmentPairusesOption<usize>rather than cloned data; index pairs are stored positionally, not in hash maps. - Exhaustive matching. Every
matchonUtteranceContent(24 variants) orBracketedItem(22 variants) lists all variants explicitly. New variants are a compile error, not a silent bug. - Walker as shared primitive.
walk_words()removed ~330 lines of duplicated traversal boilerplate across 7 call sites.
Downstream Consumers
| Consumer | Crate | Usage |
|---|---|---|
| Validation | talkbank-model | Cross-tier checks (E714/E715, E725-E728), overlap (E347/E348/E373/E704) |
| LSP hover | talkbank-lsp | Show aligned tier items for word under cursor |
| Word extraction | talkbank-transform | NLP-ready words from utterances |
| Overlap audit | chatter | chatter debug overlap-audit |
%wor generation | talkbank-model | Build %wor tier from main tier |
Memory and Ownership
Status: Current Last updated: 2026-03-24 01:32 EDT
This chapter documents the memory management and ownership patterns used across the TalkBank Rust crates. Understanding these decisions helps contributors make consistent choices when adding new code.
String Representation Strategy
CHAT corpora contain massive repetition, the same speaker codes, language codes, POS tags, and high-frequency words appear millions of times across files. The codebase uses three string types, chosen by expected cardinality and duplication:
flowchart LR
raw["Raw input (&str)"]
smol["SmolStr\n(inline ≤23 bytes)"]
arc["Arc<str>\n(interned, deduplicated)"]
string["String\n(owned, unique)"]
raw -->|short, low repetition| smol
raw -->|high repetition domain value| arc
raw -->|ephemeral/unique| string
| Type | When to use | Examples |
|---|---|---|
SmolStr | Short tokens, low duplication | Postcode text, tier content, event labels |
Arc<str> (interned) | High-cardinality domain symbols | Speaker codes, language codes, POS tags, stems |
String | Ephemeral or unique values | Error messages, temporary formatting |
String Interning
Location: talkbank-model/src/model/intern.rs
Five global process-local interners, each a DashMap<Arc<str>, Arc<str>> behind
OnceLock<StringInterner>:
| Interner | Pre-seeded values | Typical savings |
|---|---|---|
speaker_interner() | 30+ codes (CHI, MOT, FAT, …) | High, 3-letter codes repeat per utterance |
language_interner() | 45+ ISO 639-3 codes | Moderate, per-file |
pos_interner() | 60+ POS tags + UD relations | Very high, every %mor word |
stem_interner() | 200+ frequent English stems | High, function words dominate |
participant_interner() | 14 roles (Target_Child, …) | Low, per-file |
How it works:
- Fast path:
get()on DashMap, O(1)Arc::cloneif found - Slow path:
insert()new Arc if miss, deduplicates on future access - Thread-safe: DashMap uses shard-level locks, no global contention
- After initialization, reads are lock-free
Memory impact: 50-200 MB savings on large corpora (5-20% reduction). Arc::clone
is O(1) atomic increment vs String::clone O(n) copy.
Newtype Macros
Two macros generate domain-typed string wrappers:
string_newtype!: wrapsSmolStr. Used for generic CHAT text.interned_newtype!: wrapsArc<str>with automatic interning. Used for domain symbols.
// SmolStr-backed: no interning, inline small strings
string_newtype!(PostcodeText);
// Arc<str>-backed: interned via global interner
interned_newtype!(SpeakerCode, speaker_interner);
Ownership Model
ChatFile Lifecycle
flowchart TD
src["Source text (&str)"]
cst["tree-sitter CST\n(Tree, borrowed nodes)"]
model["ChatFile\n(owned AST)"]
cache["SQLite cache\n(validation result)"]
lsp["LSP server\n(per-document state)"]
json["JSON output\n(serde serialization)"]
cli["CLI output\n(CHAT text)"]
src -->|tree-sitter parse| cst
cst -->|CST-to-model conversion| model
model -->|validate + hash| cache
model -->|held in backend| lsp
model -->|to_json()| json
model -->|to_chat_string()| cli
- Parsing: tree-sitter
Treeowns the CST.Node<'a>values borrow fromTree, zero-copy traversal. The CST-to-model conversion copies data into ownedChatFilefields (SmolStr,Arc<str>). TheTreeis dropped after conversion. - Validation:
ChatFileis borrowed (&self) during validation. Errors are streamed to anErrorSink, no accumulation required. - LSP: Each open document holds an owned
ChatFilein the backend. Re-parsed on every edit via tree-sitter incremental parsing. - CLI batch: Each file is independently parsed → validated → reported → dropped. No cross-file state except the shared cache.
Arc Usage
Arc appears in three distinct roles:
| Role | Type | Why |
|---|---|---|
| String interning | Arc<str> in model types | O(1) clone for high-repetition domain values |
| Worker pool | Arc<WorkerGroup> in batchalign | RAII CheckedOutWorker::drop() needs group reference to return worker |
| Cache backend | Arc<dyn CacheBackend> in batchalign | Shared across async request handlers |
No Rc (single-threaded sharing not needed). No Cow<str> (SmolStr covers the
inline-small-string use case more naturally).
Interior Mutability
| Pattern | Where | What it protects |
|---|---|---|
RefCell<Parser> inside TreeSitterParser | talkbank-parser | Tree-sitter Parser needs &mut self but isn’t Sync. Callers create a TreeSitterParser and pass &TreeSitterParser everywhere. |
DashMap<Arc<str>, Arc<str>> | String interners | Concurrent interning during parallel parsing. Shard-level locks. |
OnceLock<StringInterner> | 5 global interners | Lazy init, lock-free after first access |
LazyLock<Regex> | All regex patterns workspace-wide | Compile-once, no per-call overhead |
std::sync::Mutex<VecDeque> | batchalign worker idle queue | Held < 10 μs for push/pop only |
tokio::sync::Mutex<HashMap> | batchalign job store | Short reads/writes, never held across .await |
Semaphore | Worker availability (batchalign) | Async signaling without holding locks during dispatch |
Rule: std::sync::Mutex for data accessed from sync code or held briefly.
tokio::sync::Mutex only when the lock must be held across .await points (which
we avoid when possible). DashMap when many threads read concurrently.
Collection Choices
| Collection | Where | Why not HashMap/Vec |
|---|---|---|
BTreeMap | All test/snapshot JSON output | Deterministic key ordering for reviewable diffs |
IndexMap | Participants, per-speaker results | Preserves encounter order (CHAT spec requires @Participants order) |
SmallVec<[T; N]> | Headers (N=2), tiers (N=3), features (N=4), token mappings (N=4) | Inline storage for common sizes; avoids heap for typical cases |
VecDeque | Worker idle queue (batchalign) | FIFO fair scheduling |
Dense Vec indexed by position | Retokenize word-to-token mapping | O(1) lookup, no hashing overhead, cache-friendly |
No LinkedList, BinaryHeap, or custom allocators.
Tree-Sitter Memory Model
Tree-sitter parsing is zero-copy for CST traversal:
// Node<'a> borrows from Tree, no allocation per node
fn process_node<'a>(node: Node<'a>, source: &str) -> ParseResult<...> {
for i in 0..node.child_count() {
let child: Node<'a> = node.child(i).unwrap(); // Stack-only, no heap
let text: &str = child.utf8_text(source.as_bytes())?; // Borrows source
// ... convert to owned model types ...
}
}
The tree-sitter parser consumes &str, produces a CST, and the Rust traversal
code constructs owned model types from CST nodes.
SQLite Memory-Mapped I/O
The validation cache uses SQLite with memory-mapped I/O for fast random access:
SqliteConnectOptions::new()
.journal_mode(SqliteJournalMode::Wal) // Concurrent reads during writes
.pragma("cache_size", "-8000") // 8 MB page cache
.pragma("mmap_size", "268435456") // 256 MB memory-mapped region
.synchronous(SqliteSynchronous::Normal) // Balanced durability
This configuration handles 95,000+ cached entries efficiently. The cache is never
deleted (use --force to refresh specific paths).
Manual Drop Implementations
Three types have custom Drop for resource cleanup:
| Type | Cleanup action | Why |
|---|---|---|
AuditReporter | Joins audit writer thread and flushes output | Audit mode owns file IO in a dedicated writer thread |
CheckedOutWorker | Returns worker to idle queue + releases semaphore permit | RAII pool resource management |
WorkerHandle | Sends SIGTERM/SIGKILL to child process | Process must be terminated when handle drops |
All drops are acyclic, no ordering dependencies between them.
Allocation Optimization Patterns
Rather than using an arena allocator (bumpalo was evaluated and removed, the data lifetimes don’t fit the “allocate many, free all at once” pattern), the codebase uses targeted optimizations:
| Pattern | Where | Savings |
|---|---|---|
| Scratch buffer reuse (clear + swap) | DP alignment row costs | ~50% fewer allocations in inner loop |
Flat table (vec![...; rows * cols]) | DP small-problem fallback | 1 allocation vs rows+1 |
| Dense Vec instead of HashMap | Retokenize word mapping | O(1) lookup, no hash overhead |
| SmallVec inline storage | Throughout | Avoids heap for 1-4 element collections |
SmolStr inline strings | All short CHAT tokens | No heap allocation for ≤23 byte strings |
See also: the batchalign3 book’s Arena Allocators page for the full evaluation of where arenas do and don’t help.
Algorithms and Data Structures
Status: Current Last modified: 2026-06-15 15:00 EDT
This chapter documents the key algorithms and data structure decisions across the TalkBank Rust crates.
CHAT AST Representation
The CHAT model is a tree of owned enums. The two central types are:
UtteranceContent: 24 variants covering all main-tier contentBracketedItem: 22 variants for content inside groups/brackets
flowchart TD
file["ChatFile"]
header["Headers\n(@Languages, @Participants, ...)"]
utt["Utterance"]
mc["MainContent\nVec<UtteranceContent>"]
dt["DependentTiers\n(%mor, %pho, %gra, ...)"]
file --> header
file --> utt
utt --> mc
utt --> dt
mc --> word["Word / AnnotatedWord / ReplacedWord"]
mc --> group["Group / PhoGroup / SinGroup / Quotation"]
mc --> marker["Pause / Separator / OverlapPoint / ..."]
group --> bi["BracketedContent\nVec<BracketedItem>"]
bi --> word2["Word / ReplacedWord / Separator"]
bi --> nested["Nested groups"]
Memory layout: Large variants (e.g., AnnotatedWord with scoped annotations)
are Boxed to keep the enum’s stack size bounded.
Content Walker
Location: talkbank-model/src/alignment/helpers/walk/
Closure-based recursive traversal centralizing the walk over all 24+22 variants:
pub fn for_each_leaf<'a>(
content: &'a [UtteranceContent],
domain: Option<AlignmentDomain>,
f: &mut impl FnMut(ContentLeaf<'a>),
)
Domain-aware gating:
Some(Mor): skips retrace groups (retrace words aren’t morphologically analyzed)Some(Pho | Sin): skips PhoGroup/SinGroup (treated as atomic by those tiers)None: recurses everything unconditionally
Both immutable (for_each_leaf) and mutable (for_each_leaf_mut) versions exist.
Used by talkbank-model, talkbank-transform word extraction, and other
typed CHAT traversals across the workspace.
Parsing Strategies
Tree-Sitter (Canonical Parser)
flowchart LR
src["Source .cha text"]
ts["tree-sitter C parser\n(generated from grammar.js)"]
cst["CST (Tree)"]
conv["Recursive descent\nover CST nodes"]
model["ChatFile (owned AST)"]
errors["ErrorSink\n(diagnostics)"]
src --> ts --> cst --> conv --> model
conv --> errors
- Grammar defined in
grammar/grammar.js(source of truth) parser.cis generated, never edit directly- CST-to-model conversion: recursive dispatch on node kind, skip
WHITESPACES, report unrecognized nodes viaErrorSink - Strict + catch-all pattern: Known header values get named grammar rules (syntax highlighting); unknown values hit a catch-all (flagged by validator)
Fragment Parsing
TreeSitterParser provides fragment methods for parsing individual CHAT
fragments (a word, a tier line) directly. Methods like
parser.parse_word_fragment(), parser.parse_main_tier_fragment(), etc.
are used when synthesizing CHAT from non-CHAT sources (ASR output, UD
annotations).
Historical note: A Chumsky-based direct parser previously provided combinator-based fragment parsing. It was removed in March 2026; tree-sitter is now the sole parser.
Tier Alignment (1:1 Positional)
Location: talkbank-model/src/alignment/traits.rs
Generic positional_align() pairs main-tier words with dependent-tier items by
position (O(n)). Traits: AlignableTier, TierAlignmentResult, AlignableContent.
%pho,%sin,%wor, use generic positional alignment%mor,%gra, domain-specific custom implementations- Mismatch diagnostics via
similarcrate (Patience diff algorithm, O(n log n))
Caching
The CHAT-core validation cache is documented separately in
Validation Cache. The
upstream batchalign3 project documents its own audio-task cache
(FA / UTR ASR / media conversion) separately.
Text Processing
Regex Compilation
All regex patterns use LazyLock<Regex> from std::sync, compiled once at
first use, lock-free thereafter. Never call Regex::new() inside functions or
loops.
Deterministic Output
BTreeMapfor all test/snapshot JSON (lexicographic key ordering)IndexMapfor participant/speaker ordering (preserves encounter order per spec)- Frequency results collected into
BTreeMap<NormalizedWord, Count>
Setup
Status: Current Last modified: 2026-06-21 21:33 EDT
Development is supported on Windows, macOS, and Linux. The instructions below use Unix shell syntax; on Windows, use PowerShell or Git Bash equivalently.
Prerequisites
- Rust (stable) via rustup (all platforms)
- Node.js for tree-sitter grammar generation and symbol validation
- tree-sitter CLI:
cargo install tree-sitter-cli - just (optional but recommended) for the repo’s top-level helper recipes
Clone Repository
mkdir -p ~/talkbank && cd ~/talkbank
git clone https://github.com/TalkBank/chatter.git
cd chatter
Build
From your chatter checkout root:
cargo build --workspace --locked
cargo build --workspace --all-targets --locked
# Optional helpers from the root justfile
just build
just test
just book-install-tools
just book
Two Cargo Workspaces
The repository has two independent Cargo workspaces:
1. Root workspace (Cargo.toml)
Contains all Rust crates for parsing, model, validation, and transform:
cargo build
cargo test
2. Spec workspace (spec/Cargo.toml)
Contains two sibling crates for spec-driven artifacts. Invoke with
--manifest-path relative to the chatter repo root:
cargo build --manifest-path spec/tools/Cargo.toml
cargo build --manifest-path spec/runtime-tools/Cargo.toml
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- --help
cargo run --manifest-path spec/runtime-tools/Cargo.toml --bin validate_error_specs -- --help
Root justfile recipes
just build # Build the Rust workspace
just build-release
just test # cargo test --workspace
just clippy
just fmt
just fmt-check
Verification
This repo does not currently have the old monorepo-wide make verify
wrapper ported into the root checkout. Until that lands, use the concrete
verification commands from the repo guidance:
cargo fmt
cargo check --workspace --all-targets
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc
Add grammar/spec commands when your change touches those surfaces:
cd grammar && tree-sitter generate && tree-sitter test
cargo build --manifest-path spec/tools/Cargo.toml
cargo build --manifest-path spec/runtime-tools/Cargo.toml
CI green on the pushed commit remains the authoritative pre-push gate for this repo.
Editor Setup
rust-analyzer
The workspace should work out of the box with rust-analyzer. The root Cargo.toml workspace configuration is standard.
Grammar Workflow
Status: Current Last modified: 2026-05-29 18:36 EDT
The tree-sitter grammar at grammar/grammar.js is the formal definition of the CHAT format. Changes require careful validation.
The following diagram shows the complete regeneration pipeline. Every step must pass before committing a grammar change.
flowchart TD
edit(["Edit grammar/grammar.js"])
generate["tree-sitter generate\n→ src/parser.c\n→ src/node-types.json"]
grammar_test["tree-sitter test\n(corpus tests)"]
rust_test["cargo test -p talkbank-parser\n(CST-to-model conversion)"]
equiv["parser equivalence\n(corpus/reference/ files)"]
spec_check{"Grammar change\naffects spec examples?"}
test_gen["spec/tools generators\n→ grammar/test/corpus/\n→ parser-tests/tests/generated/\n→ docs/errors/"]
commit(["Commit"])
edit --> generate --> grammar_test --> rust_test --> equiv --> spec_check
spec_check -->|Yes| test_gen --> commit
spec_check -->|No| commit
Step-by-Step Procedure
1. Edit the Grammar
Modify grammar.js in the grammar/ directory. Key design principles:
- Explicit whitespace (no
extras) - Precedence annotations to resolve ambiguities
- Named rules for all semantically meaningful nodes
2. Generate the Parser
cd grammar
tree-sitter generate
This produces src/parser.c and src/node-types.json. Never edit these files by hand.
3. Run Grammar Tests
tree-sitter test
Every test under grammar/test/corpus/ must pass. Tests live there
and are partially auto-generated from specs (primarily via
gen_tree_sitter_tests).
4. Run Parser Tests
cargo test -p talkbank-parser
This verifies the Rust parser wrapper handles all CST nodes correctly.
5. Run Parser Equivalence
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
Every file in the reference corpus must parse correctly. Each .cha file is its own test, nextest runs them in parallel and reports individual failures.
6. Regenerate Spec Tests
If the grammar change affects any spec examples:
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- \
--output-dir grammar/test/corpus \
--template-dir spec/tools/templates
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_rust_tests -- \
--output-dir crates/talkbank-parser-tests/tests/generated
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_validation_corpus -- \
--corpus-dir crates/talkbank-parser-tests/tests/error_corpus/validation_errors
This regenerates tree-sitter corpus tests and other generated outputs that still depend on the spec pipeline.
Do this when the grammar change actually affects generated artifacts.
7. Update node_types.rs
If new node types were added to the grammar, the generated node_types.rs in talkbank-parser needs updating. The spec tools handle this via node-types.json.
Critical Policy
The reference corpus at corpus/reference/ must pass parser equivalence at 100%. If a grammar change breaks even one file, revert immediately. The reference corpus is the ultimate arbiter of correctness.
Common Patterns
Adding a New Token
- Define the token in
grammar.js - Add handling in the Rust tier parser (match on the new node kind)
- Add a spec construct example
- Run the relevant generation and verification steps
For small, isolated syntax additions, the grammar workflow should stay local:
- one grammar change
- one grammar corpus example
- one full-file fixture if needed
Changing a Rule
- Modify the rule in
grammar.js tree-sitter generate && tree-sitter test- Update Rust parser if CST node structure changed
- Update spec examples if the expected CST changed
- Run the current local verification sweep from
contributing/dev-checks.md
Spec Workflow
Status: Current Last modified: 2026-05-29 17:50 EDT
Specifications in spec/ are the source of truth for CHAT format intent, grammar
examples, and validation/error contracts.
Adding a Construct Spec
Construct specs define valid CHAT patterns with expected parse trees.
1. Create the Spec File
Create a new markdown file in the appropriate spec/constructs/ subdirectory:
spec/constructs/
├── header/ # Header-related constructs
├── main_tier/ # Main tier patterns
├── tiers/ # Dependent tier patterns
├── utterance/ # Utterance-level patterns
└── word/ # Word syntax patterns
2. Write the Spec
# my_example
Description of what this example demonstrates.
## Input
\```utterance
*CHI: hello world .
\```
## Expected CST
\```cst
(utterance
(main_tier
...))
\```
## Metadata
- **Level**: utterance
- **Category**: main_tier
The code fence label (e.g., utterance, mor_dependent_tier) selects which
template wraps the input into a full CHAT file.
3. Generate the CST
Parse your input with tree-sitter to get the actual CST, then copy it as the Expected CST (stripping positions and field names).
4. Regenerate The Affected Generated Artifacts
The predecessor monorepo wrapped this step as make test-gen. That root
wrapper is not yet ported into this repo, so follow spec/CLAUDE.md and run
only the generator command(s) relevant to the artifacts you intentionally
changed.
For isolated grammar additions, keep the change small:
- Add or adjust one grammar example.
- Add one full-file fixture if the change matters in context.
- Regenerate only the artifacts that truly changed.
Adding an Error Spec
Error specs define invalid CHAT patterns with expected error codes.
1. Create the Spec File
Error specs live in spec/errors/, named by error code. The
convention is E###_auto.md (or E###_<short-slug>.md); for example
spec/errors/E301_auto.md covers “Empty speaker code”.
2. Write the Spec
The actual on-disk format (per spec/errors/E301_auto.md) uses
bolded metadata keys; there is no Name field and severity is
implicit in the error-code numbering:
# E301: Empty speaker code
## Description
Empty speaker code
## Metadata
- **Error Code**: E301
- **Category**: Main tier validation
- **Level**: utterance
- **Layer**: parser
## Example 1
**Source**: `E3xx_main_tier_errors/E301_empty_speaker.cha`
**Trigger**: Main tier with * but no speaker code
**Expected Error Codes**: E301
\```chat
@UTF8
@Begin
@Languages: eng
@Participants: CHI Child
...
\```
Key Metadata Fields
- Layer: parser: the error is caught during
parser.parse_chat_file()(file fails to parse) - Layer: validation: the error is caught by
validate_with_alignment()after successful parse - Status: not_implemented: generates
#[ignore]tests (validation logic not yet coded)
3. Regenerate The Affected Artifacts
Regenerate the affected artifacts with the current spec-tool commands from
spec/CLAUDE.md, then run the concrete verification commands from
Setup / Developer Verification Checks.
Updating the Symbol Registry
The symbol registry at spec/symbols/symbol_registry.json defines character sets used by the grammar and Rust crates.
flowchart TD
registry["Edit spec/symbols/\nsymbol_registry.json"]
validate["validate_symbol_registry.js\n(structure check)"]
gen_grammar["Generate grammar symbols\n(for tree-sitter)"]
gen_rust["generate_rust_symbol_sets.js\n→ talkbank-model/src/generated/symbol_sets.rs\n→ spec/tools/src/generated/symbol_sets.rs"]
fmt["rustfmt\n(format generated code)"]
verify["Run current symbol generators\nthen local verification sweep"]
registry --> validate --> gen_grammar & gen_rust
gen_rust --> fmt --> verify
gen_grammar --> verify
After editing, run the current symbol-generation commands from spec/CLAUDE.md,
then regenerate any dependent grammar/tests/docs outputs if the symbol change
affects them.
Common Mistakes
- Editing generated files: never edit
grammar/test/corpus/orcrates/talkbank-parser-tests/tests/generated/by hand - Regenerating reflexively: use regeneration when generated artifacts changed, not as a substitute for thinking about what kind of test authority the change really needs
- Wrong layer: parser-layer specs expect parse failure; validation-layer specs expect parse success + error report
Testing
Status: Current Last modified: 2026-06-15 15:00 EDT
Test Generation Pipeline
Specs are the source of truth. All grammar corpus tests, Rust parser tests,
and error docs are generated from specs. This repo does not currently
have the old monorepo-wide make test-gen wrapper; run the relevant
spec/tools binaries directly instead, and never hand-edit generated files.
flowchart LR
subgraph sources["Source of Truth"]
constructs["spec/constructs/\n(construct specs, see directory listing)"]
errors["spec/errors/\n(error specs, see directory listing)"]
templates["spec/tools/templates/\n(Tera wrappers)"]
end
subgraph generators["spec/tools generators\n(run only what changed)"]
gen_ts["gen_tree_sitter_tests"]
gen_rust["gen_rust_tests"]
gen_validation["gen_validation_corpus"]
gen_docs["gen_error_docs"]
end
subgraph outputs["Generated Outputs (DO NOT EDIT)"]
ts_tests["grammar/test/corpus/\n(tree-sitter tests)"]
rust_tests["crates/talkbank-parser-tests/tests/generated/\n(Rust tests)"]
val_corpus["crates/talkbank-parser-tests/tests/error_corpus/validation_errors/\n(.cha fixtures + manifest.json)"]
error_docs["docs/errors/\n(local generated error pages)"]
end
constructs & errors --> gen_ts
templates --> gen_ts
constructs & errors --> gen_rust
errors --> gen_validation
errors --> gen_docs
gen_ts --> ts_tests
gen_rust --> rust_tests
gen_validation --> val_corpus
gen_docs --> error_docs
To add a grammar test or error test, add a spec file in spec/constructs/
or spec/errors/, then run the current generator command(s) from
Spec Workflow. Use only the binaries that match the
artifacts you intentionally changed.
Test Strategy
Testing is organized in layers, from fastest to most comprehensive.
flowchart TD
unit["Unit + Integration Tests\n(cargo nextest run)"]
specgen["Spec-Generated Tests\n(spec/tools generators)\nParser + validation layer"]
grammar["Grammar Corpus\n(tree-sitter test)"]
ref["Reference Corpus\n(corpus/reference/, 100% required)"]
gates["Local verification sweep + CI\n(dev-checks.md / quality-gates.md)"]
unit --> specgen --> grammar --> ref --> gates
Never-Regress Gates
Four gates form the regression contract for the CHAT core. They guard the
behavior a successor cannot easily re-derive: parser correctness, lossless
serialization, full-corpus coverage, and error detection. Any commit
touching the relevant surface (grammar, parser, model, validation,
serialization, or alignment) MUST run the matching gate(s) and keep them
green. A red gate is a bug until proven otherwise (see the repo CLAUDE.md,
“Test Failures Are Bugs Until Proven Otherwise”), never a test expectation
to quietly update.
| Gate | Command | What it protects |
|---|---|---|
| Parser equivalence | cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)' | The re2c oracle parser and the tree-sitter parser agree on every reference file. A divergence means one parser is wrong, or a construct spec is missing. |
| Roundtrip idempotency | cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus | parse, serialize, re-parse yields a semantically identical AST (SemanticEq) for every reference file. Catches any model or WriteChat change that silently loses information. |
| Reference corpus 100% | (the same roundtrip_reference_corpus test) | Every file under corpus/reference/ parses and roundtrips with zero failures. The reference corpus is the ultimate arbiter of full-file correctness; it must be 100%, never “mostly”. |
| Error-code spec tests | cargo nextest run -p talkbank-parser-tests --test generated_tests --test validation_error_corpus --test error_coverage | Every error spec under spec/errors/ still fires its expected code: parser-layer errors reject as designed, validation-layer errors are detected, and every ErrorCode has a backing spec. These tests are generated from specs, never hand-written. |
Two of the four share one test: roundtrip_reference_corpus enforces both
roundtrip idempotency and the reference-corpus 100% guarantee, because it
iterates every reference file (the coverage guarantee) and checks roundtrip
semantic equality on each (the idempotency guarantee).
All four also run as part of the full workspace sweep
(cargo nextest run --workspace), so a complete local run before committing
covers them. The per-gate commands above are the fast, targeted way to
re-check one surface during the inner development loop. The sections below
describe each layer in more detail.
Unit Tests (nextest)
cargo nextest run
Runs all unit and integration tests across all crates (~2300+ tests). These test individual functions, serialization roundtrips, and model invariants.
cargo nextest does not run doctests. Keep cargo test --doc as a separate
verification step when you change public API examples or doc comments.
Parser Equivalence
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
Runs the parser on each file in the corpus/reference/ tree and validates
results. Each .cha file is its own test, enabling per-file parallelism and
failure isolation via nextest. The exact file count is whatever
find corpus/reference -name '*.cha' -type f | wc -l reports, do not
hard-code it here.
Spec-Generated Tests
Part of talkbank-parser-tests. These are generated from specs via the
current spec/tools binaries and currently test:
- Construct specs: input parses correctly
- Parser-layer error specs: input fails to parse with expected error code
- Validation-layer error specs: input parses but validation reports expected error code
Common entrypoints from the repository root:
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_tree_sitter_tests -- \
--output-dir grammar/test/corpus \
--template-dir spec/tools/templates
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_rust_tests -- \
--output-dir crates/talkbank-parser-tests/tests/generated
cargo run --manifest-path spec/tools/Cargo.toml --bin gen_validation_corpus -- \
--corpus-dir crates/talkbank-parser-tests/tests/error_corpus/validation_errors
Tree-Sitter Grammar Tests
cd grammar && tree-sitter test
Runs the tree-sitter grammar corpus tests. This is the right gate for grammar structure changes.
Error Corpus Tests
Error fixtures live in spec/errors/. Parser-layer error examples become Rust
tests via gen_rust_tests; validation-layer examples become a .cha fixture
corpus + manifest.json via gen_validation_corpus, under
crates/talkbank-parser-tests/tests/error_corpus/validation_errors/, which the
data-driven runner validation_error_corpus.rs consumes. Add a new error spec
under spec/errors/E###_*.md and regenerate.
Tree-Sitter Tests
cd grammar
tree-sitter test
Verifies the grammar produces correct CSTs for known inputs. The
actual test count comes from ls grammar/test/corpus/*.txt | wc -l;
do not hard-code it.
Reference Corpus
The reference corpus at corpus/reference/ is organized into subdirs
(annotation/, audio/, ca/, content/, core/, edge-cases/,
languages/, tiers/, word-features/). The parser must handle
every file at 100%, the exact file count is whatever
find corpus/reference -name '*.cha' -type f | wc -l reports.
This corpus is the ultimate arbiter of correctness for full-file parsing.
Local Verification Contract
There is no repo-local make verify wrapper in this checkout today.
Use the explicit command set from
Developer Verification Checks and
Testing and Quality Gates instead.
Core local sweep:
cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc
Then add the surface-specific checks that match your change:
- grammar changes:
cd grammar && tree-sitter generate && tree-sitter test - spec-tool changes:
cargo build --manifest-path spec/tools/Cargo.tomlandcargo build --manifest-path spec/runtime-tools/Cargo.toml - parser / model / alignment / serialization changes:
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'andcargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus
Running Specific Tests
# Single test by name
cargo nextest run test_name
# Tests in a specific crate
cargo nextest run -p talkbank-model
# Tests matching a pattern
cargo nextest run -- mor
# With output
cargo nextest run --no-capture
What to Run When
| What you changed | Run |
|---|---|
Grammar (grammar.js) | cd grammar && tree-sitter generate && tree-sitter test, then the relevant parser/spec-generator commands |
| Parser (CST-to-model) | cargo nextest run -p talkbank-parser |
| Model (types, validation, alignment) | cargo nextest run -p talkbank-model |
| CLI (chatter args, dispatch) | cargo nextest run -p chatter |
| LSP | cargo nextest run -p talkbank-lsp |
| Spec files | Run the relevant gen_* commands from spec/tools, then the local verification sweep from dev-checks.md |
| Pre-merge (any change) | The local verification sweep from dev-checks.md plus surface-specific additions |
| Pre-push (quick) | Re-run the narrowest commands that cover the surfaces you touched; there is no repo-local make ci-local wrapper |
Mutation Testing
Use cargo-mutants to find code that can be changed without any test failing, the true coverage gaps.
# Install (once)
cargo install cargo-mutants
# Run against a specific crate (--jobs 1 to avoid OOM on 64 GB machines)
cargo mutants -p talkbank-parser --timeout 120 --jobs 1
# Review results
cat mutants.out/missed.txt # Mutations no test caught
cat mutants.out/caught.txt # Mutations properly detected
Mutation testing is not part of CI but should be run periodically (after major changes) to find untested logic paths. Results guide where to add new tests.
Configuration: mutants.toml at the repo root excludes trivial functions.
Adding Tests
- Model tests: add to the relevant crate’s
tests/directory or#[cfg(test)]module - Parser tests: if the change is about grammar shape or validation contracts,
add or update specs and regenerate with the relevant
spec/toolsgenerator binaries - Error tests: add a new spec under
spec/errors/E###_*.mdand rungen_rust_tests(parser-layer) orgen_validation_corpus(validation-layer); the generated Rust tests / fixture corpus + manifest are produced automatically
Coding Standards
Status: Current Last updated: 2026-03-24 00:01 EDT
Rust Conventions
- Edition: 2024
- Formatting:
cargo fmtbefore every commit - Linting:
cargo clippy --all-targets -- -D warningsmust pass with zero warnings - No clippy silencing without explicit approval
Error Handling
- No panics for recoverable conditions, use
thiserror/miettefor error types - Library code uses the
ErrorSinktrait for error reporting, notResult - Use
ParseOutcome<T>in parser code (parsed or rejected)
Logging
- Library crates use
tracing(neverprintln!oreprintln!) - CLI binaries write to stdout (results) and stderr (diagnostics)
- Use appropriate log levels:
error!,warn!,info!,debug!,trace!
Naming
- Follow standard Rust conventions (snake_case for functions, CamelCase for types)
- Conventional Commits for commit messages:
<type>[scope]: <description>- Types:
feat,fix,refactor,test,docs,chore
- Types:
Dependencies
Preferred crates:
clap: CLI argument parsingserde: serializationmiette: user-facing diagnosticsinsta: snapshot testingtracing: structured loggingrayon/crossbeam, concurrencysmallvec: small-buffer optimization
Code Organization
- Keep crate boundaries clean, lower crates should not depend on higher ones
- The model crate should not depend on any parser
- Parsing code should not depend on serialization/transform code
- All CHAT parsing and serialization goes through the AST, never ad-hoc string manipulation
- Treat 10 or more named struct fields as an audit trigger. Wide boundary or
report records can be acceptable, but wide runtime state bags need explicit
review. See
architecture/chat-model/wide-structs.md.
Testing
- Prefer spec-driven tests over hand-written tests for parser behavior
- Use
cargo nextest runfor unit tests (except doctests) - Snapshot tests with
instafor complex output comparisons
Generated Files
Never hand-edit generated artifacts:
parser.c: generated fromgrammar.jsgrammar/test/corpus/: generated from specscrates/talkbank-parser-tests/tests/generated/: generated from specscrates/talkbank-model/src/generated/symbol_sets.rs: generated from symbol registry
Always regenerate from source inputs.
Coding Standards and Engineering Practices
Status: Current Last updated: 2026-05-21 08:38 EDT
Objective
Set enforceable, language-specific standards that reduce ambiguity and improve long-term maintainability.
Global Standards
- Prefer explicit domain types over ad-hoc strings.
- Keep parsing, validation, and rendering logic separated.
- Eliminate magic numbers/strings/paths via named constants and config.
- Treat generated code as immutable artifacts.
- Require tests for every bugfix and behavior change.
Rust Standards
- Enforce formatter and clippy in CI.
- Minimize
#[allow(clippy::...)]; each allowance needs rationale. - Prefer small focused modules with clear ownership.
- Public APIs require doc comments with examples and error behavior.
- In parser code, disallow
ErrorSink+Option<T>signatures for fallible parse operations.- Use explicit outcome enums or
Resultwith structured diagnostics. - Guardrail script:
scripts/check-errorsink-option-signatures.sh.
- Use explicit outcome enums or
- For model enums that encode validation state, require
ValidationTaggedderive.- Explicit annotation:
#[validation_tag(error|warning|clean)]. - Naming-convention fallback (per
crates/talkbank-derive/src/validation_tagged.rs:118-123): variants ending inError→Error; variants ending inWarningORUnsupported, plus a variant named exactlyUnsupported, →Warning; otherwise →Clean.
- Explicit annotation:
Grammar Standards
- Grammar rules must map to documented token/category semantics.
- No duplicated symbol sets in free-form literals.
- Every non-obvious precedence/conflict decision must include rationale.
Spec and Generator Standards
- Spec files must follow strict metadata template.
- Generators must be deterministic and pure with respect to inputs.
- No hardcoded user-specific paths in docs or generated outputs.
Magic Value Policy
Disallowed
- Inline path literals tied to local machines.
- Unnamed numeric constants encoding protocol behavior.
- Repeated header/tier string literals across modules.
Required
- Central constants/modules:
- path defaults,
- tier/header prefixes,
- token categories,
- formatting policies.
Review and PR Standards
- PR template must include:
- subsystem touched,
- contract impact,
- generated artifact impact,
- tests added/updated,
- docs updated.
- Require at least one reviewer with subsystem ownership for core modules.
Internal Decision Records
Adopt short ADR format in the book’s architecture section:
- context,
- decision,
- alternatives considered,
- consequences,
- rollback path.
Acceptance Criteria
- Coding standards are documented once and enforced automatically.
- Magic values are systematically reduced and tracked.
- Every behavior change includes tests and doc impact assessment.
- Architecture decisions are recorded and discoverable.
CI and Release
Status: Current Last updated: 2026-06-21 21:33 EDT
Pre-Merge Verification
Use the concrete local verification commands from Setup and Developer Verification Checks:
cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc
Then rely on GitHub Actions CI as the authoritative shared signal before you announce a change as ready.
Generated artifact drift
Generated artifacts are still important, but the old root wrappers from the predecessor workspace are not yet ported into this repo. In practice:
- regenerate only the affected spec/symbol outputs,
- do not hand-edit generated artifacts,
- and run the surface-specific verification commands that match the change.
See Spec Workflow and spec/CLAUDE.md for the current
source-of-truth guidance.
Release Process
TalkBank/chatter is the public release source of truth: release.yml
(cargo-dist) publishes the signed GitHub Releases for the CLI and the desktop
app.
Workflows that actually exist in this repo
| Workflow | Purpose | Notes |
|---|---|---|
.github/workflows/ci.yml | Main build/test/book CI | Primary shared signal on pushes and PRs |
.github/workflows/cross-platform.yml | Cross-platform build coverage | Supplements the main CI workflow |
.github/workflows/crates-io-foundation.yml | First-wave crates.io readiness | Checks foundation-crate metadata, package surfaces, hold-backs, and publish order |
.github/workflows/release.yml | cargo-dist release automation | Builds dist-enabled workspace artifacts from version tags; owns the GitHub Release |
.github/workflows/release-desktop.yml | Desktop installer release automation | Builds chatter-desktop installers on the same version tags and uploads them into the release that release.yml creates; workflow_dispatch runs build-only |
.github/workflows/clippy-rolling.yml | New-stable clippy drift detection | Weekly maintenance workflow |
Current release stance
release.ymlis about workspace artifact packaging via cargo-dist, not about crates.io publication.- The first-wave crates.io path is documented separately in
Crates.io Publication and is checked by
just crates-io-foundation-checkplus.github/workflows/crates-io-foundation.yml.
Desktop release workflow: how the two tag workflows compose
On a version tag, release.yml (cargo-dist) and release-desktop.yml run
in parallel. cargo-dist owns creating the GitHub Release and attaching the
CLI archives, checksums, and installer scripts; release-desktop.yml builds
the Tauri installers, then polls until the release exists and uploads its
installers into it. Two platform notes baked into the workflow:
- macOS: Tauri signs, notarizes, and staples the
.app, but NOT the.dmgit wraps around it. The workflow therefore submits the.dmgitself to the notary service and staples it, then verifiescodesign,spctl, andstapler validateon both artifacts. The signing identity is supplied via environment, never hardcoded intauri.conf.json. - Windows / Linux: artifacts are currently unsigned by decision; see
docs/strategy/distribution-and-signing.md(“Decisions, 2026-06-12”) and the SmartScreen guidance in the install docs.
Release secrets (Actions secrets on this repository)
Required by the macOS jobs of release-desktop.yml (and by cargo-dist
macOS codesigning if macos-sign is enabled, which uses the separate
CODESIGN_* names documented in the strategy doc):
| Secret | Content |
|---|---|
APPLE_CERTIFICATE | base64-encoded Developer ID Application .p12 |
APPLE_CERTIFICATE_PASSWORD | password for the .p12 |
APPLE_SIGNING_IDENTITY | full identity string, Developer ID Application: <Name> (<TEAMID>) |
APPLE_API_KEY | App Store Connect API key ID (notarization) |
APPLE_API_ISSUER | App Store Connect issuer ID |
APPLE_API_KEY_CONTENT | contents of the AuthKey_*.p8 file |
Rotation: replacing the certificate or notary key means updating these secrets and nothing else; no workflow edits are needed. A maintainer must re-create all of them on any new repository (secrets do not transfer).
Crates.io Publication
Status: Current Last updated: 2026-06-21 21:33 EDT
Scope
The crates.io automation in this repo currently targets the Wave 1A foundation crates only. crates.io publication is a deliberate maintainer action, not a tag-triggered release path.
Wave 1A is:
tree-sitter-talkbanktalkbank-derivetalkbank-modeltalkbank-cachetalkbank-parsertalkbank-parser-re2ctalkbank-transform
talkbank-parser-re2c is part of the first wave because
talkbank-transform has a runtime dependency on it. Holding it back would
make talkbank-transform unpublishable.
The current Wave 1B hold-backs are explicitly marked publish = false:
send2clanchattertalkbank-lsp
They stay blocked until their support contract, install story, and user-facing docs are ready.
What the repo now automates
Two repo-native entry points cover the first-wave foundations:
| Surface | Purpose |
|---|---|
just crates-io-foundation-check | Local preflight for first-wave crates.io readiness |
.github/workflows/crates-io-foundation.yml | CI enforcement for first-wave metadata, package surfaces, hold-backs, and publish order |
The readiness check enforces:
- required crates.io metadata (
repository,homepage,keywords,categories,readme) - readme-file existence
- package assembly for every first-wave crate via
cargo package --list - the first-wave runtime dependency graph
publish = falseguards on Wave 1B crates- a real
cargo publish --dry-runfor the standalonetree-sitter-talkbankcrate
Important limitation: Cargo cannot fully dry-run the bootstrap wave
For the first publication of an interdependent workspace, cargo publish --dry-run is not a complete CI gate for every crate. Cargo rewrites path
dependencies to registry dependencies while preparing the package. That means a
crate such as talkbank-model cannot complete a registry-style dry-run until
its prerequisite talkbank-derive already exists on crates.io.
So the current automation is intentionally honest:
tree-sitter-talkbankgets a real crates.io dry-run because it stands alone.- The remaining Wave 1A crates are validated by metadata, readme, and
dependency checks before publication. (No MSRV is declared yet; set a
deliberate
rust-versionand re-add an MSRV check when publication is actually pursued.) - As each prerequisite crate lands on crates.io, rerun targeted
cargo publish --dry-run -p <crate>checks for the later crates before publishing them.
This is a real limitation of the initial bootstrap wave, not a missing script. If we later want full registry-resolution rehearsal before publication, that requires a staging registry/local index strategy, not just another shell loop.
Publication procedure
Before publishing anything:
- Verify crates.io name availability for every Wave 1A package.
- Run
just crates-io-foundation-check. - Ensure
.github/workflows/crates-io-foundation.ymland the main CI workflow are green on the commit you intend to publish. - Publish in this exact order, waiting for the crates.io index to observe each crate before moving to the next:
tree-sitter-talkbanktalkbank-derivetalkbank-modeltalkbank-cachetalkbank-parsertalkbank-parser-re2ctalkbank-transform
- After each prerequisite becomes visible on crates.io, rerun any newly-unblocked
cargo publish --dry-run -p <crate>checks before the next publish step.
Example command shape:
cargo publish -p tree-sitter-talkbank --locked
Tagging policy
Do not use version tags to drive crates.io publication from this repo.
.github/workflows/release.yml is reserved for cargo-dist GitHub Releases of
dist-enabled artifacts. Crates.io publication remains a deliberate manual
maintainer flow.
Testing and Quality Gates
Status: Current Last modified: 2026-06-21 21:33 EDT
This page summarizes the current relationship between local verification and the repository CI workflows.
Local pre-merge contract
There is no repo-local make verify wrapper in this checkout today. The local
contract is the command set documented in Setup and
Developer Verification Checks:
cargo fmt --all -- --check
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc
plus grammar/spec/parser-specific checks when you touch those surfaces.
Never-regress gates
Beyond the formatting/build/test sweep above, the CHAT core has four never-regress gates that must stay green for any change touching the grammar, parser, model, validation, serialization, or alignment: parser equivalence, roundtrip idempotency, reference-corpus 100%, and the error-code spec tests. Each has a fast, targeted command. They are defined, with the exact command and what each protects, under Testing, Never-Regress Gates. A red gate is a bug until proven otherwise, never a test expectation to quietly update.
Root CI contract
The main CI workflow (.github/workflows/ci.yml) is the authoritative shared
signal for this repo. Today it covers:
- Rust build, test, and clippy
- mdBook build
Additional workflows cover cross-platform build coverage and rolling-clippy drift checks.
Because the old local wrapper pipeline has not been ported into this repo,
historical references to numbered gates such as G0-G14 should be treated as
legacy labels from the predecessor workspace, not as the current command
surface here.
Additional CI-only checks
These are required CI signals or workflow checks that are not identical to the local command set:
- cross-platform release/build coverage
- weekly rolling-clippy drift checks
- workflow-specific smoke tests attached to release automation
Documentation Architecture
Status: Current Last modified: 2026-06-15 15:00 EDT
Principle: Centralized Book + Subsystem Satellites
User-facing and contributor-facing prose lives in mdBook
(book/). The repo-level docs/ directory holds operator-facing
material (release contract, versioning, code-signing, platform
support, validation feature flags). Maintainers can also generate a
local error-reference tree under docs/errors/ while working on
diagnostics, but that output is not the canonical checked-in docs
surface. Subsystem-specific working docs stay in place
only when tightly coupled to files in that directory.
flowchart TD
main["book/ (the unified Chatter mdBook)\nSurfaces: chatter, chat-format, architecture, contributing\nAudiences: users, integrators, contributors"]
spec["spec/docs/\nSpec authoring guides"]
errors["docs/errors/\nOptional local generated error reference"]
api["cargo doc\nRust API docs (auto-generated)"]
main -->|"links to"| spec
main -->|"links to"| errors
main -.->|"complements"| api
Where Documentation Goes
| Content type | Location | Examples |
|---|---|---|
| User guides, CHAT format reference | book/src/chatter/user-guide/, book/src/chat-format/ | CLI usage, validation errors |
| Architecture and design | book/src/architecture/ | Parsing, data model, concurrency, memory |
| Contributor workflows | book/src/contributing/ | Grammar workflow, testing, coding standards |
| Integrator contracts | book/src/chatter/integrating/ | JSON schema, diagnostic contract |
| Technical reference and audits | book/src/ (Technical Reference section) | Parity audits, UTF-8 audit, risk register |
| Spec authoring guides | spec/docs/ | Error spec format, curation workflow |
| Generated error docs | docs/errors/ | Optional local output from gen_error_docs; source of truth stays in spec/errors/ |
| Historical/archived docs | project archive | Old audits, superseded proposals |
| AI assistant context | CLAUDE.md files (per repo/subdir) | Not documentation for humans |
Rules
- One canonical page per topic. No duplicate coverage across locations.
- No crate-level
docs/directories. Architectural explanations go in the book. Crate API docs come from///doc comments viacargo doc. - Satellites stay only when the audience is editing files in that directory.
Spec authors need
WRITING_ERROR_SPECS.mdnext to their specs. Everyone else reads the book. - Generated docs are build artifacts. Never hand-edit
docs/errors/. If you need that local reference set, regenerate it withgen_error_docs. - Historical docs go to project archive. Don’t keep old audit logs, investigation notes, or superseded proposals in the public repo.
One unified book
There is one mdBook for this repo at book/,
titled “Chatter, TalkBank CHAT Toolchain”, organized by audience-first sections
under book/src/:
| Section | Audience | Content |
|---|---|---|
book/src/chatter/ | chatter CLI users + integrators | CLI reference, library usage, JSON contracts |
book/src/chat-format/ | All users + integrators | CHAT format reference (headers, tiers, symbols) |
book/src/architecture/ | All devs | Cross-surface architecture, parser/grammar/data-model design |
book/src/contributing/ | Contributors | Setup, testing, coding standards, dev checks |
One book.toml and one SUMMARY.md for the whole tree. Cross-section
links resolve as ordinary in-book paths.
CHAT Processing Playbook for Developers
Status: Current Last updated: 2026-03-23 23:49 EDT
Objective
Provide an implementation playbook for developers building or extending CHAT parsing, validation, transformation, and serialization logic.
Mental Model
Treat CHAT processing as a layered pipeline:
- Ingest bytes and normalize line boundaries.
- Parse syntax into structured model with exact spans.
- Validate semantic rules with structured diagnostics.
- Transform or enrich model without breaking invariants.
- Serialize in canonical form.
Developer Workflow
- Start from a concrete fixture or corpus case.
- Add/adjust parser behavior with contract tests first.
- Add semantic validator rules separately from parser acceptance.
- Confirm roundtrip and equivalence gates.
- Update docs for any visible behavior or policy change.
Tier Dispatch Strategy
Use cheap byte-prefix dispatch before heavy parsing:
@=> header candidate,*=> main tier,%=> dependent tier,- continuation rules and whitespace handled deterministically.
This preserves performance and isolates error contexts earlier.
For downstream batchalign3 consumers, tier dispatch is only the front door.
The important contract is what happens after dispatch: parse-health taint,
recovery vs rejection, and whether a tier is safe to pass into alignment.
Word Parsing Rules of Thumb
- Parse suffix markers in strict order (
@...,@s...,$...) with explicit precedence. - Keep
raw_textexact,cleaned_textpolicy-driven and test-locked. - Treat CA delimiters and special symbols via centralized symbol sets.
- Never embed ad hoc symbol literals in multiple files.
Error Handling Contract
- Every parser failure should produce structured diagnostics with:
- code,
- severity,
- span,
- context,
- message.
- Avoid silent fallback behavior unless policy explicitly allows it.
- If fallback occurs, emit warning-grade diagnostics where relevant.
- Never fabricate semantic placeholders (empty required text, arbitrary enum default, fake word/chunk) to satisfy type construction.
- Prefer
None/partial outcome + diagnostics over synthetic model values.
Span Discipline
- Offsets are absolute across full file content.
- Nested parser helpers must accept base offset and return shifted spans.
- Add tests for boundary and continuation-line spans.
Performance Policy
- Prefer byte-oriented prechecks for top-level dispatch and simple delimiters.
- Use parser combinators for structural parsing, not for obvious constant-prefix routing.
- Measure parser performance on representative corpus slices before/after major changes.
Common Failure Patterns and Fixes
- Symptom: semantic mismatch only in snapshots.
- Fix: compare parser outputs directly and isolate first structural delta.
- Symptom: generated tests pass, corpus fails.
- Fix: add missing fixture, decide parse-vs-validate placement, lock behavior.
- Symptom: output drift after grammar edit.
- Fix: run full regeneration and equivalent parser contract suite before merge.
Batchalign3 Surface Checks
When a change affects the surface used by batchalign3, confirm:
- full-file parse equivalence still holds for corpus coverage
- alignment-sensitive downstream tiers still gate on parse-health appropriately
Review Checklist for Parser PRs
- New or changed behavior has targeted tests.
- Equivalence suite status is attached.
- Snapshot updates are intentional and explained.
- No hidden magic symbols or magic string literals introduced.
- Docs updated where user-visible behavior changes.
Required Artifacts for Significant Changes
- Design note (architecture decision record in the book).
- Before/after examples.
- Impacted fixtures list.
- Migration implications for integrators.
GitHub Readiness and Open Source Governance
Status: Current Last modified: 2026-06-21 21:33 EDT
Objective
Prepare TalkBank/chatter to operate as a healthy public project with clear legal, security,
contribution, and release processes.
Root Artifacts
| Artifact | Status | Notes |
|---|---|---|
LICENSE-MIT + LICENSE-APACHE | Done | Dual-licensed MIT OR Apache-2.0 (standard Rust convention; both files present at root, no combined LICENSE). Every crate inherits license = "MIT OR Apache-2.0" from [workspace.package]. |
CONTRIBUTING.md | Done | Setup, standards, PR flow, pre-PR checklist |
CODE_OF_CONDUCT.md | TODO (deferred) | Intentionally absent for now: it is held until a durable enforcement contact (an institutional address or successor handle, not an individual) is settled. The plan is to adopt the Contributor Covenant once that contact exists. |
SECURITY.md | Done | Root file added; issue-template contact link now resolves to a real policy |
CODEOWNERS | TODO | Not added yet: repo contents do not currently publish an authoritative GitHub owner/team map for path-level review ownership |
.github/workflows/*.yml | Done | ci.yml (Rust build+test, mdBook, Rust-version-sync) + cross-platform.yml (OS matrix) + clippy-rolling.yml + crates-io-foundation.yml + release.yml + release-desktop.yml |
.github/ISSUE_TEMPLATE/* | Done | Bug report + feature request (YAML forms) |
| Pull request template | Done | .github/PULL_REQUEST_TEMPLATE.md mirrors current CONTRIBUTING + PR review requirements |
CI Governance Policy
- Required status checks: the
ci.ymljobs that run on every pull request,Rust build + test,mdBook build, andRust version pins in sync. See Branch Protection for the exact GitHub check names and which other workflow (cross-platform.yml) is deliberately not in the required set. - Branch protection rules: documented in Branch Protection; configure on GitHub once the repo is public.
Release Governance
- Releases: the CLI and desktop app are published as signed GitHub Releases (cargo-dist); the Rust crates are source-available (not yet on crates.io).
- Cargo publication governance: first-wave crates.io foundations are documented
in Crates.io Publication and checked by
.github/workflows/crates-io-foundation.yml. - Binary release governance:
release.ymlis reserved for cargo-dist GitHub Release packaging of dist-enabled artifacts. It is not the crates.io publication workflow. - Tagging rule: do not treat version tags as authorization to publish new surfaces. A surface becomes stable only when its release notes explicitly say so and its public distribution channel is live.
- Release-note rule: every public release note must state the surface’s distribution channel, support boundary, and any closely related surfaces that remain held back.
Community Operations
- Label taxonomy:
bugandenhancementauto-applied by issue templates. Richer taxonomy (drift,spec,grammar,parser,docs,good first issue): TODO (GitHub settings). - Contributor pathway:
CONTRIBUTING.mdcovers setup and PR flow. First-time/advanced contributor pathways: TODO. - Public project roadmap: TODO.
Supply Chain and Security
- Dependency scanning: CI runs
rustsec/audit-checkandcargo-deny(withdeny.toml). Automated update PRs (Dependabot/Renovate): TODO. - Signed release artifacts: TODO.
- Security advisories process: documented in
SECURITY.md.
Acceptance Criteria
- Repo has complete governance artifacts at root.
- CI and branch protections enforce stated policy.
- Contributors can onboard and submit PRs without tribal knowledge.
- Release/support tiers are documented per surface.
- Release process is repeatable and documented.
Rust Compilation Times: Findings and Optimizations
Status: Reference (historical analysis; current Cargo.toml profile knobs are the source of truth) Last updated: 2026-05-20 20:32 EDT
This document captures the compilation performance analysis that drove the
current dev/test profile knobs in the workspace root Cargo.toml. The
absolute measurements below were taken before the 2026-04-28 batchalign3
fold roughly tripled the third-party dependency surface; subsequent updates
are reflected in Cargo.toml comments, which are the source of truth.
Background: How Rust Compilation Works
Rust compilation has two key mechanisms for speed:
-
Incremental compilation: When you change one file and rebuild, the compiler remembers which “codegen units” within each crate were affected and only recompiles those. This is the primary speedup mechanism for local iterative development (edit-compile-test cycles).
-
Crate-level caching: Cargo tracks which crates have changed inputs (source files, dependencies, feature flags). Unchanged crates are skipped entirely. This helps when you edit a leaf crate and don’t need to rebuild unrelated crates.
Additionally, there are external tools:
-
sccache: A shared compilation cache that stores compiled artifacts by content hash. Designed for CI environments where builds start from a clean state. It works by wrapping
rustcand checking a cache before invoking the real compiler. -
Linker choice: The linker runs after all crates are compiled to produce the final binary. Faster linkers (like
lld) can shave seconds off link time for large binaries.
What We Found
Problem 1: sccache Was Disabling Incremental Compilation (Critical)
The global ~/.cargo/config.toml had:
[build]
rustc-wrapper = "/opt/homebrew/bin/sccache"
This caused two compounding problems:
-
sccache disables Rust incremental compilation entirely. When a
rustc-wrapperis set, Cargo cannot use incremental mode because the wrapper interposes between Cargo and rustc, breaking the incremental artifact protocol. -
sccache had near-zero cache benefit for this workspace. The sccache stats showed a 2.7% Rust cache hit rate. Out of 37 compilations, 36 were marked “non-cacheable” because rlib crates (library crates, which is what most workspace crates produce) cannot be cached by sccache.
The result: every cargo build after a one-line change was effectively a clean
rebuild of the entire dependency chain. A change to talkbank-model (near the
root of the crate graph) triggered a full recompile of 11+ downstream crates,
taking 60-90 seconds even for a trivial edit.
Problem 2: Full Debug Info Was Inflating Link Times
The dev profile was generating full DWARF debug info (level 2), which includes:
- Type definitions for every struct/enum
- Variable location info for debugger inspection
- Full scope and lifetime metadata
This produces large .dSYM bundles and .o files, increasing linker input size
and slowing down the link phase.
Problem 3: Third-Party Dependencies at -O0
All third-party crates (serde, regex, tree-sitter, etc.) were compiled at
opt-level = 0 in dev builds. Since these crates rarely change, this was a
pure penalty: slow runtime (tests using serde deserialization, tree-sitter
parsing, or regex matching ran ~10x slower than necessary) with no compile-time
benefit after the first build.
Non-Problem: lld Linker
The linker = "lld" setting in the global cargo config was fine. On macOS this
uses ld64.lld from Homebrew’s LLVM toolchain (LLD 21.1.8), which is slightly
faster than Apple’s default linker for workspaces of this size. No change needed.
Changes Made
Change 1: Project-Local sccache Override
Created .cargo/config.toml in the project root:
[build]
rustc-wrapper = ""
This overrides the global sccache setting for this project only, re-enabling incremental compilation. Other Rust projects on the system are unaffected.
Why not modify the global config? Keeping the project-local override is safer, sccache may still be useful for other projects or CI workflows.
Note: .cargo/config.toml is gitignored (not committed) because the
empty-string rustc-wrapper = "" value trips a cargo-llvm-cov bug that
treats "" as a real wrapper path instead of “no wrapper.” Each
contributor opts in locally; CI does not carry the override.
Change 2: Reduced Debug Info
In the workspace Cargo.toml:
[profile.dev]
debug = "line-tables-only"
[profile.test]
debug = "line-tables-only"
This generates only file/line number information for backtraces, skipping the bulky type and variable metadata. You still get useful panic/backtrace output with source locations; you just can’t inspect local variables in a debugger (lldb/gdb). For most development workflows this is the right tradeoff.
Change 3: Optimized Third-Party Dependencies, RETIRED post-fold
The original change set [profile.dev.package."*"] opt-level = 1 to
optimize every third-party crate. After the 2026-04-28 batchalign3 fold
roughly tripled the third-party dependency surface (axum, async-trait,
tokio’s full feature set, etc.), the build-time cost of this setting
became prohibitive, and the workspace Cargo.toml comment block now
explains why it was removed.
[profile.test.package."*"] opt-level = 1 was also removed for the
same reason; for specific tests where runtime is the bottleneck, opt
in locally rather than reintroducing the workspace-wide setting.
Results (pre-fold, 2026-03 measurement)
The numbers below were captured pre-fold against the original ten-crate
workspace. The fold roughly tripled the third-party dep set and forced
retiring [profile.dev.package."*"] opt-level = 1; today’s wall-clock
will be slower and depends on which crate you touched. Re-run
cargo build --timings on the current workspace if you need fresh
numbers.
| Scenario | Before | After (pre-fold) |
|---|---|---|
| Clean build | ~3-5 min (est.) | ~39s |
Incremental rebuild (touch talkbank-model) | ~60-90s | ~4s |
| Test runtime (serde/regex/tree-sitter hot paths) | Slow (-O0) | Faster (-O1, when opt-in) |
Optional: Cranelift Backend for Maximum Iteration Speed
For the fastest possible “does it compile?” checks during rapid iteration, Rust nightly supports the Cranelift codegen backend:
cargo +nightly -Z codegen-backend=cranelift build
Cranelift generates code ~2x faster than LLVM but produces unoptimized output and is nightly-only. It is useful for compile-check cycles but not for correctness testing or benchmarking.
General Principles for Rust Compile Time
-
Incremental compilation is king for local dev. Anything that disables it (sccache, certain rustc-wrapper tools) is a net negative for iterative development.
-
sccache is for CI, not local dev. It shines when doing clean builds from scratch (CI runners, cross-compilation). For edit-rebuild cycles, incremental compilation is far more valuable.
-
Optimize dependencies, not your own crates.
[profile.dev.package."*"]withopt-level = 1gives you faster test execution with minimal compile cost (dependencies rarely change). -
Debug info has a real cost. Full DWARF debug info inflates binary sizes and link times. Use
line-tables-onlyunless you actively need a debugger. -
Measure before optimizing. Use
cargo build --timingsto generate an HTML report showing per-crate compile times and parallelism. Usesccache --show-statsto verify cache effectiveness. -
Watch for crate graph bottlenecks. Crates that sit at the root of the dependency graph (like
talkbank-model) are the critical path, changes to them trigger the longest rebuild chains. Keep these crates lean and consider splitting them if they grow too large.
Developer Verification Checks
Status: Current Last modified: 2026-05-30 20:13 EDT
This page defines the current local verification expectations for
TalkBank/chatter.
There is not yet a repo-local make verify wrapper in this checkout. Use
the concrete commands below instead.
Core local sweep
Run this from the repository root before opening or merging substantial changes:
cargo fmt --all -- --check
cargo check --workspace --all-targets
cargo build --workspace --all-targets --locked
cargo nextest run --workspace
cargo test --doc
Surface-specific additions
Add the checks that match the surface you changed:
-
Grammar changes
cd grammar && tree-sitter generate && tree-sitter test -
Spec tooling changes
cargo build --manifest-path spec/tools/Cargo.toml cargo build --manifest-path spec/runtime-tools/Cargo.toml -
Parser / model / alignment / serialization changes
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)' cargo nextest run -p talkbank-parser-tests --test roundtrip_reference_corpus
See Setup and Spec Workflow for the surface-specific regeneration guidance.
When to Run
- Always before creating a PR.
- Always before merging parser, spec-tool, grammar, or generated-artifact changes.
- Again after rebasing if upstream changed the same surface.
Additional Engineering Checks
Run these in addition to the core sweep when touching parser/model code:
cargo test -p talkbank-parser --test test_parse_health_recoverycargo nextest run -p talkbank-parser-tests --test parser_equivalence_files
These protect against regressions in:
- parser recovery without sentinel fabrication
- parse-health taint propagation
- parser semantic equivalence
Failure Policy
- If any required check fails, do not merge.
- Fix the failing check or scope down the change.
- If the failure is unrelated and pre-existing, document it in the PR and open a blocker issue.
Recommended Fast Loop During Development
Use narrower loops while iterating, then run the full sweep before final review. For a broad Rust verification pass:
cargo test --workspace
For grammar-only edits, prefer the smallest relevant loop first:
cd grammar && tree-sitter test
cargo nextest run -p talkbank-parser
Only reach for spec/symbol regeneration when the change truly affects generated artifacts; do not treat regeneration as a substitute for choosing the right regression test.
Branch Protection and Required CI Checks
Status: Current Last updated: 2026-06-15 15:00 EDT
This page defines the required status checks and protection policy for main.
Branch Protection Policy
Enable branch protection for main with:
- Require pull request before merge.
- Require approvals (minimum 1; maintainers may set higher).
- Require conversation resolution before merge.
- Require status checks to pass before merge.
- Restrict force pushes and branch deletions.
Required Status Checks
Configure these CI checks as required. The names are the GitHub check names,
which come from each job’s name: in .github/workflows/ci.yml; that
workflow runs on every pull request to main:
Rust build + testmdBook buildRust version pins in sync
One other workflow is deliberately NOT in the required set:
cross-platform.yml(the Ubuntu + macOS + Windows matrix) runs on push tomain, a daily schedule, and manual dispatch, NOT on pull requests, so it cannot report a status on a PR and must not be required (requiring it would block every merge). It is a post-merge and daily drift gate. Add apull_requesttrigger first if you want it required.
Optional Hardening
- Require branches to be up to date before merging.
- Enable merge queue if PR volume increases.
- Restrict who can dismiss stale reviews.
Operational Rule
If required checks fail:
- Do not bypass protection.
- Fix the issue or revert the breaking change.
- Re-run checks until green.
Reference Corpus Overhaul
Status: Historical (Phase 0-6 narrative is preserved for context; the live corpus layout is described in Testing § Reference Corpus, read that first for current counts and structure) Last modified: 2026-05-29 18:43 EDT
Subsequent reorganization moved the corpus from the 345-flat-plus-language-subdirs layout described below into nine topical subdirectories under
corpus/reference/. Absolute counts in this page (file totals, language-dir counts, theconstructs/directory) reflect the pre-reorganization state and are kept here only as the historical record of how the corpus got to where it is.
Motivation
The reference corpus (corpus/reference/) is the 100%-pass quality gate for all
parser/grammar changes. The parser must handle every file at 100%.
Before this overhaul, the corpus had three problems:
- Language monoculture: 345 files, all English. We have 100K+ real files across 42 languages in the corpus data directory but the gate only tested English.
- Construct gaps: 18 concrete grammar node types were never exercised
(e.g.,
interrupted_question,scoped_best_guess,trailing_off_question). A grammar regression affecting these constructs would pass CI undetected. - Error coverage gaps: 27 error specs were stubs (no CHAT example), 4 error codes had no spec file at all.
Strategy
Fresh build, not incremental patching. We kept the existing 345 English files as-is (they encode years of parser fixes) and added multilingual files + construct gap-fillers on top.
Phase 0: Coverage Tooling
Built corpus_node_coverage (spec/tools/src/bin/corpus_node_coverage.rs) to
measure which of the 334 concrete grammar node types the corpus exercises.
Running against the old 345-file corpus confirmed exactly 18 gaps.
Phase 1: Language Selection & File Extraction
Built extract_corpus_candidates (spec/runtime-tools/src/bin/extract_corpus_candidates.rs)
to automatically select representative files from the corpus data directory for 20 target
languages:
eng, zho, fra, deu, spa, jpn, nld, heb, por, ell,
tur, hrv, pol, ita, hun, rus, est, dan, ara, isl
Selection criteria:
- Clean tree-sitter parsing (no ERROR nodes), mandatory
- Short files (under 200 lines, preferring 15-100)
- Varied tiers (%mor/%gra/%pho/%com)
- Multiple speakers preferred
- Privacy: explicitly skip
Passworddirectories in the corpus data directory
For each language, the tool scored and ranked candidates. We selected 1-2 files per language (25 files total across 20 language subdirectories).
Phase 2: Construct Gap-Filling
Created 4 handcrafted files in corpus/reference/constructs/ to exercise the 18
missing node types that don’t appear in real-world data:
| File | Node types exercised |
|---|---|
rare-terminators.cha | interrupted_question, self_interrupted_question, self_interruption, trailing_off_question |
uptake.cha | uptake_symbol |
best-guess.cha | scoped_best_guess |
unsupported.cha | thumbnail_header, unsupported_header, unsupported_dependent_tier, unsupported_line, unsupported_header_prefix, unsupported_tier_prefix |
Other gaps (l1_of_header, utf8_header, etc.) were already covered by the
language files or were confirmed as supertypes (not concrete).
Result: 334/334 concrete types exercised (100%).
Phase 3: Tier Regeneration
Ran batchalign3 morphotag on all 25 language files to generate fresh %mor/%gra tiers:
cd /path/to/batchalign3
uv run batchalign3 morphotag /path/to/chatter/corpus/reference/{lang}/ --in-place
All 20 languages are covered by Stanza’s UD models. Validation confirmed all 374 files pass parser equivalence and roundtrip.
Phase 4: Error Corpus Expansion
4.1: Created 3 missing error specs (E707, E711, E717) with CHAT examples and metadata. Fixed E376 (had wrong error code E208 in metadata).
4.2: Filled 17 triggerable stub specs with CHAT examples:
- Cross-utterance validation (E341, E351-E355)
- Parser recovery warnings (E319-E322, E325, E326)
- Underline tier errors (E356-E357)
- Overlap index errors (E373)
- Direct parser tier errors (E381, E384)
4.3: Documented 12 untriggerable stubs (internal, deprecated, or not-yet-wired error codes) with explanations of why no example is possible: E001, E002, E211, E317, E318, E340, E374, E377, E378, E380, E385, E386.
4.4: Corrected 5 misclassified specs where examples triggered different error
codes than intended (E319-E322, E376). Added Status: not_implemented and
explanatory notes.
4.5: Built perturbation tool (spec/tools/src/bin/perturb_corpus.rs) with 11
mutation strategies that take a valid .cha file and produce controlled mutations
targeting specific error codes:
| Perturbation | Target Error |
|---|---|
delete-participants | E501 |
delete-languages | E503 |
delete-id | E504 |
undeclared-speaker | E308 |
delete-terminator | E305 |
extra-mor-word | E706 |
fewer-mor-words | E705 |
delete-begin | E502 |
delete-end | E510 |
duplicate-participants | E511 |
mor-terminator-mismatch | E716 |
Also includes a mining mode (--mine DIR) that scans real data for tree-sitter
ERROR nodes, with automatic Password directory exclusion.
4.6: Regenerated golden artifacts: all 8 golden generators + audit + bootstrap:
| Artifact | Lines |
|---|---|
golden_words.txt | 769 (1949 unique words) |
golden_mor_tiers.txt | 405 |
golden_gra_tiers.txt | 7 |
golden_main_tiers.txt | 607 |
golden_pho_tiers.txt | 25 |
golden_wor_tiers.txt | 7 |
golden_sin_tiers.txt | 5 |
golden_com_tiers.txt | 24 |
golden_words_featured.txt | 96 |
golden_words_minimal.txt | 62 |
Bootstrap regenerated reference_corpus.rs with 374 test cases.
Phase 5: CI Integration & Validation
At that milestone, the then-current verification sweep passed:
- Parser equivalence: 377/377 (374 files + 3 extra)
- Node coverage: 334/334 (100%)
- Error coverage: 181/181 (100%), 169 with CHAT examples, 12 documented stubs
- The parser-equivalence and reference-corpus regression gates passed
Phase 6: Cleanup & Documentation
- Updated file count references (339→374) across CLAUDE.md files
- Rewrote
corpus/README.mdwith new structure - Updated memory files
Final State
corpus/reference/ 374 files total
*.cha 345 files (original English corpus)
constructs/ 4 files (rare grammar constructs)
{20 language dirs}/ 25 files (multilingual, from corpus data)
| Metric | Before | After |
|---|---|---|
| Total files | 345 | 374 |
| Languages | 1 (English) | 20 |
| Concrete node coverage | 316/334 (94.6%) | 334/334 (100%) |
| Error specs | 177/181 (97.8%) | 181/181 (100%) |
| Error specs with examples | ~150 | 169 |
| Documented stubs | 0 | 12 |
| Golden artifacts | Stale | Freshly regenerated |
Tools Built
| Tool | Path | Purpose |
|---|---|---|
corpus_node_coverage | spec/tools/src/bin/ | Grammar node type coverage |
extract_corpus_candidates | spec/runtime-tools/src/bin/ | Automated file selection from corpus data |
perturb_corpus | spec/tools/src/bin/ | Error file generation by mutation |
What Worked
- extract_corpus_candidates: Automated scoring eliminated guesswork in file selection. Files were high-quality, short, and diverse.
- construct gap-filling: 4 handcrafted files closed 18 gaps efficiently.
- Keeping existing 345 files: No breakage, no regressions. The new files are purely additive.
- batchalign3 morphotag: Generated correct %mor/%gra for all 20 languages without manual intervention.
What Didn’t Work / Lessons Learned
- Mining real errors from corpus data: The MacWhinney subcorpus (407 files) had zero tree-sitter parse errors; the data is too clean. Mining is slow on large directories (>4 minutes for all of Eng-NA). The perturbation approach is more effective for systematic error coverage.
- Parser recovery error specs (E319-E322): Writing examples that trigger specific tree-sitter error recovery codes is very difficult. Tree-sitter’s error recovery is robust and routes most malformed input through generic paths (E316) rather than the specific recovery codes. These remain as documented stubs.
- Direct parser vs unsupported.cha (historical, direct parser has been removed):
The former Chumsky direct parser could not handle
unsupported_linenodes (failed onconstructs/unsupported.cha). This is no longer relevant since tree-sitter is now the sole parser.
Known Remaining Gaps
- 12 untriggerable error stubs: Internal (E001, E002), deprecated (E211, E317, E318, E340, E374, E377, E378, E380, E385, E386). These are legitimate, the codes either have no emission path or are reserved.
- No audio files: Phase 3.3 (audio subset with %wor tiers) was deferred. Adding ~10 short audio clips would test the alignment pipeline end-to-end.
- Direct parser roundtrip (historical, direct parser has been removed): 373/374 passed under the former Chumsky direct parser (unsupported.cha failed). No longer relevant since tree-sitter is now the sole parser.
- 5 parser recovery specs not_implemented: E319-E322, E376. Examples don’t trigger the intended codes due to tree-sitter’s error recovery routing.
Desktop App Testing
Status: Current Last updated: 2026-05-20 20:28 EDT
This document covers the testing strategy for the Chatter desktop app
(apps/chatter-desktop/). Testing is split into three tiers by speed and scope.
Testing Tiers
┌─────────────────────────────────────────────────────────┐
│ Tier 3: E2E (WebdriverIO + tauri-driver) │
│ Real app, real DOM, real IPC. Slow (~5-10s/test). │
│ Catches: rendering bugs, IPC wiring, platform quirks. │
│ Run: manually before releases, optionally in CI. │
├─────────────────────────────────────────────────────────┤
│ Tier 2: Rust integration tests │
│ Real validation pipeline, real event bridge, no GUI. │
│ Catches: serialization mismatches, event ordering, │
│ stats consistency, single-file handling. │
│ Run: every commit, CI required. │
├─────────────────────────────────────────────────────────┤
│ Tier 1: Unit tests (Rust + TypeScript) │
│ Pure functions and thin runtime seams in isolation. │
│ Catches: protocol drift, reducer bugs, CLAN math. │
│ Run: every commit, CI required. │
└─────────────────────────────────────────────────────────┘
Most bugs will be caught by Tier 2. The Rust integration tests exercise the
exact same code path as the Tauri commands; they call
validate_target_streaming() and the frontend event bridge directly, then
verify the JSON shape, field names, event ordering, and stats consistency.
Tier 1 & 2: Unit and integration tests
Running
# TypeScript capability/seam tests
cd apps/chatter-desktop && npm run test:unit
# Rust contract/integration tests
cargo nextest run -p chatter-desktop --test validation_bridge
What they cover
| Test | What it verifies |
|---|---|
apps/chatter-desktop/tests/unit/validationRunner.test.cjs | Validation capability uses centralized command names, subscribes before invoke, and disposes listeners exactly once |
apps/chatter-desktop/tests/unit/validationState.test.cjs | Validation reducer computes relative file names and merges diagnostics/status immutably |
reference_corpus_no_hard_errors | every file under corpus/reference/ produces zero Severity::Error (warnings allowed) |
event_lifecycle_has_correct_sequence | Discovering → Started → FileComplete×N → Finished ordering |
frontend_events_serialize_to_expected_json_shape | Every event has type field; camelCase field names match TypeScript types; diagnostics include renderedText |
protocol_contracts_serialize_to_expected_json_shape | Rust command/event constants and request payloads stay aligned with the TypeScript protocol module |
single_file_validation | Single-file path validates exactly the selected file |
finished_stats_match_file_events | valid + invalid + parseErrors == totalFiles; FileComplete count matches |
rendered_html_present_for_errors | Every diagnostic carries non-empty miette HTML with box-drawing characters and style= attributes (ANSI colors converted to HTML) |
Adding new tests
Test file: apps/chatter-desktop/src-tauri/tests/validation_bridge.rs
The tests use collect_events() which runs the real validation pipeline and
collects all FrontendEvent values. To test a specific scenario:
#![allow(unused)]
fn main() {
#[test]
fn my_scenario() {
let target = workspace_root().join("path/to/corpus");
let events = collect_events(&target);
let summary = summarize(&events);
// assert on summary fields or individual events
}
}
Miette rendering pipeline
Error rendering is server-side. Each FrontendDiagnostic carries two
renderings:
rendered_html:render_error_with_miette_with_source_colored()produces ANSI-colored text,ansi-to-htmlconverts it to HTML<span style="...">. The frontend displays it in a<pre>block viadangerouslySetInnerHTML. This guarantees identical output to the CLI.rendered_text:render_error_with_miette_with_source()produces plain text (no ANSI codes) for clean clipboard copy-paste.
The rendered_html_present_for_errors integration test verifies that every
error diagnostic includes non-empty HTML containing miette box-drawing
characters and style= attributes from ANSI color conversion.
TypeScript seam tests
The TypeScript unit tests compile a focused subset of apps/chatter-desktop/src/ to a
temporary CommonJS directory, then run Node’s built-in test runner against the
compiled output. This keeps the test toolchain small while still exercising the
runtime seam as real JavaScript.
- Runner script:
apps/chatter-desktop/scripts/run-unit-tests.mjs - Compile config:
apps/chatter-desktop/tsconfig.unit.json - Test files:
apps/chatter-desktop/tests/unit/*.test.cjs
TypeScript ↔ Rust contract
The Rust integration tests verify that serialized JSON matches what the
TypeScript frontend expects. If you change a field name or event structure in
events.rs, the frontend_events_serialize_to_expected_json_shape test will
catch the mismatch before you discover it at runtime.
The key serde attributes:
#[serde(tag = "type", rename_all = "camelCase")]on enums, variant names become camelCase tag values (fileComplete, notFileComplete)#[serde(rename_all = "camelCase")]on individual variants, field names become camelCase (totalFiles, nottotal_files)- Both must be present: the enum-level
rename_allonly affects tag names, not field names within variants
Tier 3: E2E Tests (WebdriverIO)
Prerequisites
cargo install tauri-driver # WebDriver backend for Tauri (Linux/Windows only)
cargo tauri build --debug # Build the app binary
Note: tauri-driver only works on Linux and Windows. On macOS, WKWebView
does not support WebDriver. Run E2E tests in CI (Linux) or on a Windows machine.
Running
# Terminal 1: start tauri-driver (WebDriver server on :4444)
tauri-driver
# Terminal 2: run the tests
cd apps/chatter-desktop
npm run test:e2e
What they cover
The smoke tests in tests/e2e/smoke.spec.ts verify that the app launches and
renders the expected UI elements:
- Drop zone with Choose File / Choose Folder buttons
- Empty file tree (“No files loaded”)
- Empty error panel (“Select a file to view errors”)
- Status bar showing “Ready”
Limitations
File dialogs cannot be driven via WebDriver. The native file picker
(@tauri-apps/plugin-dialog) opens an OS-level dialog that WebDriver can’t
interact with. Options for testing the validation flow:
- Test-only Tauri command: add
validate_for_test(path)behind#[cfg(debug_assertions)]that bypasses the file dialog - Programmatic invoke: use
driver.executeScript()to callwindow.__TAURI__.core.invoke("validate", { path })directly - Drag-and-drop simulation: possible but platform-dependent and fragile
For now, the Rust integration tests cover the full validation pipeline. E2E tests focus on UI rendering and user-visible layout.
Adding E2E tests
Test file: apps/chatter-desktop/tests/e2e/*.spec.ts
WebdriverIO provides $() and $$() for CSS selectors, plus Tauri-aware
capabilities:
it("should show validation results", async () => {
// Programmatically trigger validation (bypasses file dialog)
await browser.executeAsync(async (path, done) => {
await (window as any).__TAURI__.core.invoke("validate", {
path,
});
// Wait for finished event
setTimeout(done, 5000);
}, "/path/to/corpus");
const tree = await $(".file-tree-panel");
const text = await tree.getText();
expect(text).not.toContain("No files loaded");
});
When to run E2E tests
- Before releases: manual run to verify the built app works end-to-end
- Optionally in CI: requires
tauri-driverand a display server (Xvfb on Linux). Slow, so consider running only on release branches. - Not on every commit: the Rust integration tests are fast and cover more ground
Platform-Specific Considerations
| Platform | WebView engine | E2E support |
|---|---|---|
| macOS | WKWebView | Not supported: tauri-driver does not work on macOS (WKWebView has no WebDriver API) |
| Windows | WebView2 (Chromium) | Full support via tauri-driver |
| Linux | WebKitGTK | Full support via tauri-driver; requires Xvfb for headless |
macOS limitation: Apple’s WKWebView does not expose a WebDriver endpoint,
so tauri-driver cannot drive the app on macOS. E2E tests must run on Linux
(CI) or Windows. For local macOS development, rely on the Rust integration
tests (Tier 2) and manual smoke testing.
CSS rendering differs slightly between WebKit (Linux) and Chromium (Windows). Visual regressions are possible, consider screenshot comparison tests if this becomes a problem.
Test Data
All tests use the reference corpus at corpus/reference/. This
corpus is checked into the repo and must always pass validation with
zero hard errors (warnings are allowed). The exact set of files and
the current warning-emitting files are whatever
find corpus/reference -name '*.cha' -type f and the validator
report, do not hard-code those lists here.
Do not create ad-hoc .cha test files. Use existing reference corpus files
or ask the user to provide test data.
CI Integration
Add to the existing CI workflow:
# Rust integration tests (fast, always run)
- name: Desktop integration tests
run: cargo nextest run -p chatter-desktop --test validation_bridge
# E2E tests (slow, release branches only)
- name: Build desktop app
if: startsWith(github.ref, 'refs/heads/release')
run: cargo tauri build --debug
- name: E2E smoke tests
if: startsWith(github.ref, 'refs/heads/release')
run: |
tauri-driver &
sleep 2
cd apps/chatter-desktop && npm run test:e2e
Library Usage
Status: Current Last modified: 2026-06-21 21:33 EDT
The TalkBank Rust crates can be used as dependencies in your own Rust
projects for parsing, validating, and manipulating CHAT files. This page
shows the most common entry points; the API reference on docs.rs (once
published) is the authoritative source. Until then, treat the rustdoc
comments inside each crate’s src/lib.rs as the source of truth.
Examples on this page are mirrored as a real Cargo test at
crates/talkbank-transform/tests/book_library_usage_examples.rs. The book renders them asrust,ignoreso mdbook doesn’t try to link against the workspace’s many compiled crate variants; the parallel test runs the same code undercargo testand is what catches API drift between this page and the libraries. If you edit either, update both.
Important: some legacy tree-sitter fragment helpers are synthetic rather than semantically honest. They can inject fragment input into boilerplate CHAT text and parse the resulting synthetic file. Prefer full-file parsing for real tree-sitter use, and do not treat legacy fragment helpers as the long-term fragment API. For direct-parser fragment semantics, use direct-parser-native tests instead of treating synthetic wrappers as the oracle.
Adding Dependencies
The TalkBank library crates are source-available from this repository. They are
not yet published on crates.io, so depend on them from the public repo via git
(pinned to a release tag), or via local path dependencies from a
TalkBank/chatter checkout for local development:
[dependencies]
talkbank-model = { path = "../chatter/crates/talkbank-model" }
talkbank-transform = { path = "../chatter/crates/talkbank-transform" }
talkbank-parser = { path = "../chatter/crates/talkbank-parser" }
The published-crate workflow is tracked separately; once it lands these
paths can become version = "X.Y" deps.
Parsing and Validating a CHAT File
The simplest entry point is parse_and_validate from
talkbank-transform. It takes the source text and a
ParseValidateOptions, returns a fully constructed ChatFile, or a
PipelineError if parsing or validation failed.
extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::parse_and_validate;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = std::fs::read_to_string("file.cha")?;
let options = ParseValidateOptions::default().with_validation();
let chat_file = parse_and_validate(&source, options)?;
for utt in chat_file.utterances() {
println!("Speaker: {}", utt.main.speaker);
}
Ok(())
}
ChatFile is generic over a ValidationState parameter; the
parse_and_validate return defaults to the validated state.
chat_file.utterances() returns an iterator over &Utterance derived
from the file’s lines (utterances are interleaved with headers and
comments in source order).
For batch workflows where parser construction overhead matters, reuse a
single TreeSitterParser and call parse_and_validate_with_parser:
extern crate talkbank_model;
extern crate talkbank_parser;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_parser::TreeSitterParser;
use talkbank_transform::parse_and_validate_with_parser;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let chat_files: Vec<std::path::PathBuf> = Vec::new();
let parser = TreeSitterParser::new()?;
let options = ParseValidateOptions::default().with_validation();
for path in &chat_files {
let source = std::fs::read_to_string(path)?;
let chat_file = parse_and_validate_with_parser(&parser, &source, options.clone())?;
let _ = chat_file;
}
Ok(())
}
ParseValidateOptions also exposes with_alignment() (implies
with_validation(), additionally validates cross-tier alignment for
%mor, %gra, %pho, %wor) and with_strict_linkers() (enables
E351-E355 self-completion/other-completion linker checks).
Working with the Model
ChatFile stores participants and language metadata as top-level fields
populated from @Participants / @ID / @Languages headers during
parsing. Utterances live in lines and are iterated via
chat_file.utterances().
extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::DependentTier;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::parse_and_validate;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "\
@UTF8
@Begin
@Languages:\teng
@Participants:\tCHI Target_Child
@ID:\teng|test|CHI|||||Target_Child|||
*CHI:\thello world .
%mor:\tco|hello n|world .
@End
";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;
// Participant metadata is top-level on the ChatFile.
let _participants = &chat_file.participants;
// Iterate utterances and their dependent tiers.
for utt in chat_file.utterances() {
for tier in &utt.dependent_tiers {
if let DependentTier::Mor(mor_tier) = tier {
for item in mor_tier.items() {
println!("POS: {}, Lemma: {}", item.main.pos, item.main.lemma);
}
}
}
}
Ok(())
}
DependentTier is a closed-set enum (Mor, Gra, Pho, Mod, Sin,
Act, Add, Com, Err, Exp, Gpx, Int, Lan, …); match on the
variants you care about and ignore the rest. MorTier::items() returns
&[Mor]; each Mor has a main MorWord plus optional post-clitics.
Serializing to CHAT
Bring the WriteChat trait into scope and call to_chat_string() for a
fully-rendered CHAT string, or write_chat(&mut writer) to stream into
any std::fmt::Write.
extern crate talkbank_model;
extern crate talkbank_transform;
use std::fmt::Write as _;
use talkbank_model::ParseValidateOptions;
use talkbank_model::WriteChat;
use talkbank_transform::parse_and_validate;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "@UTF8\n@Begin\n@Languages:\teng\n@Participants:\tCHI Target_Child\n@ID:\teng|test|CHI|||||Target_Child|||\n*CHI:\thello .\n@End\n";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;
// Convenience: render to a fresh String.
let chat_text = chat_file.to_chat_string();
assert!(chat_text.starts_with("@UTF8"));
// Streaming: write into any std::fmt::Write sink.
let mut output = String::new();
chat_file.write_chat(&mut output)?;
Ok(())
}
Serializing to JSON
Prefer the schema-validated helpers in talkbank_transform::json:
to_json_pretty_validated checks the output against the JSON schema and
catches drift between the data model and the schema. The unvalidated
variants are a faster bypass when you’ve already validated upstream.
extern crate talkbank_model;
extern crate talkbank_transform;
use talkbank_model::ParseValidateOptions;
use talkbank_transform::json::to_json_pretty_validated;
use talkbank_transform::parse_and_validate;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let source = "@UTF8\n@Begin\n@Languages:\teng\n@Participants:\tCHI Target_Child\n@ID:\teng|test|CHI|||||Target_Child|||\n*CHI:\thi .\n@End\n";
let chat_file = parse_and_validate(source, ParseValidateOptions::default().with_validation())?;
let json = to_json_pretty_validated(&chat_file)?;
assert!(json.contains("\"speaker\""));
Ok(())
}
The schema for ChatFile lives at schema/chat-file.schema.json and is
regenerated from the Rust types via cargo test --test generate_schema. For arbitrary
serde values (not just ChatFile), to_json_unvalidated /
to_json_pretty_unvalidated work the same way without the schema step.
Custom Error Handling
Lower-level parser entry points stream diagnostics through the
ErrorSink trait. Implement it to collect, count, filter, or forward
errors as they arrive, useful when you need finer-grained control than
the Result<ChatFile, PipelineError> shape parse_and_validate returns.
extern crate talkbank_model;
use talkbank_model::ErrorSink;
use talkbank_model::ParseError;
struct MyErrorHandler;
impl ErrorSink for MyErrorHandler {
fn report(&self, error: ParseError) {
// Custom handling: log, filter, count, etc.
eprintln!("[{}] {}", error.code, error.message);
}
}
ErrorSink is Send + Sync, and a blanket &T: ErrorSink impl means
borrowed references are sinks too, no Arc wrapper required. The
built-in ErrorCollector (gathers into a Vec), ParseTracker (counts
by severity), and NullErrorSink (discards) cover most common needs;
implement ErrorSink directly for everything else.
Crate Selection Guide
| Need | Crate |
|---|---|
Data model types, error types, WriteChat, ErrorSink | talkbank-model |
| Tree-sitter CHAT parsing (low-level) | talkbank-parser |
| Full pipeline (parse + validate + JSON, schema validation) | talkbank-transform |
talkbank-model is the foundation, every other crate depends on it. If
all you need are the AST types and validation, model alone is enough.
talkbank-transform brings parsing + JSON + caching.
Batchalign3-Facing Surface
If you are building Batchalign3 or another external consumer, the stable surface is usually:
| Batchalign3 need | Prefer |
|---|---|
| Canonical full-file parsing | talkbank-parser |
| Parse/validate contracts and typed model access | talkbank-model |
Alignment-aware downstream consumers (align, compare, benchmark) | talkbank-model alignment helpers plus the model AST |
| Whole-pipeline parse+validate+convert | talkbank-transform |
For batch workflows, keep parser instances reusable and keep alignment logic separate from parse semantics.
JSON Output Reference
Status: Reference Last updated: 2026-05-11 23:45 EDT
This document describes the structure of JSON produced by chatter to-json.
For the formal JSON Schema, see JSON Schema.
Quick Start
# Default: parse + validate + align, pretty-printed, schema-checked
chatter to-json file.cha
# Write to file
chatter to-json file.cha -o file.json
# Skip validation (parse only, faster)
chatter to-json file.cha --skip-validation
# Skip alignment only
chatter to-json file.cha --skip-alignment
Validation and alignment are on by default. Use --skip-validation
or --skip-alignment to opt out.
Top-Level Structure
{
"lines": [ ... ]
}
A ChatFile is a flat list of lines. Each line has a line_type discriminator:
line_type | Description |
|---|---|
"header" | File header (@Begin, @Languages, @Participants, etc.) |
"utterance" | Main tier + dependent tiers + alignment |
"comment" | @Comment: lines |
Word Fields
Words are the fundamental unit. Every word in the main tier content array
carries these fields:
| Field | Type | Always? | Description |
|---|---|---|---|
type | "word" | yes | Discriminator |
raw_text | string | yes | Exact text from the transcript, including all CHAT markers |
cleaned_text | string | yes | NLP-ready text (shortenings restored, markers stripped) |
content | array | yes | Structured breakdown of word parts (see below) |
category | string | no | "omission", "filler", "nonword", "fragment", "ca_omission" |
form_type | string | no | Special form code: "c", "d", "f", "x", etc. |
lang | object | no | Language marker (see Language-Switched example) |
untranscribed | string | no | "unintelligible" (xxx), "phonetic" (yyy), "untranscribed" (www) |
Word content items use "content" for the text value:
{ "type": "text", "content": "dog" }
Computed Fields
cleaned_text and untranscribed are computed from content during
serialization. They do not exist as stored fields in the data model.
-
cleaned_text: ConcatenatesTextandShorteningelements fromcontent. Excludes lengthening markers (:), stress markers, CA elements, overlap points, compound markers, and underline markers. Example:sit(ting)→"sitting". -
untranscribed: Present only whencleaned_textis"xxx","yyy", or"www".
Word Examples
Simple Word
dog
{
"type": "word",
"raw_text": "dog",
"cleaned_text": "dog",
"content": [{ "type": "text", "content": "dog" }]
}
Filler
&-uh
{
"type": "word",
"raw_text": "&-uh",
"cleaned_text": "uh",
"content": [{ "type": "text", "content": "uh" }],
"category": "filler"
}
Untranscribed
xxx
{
"type": "word",
"raw_text": "xxx",
"cleaned_text": "xxx",
"content": [{ "type": "text", "content": "xxx" }],
"untranscribed": "unintelligible"
}
Compound
ice+cream
{
"type": "word",
"raw_text": "ice+cream",
"cleaned_text": "icecream",
"content": [
{ "type": "text", "content": "ice" },
{ "type": "compound_marker", "content": { "span": { "start": 0, "end": 1 } } },
{ "type": "text", "content": "cream" }
]
}
Omission
0she
{
"type": "word",
"raw_text": "0she",
"cleaned_text": "she",
"content": [{ "type": "text", "content": "she" }],
"category": "omission"
}
Nonword
&~baba
{
"type": "word",
"raw_text": "&~baba",
"cleaned_text": "baba",
"content": [{ "type": "text", "content": "baba" }],
"category": "nonword"
}
Special Form
doggy@c
{
"type": "word",
"raw_text": "doggy@c",
"cleaned_text": "doggy",
"content": [{ "type": "text", "content": "doggy" }],
"form_type": "c"
}
Language-Switched
maison@s:fra
{
"type": "word",
"raw_text": "maison@s:fra",
"cleaned_text": "maison",
"content": [{ "type": "text", "content": "maison" }],
"lang": { "type": "explicit", "code": "fra" }
}
The lang field has variants: {"type": "shortcut"} (bare @s),
{"type": "explicit", "code": "fra"} (@s:fra), and
{"type": "multiple", "code": ["eng", "zho"]} (@s:eng+zho).
Utterances
An utterance line contains:
{
"line_type": "utterance",
"main": {
"speaker": "CHI",
"content": {
"content": [ ... ],
"terminator": { "type": "period" },
"bullet": { "start_ms": 0, "end_ms": 3042 }
}
},
"dependent_tiers": [ ... ],
"alignments": { ... },
"utterance_language": { "status": "resolved_default", "code": "eng" },
"language_metadata": { ... }
}
Key structural points:
- The utterance body is under
"main", not"utterance". content,terminator, andbulletare nested insidemain.content.terminatoris an object with atypefield ("period","question","exclamation", etc.), not a bare string.bullet(utterance-level timing) is insidemain.content, omitted when absent (not present asnull).dependent_tiers,alignments,utterance_language, andlanguage_metadataare top-level siblings ofmain. Emptydependent_tiersandalignmentsare omitted when there is nothing to report.
Content Items
main.content.content is a heterogeneous array. Each item has a type discriminator:
| Type | Description |
|---|---|
"word" | A word token (see Word Fields above) |
"event" | Non-verbal action (&=laughs) |
"pause" | Timed or untimed pause ((.), (0.5)) |
"group" | Bracketed group (<word word>) |
"separator" | Tag markers, linkers, etc. |
Dependent Tiers
When present, dependent_tiers is an array of tagged objects:
"dependent_tiers": [
{
"type": "Mor",
"data": {
"tier_type": "Mor",
"items": [
{
"main": { "pos": "pron", "lemma": "I" }
},
{
"main": { "pos": "verb", "lemma": "want", "features": ["Fin", "Ind", "Pres"] }
}
],
"terminator": "."
}
},
{
"type": "Gra",
"data": {
"tier_type": "Gra",
"relations": [
{ "index": 1, "head": 2, "relation": "NSUBJ" },
{ "index": 2, "head": 0, "relation": "ROOT" }
]
}
}
]
type | Tier | Description |
|---|---|---|
"Mor" | %mor | Morphological analysis (POS tags, lemmas, features, clitics) |
"Gra" | %gra | Grammatical relations (dependency arcs) |
"Pho" | %pho | Phonological transcription |
"Sin" | %sin | Syntax tier |
"Wor" | %wor | Word-level timing (items with inline_bullet) |
| Other | %xxx | User-defined dependent tiers |
%wor Tier
The Wor tier contains word items with timing:
{
"type": "Wor",
"data": {
"items": [
{
"kind": "word",
"raw_text": "hello",
"cleaned_text": "hello",
"content": [{ "type": "text", "content": "hello" }],
"inline_bullet": { "start_ms": 100, "end_ms": 300 }
}
],
"terminator": { "type": "period" }
}
}
Note that %wor items use "kind" instead of "type" for their discriminator,
since "type" is used by the tier envelope.
Alignment Data
When validation runs (the default), the alignments object contains:
units: per-tier index arrays (for internal bookkeeping)- Named tier pairs (e.g.,
mor,gra) with alignment mappings
"alignments": {
"units": {
"main_mor": [{"index": 0}, {"index": 1}],
"main_pho": [{"index": 0}, {"index": 1}],
"main_sin": [{"index": 0}, {"index": 1}],
"main_wor": [{"index": 0}, {"index": 1}],
"mor": [{"index": 0}, {"index": 1}]
},
"mor": {
"pairs": [
{ "source_index": 0, "target_index": 0 },
{ "source_index": 1, "target_index": 1 }
],
"errors": []
}
}
Alignment links each main-tier word (source_index) to its corresponding
dependent-tier item (target_index) by position. errors contains any
alignment-level diagnostics (count mismatches, etc.) and is [] when
alignment validated cleanly.
Headers
Headers use the header object with a type discriminator:
| Type | Header | Key Fields |
|---|---|---|
"utf8" | @UTF8 | , |
"begin" | @Begin | , |
"end" | @End | , |
"languages" | @Languages | codes |
"participants" | @Participants | entries (speaker_code, name, role) |
"id" | @ID | language, corpus, speaker, role, age, sex, … |
"media" | @Media | filename, media_type, status |
"comment" | @Comment | text |
"date" | @Date | date |
"options" | @Options | options (array of strings) |
See the JSON Schema for the complete list of header types and fields.
Timing
Utterance-level timing appears in main.content.bullet:
"bullet": {
"start_ms": 1234,
"end_ms": 5678
}
Word-level timing (from %wor tier) appears in inline_bullet on individual
words within the Wor dependent tier.
JSON Schema
Status: Current Last modified: 2026-06-15 15:00 EDT
This repository generates JSON Schema from Rust-owned types with
schemars for the ChatFile transcript model used
by chatter to-json.
Keeping that schema generated from the Rust source of truth lets cross-language integrations consume a stable contract without re-deriving the shapes by hand.
Available schemas
| Schema | Canonical URL | Repository | Generator |
|---|---|---|---|
ChatFile transcript model | https://talkbank.org/schemas/v0.1/chat-file.json | schema/chat-file.schema.json | cargo test --test generate_schema |
The generated schema declares both $schema (JSON Schema 2020-12) and $id
(the canonical URL above). External consumers that want to track the
current transcript-model version should follow the v0.1 URL; there is
no /latest/ alias in the generated artifacts.
Transcript schema: ChatFile
chatter to-json converts CHAT transcripts into a structured JSON form backed
by the same ChatFile model used by the parser, validator, and serializer.
How chatter to-json uses it
By default, chatter to-json:
- validates the CHAT input,
- checks dependent-tier alignment unless
--skip-alignmentis passed, and - validates the emitted JSON against the schema unless
--skip-schema-validationis passed.
Useful flags:
chatter to-json input.cha --skip-validation
chatter to-json input.cha --skip-alignment
chatter to-json input.cha --skip-schema-validation
chatter from-json deserializes JSON back into the internal ChatFile model
and re-serializes it to CHAT format. The input should conform to this schema.
Roundtrip expectations
The CHAT-to-JSON-to-CHAT pipeline is intended to preserve the ChatFile model:
chatter to-json input.cha -o intermediate.json
chatter from-json intermediate.json -o output.cha
diff input.cha output.cha
Both directions go through the same typed model. When changing the parser, serializer, or schema generation, confirm roundtrip behavior with the existing roundtrip test suites rather than assuming byte-for-byte identity.
Using the schema externally
Validate JSON in Python
import json
import jsonschema
import urllib.request
schema_url = "https://talkbank.org/schemas/v0.1/chat-file.json"
schema = json.loads(urllib.request.urlopen(schema_url).read())
with open("transcript.json") as f:
data = json.load(f)
jsonschema.validate(data, schema)
IDE autocompletion
{
"$schema": "https://talkbank.org/schemas/v0.1/chat-file.json",
"lines": [],
"participants": {},
"languages": [],
"options": []
}
Generate types from the schema
Tools like quicktype, json-schema-to-typescript, and datamodel-code-generator can generate typed structs or classes from the schema for TypeScript, Python, Go, and other languages.
Regenerating the schema
After changing transcript-model types in talkbank-model:
cd chatter
cargo test --test generate_schema
This writes the checked-in schema artifact in schema/. CI already checks that
generated artifacts stay in sync.
Code references
schema/chat-file.schema.json: generated schemacrates/talkbank-transform/src/json.rs: schema loading and validationcrates/talkbank-model/src/model/: Rust data modeltests/generate_schema/: shared schema generation helpers
Diagnostic and JSON Output Contract
Status: Current Last updated: 2026-06-15 15:00 EDT
This page documents the machine-readable JSON surfaces currently exposed by the
top-level chatter CLI.
Stability policy
- Treat field names documented here as the public contract.
- Treat additional fields as additive unless this page says otherwise.
- Treat message wording as human-facing text, not a stable machine contract.
chatter validate ... --format json
Both chatter validate FILE --format json and
chatter validate DIR --format json emit newline-delimited JSON
(NDJSON) on stdout, with the same record shapes in both modes:
- zero or more per-file records (one per validated file), then
- one final summary record.
A single-file invocation still emits a file record followed by a summary record; it is not a single-object surface.
Per-file records
Valid files:
{"type":"file","file":"/path/to/file.cha","status":"valid","cache_hit":false}
Invalid files (the errors array is opaque per-error JSON; the
note field is appended when the validator stopped further checks
because of structural errors):
{
"type": "file",
"file": "/path/to/file.cha",
"status": "invalid",
"error_count": 1,
"errors": [
{
"code": "E502",
"message": "Missing @End header at end of file",
"severity": "Error"
}
],
"note": "Some additional checks may not have run because of structural errors. Fix the structural errors first, then re-validate."
}
Parser-failure files use "status":"parse_error" with an error
string. Read-failure files use "status":"read_error" with an
error string.
Summary record
{
"type": "summary",
"directory": "/path/to/dir",
"total_files": 2,
"valid": 1,
"invalid": 1,
"parse_errors": 0,
"cache_hits": 0,
"cache_misses": 2,
"cache_hit_rate": 0.0,
"cancelled": false
}
When --roundtrip is set, the summary also includes
roundtrip_passed and roundtrip_failed counters.
Contract notes
- The
typefield is stable:"file"or"summary". - For file records:
fileandstatusare stable;cache_hitis stable forvalidrecords.error_countanderrorsare stable forinvalidrecords. - For summary records:
directory,total_files,valid,invalid,parse_errors,cache_hits,cache_misses,cache_hit_rate, andcancelledare stable. statusvalues currently observed:valid,invalid,parse_error,read_error. New status values may appear.- Errors do not include a byte-offset
locationfield in the NDJSON surface; for byte-offset diagnostics use the LSP or the non-JSON renderer. - The
notefield on invalid file records is human-facing guidance and may be added or omitted between releases. - Exit code
0means all files validated successfully; exit code1means at least one file failed or an I/O error occurred.
chatter to-json
chatter to-json emits the full ChatFile JSON model rather than a diagnostic
summary. The authoritative contract for that output is the JSON Schema
documented in JSON Schema.
Practical notes:
- The JSON itself is the contract, not any validation status lines printed by the CLI.
- Use
-o/--outputif you want only the JSON in a file. - Use
--skip-validation,--skip-alignment, or--skip-schema-validationonly when you explicitly want to bypass those checks.
chatter cache stats --json
Cache statistics emit one JSON object on stdout:
{
"total_entries": 743,
"cache_dir": "/Users/example/Library/Caches/talkbank-chat",
"cache_size_bytes": 274432,
"last_modified": "2026-03-09T13:05:31+00:00"
}
Contract notes:
total_entries,cache_dir,cache_size_bytes, andlast_modifiedare stable.last_modifiedis RFC 3339 / ISO 8601 text.
Merge Override File Format
Status: Draft Last updated: 2026-06-15 15:00 EDT
The merge override file is the typed, human-readable record of
operator decisions in the chatter speaker-id → chatter merge
pipeline. It serves three purposes:
- Persistence: operator adjudications made for one batch can
be replayed on later runs without re-prompting (
chatter speaker-id --override-file <FILE> --session-id <ID>). - Audit trail: each entry records who decided what, when, and on the basis of which Jaccard scores. Years later, a researcher can answer “why was PAR0 labeled INV in this session?” by reading the file.
- Interchange: an adjudication UI (CLI, future web app) and the batch pipeline share the same file format; UI tools can be added or replaced without changing the on-disk contract.
This page is the authoritative reference for the file’s schema.
For the usage contract (which commands read/write it, when, why),
see chatter speaker-id.
File location and naming
The file’s location is caller-chosen. The convention is one file per donor batch, named for the batch:
batch-2026-05-27-childes-eng.overrides.toml
batch-2026-06-15-fluency-pilot.overrides.toml
batch-2026-08-22-aphasiabank-bilingual.overrides.toml
Pipeline operators pass the path explicitly via --override-file;
no implicit search of a default location.
File format
UTF-8 TOML. The file has exactly one top-level key,
schema_version, followed by zero or more session entries, each
keyed by a session ID.
schema_version = 1
[<session_id_1>]
mode = "auto"
# ... fields per entry ...
[<session_id_2>]
mode = "explicit"
# ... fields per entry ...
The session ID is the table name. It is a free-form stable string,
typically the basename stem of the CHAT file the entry applies to
(s12-t1, Corpus2024-session-07, etc.). The TOML parser treats it
as a key; CHAT-conformant identifiers fit the unquoted-key grammar
and need no escaping, but any string is permitted if it conforms
to TOML key syntax (use quoted keys like "unusual_session-id"
if the ID contains non-bare-key characters).
Top-level fields
| Field | Type | Required | Meaning |
|---|---|---|---|
schema_version | unsigned integer | yes | The schema version this file conforms to. Currently 1. Readers refuse files with any other value. |
The reader refuses files with schema_version absent or
unknown, returning a typed error
(OverrideFileError::UnsupportedSchemaVersion). There is no
implicit version, no fallback, no auto-migration. Operators of a
file written by a newer version of chatter must upgrade their
binary; operators of a file written by an older version that the
current binary no longer supports must re-adjudicate. This policy
is documented in
architecture/merge-domain-types.md §6;
its rationale is to keep the schema honest and avoid premature
migration code that might silently misinterpret old data.
Per-session entry fields
Each [<session_id>] table contains the fields below. Required
fields must be present and well-typed; optional fields may be
omitted; unknown fields cause a parse error.
Required fields
| Field | Type | Meaning |
|---|---|---|
mode | string enum | One of "auto", "explicit", "override". How the decision was made; see “Mode semantics” below. |
inserted_role | inline table | The CHAT identity assigned to every speaker whose mapping action is "rename". Fields: code (string, CHAT speaker code), tag (string, CHAT role-tag). |
mapping | inline table | Map from input speaker codes to actions. Keys are speaker codes; values are "rename" or "drop". Every speaker that exists in the input CHAT file must appear in mapping. |
operator | string | Free-form identifier of the person who created the entry (username, initials, email prefix). Recorded as audit trail. |
decided_at | RFC 3339 datetime | When the decision was made. Must include a time zone (UTC recommended). |
Optional fields
| Field | Type | Default | Meaning |
|---|---|---|---|
scores | inline table | {} | Per-speaker Jaccard scores recorded at decision time. Keys are speaker codes; values are floats in [0.0, 1.0]. Populated when the decision was based on a reference-mode auto attempt (even if the final mode is "explicit" because the operator overrode a low-confidence result). |
margin | float or string | absent | The decisive margin (winner-score / loser-score). Finite values serialize as numbers; the divide-by-zero case (loser score = 0) serializes as the string "unbounded". |
note | string | "" | Free-text operator note. Strongly recommended for "explicit" and "override" modes, captures why the operator made the call. |
flags | array of strings | [] | Operator-supplied flags marking unusual situations. Known values listed in “Flag vocabulary” below; unknown strings are preserved verbatim (treated as Custom). |
engine | string enum | "deterministic" | Which engine produced the decision. Always written on new entries; absent only in pre-provenance files, which read as "deterministic". One of "deterministic" (Jaccard reference-mode, spreadsheet, or operator adjudication) or "llm" (language-model judgment). |
judgment | inline table | absent | LLM audit trail. Present only when engine = "llm"; omitted for deterministic decisions. Sub-fields documented below. |
judgment sub-table fields
The judgment inline table records the audit trail for LLM-produced
decisions. It is present if and only if engine = "llm".
| Field | Type | Required | Meaning |
|---|---|---|---|
model | string | yes | Model identifier used for the judgment (e.g. "deepseek-v4-flash"). |
endpoint | string | yes | OpenAI-compatible base URL the judgment was made against. |
prompt_version | string | yes | Prompt-template version tag (e.g. "v1"). Bumping this marks older entries as produced by a prior template. |
confidence | inline table | no (omitted when empty) | Per-field model confidence in [0.0, 1.0]. Keys are decision field names (e.g. "mapping", "roles", "merge_applicable"). Omitted entirely when no confidence values were reported. |
reasoning | string | yes | One or two sentence model rationale for the decision. |
Mode semantics
The mode field records how the decision was made and is
informational only at read time, every mode applies the same
mapping deterministically. Distinguishing modes matters for
audit purposes.
| Mode | Set when | Operator confidence |
|---|---|---|
"auto" | chatter speaker-id ran in reference mode, Jaccard margin was at or above --confidence-threshold, and the operator did not intervene. | High; the algorithm picked. |
"explicit" | The operator supplied --mapping directly, typically after a prior reference-mode attempt failed at the confidence threshold. | Operator made the call; confidence depends on what evidence they used (listening to audio, contributor data sheet, prior knowledge). |
"override" | The entry was created by reading a prior override file (replay). | Inherited from whichever prior decision the entry was first stamped with. The mode is updated to "override" whenever a replay re-writes the entry. |
The reader does not enforce mode → field correlations (e.g., it
does not require scores to be present when mode = "auto"). The
writer follows these conventions:
"auto"entries always includescoresandmargin."explicit"entries includescoresandmarginIFF a prior reference-mode attempt produced them; otherwise they are absent."override"entries preserve whateverscores,margin, andnotewere in the source file.
Mapping semantics
Each entry in mapping is one of:
"rename": the speaker is renamed toinserted_role.codewith role taginserted_role.tagin the output CHAT file. Every utterance for this speaker has its*CODE:prefix rewritten; the@Participantsentry for this speaker has its code + role-tag rewritten (preserving any intervening name); the@IDrow’s code (field 3) and role (field 8) are rewritten."drop": the speaker’s utterances are removed from the output entirely. The speaker’s@Participantsentry and@IDrow are also removed.
Precondition. Every speaker that appears in the input CHAT
file must appear in mapping. There is no defaulting; omission
is rejected with
SpeakerIdError::SpeakerNotInMapping { speaker }. This is by
design: every decision must be explicit, so a future reader
knows that no speaker was silently passed through.
The reader rejects:
- Mapping entries whose key is not a speaker present in the input
(
SpeakerIdError::MappingSpeakerNotInInput). - Mapping values other than
"rename"or"drop"(TOML parse error from the typed deserializer).
Flag vocabulary
The flags array contains zero or more string values. The
following are recognized vocabulary; consumers MAY treat them
specially:
| Flag | Meaning |
|---|---|
"diarization-mixed" | The ASR diarization label being renamed actually contains multiple real-world speakers (e.g., clinician + parent collapsed). The rename is the best available approximation; downstream consumers should know the output is imperfect. |
"best-guess" | The operator could not confidently determine which speaker is which (e.g., from audio alone). The mapping is recorded as best-guess and merits review by a domain expert before publication. |
Any other string is preserved verbatim as a contributor-specific
flag (Custom(String) in the Rust type). Consumers SHOULD NOT
crash on unknown flags but MAY surface them in audit-trail
displays.
The order of flags within an entry is not semantically meaningful; duplicates are tolerated but considered noise. Tooling that modifies the list SHOULD deduplicate.
Reader semantics
OverrideFile::read(path) is the canonical reader. Its behavior:
- Open
pathUTF-8. - Parse via
toml. - Refuse if
schema_versionis absent or not equal to the binary’sCURRENT_SCHEMA_VERSION(currently1). Error:OverrideFileError::UnsupportedSchemaVersion { found, supported }. - Parse all
[<session_id>]tables intoMergeOverridevalues; reject unknown fields. - Return
OverrideFile { schema_version, entries }.
OverrideFile::read_or_default(path) is the variant used by
chatter speaker-id --write-override: if the file does not
exist, returns OverrideFile::default() (empty, current schema
version); otherwise behaves as read.
OverrideFile::get(&session_id) retrieves a single entry;
returns None if absent.
Writer semantics
OverrideFile::write(path) serializes the file deterministically:
- Top-level field order:
schema_versionfirst. - Entries ordered by session ID alphabetically (
BTreeMapdefault). - Per-entry field order:
mode,inserted_role,mapping,scores,margin,operator,decided_at,note,flags,engine,judgment. - Optional fields omitted when empty / absent.
- Atomic replace: writes to
<path>.tmpthen renames over<path>to avoid leaving a partial file on crash.
chatter speaker-id --write-override <path> appends a single
entry: it reads the file (or starts empty), inserts/updates the
entry for the current session, and writes back. The session ID
defaults to the input CHAT file’s basename stem unless
overridden via --session-id.
Example: minimal auto-mode entry
schema_version = 1
[session-101-t1]
mode = "auto"
inserted_role = { code = "INV", tag = "Investigator" }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.1931, PAR1 = 0.7347 }
margin = 3.81
operator = "alice"
decided_at = 2026-05-27T08:41:00-04:00
The reader reconstructs: child speaker was PAR1 (high Jaccard
match with reference’s CHI); auto-decide succeeded with margin
3.81×; PAR0 becomes INV:Investigator in the output.
Example: operator-adjudicated entry
After a low-confidence refusal, the operator listened to the
audio, confirmed the call, and re-ran with --mapping:
[session-102-t1]
mode = "explicit"
inserted_role = { code = "INV", tag = "Investigator" }
mapping = { PAR0 = "drop", PAR1 = "rename" }
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
operator = "alice"
decided_at = 2026-05-27T11:15:00-04:00
note = "Auto refused at 2.0× threshold. Listened to first 60 seconds; PAR0 produces child-content matching the hand transcript. PAR1 introduces herself as the clinician."
The scores from the prior auto attempt are preserved; the note captures why the operator was confident in the call despite the close margin. Years later, a researcher can verify by listening to the same 60 seconds and confirming the operator’s observation, the audit trail is reproducible.
Example: diarization-mixed parent sample
[session-103-t1-parent]
mode = "explicit"
inserted_role = { code = "MOT", tag = "Mother" }
mapping = { PAR0 = "rename", PAR1 = "drop" }
scores = { PAR0 = 0.3727, PAR1 = 0.6940 }
margin = 1.86
operator = "alice"
decided_at = 2026-05-27T11:22:00-04:00
note = "Parent sample. Per contributor data sheet: mother. PAR0 contains clinician intro + parent mixed (Batchalign diarization limitation)."
flags = ["diarization-mixed"]
The flags = ["diarization-mixed"] warns downstream consumers
that the renamed MOT speaker is not a clean parent-only stream
the first ~15 seconds were the clinician giving setup
instructions before leaving the room. The note captures the
specifics for future review.
Example: replayed entry
The same file run on a different day from the override file:
[session-102-t1]
mode = "override"
inserted_role = { code = "INV", tag = "Investigator" }
mapping = { PAR0 = "drop", PAR1 = "rename" }
scores = { PAR0 = 0.6286, PAR1 = 0.3457 }
margin = 1.82
operator = "alice"
decided_at = 2026-05-27T11:15:00-04:00
note = "Auto refused at 2.0× threshold. Listened to first 60 seconds; PAR0 produces child-content matching the hand transcript. PAR1 introduces herself as the clinician."
mode becomes "override" whenever the entry is re-applied by
reading the file. The other fields (including the original
operator and decided_at) are preserved, the override file
is the audit trail of the original decision, not of the
replay.
TOML grammar reference
For consumers writing the file by hand or generating it from other tools, the grammar is standard TOML 1.0 (toml.io) with the following domain-specific conventions:
- Datetimes use RFC 3339 with explicit time zone. UTC offset
Zand offsets like-04:00are both accepted. - Floats: standard TOML float syntax. The
marginfield accepts either a float or the string"unbounded". - Tables vs inline tables: top-level
[<session_id>]tables may use either standard or inline syntax; the writer emits standard tables for readability. - Comments: TOML
#line comments are permitted anywhere; the reader ignores them. The writer does not preserve comments across read-modify-write cycles (toml, nottoml_edit); hand-edited comments may be lost on subsequent--write-overrideruns. If preserving comments becomes important, the writer can be swapped fortoml_editin a future release.
Future schema changes
Schema version increments will appear here under “Migration” with
the version-to-version diff and migration instructions. Until
then, this is the only schema; the policy is strict
refuse-with-clear-error on any other schema_version value.
2026-06 additive fields: engine and judgment (no version bump)
The engine and judgment fields were added in 2026-06 to record
decision provenance (deterministic vs LLM). This addition did NOT
increment schema_version because both fields are backward
compatible in both directions:
- Old reader, new file: TOML
deny_unknown_fieldsis not set globally; older binaries that parse a file containingengineandjudgmentwill silently ignore the unknown keys. The decision itself (mode, mapping, inserted_role) is unaffected. - New reader, old file:
enginehas#[serde(default)]and defaults to"deterministic";judgmenthasskip_serializing_if = "Option::is_none"and is absent, which deserializes asNone. Pre-provenance files are therefore readable without error and are treated as deterministic decisions.
A future version bump would be warranted only if a change makes old files unreadable or misinterpretable, neither of which applies here.
Relationship to JSON Schema
The Rust OverrideFile type is implemented (in
talkbank-transform, src/speaker_id/override_file.rs) and drives
the override-file replay workflow today. What is not yet built is its
JSON Schema export: OverrideFile does not yet derive
schemars::JsonSchema, so no schema is generated, and the canonical
URL https://talkbank.org/schemas/v0.1/merge-overrides.json is
reserved but not yet published. Exposing it follows the same
schemars-based generator pattern documented in
JSON Schema.
The TOML form is the on-disk format; JSON Schema is the
machine-readable spec for external tooling. Both describe the
same OverrideFile Rust type.