Parser, Model, and API Contracts
Status: Current Last updated: 2026-06-21 21:33 EDT
Single-handle parser API
talkbank-parser provides TreeSitterParser as the canonical API
handle for all parsing, full-file and fragment methods live directly
on the struct. Callers create one instance and pass
&TreeSitterParser everywhere. The alternate talkbank-parser-re2c
is opt-in (specification oracle and high-throughput batch parsing)
and produces the same ChatFile model.
Contract for Batchalign
The Batchalign runtime (the batchalign crate) consumes these
guarantees from the talkbank-* core crates:
- parsing produces a typed
ChatFileor an explicit parse-status signal - parse-health taint is visible to alignment consumers
- alignment helpers operate on semantic model types, not raw text hacks
- recovery never fabricates valid-looking placeholder semantics for malformed input
The parser/model boundary stays honest enough for downstream
workflows, align, compare, benchmark, morphotagging, to make
their own validity decisions.
Canonical Contract Model
Public Contract Layers
- Parse API Contract:
- stable function signatures,
- deterministic parse result envelope,
- clear partial-success semantics.
- Semantic Model Contract:
- stable core model fields,
- explicit unstable/internal fields policy.
- Diagnostic Contract:
- stable error code IDs and severity semantics,
- best-effort message text compatibility.
- Serialization Contract:
- deterministic output constraints,
- normalized formatting policy.
Required Types
ParseOutcome<T>value: T | omitted-by-statusdiagnostics: Vec<Diagnostic>status: Success | Partial | Failed
Diagnosticcode,severity,category,message,location,context,suggestion
Parser Role
talkbank-parser: the sole parser, used by CLI/LSP/API/batchalign3.TreeSitterParseris the only API handle, callers create one and pass&TreeSitterParsereverywhere.- Tree-sitter GLR provides error recovery; the Rust traversal code converts CST to typed model.
- Full-file methods:
parser.parse_chat_file(),parser.parse_chat_file_streaming(). - Fragment methods:
parser.parse_word_fragment(),parser.parse_main_tier_fragment(), etc.
Invariants
- Parsing with offset must shift all spans consistently.
- Parse-level and validation-level diagnostics must remain distinguishable.
- Serialization should preserve semantic equivalence and documented formatting rules.
- Roundtrip behavior must be testable per parser implementation.
- Parser functions that accept
ErrorSinkshould not returnOption<T>for fallible parse state.
API Versioning Policy (Pre-1.0, Strict)
- Three intended contract levels:
- Stable-for-integrators
- Stable-internal
- Experimental
- Mark every public function/type by contract level.
This classification is not yet codified in a separate manifest file; the levels above are the working policy. Integrators should treat any unmarked surface as Experimental until contract levels are formally published.
Acceptance Criteria
- Single canonical parse outcome envelope exposed for integrators.
- Parser implementations conform to shared contract tests.
- Contract-level annotations exist for all public API surfaces.
- Documentation for parse/validate/serialize lifecycle is centralized and current.
Recovery Contract: No Fabricated Semantic Values
The parser contract must forbid sentinel semantic values during error recovery.
Disallowed recovery behavior:
- returning arbitrary enum variants as fallback for unknown/missing nodes,
- returning empty strings as stand-ins for required fields,
- constructing fake words/chunks like
"missing","error", or other placeholders.
Required recovery behavior:
- Emit structured diagnostic with precise span and expected node kind.
- Return an explicit parse-status signal (
Partial/Failed) throughParseOutcome. - Omit invalid semantic node OR store it in explicit recovery metadata, never as a valid semantic value.
Current enforcement:
- CI guardrail script tracks and blocks introduction of new
ErrorSink + Optionsignatures. - See
scripts/check-errorsink-option-signatures.shandscripts/errorsink_option_allowlist.txt.
Rationale:
- fabricated semantic values create secondary, misleading diagnostics against synthetic data,
- downstream tools cannot distinguish real user content from parser-generated placeholders,
- equivalence and regression tests become noisy and non-actionable.
For batchalign3, this is especially important because alignment workflows
must be able to tell the difference between:
- a malformed input that should taint or block alignment
- a recoverable input where raw text can be preserved
- a clean input that should proceed through the align/compare pipeline
String Storage Policy
The model uses three string storage strategies:
Arc<str>interning (interned_newtype!): For high-frequency repeated values (POS tags, stems, speaker codes). Global interner avoids redundant allocations.SmolStr(string_newtype!): For short strings (median 10-15 chars) that benefit from inline storage. O(1) clone, no heap allocation for strings ≤23 bytes.String: Only for utility types outside the core model (e.g.,semantic_diff/).