Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parser Leniency Policy

Status: Current Last updated: 2026-06-15 13:08 EDT

This document is the single source of truth for how the tree-sitter grammar, Rust validation layer, and CLI tooling divide responsibility for enforcing the CHAT specification. It consolidates decisions scattered across grammar.js comments, analysis documents, and code.

Scope: Documentation only. This document does not implement new validation rules; it records what exists, what is intentionally absent, and proposes a roadmap for closing gaps.


Philosophy: Parse, Don’t Validate

The tree-sitter grammar intentionally accepts a superset of valid CHAT. The rationale:

  1. Maximise parse coverage: Real-world .cha files contain legacy patterns, whitespace variations, and edge cases. A grammar that rejects them produces no AST and therefore no diagnostics. Accepting them gives the validation layer something to work with.

  2. Separate syntax from semantics: The grammar captures structure (headers, utterances, tiers, annotations). The Rust validation layer enforces semantic rules (required headers, participant declarations, alignment counts).

  3. Enable configurable strictness: Different consumers need different policies. A roundtrip pipeline can be strict; an editor providing live diagnostics should be lenient. Validation profiles (see Validation Profile Infrastructure) make this possible.

Three-Tier Classification

Every intentional leniency decision falls into one of three tiers:

TierLabelMeaning
AParse-lenient + validate-strictGrammar accepts it; validation rejects it as an error
BParse-lenient + validate-warningGrammar accepts it; validation emits a warning
CParse-lenient onlyGrammar accepts it; no validation needed: the construct is genuinely optional or the broad acceptance is by design

This classification was proposed in an earlier grammar governance analysis and is formalised here.


Leniency Matrix

Master table of every documented leniency decision in the grammar. The Status column indicates whether downstream validation compensates for the grammar’s permissiveness.

#Grammar ConstructSpec RequirementGrammar BehaviorTierValidationError CodeStatus
1@UTF8 headerRequired, must be first lineOptional (not enforced)AValidatedE503OK
2@Begin headerRequiredOptional (grammar.js ~L104)AValidatedE504OK
3@End headerRequiredOptional (grammar.js ~L106)AValidatedE502OK
4Pre-first-utterance header orderNo enforced order (matches CLAN CHECK)choice(), any order (grammar.js ~L122-135)CN/A (by design),OK
5Headers after utterancesAllowed (e.g. @Bg, @Eg, @G, @Comment)Interleaved freelyCN/A (by design),OK
6Content type context restrictionsUnified across contextsUnified base_content_item (grammar.js ~L731-738)CN/A (by design); specific semantic rules (E371, E372) exist separately,OK
7Terminator presenceRequired (except CA mode)Optional (grammar.js ~L691-692)AValidatedE305OK
8Bare shortening as wordCA mode onlyAccepted anywhereAValidatedE2xxOK
9Trailing whitespace in annotationsNot specifiedOptional trailing space (grammar.js ~L957, 966, 975, 1004, 1013)CN/A,OK
10MOR segment UnicodeVery permissive (broad language support)Exclusion-based regex (grammar.js ~L1909-1915)CN/A (by design),OK
11MOR fusional suffixes with hyphensALNUM + IPA onlyAllows hyphens (grammar.js ~L1942-1945)CN/A (by design),OK
12MOR nested translationsNo nested structuresAllows () and [] nesting (grammar.js ~L1954-1966)CN/A (by design),OK
13Linkers / language codesTruly optionalOptionalCN/A,OK
14Word annotationsTruly optionalOptionalCN/A,OK
15Media bulletTruly optionalOptionalCN/A,OK
16Group whitespace (leading/trailing)No whitespace inside < >Optional (grammar.js ~L1097, 1099)CN/A,OK
17Long feature label charactersLimited character set/[A-Za-z0-9@%_-]+/ (grammar.js ~L1327)CN/A,OK
18Catch-all headers ($.anything)Structured content for some headers/[^\r\n]+/ for ~19 header typesCN/A (content is opaque),OK
19Header gap whitespaceSingle space/tabrepeat1(choice(space, tab)) (grammar.js ~L467, 477, 489)CN/A,OK
20@Types header whitespaceNo spaces around commasOptional whitespace around commas (grammar.js ~L584-592)CN/A,OK

Permissiveness Regression Decisions

During development, several validation rules were tightened and then relaxed after they produced false positives against the reference corpus. These decisions are documented in the permissiveness regression log (archived). Each is summarised here with its rationale.

Decision 1: [*] bare annotation, E214 disabled

  • Previous behaviour: E214 emitted when [*] appeared without an explicit error code (empty ContentAnnotation::Error).
  • Current behaviour: Bare [*] is accepted without error.
  • Implementation: Removed validation branch in talkbank-model/src/model/annotation/annotated.rs.
  • Rationale: Reference files (errormarkers.cha, compound.cha) use bare [*] as valid CHAT.
  • Revisit: If coded error annotations become required, do it behind an explicit strict profile.

Decision 2: @t without @s:<lang>, E248 disabled

  • Previous behaviour: E248 emitted for @t markers without an explicit language marker.
  • Current behaviour: @t accepted without requiring @s:<lang>.
  • Implementation: Removed checks in talkbank-model/src/validation/word/structure.rs.
  • Rationale: Reference file formmarkers.cha contains a@t and is expected to be valid.
  • Revisit: Scope to explicit strict validation mode if desired.

Decision 3: Undeclared inline language codes, E254 re-introduced as warning

  • Original behaviour: Inline @s:... markers with language codes not declared in @Languages emitted E254 as an error.
  • Intermediate behaviour: E254 was disabled and the code removed from the codebase to keep reference file lang-marker.cha valid.
  • Current behaviour: E254 (UndeclaredExplicitWordLanguage) is back in the registry at crates/talkbank-model/src/errors/codes/error_code.rs:321 and emitted at crates/talkbank-model/src/validation/word/language/resolve.rs:195, but as a warning rather than an error. This was paired with the introduction of E255 (WholeUtteranceLanguageSwitchShouldUsePrecode) for whole-utterance @s runs that should use [- lang] precodes.
  • Why it returned: Heterogeneous corpora (Cantonese, Polish, Czech, Spanish, HK bilingual) made the warn-only signal load-bearing for catching @s:LANG markers that disagreed with @Languages. The warning surfaces the inconsistency without blocking the file.
  • Revisit: If the warn-only signal turns out to be ignored in practice, decide between escalating back to error severity or removing.

Decision 4: Mixed-language digit legality, permissive-any rule

  • Previous behaviour: Digits had to be legal in all applicable languages for mixed/ambiguous markers.
  • Current behaviour: Digits accepted if legal in at least one applicable language.
  • Implementation: Changed from is_valid_in_all() to any() in talkbank-model/src/validation/word/language/digits.rs.
  • Rationale: Prevents false positives in mixed-language reference examples.
  • Revisit: Confirm spec intent for mixed/ambiguous validation semantics.

Decision 5: @Bg nesting, same-label only

  • Previous behaviour: Any nested @Bg while another gem scope was open emitted E529.
  • Current behaviour: E529 only fires when nesting the same label (or same unlabeled scope key). Different labels may nest hierarchically.
  • Implementation: Changed from any_scope_open to same_scope_open in talkbank-model/src/validation/header/structure.rs.
  • Rationale: Avoids false positives on hierarchical markup patterns (e.g., HSLLD corpus).
  • Revisit: Decide whether nesting policy should be global or per-label.

Decision 6: Temporal bullets in CA mode, skipped

  • Previous behaviour: E701/E704 temporal checks ran even for CA-mode files.
  • Current behaviour: Temporal constraints are skipped when file is in CA mode.
  • Implementation: validate_temporal_constraints() early-returns when ca_mode is true (talkbank-model/src/validation/temporal.rs).
  • Rationale: CA reference files include patterns that triggered false monotonicity/self-overlap diagnostics.
  • Revisit: Implement CA-specific temporal policy rather than global skip.

Decision 7: Pipeline severity threshold, errors only

  • Previous behaviour: Any validation diagnostic (including warnings) caused PipelineError::Validation.
  • Current behaviour: Pipeline returns failure only if at least one diagnostic has Severity::Error.
  • Implementation: talkbank-transform/src/pipeline/parse.rs.
  • Rationale: Warnings should not block parse/transform/export pipelines.
  • Revisit: Keep as default; add explicit --strict flag/profile if needed.

Decision 8: Spacing warnings W210/W211, disabled

  • Previous behaviour: Style-level spacing warnings around terminators and overlap markers.
  • Current behaviour: Checks removed from core main-tier validation path.
  • Implementation: check_spacing_warnings() invocation removed from talkbank-model/src/model/content/main_tier.rs.
  • Rationale: Generated unexpected diagnostics on files treated as valid in reference workflow.
  • Revisit: Reintroduce as optional lint profile, not core validator hard path.

Validation Gap Roadmap

Concrete items where the grammar is lenient but no validation compensates. Each proposes a new error code and priority.

Priority 1: @UTF8 Presence (E503), DONE

  • Grammar: @UTF8 is optional.
  • Spec: Required, must be the first line.
  • Implemented: E503 (MissingUTF8Header) added to check_headers() in talkbank-model/src/validation/header/structure.rs.
  • Severity: Error.
  • Note: All 340 reference corpus files contain @UTF8, zero roundtrip impact.

Priority 2: Pre-First-Utterance Header Order (proposed E534), Not a Gap

  • Grammar: choice() accepts headers in any order between @Begin and the first utterance.
  • Assessment: CLAN CHECK does not enforce any ordering for post-@Begin headers; it validates presence and format only. Our grammar’s flexible ordering matches CHECK’s behavior.
  • Status: Reclassified from Tier B (GAP) to Tier C (by design).

Priority 3: Content Type Context Validation, Not a Gap

  • Grammar: Unified base_content_item accepts any content type in any context.
  • Assessment: The unified rule is correct by design. Nested groups are legal CHAT (e.g., <the <dag> [: dog]> [= something]). The two specific semantic restrictions that do exist (no pauses in pho groups, E371; no nested quotations, E372) are already validated.
  • Status: Reclassified from Tier A (PARTIAL) to Tier C (by design).

Validation Profile Infrastructure

What Exists

ValidationConfig (talkbank-model/src/errors/config.rs)

Builder-pattern configuration for per-error-code severity overrides.

let config = ValidationConfig::new()
    .downgrade(ErrorCode::IllegalUntranscribed, Severity::Warning)
    .disable(ErrorCode::InvalidOverlapIndex)
    .upgrade(ErrorCode::UnknownAnnotation, Severity::Error);

API:

  • new(): empty config, all codes use original severity
  • downgrade(code, severity): lower severity (chainable)
  • disable(code): suppress entirely (chainable)
  • upgrade(code, severity): raise severity (chainable)
  • set_severity(code, Option<Severity>): set or disable (chainable)
  • effective_severity(code, original) -> Option<Severity>: query
  • is_disabled(code) -> bool: check

Pre-built profiles:

  • lenient(): Downgrades IllegalUntranscribed and InvalidOverlapIndex to Severity::Warning. Designed for legacy corpora gradual migration.
  • strict(): escalates unmapped warnings to errors (sets upgrade_unmapped_warnings, honored by effective_severity). Explicit per-code overrides still take precedence, so a caller can opt a specific code back to Severity::Warning.

ConfigurableErrorSink (talkbank-model/src/errors/configurable_sink.rs)

Wrapper that intercepts errors and applies ValidationConfig before forwarding to an inner ErrorSink.

let inner = ErrorCollector::new();
let sink = ConfigurableErrorSink::new(&inner, config);
// Pass `sink` to parser/validator, disabled errors are filtered,
// severity overrides are applied.

Runner-Level Flags (talkbank-transform, chatter)

FlagEffect
--skip-alignmentSkip tier alignment validation
--roundtripTest serialization idempotency after validation
--forceClear cache for path and revalidate
--max-errors NStop after N errors

What Is Missing

GapDescriptionEffort
No --profile CLI flagUsers cannot select strict / lenient / lint from the command lineMedium
ConfigurableErrorSink not wired into validation pipelineInfrastructure exists but is not used by chatter validateMedium
No lint-style profileSpacing/style warnings (W210, W211) have no homeSmall (once profiles are wired)
No profile serializationCannot load profiles from TOML/JSON config filesMedium
No corpus-specific profilesE.g., HSLLD-specific rulesFuture

Proposed Profiles

From the permissiveness regression log:

ProfilePurposeBehaviour
reference-compatibleCurrent permissive baselineDefault, matches current validation behaviour
strict-chatFull spec enforcementRe-enable selected tightenings (E214, E248, E254, etc.)
lint-styleSpacing/style warnings onlyEnable W210, W211; do not fail pipeline

The roundtrip gate should be pinned to an agreed profile to prevent future ambiguity about what “pass” means.


Silent Recovery Points (NLP Pipelines)

An earlier Python-Rust boundary audit identified several places where batchalign-core silently massages data without diagnostics. These are related to leniency because they represent permissive acceptance without transparency.

PipelineRecovery MechanismDiagnostics?
Stanza morphosyntaxretokenize.rs DP alignment; Word::new_unchecked fallbackNo
Whisper/Wave2Vec FAforced_alignment.rs DP “best fit”No
Google TranslateImported verbatim into %xtraNo filtering
Stanza segmentationSilent abort on assignment mismatchNo

Key infrastructure gap: ParseHealth exists in talkbank-model (per-utterance tier cleanliness flags with taint(), is_clean(), can_align_main_to_mor() methods). It is used by the tree-sitter and direct parsers during parsing. However, batchalign-core does not read, write, or propagate ParseHealth during any mutation (morphosyntax injection, FA injection, retokenisation). The infrastructure exists in the model layer but is not connected to the pipeline layer.


Cross-References

SourceWhat It Contains
Grammar governance analysis (archived)Proposed this document; leniency matrix concept; three-tier classification
Permissiveness regression log (archived)8 permissiveness regression decisions with rationale
Python-Rust boundary audit (archived)Silent recovery points; ParseHealth gap; NLP pipeline audit
grammar/grammar.jsInline comments on each leniency decision (line references in matrix above)
talkbank-model/src/errors/config.rsValidationConfig API
talkbank-model/src/errors/configurable_sink.rsConfigurableErrorSink adapter
talkbank-model/src/validation/header/structure.rsHeader validation: E501, E502, E503, E504-E533
talkbank-model/src/validation/temporal.rsTemporal constraint checks (E701, E704); CA-mode skip
talkbank-model/src/model/content/main_tier.rsWhere W210/W211 were removed

Last updated: 2026-02-18