Grammar System and Token Governance

Status: Current Last modified: 2026-05-29 18:43 EDT

Current Reality

grammar/grammar.js encodes substantial implicit language knowledge directly in regex exclusions, reserved symbol lists, and leniency decisions. Example areas:

word segment forbidden start/rest classes,
CA delimiter/element symbol groups,
event segment exclusions,
hand-maintained coupling between comments and token rules.

This is currently powerful but fragile.

Primary Failure Modes

New symbolic token added in one place but not in exclusion sets.
Parser behavior changes silently due to regex class edits.
Generated node types drift from assumptions in spec tooling.
Lenient parsing choices become undocumented policy.

Current Design

The generated symbol registry is the single source of token constraints. The pipeline has shipped, just symbols-gen rebuilds it.

Registry Artifacts

spec/symbols/symbol_registry.json (human-authored intent):
- symbol string
- category (delimiter, continuation, overlap, punctuation, etc.)
- contexts where reserved/allowed
- parse role and precedence notes
Generated outputs:
- grammar/src/generated_symbol_sets.js
- crates/talkbank-model/src/generated/symbol_sets.rs
- spec/tools/src/generated/symbol_sets.rs
- docs: Symbol Registry

Grammar Refactor Requirements

Replace large manual regex strings with generated character classes.
Keep final grammar readable by preserving semantic names in generated constants.
Distinguish clearly between:

syntax permissiveness,
semantic validation restrictions.

Add comments only for design rationale, not for duplicating manual references.

Node Type Drift Controls

Enforce regeneration and consistency checks:
- grammar source change must regenerate parser and node types,
- node type constants consumed by spec/tools and parser code must compile,
- CI fails if generated files differ from committed state.

Leniency Policy

Explicitly classify every lenient parse behavior:

Parse-lenient + validate-strict.
Parse-lenient + validate-warning.
Parse-strict (hard fail).

Document this matrix in the Leniency Policy.

Grammar Test Strategy

Keep corpus tests generated from spec/constructs.
Add targeted hand-authored edge tests for symbol boundary interactions.
Add mutation-style tests for forbidden-character regressions.
Add parser equivalence tests for tokenizer-sensitive cases.

Acceptance Criteria

No manual reserved-symbol duplication in grammar.js.
Symbol registry is generated to all required consumers.
Grammar modifications cannot land with stale generated artifacts.
Every special token category has explicit policy documentation.

Keyboard shortcuts

Chatter: TalkBank CHAT Toolchain