Grammar System and Token Governance
Status: Current Last modified: 2026-05-29 18:43 EDT
Current Reality
grammar/grammar.js encodes substantial implicit language knowledge directly in regex exclusions,
reserved symbol lists, and leniency decisions. Example areas:
- word segment forbidden start/rest classes,
- CA delimiter/element symbol groups,
- event segment exclusions,
- hand-maintained coupling between comments and token rules.
This is currently powerful but fragile.
Primary Failure Modes
- New symbolic token added in one place but not in exclusion sets.
- Parser behavior changes silently due to regex class edits.
- Generated node types drift from assumptions in spec tooling.
- Lenient parsing choices become undocumented policy.
Current Design
The generated symbol registry is the single source of token constraints.
The pipeline has shipped, just symbols-gen rebuilds it.
Registry Artifacts
spec/symbols/symbol_registry.json(human-authored intent):- symbol string
- category (delimiter, continuation, overlap, punctuation, etc.)
- contexts where reserved/allowed
- parse role and precedence notes
- Generated outputs:
grammar/src/generated_symbol_sets.jscrates/talkbank-model/src/generated/symbol_sets.rsspec/tools/src/generated/symbol_sets.rs- docs: Symbol Registry
Grammar Refactor Requirements
- Replace large manual regex strings with generated character classes.
- Keep final grammar readable by preserving semantic names in generated constants.
- Distinguish clearly between:
- syntax permissiveness,
- semantic validation restrictions.
- Add comments only for design rationale, not for duplicating manual references.
Node Type Drift Controls
- Enforce regeneration and consistency checks:
- grammar source change must regenerate parser and node types,
- node type constants consumed by
spec/toolsand parser code must compile, - CI fails if generated files differ from committed state.
Leniency Policy
Explicitly classify every lenient parse behavior:
- Parse-lenient + validate-strict.
- Parse-lenient + validate-warning.
- Parse-strict (hard fail).
Document this matrix in the Leniency Policy.
Grammar Test Strategy
- Keep corpus tests generated from
spec/constructs. - Add targeted hand-authored edge tests for symbol boundary interactions.
- Add mutation-style tests for forbidden-character regressions.
- Add parser equivalence tests for tokenizer-sensitive cases.
Acceptance Criteria
- No manual reserved-symbol duplication in
grammar.js. - Symbol registry is generated to all required consumers.
- Grammar modifications cannot land with stale generated artifacts.
- Every special token category has explicit policy documentation.