Symbol Registry Architecture
Status: Current Last modified: 2026-05-29 18:43 EDT
Purpose
spec/symbols/symbol_registry.json is the canonical source of token/symbol classes used by
CHAT grammar tokenization policy.
Scope
The registry currently governs:
- CA delimiter symbols,
- CA element symbols,
- word segment forbidden symbol classes,
- event segment forbidden symbol classes.
Governance Rules
- Symbol changes must be made only in
spec/symbols/symbol_registry.json. - Registry must pass validation:
node spec/symbols/validate_symbol_registry.js
- Grammar symbol sets must be regenerated after any registry change:
just symbols-gen
- Generated files are read-only and must not be edited manually.
Determinism Requirements
- Every category list in the registry must be lexicographically sorted.
- Duplicate symbols are forbidden.
ca_delimiter_symbolsandca_element_symbolsmust be disjoint.
These constraints keep generated outputs stable and review diffs minimal.
Consuming Outputs
Generated symbol constants are emitted to:
grammar/src/generated_symbol_sets.jscrates/talkbank-model/src/generated/symbol_sets.rsspec/tools/src/generated/symbol_sets.rs
grammar/grammar.js imports from this generated module to avoid manual duplication of
critical symbol policy.
Change Workflow
- Edit registry JSON.
- Run registry validation.
- Run
just symbols-gen. - Run grammar generation/tests.
- Run parser equivalence tests.
- Commit source + generated outputs together.
Auditability
Registry drift is caught by the checked-in generated artifacts plus the normal local verification sweep and CI checks, so symbol changes should land together with regenerated grammar and Rust outputs.