Reference Corpus Overhaul
Status: Historical (Phase 0-6 narrative is preserved for context; the live corpus layout is described in Testing § Reference Corpus, read that first for current counts and structure) Last modified: 2026-05-29 18:43 EDT
Subsequent reorganization moved the corpus from the 345-flat-plus-language-subdirs layout described below into nine topical subdirectories under
corpus/reference/. Absolute counts in this page (file totals, language-dir counts, theconstructs/directory) reflect the pre-reorganization state and are kept here only as the historical record of how the corpus got to where it is.
Motivation
The reference corpus (corpus/reference/) is the 100%-pass quality gate for all
parser/grammar changes. The parser must handle every file at 100%.
Before this overhaul, the corpus had three problems:
- Language monoculture: 345 files, all English. We have 100K+ real files across 42 languages in the corpus data directory but the gate only tested English.
- Construct gaps: 18 concrete grammar node types were never exercised
(e.g.,
interrupted_question,scoped_best_guess,trailing_off_question). A grammar regression affecting these constructs would pass CI undetected. - Error coverage gaps: 27 error specs were stubs (no CHAT example), 4 error codes had no spec file at all.
Strategy
Fresh build, not incremental patching. We kept the existing 345 English files as-is (they encode years of parser fixes) and added multilingual files + construct gap-fillers on top.
Phase 0: Coverage Tooling
Built corpus_node_coverage (spec/tools/src/bin/corpus_node_coverage.rs) to
measure which of the 334 concrete grammar node types the corpus exercises.
Running against the old 345-file corpus confirmed exactly 18 gaps.
Phase 1: Language Selection & File Extraction
Built extract_corpus_candidates (spec/runtime-tools/src/bin/extract_corpus_candidates.rs)
to automatically select representative files from the corpus data directory for 20 target
languages:
eng, zho, fra, deu, spa, jpn, nld, heb, por, ell,
tur, hrv, pol, ita, hun, rus, est, dan, ara, isl
Selection criteria:
- Clean tree-sitter parsing (no ERROR nodes), mandatory
- Short files (under 200 lines, preferring 15-100)
- Varied tiers (%mor/%gra/%pho/%com)
- Multiple speakers preferred
- Privacy: explicitly skip
Passworddirectories in the corpus data directory
For each language, the tool scored and ranked candidates. We selected 1-2 files per language (25 files total across 20 language subdirectories).
Phase 2: Construct Gap-Filling
Created 4 handcrafted files in corpus/reference/constructs/ to exercise the 18
missing node types that don’t appear in real-world data:
| File | Node types exercised |
|---|---|
rare-terminators.cha | interrupted_question, self_interrupted_question, self_interruption, trailing_off_question |
uptake.cha | uptake_symbol |
best-guess.cha | scoped_best_guess |
unsupported.cha | thumbnail_header, unsupported_header, unsupported_dependent_tier, unsupported_line, unsupported_header_prefix, unsupported_tier_prefix |
Other gaps (l1_of_header, utf8_header, etc.) were already covered by the
language files or were confirmed as supertypes (not concrete).
Result: 334/334 concrete types exercised (100%).
Phase 3: Tier Regeneration
Ran batchalign3 morphotag on all 25 language files to generate fresh %mor/%gra tiers:
cd /path/to/batchalign3
uv run batchalign3 morphotag /path/to/chatter/corpus/reference/{lang}/ --in-place
All 20 languages are covered by Stanza’s UD models. Validation confirmed all 374 files pass parser equivalence and roundtrip.
Phase 4: Error Corpus Expansion
4.1: Created 3 missing error specs (E707, E711, E717) with CHAT examples and metadata. Fixed E376 (had wrong error code E208 in metadata).
4.2: Filled 17 triggerable stub specs with CHAT examples:
- Cross-utterance validation (E341, E351-E355)
- Parser recovery warnings (E319-E322, E325, E326)
- Underline tier errors (E356-E357)
- Overlap index errors (E373)
- Direct parser tier errors (E381, E384)
4.3: Documented 12 untriggerable stubs (internal, deprecated, or not-yet-wired error codes) with explanations of why no example is possible: E001, E002, E211, E317, E318, E340, E374, E377, E378, E380, E385, E386.
4.4: Corrected 5 misclassified specs where examples triggered different error
codes than intended (E319-E322, E376). Added Status: not_implemented and
explanatory notes.
4.5: Built perturbation tool (spec/tools/src/bin/perturb_corpus.rs) with 11
mutation strategies that take a valid .cha file and produce controlled mutations
targeting specific error codes:
| Perturbation | Target Error |
|---|---|
delete-participants | E501 |
delete-languages | E503 |
delete-id | E504 |
undeclared-speaker | E308 |
delete-terminator | E305 |
extra-mor-word | E706 |
fewer-mor-words | E705 |
delete-begin | E502 |
delete-end | E510 |
duplicate-participants | E511 |
mor-terminator-mismatch | E716 |
Also includes a mining mode (--mine DIR) that scans real data for tree-sitter
ERROR nodes, with automatic Password directory exclusion.
4.6: Regenerated golden artifacts: all 8 golden generators + audit + bootstrap:
| Artifact | Lines |
|---|---|
golden_words.txt | 769 (1949 unique words) |
golden_mor_tiers.txt | 405 |
golden_gra_tiers.txt | 7 |
golden_main_tiers.txt | 607 |
golden_pho_tiers.txt | 25 |
golden_wor_tiers.txt | 7 |
golden_sin_tiers.txt | 5 |
golden_com_tiers.txt | 24 |
golden_words_featured.txt | 96 |
golden_words_minimal.txt | 62 |
Bootstrap regenerated reference_corpus.rs with 374 test cases.
Phase 5: CI Integration & Validation
At that milestone, the then-current verification sweep passed:
- Parser equivalence: 377/377 (374 files + 3 extra)
- Node coverage: 334/334 (100%)
- Error coverage: 181/181 (100%), 169 with CHAT examples, 12 documented stubs
- The parser-equivalence and reference-corpus regression gates passed
Phase 6: Cleanup & Documentation
- Updated file count references (339→374) across CLAUDE.md files
- Rewrote
corpus/README.mdwith new structure - Updated memory files
Final State
corpus/reference/ 374 files total
*.cha 345 files (original English corpus)
constructs/ 4 files (rare grammar constructs)
{20 language dirs}/ 25 files (multilingual, from corpus data)
| Metric | Before | After |
|---|---|---|
| Total files | 345 | 374 |
| Languages | 1 (English) | 20 |
| Concrete node coverage | 316/334 (94.6%) | 334/334 (100%) |
| Error specs | 177/181 (97.8%) | 181/181 (100%) |
| Error specs with examples | ~150 | 169 |
| Documented stubs | 0 | 12 |
| Golden artifacts | Stale | Freshly regenerated |
Tools Built
| Tool | Path | Purpose |
|---|---|---|
corpus_node_coverage | spec/tools/src/bin/ | Grammar node type coverage |
extract_corpus_candidates | spec/runtime-tools/src/bin/ | Automated file selection from corpus data |
perturb_corpus | spec/tools/src/bin/ | Error file generation by mutation |
What Worked
- extract_corpus_candidates: Automated scoring eliminated guesswork in file selection. Files were high-quality, short, and diverse.
- construct gap-filling: 4 handcrafted files closed 18 gaps efficiently.
- Keeping existing 345 files: No breakage, no regressions. The new files are purely additive.
- batchalign3 morphotag: Generated correct %mor/%gra for all 20 languages without manual intervention.
What Didn’t Work / Lessons Learned
- Mining real errors from corpus data: The MacWhinney subcorpus (407 files) had zero tree-sitter parse errors; the data is too clean. Mining is slow on large directories (>4 minutes for all of Eng-NA). The perturbation approach is more effective for systematic error coverage.
- Parser recovery error specs (E319-E322): Writing examples that trigger specific tree-sitter error recovery codes is very difficult. Tree-sitter’s error recovery is robust and routes most malformed input through generic paths (E316) rather than the specific recovery codes. These remain as documented stubs.
- Direct parser vs unsupported.cha (historical, direct parser has been removed):
The former Chumsky direct parser could not handle
unsupported_linenodes (failed onconstructs/unsupported.cha). This is no longer relevant since tree-sitter is now the sole parser.
Known Remaining Gaps
- 12 untriggerable error stubs: Internal (E001, E002), deprecated (E211, E317, E318, E340, E374, E377, E378, E380, E385, E386). These are legitimate, the codes either have no emission path or are reserved.
- No audio files: Phase 3.3 (audio subset with %wor tiers) was deferred. Adding ~10 short audio clips would test the alignment pipeline end-to-end.
- Direct parser roundtrip (historical, direct parser has been removed): 373/374 passed under the former Chumsky direct parser (unsupported.cha failed). No longer relevant since tree-sitter is now the sole parser.
- 5 parser recovery specs not_implemented: E319-E322, E376. Examples don’t trigger the intended codes due to tree-sitter’s error recovery routing.