Parser Backends

Status: Current Last modified: 2026-06-13 22:40 EDT

TalkBank has two CHAT parser implementations. Both implement the ChatParser trait and produce identical ChatFile model types.

The --parser flag selects the backend at the CLI boundary; everything downstream consumes the identical ChatFile output, so the choice is invisible past the dispatch point:

flowchart TD
    cli["chatter validate --parser &lt;backend&gt;\n(ParserBackend enum,\nchatter cli_types.rs)"]
    sel{"which backend?\n(ParserKind,\ntalkbank-transform\nvalidation_runner/config.rs)"}
    ts["TreeSitterParser\n(talkbank-parser:\nGLR, incremental)"]
    re2c["Re2cParser\n(talkbank-parser-re2c:\nre2c DFA + chumsky)"]
    trait["ChatParser trait\n(talkbank-model\nparser_api/chat_parser.rs)"]
    model["ChatFile\n(talkbank-model:\nSemanticEq-identical\nfor both backends)"]

    cli --> sel
    sel -->|"tree-sitter (default)"| ts
    sel -->|"re2c"| re2c
    ts -->|"ParserDispatch::TreeSitter\n(worker.rs) implements"| trait
    re2c -->|"ParserDispatch::Re2c\n(worker.rs) implements"| trait
    trait --> model

ParserDispatch::new(kind) (in validation_runner/worker.rs) is the single place that constructs the chosen backend from a ParserKind; both variants wrap a ChatParser implementor, so the validation runner never branches on backend again.

TreeSitterParser (default)

Crate: talkbank-parser
Technology: tree-sitter GLR parser
Grammar: grammar/grammar.js → generated C parser
Strengths: Incremental reparsing (LSP), robust error recovery (GLR), CST-level diagnostics
Weaknesses: Slower on batch workloads, !Send + !Sync (one parser per thread)

Used by the LSP, the default CLI, and all production validation.

Re2cParser

Crate: talkbank-parser-re2c
Technology: re2c DFA lexer + chumsky parser combinators
Grammar: Translated from grammar.js rules → re2c conditions + chumsky combinators
Strengths: 4-8x faster, Send + Sync, zero constructor cost, specification oracle
Weaknesses: No incremental reparsing, Box::leak memory strategy

Used for batch validation, parser parity testing, and performance benchmarking.

CLI Usage

# Default: tree-sitter
chatter validate corpus/

# Use re2c for faster batch validation
chatter validate --parser re2c corpus/

# Roundtrip with re2c
chatter validate --parser re2c --roundtrip corpus/

The --parser flag accepts tree-sitter (default) or re2c. Cache entries are parser-specific, switching parsers does not invalidate the other’s cache.

Parity Status

Both parsers produce SemanticEq-identical output on the 87-file reference corpus (100% match). On the ~100k-file wild corpus, parity is ~98.7%.

Error Detection

Metric	Value
Specs tested	140
Both detect error	140/140 (100%)
Same error code	79/140 (56.4%)
Different code, both detect	61/140 (43.6%)
Re2c silent (misses error)	0

The 61 code mismatches come from architectural differences, not bugs. Both parsers report actionable diagnostics for all 140 testable error specs.

Performance

Benchmark	TreeSitter	Re2c	Speedup
Small file (13 lines)	44 µs	9.6 µs	4.6x
Medium file (dependent tiers)	69 µs	9.4 µs	7.3x
Large file (complex)	7,734 µs	970 µs	8.0x
Batch (35 files)	21.7 ms	3.0 ms	7.2x

Run benchmarks: cargo bench -p talkbank-parser-re2c --bench parse_comparison

When to Use Which

Use Case	Recommended Parser	Why
LSP / editor integration	tree-sitter	Incremental reparsing
Batch validation (>100 files)	re2c	4-8x faster
CI validation	Either	Both correct; re2c saves CI time
Error diagnostics (user-facing)	tree-sitter	More specific E3xx codes
Parser parity testing	Both	Re2c is the specification oracle
Profiling / benchmarking	re2c	DFA lexer gives a performance floor

Shared Model Infrastructure

Both parsers convert to the same talkbank_model::ChatFile type and share post-hoc promotion logic:

TierContent::extract_terminal_bullet(): trailing InternalBullet → utterance bullet
parse_bullet_node_timestamps(): structured bullet CST → (start_ms, end_ms)

CA intonation arrows are no longer promoted to terminators at the parser/model boundary; both parsers leave them as Separator items. See CA Terminator Resolution.

Detailed Parity Report

See crates/talkbank-parser-re2c/docs/parity-report.md for the full gap analysis, divergence categories, and remaining work items.

Keyboard shortcuts

Chatter: TalkBank CHAT Toolchain