Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parser Backends

Status: Current Last modified: 2026-06-13 22:40 EDT

TalkBank has two CHAT parser implementations. Both implement the ChatParser trait and produce identical ChatFile model types.

The --parser flag selects the backend at the CLI boundary; everything downstream consumes the identical ChatFile output, so the choice is invisible past the dispatch point:

flowchart TD
    cli["chatter validate --parser <backend>\n(ParserBackend enum,\nchatter cli_types.rs)"]
    sel{"which backend?\n(ParserKind,\ntalkbank-transform\nvalidation_runner/config.rs)"}
    ts["TreeSitterParser\n(talkbank-parser:\nGLR, incremental)"]
    re2c["Re2cParser\n(talkbank-parser-re2c:\nre2c DFA + chumsky)"]
    trait["ChatParser trait\n(talkbank-model\nparser_api/chat_parser.rs)"]
    model["ChatFile\n(talkbank-model:\nSemanticEq-identical\nfor both backends)"]

    cli --> sel
    sel -->|"tree-sitter (default)"| ts
    sel -->|"re2c"| re2c
    ts -->|"ParserDispatch::TreeSitter\n(worker.rs) implements"| trait
    re2c -->|"ParserDispatch::Re2c\n(worker.rs) implements"| trait
    trait --> model

ParserDispatch::new(kind) (in validation_runner/worker.rs) is the single place that constructs the chosen backend from a ParserKind; both variants wrap a ChatParser implementor, so the validation runner never branches on backend again.

TreeSitterParser (default)

  • Crate: talkbank-parser
  • Technology: tree-sitter GLR parser
  • Grammar: grammar/grammar.js → generated C parser
  • Strengths: Incremental reparsing (LSP), robust error recovery (GLR), CST-level diagnostics
  • Weaknesses: Slower on batch workloads, !Send + !Sync (one parser per thread)

Used by the LSP, the default CLI, and all production validation.

Re2cParser

  • Crate: talkbank-parser-re2c
  • Technology: re2c DFA lexer + chumsky parser combinators
  • Grammar: Translated from grammar.js rules → re2c conditions + chumsky combinators
  • Strengths: 4-8x faster, Send + Sync, zero constructor cost, specification oracle
  • Weaknesses: No incremental reparsing, Box::leak memory strategy

Used for batch validation, parser parity testing, and performance benchmarking.

CLI Usage

# Default: tree-sitter
chatter validate corpus/

# Use re2c for faster batch validation
chatter validate --parser re2c corpus/

# Roundtrip with re2c
chatter validate --parser re2c --roundtrip corpus/

The --parser flag accepts tree-sitter (default) or re2c. Cache entries are parser-specific, switching parsers does not invalidate the other’s cache.

Parity Status

Both parsers produce SemanticEq-identical output on the 87-file reference corpus (100% match). On the ~100k-file wild corpus, parity is ~98.7%.

Error Detection

MetricValue
Specs tested140
Both detect error140/140 (100%)
Same error code79/140 (56.4%)
Different code, both detect61/140 (43.6%)
Re2c silent (misses error)0

The 61 code mismatches come from architectural differences, not bugs. Both parsers report actionable diagnostics for all 140 testable error specs.

Performance

BenchmarkTreeSitterRe2cSpeedup
Small file (13 lines)44 µs9.6 µs4.6x
Medium file (dependent tiers)69 µs9.4 µs7.3x
Large file (complex)7,734 µs970 µs8.0x
Batch (35 files)21.7 ms3.0 ms7.2x

Run benchmarks: cargo bench -p talkbank-parser-re2c --bench parse_comparison

When to Use Which

Use CaseRecommended ParserWhy
LSP / editor integrationtree-sitterIncremental reparsing
Batch validation (>100 files)re2c4-8x faster
CI validationEitherBoth correct; re2c saves CI time
Error diagnostics (user-facing)tree-sitterMore specific E3xx codes
Parser parity testingBothRe2c is the specification oracle
Profiling / benchmarkingre2cDFA lexer gives a performance floor

Shared Model Infrastructure

Both parsers convert to the same talkbank_model::ChatFile type and share post-hoc promotion logic:

  • TierContent::extract_terminal_bullet(): trailing InternalBullet → utterance bullet
  • parse_bullet_node_timestamps(): structured bullet CST → (start_ms, end_ms)

CA intonation arrows are no longer promoted to terminators at the parser/model boundary; both parsers leave them as Separator items. See CA Terminator Resolution.

Detailed Parity Report

See crates/talkbank-parser-re2c/docs/parity-report.md for the full gap analysis, divergence categories, and remaining work items.