Parsing
Status: Current Last updated: 2026-05-19 16:54 EDT
The parsing pipeline converts CHAT text into a typed ChatFile AST.
The default and canonical parser is the tree-sitter parser
(talkbank-parser). A second implementation, talkbank-parser-re2c,
exists alongside it as a specification oracle and high-throughput
batch parser; it produces the same ChatFile model and is opt-in via
chatter validate --parser re2c. The LSP and all production paths
use the tree-sitter parser.
Tree-Sitter Parser
The talkbank-parser crate wraps the tree-sitter C parser and converts its concrete syntax tree (CST) into the ChatFile model.
Full-file parsing is the canonical entry point. TreeSitterParser also
provides fragment methods (parse_word_fragment(), parse_main_tier_fragment(),
parse_chat_file_fragment(), etc.) for parsing isolated CHAT fragments
directly.
CST → AST Pipeline
flowchart LR
chat["CHAT text\n(.cha file)"]
grammar["tree-sitter grammar\n(grammar.js → parser.c)"]
cst["Concrete Syntax Tree\n(all whitespace preserved)"]
walker["TreeSitterParser\n(CST traversal)"]
ast["ChatFile AST\n(semantic model)"]
chat --> grammar --> cst --> walker --> ast
Source text
↓ tree-sitter parse
Concrete Syntax Tree (CST), green tree with all tokens
↓ tree_parsing (Rust)
ChatFile AST, typed model with validation-ready data
The CST preserves every character of the source (whitespace, punctuation, comments). The Rust tree-parsing modules walk the CST and extract semantic information into the typed model.
Error Recovery
Tree-sitter’s GLR algorithm provides automatic error recovery. When the parser encounters unexpected input, it:
- Inserts ERROR nodes in the CST
- Continues parsing the rest of the file
- Reports parse errors via the
ErrorSinktrait
This means the parser always produces a result, even for malformed files, it extracts as much structure as possible.
ParseOutcome
Individual parse functions return ParseOutcome<T>:
ParseOutcome::parsed(value): successfully parsedParseOutcome::rejected(): could not parse this node (error already reported)
This allows the parser to skip individual malformed elements while continuing to parse the rest of the file.
Parser Equivalence
The 78-file reference corpus is the primary correctness guarantee:
cargo nextest run -p talkbank-parser-tests -E 'test(parser_equivalence)'
Each .cha file is its own test, nextest runs them in parallel and reports individual failures.
TreeSitterParser API
TreeSitterParser is the sole API handle for parsing. Callers create one
instance and pass &TreeSitterParser to all parsing call sites. There is
no trait abstraction, TreeSitterParser is a concrete type in the
talkbank-parser crate.
use talkbank_parser::TreeSitterParser;
let parser = TreeSitterParser::new()?;
// Full-file parsing (methods on TreeSitterParser).
// parse_chat_file returns ParseResult<ChatFile> with the diagnostic list
// embedded in the result envelope.
let chat_file = parser.parse_chat_file(&source)?;
// parse_chat_file_streaming pushes diagnostics into an ErrorSink as it
// goes, useful for very large files or LSP-style incremental flows.
let chat_file = parser.parse_chat_file_streaming(&source, &errors);
// Fragment parsing (methods on TreeSitterParser), used when synthesizing
// CHAT from non-CHAT sources (ASR output, UD annotations).
let word = parser.parse_word_fragment(word_text, &errors);
let main_tier = parser.parse_main_tier_fragment(tier_text, &errors);
AST Structure
The resulting ChatFile AST has a recursive content structure:
flowchart TD
cf["ChatFile"]
hdr["Headers\n@Languages, @Participants,\n@ID, @Options"]
utts["Utterances[]"]
mt["MainTier\nspeaker + content"]
dt["DependentTiers[]\n%mor, %gra, %pho, %sin, %wor"]
uc["UtteranceContent\n24 variants"]
leaf["Leaves\nWord | ReplacedWord | Separator"]
group["Groups\nGroup | AnnotatedGroup |\nRetrace | PhoGroup | SinGroup | Quotation"]
cf --> hdr & utts
utts --> mt & dt
mt --> uc
uc --> leaf & group
group -->|recurse| uc
Parser String Handling
The tree-sitter parser constructs owned model types (e.g., MorWord, GrammaticalRelation) directly from CST text. String-heavy types like PosCategory and MorStem use Arc<str> interning to avoid redundant allocations for repeated values. Short strings in model newtypes use SmolStr for inline storage up to 23 bytes.