Transform Pipeline
Status: Current Last updated: 2026-06-14 19:57 EDT
The talkbank-transform crate provides high-level pipelines that compose parsing, validation, and serialization into reusable workflows.
Core Pipelines
Parse + Validate
The most common pipeline: parse a CHAT file and validate it.
use talkbank_transform::parse_and_validate;
let result = parse_and_validate(source, &parser, &error_collector);
This:
- Parses the source text into a
ChatFileAST - Runs validation (alignment checks, header consistency, etc.)
- Collects all errors and warnings into the
ErrorSink
CHAT → JSON
Convert a CHAT file to its JSON representation:
use talkbank_transform::chat_to_json;
let json = chat_to_json(source, &parser)?;
The JSON follows the schema at schema/chat-file.schema.json.
JSON → CHAT
The JSON produced by chat_to_json is schema-conformant and
round-trips. Deserialize it back into a ChatFile with serde_json
(the model derives Deserialize), then serialize through WriteChat
to reproduce CHAT text:
let chat_file: talkbank_model::ChatFile = serde_json::from_str(json_str)?;
let chat_text = chat_file.to_chat_string();
The chatter from-json command wraps this path
(crates/chatter/src/commands/json.rs, json_to_chat).
CHAT → CHAT (Normalize)
Parse and reserialize to normalize formatting:
use talkbank_transform::normalize_chat;
let normalized = normalize_chat(source, &parser)?;
normalize_chat lives in
crates/talkbank-transform/src/pipeline/convert.rs.
Validation + Roundtrip Cache Lifecycle
The following diagram shows the full validation and roundtrip pipeline, including the cache layer:
flowchart TD
file["CHAT file"]
cache{"Cache\nhit?"}
parse["Parse\n(tree-sitter → AST)"]
validate["Validate\n(per-file → per-utterance →\nmain tier → dependent tiers)"]
rt{"Roundtrip\nflag?"}
ser1["Serialize → CHAT text"]
reparse["Reparse CHAT text"]
ser2["Serialize again"]
cmp{"Two\nserializations\nmatch?"}
store["Store in cache\n(SQLite)"]
pass["Pass"]
fail["Fail"]
cached["Return cached result"]
file --> cache
cache -->|miss| parse --> validate --> rt
cache -->|hit| cached
rt -->|yes| ser1 --> reparse --> ser2 --> cmp
rt -->|no| store --> pass
cmp -->|yes| store
cmp -->|no| fail
Streaming Parse
For large files or interactive use, the transform crate supports streaming parse where utterances are processed incrementally rather than loading the entire AST into memory.
Caching
The transform layer integrates with a file-system cache. Validation results are keyed by content hash, so unchanged files skip re-validation. Cache location is platform-specific: ~/Library/Caches/talkbank-chat/ (macOS), ~/.cache/talkbank-chat/ (Linux), %LocalAppData%\talkbank-chat\ (Windows).
Use --force to bypass the cache for specific paths.
Error Collection
Pipelines use the ErrorSink trait for error reporting. Callers can provide:
- A collecting sink (gathers all diagnostics for batch output)
- A printing sink (writes diagnostics to stderr in real-time)
- A custom sink (for LSP diagnostics, JSON output, etc.)