Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Transform Pipeline

Status: Current Last updated: 2026-06-14 19:57 EDT

The talkbank-transform crate provides high-level pipelines that compose parsing, validation, and serialization into reusable workflows.

Core Pipelines

Parse + Validate

The most common pipeline: parse a CHAT file and validate it.

use talkbank_transform::parse_and_validate;

let result = parse_and_validate(source, &parser, &error_collector);

This:

  1. Parses the source text into a ChatFile AST
  2. Runs validation (alignment checks, header consistency, etc.)
  3. Collects all errors and warnings into the ErrorSink

CHAT → JSON

Convert a CHAT file to its JSON representation:

use talkbank_transform::chat_to_json;

let json = chat_to_json(source, &parser)?;

The JSON follows the schema at schema/chat-file.schema.json.

JSON → CHAT

The JSON produced by chat_to_json is schema-conformant and round-trips. Deserialize it back into a ChatFile with serde_json (the model derives Deserialize), then serialize through WriteChat to reproduce CHAT text:

let chat_file: talkbank_model::ChatFile = serde_json::from_str(json_str)?;
let chat_text = chat_file.to_chat_string();

The chatter from-json command wraps this path (crates/chatter/src/commands/json.rs, json_to_chat).

CHAT → CHAT (Normalize)

Parse and reserialize to normalize formatting:

use talkbank_transform::normalize_chat;

let normalized = normalize_chat(source, &parser)?;

normalize_chat lives in crates/talkbank-transform/src/pipeline/convert.rs.

Validation + Roundtrip Cache Lifecycle

The following diagram shows the full validation and roundtrip pipeline, including the cache layer:

flowchart TD
    file["CHAT file"]
    cache{"Cache\nhit?"}
    parse["Parse\n(tree-sitter → AST)"]
    validate["Validate\n(per-file → per-utterance →\nmain tier → dependent tiers)"]
    rt{"Roundtrip\nflag?"}
    ser1["Serialize → CHAT text"]
    reparse["Reparse CHAT text"]
    ser2["Serialize again"]
    cmp{"Two\nserializations\nmatch?"}
    store["Store in cache\n(SQLite)"]
    pass["Pass"]
    fail["Fail"]
    cached["Return cached result"]

    file --> cache
    cache -->|miss| parse --> validate --> rt
    cache -->|hit| cached
    rt -->|yes| ser1 --> reparse --> ser2 --> cmp
    rt -->|no| store --> pass
    cmp -->|yes| store
    cmp -->|no| fail

Streaming Parse

For large files or interactive use, the transform crate supports streaming parse where utterances are processed incrementally rather than loading the entire AST into memory.

Caching

The transform layer integrates with a file-system cache. Validation results are keyed by content hash, so unchanged files skip re-validation. Cache location is platform-specific: ~/Library/Caches/talkbank-chat/ (macOS), ~/.cache/talkbank-chat/ (Linux), %LocalAppData%\talkbank-chat\ (Windows).

Use --force to bypass the cache for specific paths.

Error Collection

Pipelines use the ErrorSink trait for error reporting. Callers can provide:

  • A collecting sink (gathers all diagnostics for batch output)
  • A printing sink (writes diagnostics to stderr in real-time)
  • A custom sink (for LSP diagnostics, JSON output, etc.)