Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memory and Ownership

Status: Current Last updated: 2026-03-24 01:32 EDT

This chapter documents the memory management and ownership patterns used across the TalkBank Rust crates. Understanding these decisions helps contributors make consistent choices when adding new code.

String Representation Strategy

CHAT corpora contain massive repetition, the same speaker codes, language codes, POS tags, and high-frequency words appear millions of times across files. The codebase uses three string types, chosen by expected cardinality and duplication:

flowchart LR
    raw["Raw input (&str)"]
    smol["SmolStr\n(inline ≤23 bytes)"]
    arc["Arc<str>\n(interned, deduplicated)"]
    string["String\n(owned, unique)"]

    raw -->|short, low repetition| smol
    raw -->|high repetition domain value| arc
    raw -->|ephemeral/unique| string
TypeWhen to useExamples
SmolStrShort tokens, low duplicationPostcode text, tier content, event labels
Arc<str> (interned)High-cardinality domain symbolsSpeaker codes, language codes, POS tags, stems
StringEphemeral or unique valuesError messages, temporary formatting

String Interning

Location: talkbank-model/src/model/intern.rs

Five global process-local interners, each a DashMap<Arc<str>, Arc<str>> behind OnceLock<StringInterner>:

InternerPre-seeded valuesTypical savings
speaker_interner()30+ codes (CHI, MOT, FAT, …)High, 3-letter codes repeat per utterance
language_interner()45+ ISO 639-3 codesModerate, per-file
pos_interner()60+ POS tags + UD relationsVery high, every %mor word
stem_interner()200+ frequent English stemsHigh, function words dominate
participant_interner()14 roles (Target_Child, …)Low, per-file

How it works:

  • Fast path: get() on DashMap, O(1) Arc::clone if found
  • Slow path: insert() new Arc if miss, deduplicates on future access
  • Thread-safe: DashMap uses shard-level locks, no global contention
  • After initialization, reads are lock-free

Memory impact: 50-200 MB savings on large corpora (5-20% reduction). Arc::clone is O(1) atomic increment vs String::clone O(n) copy.

Newtype Macros

Two macros generate domain-typed string wrappers:

  • string_newtype!: wraps SmolStr. Used for generic CHAT text.
  • interned_newtype!: wraps Arc<str> with automatic interning. Used for domain symbols.
// SmolStr-backed: no interning, inline small strings
string_newtype!(PostcodeText);

// Arc<str>-backed: interned via global interner
interned_newtype!(SpeakerCode, speaker_interner);

Ownership Model

ChatFile Lifecycle

flowchart TD
    src["Source text (&amp;str)"]
    cst["tree-sitter CST\n(Tree, borrowed nodes)"]
    model["ChatFile\n(owned AST)"]
    cache["SQLite cache\n(validation result)"]
    lsp["LSP server\n(per-document state)"]
    json["JSON output\n(serde serialization)"]
    cli["CLI output\n(CHAT text)"]

    src -->|tree-sitter parse| cst
    cst -->|CST-to-model conversion| model
    model -->|validate + hash| cache
    model -->|held in backend| lsp
    model -->|to_json()| json
    model -->|to_chat_string()| cli
  • Parsing: tree-sitter Tree owns the CST. Node<'a> values borrow from Tree, zero-copy traversal. The CST-to-model conversion copies data into owned ChatFile fields (SmolStr, Arc<str>). The Tree is dropped after conversion.
  • Validation: ChatFile is borrowed (&self) during validation. Errors are streamed to an ErrorSink, no accumulation required.
  • LSP: Each open document holds an owned ChatFile in the backend. Re-parsed on every edit via tree-sitter incremental parsing.
  • CLI batch: Each file is independently parsed → validated → reported → dropped. No cross-file state except the shared cache.

Arc Usage

Arc appears in three distinct roles:

RoleTypeWhy
String interningArc<str> in model typesO(1) clone for high-repetition domain values
Worker poolArc<WorkerGroup> in batchalignRAII CheckedOutWorker::drop() needs group reference to return worker
Cache backendArc<dyn CacheBackend> in batchalignShared across async request handlers

No Rc (single-threaded sharing not needed). No Cow<str> (SmolStr covers the inline-small-string use case more naturally).

Interior Mutability

PatternWhereWhat it protects
RefCell<Parser> inside TreeSitterParsertalkbank-parserTree-sitter Parser needs &mut self but isn’t Sync. Callers create a TreeSitterParser and pass &TreeSitterParser everywhere.
DashMap<Arc<str>, Arc<str>>String internersConcurrent interning during parallel parsing. Shard-level locks.
OnceLock<StringInterner>5 global internersLazy init, lock-free after first access
LazyLock<Regex>All regex patterns workspace-wideCompile-once, no per-call overhead
std::sync::Mutex<VecDeque>batchalign worker idle queueHeld < 10 μs for push/pop only
tokio::sync::Mutex<HashMap>batchalign job storeShort reads/writes, never held across .await
SemaphoreWorker availability (batchalign)Async signaling without holding locks during dispatch

Rule: std::sync::Mutex for data accessed from sync code or held briefly. tokio::sync::Mutex only when the lock must be held across .await points (which we avoid when possible). DashMap when many threads read concurrently.

Collection Choices

CollectionWhereWhy not HashMap/Vec
BTreeMapAll test/snapshot JSON outputDeterministic key ordering for reviewable diffs
IndexMapParticipants, per-speaker resultsPreserves encounter order (CHAT spec requires @Participants order)
SmallVec<[T; N]>Headers (N=2), tiers (N=3), features (N=4), token mappings (N=4)Inline storage for common sizes; avoids heap for typical cases
VecDequeWorker idle queue (batchalign)FIFO fair scheduling
Dense Vec indexed by positionRetokenize word-to-token mappingO(1) lookup, no hashing overhead, cache-friendly

No LinkedList, BinaryHeap, or custom allocators.

Tree-Sitter Memory Model

Tree-sitter parsing is zero-copy for CST traversal:

// Node<'a> borrows from Tree, no allocation per node
fn process_node<'a>(node: Node<'a>, source: &str) -> ParseResult<...> {
    for i in 0..node.child_count() {
        let child: Node<'a> = node.child(i).unwrap(); // Stack-only, no heap
        let text: &str = child.utf8_text(source.as_bytes())?; // Borrows source
        // ... convert to owned model types ...
    }
}

The tree-sitter parser consumes &str, produces a CST, and the Rust traversal code constructs owned model types from CST nodes.

SQLite Memory-Mapped I/O

The validation cache uses SQLite with memory-mapped I/O for fast random access:

SqliteConnectOptions::new()
    .journal_mode(SqliteJournalMode::Wal)       // Concurrent reads during writes
    .pragma("cache_size", "-8000")               // 8 MB page cache
    .pragma("mmap_size", "268435456")            // 256 MB memory-mapped region
    .synchronous(SqliteSynchronous::Normal)      // Balanced durability

This configuration handles 95,000+ cached entries efficiently. The cache is never deleted (use --force to refresh specific paths).

Manual Drop Implementations

Three types have custom Drop for resource cleanup:

TypeCleanup actionWhy
AuditReporterJoins audit writer thread and flushes outputAudit mode owns file IO in a dedicated writer thread
CheckedOutWorkerReturns worker to idle queue + releases semaphore permitRAII pool resource management
WorkerHandleSends SIGTERM/SIGKILL to child processProcess must be terminated when handle drops

All drops are acyclic, no ordering dependencies between them.

Allocation Optimization Patterns

Rather than using an arena allocator (bumpalo was evaluated and removed, the data lifetimes don’t fit the “allocate many, free all at once” pattern), the codebase uses targeted optimizations:

PatternWhereSavings
Scratch buffer reuse (clear + swap)DP alignment row costs~50% fewer allocations in inner loop
Flat table (vec![...; rows * cols])DP small-problem fallback1 allocation vs rows+1
Dense Vec instead of HashMapRetokenize word mappingO(1) lookup, no hash overhead
SmallVec inline storageThroughoutAvoids heap for 1-4 element collections
SmolStr inline stringsAll short CHAT tokensNo heap allocation for ≤23 byte strings

See also: the batchalign3 book’s Arena Allocators page for the full evaluation of where arenas do and don’t help.