Alignment
Status: Current Last modified: 2026-06-15 15:00 EDT
Alignment in the toolchain operates at two structural layers, plus a separate overlap-marker pass. Tier alignment is structural (counting and pairing AST nodes); word extraction is positional (domain-ordered token indices).
| Layer | Where | Purpose |
|---|---|---|
| Tier alignment | talkbank-model::alignment | 1:1 mapping between main tier and dependent tiers (%mor, %pho, %wor, %sin, %gra) |
| Word extraction | talkbank-transform::extract | Pull NLP-ready words from the AST in domain order |
Tier Alignment
Validates that dependent tiers have the correct number and arrangement
of items relative to the main tier. Lives in
crates/talkbank-model/src/alignment/.
TierDomain
#![allow(unused)]
fn main() {
enum TierDomain { Mor, Pho, Sin, Wor }
}
The same utterance produces different counts per domain:
| Rule | Mor | Pho | Sin | Wor |
|---|---|---|---|---|
| Skip retrace groups | Yes | No | No | No |
| Count pauses | No | Yes | No | No |
| PhoGroup | Recurse | Atomic (1) | Skip (0) | Recurse |
| SinGroup | Recurse | Skip (0) | Atomic (1) | Recurse |
Include fragments (&+) | No | Yes | Yes | No |
Include nonwords (&~) | No | Yes | Yes | No |
Include fillers (&-) | No | Yes | Yes | Yes |
| Include untranscribed | No | Yes | Yes | No |
| Include tag-marker separators | Yes | No | No | No |
ReplacedWord aligns to | Replacement | Original | Original | Original |
For the underlying word filter (counts_for_tier,
should_skip_group), the content walker, and the ChatFile model itself,
see CHAT Data Model. The walker plus the
domain table together govern every tier-alignment count.
Retrace handling, alignment-critical
Retraces are the most alignment-critical content type. A Retrace node
wraps content the speaker said then corrected.
- Mor: skip entirely (count
0). The retrace was a false start; only the correction carries morphological analysis. - Pho, Sin: recurse, words were physically produced and have phonological / gestural data.
- Wor: recurse, retrace ancestry does not change
%wormembership.
Critical invariant: the parser must emit UtteranceContent::Retrace
for all retrace patterns, including single-word retraces with
replacements (word [: repl] [* err] [//]). If a retrace is
accidentally emitted as a bare ReplacedWord, it counts for %mor
alignment, causing false E705 errors. Enforced by
tests/retrace_replaced_word_regression.rs. Full data model + parsing
pipeline + CHAT examples in
Retraces and Repetitions.
AlignmentPair
#![allow(unused)]
fn main() {
struct AlignmentPair {
source_index: Option<usize>,
target_index: Option<usize>,
}
}
Universal index-pair primitive. Some/Some = matched. One None =
insertion / deletion placeholder for mismatch diagnostics.
is_complete(), both indices Some. is_placeholder(), unmatched.
Per-domain results
| Type | Function | Source → Target |
|---|---|---|
MorAlignment | align_main_to_mor() | Main → %mor items |
PhoAlignment | align_main_to_pho() | Main → %pho tokens |
SinAlignment | align_main_to_sin() | Main → %sin tokens |
WorAlignment | align_main_to_wor() | Main → %wor tokens |
GraAlignment | align_mor_to_gra() | %mor chunks → %gra relations |
%gra aligns to %mor chunks, not items. Clitics create additional
chunks (pro|it~v|be&PRES = 2 chunks: pre-clitic + main).
Trait abstractions
| Trait | Purpose | Implementors |
|---|---|---|
IndexPair | source()/target() on any pair type | AlignmentPair, GraAlignmentPair |
TierAlignmentResult | pairs()/errors()/push_*() accumulator | All 5 alignment result types |
AlignableTier | What a tier provides for generic alignment | PhoTier, SinTier, WorTier |
TierCountable | count_tier_positions() / collect_tier_items() methods | [UtteranceContent] |
The generic positional_align() function uses AlignableTier to
eliminate duplication: align_main_to_{pho,sin,wor}() are thin
wrappers around it. %mor doesn’t use it (additional terminator
validation logic). %gra doesn’t use it (source is MorTier, not
MainTier). WorTier overrides mismatch_format() to Diff (LCS) since
both sides are word sequences; the others use Positional.
%wor is not validated
%wor is a timing-annotation tier. There is no downstream positional
indexing into %wor, and validate_alignments() does not check
%wor word count against the main tier. Old corpus files may have
xxx, fragments, or nonwords in %wor (pre-2026-04 behavior) without
producing false errors.
Phon tier-to-tier alignment
A second class of alignment that operates between dependent tiers:
| Source | Target | Code |
|---|---|---|
%modsyl | %mod | E725 |
%phosyl | %pho | E726 |
%phoaln | %mod | E727 |
%phoaln | %pho | E728 |
Derived-view alignments: %modsyl is a syllabified reannotation of
%mod, %phosyl of %pho, %phoaln aligns both. Word counts must
match between source and target. Computed in compute_alignments()
after the main-tier alignments. build_tier_to_tier_alignment()
constructs index pairs and emits build_count_mismatch_error() when
counts disagree. %phoaln checks against both %mod and %pho,
potentially emitting E727 and E728 simultaneously.
Known data issue: Phon XML source data has orthography↔IPA word
count discrepancies in ~4% of files (518 / 12,340). Expected in child
phonology data. The PhonTalk converter handles this inconsistently,
%mod/%pho are truncated to match orthography via OneToOne, but
%xmodsyl/%xphosyl/%xphoaln are written from raw IPATranscript,
exposing the full IPA word count. Result: E725-E728 mismatches.
Parse-health gating
Alignment diagnostics honor ParseHealth metadata. If a dependent
tier’s domain is parse-tainted, mismatch errors for that domain pair
are suppressed. Main-tier taint blocks all main→dependent alignments.
Dependent-tier taint blocks only that tier. Phon tier-to-tier checks
have their own gates (can_align_modsyl_to_mod,
can_align_phosyl_to_pho, can_align_phoaln).
Word Extraction
extract_words() (in crates/talkbank-transform/src/extract.rs) uses
the content walker to pull words from the AST in domain-specific order.
Returns Vec<ExtractedWord> with text, word_index, is_separator,
special_form. Tag-marker separators (, „ ‡) are included as
words in Mor domain because they have %mor items (cm|cm,
end|end, beg|beg).
Overlap Marker Iteration
CA overlap markers (⌈⌉⌊⌋) appear at three content levels,
UtteranceContent (top-level), BracketedItem (inside groups), and
WordContent (intra-word, butt⌈er⌉). Two APIs in
talkbank-model/src/alignment/helpers/overlap.rs:
walk_overlap_points, low-level
Visits every OverlapPoint in document order with word-position
context. Analogous to walk_words but for overlap markers:
walk_overlap_points(&utterance.main.content.content.0, &mut |visit| {
// visit.point: &OverlapPoint (kind + optional index)
// visit.word_position: usize (alignable words seen so far)
});
extract_overlap_info, region-based
Pairs markers by (kind, index) into OverlapRegion structs. Each
region represents a matched ⌈…⌉ or ⌊…⌋ pair. Index-aware:
⌈2...⌉2 forms a separate region from ⌈...⌉. Mismatched indices
leave markers unpaired. Onset-only ⌈ (without ⌉) is a legitimate CA
convention, region has end_at_word = None,
is_well_paired() = false, but top_onset_fraction() still works.
Cross-utterance, analyze_file_overlaps
For whole-file analysis, in overlap_groups.rs. 1:N matching: one
top region from speaker A can match multiple bottom regions from
speakers B, C, etc. Used by E347 and chatter debug overlap-audit.
Overlap validation
| Code | Level | Check |
|---|---|---|
| E347 | Cross-utterance | Orphaned tops/bottoms with 1:N matching (warning) |
| E348 | Utterance | Unpaired markers within a single utterance (warning) |
| E373 | Utterance | Invalid overlap index values (must be 2-9) |
| E704 | Cross-utterance | Same speaker encoding both top and bottom (error) |
chatter debug overlap-audit <path> reports per-file statistics
(groups, bottoms, orphans, temporal consistency) in TSV format. Use
--database <path.jsonl> for a persistent JSON-lines database.
Design Principles
- No string hacking. All alignment operates on typed AST
structures (
Word,MorTier,AlignmentPair), never on serialized CHAT text. - Domain-aware from the start.
TierDomaingates traversal at the walker level. Downstream code never re-implements retrace / group skipping logic. - Deterministic over approximate. Tier alignment and word extraction use deterministic, positional algorithms over the typed AST.
- Dense indexed structures.
AlignmentPairusesOption<usize>rather than cloned data; index pairs are stored positionally, not in hash maps. - Exhaustive matching. Every
matchonUtteranceContent(24 variants) orBracketedItem(22 variants) lists all variants explicitly. New variants are a compile error, not a silent bug. - Walker as shared primitive.
walk_words()removed ~330 lines of duplicated traversal boilerplate across 7 call sites.
Downstream Consumers
| Consumer | Crate | Usage |
|---|---|---|
| Validation | talkbank-model | Cross-tier checks (E714/E715, E725-E728), overlap (E347/E348/E373/E704) |
| LSP hover | talkbank-lsp | Show aligned tier items for word under cursor |
| Word extraction | talkbank-transform | NLP-ready words from utterances |
| Overlap audit | chatter | chatter debug overlap-audit |
%wor generation | talkbank-model | Build %wor tier from main tier |