Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Merge Pipeline, Test Plan

Status: Draft Last modified: 2026-05-30 06:55 EDT

This page is the test-coverage roadmap for the new merge pipeline (chatter speaker-id + chatter merge + chatter adjudicate + the override-file format + the underlying talkbank-model::merge types). It exists because, per this repo’s root CLAUDE.md red/green TDD rule, every new feature starts with failing tests at the highest level the feature lives at, and we want to enumerate those tests before writing the implementation, so coverage is designed, not discovered.

This is a plan, not yet code. When the implementation work begins, every test case below becomes a real test; the doc then flips to a coverage matrix that gets kept honest by CI.

TDD discipline, what “strict red/green” means here

Every cycle of impl-phase work is:

  1. RED. Write ONE failing test at the highest layer the feature lives at. The test exercises a real user-observable behavior, not an internal helper. Commit the failing test alone (or stage it before any code change), verify it fails for the right reason (the missing behavior), not for a compile error or a typo.
  2. GREEN. Write the smallest code change that makes the test pass. No anticipating future tests, no scaffolding for tests that don’t yet exist. The codebase should compile and pass tests at this point.
  3. REFACTOR. With the green test as the safety net, tighten the implementation: extract helpers, rename for clarity, replace primitives with newtypes, document tricky parts. Tests stay green throughout.
  4. DRILL DOWN if needed. If the L3 (or L2) test passes but pinned the behavior less precisely than the contract requires (e.g., the L3 test asserts “exit 2 with some error” but the contract says “the specific MergeError variant must match”), add an L2 (or L1) test next that drills into the precise path. The drilled test FAILS at first against the green-but-imprecise impl, motivating the tighter impl.

Cycles must be atomic: one RED → one GREEN → optional REFACTOR → optional drill-down. Do not stack multiple tests on top of a single impl change; do not write impl ahead of tests. The discipline matters because the bug bar of this pipeline is high (CHAT-data byte-stable preservation, audit-trail reproducibility) and TDD is the cheapest way to catch regressions before they ship.

Three test layers + the adjudication layer

The merge pipeline’s behavior spans four substrates with different testing mechanisms.

LayerSubstrateWhy tests live here
L1, Spec / fragmentspec/constructs/speaker-id/ → current spec/tools generatorsToken-cleaner behavior on CHAT fragments (markup strip for Jaccard scoring). Same mechanism that pins parser/grammar tests; regenerated regression.
L2, Transform / ASTcrates/talkbank-transform/tests/Pure-Rust tests over parsed ChatFile values. identify_mapping, apply_mapping, merge, run_adjudication semantics on hand-built or parsed CHAT inputs. No process boundary.
L3, CLI / subprocesscrates/chatter/tests/merge_tests.rs (new)End-to-end behavior of chatter speaker-id, chatter merge, and chatter adjudicate invoked as subprocesses (assert_cmd + predicates). Exit codes, flag parsing, file I/O, stderr formats.
L4, Scripted adjudicationcrates/talkbank-transform/tests/adjudication_tests.rs + scripted prompterOperator-decision paths in chatter adjudicate. Uses ScriptedPrompter injecting synthetic operator choices. See Adjudication Workflow for the prompter abstraction.

L1 ⊂ L2 ⊂ L3 in terms of failure-mode coverage: a failing L1 test implies a failing L2 test which implies a failing L3 test. So when the same invariant could be tested at multiple layers, the starter test is the highest layer and lower-layer tests are supplements that pin the precise internal path. L4 sits beside L2/L3, same crate/file conventions but a dedicated layer because the prompter-injection pattern is specific to adjudication.

L1, Spec / fragment tests

Lives in spec/constructs/speaker-id/. Three subdirectories:

  • token-cleaner/: what the Jaccard tokenizer strips and keeps
  • jaccard-scoring/: fixed-input → fixed-score golden tests
  • mapping-application/: header rewrite rules on real fragments

L1.1, Token cleaner

Each spec is a CHAT main-tier fragment + the expected token list after cleaning. Behavior pinned: bracket markup stripped, angle-bracket retracing unwrapped, terminator variants discarded, &-... / &+... discarded, xxx/yyy/www discarded, 0 discarded, @l / @n / @c suffix dropped, _-compound split to spaces, punctuation stripped, lowercased, ≥2-char alpha filter, NAK bullets stripped.

SpecInput fragmentExpected tokens
clean-plain-utterance*CHI:\thello world .["hello", "world"]
clean-strip-bracket-codes*CHI:\thello [*] [/] world [//] .["hello", "world"]
clean-unwrap-angle-retrace*CHI:\t<two of the> [//] three of the presents .["two", "of", "the", "three", "of", "the", "presents"]
clean-strip-fillers*CHI:\t&-um &+pre something &-uh .["something"]
clean-strip-zero-and-paralinguistic*CHI:\t0 [=! nodding] .[]
clean-strip-unintelligible*CHI:\txxx and yyy and www .["and", "and"]
clean-strip-bullets*CHI:\thello world . \x150_1234\x15["hello", "world"]
clean-special-form-suffix*CHI:\tnaming l@l u@l l@l u@l .["naming"]
clean-compound-underscore*CHI:\tValentine's_Day and Fruit_Loops .["valentine", "day", "and", "fruit", "loops"]
clean-terminator-variants*CHI:\thello +//. world +... again +/. last !["hello", "world", "again", "last"]
clean-overlap-markers*CHI:\t↫here↫ and there .["here", "and", "there"]
clean-lowercase-filter*CHI:\tHello World A I am .["hello", "world", "am"]

Each spec file in spec/constructs/speaker-id/token-cleaner/ has the standard # name, ## Input, ## Expected tokens, and ## Metadata sections per the spec authoring template at spec/CLAUDE.md in the workspace root (outside the book).

L1.2, Jaccard scoring

Fixed bag-of-tokens pairs with known multiset Jaccard. These guard against off-by-one errors in the sum_w min / sum_w max implementation and against any future “optimizations” that silently change scoring.

SpecBag ABag BExpected J(A,B)
jaccard-identical{hello:2, world:1}{hello:2, world:1}1.0
jaccard-disjoint{hello:1}{world:1}0.0
jaccard-empty-empty{}{}0.0
jaccard-empty-nonempty{}{x:1}0.0
jaccard-multiset-counts{a:3, b:1}{a:1, b:1}2/4 = 0.5
jaccard-partial-overlap{a:1, b:1, c:1}{b:1, c:1, d:1}2/4 = 0.5

L1.3, Mapping application on fragments

Header-rewrite micro-tests. Each spec gives an input @Participants: or @ID: row and a small mapping; the expected output row is the rewritten form.

SpecInput rowMappingExpected output row
participants-rewrite-rename@Participants:\tPAR0 Participant, PAR1 ParticipantPAR0→INV:Investigator, PAR1→drop@Participants:\tINV Investigator
participants-preserve-name-token@Participants:\tCHI Alex Target_Child, PAR0 ParticipantPAR0→INV:Investigator@Participants:\tCHI Alex Target_Child, INV Investigator
id-rewrite-rename@ID:\teng|corpus_name|PAR0|||||Participant|||PAR0→INV:Investigator@ID:\teng|corpus_name|INV|||||Investigator|||
id-drop-removes-row@ID:\teng|...|PAR1|||||Participant|||PAR1→drop(row removed)
id-preserves-other-fields@ID:\teng|2|CHI|6;01.|female|NF||Target_Child|||(no-op for CHI)identical to input

L2, Transform / AST tests

Lives in crates/talkbank-transform/tests/. Three test files:

  • speaker_id_tests.rs
  • transcript_merge_tests.rs
  • override_file_tests.rs

Each tests behavior over parsed talkbank-model::ChatFile values, using inline synthetic CHAT strings parsed via talkbank_parser::parse_chat_file (no subprocess overhead).

L2.1, identify_mapping (reference mode)

TestScenarioAssertion
identify_mapping_clean_winnerReference has CHI saying content X; donor has PAR0 saying X verbatim and PAR1 saying unrelated contentReturns SpeakerMapping { drop: {PAR0}, rename: {PAR1: INV} }, margin >> 2.0
identify_mapping_borderline_refusesReference and both donor speakers share substantial vocabulary (margin < 2.0)Returns Err(SpeakerIdError::LowConfidence { scores, threshold, margin })
identify_mapping_anchor_missingReference has no utterances tagged with anchor speakerReturns Err(SpeakerIdError::AnchorMissingInReference { anchor: CHI })
identify_mapping_single_speaker_donorDonor has only one speakerReturns Err(SpeakerIdError::InsufficientSpeakers { n: 1 })
identify_mapping_threshold_at_exact_valueConstructed donor where margin = 2.0 exactly with threshold 2.0Returns Ok(_) (≥ comparison, not strict >)
identify_mapping_threshold_below_exact_valueMargin = 1.9999 with threshold 2.0Returns Err(SpeakerIdError::LowConfidence)
identify_mapping_unbounded_marginDonor PAR1 has Jaccard 0 against reference; PAR0 > 0Returns Ok(_) with margin = Margin::Unbounded
identify_mapping_deterministicSame inputs, repeated callIdentical SpeakerMapping byte-for-byte (BTreeMap ordering)

L2.2, apply_mapping

TestScenarioAssertion
apply_mapping_renames_main_tierDonor has *PAR0:\t... and *PAR1:\t...; mapping renames PAR0→INV, drops PAR1Output has *INV:\t... for original PAR0 utts; PAR1 utts absent
apply_mapping_byte_stable_except_prefixDonor has rich CHAT markup, %wor, %com on every uttEvery retained utt is byte-identical except the *CODE:\t prefix; dependent tiers preserved exactly
apply_mapping_rewrites_participantsDonor @Participants: has PAR0+PAR1 entriesOutput has only INV entry (after PAR1 drop)
apply_mapping_rewrites_idDonor @ID: rows for PAR0+PAR1PAR0 row rewritten to INV with role tag; PAR1 row removed
apply_mapping_speaker_not_in_inputMapping references PAR9 which isn’t in donorReturns Err(SpeakerIdError::MappingSpeakerNotInInput { speaker: PAR9 })
apply_mapping_speaker_not_in_mappingDonor has PAR0+PAR1+PAR2 but mapping only covers PAR0+PAR1Returns Err(SpeakerIdError::SpeakerNotInMapping { speaker: PAR2 })
apply_mapping_preserves_other_headersDonor has @Languages, @Media, @CommentAll non-Participants/non-ID headers pass through verbatim
apply_mapping_idempotent_on_rerunApply mapping, parse output, apply identity mappingOutput unchanged (byte-stable)

L2.3, merge (core invariants)

These mirror the user-guide’s “What the merged output guarantees” section directly. Each invariant from that section maps to one or more L2 tests; the L3 tests then re-exercise the same invariant through the CLI.

TestInvariant from user-guideAssertion
merge_retained_speakers_byte_stable“Retained speakers are byte-stable”Every *CHI: block from File 1 (main tier + all dependent tiers, including %com) appears in the output byte-identical, in original order
merge_strips_default_derived_tiers“Inserted speakers’ downstream-generated tiers are stripped”Output has no %wor, %mor, %gra, %pho on inserted-speaker utts; other dependent tiers preserved
merge_strip_tiers_configurable“configurable via --strip-tiersCustom strip_tiers=[com] removes %com instead of the defaults
merge_strip_tiers_empty_preserves_allempty strip setInserted utts retain %wor, %mor, %gra, %pho from File 2 verbatim
merge_utterance_order_by_start_time“Utterance order is timeline order”Output utterances sorted by start_ms ascending
merge_stable_tiebreak_file1_first“first-file utterance comes first”When File 1 and File 2 each have an utterance starting at exactly t, the File 1 one appears first in the output
merge_bullets_pass_through“Time bullets are pass-through”Every bullet in the output is exactly the bullet from its source utterance, merge does not recompute, smooth, or refresh
merge_bullet_lift_from_wor“If main tier lacks bullet, lift from %wor”Donor utt with no end-of-line bullet but a %wor row gets a derived \x15<first>_<last>\x15 appended; original %wor then stripped per the tier policy
merge_no_overlap_markers_injected“Overlap markup is NOT injected”Even when inserted utt’s bullet overlaps a retained utt’s bullet by 500ms, no [>]/[<] tokens appear anywhere in the output that weren’t in the original retained file
merge_preserves_existing_overlap_markersretained file already has [>] somewhereThe original [>] is preserved byte-stable on the retained utt
merge_header_languages_passthroughHeader reconciliation ruleOutput @Languages matches File 1’s
merge_header_media_file1_winsHeader reconciliation ruleFile 1 says video, File 2 says audio → output says video (no warning emitted for modality only)
merge_header_participants_concatenatesHeader reconciliation ruleOutput @Participants: is File 1’s entries + File 2’s non-retained entries, in that order
merge_header_id_concatenatesHeader reconciliation ruleOutput @ID: rows are File 1’s + File 2’s non-retained, original order within each file
merge_header_comments_concatenateHeader reconciliation ruleOutput @Comment rows are File 1’s + File 2’s, in original order (ASR provenance preserved)
merge_preconditions_retain_missingexit code 2 preconditionFile 1 declares no CHI; merge with retain={CHI} returns Err(MergeError::RetainSpeakersMissing)
merge_preconditions_no_timelineexit code 2 preconditionFile 1 has no utterances with bullets → Err(MergeError::NoTimelineInFile1)
merge_preconditions_language_mismatchexit code 2 preconditionFile 1 @Languages: eng, File 2 @Languages: yueErr(MergeError::LanguageMismatch)
merge_preconditions_ambiguous_speakerexit code 2 preconditionBoth files have INV utterances and retain={CHI} (INV not in retain) → Err(MergeError::AmbiguousSpeaker { speaker: INV })
merge_warns_on_backward_bullet_drift“small backward-time bullets … proceeds”File with utt1: 100_200, utt2: 190_300, succeeds, emits a warning

L2.4, Override file I/O

TestScenarioAssertion
override_file_round_tripConstruct OverrideFile with one entry, write, read backRe-read value == original
override_file_refuses_missing_schema_versionTOML with no schema_versionErr(OverrideFileError::UnsupportedSchemaVersion { found: 0, supported: 1 })
override_file_refuses_wrong_schema_versionschema_version = 2 (future)Err(UnsupportedSchemaVersion { found: 2, supported: 1 })
override_file_rejects_unknown_fieldEntry has an extraneous field extra = "x"Err(OverrideFileError::Parse)
override_file_rejects_malformed_modemode = "guess"Err(Parse) (only auto/explicit/override accepted)
override_file_atomic_writeWrite to a path that already existsOriginal file is replaced atomically; no <path>.tmp left behind
override_file_deterministic_serializationSame struct, write twiceBytes on disk are byte-identical between writes
override_file_omits_empty_optionalsEntry has empty scores, no margin, empty flagsTOML output does not contain those keys
override_file_preserves_margin_unboundedEntry has margin = Margin::UnboundedTOML on disk has margin = "unbounded"; reads back as Unbounded
override_file_preserves_margin_finiteEntry has margin = Margin::Finite(3.81)TOML on disk has margin = 3.81; reads back equal
override_file_read_or_default_missingPath does not existReturns empty OverrideFile with current schema version
override_file_get_returns_entryFile has one entry under SessionId Xget(X) returns Some; get(Y) returns None

L2.5, Domain-type unit tests

Smaller per-type tests. Each in its module’s #[cfg(test)] mod tests section.

TestTypeAssertion
jaccard_score_new_in_rangeJaccardScorenew(0.5)Ok; new(-0.1) and new(1.1)Err; new(NaN)Err
jaccard_score_serde_round_tripJaccardScoreSerializes to 0.5 (bare float in JSON/TOML); deserializes back identically; out-of-range deserialize → error
confidence_threshold_default_is_2_0ConfidenceThresholdDefault::default().value() == 2.0
confidence_threshold_rejects_below_1ConfidenceThresholdnew(0.5)Err
margin_from_scores_zero_loserMarginfrom_scores(JaccardScore::new(0.7), JaccardScore::zero()) == Margin::Unbounded
margin_from_scores_zero_zeroMarginfrom_scores(zero, zero) == Margin::Finite(0.0) or explicit “degenerate” representation (decide and document)
margin_meets_thresholdMarginFinite(3.81).meets(threshold=2.0) == true; Finite(1.5).meets(2.0) == false; Unbounded.meets(threshold) == true for any threshold
retain_set_parseRetainSet"CHI".parse() == Ok({CHI}); "CHI,SI2".parse() == Ok({CHI, SI2}); "".parse() == Err; "CHI,,SI2".parse() == Err
inserted_role_parseInsertedRole"INV:Investigator".parse() == Ok(_); "INV".parse() == Err; ":Investigator".parse() == Err
mapping_spec_parse_simpleparse_mapping_spec"PAR0=drop,PAR1=INV:Investigator" parses to a complete SpeakerMapping with correct actions and inserted_role
mapping_spec_parse_drop_onlyparse_mapping_spec"PAR0=drop" parses iff no inserted_role context required (decide whether legal in isolation; if not, must error)
mapping_spec_parse_conflicting_rolesparse_mapping_spec"PAR0=INV:Investigator,PAR1=MOT:Mother", two different inserted roles → error (v1 only allows one)
merge_flag_serde_known_variantsMergeFlagDiarizationMixed serializes as "diarization-mixed" (kebab-case); deserializes the same
merge_flag_serde_customMergeFlagUnknown string deserializes as Custom("unknown-flag"); serializes verbatim

L3, CLI / subprocess tests

Lives in crates/chatter/tests/merge_tests.rs (new file). Uses the same assert_cmd + predicates + tempfile pattern as the existing integration_tests.rs. Each test invokes chatter speaker-id or chatter merge as a subprocess against files written to a tempdir().

L3.1, chatter merge, success paths

TestInvariants exercised
merge_basic_clinician_patternE2E happy path: small hand-coded child-only file + small ASR-labeled file → exit 0, output exists, retained CHI byte-stable, inserted INV present with derived tiers stripped. Single-invocation smoke test.
merge_writes_to_stdout_by_defaultNo -o flag → output goes to stdout, exit 0
merge_writes_to_output_path-o merged.cha → file created with correct content; nothing on stdout
merge_retain_multi_speaker--retain CHI,SI2 keeps both CHI and SI2 byte-stable; everything else from File 2
merge_strip_tiers_custom--strip-tiers com,act removes %com and %act instead of default set
merge_strip_tiers_empty--strip-tiers '' preserves %wor from File 2 in output

L3.2, chatter merge, error paths

TestAsserted exit codeAsserted stderr
merge_missing_file11“No such file” or equivalent typed message
merge_unparseable_file11parser diagnostic
merge_missing_retain_flag2 (clap)clap usage message
merge_retain_empty_value2typed error from RetainSet::from_str
merge_no_retain_speakers_in_file12RetainSpeakersMissing rendered
merge_no_timeline_in_file12NoTimelineInFile1 rendered
merge_language_mismatch2LanguageMismatch { file1: eng, file2: yue } rendered
merge_ambiguous_speaker2AmbiguousSpeaker { speaker: ... } rendered with hint to use –retain

L3.3, chatter speaker-id, reference mode

TestScenarioAssertion
speaker_id_reference_auto_clean_winnerReference + donor where margin >> 2.0Exit 0; output has expected renamed/dropped speakers
speaker_id_reference_writes_overrideWith --write-override path.tomlFile created; entry has mode = "auto", scores, margin, decided_at, operator
speaker_id_reference_appends_to_existing_override--write-override path.toml where file already has another sessionNew session added; existing session preserved
speaker_id_reference_low_confidence_exits_4Margin < thresholdExit 4; stderr contains per-speaker scores
speaker_id_reference_anchor_missing_exits_2Reference has no anchor speaker utterancesExit 2; typed error in stderr
speaker_id_reference_threshold_override--confidence-threshold 1.5 on a margin-1.7 caseExit 0 (would have refused at default 2.0)
speaker_id_reference_anchor_required--reference without --anchorExit 2 (clap or our own); usage error

L3.4, chatter speaker-id, explicit-mapping mode

TestScenarioAssertion
speaker_id_explicit_basic--mapping "PAR0=drop,PAR1=INV:Investigator"Exit 0; output renames PAR1→INV, drops PAR0
speaker_id_explicit_mapping_speaker_not_in_input--mapping references PAR9 not in inputExit 2; typed error
speaker_id_explicit_speaker_missing_from_mappingInput has PAR0+PAR1+PAR2; mapping only covers PAR0+PAR1Exit 2; typed error naming PAR2
speaker_id_explicit_with_note_records_in_override--mapping + --write-override + --note "verified by listening"TOML entry has note = "verified by listening" and mode = "explicit"

L3.5, chatter speaker-id, override-file mode

TestScenarioAssertion
speaker_id_override_file_replayOverride file has entry for session-XReading override + applying produces same output as the original auto/explicit run
speaker_id_override_file_missing_entryOverride file has no entry for the requested sessionExit 2; OverrideEntryMissing in stderr
speaker_id_override_file_missing_file--override-file path.toml where file doesn’t existExit 1; NotFound in stderr
speaker_id_override_file_wrong_schema_versionFile has schema_version = 99Exit 1; UnsupportedSchemaVersion in stderr
speaker_id_override_file_mutually_exclusive_modes--reference AND --mapping both setExit 2 (clap or our own); only one operation mode allowed

L3.6, Pipeline composition

These exercise chatter speaker-idchatter merge composed end-to-end through the file system, simulating the orchestrator workflow.

TestScenarioAssertion
pipeline_speaker_id_then_mergeRun speaker-id on anonymous ASR file; run merge on the result + hand-coded fileFinal merged file passes all merge invariants (retained byte-stable, etc.)
pipeline_replay_via_override_fileRun once with auto; capture override file; delete intermediates; replay via --override-file; merge againFinal merged file is byte-identical to the original run (audit-trail-reproducibility property)
pipeline_low_confidence_then_explicitRun speaker-id; gets exit 4; capture scores from stderr; run again with --mapping matching what the operator would decide; record via --write-override; mergeAll steps succeed; override file has mode = "explicit" with prior scores recorded

L4, Scripted adjudication tests

Lives in crates/talkbank-transform/tests/adjudication_tests.rs. Uses the Prompter trait and ScriptedPrompter documented in Adjudication Workflow §The prompter abstraction. Each test constructs a pending-adjudications input, scripts the operator’s decisions, runs run_adjudication, and asserts on the resulting override file plus the residual pending file.

L4.1, Speaker-id adjudication paths

TestScripted decisionAssertion
adjudicate_speaker_id_accepts_suggestedAcceptSuggested { note: None } for one pending entryOverride file entry has mode = "explicit", mapping matches suggested, pending file emptied
adjudicate_speaker_id_override_mappingOverrideMapping { mapping: { PAR0=rename, PAR1=drop }, note: Some("verified by listening") } (opposite of suggested)Override file mapping matches operator’s choice; note recorded
adjudicate_speaker_id_deferDefer { reason: "need to listen to audio" }Pending entry untouched; override file unchanged; tool exits 4 (deferred)
adjudicate_speaker_id_blockBlock { reason: "reference file missing bullets" }Pending entry tagged as blocked; override file unchanged
adjudicate_speaker_id_kind_mismatch_rejectedOverrideInsertedRole { ... } against a speaker-id-low-confidence entryReturns Err(AdjudicationError::DecisionKindMismatch); nothing written

L4.2, Parent-role-lookup adjudication paths

TestScripted decisionAssertion
adjudicate_parent_role_accepts_default_invAcceptSuggestedOverride entry uses INV:Investigator (the safe default)
adjudicate_parent_role_overrides_to_motherOverrideInsertedRole { code: "MOT", tag: "Mother" }Override entry uses MOT; note recorded
adjudicate_parent_role_overrides_to_fatherOverrideInsertedRole { code: "FAT", tag: "Father" }Override entry uses FAT
adjudicate_parent_role_invalid_code_rejectedOverrideInsertedRole { code: "", tag: "Mother" }Returns Err; with --skip-on-error, logs and proceeds

L4.3, Diarization-mix and sanity-scan paths

TestScripted decisionAssertion
adjudicate_diarization_mix_flag_onlyFlag { flags: [DiarizationMixed], note: "PAR0 mixes clinician+parent" }Existing override entry gets flag added; mapping unchanged
adjudicate_sanity_scan_swap_mappingOverrideMapping { ... } reversing original speaker-idOverride entry updated; mode = "explicit"; original mapping preserved in history
adjudicate_sanity_scan_confirms_real_overlapFlag { flags: [Custom("real-overlap-confirmed")] }Override entry gets custom flag; mapping unchanged

L4.4, Workflow plumbing

TestScenarioAssertion
adjudicate_empty_pending_file_noopPending file has empty entries arrayExit 0; nothing changes
adjudicate_resumption_skips_decided_entriesPending file has 3 entries; first 2 already decided in override; only 3rd has no override entryPrompter is called exactly once, for the 3rd entry
adjudicate_re_adjudicate_preserves_historyExisting override entry; --re-adjudicate with new decisionNew decision saved; prior decision preserved in history array
adjudicate_kind_filter_processes_only_matchingPending file has mixed kinds; --kind parent-role-lookup flag setPrompter only called for parent-role-lookup entries; other kinds untouched
adjudicate_dry_run_writes_nothingAny pending input + any decision; --dry-run setOverride file unchanged; pending file unchanged
adjudicate_scripted_mode_unknown_session_abortsScripted decisions reference session-X but pending has only session-YReturns Err(AdjudicationError::ScriptedDecisionWithoutPendingEntry); tool exits 2
adjudicate_scripted_mode_extra_pending_abortsPending has session-X and session-Y; scripted decisions cover only session-XReturns Err(AdjudicationError::PendingEntryWithoutScriptedDecision); tool exits 2
adjudicate_mutually_exclusive_modes--interactive + --scripted both setReturns Err; tool exits 2 (clap or our own validator)

L4.5, Prompter contract conformance

These tests pin the contract that any Prompter impl must satisfy, so future UI backends (VS Code, web) can be developed against the same invariants.

TestScenarioAssertion
prompter_terminal_round_trip_decisionTerminalPrompter reading a scripted stdinReturns the expected OperatorDecision parsed from the operator’s typed input
prompter_scripted_returns_decisions_in_orderScriptedPrompter::from_decisions([d1, d2, d3])Three consecutive ask() calls return d1, d2, d3 in order
prompter_scripted_panics_on_unscripted_sessionScriptedPrompter has decisions for session A; tool asks for session Bask() returns Err(PrompterError::NoDecisionFor(SessionId))
prompter_scripted_toml_round_tripsWrite a scripted-decisions TOML, read with ScriptedTomlPrompter, runSame OperatorDecision sequence as a ScriptedPrompter::from_decisions with equivalent contents

Fixture catalog

These are the synthetic CHAT pairs that the tests above consume. Each is small (≤20 utterances), exercises a precise invariant, and is fully fictional (no real corpus content).

The fixtures live as inline const FIX_*: &str blocks in the respective test modules, following the precedent in chatter/tests/integration_tests.rs (which has const VALID_CHAT: &str = r#"..."# etc.).

FIX_REF_TWO_UTT_NO_MARKUP

The smallest possible valid CHAT pair input. Two *CHI: utterances, no markup beyond a simple terminator, time bullets on both. Used by cycle 1’s smoke test where the impl must work without yet handling any markup edge cases.

FIX_ASR_LABELED_TWO_UTT

The matching donor for FIX_REF_TWO_UTT_NO_MARKUP: two *INV: utterances at different time positions. Used by cycle 1.

FIX_REF_CHILD_ONLY_SIMPLE

A 6-utterance child-only hand transcript with rich CHAT markup (error code, retracing, filled pause, special-form letter, zero realization with paralinguistic). Used by every L2/L3 merge test from cycle 2 onward as the canonical “File 1”, the reference / authoritative file. Has time bullets on every utterance.

FIX_ASR_ANON_2SPEAKER_SIMPLE

The matching ASR-output file with anonymous PAR0 (clinician, asks questions) and PAR1 (child, says what FIX_REF_* shows plus some extra). Has %wor on every utterance. Used by every speaker-id test where auto-mode is expected to succeed cleanly (margin >> 2.0).

FIX_ASR_LABELED_INV_SIMPLE

FIX_ASR_ANON_2SPEAKER_SIMPLE after speaker-id has run with PAR1→drop, PAR0→INV:Investigator. Used by merge tests where we want to skip the speaker-id step and test merge alone.

FIX_ASR_BORDERLINE_VOCABULARY

ASR file where both speakers describe the same picture-book content (margin 1.6-1.9 against reference). Used by low-confidence tests.

FIX_REF_NO_BULLETS

A reference file with no time bullets at all. Used to test NoTimelineInFile1 precondition.

FIX_REF_LANG_ENG / FIX_ASR_LANG_YUE

Two files with conflicting @Languages. Used to test LanguageMismatch.

FIX_AMBIGUOUS_INV

Two files both containing *INV: utterances, with --retain CHI (INV not in retain set). Used to test AmbiguousSpeaker.

FIX_REF_MULTI_RETAIN

Reference file containing *CHI: and *SI2: utterances (sibling target). Used to test --retain CHI,SI2.

FIX_ASR_NO_MAIN_BULLET

Donor file where some utterances have no main-tier bullet, only %wor. Used to test bullet-lift behavior in normalization.

FIX_OVERRIDE_VALID / FIX_OVERRIDE_WRONG_SCHEMA / FIX_OVERRIDE_MALFORMED

Override files in valid, schema-rejected, and parse-rejected shapes. Used by override-file I/O tests.

FIX_PENDING_SPEAKER_ID / FIX_PENDING_PARENT_ROLE / FIX_PENDING_MIXED_KINDS

Pending-adjudications files exercising one kind, another kind, and a mix. Used by L4 adjudication tests.

FIX_SCRIPTED_ACCEPT_ALL / FIX_SCRIPTED_OVERRIDE_FIRST_DEFER_SECOND

Scripted-decisions TOML files for ScriptedTomlPrompter. Cover the canonical accept-suggested case and a mixed override+defer case.

The exact bytes of each fixture are pinned in their respective test modules when the implementation lands; this plan doesn’t freeze them yet, only their purpose. Drafting the actual bytes is the first step of impl-phase work.

Coverage matrix

Cross-checking that every behavioral invariant from the four design docs has at least one test:

Invariant sourceInvariantFirst-failing layerTest name
merge user-guideRetained byte-stableL3 → L2merge_basic_clinician_pattern + merge_retained_speakers_byte_stable
merge user-guideDerived tiers strippedL3 → L2merge_strip_tiers_custom + merge_strips_default_derived_tiers
merge user-guideOrder by start_msL2merge_utterance_order_by_start_time
merge user-guideTiebreak File1 firstL2merge_stable_tiebreak_file1_first
merge user-guideBullets pass-throughL2merge_bullets_pass_through
merge user-guideBullet lift from %worL2merge_bullet_lift_from_wor
merge user-guideHeader reconciliation (all rows)L2merge_header_* series
merge user-guide + memoryNo overlap markers injectedL2merge_no_overlap_markers_injected + merge_preserves_existing_overlap_markers
merge user-guideEach precondition → exit 2L3merge_*_exits_2 series in L3.2
merge user-guideWarns on bullet driftL2merge_warns_on_backward_bullet_drift
speaker-id user-guideReference mode autoL3speaker_id_reference_auto_clean_winner
speaker-id user-guideExplicit modeL3speaker_id_explicit_basic
speaker-id user-guideOverride-file modeL3speaker_id_override_file_replay
speaker-id user-guideConfidence threshold (exit 4)L3 → L2speaker_id_reference_low_confidence_exits_4 + identify_mapping_borderline_refuses
speaker-id user-guideByte-stable except prefixL2apply_mapping_byte_stable_except_prefix
speaker-id user-guideHeader rewritesL2 + L1apply_mapping_rewrites_* + participants-rewrite-* specs
speaker-id user-guideProvenance capturedL3speaker_id_reference_writes_override
speaker-id user-guideEach precondition → typed errorL3 → L2various *_exits_2 and apply_mapping_* tests
speaker-id user-guideToken cleaner specL1clean-* specs
speaker-id user-guideMultiset Jaccard formulaL1jaccard-* specs
override-file refSchema-version refusalL2override_file_refuses_* tests
override-file refRound-trip fidelityL2override_file_round_trip
override-file refDeterministic serializationL2override_file_deterministic_serialization
override-file refAtomic writeL2override_file_atomic_write
override-file refmargin "unbounded" formL2override_file_preserves_margin_unbounded
domain typesJaccardScore rangeL2jaccard_score_new_in_range
domain typesConfidenceThreshold ≥ 1L2confidence_threshold_*
domain typesMargin semanticsL2margin_*
domain typesRetainSet::from_strL2retain_set_parse
domain typesInsertedRole::from_strL2inserted_role_parse
domain typesparse_mapping_specL2mapping_spec_parse_*
domain typesMergeFlag serdeL2merge_flag_serde_*
domain typesPipeline reproducibilityL3pipeline_replay_via_override_file

Every invariant has at least one named test; many have multiple across layers. When the impl phase begins, the first commit should produce the fixtures, the second commit the highest-layer failing test for the simplest invariant, then drill down per the standard TDD progression.

What this plan does NOT cover

  • Performance / scaling tests. Until the pipeline shows up on a measured workload, no targeted perf assertions. The reference corpus’s existing round-trip benchmarks remain the baseline.
  • Fuzz testing. This repository now has a local fuzz/ workspace for parser/validation fuzzing. If the merge crate stabilizes enough to justify dedicated fuzzing, adding a merge-specific target for random parseable CHAT-pair inputs is a follow-up, not a v1 blocker.
  • Cross-platform CI checks. Windows / Linux / macOS each build the workspace; the merge module rides the existing CI. No platform-specific tests needed (the merge operates on parsed AST and writes UTF-8; no path-or-line-ending quirks).
  • Real-corpus regression sweeps. Once impl lands, running chatter merge over a curated subset of the reference corpus and snapshotting outputs is a smart follow-up. Lives in a separate tests/golden/ style mechanism if added; not designed here.

TDD authoring sequence

Each numbered item is one full RED → GREEN → REFACTOR cycle. Cycles must run in order; do not start cycle N+1 until cycle N is green and committed. Numbers are designed so the first working pipeline (cycle 8) emerges from the absolute minimum set of types + algorithms, then each later cycle extends.

The starter test for cycle 1 is intentionally tiny: a 2-utterance fixture pair with no markup, one retain speaker. The smoke test exercises every layer (parser, transform, CLI) but with the simplest possible CHAT bytes, so the first impl is small enough to land in one cycle.

Phase A, minimal end-to-end pipeline (cycles 1-8)

These cycles produce the simplest possible chatter merge working end-to-end with synthetic fixtures.

#RED (failing test)GREEN (smallest impl that passes)
1merge_basic_smoke, L3 subprocess test against the tiniest fixture pair (FIX_REF_TWO_UTT_NO_MARKUP + FIX_ASR_LABELED_TWO_UTT), retain={CHI}, asserts exit 0 and “merged file exists”Stub chatter merge subcommand wiring; introduce minimal talkbank-transform::transcript_merge::merge that interleaves utterances by start_ms and emits parser→serializer round-trip. No tier-stripping, no header-reconcile, no validation. Just: parse, sort, serialize.
2merge_retained_speakers_byte_stable, L2 over the smoke fixture, asserts every CHI block byte-identicalImplement byte-stable handling for retained utterances (preserve main_raw_lines + dependent tiers exactly).
3merge_strips_default_derived_tiers, L2 against a fixture where the donor has %wor rowsImplement tier_strip per the per-tier policy; drop %wor/%mor/%gra/%pho from inserted-speaker utts.
4merge_utterance_order_by_start_time, L2 with a fixture where File 1 and File 2 utterances interleaveImplement timeline sort key (start_ms primary; source-order tiebreak).
5merge_header_participants_concatenates, L2Implement header_reconcile::participants_merge.
6merge_header_id_concatenates, L2Extend header_reconcile for @ID rows.
7merge_header_languages_passthrough + merge_header_media_file1_wins + merge_header_comments_concatenate, L2Extend header_reconcile for remaining headers per the contract table.
8merge_preconditions_retain_missing + merge_preconditions_no_timeline + merge_preconditions_language_mismatch + merge_preconditions_ambiguous_speaker, L3, each asserting exit code 2 with a specific stderr messageImplement preconditions module + map MergeError to exit codes in the CLI.

Phase A, actual cycle log

The four-precondition cycle 8 was deliberately split into four single-variant cycles (9a / 9b / 9c / 9d) so each MergeError variant lands with its own RED→GREEN cycle and L2 + L3 sibling tests. The numbering here is therefore finer-grained than the plan table above; the table records the shape of Phase A, the log records what was actually committed.

#Test(s)LayerStatus
1merge_basic_smokeL3done
2merge_retained_speakers_byte_stableL2done
3merge_strips_default_derived_tiersL2done
4merge_strip_tiers_configurableL2done
5merge_strip_tiers_empty_preserves_allL2done
6merge_header_participants_concatenatesL2done
7merge_header_id_concatenatesL2done
8amerge_header_comments_concatenateL2done
8bmerge_header_languages_passthrough + merge_header_media_file1_winsL2done
9amerge_no_retain_speakers_in_file1 + _returns_errL3 + L2done (L2 sibling backfilled in 9c)
9bmerge_no_timeline_in_file1 + _returns_errL3 + L2done
9cmerge_language_mismatch + _returns_errL3 + L2done
9dmerge_ambiguous_speaker + _returns_errL3 + L2done

End of Phase A: chatter merge works on simple fixtures with all four preconditions (retain / timeline / language / ambiguous speaker) enforced. The pipeline is publishable as v0.

Phase B, actual cycle log

Phase B picks up at cycle 10 in the cycle log (Phase A used 9a-9d for the precondition split).

#Test(s)LayerStatus
10speaker_id_explicit_basicL3done
11apply_mapping_byte_stable_except_prefix + apply_mapping_rewrites_participants + apply_mapping_rewrites_idL2done (regression-guards)
12identify_mapping_clean_winnerL2done
13identify_mapping_borderline_refusesL2done
14speaker_id_reference_low_confidence_exits_4L3done
15speaker_id_reference_writes_override (+ OverrideFile data model)L3done
16speaker_id_override_file_replay (+ OverrideFile::get)L3done
17adjudicate_speaker_id_accepts_suggested (+ adjudication core)L4done
18adjudicate_scripted_accepts_suggested (+ chatter adjudicate CLI + scripted-TOML I/O)L3done
19speaker_id_reference_writes_pending_on_low_confidence (+ --write-pending flag + LowConfidence carries DonorMatchReport)L3done
20adjudicate_speaker_id_override_mapping (+ OperatorDecision::OverrideMapping variant + scripted-TOML override-mapping shape)L4done
21adjudicate_interactive_accepts_suggested (+ TerminalPrompter + --interactive flag)L3done
22adjudicate_parent_role_lookup_chooses_role (+ PendingKindData promotion + ParentRoleLookup kind + ChooseRole decision)L4done
23adjudicate_interactive_chooses_role (+ parse_operator_response + kind-aware prompt hint)L3done
24adjudicate_interactive_override_mapping (+ parse_override_mapping + parse_speaker_assignment)L3done
25pipeline_clean_winner_end_to_end (+ chatter pipeline subcommand)L3done
26batch_pass1_single_session (+ chatter batch subcommand, subprocess driver)L3done
27batch_mixed_outcomes (regression-guard: clean+borderline aggregation)L3done
28batch_pass2_replay (+ --override-file on pipeline + batch; per-session auto-detection)L3done
29batch_skip_existing (+ --skip-existing flag on batch for idempotent re-runs)L3done
30refactor, PipelineArgs + BatchArgs structs retire three #[allow(clippy::too_many_arguments)] markers,done (true-no-op refactor; covered by cycles 25-29 regression suite)
31refactor, split commands/speaker_id.rs (472 lines) into speaker_id/{mod,modes,writes,support}.rs (158 + 196 + 103 + 86 lines); retire 4 stale #[allow(dead_code)] markers on ReferenceModeOutcome (fields are read by write_override_entry),done (true-no-op refactor; covered by cycles 10-29 regression suite)
32adjudicate_sanity_scan_accept_suggested (+ AdjudicationKind::SanityScanMisclassification variant, PendingKindData::SanityScanMisclassification { suggested, reason } variant, two apply-decision arms mirroring SpeakerIdLowConfidence, terminal prompter render + prompt-hint arm)L4done, adjudication kind end-to-end; the post-merge scan detector itself (heuristic + auto-pending-write) is a separate cycle 33
33sanity_scan_flags_inverted_mlu (+ talkbank_transform::sanity_scan::scan_session + chatter sanity-scan subcommand; mean-utterance-word-count asymmetry heuristic, default 1.5×, binary-mapping only)L3done, detector + CLI end-to-end; multi-rename support, batch integration, and alternative heuristics deferred
34batch_writes_override_for_auto_decisions (+ --write-override on both chatter pipeline and chatter batch; threaded through PipelineArgs.write_override_path + BatchArgs.write_override_path; reference-mode auto-decisions audit-trailed for sanity-scan + future re-runs)L3done
35batch_with_sanity_scan_flag_flags_inverted_mlu (+ --sanity-scan + --sanity-scan-threshold on chatter batch; post-loop subprocess driver for chatter sanity-scan; precondition validation requiring --write-override + --write-pending)L3done
36refactor, split cli/args/core.rs (984 → 747 lines): extract DebugCommandsdebug_commands.rs, CacheCommandscache_commands.rs, config enums (LogFormat, TuiMode, OutputFormat, ParserBackend, AlignmentTier) → cli_types.rs, unit-test module → core_tests.rs (via #[path]); satisfies the 800-line hard limit,done (true-no-op refactor; covered by full regression suite + 110 bin/integration tests)
37+sanity-scan multi-rename support; diarization-mix-review kind (operator workflow design needed); newtype threading at struct seams (deferred simplify finding); apply_decision arm dedup + per-kind OperatorDecision sub-enumsL3 + L4pending

Phase B, speaker-id pipeline (cycles 9-16)

These cycles add chatter speaker-id and its three modes.

#REDGREEN
9speaker_id_explicit_basic, L3 against an anonymous-2-speaker donor with --mapping "PAR0=drop,PAR1=INV:Investigator", asserts output has only INV uttsStub chatter speaker-id subcommand. Implement parse_mapping_spec + apply_mapping. Reference mode and override-file mode return unimplemented!() for now.
10apply_mapping_byte_stable_except_prefix + apply_mapping_rewrites_participants + apply_mapping_rewrites_id, L2Tighten apply_mapping per header rewrite rules.
11identify_mapping_clean_winner, L2 with a fixture where one donor speaker overwhelmingly matches the referenceImplement text_cleaner + jaccard modules. Implement identify_mapping using them. Reference mode in CLI now works.
12identify_mapping_borderline_refuses, L2 with a borderline fixtureAdd ConfidenceThreshold check + LowConfidence error path.
13speaker_id_reference_low_confidence_exits_4, L3 against borderline fixtureMap LowConfidence to exit code 4 in the CLI; print scores to stderr.
14speaker_id_reference_writes_override, L3 with --write-overrideImplement OverrideFile::read_or_default + OverrideFile::write.
15speaker_id_override_file_replay, L3 with --override-file + --session-idImplement override-file mode in CLI (OverrideFile::get + apply).
16Token-cleaner L1 specs (a handful of representative clean-* specs from L1.1) + current spec/tools generatorsMove the regex-and-string cleaner into a spec-test-covered implementation. Specs become the regression net.

End of Phase B: full chatter speaker-id + chatter merge pipeline works auto + explicit + override modes.

Phase C, adjudication (cycles 17-22)

These cycles add the chatter adjudicate tool and its prompter-injection testability.

#REDGREEN
17adjudicate_empty_pending_file_noop, L4 against an empty pending file, asserts exit 0 + no changesStub chatter adjudicate subcommand. Implement PendingAdjudications::read + run_adjudication core skeleton with a no-op Prompter trait.
18prompter_scripted_returns_decisions_in_order, L4Implement ScriptedPrompter::from_decisions (in-memory) per the Prompter trait.
19adjudicate_speaker_id_accepts_suggested, L4 against FIX_PENDING_SPEAKER_ID with one AcceptSuggested decisionImplement apply_decision for the speaker-id-low-confidence kind. Override file now gets the decision; pending entry removed.
20adjudicate_speaker_id_override_mapping, L4 with OverrideMapping decisionExtend apply_decision for the override-mapping variant.
21adjudicate_speaker_id_kind_mismatch_rejected, L4 with a OverrideInsertedRole against a speaker-id pending entryImplement kind→variants validation in apply_decision.
22adjudicate_scripted_mode_unknown_session_aborts + adjudicate_scripted_mode_extra_pending_aborts, L4Tighten scripted-mode validation; assert 1:1 mapping between pending entries and scripted decisions.

End of Phase C: scripted adjudication tested end-to-end with synthetic operator inputs. Interactive terminal UX still unimplemented (next phase).

Phase D, interactive UX (cycles 23-25)

#REDGREEN
23prompter_terminal_round_trip_decision, L4 with mocked stdin/stdoutImplement TerminalPrompter parsing [a]/[o]/[f]/... keys + optional follow-up prompts.
24adjudicate_resumption_skips_decided_entries, L4 with a partially-decided override file + full pending listImplement skip-already-decided logic in run_adjudication.
25Manual smoke test (NOT automated), run chatter adjudicate --interactive against the test fixtures; visually confirm the operator UX matches the doc’s mock-upPolish terminal output: ANSI formatting, fixed-width alignment, the [m] Show more context action, the [p] Play media action.

End of Phase D: full v1 pipeline complete.

Phase E, non-speaker-id adjudication kinds (cycles 26-29)

Each adjudication kind gets its own RED→GREEN cycle.

#REDGREEN
26adjudicate_parent_role_overrides_to_mother + adjudicate_parent_role_overrides_to_father, L4Implement parent-role-lookup kind end-to-end (pending schema, prompter context, decision application).
27adjudicate_diarization_mix_flag_only, L4Implement diarization-mix-review kind end-to-end.
28adjudicate_sanity_scan_swap_mapping, L4Implement sanity-scan-misclassification kind end-to-end.
29adjudicate_re_adjudicate_preserves_history, L4Implement --re-adjudicate flag; add history field to MergeOverride.

Phase F, breadth pass (cycles 30+)

Fill in every remaining test from L1-L4 that hasn’t been written yet. These are coverage-deepening tests, not behavior adders. The impl from Phases A-E should pass them with at most minor refactoring; if a test fails meaningfully, that’s a gap in the impl that this cycle closes.

The breadth pass is the only phase where multiple cycles can proceed in parallel (different contributors take different test groups). Phases A-E are strictly serial.

Hard rules during impl phase

  • No test stubs. Every test in this plan, when written, must FAIL before its impl exists and PASS after. Skipped or #[ignore]-marked tests are not allowed in the regression net (use #[ignore] only for genuinely slow or environment-dependent tests, not for “not implemented yet”).
  • No test deletion to make CI green. If a test that was passing starts failing after a refactor, the refactor is wrong. Investigate; do not delete the test.
  • Three cycle archetypes, distinguish them. A cycle is one of:
    • bug-fix: RED motivates new impl code (cycle N-1’s impl truly cannot satisfy the new test).
    • regression-guard: RED pins an invariant the impl inherits from upstream infrastructure (e.g. parse→serialize byte-stability inherited from talkbank-parser). The test passes against cycle N-1’s impl, but the cycle is valuable because it locks in the invariant against future “optimizations” that might break it. Verbose-output the actual behavior on first run to confirm the invariant holds for the right reasons, not by accident.
    • true no-op: RED tests something already pinned elsewhere. These ARE unnecessary; drop the cycle or sharpen the test. The difference between regression-guard and true no-op is whether the invariant is named explicitly anywhere else. If yes (e.g., the parser crate already has a roundtrip test that covers it), the cycle is true-no-op. If no, the cycle is a regression-guard and worth keeping.