Merge Pipeline, Test Plan
Status: Draft Last modified: 2026-05-30 06:55 EDT
This page is the test-coverage roadmap for the new merge pipeline
(chatter speaker-id + chatter merge + chatter adjudicate +
the override-file format + the underlying talkbank-model::merge
types). It exists because, per this repo’s root CLAUDE.md
red/green TDD rule, every new feature starts with failing tests
at the highest level the feature lives at, and we want to
enumerate those tests before writing the implementation, so
coverage is designed, not discovered.
This is a plan, not yet code. When the implementation work begins, every test case below becomes a real test; the doc then flips to a coverage matrix that gets kept honest by CI.
TDD discipline, what “strict red/green” means here
Every cycle of impl-phase work is:
- RED. Write ONE failing test at the highest layer the feature lives at. The test exercises a real user-observable behavior, not an internal helper. Commit the failing test alone (or stage it before any code change), verify it fails for the right reason (the missing behavior), not for a compile error or a typo.
- GREEN. Write the smallest code change that makes the test pass. No anticipating future tests, no scaffolding for tests that don’t yet exist. The codebase should compile and pass tests at this point.
- REFACTOR. With the green test as the safety net, tighten the implementation: extract helpers, rename for clarity, replace primitives with newtypes, document tricky parts. Tests stay green throughout.
- DRILL DOWN if needed. If the L3 (or L2) test passes but pinned the behavior less precisely than the contract requires (e.g., the L3 test asserts “exit 2 with some error” but the contract says “the specific MergeError variant must match”), add an L2 (or L1) test next that drills into the precise path. The drilled test FAILS at first against the green-but-imprecise impl, motivating the tighter impl.
Cycles must be atomic: one RED → one GREEN → optional REFACTOR → optional drill-down. Do not stack multiple tests on top of a single impl change; do not write impl ahead of tests. The discipline matters because the bug bar of this pipeline is high (CHAT-data byte-stable preservation, audit-trail reproducibility) and TDD is the cheapest way to catch regressions before they ship.
Three test layers + the adjudication layer
The merge pipeline’s behavior spans four substrates with different testing mechanisms.
| Layer | Substrate | Why tests live here |
|---|---|---|
| L1, Spec / fragment | spec/constructs/speaker-id/ → current spec/tools generators | Token-cleaner behavior on CHAT fragments (markup strip for Jaccard scoring). Same mechanism that pins parser/grammar tests; regenerated regression. |
| L2, Transform / AST | crates/talkbank-transform/tests/ | Pure-Rust tests over parsed ChatFile values. identify_mapping, apply_mapping, merge, run_adjudication semantics on hand-built or parsed CHAT inputs. No process boundary. |
| L3, CLI / subprocess | crates/chatter/tests/merge_tests.rs (new) | End-to-end behavior of chatter speaker-id, chatter merge, and chatter adjudicate invoked as subprocesses (assert_cmd + predicates). Exit codes, flag parsing, file I/O, stderr formats. |
| L4, Scripted adjudication | crates/talkbank-transform/tests/adjudication_tests.rs + scripted prompter | Operator-decision paths in chatter adjudicate. Uses ScriptedPrompter injecting synthetic operator choices. See Adjudication Workflow for the prompter abstraction. |
L1 ⊂ L2 ⊂ L3 in terms of failure-mode coverage: a failing L1 test implies a failing L2 test which implies a failing L3 test. So when the same invariant could be tested at multiple layers, the starter test is the highest layer and lower-layer tests are supplements that pin the precise internal path. L4 sits beside L2/L3, same crate/file conventions but a dedicated layer because the prompter-injection pattern is specific to adjudication.
L1, Spec / fragment tests
Lives in spec/constructs/speaker-id/. Three subdirectories:
token-cleaner/: what the Jaccard tokenizer strips and keepsjaccard-scoring/: fixed-input → fixed-score golden testsmapping-application/: header rewrite rules on real fragments
L1.1, Token cleaner
Each spec is a CHAT main-tier fragment + the expected token list
after cleaning. Behavior pinned: bracket markup stripped,
angle-bracket retracing unwrapped, terminator variants
discarded, &-... / &+... discarded, xxx/yyy/www
discarded, 0 discarded, @l / @n / @c suffix dropped,
_-compound split to spaces, punctuation stripped, lowercased,
≥2-char alpha filter, NAK bullets stripped.
| Spec | Input fragment | Expected tokens |
|---|---|---|
clean-plain-utterance | *CHI:\thello world . | ["hello", "world"] |
clean-strip-bracket-codes | *CHI:\thello [*] [/] world [//] . | ["hello", "world"] |
clean-unwrap-angle-retrace | *CHI:\t<two of the> [//] three of the presents . | ["two", "of", "the", "three", "of", "the", "presents"] |
clean-strip-fillers | *CHI:\t&-um &+pre something &-uh . | ["something"] |
clean-strip-zero-and-paralinguistic | *CHI:\t0 [=! nodding] . | [] |
clean-strip-unintelligible | *CHI:\txxx and yyy and www . | ["and", "and"] |
clean-strip-bullets | *CHI:\thello world . \x150_1234\x15 | ["hello", "world"] |
clean-special-form-suffix | *CHI:\tnaming l@l u@l l@l u@l . | ["naming"] |
clean-compound-underscore | *CHI:\tValentine's_Day and Fruit_Loops . | ["valentine", "day", "and", "fruit", "loops"] |
clean-terminator-variants | *CHI:\thello +//. world +... again +/. last ! | ["hello", "world", "again", "last"] |
clean-overlap-markers | *CHI:\t↫here↫ and there . | ["here", "and", "there"] |
clean-lowercase-filter | *CHI:\tHello World A I am . | ["hello", "world", "am"] |
Each spec file in spec/constructs/speaker-id/token-cleaner/ has
the standard # name, ## Input, ## Expected tokens, and
## Metadata sections per the spec authoring template at
spec/CLAUDE.md in the workspace root (outside the book).
L1.2, Jaccard scoring
Fixed bag-of-tokens pairs with known multiset Jaccard. These
guard against off-by-one errors in the sum_w min / sum_w max
implementation and against any future “optimizations” that
silently change scoring.
| Spec | Bag A | Bag B | Expected J(A,B) |
|---|---|---|---|
jaccard-identical | {hello:2, world:1} | {hello:2, world:1} | 1.0 |
jaccard-disjoint | {hello:1} | {world:1} | 0.0 |
jaccard-empty-empty | {} | {} | 0.0 |
jaccard-empty-nonempty | {} | {x:1} | 0.0 |
jaccard-multiset-counts | {a:3, b:1} | {a:1, b:1} | 2/4 = 0.5 |
jaccard-partial-overlap | {a:1, b:1, c:1} | {b:1, c:1, d:1} | 2/4 = 0.5 |
L1.3, Mapping application on fragments
Header-rewrite micro-tests. Each spec gives an input
@Participants: or @ID: row and a small mapping; the expected
output row is the rewritten form.
| Spec | Input row | Mapping | Expected output row |
|---|---|---|---|
participants-rewrite-rename | @Participants:\tPAR0 Participant, PAR1 Participant | PAR0→INV:Investigator, PAR1→drop | @Participants:\tINV Investigator |
participants-preserve-name-token | @Participants:\tCHI Alex Target_Child, PAR0 Participant | PAR0→INV:Investigator | @Participants:\tCHI Alex Target_Child, INV Investigator |
id-rewrite-rename | @ID:\teng|corpus_name|PAR0|||||Participant||| | PAR0→INV:Investigator | @ID:\teng|corpus_name|INV|||||Investigator||| |
id-drop-removes-row | @ID:\teng|...|PAR1|||||Participant||| | PAR1→drop | (row removed) |
id-preserves-other-fields | @ID:\teng|2|CHI|6;01.|female|NF||Target_Child||| | (no-op for CHI) | identical to input |
L2, Transform / AST tests
Lives in crates/talkbank-transform/tests/. Three test files:
speaker_id_tests.rstranscript_merge_tests.rsoverride_file_tests.rs
Each tests behavior over parsed talkbank-model::ChatFile values,
using inline synthetic CHAT strings parsed via
talkbank_parser::parse_chat_file (no subprocess overhead).
L2.1, identify_mapping (reference mode)
| Test | Scenario | Assertion |
|---|---|---|
identify_mapping_clean_winner | Reference has CHI saying content X; donor has PAR0 saying X verbatim and PAR1 saying unrelated content | Returns SpeakerMapping { drop: {PAR0}, rename: {PAR1: INV} }, margin >> 2.0 |
identify_mapping_borderline_refuses | Reference and both donor speakers share substantial vocabulary (margin < 2.0) | Returns Err(SpeakerIdError::LowConfidence { scores, threshold, margin }) |
identify_mapping_anchor_missing | Reference has no utterances tagged with anchor speaker | Returns Err(SpeakerIdError::AnchorMissingInReference { anchor: CHI }) |
identify_mapping_single_speaker_donor | Donor has only one speaker | Returns Err(SpeakerIdError::InsufficientSpeakers { n: 1 }) |
identify_mapping_threshold_at_exact_value | Constructed donor where margin = 2.0 exactly with threshold 2.0 | Returns Ok(_) (≥ comparison, not strict >) |
identify_mapping_threshold_below_exact_value | Margin = 1.9999 with threshold 2.0 | Returns Err(SpeakerIdError::LowConfidence) |
identify_mapping_unbounded_margin | Donor PAR1 has Jaccard 0 against reference; PAR0 > 0 | Returns Ok(_) with margin = Margin::Unbounded |
identify_mapping_deterministic | Same inputs, repeated call | Identical SpeakerMapping byte-for-byte (BTreeMap ordering) |
L2.2, apply_mapping
| Test | Scenario | Assertion |
|---|---|---|
apply_mapping_renames_main_tier | Donor has *PAR0:\t... and *PAR1:\t...; mapping renames PAR0→INV, drops PAR1 | Output has *INV:\t... for original PAR0 utts; PAR1 utts absent |
apply_mapping_byte_stable_except_prefix | Donor has rich CHAT markup, %wor, %com on every utt | Every retained utt is byte-identical except the *CODE:\t prefix; dependent tiers preserved exactly |
apply_mapping_rewrites_participants | Donor @Participants: has PAR0+PAR1 entries | Output has only INV entry (after PAR1 drop) |
apply_mapping_rewrites_id | Donor @ID: rows for PAR0+PAR1 | PAR0 row rewritten to INV with role tag; PAR1 row removed |
apply_mapping_speaker_not_in_input | Mapping references PAR9 which isn’t in donor | Returns Err(SpeakerIdError::MappingSpeakerNotInInput { speaker: PAR9 }) |
apply_mapping_speaker_not_in_mapping | Donor has PAR0+PAR1+PAR2 but mapping only covers PAR0+PAR1 | Returns Err(SpeakerIdError::SpeakerNotInMapping { speaker: PAR2 }) |
apply_mapping_preserves_other_headers | Donor has @Languages, @Media, @Comment | All non-Participants/non-ID headers pass through verbatim |
apply_mapping_idempotent_on_rerun | Apply mapping, parse output, apply identity mapping | Output unchanged (byte-stable) |
L2.3, merge (core invariants)
These mirror the user-guide’s “What the merged output guarantees” section directly. Each invariant from that section maps to one or more L2 tests; the L3 tests then re-exercise the same invariant through the CLI.
| Test | Invariant from user-guide | Assertion |
|---|---|---|
merge_retained_speakers_byte_stable | “Retained speakers are byte-stable” | Every *CHI: block from File 1 (main tier + all dependent tiers, including %com) appears in the output byte-identical, in original order |
merge_strips_default_derived_tiers | “Inserted speakers’ downstream-generated tiers are stripped” | Output has no %wor, %mor, %gra, %pho on inserted-speaker utts; other dependent tiers preserved |
merge_strip_tiers_configurable | “configurable via --strip-tiers” | Custom strip_tiers=[com] removes %com instead of the defaults |
merge_strip_tiers_empty_preserves_all | empty strip set | Inserted utts retain %wor, %mor, %gra, %pho from File 2 verbatim |
merge_utterance_order_by_start_time | “Utterance order is timeline order” | Output utterances sorted by start_ms ascending |
merge_stable_tiebreak_file1_first | “first-file utterance comes first” | When File 1 and File 2 each have an utterance starting at exactly t, the File 1 one appears first in the output |
merge_bullets_pass_through | “Time bullets are pass-through” | Every bullet in the output is exactly the bullet from its source utterance, merge does not recompute, smooth, or refresh |
merge_bullet_lift_from_wor | “If main tier lacks bullet, lift from %wor” | Donor utt with no end-of-line bullet but a %wor row gets a derived \x15<first>_<last>\x15 appended; original %wor then stripped per the tier policy |
merge_no_overlap_markers_injected | “Overlap markup is NOT injected” | Even when inserted utt’s bullet overlaps a retained utt’s bullet by 500ms, no [>]/[<] tokens appear anywhere in the output that weren’t in the original retained file |
merge_preserves_existing_overlap_markers | retained file already has [>] somewhere | The original [>] is preserved byte-stable on the retained utt |
merge_header_languages_passthrough | Header reconciliation rule | Output @Languages matches File 1’s |
merge_header_media_file1_wins | Header reconciliation rule | File 1 says video, File 2 says audio → output says video (no warning emitted for modality only) |
merge_header_participants_concatenates | Header reconciliation rule | Output @Participants: is File 1’s entries + File 2’s non-retained entries, in that order |
merge_header_id_concatenates | Header reconciliation rule | Output @ID: rows are File 1’s + File 2’s non-retained, original order within each file |
merge_header_comments_concatenate | Header reconciliation rule | Output @Comment rows are File 1’s + File 2’s, in original order (ASR provenance preserved) |
merge_preconditions_retain_missing | exit code 2 precondition | File 1 declares no CHI; merge with retain={CHI} returns Err(MergeError::RetainSpeakersMissing) |
merge_preconditions_no_timeline | exit code 2 precondition | File 1 has no utterances with bullets → Err(MergeError::NoTimelineInFile1) |
merge_preconditions_language_mismatch | exit code 2 precondition | File 1 @Languages: eng, File 2 @Languages: yue → Err(MergeError::LanguageMismatch) |
merge_preconditions_ambiguous_speaker | exit code 2 precondition | Both files have INV utterances and retain={CHI} (INV not in retain) → Err(MergeError::AmbiguousSpeaker { speaker: INV }) |
merge_warns_on_backward_bullet_drift | “small backward-time bullets … proceeds” | File with utt1: 100_200, utt2: 190_300, succeeds, emits a warning |
L2.4, Override file I/O
| Test | Scenario | Assertion |
|---|---|---|
override_file_round_trip | Construct OverrideFile with one entry, write, read back | Re-read value == original |
override_file_refuses_missing_schema_version | TOML with no schema_version | Err(OverrideFileError::UnsupportedSchemaVersion { found: 0, supported: 1 }) |
override_file_refuses_wrong_schema_version | schema_version = 2 (future) | Err(UnsupportedSchemaVersion { found: 2, supported: 1 }) |
override_file_rejects_unknown_field | Entry has an extraneous field extra = "x" | Err(OverrideFileError::Parse) |
override_file_rejects_malformed_mode | mode = "guess" | Err(Parse) (only auto/explicit/override accepted) |
override_file_atomic_write | Write to a path that already exists | Original file is replaced atomically; no <path>.tmp left behind |
override_file_deterministic_serialization | Same struct, write twice | Bytes on disk are byte-identical between writes |
override_file_omits_empty_optionals | Entry has empty scores, no margin, empty flags | TOML output does not contain those keys |
override_file_preserves_margin_unbounded | Entry has margin = Margin::Unbounded | TOML on disk has margin = "unbounded"; reads back as Unbounded |
override_file_preserves_margin_finite | Entry has margin = Margin::Finite(3.81) | TOML on disk has margin = 3.81; reads back equal |
override_file_read_or_default_missing | Path does not exist | Returns empty OverrideFile with current schema version |
override_file_get_returns_entry | File has one entry under SessionId X | get(X) returns Some; get(Y) returns None |
L2.5, Domain-type unit tests
Smaller per-type tests. Each in its module’s #[cfg(test)] mod tests section.
| Test | Type | Assertion |
|---|---|---|
jaccard_score_new_in_range | JaccardScore | new(0.5) → Ok; new(-0.1) and new(1.1) → Err; new(NaN) → Err |
jaccard_score_serde_round_trip | JaccardScore | Serializes to 0.5 (bare float in JSON/TOML); deserializes back identically; out-of-range deserialize → error |
confidence_threshold_default_is_2_0 | ConfidenceThreshold | Default::default().value() == 2.0 |
confidence_threshold_rejects_below_1 | ConfidenceThreshold | new(0.5) → Err |
margin_from_scores_zero_loser | Margin | from_scores(JaccardScore::new(0.7), JaccardScore::zero()) == Margin::Unbounded |
margin_from_scores_zero_zero | Margin | from_scores(zero, zero) == Margin::Finite(0.0) or explicit “degenerate” representation (decide and document) |
margin_meets_threshold | Margin | Finite(3.81).meets(threshold=2.0) == true; Finite(1.5).meets(2.0) == false; Unbounded.meets(threshold) == true for any threshold |
retain_set_parse | RetainSet | "CHI".parse() == Ok({CHI}); "CHI,SI2".parse() == Ok({CHI, SI2}); "".parse() == Err; "CHI,,SI2".parse() == Err |
inserted_role_parse | InsertedRole | "INV:Investigator".parse() == Ok(_); "INV".parse() == Err; ":Investigator".parse() == Err |
mapping_spec_parse_simple | parse_mapping_spec | "PAR0=drop,PAR1=INV:Investigator" parses to a complete SpeakerMapping with correct actions and inserted_role |
mapping_spec_parse_drop_only | parse_mapping_spec | "PAR0=drop" parses iff no inserted_role context required (decide whether legal in isolation; if not, must error) |
mapping_spec_parse_conflicting_roles | parse_mapping_spec | "PAR0=INV:Investigator,PAR1=MOT:Mother", two different inserted roles → error (v1 only allows one) |
merge_flag_serde_known_variants | MergeFlag | DiarizationMixed serializes as "diarization-mixed" (kebab-case); deserializes the same |
merge_flag_serde_custom | MergeFlag | Unknown string deserializes as Custom("unknown-flag"); serializes verbatim |
L3, CLI / subprocess tests
Lives in crates/chatter/tests/merge_tests.rs (new file).
Uses the same assert_cmd + predicates + tempfile pattern
as the existing integration_tests.rs. Each test invokes
chatter speaker-id or chatter merge as a subprocess against
files written to a tempdir().
L3.1, chatter merge, success paths
| Test | Invariants exercised |
|---|---|
merge_basic_clinician_pattern | E2E happy path: small hand-coded child-only file + small ASR-labeled file → exit 0, output exists, retained CHI byte-stable, inserted INV present with derived tiers stripped. Single-invocation smoke test. |
merge_writes_to_stdout_by_default | No -o flag → output goes to stdout, exit 0 |
merge_writes_to_output_path | -o merged.cha → file created with correct content; nothing on stdout |
merge_retain_multi_speaker | --retain CHI,SI2 keeps both CHI and SI2 byte-stable; everything else from File 2 |
merge_strip_tiers_custom | --strip-tiers com,act removes %com and %act instead of default set |
merge_strip_tiers_empty | --strip-tiers '' preserves %wor from File 2 in output |
L3.2, chatter merge, error paths
| Test | Asserted exit code | Asserted stderr |
|---|---|---|
merge_missing_file1 | 1 | “No such file” or equivalent typed message |
merge_unparseable_file1 | 1 | parser diagnostic |
merge_missing_retain_flag | 2 (clap) | clap usage message |
merge_retain_empty_value | 2 | typed error from RetainSet::from_str |
merge_no_retain_speakers_in_file1 | 2 | RetainSpeakersMissing rendered |
merge_no_timeline_in_file1 | 2 | NoTimelineInFile1 rendered |
merge_language_mismatch | 2 | LanguageMismatch { file1: eng, file2: yue } rendered |
merge_ambiguous_speaker | 2 | AmbiguousSpeaker { speaker: ... } rendered with hint to use –retain |
L3.3, chatter speaker-id, reference mode
| Test | Scenario | Assertion |
|---|---|---|
speaker_id_reference_auto_clean_winner | Reference + donor where margin >> 2.0 | Exit 0; output has expected renamed/dropped speakers |
speaker_id_reference_writes_override | With --write-override path.toml | File created; entry has mode = "auto", scores, margin, decided_at, operator |
speaker_id_reference_appends_to_existing_override | --write-override path.toml where file already has another session | New session added; existing session preserved |
speaker_id_reference_low_confidence_exits_4 | Margin < threshold | Exit 4; stderr contains per-speaker scores |
speaker_id_reference_anchor_missing_exits_2 | Reference has no anchor speaker utterances | Exit 2; typed error in stderr |
speaker_id_reference_threshold_override | --confidence-threshold 1.5 on a margin-1.7 case | Exit 0 (would have refused at default 2.0) |
speaker_id_reference_anchor_required | --reference without --anchor | Exit 2 (clap or our own); usage error |
L3.4, chatter speaker-id, explicit-mapping mode
| Test | Scenario | Assertion |
|---|---|---|
speaker_id_explicit_basic | --mapping "PAR0=drop,PAR1=INV:Investigator" | Exit 0; output renames PAR1→INV, drops PAR0 |
speaker_id_explicit_mapping_speaker_not_in_input | --mapping references PAR9 not in input | Exit 2; typed error |
speaker_id_explicit_speaker_missing_from_mapping | Input has PAR0+PAR1+PAR2; mapping only covers PAR0+PAR1 | Exit 2; typed error naming PAR2 |
speaker_id_explicit_with_note_records_in_override | --mapping + --write-override + --note "verified by listening" | TOML entry has note = "verified by listening" and mode = "explicit" |
L3.5, chatter speaker-id, override-file mode
| Test | Scenario | Assertion |
|---|---|---|
speaker_id_override_file_replay | Override file has entry for session-X | Reading override + applying produces same output as the original auto/explicit run |
speaker_id_override_file_missing_entry | Override file has no entry for the requested session | Exit 2; OverrideEntryMissing in stderr |
speaker_id_override_file_missing_file | --override-file path.toml where file doesn’t exist | Exit 1; NotFound in stderr |
speaker_id_override_file_wrong_schema_version | File has schema_version = 99 | Exit 1; UnsupportedSchemaVersion in stderr |
speaker_id_override_file_mutually_exclusive_modes | --reference AND --mapping both set | Exit 2 (clap or our own); only one operation mode allowed |
L3.6, Pipeline composition
These exercise chatter speaker-id → chatter merge composed
end-to-end through the file system, simulating the orchestrator
workflow.
| Test | Scenario | Assertion |
|---|---|---|
pipeline_speaker_id_then_merge | Run speaker-id on anonymous ASR file; run merge on the result + hand-coded file | Final merged file passes all merge invariants (retained byte-stable, etc.) |
pipeline_replay_via_override_file | Run once with auto; capture override file; delete intermediates; replay via --override-file; merge again | Final merged file is byte-identical to the original run (audit-trail-reproducibility property) |
pipeline_low_confidence_then_explicit | Run speaker-id; gets exit 4; capture scores from stderr; run again with --mapping matching what the operator would decide; record via --write-override; merge | All steps succeed; override file has mode = "explicit" with prior scores recorded |
L4, Scripted adjudication tests
Lives in crates/talkbank-transform/tests/adjudication_tests.rs.
Uses the Prompter trait and ScriptedPrompter documented in
Adjudication Workflow §The prompter abstraction.
Each test constructs a pending-adjudications input, scripts the
operator’s decisions, runs run_adjudication, and asserts on
the resulting override file plus the residual pending file.
L4.1, Speaker-id adjudication paths
| Test | Scripted decision | Assertion |
|---|---|---|
adjudicate_speaker_id_accepts_suggested | AcceptSuggested { note: None } for one pending entry | Override file entry has mode = "explicit", mapping matches suggested, pending file emptied |
adjudicate_speaker_id_override_mapping | OverrideMapping { mapping: { PAR0=rename, PAR1=drop }, note: Some("verified by listening") } (opposite of suggested) | Override file mapping matches operator’s choice; note recorded |
adjudicate_speaker_id_defer | Defer { reason: "need to listen to audio" } | Pending entry untouched; override file unchanged; tool exits 4 (deferred) |
adjudicate_speaker_id_block | Block { reason: "reference file missing bullets" } | Pending entry tagged as blocked; override file unchanged |
adjudicate_speaker_id_kind_mismatch_rejected | OverrideInsertedRole { ... } against a speaker-id-low-confidence entry | Returns Err(AdjudicationError::DecisionKindMismatch); nothing written |
L4.2, Parent-role-lookup adjudication paths
| Test | Scripted decision | Assertion |
|---|---|---|
adjudicate_parent_role_accepts_default_inv | AcceptSuggested | Override entry uses INV:Investigator (the safe default) |
adjudicate_parent_role_overrides_to_mother | OverrideInsertedRole { code: "MOT", tag: "Mother" } | Override entry uses MOT; note recorded |
adjudicate_parent_role_overrides_to_father | OverrideInsertedRole { code: "FAT", tag: "Father" } | Override entry uses FAT |
adjudicate_parent_role_invalid_code_rejected | OverrideInsertedRole { code: "", tag: "Mother" } | Returns Err; with --skip-on-error, logs and proceeds |
L4.3, Diarization-mix and sanity-scan paths
| Test | Scripted decision | Assertion |
|---|---|---|
adjudicate_diarization_mix_flag_only | Flag { flags: [DiarizationMixed], note: "PAR0 mixes clinician+parent" } | Existing override entry gets flag added; mapping unchanged |
adjudicate_sanity_scan_swap_mapping | OverrideMapping { ... } reversing original speaker-id | Override entry updated; mode = "explicit"; original mapping preserved in history |
adjudicate_sanity_scan_confirms_real_overlap | Flag { flags: [Custom("real-overlap-confirmed")] } | Override entry gets custom flag; mapping unchanged |
L4.4, Workflow plumbing
| Test | Scenario | Assertion |
|---|---|---|
adjudicate_empty_pending_file_noop | Pending file has empty entries array | Exit 0; nothing changes |
adjudicate_resumption_skips_decided_entries | Pending file has 3 entries; first 2 already decided in override; only 3rd has no override entry | Prompter is called exactly once, for the 3rd entry |
adjudicate_re_adjudicate_preserves_history | Existing override entry; --re-adjudicate with new decision | New decision saved; prior decision preserved in history array |
adjudicate_kind_filter_processes_only_matching | Pending file has mixed kinds; --kind parent-role-lookup flag set | Prompter only called for parent-role-lookup entries; other kinds untouched |
adjudicate_dry_run_writes_nothing | Any pending input + any decision; --dry-run set | Override file unchanged; pending file unchanged |
adjudicate_scripted_mode_unknown_session_aborts | Scripted decisions reference session-X but pending has only session-Y | Returns Err(AdjudicationError::ScriptedDecisionWithoutPendingEntry); tool exits 2 |
adjudicate_scripted_mode_extra_pending_aborts | Pending has session-X and session-Y; scripted decisions cover only session-X | Returns Err(AdjudicationError::PendingEntryWithoutScriptedDecision); tool exits 2 |
adjudicate_mutually_exclusive_modes | --interactive + --scripted both set | Returns Err; tool exits 2 (clap or our own validator) |
L4.5, Prompter contract conformance
These tests pin the contract that any Prompter impl must
satisfy, so future UI backends (VS Code, web) can be developed
against the same invariants.
| Test | Scenario | Assertion |
|---|---|---|
prompter_terminal_round_trip_decision | TerminalPrompter reading a scripted stdin | Returns the expected OperatorDecision parsed from the operator’s typed input |
prompter_scripted_returns_decisions_in_order | ScriptedPrompter::from_decisions([d1, d2, d3]) | Three consecutive ask() calls return d1, d2, d3 in order |
prompter_scripted_panics_on_unscripted_session | ScriptedPrompter has decisions for session A; tool asks for session B | ask() returns Err(PrompterError::NoDecisionFor(SessionId)) |
prompter_scripted_toml_round_trips | Write a scripted-decisions TOML, read with ScriptedTomlPrompter, run | Same OperatorDecision sequence as a ScriptedPrompter::from_decisions with equivalent contents |
Fixture catalog
These are the synthetic CHAT pairs that the tests above consume. Each is small (≤20 utterances), exercises a precise invariant, and is fully fictional (no real corpus content).
The fixtures live as inline const FIX_*: &str blocks in the
respective test modules, following the precedent in
chatter/tests/integration_tests.rs (which has
const VALID_CHAT: &str = r#"..."# etc.).
FIX_REF_TWO_UTT_NO_MARKUP
The smallest possible valid CHAT pair input. Two *CHI:
utterances, no markup beyond a simple terminator, time bullets
on both. Used by cycle 1’s smoke test where the impl must
work without yet handling any markup edge cases.
FIX_ASR_LABELED_TWO_UTT
The matching donor for FIX_REF_TWO_UTT_NO_MARKUP: two
*INV: utterances at different time positions. Used by
cycle 1.
FIX_REF_CHILD_ONLY_SIMPLE
A 6-utterance child-only hand transcript with rich CHAT markup (error code, retracing, filled pause, special-form letter, zero realization with paralinguistic). Used by every L2/L3 merge test from cycle 2 onward as the canonical “File 1”, the reference / authoritative file. Has time bullets on every utterance.
FIX_ASR_ANON_2SPEAKER_SIMPLE
The matching ASR-output file with anonymous PAR0 (clinician,
asks questions) and PAR1 (child, says what FIX_REF_* shows
plus some extra). Has %wor on every utterance. Used by every
speaker-id test where auto-mode is expected to succeed cleanly
(margin >> 2.0).
FIX_ASR_LABELED_INV_SIMPLE
FIX_ASR_ANON_2SPEAKER_SIMPLE after speaker-id has run with
PAR1→drop, PAR0→INV:Investigator. Used by merge tests where
we want to skip the speaker-id step and test merge alone.
FIX_ASR_BORDERLINE_VOCABULARY
ASR file where both speakers describe the same picture-book content (margin 1.6-1.9 against reference). Used by low-confidence tests.
FIX_REF_NO_BULLETS
A reference file with no time bullets at all. Used to test
NoTimelineInFile1 precondition.
FIX_REF_LANG_ENG / FIX_ASR_LANG_YUE
Two files with conflicting @Languages. Used to test
LanguageMismatch.
FIX_AMBIGUOUS_INV
Two files both containing *INV: utterances, with
--retain CHI (INV not in retain set). Used to test
AmbiguousSpeaker.
FIX_REF_MULTI_RETAIN
Reference file containing *CHI: and *SI2: utterances (sibling
target). Used to test --retain CHI,SI2.
FIX_ASR_NO_MAIN_BULLET
Donor file where some utterances have no main-tier bullet, only
%wor. Used to test bullet-lift behavior in normalization.
FIX_OVERRIDE_VALID / FIX_OVERRIDE_WRONG_SCHEMA / FIX_OVERRIDE_MALFORMED
Override files in valid, schema-rejected, and parse-rejected shapes. Used by override-file I/O tests.
FIX_PENDING_SPEAKER_ID / FIX_PENDING_PARENT_ROLE / FIX_PENDING_MIXED_KINDS
Pending-adjudications files exercising one kind, another kind, and a mix. Used by L4 adjudication tests.
FIX_SCRIPTED_ACCEPT_ALL / FIX_SCRIPTED_OVERRIDE_FIRST_DEFER_SECOND
Scripted-decisions TOML files for ScriptedTomlPrompter.
Cover the canonical accept-suggested case and a mixed
override+defer case.
The exact bytes of each fixture are pinned in their respective test modules when the implementation lands; this plan doesn’t freeze them yet, only their purpose. Drafting the actual bytes is the first step of impl-phase work.
Coverage matrix
Cross-checking that every behavioral invariant from the four design docs has at least one test:
| Invariant source | Invariant | First-failing layer | Test name |
|---|---|---|---|
| merge user-guide | Retained byte-stable | L3 → L2 | merge_basic_clinician_pattern + merge_retained_speakers_byte_stable |
| merge user-guide | Derived tiers stripped | L3 → L2 | merge_strip_tiers_custom + merge_strips_default_derived_tiers |
| merge user-guide | Order by start_ms | L2 | merge_utterance_order_by_start_time |
| merge user-guide | Tiebreak File1 first | L2 | merge_stable_tiebreak_file1_first |
| merge user-guide | Bullets pass-through | L2 | merge_bullets_pass_through |
| merge user-guide | Bullet lift from %wor | L2 | merge_bullet_lift_from_wor |
| merge user-guide | Header reconciliation (all rows) | L2 | merge_header_* series |
| merge user-guide + memory | No overlap markers injected | L2 | merge_no_overlap_markers_injected + merge_preserves_existing_overlap_markers |
| merge user-guide | Each precondition → exit 2 | L3 | merge_*_exits_2 series in L3.2 |
| merge user-guide | Warns on bullet drift | L2 | merge_warns_on_backward_bullet_drift |
| speaker-id user-guide | Reference mode auto | L3 | speaker_id_reference_auto_clean_winner |
| speaker-id user-guide | Explicit mode | L3 | speaker_id_explicit_basic |
| speaker-id user-guide | Override-file mode | L3 | speaker_id_override_file_replay |
| speaker-id user-guide | Confidence threshold (exit 4) | L3 → L2 | speaker_id_reference_low_confidence_exits_4 + identify_mapping_borderline_refuses |
| speaker-id user-guide | Byte-stable except prefix | L2 | apply_mapping_byte_stable_except_prefix |
| speaker-id user-guide | Header rewrites | L2 + L1 | apply_mapping_rewrites_* + participants-rewrite-* specs |
| speaker-id user-guide | Provenance captured | L3 | speaker_id_reference_writes_override |
| speaker-id user-guide | Each precondition → typed error | L3 → L2 | various *_exits_2 and apply_mapping_* tests |
| speaker-id user-guide | Token cleaner spec | L1 | clean-* specs |
| speaker-id user-guide | Multiset Jaccard formula | L1 | jaccard-* specs |
| override-file ref | Schema-version refusal | L2 | override_file_refuses_* tests |
| override-file ref | Round-trip fidelity | L2 | override_file_round_trip |
| override-file ref | Deterministic serialization | L2 | override_file_deterministic_serialization |
| override-file ref | Atomic write | L2 | override_file_atomic_write |
| override-file ref | margin "unbounded" form | L2 | override_file_preserves_margin_unbounded |
| domain types | JaccardScore range | L2 | jaccard_score_new_in_range |
| domain types | ConfidenceThreshold ≥ 1 | L2 | confidence_threshold_* |
| domain types | Margin semantics | L2 | margin_* |
| domain types | RetainSet::from_str | L2 | retain_set_parse |
| domain types | InsertedRole::from_str | L2 | inserted_role_parse |
| domain types | parse_mapping_spec | L2 | mapping_spec_parse_* |
| domain types | MergeFlag serde | L2 | merge_flag_serde_* |
| domain types | Pipeline reproducibility | L3 | pipeline_replay_via_override_file |
Every invariant has at least one named test; many have multiple across layers. When the impl phase begins, the first commit should produce the fixtures, the second commit the highest-layer failing test for the simplest invariant, then drill down per the standard TDD progression.
What this plan does NOT cover
- Performance / scaling tests. Until the pipeline shows up on a measured workload, no targeted perf assertions. The reference corpus’s existing round-trip benchmarks remain the baseline.
- Fuzz testing. This repository now has a local
fuzz/workspace for parser/validation fuzzing. If the merge crate stabilizes enough to justify dedicated fuzzing, adding a merge-specific target for random parseable CHAT-pair inputs is a follow-up, not a v1 blocker. - Cross-platform CI checks. Windows / Linux / macOS each build the workspace; the merge module rides the existing CI. No platform-specific tests needed (the merge operates on parsed AST and writes UTF-8; no path-or-line-ending quirks).
- Real-corpus regression sweeps. Once impl lands, running
chatter mergeover a curated subset of the reference corpus and snapshotting outputs is a smart follow-up. Lives in a separatetests/golden/style mechanism if added; not designed here.
TDD authoring sequence
Each numbered item is one full RED → GREEN → REFACTOR cycle. Cycles must run in order; do not start cycle N+1 until cycle N is green and committed. Numbers are designed so the first working pipeline (cycle 8) emerges from the absolute minimum set of types + algorithms, then each later cycle extends.
The starter test for cycle 1 is intentionally tiny: a 2-utterance fixture pair with no markup, one retain speaker. The smoke test exercises every layer (parser, transform, CLI) but with the simplest possible CHAT bytes, so the first impl is small enough to land in one cycle.
Phase A, minimal end-to-end pipeline (cycles 1-8)
These cycles produce the simplest possible chatter merge
working end-to-end with synthetic fixtures.
| # | RED (failing test) | GREEN (smallest impl that passes) |
|---|---|---|
| 1 | merge_basic_smoke, L3 subprocess test against the tiniest fixture pair (FIX_REF_TWO_UTT_NO_MARKUP + FIX_ASR_LABELED_TWO_UTT), retain={CHI}, asserts exit 0 and “merged file exists” | Stub chatter merge subcommand wiring; introduce minimal talkbank-transform::transcript_merge::merge that interleaves utterances by start_ms and emits parser→serializer round-trip. No tier-stripping, no header-reconcile, no validation. Just: parse, sort, serialize. |
| 2 | merge_retained_speakers_byte_stable, L2 over the smoke fixture, asserts every CHI block byte-identical | Implement byte-stable handling for retained utterances (preserve main_raw_lines + dependent tiers exactly). |
| 3 | merge_strips_default_derived_tiers, L2 against a fixture where the donor has %wor rows | Implement tier_strip per the per-tier policy; drop %wor/%mor/%gra/%pho from inserted-speaker utts. |
| 4 | merge_utterance_order_by_start_time, L2 with a fixture where File 1 and File 2 utterances interleave | Implement timeline sort key (start_ms primary; source-order tiebreak). |
| 5 | merge_header_participants_concatenates, L2 | Implement header_reconcile::participants_merge. |
| 6 | merge_header_id_concatenates, L2 | Extend header_reconcile for @ID rows. |
| 7 | merge_header_languages_passthrough + merge_header_media_file1_wins + merge_header_comments_concatenate, L2 | Extend header_reconcile for remaining headers per the contract table. |
| 8 | merge_preconditions_retain_missing + merge_preconditions_no_timeline + merge_preconditions_language_mismatch + merge_preconditions_ambiguous_speaker, L3, each asserting exit code 2 with a specific stderr message | Implement preconditions module + map MergeError to exit codes in the CLI. |
Phase A, actual cycle log
The four-precondition cycle 8 was deliberately split into four
single-variant cycles (9a / 9b / 9c / 9d) so each MergeError
variant lands with its own RED→GREEN cycle and L2 + L3 sibling
tests. The numbering here is therefore finer-grained than the
plan table above; the table records the shape of Phase A, the
log records what was actually committed.
| # | Test(s) | Layer | Status |
|---|---|---|---|
| 1 | merge_basic_smoke | L3 | done |
| 2 | merge_retained_speakers_byte_stable | L2 | done |
| 3 | merge_strips_default_derived_tiers | L2 | done |
| 4 | merge_strip_tiers_configurable | L2 | done |
| 5 | merge_strip_tiers_empty_preserves_all | L2 | done |
| 6 | merge_header_participants_concatenates | L2 | done |
| 7 | merge_header_id_concatenates | L2 | done |
| 8a | merge_header_comments_concatenate | L2 | done |
| 8b | merge_header_languages_passthrough + merge_header_media_file1_wins | L2 | done |
| 9a | merge_no_retain_speakers_in_file1 + _returns_err | L3 + L2 | done (L2 sibling backfilled in 9c) |
| 9b | merge_no_timeline_in_file1 + _returns_err | L3 + L2 | done |
| 9c | merge_language_mismatch + _returns_err | L3 + L2 | done |
| 9d | merge_ambiguous_speaker + _returns_err | L3 + L2 | done |
End of Phase A: chatter merge works on simple fixtures with
all four preconditions (retain / timeline / language / ambiguous
speaker) enforced. The pipeline is publishable as v0.
Phase B, actual cycle log
Phase B picks up at cycle 10 in the cycle log (Phase A used 9a-9d for the precondition split).
| # | Test(s) | Layer | Status |
|---|---|---|---|
| 10 | speaker_id_explicit_basic | L3 | done |
| 11 | apply_mapping_byte_stable_except_prefix + apply_mapping_rewrites_participants + apply_mapping_rewrites_id | L2 | done (regression-guards) |
| 12 | identify_mapping_clean_winner | L2 | done |
| 13 | identify_mapping_borderline_refuses | L2 | done |
| 14 | speaker_id_reference_low_confidence_exits_4 | L3 | done |
| 15 | speaker_id_reference_writes_override (+ OverrideFile data model) | L3 | done |
| 16 | speaker_id_override_file_replay (+ OverrideFile::get) | L3 | done |
| 17 | adjudicate_speaker_id_accepts_suggested (+ adjudication core) | L4 | done |
| 18 | adjudicate_scripted_accepts_suggested (+ chatter adjudicate CLI + scripted-TOML I/O) | L3 | done |
| 19 | speaker_id_reference_writes_pending_on_low_confidence (+ --write-pending flag + LowConfidence carries DonorMatchReport) | L3 | done |
| 20 | adjudicate_speaker_id_override_mapping (+ OperatorDecision::OverrideMapping variant + scripted-TOML override-mapping shape) | L4 | done |
| 21 | adjudicate_interactive_accepts_suggested (+ TerminalPrompter + --interactive flag) | L3 | done |
| 22 | adjudicate_parent_role_lookup_chooses_role (+ PendingKindData promotion + ParentRoleLookup kind + ChooseRole decision) | L4 | done |
| 23 | adjudicate_interactive_chooses_role (+ parse_operator_response + kind-aware prompt hint) | L3 | done |
| 24 | adjudicate_interactive_override_mapping (+ parse_override_mapping + parse_speaker_assignment) | L3 | done |
| 25 | pipeline_clean_winner_end_to_end (+ chatter pipeline subcommand) | L3 | done |
| 26 | batch_pass1_single_session (+ chatter batch subcommand, subprocess driver) | L3 | done |
| 27 | batch_mixed_outcomes (regression-guard: clean+borderline aggregation) | L3 | done |
| 28 | batch_pass2_replay (+ --override-file on pipeline + batch; per-session auto-detection) | L3 | done |
| 29 | batch_skip_existing (+ --skip-existing flag on batch for idempotent re-runs) | L3 | done |
| 30 | refactor, PipelineArgs + BatchArgs structs retire three #[allow(clippy::too_many_arguments)] markers | , | done (true-no-op refactor; covered by cycles 25-29 regression suite) |
| 31 | refactor, split commands/speaker_id.rs (472 lines) into speaker_id/{mod,modes,writes,support}.rs (158 + 196 + 103 + 86 lines); retire 4 stale #[allow(dead_code)] markers on ReferenceModeOutcome (fields are read by write_override_entry) | , | done (true-no-op refactor; covered by cycles 10-29 regression suite) |
| 32 | adjudicate_sanity_scan_accept_suggested (+ AdjudicationKind::SanityScanMisclassification variant, PendingKindData::SanityScanMisclassification { suggested, reason } variant, two apply-decision arms mirroring SpeakerIdLowConfidence, terminal prompter render + prompt-hint arm) | L4 | done, adjudication kind end-to-end; the post-merge scan detector itself (heuristic + auto-pending-write) is a separate cycle 33 |
| 33 | sanity_scan_flags_inverted_mlu (+ talkbank_transform::sanity_scan::scan_session + chatter sanity-scan subcommand; mean-utterance-word-count asymmetry heuristic, default 1.5×, binary-mapping only) | L3 | done, detector + CLI end-to-end; multi-rename support, batch integration, and alternative heuristics deferred |
| 34 | batch_writes_override_for_auto_decisions (+ --write-override on both chatter pipeline and chatter batch; threaded through PipelineArgs.write_override_path + BatchArgs.write_override_path; reference-mode auto-decisions audit-trailed for sanity-scan + future re-runs) | L3 | done |
| 35 | batch_with_sanity_scan_flag_flags_inverted_mlu (+ --sanity-scan + --sanity-scan-threshold on chatter batch; post-loop subprocess driver for chatter sanity-scan; precondition validation requiring --write-override + --write-pending) | L3 | done |
| 36 | refactor, split cli/args/core.rs (984 → 747 lines): extract DebugCommands → debug_commands.rs, CacheCommands → cache_commands.rs, config enums (LogFormat, TuiMode, OutputFormat, ParserBackend, AlignmentTier) → cli_types.rs, unit-test module → core_tests.rs (via #[path]); satisfies the 800-line hard limit | , | done (true-no-op refactor; covered by full regression suite + 110 bin/integration tests) |
| 37+ | sanity-scan multi-rename support; diarization-mix-review kind (operator workflow design needed); newtype threading at struct seams (deferred simplify finding); apply_decision arm dedup + per-kind OperatorDecision sub-enums | L3 + L4 | pending |
Phase B, speaker-id pipeline (cycles 9-16)
These cycles add chatter speaker-id and its three modes.
| # | RED | GREEN |
|---|---|---|
| 9 | speaker_id_explicit_basic, L3 against an anonymous-2-speaker donor with --mapping "PAR0=drop,PAR1=INV:Investigator", asserts output has only INV utts | Stub chatter speaker-id subcommand. Implement parse_mapping_spec + apply_mapping. Reference mode and override-file mode return unimplemented!() for now. |
| 10 | apply_mapping_byte_stable_except_prefix + apply_mapping_rewrites_participants + apply_mapping_rewrites_id, L2 | Tighten apply_mapping per header rewrite rules. |
| 11 | identify_mapping_clean_winner, L2 with a fixture where one donor speaker overwhelmingly matches the reference | Implement text_cleaner + jaccard modules. Implement identify_mapping using them. Reference mode in CLI now works. |
| 12 | identify_mapping_borderline_refuses, L2 with a borderline fixture | Add ConfidenceThreshold check + LowConfidence error path. |
| 13 | speaker_id_reference_low_confidence_exits_4, L3 against borderline fixture | Map LowConfidence to exit code 4 in the CLI; print scores to stderr. |
| 14 | speaker_id_reference_writes_override, L3 with --write-override | Implement OverrideFile::read_or_default + OverrideFile::write. |
| 15 | speaker_id_override_file_replay, L3 with --override-file + --session-id | Implement override-file mode in CLI (OverrideFile::get + apply). |
| 16 | Token-cleaner L1 specs (a handful of representative clean-* specs from L1.1) + current spec/tools generators | Move the regex-and-string cleaner into a spec-test-covered implementation. Specs become the regression net. |
End of Phase B: full chatter speaker-id + chatter merge
pipeline works auto + explicit + override modes.
Phase C, adjudication (cycles 17-22)
These cycles add the chatter adjudicate tool and its
prompter-injection testability.
| # | RED | GREEN |
|---|---|---|
| 17 | adjudicate_empty_pending_file_noop, L4 against an empty pending file, asserts exit 0 + no changes | Stub chatter adjudicate subcommand. Implement PendingAdjudications::read + run_adjudication core skeleton with a no-op Prompter trait. |
| 18 | prompter_scripted_returns_decisions_in_order, L4 | Implement ScriptedPrompter::from_decisions (in-memory) per the Prompter trait. |
| 19 | adjudicate_speaker_id_accepts_suggested, L4 against FIX_PENDING_SPEAKER_ID with one AcceptSuggested decision | Implement apply_decision for the speaker-id-low-confidence kind. Override file now gets the decision; pending entry removed. |
| 20 | adjudicate_speaker_id_override_mapping, L4 with OverrideMapping decision | Extend apply_decision for the override-mapping variant. |
| 21 | adjudicate_speaker_id_kind_mismatch_rejected, L4 with a OverrideInsertedRole against a speaker-id pending entry | Implement kind→variants validation in apply_decision. |
| 22 | adjudicate_scripted_mode_unknown_session_aborts + adjudicate_scripted_mode_extra_pending_aborts, L4 | Tighten scripted-mode validation; assert 1:1 mapping between pending entries and scripted decisions. |
End of Phase C: scripted adjudication tested end-to-end with synthetic operator inputs. Interactive terminal UX still unimplemented (next phase).
Phase D, interactive UX (cycles 23-25)
| # | RED | GREEN |
|---|---|---|
| 23 | prompter_terminal_round_trip_decision, L4 with mocked stdin/stdout | Implement TerminalPrompter parsing [a]/[o]/[f]/... keys + optional follow-up prompts. |
| 24 | adjudicate_resumption_skips_decided_entries, L4 with a partially-decided override file + full pending list | Implement skip-already-decided logic in run_adjudication. |
| 25 | Manual smoke test (NOT automated), run chatter adjudicate --interactive against the test fixtures; visually confirm the operator UX matches the doc’s mock-up | Polish terminal output: ANSI formatting, fixed-width alignment, the [m] Show more context action, the [p] Play media action. |
End of Phase D: full v1 pipeline complete.
Phase E, non-speaker-id adjudication kinds (cycles 26-29)
Each adjudication kind gets its own RED→GREEN cycle.
| # | RED | GREEN |
|---|---|---|
| 26 | adjudicate_parent_role_overrides_to_mother + adjudicate_parent_role_overrides_to_father, L4 | Implement parent-role-lookup kind end-to-end (pending schema, prompter context, decision application). |
| 27 | adjudicate_diarization_mix_flag_only, L4 | Implement diarization-mix-review kind end-to-end. |
| 28 | adjudicate_sanity_scan_swap_mapping, L4 | Implement sanity-scan-misclassification kind end-to-end. |
| 29 | adjudicate_re_adjudicate_preserves_history, L4 | Implement --re-adjudicate flag; add history field to MergeOverride. |
Phase F, breadth pass (cycles 30+)
Fill in every remaining test from L1-L4 that hasn’t been written yet. These are coverage-deepening tests, not behavior adders. The impl from Phases A-E should pass them with at most minor refactoring; if a test fails meaningfully, that’s a gap in the impl that this cycle closes.
The breadth pass is the only phase where multiple cycles can proceed in parallel (different contributors take different test groups). Phases A-E are strictly serial.
Hard rules during impl phase
- No test stubs. Every test in this plan, when written,
must FAIL before its impl exists and PASS after. Skipped or
#[ignore]-marked tests are not allowed in the regression net (use#[ignore]only for genuinely slow or environment-dependent tests, not for “not implemented yet”). - No test deletion to make CI green. If a test that was passing starts failing after a refactor, the refactor is wrong. Investigate; do not delete the test.
- Three cycle archetypes, distinguish them. A cycle is one
of:
- bug-fix: RED motivates new impl code (cycle N-1’s impl truly cannot satisfy the new test).
- regression-guard: RED pins an invariant the impl
inherits from upstream infrastructure (e.g. parse→serialize
byte-stability inherited from
talkbank-parser). The test passes against cycle N-1’s impl, but the cycle is valuable because it locks in the invariant against future “optimizations” that might break it. Verbose-output the actual behavior on first run to confirm the invariant holds for the right reasons, not by accident. - true no-op: RED tests something already pinned elsewhere. These ARE unnecessary; drop the cycle or sharpen the test. The difference between regression-guard and true no-op is whether the invariant is named explicitly anywhere else. If yes (e.g., the parser crate already has a roundtrip test that covers it), the cycle is true-no-op. If no, the cycle is a regression-guard and worth keeping.