How Would You Know If It Worked?

Recall(From the intro post)

WordBridge makes four main empirical claims — H1 through H4 — about detection accuracy, latency, contextual anchoring, and false positive rates. This post is about what it would actually take to test those claims, what baselines make sense, and one measurement gap the field doesn't have a standard answer to yet.

The most common question a grant reviewer asks isn't "is this interesting?" It's "how would you know if it worked?" This post is the answer to that question, written down before any implementation starts — because the evaluation design shapes what the prototype needs to produce, not the other way around.

The four hypotheses and what each one needs

The proposal's hypotheses aren't independent — they form a dependency chain. H1 and H2 are about whether the detection mechanism works at all. H3 is about whether the contextual layer adds value on top of that. H4 is about whether the safety layer is well-calibrated. Getting H1/H2 wrong means H3 and H4 are untestable.

Definition(H1 — Detection accuracy)

Passive ambient circumlocution detection is non-inferior (within 5 percentage points) to explicit-query baselines on Top-1 and Top-3 lexical retrieval accuracy.

What non-inferiority means here: the baseline can ask directly ("what word are you looking for?") and can wait for the full utterance. WordBridge can't do either. The claim isn't that it beats explicit query — it's that the passive approach doesn't lose too much ground relative to active help.

Definition(H2 — Latency)

The system surfaces the correct word candidate within the patient's utterance — before or during the circumlocution, not after it ends — at windows of 500ms, 1s, and 2s from circumlocution onset.

Why the window structure matters: 500ms is the target for audio delivery during active speech. 1s is a softer target where the candidate arrives near the end of the circumlocution. 2s is the baseline — arriving after the full utterance ends, before the patient falls silent or switches strategies. Anything beyond 2s has no clinical utility.

Definition(H3 — Contextual anchors)

Contextual anchor delivery reduces time-to-topic-recovery in simulated thread-loss scenarios, as rated by blinded speech-language pathologist evaluators.

The measurement challenge: "recovery" is inherently a judgment call. The SLP raters don't know whether an anchor fired — they rate whether the conversation recovered, and how fast. This blinding is load-bearing.

Definition(H4 — Safety tier calibration)

The physiological-state-aware tiered intervention produces a lower false-positive intervention rate than a state-agnostic baseline — specifically comparing context-reasoned tier transitions (Background Model) against raw HR/GSR threshold triggers.

The nested comparison: as the safety tiers post argued, this needs two sub-comparisons: tiered vs. no gating at all, and context-reasoned gating vs. threshold-based gating. A positive result on the first but not the second means the Background Model's reasoning isn't adding value over a simpler sensor threshold.

Primary metrics

Lexical retrieval accuracy (H1)

The core metric is Top-N accuracy: does the correct target word appear in the model's top 1 / top 3 candidate list, evaluated against the ground-truth target word from the annotated transcript.

Example(Concrete scoring)

Transcript: "the thing for keeping food cold, the big white one in the kitchen" (target: refrigerator)

Model returns [refrigerator, freezer, cabinet] → Top-1 correct, Top-3 correct
Model returns [freezer, refrigerator, pantry] → Top-1 wrong, Top-3 correct
Model returns [dishwasher, cabinet, counter] → Top-1 wrong, Top-3 wrong

Secondary breakdown: semantic proximity for Top-3-incorrect cases. If the model returns freezer when the target is refrigerator, that's a qualitatively different failure than returning dishwasher. Reporting semantic similarity (via embedding cosine distance) between the highest-ranked wrong candidate and the target gives a richer picture of where the model is failing.

Latency to first correct candidate (H2)

Measured in milliseconds from circumlocution onset (defined as the first hesitation marker or paraphrase segment, per annotator labels) to the timestamp when the top-ranked candidate first matches the target word.

The 500ms / 1s / 2s windows are hard cutoffs — a candidate arriving at 2.1s counts as a miss for the 2s window. Report the cumulative distribution of correct-candidate latency across the test set, not just mean latency.

Intuition(Why the distribution matters more than the mean)

A system that reaches the correct candidate in 400ms for 60% of samples and 3,000ms for the remaining 40% has a mean latency of ~1,400ms — which looks acceptable. The distribution reveals that it's basically useless for the harder cases, which are likely the ones that matter most clinically.

Contextual anchor quality (H3)

Two measures, both requiring human evaluation:

Relevance rating — SLP raters score each anchor on a 1–5 scale: does this anchor contain information that would actually help reorient someone who has lost the thread of this conversation?

Recovery delta — time-to-recovery with vs. without anchor delivery, measured in simulated replay scenarios (same conversation, anchor delivered vs. suppressed). This requires a paired experimental design.

False positive intervention rate (H4)

False positive defined as: the system escalates to Tier 2 or Tier 3, or delivers a word candidate, during a segment where a blinded annotator has marked no circumlocution or distress event.

Report separately:

Raw threshold trigger FPR (what you get with HR/GSR alone)
Context-reasoned FPR (Background Model decision)
False negative rate (missed real events) for both — the tradeoff between FPR and FNR is the core calibration question

Baselines

Important(Baselines need to be fair comparisons, not strawmen)

Every baseline should have access to the same audio. The only thing that varies is the detection mechanism and timing. A baseline that gets less audio than WordBridge, or has to do something unrealistic to query it, proves nothing.

Baseline	What it tests	What it gets
Explicit-query GPT-4o	H1 upper bound — what's possible when the model can ask and wait	Full utterance, can prompt for clarification
Whisper + LLM turn-based pipeline	H2 latency floor — the status quo	Transcription of full utterance, then LLM query
Semantic similarity retrieval	H1 with no LLM reasoning — embedding-based nearest neighbor over vocabulary	Full utterance transcript
State-agnostic WordBridge	H4 calibration — full system without physiological gating	Same audio, same model, no tier transitions
Threshold-based tier gating	H4 nested comparison — raw HR/GSR thresholds with no Background Model reasoning	Same physiological signals, hard threshold rules

The explicit-query GPT-4o baseline sets the ceiling for H1 — if WordBridge's passive accuracy is more than 5 points below this, the non-inferiority claim fails. The Whisper + LLM pipeline sets the latency floor — if WordBridge doesn't beat it at the 500ms window, the temporal advantage argument (H2) collapses.

Why there's no standard benchmark for this

AphasiaBank provides transcripts and audio for anomic aphasia discourse. What it doesn't provide — what doesn't exist anywhere — is timestamped circumlocution onset labels at the resolution WordBridge's H2 requires.

Warning(The benchmark gap is a contribution, not a problem)

No existing benchmark evaluates real-time passive ambient circumlocution detection at sub-second latency. The closest is the EMNLP 2024 target-word identification work (arXiv:2506.14203), which identifies target words from completed offline transcripts — a fundamentally different task from detecting circumlocution onset within an utterance in progress.

The annotation work required to evaluate H2 — labeling circumlocution onset timestamps on AphasiaBank audio — is itself a contribution. It doesn't exist, and WordBridge needs to create it. That's worth naming explicitly in the proposal rather than treating as just infrastructure.

This means the evaluation is entangled with the dataset work. You can't run H2 without first building the onset-labeled corpus. The next post covers what that looks like.

The circumlocution tolerance curve

A secondary diagnostic worth including: the tolerance curve — how detection accuracy degrades as a function of how long the circumlocution has run when the prediction is made.

Plot Top-1 accuracy at t=500ms, t=1s, t=2s, t=3s, t=full utterance from circumlocution onset. The shape of this curve tells you:

Whether the model gets better with more context (good — suggests it's actually reasoning about the circumlocution) or plateaus early (concerning — suggests it's pattern-matching on the first few tokens)
Where the "useful information" in a circumlocution is concentrated — early descriptors vs. later refinements
How much the Background Model's session context shifts predictions compared to the Interaction Model alone at the same timestamp

Summary(Summary)

Evaluating WordBridge cleanly requires distinguishing four separate questions: does passive detection match explicit-query accuracy (H1), does it arrive early enough to be useful (H2), do contextual anchors actually help (H3), and is the safety gating well-calibrated (H4). H1 and H2 need a baseline corpus with onset-labeled circumlocution timestamps that doesn't exist yet — building it is part of the contribution. H4 needs a nested comparison between threshold-based and context-reasoned tier transitions, not just tiered vs. no gating. The circumlocution tolerance curve (accuracy vs. elapsed time from onset) is a secondary diagnostic that reveals whether the model is reasoning about the circumlocution or just pattern-matching on early tokens.

DATE	Jun 18, 2026
BY	gitcoder89431
READ	8 min
TAGS	#research#evaluation#aphasia#benchmarks#methodology
STATUS	published