Recall(From the intro post)
WordBridge's safety framework uses wearable HR/GSR signals to gate system behavior across three tiers — full operation, suggestions-suppressed, and silent caregiver alert. H4 claims this tiered approach produces lower false-positive intervention rates than a state-agnostic baseline. That claim quietly depends on two things working: the sensors have to be right, and the alerts they trigger have to mean something.
Both of those have their own research literatures, and both literatures are full of warnings.
Assumption 1: the sensors are right
Lab studies of HR/GSR-based stress detection report respectable numbers — Random Forest and AdaBoost classifiers around 85% accuracy, and some time-domain HRV approaches using 5-minute windows reaching over 96% in controlled settings.
Warning(Lab accuracy is not field accuracy)
Outside the lab, the same reviews report:
- Wrist-worn sensors are measurably less accurate than chest-worn for HR
- Motion artifacts during ordinary physical activity degrade signal quality and require separate correction algorithms
- In a driving-scenario validation study, wearable EDA (electrodermal activity — the GSR-family signal) showed no correlation with gold-standard lab equipment
This matters specifically for WordBridge's population. The proposal's tier framework assumes elevated HR/GSR maps onto "mild distress." But an early-stage Alzheimer's patient or anomic aphasia speaker going about a normal day will also generate elevated HR/GSR from: climbing stairs, a hot room, caffeine, a startling noise, normal excitement at a grandchild's visit. None of these are the "conversational collapse" Tier 3 is designed for — but on raw sensor values alone, they can look identical to it.
Example(A concrete failure mode)
Patient is laughing with their grandchild — heart rate spikes from genuine joy and physical animation. A threshold-based Tier 2 trigger reads this as "elevated HR/GSR — mild distress" and suppresses word suggestions and switches to contextual-anchor-only mode, right in the middle of a conversation that was going well. The intervention designed to prevent harm during distress causes a (smaller, but real) harm during a good moment.
This is precisely the scenario H4 needs to catch — but it also means H4's "state-agnostic baseline" comparison might be testing the wrong thing. The interesting comparison isn't "physiological gating vs. no gating," it's "threshold-based physiological gating vs. context-aware physiological gating." Raw HR/GSR thresholds, on their own, may not be distinguishable from noise often enough to be useful.
Assumption 2: the alert means something
Tier 3's "silent caregiver alert" only works if caregivers act on it. There's a large, well-documented literature on what happens when alert systems get this wrong, and it has a name: alarm fatigue.
Definition(Alarm fatigue)
A state in which caregivers, exposed to a high volume of mostly-false or non-actionable alerts, become desensitized and begin ignoring or delaying responses to all alerts — including the genuinely critical ones. Documented extensively in hospital monitoring systems.
The numbers from hospital settings are stark: studies report that 80–99% of clinical monitor alarms are false or clinically insignificant. The consequence isn't just wasted attention — it's that caregivers start ignoring the real alarms too, because they've learned the alarm itself carries almost no information (IBM, PMC6904899).
Important(The risk compounds, not just adds)
If Tier 2→3 transitions are even moderately over-sensitive (per Assumption 1, above), Tier 3's caregiver alerts inherit that false-positive rate directly. A caregiver who gets a "silent alert" every time the patient laughs hard or climbs the stairs will, within days, start treating WordBridge's Tier 3 alerts the way ICU nurses learn to treat 90% of monitor alarms: as background noise. The one alert that matters arrives into a channel the caregiver has already learned to discount.
What the fix looks like elsewhere
A 2025 paper on context-aware alerting in long-term care facilities tackled almost exactly this problem — too many false/non-actionable alerts overwhelming nursing staff — with a hybrid architecture: deterministic rule-based suppression for clearly non-actionable alerts, context-aware delay based on urgency and staff workload, and LLM-driven semantic reasoning (GPT-4) over structured spatial/temporal/clinical context to decide final priority and routing (PMC12608296).
The reported improvement: alert volume reduced 72.5%, false positive rate from 0.20 → 0.005, false negative rate from 0.79 → 0.023, F1 from 0.18 → 0.97.
Intuition(What this means for WordBridge's Background Model)
The headline numbers aren't the point — they're from a different system in a different setting. What's transferable is the shape of the fix: raw sensor/event thresholds alone weren't good enough, and the fix was a reasoning layer sitting between the raw signal and the alert, using context the threshold doesn't have access to (where is this person, what were they just doing, how urgent is this given everything else known about the situation).
This is exactly the job WordBridge already assigns to the Background Model — it's the one component with the conversational and session history needed to ask "is this elevated HR/GSR reading consistent with what's been happening for the last two minutes, or does it look like a discontinuity?" A pure threshold on the Interaction Model's side can't ask that question; it doesn't have the context window.
Sharpening H4
Taken together, this suggests H4 as currently framed — "tiered intervention produces lower false positive rates than a state-agnostic baseline" — is necessary but not sufficient. The more important comparison is nested inside it:
| Comparison | What it tests |
|---|---|
| State-aware tiers vs. state-agnostic baseline | Does using physiological signals at all help? (Original H4) |
| Threshold-based tier transitions vs. context-reasoned tier transitions | Does the Background Model's context actually add value over raw sensor thresholds, the way it did in the elderly-care alerting work? |
If the first comparison shows a win but the second doesn't, WordBridge has built a more complex system that's no better than a simple HR/GSR threshold — which the field has already shown is not reliable enough on its own for this population.
Summary(Summary)
WordBridge's safety framework depends on two things that each have well-documented failure modes in the existing literature: wearable physiological signals are noisier in real-world ambulatory use than in lab studies (especially for the population WordBridge targets, where normal daily activity produces the same signals as "distress"), and caregiver alert systems degrade into ignored noise once false-positive rates climb — a well-studied phenomenon called alarm fatigue, with 80-99% false alarm rates documented in hospitals. The fix used elsewhere (LLM-based contextual reasoning layered over rule-based filtering, cutting false positives by orders of magnitude in elder-care alerting) maps directly onto the role WordBridge already assigns its Background Model. The practical upshot for H4: the experiment should explicitly compare threshold-based vs. context-reasoned tier transitions, not just tiered vs. state-agnostic — otherwise a positive H4 result might just mean "any gating beats no gating," without showing the Background Model's reasoning is doing anything a simple threshold couldn't.