Where Does the Data Come From?

Recall(From the evaluation plan post)

H2 — whether the correct candidate arrives before the utterance ends — requires circumlocution onset timestamps labeled at sub-second resolution. That corpus doesn't exist. This post is about how to build it, what AphasiaBank actually provides, and where the data plan is strong versus where it requires honest caveats.

"We'll use AphasiaBank" is a common wave of the hand in aphasia research proposals. AphasiaBank is genuinely valuable — open-access, IRB-cleared, clinically validated — but it provides transcripts and audio, not the labeled onset timestamps WordBridge's evaluation requires. The gap between what AphasiaBank provides and what this project needs is where most of the annotation work lives.

What AphasiaBank actually provides

AphasiaBank is a shared, open-access corpus of aphasia discourse maintained by Carnegie Mellon, collected under a standard IRB protocol that permits secondary research use. The core resource is transcribed audio of standardized discourse tasks administered to people with aphasia and neurotypical controls.

Definition(Standardized discourse tasks in AphasiaBank)

The bank includes several task types, each designed to elicit different aspects of language production:

Picture description — the participant describes a standardized scene (most commonly the Cookie Theft picture from the Boston Diagnostic Aphasia Examination)
Story retelling — retelling from Cinderella or other standardized narratives
Procedural discourse — describing a routine (making a sandwich, getting dressed)
Conversational interview — semi-structured conversation with the examiner

Anomic aphasia speakers appear throughout. The picture description and procedural discourse tasks are the most useful for WordBridge because they reliably elicit circumlocution — participants grope for object names repeatedly and in structured ways.

The transcripts use CHAT format (Codes for the Human Analysis of Transcripts), which includes disfluency markers — pauses, revisions, repetitions, filled hesitations ("um", "uh") — at the word level. Audio files are aligned to transcripts, though alignment is approximate rather than forced-alignment quality.

What it does not provide

The CHAT disfluency markers tell you that a hesitation or revision occurred, not when within the audio it started. For transcript-level analysis — identifying which word the speaker was ultimately trying to retrieve — that's sufficient. For WordBridge's H2 evaluation, it isn't.

Warning(The resolution gap)

AphasiaBank marks that a circumlocution occurred and (often) what the target word was. It does not mark the onset timestamp — the moment in the audio when the circumlocution began. That's the missing label. Without it, you can't compute "time from circumlocution onset to first correct candidate" — you can only compute "time from utterance start to first correct candidate," which conflates latency with how long the speaker took to get to the circumlocution.

The annotation plan

The annotation work breaks into two phases: identifying circumlocution instances, then labeling onset timestamps on the audio.

Phase 1: circumlocution instance identification

From AphasiaBank transcripts, extract all segments that fit the operational definition of circumlocution: the speaker produces a multi-word description of a target referent they demonstrably fail to name in that same turn.

Definition(Operational definition for annotation)

A circumlocution event is a contiguous spoken segment where:

The speaker produces a descriptive paraphrase (function, appearance, location, or category of the referent)
The target word is absent from the segment
Either the target word appears later in the same or adjacent turn (resolved circumlocution), or the speaker abandons the retrieval attempt (unresolved)

The target word is defined as what the picture or task prompt elicits, or — for conversational data — what the examiner's response confirms was intended.

The CHAT transcripts make Phase 1 tractable. Many instances are already flagged by annotators with error codes ([* n:k] for neologism, [* p] for phonological error, etc.), and circumlocution is often marked with [+ jar] (jargon/revision) or adjacent disfluency codes. This gives a first-pass candidate list without starting from blank audio.

Phase 2: onset timestamp labeling

For each confirmed circumlocution instance, a trained annotator listens to the audio and marks:

Onset: the start of the first descriptive word — not the hesitation before it, not the "um," but the first content word of the circumlocutory description
End of segment: where the circumlocution ends — either at the target word production or at the start of a topic change / abandonment
Resolution label: resolved (target word produced), assisted (examiner or interlocutor supplies the word), abandoned (speaker gives up or changes topic)

Example(Annotation in practice)

Audio: "I need the — [0.8s pause] — you know, the thing you — the round thing, flat, you put food on it — the... [1.2s pause]... plate."

Hesitation onset: 0.0s (first filled pause)
Circumlocution onset: ~0.9s (first content word: "thing")
Target word production: ~8.1s ("plate")
Resolution: resolved
H2 measurement window starts at 0.9s

Annotator agreement should be reported on a random 15% subsample of the corpus, with Cohen's κ for the resolution label and mean absolute deviation in seconds for onset timestamps. Based on similar forced-alignment annotation tasks, onset agreement within ±200ms is a reasonable quality target.

Synthetic augmentation

AphasiaBank will yield on the order of 300–500 usable circumlocution events with onset labels after annotation — enough for evaluation, but tight for fine-tuning if that's part of the training plan.

Synthetic augmentation addresses the scale problem but introduces its own:

Intuition(What synthetic data is good for)

Vocabulary coverage — real AphasiaBank data clusters around the standardized task stimuli (kitchen objects, Cinderella characters, routine actions). Synthetic circumlocutions can cover a much wider vocabulary, reducing the risk that the model has only learned to retrieve words that happen to appear in standard picture description protocols.

Edge cases — unusual circumlocution strategies (categorical over-generalization, function descriptions without appearance cues, etc.) are rare in a 400-sample corpus. Synthetic generation can deliberately oversample them.

Tier 3 conditions — distress + circumlocution co-occurring is genuinely rare in AphasiaBank (which is a clinical elicitation setting, not a real conversation under stress). Synthetic data is the only practical way to get reasonable coverage for the H4 safety tier evaluation.

Warning(What synthetic data distorts)

LLM-generated circumlocutions are syntactically natural and semantically coherent in ways that real aphasia speech is not. Real speakers produce fragmented circumlocutions, restart mid-phrase, self-correct, and abandon attempts in patterns that reflect underlying lexical-phonological processing failures — not just "describe this thing in different words."

A model trained primarily on synthetic circumlocutions may learn to detect description-like language rather than retrieval failure. These overlap substantially but are not the same, and the difference matters when the real-world user has genuine aphasia rather than just finding a word temporarily unfamiliar.

This is the strongest argument for reporting natural and synthetic samples separately in all experiments, and for treating synthetic training as augmentation rather than primary data.

Honest corpus size estimate

Source	Estimated events	Notes
AphasiaBank (natural)	300–500	After annotation; depends on annotator bandwidth and task-type filtering
Synthetic augmentation	1,000–1,200	LLM-generated, prompted from target-word vocabulary lists
Total	~1,500	Natural + synthetic, reported separately in all experiments

These numbers are honest estimates, not aspirational targets. 300 natural events is enough to evaluate H1 and H2 with reasonable statistical power — a 5-point accuracy difference is detectable at n=300 with standard significance thresholds. It is not enough to fine-tune a large model from scratch, which is why the architecture relies on prompting and retrieval over TML-Interaction-Small rather than task-specific fine-tuning.

Note(IRB and data sharing)

AphasiaBank data is available for research use under the DementiaBank data use agreement, which permits secondary analysis and publication of aggregated results. Any new onset-timestamp annotations created as part of this project would be contributed back to AphasiaBank under the same terms — the annotation work has value to the field beyond this proposal.

New data collected from participants (if any human subjects work is added to the protocol) would require separate IRB approval. The current proposal scope assumes AphasiaBank-only data for the initial prototype evaluation.

Alzheimer's data: the harder problem

Everything above is about anomic aphasia, which has a well-defined corpus. Early Alzheimer's discourse data is substantially more limited in AphasiaBank — the bank has dementia subcorpora (DementiaBank, which overlaps with AphasiaBank infrastructure), but conversation with thread loss is much harder to elicit in a standardized way than picture description circumlocution.

Remark

For the Alzheimer's scenario — thread loss, contextual anchor firing, H3 evaluation — the honest position is that the training and evaluation data situation is less mature than for aphasia. DementiaBank's Cookie Theft subset is usable but smaller. H3's recovery delta measurement requires a conversational setup that the standardized discourse tasks don't naturally provide. This is the part of the evaluation plan that needs the most development before the prototype is built.

This doesn't change the viability of the aphasia side (H1, H2, H4), but it does mean H3 is the hypothesis most likely to need a purpose-collected pilot dataset before a full evaluation is possible.

Summary(Summary)

AphasiaBank provides 300–500 usable circumlocution instances after annotation — enough to evaluate H1 and H2, not enough to fine-tune a model from scratch. The annotation work required to create onset-labeled timestamps (the gap between "this circumlocution occurred" and "this circumlocution began at t=0.9s") is infrastructure the field doesn't have yet, and contributing it back to AphasiaBank is a concrete deliverable beyond the prototype itself. Synthetic augmentation can fill scale and edge-case coverage gaps but should never be reported pooled with natural data. The Alzheimer's scenario (H3) has the weakest data foundation and is the hypothesis most likely to require purpose-collected pilot data before a full evaluation can run.

DATE	Jun 18, 2026
BY	gitcoder89431
READ	9 min
TAGS	#research#data#aphasiabank#annotation#methodology
STATUS	published