The Gold Standard for Patient Experience
HCAHPS — the Hospital Consumer Assessment of Healthcare Providers and Systems — is the most widely used patient experience survey in the United States. Mandated by CMS since 2006, it is the benchmark that every hospital in the country measures itself against. The survey covers the core dimensions of hospital care: how well nurses and doctors communicate, how responsive staff are to patient needs, how clean and quiet the hospital environment is, and how well discharge and medication information is communicated.
For any synthetic patient model, HCAHPS is the obvious first test. If a model cannot reproduce the national distribution of patient experience responses, it has no business generating patient survey data. This validation study asks a simple question: can Simsurveys replicate what 631,000 real patients reported about their hospital experiences?
Study Design
We compared Simsurveys output against the HCAHPS 2024 CMS dataset, covering January through December 2024 discharges with public release in October 2025. The live dataset includes approximately 631,000 completed surveys collected from 4,304 US hospitals — representing the full national picture of hospital patient experience.
On the simulation side, we generated n=1,000 synthetic respondents using the Simsurveys general consumer model. This is a critical detail: we deliberately used the baseline consumer model before any patient-specific fine-tuning. The goal was to establish a floor — how well does a general-purpose model perform on healthcare-specific patient experience questions without any domain adaptation?
The survey covered 21 questions spanning 7 care domains: Nurse Care, Doctor Care, Hospital Environment, Experiences in Hospital, Medication Communication, Leaving Hospital, Care Transition, and Overall Rating. Each question uses the standard HCAHPS response scales, and we measured alignment using KL Divergence across the full response distributions.
Results: Strong Baseline Performance
Across all 21 questions, the model achieved an average KL Divergence of 0.091 and a median of 0.084. These are strong results for any synthetic data model, and especially notable given that this was a general consumer model with no patient-specific training.
Of the 21 questions tested, 19 achieved a "Good" rating (KL < 0.15). The two questions rated "Review" were Q8 — room quiet at night (KL = 0.153) — and Q16 — written discharge information (KL = 0.159). Both are just barely above the threshold, and the deviations follow a consistent, interpretable pattern.
19 of 21 questions rated "Good" (KL < 0.15) using a general consumer model with zero patient-specific fine-tuning. Average KL Divergence: 0.091. Median: 0.084.
The Central Tendency Pattern
The primary deviation pattern across the HCAHPS validation is what we call the central tendency effect. The model consistently understates the most extreme positive response — "Always" — and overstates the moderate positive response — "Usually." In practical terms, when real patients report 75% "Always" on a nurse communication question, the model might generate 65% "Always" and 20% "Usually" instead of the actual 10% "Usually."
This pattern is entirely expected for a general consumer model encountering healthcare survey data. HCAHPS distributions are heavily top-skewed, with 70–80% of respondents selecting the highest response option on most questions. A model trained on general consumer surveys — where response distributions are typically more balanced — will naturally pull toward the center of the scale.
The important point is that the model captures the correct rank ordering and overall shape of each distribution. It knows that nurse communication scores higher than hospital quietness, that doctor communication is rated highly, and that discharge information is a relative weak point. The directional intelligence is there; the intensity calibration is what needs adjustment.
What Fine-Tuning Will Address
The central tendency effect is precisely the kind of systematic bias that patient-specific fine-tuning is designed to correct. By training on actual patient experience data — where top-box scores of 70–80% are the norm rather than the exception — the model will learn to calibrate its response intensity to match healthcare-specific distributions.
We expect the fine-tuned Patient model to push the two "Review" items well into the "Good" range and to tighten the already-strong performance on the remaining 19 questions. The baseline results give us confidence that the underlying response structure is sound; the fine-tuning step is about intensity calibration, not structural correction.
Implications for Patient Research
These results demonstrate that even a general-purpose consumer model can approximate national patient experience distributions with meaningful accuracy. For research teams that need directional patient experience data quickly — for pilot studies, survey pretesting, or benchmarking exercises — the current model is already usable. For studies that require precise top-box calibration, the fine-tuned Patient model will deliver the additional accuracy needed.
The full validation report, including question-level distribution tables and metric summaries, is available for download. For more on the Patient model and its training data, visit the Patient model page.