Synthetic Data for Rare Disease Research: When You Can't Recruit Enough Patients

Q: Can synthetic data replace patient recruitment for rare disease research?

Synthetic data does not replace clinical trial recruitment, but it can substitute for traditional survey-based market research in rare disease populations. The Simsurveys Patient model is trained on 500,000+ de-identified federal health records and has been validated with KL divergence scores as low as 0.029 against published benchmarks. For market research questions like patient journey mapping, unmet needs assessment, and orphan drug launch planning, synthetic data delivers statistically meaningful results without the months-long recruitment timelines.

Q: Is synthetic rare disease patient data HIPAA compliant?

Yes. Synthetic patient data is not real patient data and contains no protected health information (PHI). The Simsurveys Patient model is trained on de-identified, publicly available federal health datasets (NHIS, BRFSS, MEPS, NHANES, CAHPS, PROMIS). No IRB approval is required, no patient consent is needed, and there are no HIPAA obligations because no real patient records are generated or exposed.

Q: How is synthetic patient data for rare diseases validated?

Simsurveys validates its Patient model against published benchmark surveys using KL divergence, a statistical measure of distributional similarity. Results include KL divergence of 0.039 against KFF GLP-1 survey data, 0.091 against 631,000 HCAHPS responses, and 0.029 against US Pain Foundation chronic pain data. Scores below 0.05 indicate near-identical distributions. Full validation reports are published openly.

Q: What rare disease use cases does synthetic patient data support?

Synthetic patient data supports orphan drug launch planning, patient journey mapping, unmet needs assessment, regulatory and HTA dossier support as supplementary evidence, natural history study augmentation, and early-stage market sizing. It is most effective for conditions with some representation in federal health datasets and works best when used alongside traditional research methods.

Q: Does synthetic data work for extremely rare diseases with fewer than 1,000 patients?

For ultra-rare conditions with fewer than 1,000 known patients, synthetic data has less training signal because these populations are underrepresented in federal health surveys. The model is strongest for rare diseases that affect thousands to tens of thousands of patients and have some footprint in datasets like NHIS, BRFSS, or MEPS. For ultra-rare conditions, synthetic data can still provide directional signal on broader disease-area attitudes and experiences, but should be interpreted with appropriate caution.

Research · April 2, 2026 · Myles Friedman · 7 min read

There are more than 7,000 known rare diseases, and collectively they affect roughly 30 million Americans. But individually, each condition affects a small population — by the FDA’s definition, fewer than 200,000 patients in the United States. Many affect far fewer. For the researchers, pharma teams, and advocacy organizations working in this space, this creates a fundamental problem: you cannot survey a population you cannot find.

Traditional patient survey recruitment for rare disease populations is among the hardest challenges in market research. Recruitment timelines stretch to months. Costs escalate into six figures. IRB requirements for vulnerable populations add layers of complexity. And even after all that effort, sample sizes often fall short of statistical significance. The result is that critical research questions about patient experience, unmet needs, and treatment preferences go unanswered — not because they are unimportant, but because the logistics are prohibitive.

Synthetic patient data offers an alternative path for many of these research questions. The Simsurveys Patient model is trained on more than 500,000 de-identified federal health records drawn from NHIS, BRFSS, MEPS, NHANES, CAHPS, and PROMIS. It generates survey responses that statistically mirror how patient populations respond — without recruiting a single real patient, without HIPAA concerns, and without an IRB.

The Recruitment Problem in Rare Disease Research

The challenges of rare disease patient recruitment are structural, not logistical. They cannot be solved by trying harder or spending more, because the underlying population constraints are fixed.

Tiny Patient Populations

A condition affecting 50,000 people in the U.S. sounds like a reasonable survey universe — until you account for diagnosis rates, geographic dispersion, age distributions, and willingness to participate in research. The recruitable population for any given rare disease study is typically a small fraction of the total prevalence. Reaching statistical significance with traditional methods can require recruiting a proportion of the patient population that is simply unrealistic.

Recruitment Timelines Measured in Months

Where a general population survey might field in 1–2 weeks, rare disease patient recruitment routinely takes 3–6 months. Patients are identified through specialty clinics, patient registries, advocacy organizations, and social media outreach. Each channel has its own lead times, consent processes, and coordination overhead. For pharma teams working on orphan drug launch timelines, this pace is often incompatible with business needs.

Astronomical Costs

The cost per complete for rare disease patient surveys reflects the difficulty of recruitment. Where a general population health survey might cost $10–$30 per respondent, rare disease patient recruitment can cost $500–$2,000+ per complete — before analysis, project management, or reporting. A 200-patient study can easily exceed $200,000, and many rare disease programs cannot justify that investment for directional market research.

IRB Complexity for Vulnerable Populations

Many rare diseases disproportionately affect children, elderly patients, or individuals with cognitive impairments — populations that require additional IRB protections. The review process for research involving vulnerable populations adds weeks or months to study timelines, requires detailed consent protocols, and may limit the types of questions that can be asked. These protections are appropriate and necessary for clinical research, but they create meaningful friction for market research and patient experience studies.

Geographic Dispersion

Rare disease patients are, by definition, spread thinly across the population. There is no geographic concentration to leverage. A patient with a condition affecting 20,000 Americans could be in any city, any state, any healthcare system. This dispersion makes every recruitment channel less efficient and makes in-person methodologies essentially impossible at scale.

How Synthetic Patient Data Addresses These Challenges

The Simsurveys Patient model takes a fundamentally different approach. Rather than recruiting individual patients, it draws on patterns learned from large-scale federal health datasets to generate survey responses that reflect how patient populations experience health, healthcare, and daily life.

The model is trained on 500,000+ de-identified records from six major federal health surveys: NHIS (National Health Interview Survey), BRFSS (Behavioral Risk Factor Surveillance System), MEPS (Medical Expenditure Panel Survey), NHANES (National Health and Nutrition Examination Survey), CAHPS (Consumer Assessment of Healthcare Providers and Systems), and PROMIS (Patient-Reported Outcomes Measurement Information System). Together, these datasets capture a broad spectrum of health conditions, treatment experiences, functional limitations, healthcare access patterns, and patient-reported outcomes.

Because the model learns relationships between conditions, demographics, comorbidities, and patient experiences, it can generate responses for rare disease populations based on the patterns present in the training data — even when any individual rare condition has limited direct representation.

No HIPAA. No IRB. No patient burden. Synthetic patient data contains no protected health information. It is not derived from identifiable patient records. There is no consent requirement, no IRB review, and no risk of burdening a vulnerable patient population with survey participation. For market research and patient experience studies, this removes the most significant barriers to rare disease research.

Validation: How Accurate Is Synthetic Patient Data?

Synthetic data is only useful if it is accurate. We measure accuracy using KL divergence, a statistical measure of how similar two probability distributions are. Lower scores indicate closer alignment, and scores below 0.05 indicate distributions that are nearly indistinguishable from the real survey data.

The Patient model has been validated against three published benchmark surveys:

KFF GLP-1 Survey: KL divergence of 0.039 — near-identical alignment with survey data on patient medication experiences and health behaviors.
HCAHPS (Hospital Consumer Assessment): KL divergence of 0.091 against 631,000 real patient responses — strong alignment across patient satisfaction and care experience measures at massive scale.
US Pain Foundation Chronic Pain Survey: KL divergence of 0.029 — the closest alignment achieved, on a survey covering a hard-to-reach population with complex health needs.

The chronic pain validation is particularly relevant for rare disease applications. Chronic pain patients share many characteristics with rare disease populations: complex conditions, multiple comorbidities, difficulty accessing appropriate care, and underrepresentation in traditional research. A KL divergence of 0.029 against real chronic pain survey data demonstrates the model’s ability to capture the experiences of hard-to-reach patient populations. Full validation reports are available on our validation studies page, and the underlying methodology is detailed on our publications page.

Use Cases for Rare Disease Synthetic Data

Synthetic patient data is not a replacement for clinical trial data or regulatory-grade evidence. It is a tool for the market research, strategy, and patient insights questions that currently go unanswered because recruitment is too hard. Key applications include:

Orphan drug launch planning: Understand patient needs, treatment satisfaction, and switching triggers before committing to a launch strategy. Generate directional reads on patient segments without waiting months for recruitment.
Patient journey mapping: Model the diagnostic odyssey, treatment decision points, and care access barriers that rare disease patients experience. Identify intervention opportunities across the patient journey.
Unmet needs assessment: Quantify the gaps between current treatment options and patient expectations. Prioritize which unmet needs represent the largest opportunities for new therapies or services.
Regulatory and HTA dossier support: Use synthetic patient data as supplementary evidence in health technology assessments and value dossiers. Synthetic data does not replace primary evidence, but it can fill gaps where patient survey data would be expected but is impractical to collect.
Natural history study augmentation: Supplement small natural history datasets with synthetic patient profiles that reflect the broader population characteristics. Useful for early-stage programs where the clinical evidence base is thin.

For a deeper exploration of how synthetic patient data models work and how they relate to traditional patient research, see our Patient Digital Twin white paper.

Limitations and Honest Boundaries

Synthetic data is a powerful tool for rare disease research, but it has boundaries that matter. Being transparent about these limitations is essential for responsible use.

Very rare conditions (<1,000 patients) have less training signal. The model learns from patterns in federal health surveys. Conditions with extremely low prevalence may have minimal or no direct representation in the training data. The model can still draw on related conditions and comorbidity patterns, but outputs for ultra-rare conditions should be interpreted as directional rather than definitive.
Synthetic data is not a replacement for clinical trial data. It should not be used as primary evidence for regulatory submissions, clinical endpoints, or safety assessments. Its strength is in market research, patient experience, and strategic planning — not clinical decision-making.
It works best alongside traditional research. The strongest research programs use synthetic data to screen, explore, and generate hypotheses, then validate the most important findings with real patient data where feasible. Synthetic data expands the questions you can ask; traditional research confirms the answers that matter most.

Frequently Asked Questions

Can synthetic data replace patient recruitment for rare disease research?

Synthetic data does not replace clinical trial recruitment, but it can substitute for traditional survey-based market research in rare disease populations. The Simsurveys Patient model is trained on 500,000+ de-identified federal health records and has been validated with KL divergence scores as low as 0.029 against published benchmarks. For market research questions like patient journey mapping, unmet needs assessment, and orphan drug launch planning, synthetic data delivers statistically meaningful results without months-long recruitment timelines.

Is synthetic rare disease patient data HIPAA compliant?

Yes. Synthetic patient data contains no protected health information (PHI) and is not derived from identifiable patient records. The model is trained on de-identified, publicly available federal health datasets (NHIS, BRFSS, MEPS, NHANES, CAHPS, PROMIS). No IRB approval is required, no patient consent is needed, and there are no HIPAA obligations.

How is synthetic patient data for rare diseases validated?

Simsurveys validates its Patient model against published benchmark surveys using KL divergence, a statistical measure of distributional similarity. Results include KL divergence of 0.039 against KFF GLP-1 survey data, 0.091 against 631,000 HCAHPS responses, and 0.029 against US Pain Foundation chronic pain data. Scores below 0.05 indicate near-identical distributions. Full validation reports are published openly.

What rare disease use cases does synthetic patient data support?

Key applications include orphan drug launch planning, patient journey mapping, unmet needs assessment, regulatory and HTA dossier support (as supplementary evidence), and natural history study augmentation. It is most effective for conditions with some representation in federal health datasets and works best when used alongside traditional research methods.

Does synthetic data work for extremely rare diseases with fewer than 1,000 patients?

For ultra-rare conditions, synthetic data has less training signal because these populations are underrepresented in federal health surveys. The model is strongest for rare diseases affecting thousands to tens of thousands of patients. For ultra-rare conditions, it can still provide directional signal on broader disease-area attitudes and experiences, but results should be interpreted with appropriate caution and supplemented with whatever real patient data is available.

Getting Started

The Simsurveys Patient model is available now for rare disease market research, patient journey mapping, and orphan drug launch planning. You can create a free account and run your first synthetic patient study without a panel partner, without an IRB, and without the months-long recruitment timelines that have historically made rare disease patient research impractical.

For more on the Patient model’s methodology, see the Patient Digital Twin white paper. For validation details, browse our validation studies or published papers.

Synthetic Data for Rare Disease Research: When You Can’t Recruit Enough Patients