← Back to Blog
Compliance

HIPAA, IRB, and Synthetic Patient Data: What Pharma Researchers Need to Know

Synthetic survey data generated from public federal datasets is not PHI, does not require IRB approval, and eliminates weeks of compliance overhead. Here is the full regulatory picture.

Compliance · March 31, 2026 · Myles Friedman · 6 min read

Every pharma researcher who works with patient data knows the drill: HIPAA compliance reviews, IRB submissions, privacy impact assessments, data use agreements. These safeguards exist for good reason — but when they add 8–16 weeks to a project timeline and tens of thousands of dollars in legal and administrative costs, most research questions never get asked. The compliance burden is not just a procedural inconvenience. It is a structural barrier to faster, more iterative insights work.

Synthetic survey data changes the equation entirely. When the data is generated by AI models trained on publicly available, de-identified federal datasets — and no human subjects are involved at any stage — the regulatory framework that governs traditional patient research simply does not apply. No PHI. No human subjects. No HIPAA. No IRB.

Why Synthetic Patient Data Is Not Subject to HIPAA

HIPAA protects protected health information (PHI) — individually identifiable health information held by covered entities and their business associates. The key word is individually identifiable. For data to be PHI, it must relate to a specific individual’s health condition, treatment, or payment, and it must contain identifiers (or be reasonably usable to identify that individual).

Synthetic survey data generated by Simsurveys meets none of these criteria. The Patient model is trained on 500,000+ de-identified records drawn from six federal datasets: the National Health Interview Survey (NHIS), the Behavioral Risk Factor Surveillance System (BRFSS), the Medical Expenditure Panel Survey (MEPS), the National Health and Nutrition Examination Survey (NHANES), the Consumer Assessment of Healthcare Providers and Systems (CAHPS), and the Patient-Reported Outcomes Measurement Information System (PROMIS).

Every one of these datasets is publicly available. Every one has been de-identified by the federal agencies that administer them. The data has already passed through the most rigorous de-identification standards in healthcare research. When Simsurveys trains models on these datasets, the models learn population-level statistical distributions — the probability that a 60-year-old with hypertension also reports chronic fatigue, for example — not individual patient records. The synthetic responses that the model generates are new data points drawn from those learned distributions. They do not correspond to any real person. They are not PHI.

The Healthcare (HCP) model operates on the same principle. It is trained on de-identified physician demographics and prescribing data. No patient records are involved in the HCP model at any stage.

Why No IRB Approval Is Required

Institutional Review Boards exist to protect human research subjects. The Common Rule (45 CFR 46) defines a human subject as “a living individual about whom an investigator conducting research obtains data through intervention or interaction with the individual, or identifiable private information.”

Synthetic survey data involves no intervention, no interaction, and no identifiable private information. No living individual participates in the research. No one is surveyed, interviewed, observed, or contacted. The AI model generates responses based on statistical patterns learned from already-public, already-de-identified data. Because there are no human subjects, there is no IRB jurisdiction.

For pharma research teams, this is not a minor procedural shortcut. IRB submissions routinely take 4–12 weeks for initial review, and amendments can add additional weeks. Eliminating the IRB requirement removes one of the largest fixed delays in the patient insights workflow.

The compliance advantage: Synthetic patient survey data from Simsurveys requires no HIPAA compliance review, no IRB submission, no informed consent forms, no data use agreements, and no privacy impact assessment. Research that would take 3–4 months to clear compliance can begin immediately.

The Re-Identification Question

The most common skeptic question is reasonable: “What if the synthetic data accidentally re-identifies a real patient?” This concern applies to some forms of synthetic data — specifically, approaches that synthesize new records from individual-level datasets by adding noise or perturbation to real records. In those cases, re-identification risk is a legitimate consideration.

Simsurveys does not work this way. The models are trained on population-level statistical distributions, not on individual records. The training pipeline learns aggregate conditional probabilities — the relationship between age, condition, treatment, and reported outcomes across hundreds of thousands of records — and generates entirely new data points from those distributions. There is no mechanism to trace a synthetic response back to a specific individual because no specific individual’s record exists in the model. The architecture is fundamentally different from record-level synthesis, and the re-identification risk profile is correspondingly different: effectively zero.

For a detailed description of data handling and security practices, see the Simsurveys security and compliance page and privacy policy.

Sunshine Act: Not Applicable

The Physician Payments Sunshine Act (Open Payments) requires pharmaceutical and medical device companies to report payments and other transfers of value to physicians and teaching hospitals. Traditional physician surveys trigger Sunshine Act reporting because they involve honoraria — direct payments to physicians in exchange for their time.

Synthetic HCP surveys involve no payments to physicians. No physician participates, no honorarium is paid, and no transfer of value occurs. There is no Sunshine Act reporting obligation. For pharma companies that run dozens of physician studies per year, eliminating Sunshine Act tracking and reporting for directional research represents a meaningful reduction in administrative overhead.

GDPR Considerations

Synthetic data generated from U.S. public federal datasets contains no personal data of EU residents and has no GDPR implications. The training data is sourced exclusively from U.S. federal surveys covering U.S. populations. The generated data does not relate to identified or identifiable natural persons under GDPR’s definition.

For research that specifically targets EU populations or uses EU-sourced training data, different regulatory considerations would apply. Simsurveys’ current models are trained on U.S. data and are designed for U.S. population research.

ESOMAR Transparency Standards

ESOMAR, the global market research industry association, recommends transparency when synthetic data is used in research. Their guidance emphasizes that researchers should clearly disclose when data is synthetically generated, describe the methodology and training data sources, and provide validation evidence.

Simsurveys publishes full methodology documentation, including training data sources, model architecture descriptions, and validation study results. Every validation study includes question-level statistical metrics so that users can evaluate accuracy for their specific use case. This level of transparency exceeds what most traditional survey vendors provide about their panel composition and quality control processes.

Validation Evidence

Regulatory clarity means nothing if the data is not accurate. The Simsurveys Patient model has been validated against three published benchmark surveys:

  • KFF GLP-1 Survey: KL divergence of 0.039 — measuring attitudes and experiences with GLP-1 medications against a nationally representative sample.
  • HCAHPS (Hospital Consumer Assessment): KL divergence of 0.091 against 631,000 real patient responses — the largest validation dataset in the synthetic survey industry.
  • US Pain Foundation Survey: KL divergence of 0.029 — capturing chronic pain patient experiences, treatment satisfaction, and quality of life measures.

KL divergence below 0.10 indicates closely matching distributions. A score of 0.029 means the synthetic and real distributions are nearly indistinguishable. Full validation reports are available on the validation studies page.

Frequently Asked Questions

Is synthetic patient data subject to HIPAA?

No. Synthetic survey data generated from publicly available, de-identified federal datasets does not contain protected health information and is not subject to HIPAA. The data is statistically generated from population-level models, not derived from individual patient records.

Does synthetic patient survey data require IRB approval?

No. IRB review is required when research involves human subjects. Synthetic survey data is generated by AI models, not collected from human participants. There are no human subjects, no informed consent requirements, and no IRB submission needed.

Can synthetic data accidentally re-identify a real patient?

No. Simsurveys models are trained on population-level statistical distributions from de-identified federal datasets, not on individual patient records. The models learn aggregate patterns rather than memorizing any individual’s data. There is no mechanism for re-identification because no individual records exist in the training pipeline.

Does the Sunshine Act apply to synthetic HCP surveys?

No. Synthetic HCP surveys do not involve any payments to physicians — no honoraria, no incentives, no transfers of value. There is no Sunshine Act reporting obligation.

Does GDPR apply to synthetic survey data?

Synthetic data generated from U.S. public federal datasets contains no personal data of EU residents and has no GDPR implications. For research involving EU populations or EU-sourced data, different considerations may apply.

How accurate is synthetic patient survey data?

In validation studies against published benchmarks, the Simsurveys Patient model achieved KL divergence scores of 0.029–0.091. The HCAHPS validation was conducted against 631,000 real patient responses. Full validation reports with question-level metrics are published on the validation studies page.

Getting Started

The Simsurveys Patient model is trained on 500,000+ de-identified records from six federal datasets and delivers results in minutes. The Healthcare model covers 15+ physician specialties with no Sunshine Act reporting required. You can create a free account and run your first synthetic study without a HIPAA review, without an IRB submission, and without a compliance delay.

For more on our security and data practices, see the security and compliance page and privacy policy.

Skip the compliance queue.

Generate validated patient and HCP survey data in minutes. No HIPAA review, no IRB, no Sunshine Act reporting.