Evaluating Synthetic Survey Estimates: A Validation Framework

Research · March 24, 2026 · Myles Friedman · 6 min read

The Wrong Comparison

Most discussions about synthetic survey data start from the wrong premise. They ask: "How close is this to the real data?" The implicit assumption is that a live survey produces ground truth and synthetic data is an approximation of it. This framing sounds reasonable but it is incorrect, and it leads to validation standards that are either too lenient or too strict for the wrong reasons.

Live survey results are not ground truth. They are measurements — and like all measurements, they are subject to error. Survey methodology has spent decades cataloging these errors under the framework of Total Survey Error: sampling error, coverage error, nonresponse error, and measurement error. Every live survey result is a function of who was sampled, who responded, how questions were worded, and what mode was used to collect answers. Change any of those parameters and the "ground truth" changes with it.

Synthetic Data as Model-Based Estimators

The more productive framing is to treat synthetic survey data as model-based estimators of opinion structure. A synthetic respondent system does not attempt to reproduce the exact dataset that a specific live survey would generate. Instead, it estimates the expected distribution of responses for a given population, question, and context — the signal that a well-designed live survey is also trying to measure, net of its own error sources.

This reframing matters because it changes the validation question. Instead of asking "Does synthetic data match live data?" we ask "Does the synthetic estimator approximate the underlying opinion structure as well as or better than a single live survey measurement?" In some cases, particularly when live surveys suffer from high nonresponse or satisficing, the answer may be yes.

Key insight: Synthetic estimates can borrow strength across correlated variables through shrinkage and partial pooling, reducing noise that affects individual survey items. This is the same statistical principle that makes multilevel models outperform raw cell means in small-sample survey analysis.

Question-Type-Aware Metrics

Not all survey questions should be evaluated the same way. A binary yes/no question, a five-point Likert scale, a rank-order list, and a multi-select checkbox each produce different types of distributions with different properties. Using a single metric across all question types conflates distinct sources of divergence.

Our validation framework uses question-type-aware metrics. For single-select questions — including binary, nominal, and ordinal items — we use KL divergence to measure the information-theoretic distance between live and synthetic response distributions. KL divergence captures both the direction and magnitude of distributional differences, penalizing synthetic distributions that place probability mass where the live distribution does not.

For multi-select questions, where respondents choose all options that apply, we use Rank-Biased Overlap (RBO). Multi-select items produce selection frequency rankings rather than probability distributions, and RBO is designed specifically to compare ranked lists with different lengths and varying overlap depths. This avoids the distortions that occur when forcing multi-select data into a single-select metric framework.

When Synthetic Estimates May Outperform

One of our case studies — a physician sarcopenia awareness survey — illustrates a scenario where synthetic estimates may actually reduce noise present in the live data. The live survey showed classic satisficing patterns on low-salience items: respondents selecting the midpoint or first available option rather than carefully evaluating their answer. The synthetic estimates, which are not subject to respondent fatigue or satisficing, produced distributions that were smoother and more consistent with the overall response patterns across related items.

We make this claim deliberately and conservatively. Synthetic data approximates expected distributions — it does not reveal objective truth. But in specific, identifiable scenarios, the model-based estimate may be closer to the true opinion structure than a single noisy live measurement.

When Live Data Remains Essential

The framework is equally clear about limitations. Live data remains essential for novel topics where the model has no training signal, for populations not well-represented in training data, for regulatory submissions requiring primary data collection, and for research questions where the precise measurement context (time, mode, framing) is part of the research question itself. Synthetic data is a complement to live research, not a wholesale replacement.

Best Practices

Researchers evaluating synthetic survey data should consider distributional divergence at the question level, not just aggregate accuracy. A model that performs well on average but fails on specific question types is not safe to use without qualification. We recommend reporting KL divergence for every single-select item and RBO for every multi-select item, with explicit thresholds: below 0.05 is "Excellent," below 0.15 is "Good," and above 0.15 warrants caution or exclusion.

The full validation framework, including worked examples, metric definitions, and case study analyses, is available in our white paper (PDF). For published validation results across all Simsurveys models, visit our validation studies page.

Frequently Asked Questions

What is KL divergence and how is it used to evaluate synthetic survey data?

KL divergence (Kullback-Leibler divergence) is an information-theoretic metric that measures the distance between two probability distributions. In synthetic survey validation, it compares the response distribution from synthetic respondents against the live survey distribution. Scores below 0.05 indicate excellent alignment, below 0.15 is good, and above 0.15 warrants caution.

Why should synthetic survey data be treated as a model-based estimator rather than a copy of live data?

Live survey results are not ground truth. They are measurements subject to Total Survey Error, including sampling error, coverage error, nonresponse error, and measurement error. Synthetic data estimates the expected distribution of responses for a given population and question, which is the same signal a well-designed live survey tries to measure, net of its own error sources.

When should researchers still use live survey data instead of synthetic data?

Live data remains essential for novel topics where the model has no training signal, for populations not well-represented in training data, for regulatory submissions requiring primary data collection, and for research questions where the precise measurement context such as timing, mode, or framing is part of the research question itself.

Evaluating Synthetic Survey Estimates