Reference

Technical Specifications

Exact metrics, thresholds, and study rules for reproducibility.

Study Design: Paired Comparisons

Every validation study follows a paired-comparison protocol. One live panel dataset and one synthetic dataset are generated from the same survey instrument under identical conditions: same quotas, same demographic targets, same question wording.

The synthetic dataset is generated without access to live results. Model version, generation parameters, and prompt templates are frozen before generation begins and recorded for full reproducibility.

Encoding Rules & Leakage Controls

Each question type has specific encoding and comparison rules designed to prevent information leakage between training data and test data. The team that designs the survey instrument does not have access to the generation pipeline, and the team that generates synthetic data does not see live results until after generation is complete.

Multi-select encoding: Each option in a multi-select question is treated as an independent binary variable and compared separately. This prevents inflating agreement scores by averaging across options. Per-option comparison ensures that the model is evaluated on each choice independently.

Metrics & Pass/Fail Standards

The table below defines the primary metrics, passing thresholds, and relevant notes for each question type.

Question Type Metrics Standards Notes
Single-choice KL-Divergence, JS-Divergence KL < 0.10, JS < 0.05 Likert collapsing permitted
Multi-choice JS-Divergence, Spearman, Top-K JS < 0.05, Spearman > 0.75, Top-K > 0.8 Per-option binaries
Numeric (binned) KL-Divergence, JS-Divergence KL < 0.10, JS < 0.05 Bins pre-declared
Percent-allocation KL-Divergence, JS-Divergence, Top-K KL < 0.10, JS < 0.05, Top-K > 0.8 Dominant allocations
Ranking Spearman, Top-K Spearman > 0.75, Top-K > 0.8 Scale with list length
Text responses BERTScore F1, Optimal Matching F1 > 0.75, OMS > 0.75 Semantic similarity

Reporting Conventions

All validation reports follow a standardized structure for comparability across studies.

  • Full distribution tables: Live and synthetic distributions side-by-side for every question, with percentage point differences.
  • Sample sizes: Both live (N) and synthetic (N) reported per question and per subgroup.
  • Metric summary: Pass/fail status for each metric on each question, with computed value and threshold.
  • Confidence intervals: 95% bootstrap confidence intervals on all key metrics.
  • Subgroup breakdowns: Key demographic subgroups analyzed separately where sample sizes permit (N ≥ 50).
  • Failure documentation: Questions that fail any metric are flagged with root cause analysis.
  • Model provenance: Model version, generation date, and configuration parameters recorded in every report.

Implementation Specifications

The following specifications define exact parameters used in metric computation and reporting.

Top-K defaults: K is set to 3 for questions with 5-8 options, 5 for questions with 9-15 options, and 7 for questions with 16+ options. Custom K values may be pre-declared before analysis for specific research needs.

Spearman for multi-select: When computing Spearman correlation for multi-select questions, the selection rate for each option is used as the ranking variable. Options are ranked by selection frequency in both live and synthetic datasets independently.

Confidence intervals: All CIs are computed using 1,000 bootstrap resamples with bias-corrected and accelerated (BCa) adjustment. The 95% CI is reported as the primary interval; 90% and 99% intervals are available on request.

Minimum subgroup size: Subgroup analyses are reported only when both the live and synthetic subgroup have N ≥ 50. Subgroups with 30-49 respondents are reported as directional only. Subgroups below 30 are excluded.

Reproducibility metadata: Every study records: model version ID, prompt template hash, generation timestamp, random seed, temperature setting, top-p value, and demographic quota specification. This metadata is sufficient to reproduce the synthetic dataset exactly.

Domain-Specific AI Architecture

Simsurveys operates specialized models rather than a single general-purpose system. Each model is optimized for its research domain.

Specialized Models: Three production models serve distinct research domains: Consumer & Market Research, Healthcare & HCP Research, and Social Research. Each model is independently trained, validated, and versioned.

Training Data: Models are trained on curated, validated population data specific to each domain. Consumer models use validated panel data. Healthcare models are augmented with U.S. physician-level prescription data across 15 medical specialties. Social research models incorporate national survey data with geographic structure.

Foundation Technology: Models are built on fine-tuned large language models with domain-specific instruction tuning, demographic conditioning, and response distribution calibration. The architecture ensures that outputs are probability distributions, not single-point predictions.

Infrastructure & Performance

Production infrastructure is designed for reliability, speed, and scale.

Multi-GPU inference: Production models run on distributed GPU clusters with automatic load balancing. Generation workloads are parallelized across respondents for maximum throughput.

Edge computing: The Oracle real-time query system uses edge-deployed model shards for sub-200ms response times. Models are replicated across geographic regions for latency optimization.

Real-time serving: The Oracle processes single demographic queries with a median latency of 187ms. Batch crosstab queries complete in 2-5 seconds depending on segment count.

Delivery performance: Full research studies (1,000 respondents, 25-question survey) complete generation in under 5 minutes. Large studies (10,000+ respondents) complete in under 15 minutes.

Data Formats & Export

Simsurveys exports respondent-level data in standard research formats.

File formats: CSV (UTF-8), SPSS (.sav with full variable labels and value labels), and Excel (.xlsx with formatted headers). All exports include respondent-level data with demographic variables and response codes.

Variable structure: Each respondent is a row. Each question generates one or more columns depending on type. Multi-select questions generate one binary column per option. Rankings generate one column per rank position. Metadata columns include respondent ID, generation timestamp, and demographic variables.

Analysis outputs: Automated crosstabs, statistical significance testing, charts, visualizations, and executive summary reports. Crosstabs support unlimited banner points and nested demographic breaks.

Quality Assurance & Validation

Automated quality checks run on every generated dataset before delivery.

Response consistency: Each synthetic respondent is checked for internal consistency. Skip logic compliance, logical response patterns, and demographic-response coherence are validated automatically. Respondents failing consistency checks are regenerated.

Distribution monitoring: Generated response distributions are compared against expected population parameters in real time. Statistically significant deviations from expected distributions trigger automatic review and potential regeneration.

Quota compliance: Demographic quotas are enforced at the respondent level. Final datasets are verified against quota specifications with zero tolerance for quota shortfalls. Oversampled cells are trimmed to exact targets.

Limitations & Ongoing Work

Current technical limitations are actively being addressed in ongoing development.

  • Language support: Models currently support English-language surveys only. Multilingual support is in development.
  • Cultural context: Models are primarily trained on U.S. population data. International deployments require separate validation.
  • Question complexity: Very long surveys (>100 questions) may show reduced response consistency in later questions. We recommend surveys of 50 questions or fewer for optimal performance.
  • Real-time knowledge: Models reflect their training data period and do not have access to real-time information. Recalibration cycles run every 6-12 months.
  • Rare populations: Demographic segments representing <2% of the population may have limited representation in training data, resulting in higher uncertainty.

Ready to test the specifications?

Generate your first dataset and validate against your own benchmarks.