Synthetic Data Generation

Synthetic Data Generation

 

Why Synthetic Data for openIMIS?

Synthetic data generation is essential for two main reasons:

  1. Performance Evaluation

    • Several implementations report performance issues when openIMIS scales up.

    • These issues are difficult to replicate on demo, release, or testing environments due to the absence of large, realistic datasets.

    • Synthetic data allows developers and implementers to simulate large-scale operations (millions of insurees, claims, policies, etc.) to identify bottlenecks and optimize performance.

  2. Interoperability Testing

    • Synthetic datasets can be used to simulate cross-system data flows in interoperability use cases (e.g., openCRVS ↔ openIMIS, FHIR-based exchanges).

    • Provides sample data for testing integration mediators (openHIM, FHIR APIs) without exposing real personal health information.

 

Key Objects for Data Generation

The main objects that should be covered include:

  • Insuree (individuals and households)

  • Policy and InsureePolicy (covering family/individual enrollment)

  • Claim (facility-based service utilization at different submissions levels)

  • Medical Services & Items (linked to Master Product List where available)

By covering these, we ensure end-to-end workflows are tested: enrollment, policy management, claim adjudication, and reporting.



Proposed Methodology & Tools

Design Principles

  • Low-level generation via Django ORM:

    • Avoids performance limitations observed when generating data via FHIR APIs (as highlighted by Cameroon’s experience).

    • Ensures compatibility with both PostgreSQL and SQL Server, without database-specific scripts.

  • Realism through predefined data sets:

    • Use predefined data sets to generate names, addresses, dates, diagnoses, etc.

    • Add configurable demographic distributions (e.g., urban/rural ratio, disease prevalence) to mimic realistic populations.

  • Configurable Volume and Scale:

    • Allow generation of datasets of varying sizes (e.g., 10k → 1M insurees).

    • Support modular data creation (insurees only, claims only, or full system simulation).

Technical Workflow

  1. Django Management Command

    • Command: python manage.py generate_synthetic_data --insurees 10000 --claims 50000

    • Uses ORM to populate objects with integrity checks (foreign keys, policy linkage, etc.).

    • Following inputs are accepted by synthetic data generator for customized input for generating bulk synthetic data

Parameter

Type

Default

Description

Parameter

Type

Default

Description

--preset

Choice

None

Predefined configurations: small, medium, large

--families

Integer

1000

Number of families to generate

--members

Integer

4

Fixed number of members per family

--claims

Integer

0

Number of claims to generate per insuree (0 = no claims)

--batch-size

Integer

2000

Batch size for bulk operations

--no-confirm

Flag

False

Skip confirmation prompt

preset example, predefined dataset to be generated:

presets = { 'small': {'families': 1000, 'members': 3, 'claims': 1}, 'medium': {'families': 10000, 'members': 4, 'claims': 2}, 'large': {'families': 50000, 'members': 5, 'claims': 2} }

Usage example:

# Basic usage - 1000 families, 4 members each, no claims python manage.py bulk_generate_insurees # Custom configuration python manage.py bulk_generate_insurees --families 5000 --members 3 --claims 2 # Using presets python manage.py bulk_generate_insurees --preset medium --no-confirm # Large dataset with custom batch size python manage.py bulk_generate_insurees --families 100000 --members 5 --batch-size 5000

 

  1. Dockerized Setup

    • Integrate into Docker build process for development/demo environments.

    • On startup, the environment can optionally auto-generate synthetic data for testing.

  2. Performance Testing Integration

    • Enables repeatable performance profiling across implementations.

  3. Database optimization to support bulk insertion


Warning: The following optimizations are intended for testing large datasets only and are not suitable for production use.

PostgreSQL Optimizer

MSSQL Optimizer

PostgreSQL Optimizer

MSSQL Optimizer

work_mem = 256MB: Increases memory for sorting and hashing operations
maintenance_work_mem = 512MB: More memory for maintenance operations like index creation
synchronous_commit = OFF: Disables waiting for WAL writes to complete
checkpoint_completion_target = 0.9: Spreads checkpoints over longer periods

Benefits for Bulk Insert:

Faster Sorting: Large work_mem allows in-memory sorting instead of disk-based
Reduced I/O Wait: Asynchronous commits reduce transaction latency
Better Index Building: Higher maintenance_work_mem speeds up index operations
Checkpoint Smoothing: Reduces I/O spikes during bulk operations

sql-- PostgreSQL Optimizations
SET work_mem = '256MB';
SET maintenance_work_mem = '512MB';
SET synchronous_commit = OFF;
SET checkpoint_completion_target = 0.9;

SET NOCOUNT ON: Reduces network traffic by suppressing row count messages
SET ANSI_WARNINGS OFF: Prevents warnings that can slow bulk operations
SET ARITHABORT OFF: Disables query termination on arithmetic errors
SET AUTO_UPDATE_STATISTICS OFF: Prevents automatic statistics updates during bulk insert

Benefits for Bulk Insert:

Reduced Network Overhead: NOCOUNT eliminates unnecessary messages
Fewer Interruptions: Disabled warnings prevent operation pauses
Consistent Performance: Disabled auto-statistics prevents unpredictable delays
Optimized Error Handling: ARITHABORT OFF allows operations to continue

sql-- SQL Server Optimizations
SET NOCOUNT ON;
SET ANSI_WARNINGS OFF;
SET ARITHABORT OFF;
SET AUTO_UPDATE_STATISTICS OFF;

Reference : Postgres Bulk update optimization

Reference : set_no_count, arithabort_ansi_warning

Data flow Architecture:

1. Reference Data Loading → 2. Insuree Generation → 3. Family Creation → 4. Policy Generation → 5. InsureePolicy Linking → 6. Claims Generation → 7. Claim Items/Services → 8. Final Updates

Activities & Implementation Roadmap

Proposed Activities

  1. Requirements & Design

    • Define data models and relationships for synthetic generation.

    • Align with openIMIS Core Models (Insuree, Policy, Claim).

  2. Tool Development

    • Develop Django management commands.

    • Extend with configurable parameters (data size, geographic distribution, insurance schemes).

  3. Integration with Docker

    • Update docker-compose setup to include auto-generation option for dev/demo builds.

    • Provide pre-configured demo datasets for training and sandbox environments.

  4. Testing & Validation

    • Test performance impact on PostgreSQL and SQL Server.

    • Validate data quality (consistency across insurees/policies/claims).

  5. Documentation & Training

    • Provide user guides for running synthetic data generation.

    • Share best practices for using synthetic data in interoperability workflows.

Expected Benefits

  • Improved Debugging & Performance Testing: Developers can reproduce real-world issues without requiring production data.

  • Interoperability Demonstrations: Partners can showcase openIMIS ↔ openCRVS or FHIR exchanges with realistic sample data.

  • Scalable Demo Environments: Training, testing, and capacity building become more meaningful with realistic datasets.

Did you encounter a problem or do you have a suggestion?

Please contact our Service Desk



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. https://creativecommons.org/licenses/by-sa/4.0/