Synthetic Data Generation
Why Synthetic Data for openIMIS?
Synthetic data generation is essential for two main reasons:
Performance Evaluation
Several implementations report performance issues when openIMIS scales up.
These issues are difficult to replicate on demo, release, or testing environments due to the absence of large, realistic datasets.
Synthetic data allows developers and implementers to simulate large-scale operations (millions of insurees, claims, policies, etc.) to identify bottlenecks and optimize performance.
Interoperability Testing
Synthetic datasets can be used to simulate cross-system data flows in interoperability use cases (e.g., openCRVS ↔ openIMIS, FHIR-based exchanges).
Provides sample data for testing integration mediators (openHIM, FHIR APIs) without exposing real personal health information.
Key Objects for Data Generation
The main objects that should be covered include:
Insuree (individuals and households)
Policy and InsureePolicy (covering family/individual enrollment)
Claim (facility-based service utilization at different submissions levels)
Medical Services & Items (linked to Master Product List where available)
By covering these, we ensure end-to-end workflows are tested: enrollment, policy management, claim adjudication, and reporting.
Proposed Methodology & Tools
Design Principles
Low-level generation via Django ORM:
Avoids performance limitations observed when generating data via FHIR APIs (as highlighted by Cameroon’s experience).
Ensures compatibility with both PostgreSQL and SQL Server, without database-specific scripts.
Realism through predefined data sets:
Use predefined data sets to generate names, addresses, dates, diagnoses, etc.
Add configurable demographic distributions (e.g., urban/rural ratio, disease prevalence) to mimic realistic populations.
Configurable Volume and Scale:
Allow generation of datasets of varying sizes (e.g., 10k → 1M insurees).
Support modular data creation (insurees only, claims only, or full system simulation).
Technical Workflow
Django Management Command
Command:
python manage.py generate_synthetic_data --insurees 10000 --claims 50000Uses ORM to populate objects with integrity checks (foreign keys, policy linkage, etc.).
Following inputs are accepted by synthetic data generator for customized input for generating bulk synthetic data
Parameter | Type | Default | Description |
|---|---|---|---|
| Choice | None | Predefined configurations: |
| Integer | 1000 | Number of families to generate |
| Integer | 4 | Fixed number of members per family |
| Integer | 0 | Number of claims to generate per insuree (0 = no claims) |
| Integer | 2000 | Batch size for bulk operations |
| Flag | False | Skip confirmation prompt |
preset example, predefined dataset to be generated:
presets = {
'small': {'families': 1000, 'members': 3, 'claims': 1},
'medium': {'families': 10000, 'members': 4, 'claims': 2},
'large': {'families': 50000, 'members': 5, 'claims': 2}
}Usage example:
# Basic usage - 1000 families, 4 members each, no claims
python manage.py bulk_generate_insurees
# Custom configuration
python manage.py bulk_generate_insurees --families 5000 --members 3 --claims 2
# Using presets
python manage.py bulk_generate_insurees --preset medium --no-confirm
# Large dataset with custom batch size
python manage.py bulk_generate_insurees --families 100000 --members 5 --batch-size 5000
Dockerized Setup
Integrate into Docker build process for development/demo environments.
On startup, the environment can optionally auto-generate synthetic data for testing.
Performance Testing Integration
Enables repeatable performance profiling across implementations.
Database optimization to support bulk insertion
Warning: The following optimizations are intended for testing large datasets only and are not suitable for production use.
PostgreSQL Optimizer | MSSQL Optimizer |
|---|---|
work_mem = 256MB: Increases memory for sorting and hashing operations Benefits for Bulk Insert: Faster Sorting: Large work_mem allows in-memory sorting instead of disk-based sql-- PostgreSQL Optimizations | SET NOCOUNT ON: Reduces network traffic by suppressing row count messages Benefits for Bulk Insert: Reduced Network Overhead: NOCOUNT eliminates unnecessary messages sql-- SQL Server Optimizations |
Reference : Postgres Bulk update optimization | Reference : set_no_count, arithabort_ansi_warning |
Data flow Architecture:
1. Reference Data Loading → 2. Insuree Generation → 3. Family Creation →
4. Policy Generation → 5. InsureePolicy Linking → 6. Claims Generation →
7. Claim Items/Services → 8. Final UpdatesActivities & Implementation Roadmap
Proposed Activities
Requirements & Design
Define data models and relationships for synthetic generation.
Align with openIMIS Core Models (Insuree, Policy, Claim).
Tool Development
Develop Django management commands.
Extend with configurable parameters (data size, geographic distribution, insurance schemes).
Integration with Docker
Update
docker-composesetup to include auto-generation option for dev/demo builds.Provide pre-configured demo datasets for training and sandbox environments.
Testing & Validation
Test performance impact on PostgreSQL and SQL Server.
Validate data quality (consistency across insurees/policies/claims).
Documentation & Training
Provide user guides for running synthetic data generation.
Share best practices for using synthetic data in interoperability workflows.
Expected Benefits
Improved Debugging & Performance Testing: Developers can reproduce real-world issues without requiring production data.
Interoperability Demonstrations: Partners can showcase openIMIS ↔ openCRVS or FHIR exchanges with realistic sample data.
Scalable Demo Environments: Training, testing, and capacity building become more meaningful with realistic datasets.
Did you encounter a problem or do you have a suggestion?
Please contact our Service Desk
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. https://creativecommons.org/licenses/by-sa/4.0/