De duplication and Data Validation mechanism in ETL (Individual or Beneficiary Module)

1. How it works in the current coreMIS version?

 

  • De duplication

    • This only works if there are unique field(s) present. If there are no unique fields de duplication does not work.

    • If there are unique field(s), identify both in the Extraction and the Loading part of the ETL process, for example “National Identification” column in the Excel file uploaded to “national_id” column in the PostGreSQL database.

    • During the Loading stage of the ETL check and confirm if data exists, this is a simple count query with the data in the Excel column, if it exists flag and do not insert record. Possible data to flag; number of record in Excel file, entire line of the record in the Excel file to help with identification.

    • After ETL process is completed generate a report summary of all records that were flagged as duplicates.

    • Stakeholders are granted a means to fix duplicates, either by accepting that flagged duplicates should not be saved or undo the Excel file upload fix the duplicates and reimport for the cycle to start again.


  • Validation

    • This only works if there fields to be validated. If there are no unique fields de duplication does not work.

    • If there are field(s) to be validated, for example ID format, phone number, age. Define regex or validation parameters.

    • During the Loading stage of the ETL check and confirm if data meets validation criteria, if it does not meet the validation criteria and do not insert record. Possible data to flag; number of record in Excel file, entire line of the record in the Excel file to help with identification.

    • After ETL process is completed generate a report summary of validation

    • Stakeholders are granted a means to fix validation issues, either by accepting that validation errors should not be saved or undo the Excel file upload fix the data entry issues and reimport for the cycle to start again.

 

 

2. Our knowledge based on the meetings

  • In the old CORE-MIS version it works with import data.

  • It heavily depends on the schema definition on programme/benefit plan.

  • Backend Definition: Specify in the backend whether a column can be unique or not, considering

  • schema flexibility from the benefit plan program.

  • Detection Criteria: If a column in the schema is marked as unique, the detection of duplicates involves marking both records if a non-unique value is found.

  • Deduplication Process: Searches for new and existing records and identifies potential duplicates based on specified criteria. It starts once import data is trigerred.

3. Ideas how to develop this feature in the migrated version.

In the migrated version we should consider using tasks to process potential duplicates once some of the records are marked as potential duplicates (by duplicate flag).

<TO-DO>

The process of deduplication/validation on the openIMIS/coreMIS migrated side.

The proposal of improvement for schema managing

 

 

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. https://creativecommons.org/licenses/by-sa/4.0/