Deduplication Functionality

Overview

Objective

Deduplication functionality is needed in order for the system to enhance data integrity and usability. While data is imported to system we cannot identify if given patient is already in the database. In order to do that we need to integrate de-duplication functionality.

Suggested Approach

Develop and integrate a deduplication module within the application's micro-kernel architecture. This module will involve addition of new user interface for clerks to manage deduplication tasks, implementing dynamic field selection logic to accommodate flexible schema, and developing an algorithm for detecting duplicates based on selected fields. Additionally, the system should allow clerks to make informed decisions on merging duplicate entries and integrate these functionalities into the existing task management system.

 

Due to our structure and the fact that program schemas are flexible we need to consider deduplication on two levels:

  1. Beneficiary Level - this type of deduplication will be triggered on the programm level with new button next to the import functionality.

  2. Individual/Group level Level - this type of deduplication will be triggered on the Individual/Group view. As it’s puppose is to unify data across the benefit plans we need to ensure that the data on the benefit plan level is consistent first and therefore this type of deduplication will need to validate if data on benefit plan for included beneficiaries is consistent.

Use Cases

Use Cases For Benefit Plans

Analysis for Benefit Deduplication

Use Case 1: Initiating Deduplication

  1. User Action: Initiates deduplication in a specific benefit plan view.

  2. System Response: Presents modal with possible fields for deduplication based on the benefit plan schema.

  3. Impediments:

    • Inconsistent or incomplete schema definitions could lead to confusion over field selection.

    • User might be unsure which fields are most effective for deduplication.

image-20240108-134638.png
Mock 1. Deduplication Initiation button added to Benefit Plan view in {host}/front/benefitPlans/benefitPlan/{id} view

 

Use Case 2: Field Selection for Deduplication

  1. User Action: Selects fields on which to perform deduplication.

  2. System Response: Processes the selected fields and searches for duplicates.

  3. Impediments:

    • High complexity or volume of fields can overwhelm users.

    • System performance issues with large datasets or complex queries.

image-20240108-135604.png
Mock 2. Example Deduplication Form in the modal. Dropdown while open should show list of the schema fields in given benefit plan.

Use Case 3: Displaying Deduplication Summary

  1. User Action: Waits for deduplication results.

  2. System Response: Displays a summary of duplicates found, or a message if no duplicates are found.

  3. Impediments:

    • Providing clear and understandable summaries for complex groupings.

    • Efficiently handling and presenting large sets of duplicate data.

Wireframe 1. Results of deduplication summary triggered by used in the modal. [link]

 

Deduplication summary shouldn’t give line by line projection of the differences in the benefit consumptions but rather overall information about how much duplicates were found for each grouping. If given group would have value == 1 that means no duplicates were found and it shouldn’t be display in the table (or even passed from the backend).

Use Case 4: Generating Tasks for Resolving Duplicates

  1. User Action: Chooses to generate tasks for each duplicate found.

  2. System Response: Creates individual tasks for each set of duplicates.

  3. Impediments:

    • Ensuring task creation is efficient and does not overload the system.

    • Clear task descriptions to aid in resolution.

This action will result in creating the tasks for the task executors to review. it will be executed upon pressing “Create Deduplication Review Tasks” from Wireframe 1.

Use Case Analysis for Task Resolution

Use Case 5: Task Selection and Review

  1. User Action: Selects a duplication task for review.

  2. System Response: Presents detailed information for the benefit consumption in question.

  3. Impediments:

    • Displaying comprehensive yet digestible information.

    • Ensuring data accuracy and up-to-date status.

Mock 3. New Deduplication Task section in {host}/front/tasks

 

In this Tasks view we have all deduplication queries for review that were assigned for a particular user or user group. Upon selection user is redirected to detailed view with deduplication action.

Use Case 6: Handling Duplication Tasks

  1. User Options: Reject duplication, accept one state, or merge data.

  2. System Response: Applies the chosen action to the database, updates states, and triggers events.

  3. Impediments:

    • User confusion over the best action to take in complex cases.

    • Handling merging logic when data from multiple sources is combined.

    • Implementing a deterministic strategy for benefit states not included in the user's decision.

Wireframe 2. Deduplication detailed view. [LINK]

In this view user will see the detailed information about detected duplicate. In the header (Field n on Wireframe 2) there will be names of all table columns in the given benefit plan. Displayed table is a full projection of the data. Rows of the table are values of benefit plan consumptions. Fields that have darker background were selected as a “true value” for a given row. On the right side of the table for each row there’s a checkbox that determines whether given row is in fact a duplicate. If it’s not selected then a row is not a duplicate and will be excluded from the deduplication algorithm execution. If checkbox was not selected row is a readonly and it’s value cannot be used as a “true value” for the deduplication. “true value” has to be selected for each column (it could be defaulted to the most recent record). When user triggers “Resolve Duplicates” it will trigger the post resolution and benefit plan will be updated.

Use Case 7: Post-Resolution Processing

  1. User Action: Completes the task with a chosen action.

  2. System Response: Sends the request to the backend, applies changes, and triggers necessary events.

  3. Impediments:

    • Ensuring backend processing is robust and error-free.

    • Handling potential inconsistencies post-merge or post-rejection.

Notes:

  1. The workflow will update the oldest benefit plan consumption with desired data.

  2. Benefit consumption state also should be updated.

  3. In the future implementation we could consider partially deterministic scenario when deduplication could be resolved or simplified automatically by the set of rules (e.g. automatic merge of rows that are exactly the same).

General Impediments

  • Performance Optimization: Ensuring the system performs efficiently with large datasets and complex queries.

  • Error Handling and Logging: Robust error handling and detailed logging for troubleshooting and auditing purposes.

Additional Considerations (out of scope?)

Use Case 8: Field Schema Changes

Problem Statement

When the schema of a benefit plan changes (e.g., adding, removing, or modifying fields), it can affect ongoing deduplication tasks, leading to potential inconsistencies or errors.

Possible solutions (could be combined)

  1. Versioning the Schema: Implement version control for schema changes. Each version of a benefit plan schema should be uniquely identifiable.

  2. Task State Management: When a deduplication task is created, capture the schema version that the task is based on. This ensures the task is always aware of the schema it should be referencing.

  3. Handling Active Tasks on Schema Change:

    • Option A: Pause and Review: Automatically pause active deduplication tasks on schema change. Require a manual review by an administrator or system to adjust or restart the task based on the new schema.

    • Option B: Continue with Original Schema: Allow active tasks to continue using the schema version they were initiated with, ensuring consistency for that task's lifespan.

  4. User Notification: Notify users of any ongoing deduplication tasks affected by schema changes, providing options to adjust or restart the task.

Use Case 9: Concurrent Deduplication Processes

Problem Statement

Concurrent deduplication processes on overlapping datasets can lead to conflicts, data inconsistencies, or performance issues.

Solution Strategy

  1. Locking Mechanisms: Implement a locking mechanism at the data level to prevent concurrent processes from modifying the same data simultaneously.

    • Optimistic Locking: Useful when conflicts are rare. Each transaction checks if the data has been modified by others before committing changes.

    • Pessimistic Locking: Suitable for high-conflict scenarios. Data gets locked for the duration of a transaction, preventing others from modifying it.

  2. Transaction Management: Ensure deduplication operations are handled within transactions. This allows for rollback in case of conflicts or errors, maintaining data integrity.

High Level Workflow

Initialization

@startuml
actor User
participant "Benefit Plan Interface" as BPI
participant "Deduplication System" as DS
database "Benefit Plan Schema" as BPS

User -> BPI: Select "Initiate Deduplication"
BPI -> DS: Request deduplication interface
activate DS
DS -> BPS: Retrieve available fields
activate BPS
BPS -> DS: Return field data
deactivate BPS
DS -> BPI: Display deduplication interface\nwith field options
deactivate DS
BPI -> User: Present field selection options
@enduml

Duplication Selection

@startuml
actor User
participant "Deduplication Interface" as DI
participant "Deduplication Logic" as DL
database "Benefit Data" as BD

User -> DI: Request Summary for\n deduplication criteria
DI -> DL: Field selections
activate DL
DL -> BD: Query for duplicates based on fields
activate BD
BD -> DL: Duplicate data
deactivate BD
DL -> DI: Process and summarize duplicates
deactivate DL
DI -> User: Display summary of duplicates
@enduml

Request Deduplication Review

@startuml
actor User
participant "Deduplication Interface" as DI
participant "Deduplication Logic" as DL
database "Benefit Data" as BD
database "Task Data" as TD

User -> DI: Request Summary for\n deduplication criteria
DI -> DL: Field selections
activate DL
DL -> BD: Query for duplicates based on fields
activate BD
BD -> DL: Duplicate data
deactivate BD
DL -> DI: Process and summarize duplicates
deactivate DL
DI -> User: Display summary of duplicates

alt Duplicates found
opt Request Deduplication
User --> DI: Click "Generate Tasks for Duplicates"
DI -> DL: Initiate task generation
activate DL
DL -> TD: Create deduplication tasks
activate TD
TD -> DL: Confirm task creation
deactivate TD
DL -> DI: Display task generation status
deactivate DL
DI -> User: Show confirmation
end
else No duplicates found
User -> DI: Request Close Modal
DI -> User: Modal closed

end
@enduml
@enduml

Task Selection and Review

@startuml
actor Executor
participant "Task Management Interface" as TMI
participant "Deduplication Task Logic" as DTL
database "Task Database" as TD
database "Benefit Data Storage" as DS

Executor -> TMI: Select deduplication tasks
TMI -> TD: Retrieve tasks details
activate TD
TD -> DTL: Tasks data
deactivate TD
DTL -> TMI: Present tasks details and options
TMI -> Executor: Display tasks
Executor -> TMI: Select task
TMI -> TD: Retrieve tasks details
activate TD
TD -> DTL: Tasks data
deactivate TD
DTL -> TMI: Present task details and options
TMI -> Executor: Display tasks details

loop For each potential duplication
opt Benefit Consumption is duplicate
Executor -> TMI: Select benefit consumption as duplicate
end
end
Executor -> TMI: Select "true value" for duplicates
Executor --> TMI: Process duplicates
TMI -> DTL: Resolve duplicates
DTL -> DS: Update initia benefit plan\nby merging data from different sources
DS -> DTL: Update successful
loop For each duplicate
DTL -> DS: Change duplicate references to merged benefit plan
DS -> DTL: Success
DTL -> DS: Remove duplicated benefit plan
DS -> DTL: Success
end
DTL --> TMI: Confirm update

@enduml

Deduplication Algorithm

@startuml
start
:Receive user-selected fields\nand entries;

:Identify duplicates\nbased on selected fields;
if (Duplicates found?) then (yes)
:Sort duplicates\nby entry date;
:Select the oldest entry\nas the primary record;
group Transaction
:Prepare 'merged' record\nusing primary record data;
:Apply selected values\nto primary record;
while (Each Duplicate entry)
:Transfer references from\nduplicate to primary record;
:Mark duplicate entry for removal;
endwhile
:Remove marked duplicate entries;
:Update database with changes;
end group
else (no)
:End process\n(no duplicates found);
endif
stop
@enduml

Did you encounter a problem or do you have a suggestion?

Please contact our Service Desk



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. https://creativecommons.org/licenses/by-sa/4.0/