Deduplication Functionality
Overview
Objective
Deduplication functionality is needed in order for the system to enhance data integrity and usability. While data is imported to system we cannot identify if given patient is already in the database. In order to do that we need to integrate de-duplication functionality.
Suggested Approach
Develop and integrate a deduplication module within the application's micro-kernel architecture. This module will involve addition of new user interface for clerks to manage deduplication tasks, implementing dynamic field selection logic to accommodate flexible schema, and developing an algorithm for detecting duplicates based on selected fields. Additionally, the system should allow clerks to make informed decisions on merging duplicate entries and integrate these functionalities into the existing task management system.
Due to our structure and the fact that program schemas are flexible we need to consider deduplication on two levels:
Beneficiary Level - this type of deduplication will be triggered on the programm level with new button next to the import functionality.
Individual/Group level Level - this type of deduplication will be triggered on the Individual/Group view. As it’s puppose is to unify data across the benefit plans we need to ensure that the data on the benefit plan level is consistent first and therefore this type of deduplication will need to validate if data on benefit plan for included beneficiaries is consistent.
Use Cases
Use Cases For Benefit Plans
Analysis for Benefit Deduplication
Use Case 1: Initiating Deduplication
User Action: Initiates deduplication in a specific benefit plan view.
System Response: Presents modal with possible fields for deduplication based on the benefit plan schema.
Impediments:
Inconsistent or incomplete schema definitions could lead to confusion over field selection.
User might be unsure which fields are most effective for deduplication.
Mock 1. Deduplication Initiation button added to Benefit Plan view in {host}/front/benefitPlans/benefitPlan/{id} view
Use Case 2: Field Selection for Deduplication
User Action: Selects fields on which to perform deduplication.
System Response: Processes the selected fields and searches for duplicates.
Impediments:
High complexity or volume of fields can overwhelm users.
System performance issues with large datasets or complex queries.
Mock 2. Example Deduplication Form in the modal. Dropdown while open should show list of the schema fields in given benefit plan.
Use Case 3: Displaying Deduplication Summary
User Action: Waits for deduplication results.
System Response: Displays a summary of duplicates found, or a message if no duplicates are found.
Impediments:
Providing clear and understandable summaries for complex groupings.
Efficiently handling and presenting large sets of duplicate data.
Wireframe 1. Results of deduplication summary triggered by used in the modal. [link]
Deduplication summary shouldn’t give line by line projection of the differences in the benefit consumptions but rather overall information about how much duplicates were found for each grouping. If given group would have value == 1 that means no duplicates were found and it shouldn’t be display in the table (or even passed from the backend).
Use Case 4: Generating Tasks for Resolving Duplicates
User Action: Chooses to generate tasks for each duplicate found.
System Response: Creates individual tasks for each set of duplicates.
Impediments:
Ensuring task creation is efficient and does not overload the system.
Clear task descriptions to aid in resolution.
This action will result in creating the tasks for the task executors to review. it will be executed upon pressing “Create Deduplication Review Tasks” from Wireframe 1.
Use Case Analysis for Task Resolution
Use Case 5: Task Selection and Review
User Action: Selects a duplication task for review.
System Response: Presents detailed information for the benefit consumption in question.
Impediments:
Displaying comprehensive yet digestible information.
Ensuring data accuracy and up-to-date status.
Mock 3. New Deduplication Task section in {host}/front/tasks
In this Tasks view we have all deduplication queries for review that were assigned for a particular user or user group. Upon selection user is redirected to detailed view with deduplication action.
Use Case 6: Handling Duplication Tasks
User Options: Reject duplication, accept one state, or merge data.
System Response: Applies the chosen action to the database, updates states, and triggers events.
Impediments:
User confusion over the best action to take in complex cases.
Handling merging logic when data from multiple sources is combined.
Implementing a deterministic strategy for benefit states not included in the user's decision.
Wireframe 2. Deduplication detailed view. [LINK]
In this view user will see the detailed information about detected duplicate. In the header (Field n on Wireframe 2) there will be names of all table columns in the given benefit plan. Displayed table is a full projection of the data. Rows of the table are values of benefit plan consumptions. Fields that have darker background were selected as a “true value” for a given row. On the right side of the table for each row there’s a checkbox that determines whether given row is in fact a duplicate. If it’s not selected then a row is not a duplicate and will be excluded from the deduplication algorithm execution. If checkbox was not selected row is a readonly and it’s value cannot be used as a “true value” for the deduplication. “true value” has to be selected for each column (it could be defaulted to the most recent record). When user triggers “Resolve Duplicates” it will trigger the post resolution and benefit plan will be updated.
Use Case 7: Post-Resolution Processing
User Action: Completes the task with a chosen action.
System Response: Sends the request to the backend, applies changes, and triggers necessary events.
Impediments:
Ensuring backend processing is robust and error-free.
Handling potential inconsistencies post-merge or post-rejection.
Notes:
The workflow will update the oldest benefit plan consumption with desired data.
Benefit consumption state also should be updated.
In the future implementation we could consider partially deterministic scenario when deduplication could be resolved or simplified automatically by the set of rules (e.g. automatic merge of rows that are exactly the same).
General Impediments
Performance Optimization: Ensuring the system performs efficiently with large datasets and complex queries.
Error Handling and Logging: Robust error handling and detailed logging for troubleshooting and auditing purposes.
Additional Considerations (out of scope?)
Use Case 8: Field Schema Changes
Problem Statement
When the schema of a benefit plan changes (e.g., adding, removing, or modifying fields), it can affect ongoing deduplication tasks, leading to potential inconsistencies or errors.
Possible solutions (could be combined)
Versioning the Schema: Implement version control for schema changes. Each version of a benefit plan schema should be uniquely identifiable.
Task State Management: When a deduplication task is created, capture the schema version that the task is based on. This ensures the task is always aware of the schema it should be referencing.
Handling Active Tasks on Schema Change:
Option A: Pause and Review: Automatically pause active deduplication tasks on schema change. Require a manual review by an administrator or system to adjust or restart the task based on the new schema.
Option B: Continue with Original Schema: Allow active tasks to continue using the schema version they were initiated with, ensuring consistency for that task's lifespan.
User Notification: Notify users of any ongoing deduplication tasks affected by schema changes, providing options to adjust or restart the task.
Use Case 9: Concurrent Deduplication Processes
Problem Statement
Concurrent deduplication processes on overlapping datasets can lead to conflicts, data inconsistencies, or performance issues.
Solution Strategy
Locking Mechanisms: Implement a locking mechanism at the data level to prevent concurrent processes from modifying the same data simultaneously.
Optimistic Locking: Useful when conflicts are rare. Each transaction checks if the data has been modified by others before committing changes.
Pessimistic Locking: Suitable for high-conflict scenarios. Data gets locked for the duration of a transaction, preventing others from modifying it.
Transaction Management: Ensure deduplication operations are handled within transactions. This allows for rollback in case of conflicts or errors, maintaining data integrity.
High Level Workflow
Initialization
Duplication Selection
Request Deduplication Review
Task Selection and Review
Deduplication Algorithm
Did you encounter a problem or do you have a suggestion?
Please contact our Service Desk
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. https://creativecommons.org/licenses/by-sa/4.0/