...
concatenation of tblClaims and tblClaimAdmins based on ClaimAdminId : 5’953’640 entities and and 24 fields selected (11’965 claims have no value for the ClaimAdminId, therefore they were excluded in the concatenation process)
tblClaims: 5’965’605 claims with 20 selected fields (from a total of 46 fields in the table)
tblClaimAdmins: 913 entries with 5 selected fields
the information related to insurees, families and their location is then concatenated to the previous dataset (based on InsureeID), resulting in 5’953’640 rows and 48 columns
the information related to the insurees, families and their location is obtained through the merging of the tables Insurees and Families (based on FamilyID) and with the Location table (based on LocationID), resulting in a dataset with 3’790’789 rows and 25 columns (from a total of 51)
the information related to health facilities and their locations is concatenated to the previous dataset (based on HFID column), resulting in 5’953’640 rows and 57 columns
the information related to the health facilities and locations tables are concatenated (based on LocationID column) resulting in a set of 780 rows and 10 columns (from a total of 37 columns)
concatenation of the diagnosis information to the previous dataset (based on ICDID column), resulting in a dataset of 5’953’640 rows and 61 columns.
the information related to the tblDiagnosis: 1’959 rows and 5 selected fields
concatenation of tblClaimItems and tblItems based on ItemID: 12’414’014 medication items and 28 fields selected
tblClaimItems: 12’414’014 medication items with 17 selected fields (from a total of 30 fields in the table)
tblItems: 1’528 rows and 12 selected fields (from a total of 17)
concatenation of tblClaimItems/tblItems to the previous dataset based on ClaimID : 12’371’992 rows and 82 fields (in the previous of concatenation 42’022 entities were excluded, they were related to the claims without any value for ClaimAdminId)
the ClaimServices and Services data is then concatenanted to the claim related dataset, resulting in 16’655’364 rows and 28 fields
tblClaimServices and tblServices – are concatenated in a similar manner as the tblClaimItems and tblItems tables, based on ServiceID; the selected columns in the tblServices are renamed to be similar to those in the tblItems; The resulting dataset is composed of 16’655’364 rows and 28 fields
the fields related to the tblClaimServices/tblServices is concatenated to the dataset related to the claims, based on ClaimID column, resulting in a dataset with 16’600’273 rows and 82 fields
the resulting datasets related to medication and activity definition items are appended, resulting in a dataset with 28’972’265 rows and 82 fields.
All All the necessary data for the development, training and testing of the AI categorization algorithm was regrouped in a single table with 28’972’265 rows and 82 columns, corresponding to :
...
As it can be observed in the previous figures, in 2019 and 2010 the openIMIS system has allowed the submission of more then 10’000 claims per day (respectively more then 20’000 medication items and more then 30’000 activity definition items per day). This illustrates also the need for an AI categorization step which can support the HIB reviewers.
Steps | Code and results | Table name | #row | #cols | memory (MB) | Remarks |
Color labels | Selection of fields in tables | |||||
Concatenation of tables | ||||||
1 | tblClaims | 5 965 605 | 20 | 910.28 | Selection of 20 columns (of 46) | |
tblClaimAdmins | 913 | 5 | 0.03 |
| ||
tblClaims + tblClaimsAdmins | 5 953 640 | 24 | 1 135.57 | Concatenation of the previous two tables, merged on ClaimAdminID column; 11965 claims have NULL value for the ClaimAdminId | ||
2 | tblInsurees + tblFamilies + tblLocations | 3 790 789 | 25 | 751.96 | Selection of columns | |
Step1 + tblInsurees + tblFamilies + tblLocations | 5 953 640 | 48 | 2 225.71 | Concatenation of the previous two tables, merged on InsureeID column | ||
3 | tblHFs + tblLocations | 780 | 10 | 0.07 |
| |
Step2 + tblHFs + tblLocations | 5 953 640 | 57 | 2 634.51 | Concatenation of the previous two tables, merged on HFID column | ||
4 | tblDiagnosis | 1 959 | 5 | 0.07 |
| |
Step3 + tblDiagnosis | 5 953 640 | 61 | 2 816.21 | Concatenation of the previous two tables, merged on HFID column | ||
5 | tblClaimItems | 12 414 014 | 17 | 1 610.09 | Selection of 16 columns (of 30) and added a column ‘ItemServiceType’ =’Medication’ | |
tblItems | 1 528 | 12 | 0.14 | Selection of 9 columns (of 15) and added a column ItemLevel = ‘M’ | ||
tblClaimItem + tblItems | 12 414 014 | 28 | 2 746.63 | Concatenation of the previous two tables, merged on ItemID column | ||
6 | Step5 + Step4 | 12 371 992 | 82 | 7 834.44 | Concatenation of the previous two tables, merged on ClaimID column; 42022 entities were excluded in the concatenation process as they corresponded to claims with null ClaimAdminID | |
7 | tblClaimServices | 16 655 364 | 17 | 2 160.20 | Selection of 16 columns (of 30) and added a column ‘ItemServiceType’ =’Activity definition’ | |
tblServices | 1 868 | 12 | 0.17 | Selection of 10 columns (of 16) | ||
tblClaimServices + tblServices | 16 655 364 | 28 | 3 685.04 | Concatenation of the previous two tables, merged on ClaimID column; | ||
8 | Step7 + Step4 | 16 600 273 | 82 | 10 511.95 | Concatenation of the previous two tables, merged on ClaimID column; 55091 entities were excluded in the concatenation process as they corresponded to claims with null ClaimAdminID | |
9 | Step6 + Step8 | 28 972 265 | 82 | 18 125.35 | Appended datasets |
...
76.01% (21’702’401 records) of all data is not labeled (accepted without any manual reviewing)
Code and results: 5-SaniyCheck.pdf
Selection of fields
From all the 82 columns selected in previous steps, some were necessary only for the sanity check process. For example, the DateProcessed (the date and time of the processing of the claim), ValidityFromReview and AuditUserIDReview were used to check is a claim was reviewed. On the other hand, there are other field which are null over the claim review process and are initiated only after the valuation process (like RemuneratedAmount, LimitationValue, DeductableAmount, ExceedCeilingAmount, ExceedCeilingAmountCategory, PriceOrigin, PriceAdjusted, PriceValuated) and therefore were not considered useful information for the AI categorization process. All the information related to the field selection is included in the Table 2.
...
Color legend | Selected field (column) ready for the normalization step | ||||
Selected field (column) having missing values that must be filled | |||||
Selected field (column), no missing value, but categorical variables that must be converted | |||||
Not selected field (column) | |||||
Field selection | Data type | Non null values | Null values | Distinct values | Remark |
Dataset reviewed by MO composed of 6 851 424 rows and 89 colums | |||||
ClaimItemID | int64 | 6851482 | 0 | 6179170 | not necessary, only used for data concatenation process |
ClaimID | int64 | 6851482 | 0 | 891939 | will be replaced by ClaimUUID |
ItemID | int64 | 6851482 | 0 | 2587 | will be replaced by ItemUUID |
InsureeID | int64 | 6851482 | 0 | 432426 | will be replaced by InsureeUUID |
InsureeHFID | float64 | 6850941 | 541 | 286 | necessary? If yes, must fill the missing values? |
ClaimAdminId | float64 | 6851482 | 0 | 559 | will be replaced by values in ClaimAdminUUID |
ClaimAdminHFID | int64 | 6851482 | 0 | 288 | will be replaced by UUID values |
ClaimHFID | int64 | 6851482 | 0 | 288 | will be replaced by ClaimHFIDUUID |
LocationId | int64 | 6851482 | 0 | 5613 | will be replaced by LocationUUID |
HFLocationId | int64 | 6851482 | 0 | 62 | will be replaced by HFLocationUUID |
FamilyID | int64 | 6851482 | 0 | 266668 | will be replaced by FamilyUUID |
CHFID | int64 | 6851482 | 0 | 432401 | values already checked by Rule Engine |
ProdID | float64 | 6851482 | 0 | 3 | values already checked by Rule Engine |
PolicyID | float64 | 6851482 | 0 | 354756 | values already checked by Rule Engine |
ICDID | int64 | 6851482 | 0 | 1783 | ok |
ICDID1 | float64 | 2546244 | 4305238 | 1451 | if null value then set to ICDID |
ClaimItemStatus | int64 | 6851482 | 0 | 2 | These values were necessary for the data cleaning step, no longer necessary, will be replaced by one column 'Status' with 0: Accepted and 1: Rejected |
ClaimStatus | float64 | 6851482 | 0 | 3 | |
RejectionReason | int64 | 6851482 | 0 | 2 | |
ReviewStatus | int64 | 6851482 | 0 | 1 | |
IsHead | int64 | 6851482 | 0 | 2 | ok |
FamHeadInsuree | int64 | 6851482 | 0 | 266668 | ok |
Poverty | int64 | 6851482 | 0 | 2 | ok |
QtyProvided | float64 | 6851482 | 0 | 572 | ok |
QtyApproved | float64 | 293584 | 6557898 | 161 | the null values need to be filled with corresponding values: if the item is rejected, then the QtyApproved = 0; else QtyApproved = QtyProvided |
PriceAsked | float64 | 6851482 | 0 | 1019 | ok |
PriceApproved | float64 | 39326 | 6812156 | 5429 | null values need to be filled with corresponding values: 0 if the itemis rejected and PriceAproved = PriceAsked if accepted; To verify: if the QtyApproved is not null, then PriceApproved = QtyApproved*ItemPrice |
PriceValuated | float64 | 6379129 | 472353 | 17389 | This value is automatically filled during the valuation process, not to be taken into account |
ItemPrice | float64 | 6851482 | 0 | 868 | ok |
Claimed | float64 | 6851482 | 0 | 236694 | not very informative as it is the sum of PriceAsked for all items related to the claim |
Approved | float64 | 1323329 | 5528153 | 69866 | it is only a sum of PriceApproved for all the items, not very informative; if it is necessary, need to fill all the null values in the column |
ItemFrequency | float64 | 6851482 | 0 | 5 | ok |
ItemPatCat | float64 | 6851482 | 0 | 7 | ok |
AuditUserIDReview | float64 | 327276 | 6524206 | 23 | used only to check if there are items with ReviewStatus=(Selected for Review, Not Selected, Bypassed) that were actually reviewed by a MO, no longer needed |
Relationship | float64 | 4431501 | 2419981 | 8 | the null values correspond to the insure itself? If yes, we need to fill the null values |
DateFrom | datetime64[ns] | 6851482 | 0 | 1581 | datetime variable, to be converted to numeric: Conversion solution: number of days since 01.01.2016 |
DateTo | datetime64[ns] | 6851482 | 0 | 1582 | datetime variable, to be converted to numeric: Conversion solution: number of days since 01.01.2016 |
DateClaimed | datetime64[ns] | 6851482 | 0 | 1525 | datetime variable, to be converted to numeric: Conversion solution: number of days since 01.01.2016 |
DateProcessed | datetime64[ns] | 6573627 | 277855 | 7458 | date of the processing of the claim, correspond generally to the valuation process, not necessary |
ValidityFromReview | datetime64[ns] | 327276 | 6524206 | 215406 | only used for sanity check, no longer necessary |
DOB | datetime64[ns] | 6851482 | 0 | 33393 | datetime variable, to be converted to numeric: Conversion solution: number of days until DateFrom |
PhotoDate | datetime64[ns] | 6851482 | 0 | 1920 | datetime variable, to be converted to numeric: Conversion solution: number of days since 01.01.2016 |
ItemUUID | object | 6851482 | 0 | 2587 | to be converted to numeric and to replace ItemID |
ClaimUUID | object | 6851482 | 0 | 891939 | to be converted to numeric and to replace ClaimID |
ClaimHFUUID | object | 6851482 | 0 | 288 | to be converted to numeric and to replace HFID |
HFLocationUUID | object | 6851482 | 0 | 62 | to be converted to numeric and to replace HFLocationID |
InsureeUUID | object | 6851482 | 0 | 432426 | to be converted to numeric and to replace InsureeID |
FamilyUUID | object | 6851482 | 0 | 266668 | to be converted to numeric and to replace FamilyID |
LocationUUID | object | 6851482 | 0 | 5613 | to be converted to numeric and to replace LocationID |
ClaimAdminUUID | object | 6851482 | 0 | 559 | to be converted to numeric and to replace ClaimAdminID |
ItemCode | object | 6851482 | 0 | 2580 | to be converted to numeric |
ItemName | object | 6851482 | 0 | 2529 | used only for visualization, not necessary |
ItemLevel | object | 6851482 | 0 | 4 | ok |
ItemType | object | 6851482 | 0 | 2 | to be converted to numeric |
ItemCareType | object | 6851482 | 0 | 1 | to be converted to numeric |
HFCode | object | 6851482 | 0 | 276 | to be converted to numeric |
HFLevel | object | 6851482 | 0 | 2 | to be converted to numeric |
HFSublevel | object | 6610691 | 240791 | 2 | only one distinct value in the dataset, no useful information |
HFCareType | object | 6851482 | 0 | 2 | only one distinct value in the dataset, no useful information |
HFLocationName | object | 6851482 | 0 | 62 | used only for visualization, not necessary |
HFLocationType | object | 863822 | 0 | 1 | only one distinct value in the dataset, no useful information |
Gender | object | 6851482 | 0 | 3 | to be converted to numeric |
Marital | object | 5242068 | 1609414 | 6 | only one distinct value in the dataset, no useful information |
InsureeLocationName | object | 6851482 | 0 | 42 | used only for visualization, not necessary |
InsureeLocationType | object | 6851482 | 0 | 1 | only one distinct value in the dataset, no useful information |
ICDCode | object | 6851482 | 0 | 1783 | to be converted to numeric |
ICDName | object | 6851482 | 0 | 1783 | used only for visualization, not necessary |
VisitType | object | 6851482 | 0 | 3 | to be converted to numeric and need to solve the null values problem |
ItemServiceType | object | 6851482 | 0 | 2 | redundancy with respect to ItemType, only one column should be retained |
Explanation | object | 187640 | 6663842 | 58554 | used for sanity check only |
Justification | object | 84372 | 6767110 | 10847 | used for sanity check only |
ClaimExplanation | object | 625858 | 6225624 | 47565 | used for sanity check only |
Adjustment | object | 205650 | 6645832 | 4421 | used for sanity check only |
ClaimCode | object | 6851482 | 0 | 863952 | values not coherent, to be checked with Nepal |
ValidItem | float64 | 6851482 | 0 | 2 | ok |
ValidClaimAdmin | float64 | 6851482 | 0 | 2 | ok |
ValidInsuree | float64 | 6851482 | 0 | 2 | ok |
ValidFamily | float64 | 6851482 | 0 | 2 | ok |
ValidLocation | float64 | 6851482 | 0 | 2 | ok |
ValidICD | float64 | 6851482 | 0 | 1 | ok |
Code and results: 5-SaniyCheck.pdf
Filling missing values
As it observed in Table 2, several fields contain missing values that need to be filled and each field is treated in a distinct way:
...
Fig 5. Values related to the Relationship field in the dataset, after filling the missing values
Code and results: 6-ConvertData.pdf
Conversion of categorical data
...
Fields/features related to datetime variables:
DOB: we have computed the ‘Age’ in order to replace the DOB fields, using the following expression:
Difference_seconds = (DateFrom – DOB) converted in seconds
Age = Difference_seconds/(60*60*24*365.25)
resulting in a float version of the age of the patient.
DateFrom, DateTo, DateClaimed: were converted to a count of number of days between the dates in the fields and a hard coded date (here we have chosen 2016-01-01) and resulting in integer fields
Fields/features related to string/text fields were encoded using methods existing in the python: OneHotEncoder, LabelEncoder, DictVectorizer, Get dummies. In this case, we have used the LabelEncoder for converting string fields to numerical ones.
Code and results: 6-ConvertData.pdf
Data normalization
One necessary pre-processing step for many machine learning algorithms is the normalization of the data. This task is mandatory as the values in the database may have different order of magnitude and measurement unit. Several research projects have outline the fact that anomaly detection algorithms on normalized data have higher performance than compared with the results obtained on not normalized dataset [Campos et al,2016; Kandanaarachchi et al, 2020]. The normalization methods commonly used for anomaly detection algorithms are:
...
The last two normalization methods, Median-IQR and Median-MAD, are more robust to outliers [Kandanaarachchi et al, 2020; Rousseeuw et al,2018]. As the AI methods are sensible to the normalization type, all the normalization methods were tested.
Code and results: 6-ConvertData.pdf
Source code
The source code for the above results are available on GitHub: https://github.com/openimis/openimis-ai-research_py/tree/develop/1-data-preparation
References
Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., ... & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data mining and knowledge discovery, 30(4), 891-927.
Kandanaarachchi, S., Muñoz, M. A., Hyndman, R. J., & Smith-Miles, K. (2020). On normalization and algorithm selection for unsupervised outlier detection. Data Mining and Knowledge Discovery, 34(2), 309-354.
Rousseeuw, P. J., & Hubert, M. (2018). Anomaly detection by robust statistics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2), e1236.