openIMIS-AI - 3. Normalised input data sets

About the data

The available data corresponds to anonymized openIMIS insuree and claims response data from the Health Insurance Board (HIB) in Nepal. The data helps understand the claim adjudication process in Nepal and forms the basis for the learning of the AI algorithm for claim adjudication. The dataset is an extraction of the HIB openIMIS database from 06-2016 up to 09-2020.

The process of creation and definition of the dataset described here is concatenated from several openIMIS tables/respective FHIR resources. This dataset combines necessary information related to a medication or activity definition, thus a row in the dataset will correspond to a ClaimItem or ClaimService and its associated information like:

  • Numeric fields:

    • IDs:  ClaimItemID, ClaimID, Item or Service ID, InsureeID, HFID, ClaimAdminID, ICDID, ICDID1, FamilyID, LocationID, PolicyID, ProdID

    • Quantities or prices: QtyProvided, QtyApproved,PriceAsked, PriceAdjusted, PriceAproved, PriceValuated, Claimed, Approved, Item/SericePrice

    • Statuses: ClaimStatus, ReviewStatus, ClaimItemStatus, RejectionReason 

    • Other numeric fields: Item/ServPatCat, Item/ServiceFrequency, Relationship, IsHead, Poverty

  • Categorical variables:

    • UUID: related to the numeric IDs

    • Datetime: DateFrom, DateTo, DateClaimed, DateProcessed, DOB

    • Character fields: Item/ServiceType, Item /ServiceCareType, Item/ ServicePackage, Item/ServCategory, VisitType, HFCareType, HFLevel, HFSublevel, HFCareType, LocationType

    • Text labels: Item/ServiceCode, Item/ServiceName, Item/ServiceType, HFCode, LocationCode, LocationName

    • Text fields: Explanation (in tblClaims and tblClaimItems), Justification, Adjustment

These are some examples of fields and a non-exhaustive list of the types of fields that are analyzed in the data gathering and processing stage.

Selection from Tables

In order to prepare the final dataset for use, the following selection of fields were made from the existing tables and resources (see Table 1):

  1. concatenation of tblClaims and tblClaimAdmins based on ClaimAdminId : 5’953’640 entities and and 24 fields selected (11’965 claims have no value for the ClaimAdminId, therefore they were excluded in the concatenation process)

    1. tblClaims: 5’965’605 claims with 20 selected fields (from a total of 46 fields in the table)

    2. tblClaimAdmins: 913 entries with 5 selected fields

  2. the information related to insurees, families and their location is then concatenated to the previous dataset (based on InsureeID), resulting in 5’953’640 rows and 48 columns

    1. the information related to the insurees, families and their location is obtained through the merging of the tables Insurees and Families (based on FamilyID) and with the Location table (based on LocationID), resulting in a dataset with 3’790’789 rows and 25 columns (from a total of 51)

  3. the information related to health facilities and their locations is concatenated to the previous dataset (based on HFID column), resulting in 5’953’640 rows and 57 columns

    1. the information related to the health facilities and locations tables are concatenated (based on LocationID column) resulting in a set of 780 rows and 10 columns (from a total of 37 columns)

  4. concatenation of the diagnosis information to the previous dataset (based on ICDID column), resulting in a dataset of 5’953’640 rows and 61 columns.

    1. the information related to the tblDiagnosis: 1’959 rows and 5 selected fields

  5. concatenation of tblClaimItems and tblItems based on ItemID: 12’414’014 medication items and 28 fields selected

    1. tblClaimItems: 12’414’014 medication items with 17 selected fields (from a total of 30 fields in the table)

    2. tblItems: 1’528 rows and 12 selected fields (from a total of 17)

  6. concatenation of tblClaimItems/tblItems to the previous dataset based on ClaimID : 12’371’992 rows and 82 fields (in the previous of concatenation 42’022 entities were excluded, they were related to the claims without any value for ClaimAdminId)

  7. the ClaimServices and Services data is then concatenanted to the claim related dataset, resulting in 16’655’364 rows and 28 fields

    1. tblClaimServices and tblServices – are concatenated in a similar manner as the tblClaimItems and tblItems tables, based on ServiceID; the selected columns in the tblServices are renamed to be similar to those in the tblItems; The resulting dataset is composed of 16’655’364 rows and 28 fields

  8. the fields related to the tblClaimServices/tblServices is concatenated to the dataset related to the claims, based on ClaimID column, resulting in a dataset with 16’600’273 rows and 82 fields

  9. the resulting datasets related to medication and activity definition items are appended, resulting in a dataset with 28’972’265 rows and 82 fields.

All the necessary data for the development, training and testing of the AI categorization algorithm was regrouped in a single table with 28’972’265 rows and 82 columns, corresponding to :

  • 5’953’640 claims: 12’371’992 medication and 16’655’364 activity definition entities (see Fig.1)

  • 3’790’789 insurees

  • 780 health facilities

  • 1’959 diagnosis items

Fig. 1 – Medication and activity definition items existing in the dataset

Fig 2.  Evolution in terms of submission and reviewing of claims

Fig 3. Evolution in terms of  submission and reviewing per day of medication, respectively activity definition items

As it can be observed in the previous figures, in 2019 and 2010 the openIMIS system has allowed the submission of more then 10’000 claims per day (respectively more then 20’000 medication items and more then 30’000 activity definition items per day). This illustrates also the need for an AI categorization step which can support the HIB reviewers.

Steps

Code and results

Table name

#row

#cols

memory (MB)

Remarks

Color labels

Selection of fields in tables

Concatenation of tables

1

See 1-Claims_selection.pdf

tblClaims

5 965 605

20

910.28

Selection of 20 columns (of 46)

tblClaimAdmins

913

5

0.03

 

tblClaims + tblClaimsAdmins

5 953 640

24

1 135.57

Concatenation of the previous two tables, merged on ClaimAdminID column; 11965 claims have NULL value for the ClaimAdminId

2

tblInsurees + tblFamilies + tblLocations

3 790 789

25

751.96

Selection of columns

Step1 + tblInsurees + tblFamilies + tblLocations

5 953 640

48

2 225.71

Concatenation of the previous two tables, merged on InsureeID column

3

tblHFs + tblLocations

780

10

0.07

 

Step2 + tblHFs + tblLocations

5 953 640

57

2 634.51

Concatenation of the previous two tables, merged on HFID column

4

tblDiagnosis

1 959

5

0.07

 

Step3 + tblDiagnosis

5 953 640

61

2 816.21

Concatenation of the previous two tables, merged on HFID column

5

See 2-ClaimItems_items.pdf

tblClaimItems

12 414 014

17

1 610.09

Selection of 16 columns (of 30) and added a column ‘ItemServiceType’ =’Medication’

tblItems

1 528

12

0.14

Selection of 9 columns (of 15) and added a column ItemLevel = ‘M’

tblClaimItem + tblItems

12 414 014

28

2 746.63

Concatenation of the previous two tables, merged on ItemID column

6

Step5 + Step4

12 371 992

82

7 834.44

Concatenation of the previous two tables, merged on ClaimID column; 42022 entities were excluded in the concatenation process as they corresponded to claims with null ClaimAdminID

7

See 3-ClaimServices_Services.pdf

tblClaimServices

16 655 364

17

2 160.20

Selection of 16 columns (of 30) and added a column ‘ItemServiceType’ =’Activity definition’

tblServices

1 868

12

0.17

Selection of 10 columns (of 16)

tblClaimServices + tblServices

16 655 364

28

3 685.04

Concatenation of the previous two tables, merged on ClaimID column;

8

Step7 + Step4

16 600 273

82

10 511.95

Concatenation of the previous two tables, merged on ClaimID column; 55091 entities were excluded in the concatenation process as they corresponded to claims with null ClaimAdminID

9

See 4-ConcatenateData.pdf

Step6 + Step8

28 972 265

82

18 125.35

Appended datasets

Data cleaning

Before applying the Machine Learning algorithm, a large effort was made to check and clean the dataset obtained from the previous step, which includes large inspections of the dataset from different dimensions. The sanity check of the data corresponds to checking the coherence of the data in terms of values, dates and statuses, filling missing values, handling categorical data (datetime, char or text variables existing in the dataset).

Coherence analysis

In the process of data cleaning, several exclusion conditions were defined and are mainly related to the defined adjudication workflow.

Exclusion conditions

Condition 1

All the medication and activity definition items with a ClaimStatus ‘Entered’ (the claim was entered but not submitted, so not yet checked by the RuleEngine) were excluded from the analysis: corresponding to 201’402 records (83’986 medication and 117’416 activity definition items) and respectively to 40’393 claims with date of submission of the claim between 05-12-2017 and 29-09-2020.

For this project, the input to the AI model will be corresponding to a medication or activity definition item accepted by the Rule Engine. An item rejected by the Rule Engine will not be reviewed by the AI model or by a Medical Officer and will be automatically rejected. For this reason, the items, which were not yet verified by the Rule Engine, are not considered. In Fig 4, we can observe the distribution of claims in terms of claim status and review status.

Fig 4. Number or claims related to the status of the claim, as well as the review status

Condition 2

All items rejected by the Rule Engine, were excluded from the analysis: 209’210 records, corresponding to 78’289 claims.

Condition 3

Missing values in the ClaimAdminID, PolicyID, ProdID, VisitType columns will be excluded: 1’451 records corresponding to 578 claims.

Condition 4

Inconsistency with respect to the values in the ClaimItemStatus, ClaimStatus, ReviewStatus, RejectionReason, PriceValuate: 145 records corresponding to 74 claims.

  • Records where the ClaimItemStatus is Rejected and RejectionReason is Accepted;

  • Records where the ClaimItemStatus is Accepted and RejectionReason is Rejected;

  • Records where ClaimStatus is Rejected and RejectionReason is Accepted;

Condition 5

Inconsistency in terms of status and PriceValuate: rejected item, but with positive valuated price: 33 records corresponding to 24 claims.

Condition 6

Inconsistencies in terms of date fields in the DateFrom, DateTo, DateClaimed or DOB: 6’199 records corresponding to 1’340 claims.

  • DateFrom or DateTo before 16-05-2016;

  • DateClaimed before DateFrom;

  • DOB before DateClaimed

A total of 418’440 records were excluded in this step, resulting in a cleaned dataset of 28’553’825 rows and 82 columns.

Summary

The next step included a check of the medication and activity definition records that were reviewed by a Medical Officer, as these records will be included in the labeled dataset. For this, we checked the value in the ReviewStatus field: a ReviewStatus == 8 (Reviewed by a MO) indicating that the claim was revised by a Medical Officer. To summarize, we can say:

  • 23.99%  (6’851’424 records) of all the data can be considered as labeled (reviewed by a Medical Officer), whereas 3.78% of the labeled records were rejected (259’075) and 96.22% were accepted (6’592349 records)

  • 76.01% (21’702’401 records) of all data is not labeled (accepted without any manual reviewing)

Code and results: 5-SaniyCheck.pdf

Selection of fields

From all the 82 columns selected in previous steps, some were necessary only for the sanity check process. For example, the DateProcessed (the date and time of the processing of the claim), ValidityFromReview and AuditUserIDReview were used to check is a claim was reviewed. On the other hand, there are other field which are null over the claim review process and are initiated only after the valuation process (like RemuneratedAmount, LimitationValue, DeductableAmount, ExceedCeilingAmount, ExceedCeilingAmountCategory, PriceOrigin, PriceAdjusted, PriceValuated) and therefore were not considered useful information for the AI categorization process. All the information related to the field selection is included in the Table 2.

Table 2. Description of the fields selection

Color legend

Selected field (column) ready for the normalization step

Selected field (column) having missing values that must be filled

Selected field (column), no missing value, but categorical variables that must be converted

Not selected field (column)

Field selection

Data type

Non null values

Null values

Distinct values

Remark

Dataset reviewed by MO composed of 6 851 424 rows and 89 colums

  ClaimItemID 

 int64 

6851482

0

6179170

not necessary, only used for data concatenation process

  ClaimID 

 int64 

6851482

0

891939

will be replaced by ClaimUUID

  ItemID 

 int64 

6851482

0

2587

will be replaced by ItemUUID

  InsureeID 

 int64 

6851482

0

432426

will be replaced by InsureeUUID

  InsureeHFID 

 float64 

6850941

541

286

necessary? If yes, must fill the missing values?

  ClaimAdminId 

 float64 

6851482

0

559

will be replaced by values in ClaimAdminUUID

  ClaimAdminHFID 

 int64 

6851482

0

288

will be replaced by UUID values

  ClaimHFID 

 int64 

6851482

0

288

will be replaced by ClaimHFIDUUID

  LocationId 

 int64 

6851482

0

5613

will be replaced by LocationUUID

  HFLocationId 

 int64 

6851482

0

62

will be replaced by HFLocationUUID

  FamilyID 

 int64 

6851482

0

266668

will be replaced by FamilyUUID

  CHFID 

 int64 

6851482

0

432401

values already checked by Rule Engine

  ProdID 

 float64 

6851482

0

3

values already checked by Rule Engine

  PolicyID 

 float64 

6851482

0

354756

values already checked by Rule Engine

  ICDID 

 int64 

6851482

0

1783

ok

  ICDID1 

 float64 

2546244

4305238

1451

if null value then set to ICDID

  ClaimItemStatus 

 int64 

6851482

0

2

These values were necessary for the data cleaning step, no longer necessary, will be replaced by one column 'Status' with 0: Accepted and 1: Rejected

  ClaimStatus 

 float64 

6851482

0

3

  RejectionReason 

 int64 

6851482

0

2

  ReviewStatus 

 int64 

6851482

0

1

  IsHead 

 int64 

6851482

0

2

ok

  FamHeadInsuree 

 int64 

6851482

0

266668

ok

  Poverty 

 int64 

6851482

0

2

ok

  QtyProvided 

 float64 

6851482

0

572

ok

  QtyApproved 

 float64 

293584

6557898

161

the null values need to be filled with corresponding values: if the item is rejected, then the QtyApproved = 0; else QtyApproved = QtyProvided

  PriceAsked 

 float64 

6851482

0

1019

ok

  PriceApproved 

 float64 

39326

6812156

5429

null values need to be filled with corresponding values: 0 if the itemis rejected and PriceAproved = PriceAsked if accepted; To verify: if the QtyApproved is not null, then PriceApproved = QtyApproved*ItemPrice

  PriceValuated 

 float64 

6379129

472353

17389

This value is automatically filled during the valuation process, not to be taken into account

  ItemPrice 

 float64 

6851482

0

868

ok

  Claimed 

 float64 

6851482

0

236694

not very informative as it is the sum of PriceAsked for all items related to the claim

  Approved 

 float64 

1323329

5528153

69866

it is only a sum of PriceApproved for all the items, not very informative; if it is necessary, need to fill all the null values in the column

  ItemFrequency 

 float64 

6851482

0

5

ok

  ItemPatCat 

 float64 

6851482

0

7

ok

  AuditUserIDReview 

 float64 

327276

6524206

23

used only to check if there are items with ReviewStatus=(Selected for Review, Not Selected, Bypassed) that were actually reviewed by a MO, no longer needed

  Relationship 

 float64 

4431501

2419981

8

the null values correspond to the insure itself? If yes, we need to fill the null values

  DateFrom 

 datetime64[ns] 

6851482

0

1581

datetime variable, to be converted to numeric: Conversion solution: number of days since 01.01.2016

  DateTo 

 datetime64[ns] 

6851482

0

1582

datetime variable, to be converted to numeric: Conversion solution: number of days since 01.01.2016

  DateClaimed 

 datetime64[ns] 

6851482

0

1525

datetime variable, to be converted to numeric: Conversion solution: number of days since 01.01.2016

DateProcessed 

 datetime64[ns] 

6573627

277855

7458

date of the processing of the claim, correspond generally to the valuation process, not necessary

ValidityFromReview 

 datetime64[ns] 

327276

6524206

215406

only used for sanity check, no longer necessary

DOB 

 datetime64[ns] 

6851482

0

33393

datetime variable, to be converted to numeric: Conversion solution: number of days until DateFrom
Attention: there are values in the DOB column in Nepali calendar, to be corrected

PhotoDate

 datetime64[ns] 

6851482

0

1920

datetime variable, to be converted to numeric: Conversion solution: number of days since 01.01.2016

ItemUUID 

 object 

6851482

0

2587

 to be converted to numeric and to replace ItemID

ClaimUUID 

 object 

6851482

0

891939

 to be converted to numeric and to replace ClaimID

ClaimHFUUID 

 object 

6851482

0

288

 to be converted to numeric and to replace HFID

HFLocationUUID 

 object 

6851482

0

62

 to be converted to numeric and to replace HFLocationID

InsureeUUID 

 object 

6851482

0

432426

 to be converted to numeric and to replace InsureeID

FamilyUUID 

 object 

6851482

0

266668

 to be converted to numeric and to replace FamilyID

LocationUUID 

 object 

6851482

0

5613

 to be converted to numeric and to replace LocationID

ClaimAdminUUID 

 object 

6851482

0

559

 to be converted to numeric and to replace ClaimAdminID

ItemCode 

 object 

6851482

0

2580

 to be converted to numeric 

ItemName 

 object 

6851482

0

2529

 used only for visualization, not necessary

ItemLevel

object

6851482

0

4

 ok

ItemType 

 object 

6851482

0

2

 to be converted to numeric 

ItemCareType 

 object 

6851482

0

1

 to be converted to numeric 

HFCode 

 object 

6851482

0

276

 to be converted to numeric 

HFLevel 

 object 

6851482

0

2

 to be converted to numeric 

HFSublevel 

 object 

6610691

240791

2

only one distinct value in the dataset, no useful information

HFCareType 

 object 

6851482

0

2

only one distinct value in the dataset, no useful information

HFLocationName 

 object 

6851482

0

62

used only for visualization, not necessary

HFLocationType 

 object 

863822

0

1

only one distinct value in the dataset, no useful information

Gender 

 object 

6851482

0

3

 to be converted to numeric 

Marital 

 object 

5242068

1609414

6

only one distinct value in the dataset, no useful information

InsureeLocationName 

 object 

6851482

0

42

 used only for visualization, not necessary

InsureeLocationType 

 object 

6851482

0

1

only one distinct value in the dataset, no useful information

ICDCode 

 object 

6851482

0

1783

 to be converted to numeric 

ICDName 

 object 

6851482

0

1783

 used only for visualization, not necessary

VisitType 

 object 

6851482

0

3

 to be converted to numeric  and need to solve the null values problem

ItemServiceType 

 object 

6851482

0

2

redundancy with respect to ItemType, only one column should be retained 

Explanation 

 object 

187640

6663842

58554

used for sanity check only

Justification 

 object 

84372

6767110

10847

used for sanity check only

ClaimExplanation 

 object 

625858

6225624

47565

used for sanity check only

Adjustment 

 object 

205650

6645832

4421

used for sanity check only

ClaimCode 

 object 

6851482

0

863952

values not coherent, to be checked with Nepal

 ValidItem 

 float64 

6851482

0

2

 ok

 ValidClaimAdmin 

 float64 

6851482

0

2

 ok

 ValidInsuree 

 float64 

6851482

0

2

ok 

 ValidFamily 

 float64 

6851482

0

2

ok 

 ValidLocation 

 float64 

6851482

0

2

ok 

 ValidICD 

 float64 

6851482

0

1

ok 

Code and results: 5-SaniyCheck.pdf

Filling missing values

As it observed in Table 2, several fields contain missing values that need to be filled and each field is treated in a distinct way:

  • fields related to quantities of prices (QtyApproved and PriceApproved) will be filled with numerical values in correlation with the (medication or activity definition) item status (accepted or rejected);

    • for rejected items, the quantities and prices will be filled with ‘0’

    • for accepted items, a null value means that the quantity or price was not modified by the reviewer, i.e. QtyApproved = QtyProvide or PriceApproved = PriceAsked

  • ICDID1: 4’305’205 values are null: one possibility to deal with this would be to add the same value as in ICDID

  • Relationship:

    • the value is null when the insuree is equally the head of the family (isHead ==1), in that case, we can fill this missing values with ‘8’ (corresponding to ‘head of the family’); the values found in the dataset are between 0 and 7)

    • the value is null, but the insuree is not the head of the family (isHead ==0), in that case, we can fill this missing values with ‘9’ (corresponding to ‘not known’);

Fig 5. Values related to the Relationship field in the dataset, after filling the missing values

Code and results: 6-ConvertData.pdf

Conversion of categorical data

Most Machine Learning algorithms can take into account only numerical fields (or columns, or features), meaning only floats or integers. These numerical variables are known also as continuous variables: they can have theoretically infinite number of values. In this dataset, for example fields related to prices are continuous variables. Unlike continuous variables, categorical variables can have only a finite number of values and are often strings. Sometimes, integers can also hide categorical variables as they can be associated to a label. For this reason, it is important to take a look at how many distinct values the variable can take, before deciding if it is continuous or categorical. For most AI algorithms, these categorical variables must be encoded to numerical ones.

For this project, the following procedure was used to encode the categorical variables:

  • Fields/features related to datetime variables:

    • DOB: we have computed the ‘Age’ in order to replace the DOB fields, using the following expression:

      • Difference_seconds = (DateFrom – DOB) converted in seconds

      • Age = Difference_seconds/(60*60*24*365.25)

    • resulting in a float version of the age of the patient.

    • DateFrom, DateTo, DateClaimed: were converted to a count of number of days between the dates in the fields and a hard coded date (here we have chosen 2016-01-01) and resulting in integer fields

  • Fields/features related to string/text fields were encoded using methods existing in the python: OneHotEncoder, LabelEncoder, DictVectorizer, Get dummies. In this case, we have used the LabelEncoder for converting string fields to numerical ones.

Code and results: 6-ConvertData.pdf

Data normalization

 One necessary pre-processing step for many machine learning algorithms is the normalization of the data. This task is mandatory as the values in the database may have different order of magnitude and measurement unit. Several research projects have outline the fact that anomaly detection algorithms on normalized data have higher performance than compared with the results obtained on not normalized dataset [Campos et al,2016; Kandanaarachchi et al, 2020]. The normalization methods commonly used for anomaly detection algorithms are:

  • Minimum and maximum normalization (Min-Max) : each column x is transformed to (x-min(x))/(max(x)-min(x)), where min(x) and max(x) are respectively the minimum and the maximum values of the column x.

  • Mean and standard deviation normalization (Mean-SD): each column x is transformed to (x-mean(x))/sd(x), where mean(x) and sd(x) represent respectively the mean and the standard deviation of the values in the column x.

  • Median and the IQR normalization (Median-IQR): each column x is transformed to (x-median(x))/IQR(x), where median(x) and IQR(x) represent respectively the median and IQR (InterQuantile Range) of the values in the column x.

  • Median and median absolute deviation normalization (Median-IQR): each column x is transformed to (x-median(x))/MAD(x) where MAD(x) = median(|x-median(x)|) is the median absolute deviation.

The last two normalization methods, Median-IQR and Median-MAD, are more robust to outliers [Kandanaarachchi et al, 2020; Rousseeuw et al,2018]. As the AI methods are sensible to the normalization type, all the normalization methods were tested.

Code and results: 6-ConvertData.pdf

Source code

The source code for the above results are available on GitHub: https://github.com/openimis/openimis-ai-research_py/tree/develop/1-data-preparation

References

  1. Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., ... & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data mining and knowledge discovery30(4), 891-927.

  2. Kandanaarachchi, S., Muñoz, M. A., Hyndman, R. J., & Smith-Miles, K. (2020). On normalization and algorithm selection for unsupervised outlier detection. Data Mining and Knowledge Discovery34(2), 309-354.

  3. Rousseeuw, P. J., & Hubert, M. (2018). Anomaly detection by robust statistics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery8(2), e1236.

Did you encounter a problem or do you have a suggestion?

Please contact our Service Desk



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. https://creativecommons.org/licenses/by-sa/4.0/