openIMIS-AI - 3. Normalised input data sets

Excluded data

The following records will be excluded from the input data:

Claims where ClaimAdminId is NULL => no Claim Administrator is associated with the Claim => 2459 records
Claims rejected by the Rule-based engine (static validation) => 462 records

Data normalization

One necessary pre-processing step for many machine learning algorithms is the normalization of the data. This task is mandatory as the values in the database may have different order of magnitude and measurement unit. Several research projects have outline the fact that anomaly detection algorithms on normalized data have higher performance than compared with the results obtained on not normalized dataset [Campos et al,2016; Kandanaarachchi et al, 2020]. The normalization methods commonly used for anomaly detection algorithms are:

Minimum and maximum normalization (Min-Max) : each column x is transformed to (x-min(x))/(max(x)-min(x)), where min(x) and max(x) are respectively the minimum and the maximum values of the column x.
Mean and standard deviation normalization (Mean-SD): each column x is transformed to (x-mean(x))/sd(x), where mean(x) and sd(x) represent respectively the mean and the standard deviation of the values in the column x.
Median and the IQR normalization (Median-IQR): each column x is transformed to (x-median(x))/IQR(x), where median(x) and IQR(x) represent respectively the median and IQR (InterQuantile Range) of the values in the column x.
Median and median absolute deviation normalization (Median-IQR): each column x is transformed to (x-median(x))/MAD(x) where MAD(x) = median(|x-median(x)|) is the median absolute deviation.

The last two normalization methods, Median-IQR and Median-MAD, are more robust to outliers [Kandanaarachchi et al, 2020; Rousseeuw et al,2018].

Questions

Are quantity and price modified by the Claim submission?
If quantity and price are not correct, is the Item/Service rejected?
The manual reviewed Claim is in Checked outcome with Status Reviewed. What fields record the values of quantity and price?
There are 447003 claims with ICDID1 as NULL. How to manage the cases where ICDID1 is not NULL?

References

Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., ... & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data mining and knowledge discovery, 30(4), 891-927.
Kandanaarachchi, S., Muñoz, M. A., Hyndman, R. J., & Smith-Miles, K. (2020). On normalization and algorithm selection for unsupervised outlier detection. Data Mining and Knowledge Discovery, 34(2), 309-354.
Rousseeuw, P. J., & Hubert, M. (2018). Anomaly detection by robust statistics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2), e1236.

Excluded data

Data normalization

Questions

References

Did you encounter a problem or do you have a suggestion?

Please contact our Service Desk