openIMIS-AI - 4. AI methods, model outputs, and evaluation metric

AI model output

In this project, the aim is to categorize the medical items and services related to a claim. From a statistical point of view, a rejected item/service is a special case of outlier, meaning that the rejected (item or service) cases corresponds to records in the database that differ significantly from the majority of the database. Such rare cases are the results of unusual events, generating anomalous patterns of activity.

The analysed dataset corresponds to the openIMIS claim submissions in Nepal, from Mai 2016 to September 2020. After data cleaning, the dataset is composed of 3'460'592 claims and respectively 18’362’682 entities (corresponding to medical services and medications). Only 199’167 entities were manually rejected by the Medical Officers, representing only 1.073% of the considered database. As the Medical Officers are not able to manually review all the received claims, 63.49% of the entities were accepted without manual verification. Resuming the challenges related with the present project we can note:

unbalanced data set with few samples containing valuable information concerning the rejection cases;
the possibility to deal with changing patterns concerning the rejection claims;
partial labeling of the database as not all the data was manually reviewed by a Medical Officer.

The objective of the AI model is to classify the items/services (denoted herein by entity) related to health insurance claims. For the AI model:

Accepted entities correspond to the majority class and thus represents negative outcomes, converted to 0 in the implemented AI model.
Rejected entities correspond to the minority class and thus represents positive outcomes, converted to 1 in the implemented AI model.

AI evaluation metrics

In order to illustrate and analyse the results of an AI model, we will use the confusion matrix, described as follows:

		Predicted label
		Accepted	Rejected
True label	Accepted	True Negative (TN) entities	False Positive (FP) entities
True label	Rejected	False Negative (FN) entities	True Positive (TP) entities

As we deal with skewed database, specific evaluation methods must be employed. These metrics are based on four scenarios:

True Positive (TP): the AI model predicts 1, and the actual class is 1 (the prediction and the classification made by the Medical Officer correspond to rejected item or service);
True Negative (TN): the AI model predicts 0 and the actual class is 0 (the prediction and the classification made by the Medical Officer correspond to accepted item or service);
False Positive (FP): the model predicts 1, but the actual class is 0 (the prediction of the model classify the item/service as rejected, while it should be accepted);
False Negative (FN): the model predicts 0, but the actual class is 1 (the model predicts an accepted item or service, while it should be rejected).

The evaluation metrics that can be employed for unbalanced databases are the following:

Precision is expressed as P = TP/(TP+FN) and defines of all the predictions y = 1 which ones were correctly classified;
Recall is expressed as R = TP/(TP+FP) and defines of all the actual y=1, which one the model predicted correctly (also known as sensitivity or True Positive Rate);
A tradeoff between Precision and Recall;
F score or F1 score is the harmonic mean of precision and recall, expressed as F1 = 2*(P*R)/(P+R);
Specificity or True Negative Rate is expressed as SPC = TN/(TN+FP)

Dependencies of the AI model

The results obtained by the AI model will depend on several variables:

selected features: for the present project, we have chosen two features configurations: (1) a selection of 27 features, selected during the data analysis and visualization; (2) 27 features selected previously, but also 6 aggregated features related to submitted items and related amount by the insurer per week, month, year;
normalization method

AI model selection

For the claim categorization process, anomaly detection techniques can be employed in order to detect rejected claims or items. The definition of an outlier is given by Howkins (1980) as “an observation that deviates so much from other observation as to arouse suspicion that it was generated by a different mechanism”. It is also known as the process in detecting the pattern in the data whose behavior is not as normal as expected, it is important in detecting rare events [Shah, 20xx]. The anomaly detection techniques are based on concepts like classification, nearest neighbour, clustering, statistics and information theory. [Shah, 20xx; Chandola2009; Ahmed2016]. For the present project, we are dealing with anomaly detection techniques able to assign a label to each test instance.

Supervised anomaly detection algorithms suppose that we have access to a training data set that has categorized/labeled observations for normal (accepted item/services) and abnormal (rejected cases). In a general manner, these techniques are building predictive models for normal class vs the abnormal/anomaly class. Several difficulties may arise when using these methods:

the abnormal cases are rarer compared to the normal class in the training data; in our case, we deal with highly imbalanced data (skewed data);
changing patterns in the anomaly cases are difficult to be detected by the prediction model;
having access to data with accurate and representative labels for the anomaly class is a real challenge; furthermore, not all the data in the database is categorized/label and only a small proportion of the database is manually reviewed by Medical Officers.

Semi-supervised anomaly detection algorithms require to access only to categorized/labelled normal class and its objective is to build a model for the normal class from the train set and use this model to identify abnormal cases in the test set. Several methods require access only to labelled abnormal cases; however, these methods have difficulties in representing accurately the anomaly cases and dealing with changing patterns.

Unsupervised anomaly detection has proven their efficiency for problems with imbalanced and highly imbalanced data sets, where the rejected items/services can be characterized by changing patterns. The existing methods has received tremendous efforts in the last decade [Zhong2018; Chandola2009].

[Shah, 20xx] Shah, P. A critical survey on Anomaly detection.
[Chandola2009] Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 1-58.
[Ahmed2016] Ahmed, M., Mahmood, A. N., & Islam, M. R. (2016). A survey of anomaly detection techniques in financial domain. Future Generation Computer Systems, 55, 278-288.
[Chalapathy2019] Chalapathy R, Chawla S. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407. 2019 Jan 10.
[Zong2018] Zong, Bo, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. "Deep autoencoding gaussian mixture model for unsupervised anomaly detection." In International Conference on Learning Representations. 2018.

1. Classification based Anomaly detection techniques

Classification techniques are employed in order to learn a model (called ‘classifier’) from a dataset of labeled entities (training dataset) and then, classify new entities/instances based on the learn model. The basic hypothesis is that the classifier can distinguish between normal (accepted claims/items) and abnormal (rejected claims/items) classes, by considering the given feature space. The techniques can be grouped in two categories: multi-class and one-class anomaly detection techniques. The multi-class classification based techniques rely on the availability of accurate labels for various normal class which is not always possible. Figure 1 illustrates the classification of one-class or multi-class ( representation for a two feature space).

a) One-class anomaly detection

b) Multi-class anomaly detection

Fig 1 - Classification based anomaly detection [Chandola2009]

Several methods are able to solve classification based anomaly detection problems and we can count Decision Trees, Support Vector Machines (SVM) classification, Naïve Bayes networks, etc. Furthermore, once that we have trained several predictors/classifiers, each one having at least an accuracy of 80%, we can combine their performances through the use of Ensemble methods (Voting Classifiers, Bagging and Pasting, Random Forests, Extra-Trees, Adaptive Boosting (AdaBoost), Gradient Boosting, …)

1.1 Decision Trees

A Decision Tree (DT) is a non-parametric supervised machine learning method used for both classification and regression tasks. The trees are constructed based on an algorithmic approach features space is split based on different conditions.

Like a real tree, a DT is represented by a

root node ( the first node in the tree),
decision nodes (allowing to split values with respect to a specific feature)
leaf (end points with no further splitting)

Each node represents the place where we chose a feature and ask a question with respect to values taken by the selected feature, whereas the leaves represent the output of the model (binary/multiclass values) or class labels.

The Classification Decision Trees can consider as input both categorical and numeric features. DTs divide the features values in axis-parallel rectangles or hyperplanes.

The objective of a decision tree is to separate the entities in the dataset into mutually exclusive subgroups. For doing so, starting from a root node, each node is split into child nodes in a binary or a multi split fashion related tot he methods used based on the value of the feature (input variable) separating best the given entities and using the considered metrics. Records in a node is recursively separated into child nodes until there is no more split making statistical difference on the distribution of the records in the node or the number of records is too small. It worth noted that each decision tree method uses its own splitting algorithms and splitting metrics [Sahin&Duman, 2011].

Once the tree is constructed, we can observe that the resultant tree may overfit the training data and branches may contain anomalies. For this reason, the obtained tree should be checked in order to remove nodes/branches starting from the leafs and thus modify the performance of the tree.

Advantages of DTs:

Are adapted to balanced and skewed datasets, and also for binary or multi-class labels
Compared with other existing Machine Learning algorithms, the decision trees requires less effort for the data preparation concerning the pre-processing of data, as DTs does not demand the normalization or scaling of data, it can use categorical and numerical features and it can contain missing values without affecting the building of the tree;
DTs are very intuitive and can be easily visualized, easy to explain to multi-disciplinary team members.

Disadvantages of DTs :

DTs tend to overfit if they are allowed to grow it and in order to avoid this the pruning of the tree must be pruned. In order to prune the tree, one can use the different parameters of the algorithm, like max_depth and min_samples_split;
As DTs split the features space with respect to the selected features (maximum number of features or a fixed number of features); the time necessary for the train/test step increases with high number of features
Splitting the data is according to the first best split and it will influence the following splitting (may not be the best split)
A small change in the data can cause a large change in the structure of the decision tree, thus causing instability

For the present project, we are using Python’s Scikit-Learn Decision Trees with the following parameters to be tuned:

criterion: measures the impurity of the split with the following metrics: “gini” (default) and “entropy”
splitter: allows the searching of the features for a split and can be set to “best” (from all possible features, for each node, the algorithm chooses the best feature space split) or “random” (considers only a random subset of the considered features and from these ones find the best split; the size of the random subset is given by the max_features parameter)
max_depth: fixes the maximum depth of a tree and often regularizes the tree and prevents over-fitting
min_samples_split: corresponds to the minimum number of entities contained in a node in order to consider splitting; this parameter can be used to regularize the tree and the default value is 2
min-sample_leaf: corresponds to the minimum number of entities contained in a leaf, the default values is 1 and this parameter can limit the growth of the tree
max_features: represents the number of features considering when searching for the best split; if the values is not set for this parameter, the decision tree will consider all the features.
class_weight: weights associated with classes; one can use ‘balanced’ mode which takes into account the values of the label class in order to adjusts weights inversely proportional to class frequencies in the input data.

Y. Sahin and E. Duman, Detecting Credit Card Fraud by Decision Trees and Support Vector Machines, Proceedings of the International MultiConference of Engineers and Computer Scientist, March 16-18, 2011, Hong Kong, Vol I
Z. Rokach, L., & Maimon, O. (2005). Decision trees. In Data mining and knowledge discovery handbook (pp. 165-192). Springer, Boston, MA.

1.2 Support Vector Machines

A singular Vector Machine is a Machine Learning model is capable of performing linear an nonlinear classification (but also regression) for one-class classification (only two labels can be considered: herein 0: Accepted and 1: Rejected).

The SVM algorithm will train the model based on the training set and test it on new entities from the test dataset. The algorithm constructs optimal hyperplanes (linear or nonlinear) that best separates the two considered classes (Accepted and Rejected). This is illustrated in Figure 2, which shows a pictorial example of two data sets that can be separable into two classes, here linear boundaries. SVM approaches this problem through an optimization problem which minimize the distance between the decision boundaries and any of the data entities.

a) Separating hyperplanes

b) Best hyperplane

Fig 2 - An overview sketch of Support Vector Machine algorithm linear classifier [Savas, 2019]

There are several tuning parameters specific for the SVM model:

kernel: the learning of a hyperplane in linear SVM is done by using the linear algebra to solve this problem, i.e. through the use of kernels which can be linear, polynomial, exponential, Gaussian Radial Basis Functions (RBF).
regularization parameter: tells the SVM optimization how much to avoid misslassification of entities; a high value for the regularization parameter will imply a smaller margin hyperplane in order to have all the training entities classified correctly, whereas a small values will allow missclassifications of points.
gamma parameter: characterize the influence of a single training example; a low value for the gamma parameter will take into account for the separation line points place far away from the separation line
margin: represents the distance between the hyperplane to the closes points; a good margin represents a sufficiently large distance for both classes.

[Savas,2019] Savas, C., & Dovis, F. (2019). The Impact of Different Kernel Functions on the Performance of Scintillation Detection Based on Support Vector Machines. Sensors, 19(23), 5219.

1.3 Voting classifiers

Supposing that several predictors were trained with a minimum accuracy of 80%. Then, we could create a better classifier by aggregating, for each new entity, the predictions of all classifiers and predict the class that is obtaining most of the votes. This is illustrated in Figure 3, for an ensemble classifier, combining the results of four predictors. Considering a new instance, 3 classifiers predicted ‘1’, while the other one predicted ‘0’, thus the ensemble classifier will predict ‘1’. The voting classifier can deal with label predictions (as seen in Figure 3) - and it is called a hard voting classifier, or with probability predictions and we deal with a soft voting (we will predict the class with the highest class probability, averaged over all the individual predictors.

Fig 3 - Predictions of a hard voting classifier [Géron, 2019]

[Géron, 2019] Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O'Reilly Media.

1.4 Random Forest

Training a group of Decision Trees Classifiers, each based on a different random set of the training dataset, allows us to develop an Ensemble Method. On a new entity/instance, we will obtain predictions from all the individual trees, thus we will predict the class having the majority of votes. Such an ensemble method is called a Random Forest.

The Random Forest algorithm introduces extra randomness when growing trees; when splitting a node, it will search for the best feature among a random subset of features. This will result in a greater tree diversity, yielding an overall better model [Géron, 2019].

With a few exceptions, a Random Forest classifier has all the hyperparameters of a Decision Trees classifier (in order to control the growth of trees), plus specific hyperparameters. The main advantages of a Random Forest Classifier are that:

it is not prone to overfit
it is interpretable by looking at feature importance
can be adapted to imbalanced datasets.

The main drawback is that they are computationally complex.

[Breiman, 2001] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

1.5 Extra-Trees

When growing a Random Forest, at each node only a random set of features is considered for splitting. An Extremely Randomized Trees ensemble (or in short Extra-Trees classifiers) can construct even more randomized trees by using also random threshold for each feature (whereas the Decision Trees are searching for the best possible thresholds, which is also the most time-consuming task when growing a tree). Thus, Extra-Trees are less time-consuming than Random Forests.

The hyperparameters of an Extra-Trees classifier are similar to the ones for the Random Forest [Géron, 2019].

2. Nearest Neighbor anomaly detection techniques

The nearest neighbor techniques are based on the hypothesis that the normal data instances (accepted claims or items) will occur in dense neighborhoods, while the rare cases (rejected claims/items) occur far from their closest neighbors. These techniques are based on distance or similarity measures between two data instances, generally computed for each attribute and then combined. These methods are generally grouped in two categories:

a) techniques that use the distance of an entity to its kth nearest neighbor;

b) techniques that compute relative density of each data instance to compute its anomaly score.

2.1 Distance-based techniques

The K-Nearest Neighbors (or simply KNN) algorithm uses the hypothesis that similar entities are placed in close proximity, i.e. are near to each other. The result of the algorithm will thus depend on the value of k, the numbers of neighbors considered for the prediction of the label of a new entity. The principle of KNN algorithm is the following: for each new entity, its k nearest neighbors are retrieved and the majority voting among the entities in the neighborhood is used to predict the label of the new entity, with or without weighting. Thus the results are highly dependent on the value of the k.

Fig 4 - Label Prediction of a new entity based on KNN algorithm

Advantages of KNN algorithm consist in the simplicity of implementation, its robustness to noisy training data and effectiveness for large training datasets. Among the disadvantages, we can count the hight computation cost (as it implies the computation of the distance between all the entities of the training set) and the dependence on the value of the parameter k.

[Guo,2003] Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003, November). KNN model-based approach in classification. In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems" (pp. 986-996). Springer, Berlin, Heidelberg.

2.2 Density-based techniques

Density based anomaly detection techniques will compute the density of neighborhood for each entity in the dataset. An entity with a sparse neighborhood will be thus considered as anomalous, while entities with dense neighborhoods will be treated as normal. Most of the density-based algorithms employ anomaly scores as the Local Outlier Factor, which is calculated as the ratio of average local density of the k nearest neighbors of the entity and the local density of the entity itself.

[Zhao,2018] Zhao, M., Chen, J., & Li, Y. (2018, April). A Review of Anomaly Detection Techniques Based on Nearest Neighbor. In 2018 International Conference on Computer Modeling, Simulation and Algorithm (CMSA 2018). Atlantis Press.

3. Clustering based anomaly detection techniques

Clustering can be described as the separation of data into groups of similar entities and detect patterns in data. Each group, or cluster, is composed of entities that are consistent with each other and different from other groups. Here again, algorithms are highly dependent on the data normalization.

During training, the algorithm will group the data in distinct k-clusters, where k is a hyperparameter and computes the cluster centroids. The objective of the algorithm is thus to minimize the distance between entities (data points) and their associated cluster centroids. Anomalies are detected as the entities situated far from a cluster centroid. There are several methods in the literature able to check the optimum number of clusters, for example the Elbow curve, Silhouette methods, …

Figure 4 illustrates the definition of two clusters using the clustering technique of the data in figure a)

Fig 4 - Illustration of the clustering algorithm: a) initial data; b) initialization of two cluster centroids (here k=2); c) calculate distances and redefine cluster centroid; the process is continued until convergence in fig f)

K-means algorithm is one of the most widely employed clustering algorithm and this is due to its simplicity, adaptability, time and storage complexity, invariant to data ordering. Furthermore, it has been used to initialize other clustering algorithms like Hierarchical Density-Based Clustering of Applications with Noise (HDBSCAN) and Specral Clustering [Habeeb,2019].

[Habeeb, 2019] Ariyaluran Habeeb, R. A., Nasaruddin, F., Gani, A., Amanullah, M. A., Abaker Targio Hashem, I., Ahmed, E., & Imran, M. (2019). Clustering‐based real‐time anomaly detection—A breakthrough in big data technologies. Transactions on Emerging Telecommunications Technologies, e3647.

Deep Learning (DL) techniques are the most modern technologies used to classify data in different multidisciplinary domains. They are a subset of machine learning techniques that achieves good performance and flexibility by learning to represent the data as nested hierarchy and concepts within layers of the neural network. Deep learning algorithms have been proven to outperform traditional machine learning as the scale of data increases, as illustrated in Figure 5 [Chalapathy, 2019; Javaid, 2016; Peng, 2015].

Fig 5 - Performance comparison of Deep learning-based algorithm versus traditional Machine Learning algorithms [Bahnsen, 2016; Chalapathy, 2019]

We will concentrate hereafter on DL techniques that have proven good performance results in anomaly detection in health insurance related domains.

4.1 Artificial Neural Networks

An Artificial Neural Network (ANN) is inspired by the networks of a brain and is composed of many layers and neurons with simple processing units. In Figure 6, we have an illustration of an Artificial Neural Network: it is composed of an input layer (with 5 neurons), 4 hidden layers (each with 7 neurons), an output layer (with 4 neurons). The neurons in the input layer allows to bring the data to the neural network. The neurons of a layer perform computations on the data from previous layer and transfer the results to the next layer.

Fig 6 - Representation of an Artificial Neural Network

4.2 Autoencoders

The autoencoder is a special type of neural networks where the input and output layer has the same number of neurons, as illustrated in Figure 7. This techniques is set as an unsupervised learning techniques, for the simple reason that the output values are set to the input values. The objective of such a neural network is to extract information existed in the input data, I.e. to learn the most important patterns existing in the data while ignoring noise. In order to extract this information, the hidden layer have fewer neurons as the input layer, thus composing the ‘encoding’ model. The narrow middle layer compresses redundancies in the input data while keeping non-redundant information [Nicolau, 2017].

Fig 7 - Representation of an Autoencoders

Autoencoders have many different applications, with a first application on dimensionality reduction. Compared to other dimensionality reduction techniques, like PCA, the advantage of the autoencoders is that they use nonlinear transformations (while PCA uses linear algebra to compute the minimal representation). In order to detect anomalies, Autoencoders are firstly trained on the normal data (only accepted claims/items) in order to minimize the reconstruction error (RE) of the trained AE. RE represents the error between the reconstructed output and the input data, and it is generally measured using the mean square error and optimized using the stochastic gradient descent. Thus, for a new entity, will be considered as an anomaly if its reconstruction error is higher than a pre-defined threshold.

4.3 Long Short Term Memory

The Long Short Term Memory (LSTM) technique is an extension of the Recurrent neural networks (RNN). A standard RNN is illustrated in Figure 8.

Fig 8 - Representation of a folded Recurrent Neural Network (left side) and an unfolded one (right side). xt represent an entity of the dataset with respect to a specific time [Singht, 2017]

LTSM models take into account a sequence of data in order to predict the class of new entities. LTSM has been proven to be very effective in learning long-term dependencies (more efficient compared with standard RNN). The structure of a LSTM cell is illustrated in Figure 9 and is composed of several states and gates [Alghofaili, 2020]:

The cell state is the major chain in the information flow.
The forget gate analyses the information that must be kept or not, by considering an activation function (here a sigmoid function) of the prior hidden state (ht-1) and current input (Xt). The cell state vector (Ct-1) controls which elements will be forgotten.
The input gate checks which data is necessary to add from the current input (Xt) and to update the cell state; it employs a tanh activation function to pass the current input state and the hidden state.
The output gate conditions the information to be transferred to the next hidden state.

Fig 8 - Representation of a LSTM cell structure [Alghofaili, 2020]

LSTM models have several hyperparameters to be tuned:

optimizer: allowing to enhance the performance, can be RMSProp, Adam, Adagrad, …
metrics: allowing to measure the performance like accuracy, f1-score, …
batch size: representing the number of entities passed each time
epoch: related to the number of iterations (forward and back propagation) that the models need to perform.

[Chalapathy,2019] Chalapathy, R., & Chawla, S. (2019). Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407.
[Javaid,2016] Javaid, A., Niyaz, Q., Sun, W., & Alam, M. (2016, May). A deep learning approach for network intrusion detection system. In Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS) (pp. 21-26).
[Peng,2015] Peng, H. K., & Marculescu, R. (2015). Multi-scale compositionality: identifying the compositional structures of social dynamics using deep learning. PloS one, 10(4), e0118309.
[Bahnsen,2016] Bahnsen, A. C. (2016). “Building ai applications using deep learning.
[Nicolau,2016] Nicolau, M., & McDermott, J. (2016, September). A hybrid autoencoder and density estimation model for anomaly detection. In International Conference on Parallel Problem Solving from Nature (pp. 717-726). Springer, Cham.
[Singh,2017] Singh, A. (2017). Anomaly detection for temporal data using long short-term memory (lstm).
[Alghofaili,2020] Alghofaili, Y., Albattah, A., & Rassam, M. A. (2020). A Financial Fraud Detection Model Based on LSTM Deep Learning Technique. Journal of Applied Security Research, 15(4), 498-516.

openIMIS

openIMIS-AI - 4. AI methods, model outputs, and evaluation metric

AI model output

AI evaluation metrics

Dependencies of the AI model

AI model selection

1. Classification based Anomaly detection techniques

1.1 Decision Trees

1.2 Support Vector Machines

1.3 Voting classifiers

1.4 Random Forest

1.5 Extra-Trees

2. Nearest Neighbor anomaly detection techniques

2.1 Distance-based techniques

2.2 Density-based techniques

3. Clustering based anomaly detection techniques

4.1 Artificial Neural Networks

4.2 Autoencoders

4.3 Long Short Term Memory

Did you encounter a problem or do you have a suggestion?

Please contact our Service Desk

AI model output

AI evaluation metrics

Dependencies of the AI model

AI model selection

1. Classification based Anomaly detection techniques

1.1 Decision Trees

1.2 Support Vector Machines

1.3 Voting classifiers

1.4 Random Forest

1.5 Extra-Trees

2. Nearest Neighbor anomaly detection techniques

2.1 Distance-based techniques

2.2 Density-based techniques

3. Clustering based anomaly detection techniques

4. Deep Learning related techniques

4.1 Artificial Neural Networks

4.2 Autoencoders

4.3 Long Short Term Memory

Did you encounter a problem or do you have a suggestion?

Please contact our Service Desk