View Source

Once the data preparation step is finished, we can continue with the development of the AI algorithm. As seen earlier, the data preparation supposes the data cleaning (managing errors or missing values), field/feature selection, conversion of categorical data, normalization and randomization of ordering. Each AI algorithm may demand a specific data preparation, at least in terms or normalization and randomization of ordering.

Data segregation

The AI model training and evaluation includes the data segregation, the model training and evaluation. The data segmentation corresponds to the splitting of the data in two (train and test sets) or three data sets (train, development and test sets) depending of the validation procedure that was selected.

Generally, the splitting of the dataset into training and test sets (and sometimes training/development/test sets) through a simple process of random subsampling. However, as we deal with skewed datasets , it is important to verify that the subsampling does not alters the statistics - in terms of mean, proportion and variance - of the sample. Indeed, the test set may contain only a small proportion of the minority class (or no instance/entity at all). In order to avoid this, a recommended practice is to divided the dataset in a stratified fashion, i.e. we randomly split the dataset in order to assure that each class is correctly represented (we will thus maintain in the subsets the same proportion of classes as the complete dataset). Common training/test ratios are 60/40, 70/30, 80/20 or 90/10 (for large datasets).

The training dataset is the data we are using to train the model, i.e. to estimate the parameters and hyperparameters describing the models. The test set allows the evaluation of the estimated model parameters and hyperparameters.

The development dataset (known also as validation set) allows an unbiased evaluation of the model fit on the training dataset, while estimating the model hyperparameters. This set is only used to fine-tune the hyperparameters of the model, but not the model parameters and thus will only indirectly affect the model.

AI model training and evaluation

The AI models documented here correspond to AI models that have obtained at least an f1-score of 0.60 on the evaluation step. All the selected models are part of the classification based Anomaly detection techniques and correspond to Decision Trees, Random Forest, Extra-Trees, Extra Gradient Boost and Voting Classifier.

AI model dependencies

For the present case, we have used the following dependencies:

Feature engineering: two feature configurations were tested (1) Features1: 27 selected features after the data analysis and visualization; (2) Features2: the previous 27 selected features and 6 aggregated features (related to submitted items and related amount by the insurer per week, month, year)
Rejected entities for missing document(s) or modified according to document(s) were not consider in the study (as there is no feature related to the submitted/necessary documents)
Normalization method: (0) no normalization; (1) Mean and standard deviation normalization; (2) Median and IQR normalization; (3) Minimum and maximum normalization
Data segregation: training set composed of 90% of all labeled data; test set composed of 10% of the labeled data set
Hyperparameter tunning: we have used a stratified k-fold cross validation procedure, which allows us to have the same data distribution of data for training and validation while having an accurate validation scheme for testing different hyperparameters configurations.

Decision Tree Classifier

The hyperparameters tuned for the Decision tree Classifier are:

criterion: measuring the impurity of the split with possible values: ‘gini’ (default value) and ‘entropy’
maximum depth of the tree, with values from 5 to 61 (default is ‘None’)
minimum number of entities contained in a node in order to consider splitting with possible values between 4 and 40 (default value is 2)
minimum number of entities contained in a leaf with possible, with possible values between 3 and 40 (default value is 1)

The confusion matrix obtained for the Decision Tree Classifier with the default values is the following

openIMIS > openIMIS-AI - 5. AI code and model parameters > 1-DecisionTrees_Noparams_CM_test_f1.png

Based on the confusion matrix, we can compute the evaluation metrics considered in this study:

Precision = 0.6794 (i.e. from all entities predicted as rejected, 67.94% were correctly classified)
Recall = 0.6023 (i.e. from all entities rejected by the Medical Officers, the classifier correctly categorized 60.23%)
f1 score = 0.6385
Accuracy = 0.9819 (I.e. from all the entities considered, 98.19% of items were correctly categorized)

In terms of execution time, the training time was about 80.94 s, while the prediction time was 0.08 s on the computed configuration mentioned earlier.

The default parameter values for the Decision Tree Classifier are the following: criterion = ‘gini’, splitter = ‘best’, max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0, max_features = None, random_state = None, max_leaf_nodes = None, min_impurity_decrease = 0, min_impurity_split =0, class_weight = None, ccp_alpha = 0.

The best hyperparameters obtained correspond to criterion = ‘entropy’, max_depth = 20; min_samples_leaf = 20, min_samples_split = 22 (for the other parameters, the default values were considered). The confusion matrix, evaluation time and evaluation metrics are given below. While the False Negative cases have slightly increased with respect to the previous prediction, the False Positive entities have decreased from 5'109 to 2’920 (and implicitly a higher Precision value)

openIMIS > openIMIS-AI - 5. AI code and model parameters > 1-DecisionTrees_BestParams_CM_test_f1.png

Evaluation metrics on the test set: Precision = 0.7864; Recall = 0.5980; f1-score = 0.6794; Accuracy = 0.9850

The training set was composed of 677’212 entities, with 659’235 accepted and 17’977 rejected entities.