Page Comparison

Once the data preparation step is finished, we can continue with the development of the AI algorithm. As seen earlier, the data preparation supposes the data cleaning (managing null errors or missing values or errors), field/feature selection, conversion of categorical data, normalization and randomization of ordering. Each AI algorithm may demand a specific data preparation, at least in terms or normalization and randomization of ordering.

Data segmentation

The AI model training and evaluation includes the data segmentation, the model training and evaluation. The data segmentation corresponds to the splitting the database into of the data in two (train and test sets) or three data sets (train, development and test sets) depending of the validation procedure that was selected.

Generally, the splitting of the dataset into training and test sets (and sometimes training/development/test sets) through a simple process of random subsampling. However, as we deal with skewed datasets , it is important to verify that the subsampling does not alters the statistics - in terms of mean, proportion and variance - of the sample. Indeed, the test set may contain only a small proportion of the minority class (or no instance/entity at all). In order to avoid this, a recommended practice is to divided the dataset in a stratified fashion, i.e. we randomly split the dataset in order to assure that each class is correctly represented (we will thus maintain in the subsets the same proportion of classes as the complete dataset). Common training/test ratios are 60/40, 70/30, 80/20 or 90/10 (for large datasets).

The training dataset is the data we are using to train the model, i.e. to estimate the parameters and hyperparameters describing the models. The test set allows the evaluation of the estimated model parameters and hyperparameters.

The development dataset (known also as validation set) allows an unbiased evaluation of the model fit on the training dataset, while estimating the model hyperparameters. This set is only used to fine-tune the hyperparameters of the model, but not the model parameters and thus will only indirectly affect the model.

The test set allows the evolution of the estimated model parameters and hyperparameters.

The train/dev/test split ratio is specific to each use case, the number of samples and the model that we are trying to train. This methodology for splitting the data allows to detect bias and variance problems.

As one (item/service) observation in the database depend on several variables, we are dealing with multi-variable anomaly detection problems. As seen earlier, several machine learning algorithms can meet the proposed goals of this project. As the data is partially labeled, unsupervised learning techniques like Robust Covariance, One-Class SVM and Isolation Forest usually give satisfactory results. These techniques are based on the principle of identifying groups of similar data points (and implicitly corresponding to the normal class, i.e accepted items and services) and considering the points exterior to these groups as anomalies/abnormal observations (rejected items or services).

For the Robust Covariance technique, the basic hypothesis is that the normal class have a Gaussian distribution. The train set- composed only of normal class observations – will allow the estimation of the parameters of the Robust Covariance model: the mean and covariance of the multivariate Gaussian distribution. The threshold of the model able to separate the normal and abnormal classes is further estimated based on the development set and taking into account the trained parameters of the model. The test set will allow to evaluate the performance of the trained algorithm. A possible splitting of the data for this algorithm can be:

training set composed of 60% of the normal data;
development set composed of 20% of normal data and 50% of the abnormal observations;
test set composed of the remaining 20% of normal observations and 50% of abnormal ones.

One-Class SVM algorithm relaxes the hypothesis that all the fields/columns in the normal class must have a Gaussian distribution. Thus, it identifies arbitrary shape regions with a high density of normal points and classify as anomaly/outlier any point that lies outside the boundary. Here again the algorithm is trained only on part of the normal class.

...

AI model training and evaluation

The AI models documented here correspond to AI models that have obtained at least an f1-score of 0.60 on the evaluation step. All the selected models are part of the classification based Anomaly detection techniques and correspond to Decision Trees, Random Forest, Extra-Trees, Extra Gradient Boost and Voting Classifier.

AI model dependencies

For the present case, we have used the following dependencies:

Feature engineering: two feature configurations were tested (1) Features1: 27 selected features after the data analysis and visualization; (2) Features2: the previous 27 selected features and 6 aggregated features (related to submitted items and related amount by the insurer per week, month, year)
Rejected entities for missing document(s) or modified according to document(s) were not consider in the study (as there is no feature related to the submitted/necessary documents)
Normalization method: (0) no normalization; (1) Mean and standard deviation normalization; (2) Median and IQR normalization; (3) Minimum and maximum normalization
Data segregation: training set composed of 90% of all labeled data; test set composed of 10% of the labeled data set
Hyperparameter tunning: we have used a stratified k-fold cross validation procedure, which allows us to have the same data distribution of data for training and validation while having an accurate validation scheme for testing different hyperparameters configurations.

Decision Tree Classifier

The hyperparameters tuned for the Decision tree Classifier are:

criterion: measuring the impurity of the split with possible values: ‘gini’ (default value) and ‘entropy’
maximum depth of the tree, with values from 5 to 61
minimum number of entities contained in a node in order to consider splitting with possible values between 4 and 40 (default value is 2)
minimum number of entities contained in a leaf with possible, with possible values between 3 and 40 (default value is 1)

Versions Compared

Old Version 2

New Version 3

Key

Data segmentation

AI model training and evaluation

AI model dependencies

Decision Tree Classifier