openIMIS-AI - 5. AI code and model parameters

Once the data preparation step is finished, we can continue with the development of the AI algorithm. As seen earlier, the data preparation supposes the data cleaning (managing null values or errors), field selection, normalization, randomization of ordering, splitting the database into train, development and test sets.

The training dataset is the data we are using to train the model, i.e. to estimate the parameters describing the models.

The development dataset (known also as validation set) allows an unbiased evaluation of the model fit on the training dataset, while estimating the model hyperparameters. This set is only used to fine-tune the hyperparameters of the model, but not the model parameters and thus will only indirectly affect the model.

The test set allows the evolution of the estimated model parameters and hyperparameters.

The train/dev/test split ratio is specific to each use case, the number of samples and the model that we are trying to train. This methodology for splitting the data allows to detect bias and variance problems.

As one (item/service) observation in the database depend on several variables, we are dealing with multi-variable anomaly detection problems. As seen earlier, several machine learning algorithms can meet the proposed goals of this project. As the data is partially labeled, unsupervised learning techniques like Robust Covariance, One-Class SVM and Isolation Forest usually give satisfactory results. These techniques are based on the principle of identifying groups of similar data points (and implicitly corresponding to the normal class, i.e accepted items and services) and considering the points exterior to these groups as anomalies/abnormal observations (rejected items or services).

For the Robust Covariance technique, the basic hypothesis is that the normal class have a Gaussian distribution. The train set- composed only of normal class observations – will allow the estimation of the parameters of the Robust Covariance model: the mean and covariance of the multivariate Gaussian distribution. The threshold of the model able to separate the normal and abnormal classes is further estimated based on the development set and taking into account the trained parameters of the model. The test set will allow to evaluate the performance of the trained algorithm. A possible splitting of the data for this algorithm can be:

training set composed of 60% of the normal data;
development set composed of 20% of normal data and 50% of the abnormal observations;
test set composed of the remaining 20% of normal observations and 50% of abnormal ones.

One-Class SVM algorithm relaxes the hypothesis that all the fields/columns in the normal class must have a Gaussian distribution. Thus, it identifies arbitrary shape regions with a high density of normal points and classify as anomaly/outlier any point that lies outside the boundary. Here again the algorithm is trained only on part of the normal class.

Isolation Forests are constructing a collection of decision trees, in which each decision tree splits the data along a random point and without considering any hypothesis on the shape of the distribution. Anomalies are considered as being the most isolated points. In this case the training set must contain normal class and abnormal class observations.

Did you encounter a problem or do you have a suggestion?

Please contact our Service Desk