Page Comparison

Once the data preparation step is finished, we can continue with the development of the AI algorithm. As seen earlier, the data preparation supposes the data cleaning (managing errors or missing values), field/feature selection, conversion of categorical data, normalization and randomization of ordering. Each AI algorithm may demand a specific data preparation, at least in terms or normalization and randomization of ordering.

Data

...

segregation

The AI model training and evaluation includes the data segmentationsegregation, the model training and evaluation. The data segmentation corresponds to the splitting of the data in two (train and test sets) or three data sets (train, development and test sets) depending of the validation procedure that was selected.

...

criterion: measuring the impurity of the split with possible values: ‘gini’ (default value) and ‘entropy’
maximum depth of the tree, with values from 5 to 61
minimum number of entities contained in a node in order to consider splitting with possible values between 4 and 40 (default value is 2)
minimum number of entities contained in a leaf with possible, with possible values between 3 and 40 (default value is 1)

The confusion matrix obtained for the Decision Tree Classifier with the default values is the following

...

Based on the confusion matrix, we can compute the evaluation metrics considered in this study:

Precision = 0.5920 (i.e. from all entities predicted as rejected, 59.20% were correctly classified)
Recall = 0.6200 (i.e. from all entities rejected by the Medical Officers, the classifier correctly categorized 62%)
f1 score = 0.6057
Accuracy = 0.9786 (I.e. from all the entities considered, 97.86% of items were correctly categorized)

In terms of execution time, the training time was about 341 s, while the prediction time was 0.32 s on the computed configuration mentioned earlier.

The default parameter values for the Decision Tree Classifier are the following: criterion = ‘gini’, splitter = ‘best’, max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0, max_features = None, random_state = None, max_leaf_nodes = None, min_impurity_decrease = 0, min_impurity_split =0, class_weight = None, ccp_alpha = 0.

All the information concerning the confusion matrix, evolution metrics and execution time values are presented in Table 1.

Versions Compared

Old Version 3

New Version 4

Key

Data

segregation