Boosted Decision Trees for Lithiasis Type Identification

—Several urologic studies showed that it was important to determine the lithiasis types, in order to limit the recurrence residive risk and the renal function deterioration. The difficult problem posed by urologists for classifying urolithiasis is due to the large number of parameters (components, age, gender, background ...) taking part in the classification, and hence the probable etiology determination. There exist 6 types of urinary lithiasis which are distinguished according to their compositions (chemical components with given proportions), their etiologies and patient profile. This work presents models based on Boosted decision trees results, and which were compared according to their error rates and the runtime. The principal objectives of this work are intended to facilitate the urinary lithiasis classification, to reduce the classification runtime and an epidemiologic interest. The experimental results showed that the method is effective and encouraging for the lithiasis type identification.


INTRODUCTION
Urinary lithiasis are hard crystals that form in the urinary tract, mostly in the upper urinary tract.From a cooperation with the hospital university center of Annaba (CHU), we obtained a significant related data set.However the major problems of these data resided in their analysis and their interpretations to well define the problem.Physics laboratory of CHU has provided us with data concerning the patients (age, sex, …) and urolithiasis composition.Most collected data are important in determining urinary lithiasis type.
The urolithiasis composition plays a significant role [1] [2] in determining the lithiasis types and their etiologies, which will allow to know the reasons of their occurrence, and help to prescribe a diet or appropriate treatment.
The problem posed in this study was to identify urinary lithiasis type according to their compositions and the patient's profile.
There exist 6 types of urolithiasis which differ according to their morphological and chemical compositions, the six types of urolithiasis are presented in the following figure (Fig. 1).Most studies [3] is based on the four most dominant urinary lithiasis types, namely types 1,2,3 and type 4, because 80% of the urinary lithiasis are part of these four types.In this work the six types of existing urinary lithiases , have been included in the classification, and which correspond respectively to the following etiologies: hyperoxaluria, hypercalciuria, Hyperuricosuria, urinary infections, Cystinuria, Proteinuria.
Each of the six types is composed of the following substances: C1 for type 1. C2 and C1 for the type2.C1, C2 and AU for type 3. C1, C2 and CA for type 4. Cystine and CA for type 5. C1 and Protein for type 6 [3,4].However, these six types are not only composed of the quoted components but contain tens other components, with relatively low amounts, which make it possible to effectively distinguish the six types , which is not the case for the components present with large amounts (appendix 1).
In this article, a boosted decision tree system was used to determine the urinary lithiasis types.This paper is organized as follows.Section II presents the related works.Section III discusses data analysis and data reduction.In Section IV the different methods and tools used were explained.In Section V the results of these methods are presented and compared to other models of learning, according to their classification accuracy, thus etiologies determination.Finally, Section VI concludes the paper.www.ijacsa.thesai.orgWe realized at the beginning of this project, a first work on the classification of urolithiasis, which gave promising results.We presented the results of three classifiers system used: neural network, SVM and a neuro-fuzzy system, and were compared according to their effectiveness.
After having standardized (TABLE II) the 378 patients data, they were divided into two subsets: 378 samples for the training stage and 150 for the validation stage.Data normalization is an important step especially for classifiers based on distance calculation between two samples like KNN.The normalization ensures that no variable takes too much importance simply because of its measurement unit and it also allows to give equal weight to all the variables.The Normalization of our features was realized using the following formula: Where is a normalized feature value, is the original feature value, is a minimum of feature values and is a maximum of feature values.
Several data analysis in particular statistical analysis were performed to better evaluate and interpret data.
Of the 378 cases recorded, there is a ratio man/woman equal to 1.6 i.e. 3 men for 2 women suffering from renal lithiasis (Fig. 2).Fig. 2. Proportion of the patients with urolithiasis according to the sex We found that the average age of calculi appearance in the men population is 47 years and 45ans for the women population(Fig.3).
Statistical analysis based on urolithiasis type distribution according to the sex, showed that Type 2 mostly dominates in men population while type 4 dominates in women population (Fig. 4).The principal component analysis (PCA) is a method of data analysis family and more generally of multivariate statistics, which involves transforming variables linked together (called "correlated" ) in new variables decorrelated from each other.These new variables are named "principal components" or "principal axis" .It allows the practitioner to reduce the number of variables and produce the least redundant information [5].
In order to extract as much information as possible and to have a global view on the data, a principal components analysis (PCA) was applied to urolithiasis features to reduce the components number (Fig. 5).The PCA showed that there are correlations between the various components.It is therefore a classification problem with primary variables reduction.
By applying a PCA, we have been able to reduce the number of components from 21 to 11, added to age and sex information, we obtained 13 features for the model (TABLE II).

A. Decision trees
Decision trees are a type of structures that may deduce a final result from successive decisions.To span a decision tree searching for a solution it is necessary to start from the root.Each node is an atomic decision.Each sub-tree answer allows to move in the one of the child node direction.Gradually, we go down in the tree up to finding a leaf.The leaf represents the answer which the tree gives to the tested sample [6].
The algorithm used to generate decision trees is the C4.5 algorithm, it completely depends on the ID3 algorithm, but has been proposed to overcome the ID3 algorithm limitations.

B. Boosting
Boosting algorithm [7,8] is a machine learning method, precisely it belongs to meta algorithm family.There are several variations of boosting algorithms, some of them are applied to multiclass problems like AdaBoost.MH [9].Many classification studies [8,9] showed only the Boosting algorithm effectiveness on simple decision rules.
In order to experimentally select a best decision trees for the boosting algorithm, many decision trees have been generated separately.The boosting algorithm was performed on these decision trees.www.ijacsa.thesai.orgThe idea of this work consists in implementing the algorithm of boosting on decision trees.Two stages are required, first decision trees creation and then boosting algorithm application.Decision trees were directly generated from data files randomly created.However, two important factors must be taken into consideration:

C. Proposed method
-Tree depth

-Number of trees
We decided to experiment the boosting on small decision trees, constrained by their depth.Deep enough to separate data and make a decision, but not too deep to maintain general rules and avoid over-learning.
The number of trees used in boosting must be fixed, not too large for not to slow down training step, and not too small for boosting algorithm powerfulness.

V. RESULTS AND DISCUSSIONS
The used model, boosting of decision tree, generates different results according to the selected parameters.For evaluated system, two parameters must be fixed: the tree depth and the number of trees.The results are presented in terms of training rate error and runtime.

A. Evaluation according to trees depth variation
The decision tree depth used in boosting algorithm varies from 3 to 5.
TABLE IV shows the obtained results of our model while varying the depth for a boosting with 15 decision trees.
The results in TABLE IV shows that when using low trees depth (depth 3) i.e. with the simpler rules, we obtain better performances than trees with large depth (depth 5), however the runtime for small tree depth is almost doubled; 500ms for the tree with depth 3 and 249 ms for the tree with depth 5. www.ijacsa.thesai.org

B. Evaluation according to the number of trees variation
The number of decision trees used in the Boosting algorithm takes on the three following values: 10, 15 and 20 trees.Compared to our parameters, trees depth (three possible values) and the number of trees (three possible values), you can have 9 different combinations and therefore 9 different systems according to their error rate and their runtime (Fig. 5).Fig. 6 presents the error rate of each of the nine system combination.The system that gives the best result is the one with 15 decision trees and depth equal to 3.
The most powerful model, a compromise between execution time minimization and error rate, is the one with 15 decision trees and depth equal to 3. The details of this model and its confusion matrix are presented in Fig. 7.It happens to reach a classification accuracy equal to 98.41% , with a correct classification rate of 100% for types 3 and 5 and 99% for types 1 and 2. The execution time is 500ms.
In Fig. 8, the blue curve represents the error rate of the learning stage and the red curve represents the error rate of the validation step.The validation error is almost equivalent to the learning error, our system is efficient, with an error rate equal to 1.59% for the training step and a rate error equal to 1.35% for the validation step.The iteration number is approximately equal to 400 iterations for the two steps.This model with a validation error equal to 1.35%, can be considered as a promising model for the identification of urinary tract stones and determination of étiologies.www.ijacsa.thesai.org

Fig. 1 .
Fig. 1.The six urolithiasis types, (a) Type I, (b) Type II, (c) Type III, (d) Type IV, (e) Type V, (f) Type VI II.RELATED WORKS Work on various urolithiasis types recognition, have been the subject of some studies, in particular the work performed by Igor Kuzmanovski et al., in their article "Classification of Urinary Calculi using Feed-Forward Neural Network", they carried out, using a neural network, the urolithiasis classification based on lithiasis spectrophotometric analysis.Genetic algorithms have been used to optimize the selection of the most suitable spectral areas in order to improve the classification.

Fig. 3 .Fig. 4 .
Fig. 3. Number of Cases listed by age, group and sex One of the main ideas of AdaBoost, is to set at each steps 1 ≤ t ≤ T, a new prior probability distribution Dt for learning samples based on the algorithm results in the previous step.The weight to "t" step for example (xi, ui) of index i, where xi is sample and ui is a class, is denoted pt (i).Initially, all examples have the same weight, then at each step the weights of misclassified examples are increased, forcing the learner to focus on the difficult examples of the training sample.

Fig. 6 .
Fig. 6.Curves of results according to depth and number of trees used for Boosting based on error rate and iterations number VI. CONCLUSION In conclusion, this work has allowed us to achieve our objectives, namely the effective classification of urolithiasis.The boosting model proposed using 15 decision trees with a depth equal to 3 is the best one for this classification problem.Its accuracy is 98.41% for the urolithiasis classification.He correctly classified 372 cases of 378 cases.

TABLE IV .
RESULTS ACCORDING TO DECISION TREE DEPTHS

TABLE V
illustrates the results obtained by our model while varying the number of trees for Boosting under a fixed depth equal to 4. It is shown that with a greater number, learning gives better results and therefore a lower error rate.

TABLE V .
RESULTS ACCORDING TO NUMBER OF DECISION TREES