Analytical Comparison between the Information Gain and Gini Index using Historical Geographical Data

The historical geographical data of Kashmir province is spread across two disparate files having attributes of Maximum Temperature, Minimum Temperature, Humidity measured at 12 A.M., Humidity measured at 3 P.M., rainfall besides auxiliary parameters like date, year etc. The parameters Maximum Temperature, Minimum Temperature, Humidity measured at 12 A.M., Humidity measured at 3 P.M. are continuous in nature and here, in this study, we applied Information Gain and Gini Index on these attributes to convert continuous data into discrete values, their after we compare and evaluate the generated results. Of the four attributes, two have same results for Information Gain and Gini Index; one attribute has overlapping results while as only one attribute has conflicting results for Information Gain and Gini Index. Subsequently, continuous valued attributes are converted into discrete values using Gini index. Irrelevant attributes are not considered and auxiliary attributes are labeled accordingly. Consequently, the data set is ready for the application of machine learning (decision tree) algorithms. Keywords—Geographical data mining; information gain; Gini index; machine learning; decision tree


A. Splitting Rules
Decision tree is built by recursively splitting data partitions into smaller partitions according to splitting rules or criteria. Attribute selection measure or splitting rules is a heuristic for choice of criteria that best splits class labeled training dataset into separate classes. Attribute selection measure should be such that split should produce pure partitions i.e. all the records in given partition belong to same class.
The attribute selection measure gives a score/value for each attribute, best describing given class labeled training dataset, the attribute having best score/value is chosen as splitting attribute for given partition. In this paper we have used Information Gain for the attribute selection measure.

B. Information Gain and Gini Index
ID3 uses information gain as its attribute selection measure. For a given node that holds tuples of partition D, the attribute with highest information gain (score/value) is chosen as splitting attribute for the given node [1] [6]. The chosen attribute requires least information for classifying records in the resultant partitions besides discloses least impurity in these partitions, thus resulting in minimum number of tests required to classify a given record and generation of (simple) decision tree, accordingly information required for classification of a record in D is given by (1). [5] (1) and Information still required to arrive at an exact classification is measured by (2). [5] (2) Information Gain is the difference between the original information requirement and the new requirement, that is Gain(A)=Info(D)-InfoA(D) [5] ( Thus, Gain(A) is the gain if A is chosen for branching, accordingly Gain is calculated for all the attributes of the training set and attribute with the highest information gain is chosen as splitting attribute for the given node [2][3] [7]. Thus calculation of information gain enables us to choose the attribute that would do the best classification, further most the amount of information still required for classifying records is minimal.
The Gini Index is used by CART. The Gini index measures the impurity in D [10] [11]. The Gini index considers binary split for each attribute; accordingly weighted sum of impurity of each resulting partition is calculated, thus binary split on A partitions D into D1 & D2 i.e. [5].
The reduction in impurity that would be incurred by a binary split on a discrete on attribute A is The process is repeated for every attribute and the attribute that has minimum Gini index is chosen as splitting attribute [2][3] [8].

C. Continuous Valued Attributes
For an attribute "A" that has continuous values e.g. temperature, humidity etc. the best split point is to be determined for "A". All the possible unique values of A are sorted in ascending order, the midpoint between two adjacent values is considered [5]. 430 | P a g e www.ijacsa.thesai.org for the given unique u values of attribute A, u-1 values will be generated, for each generated value infoA(D) is calculated with number of partitions two [4][9] [12] .The mid-point with minimum value is chosen as the split point of A where D1 is set of records satisfying D2 is set of records satisfying The other possible solution is to calculate Gini index for every mid-point (Gini index is calculated instead of infoA (D)) and minimum Gini index for a give attribute is taken as split point of the attribute.

II. RELATED WORK
Gini index and Information gain have been used extensively used over the years, however most relevant work done in the recent past on the comparison of Gini index and Information gain is presented below.
In their research paper entitled "Theoretical comparison between the Gini Index and Information Gain criteria" Laura Elena Raileanu and Kilian Stoffel proposed a formal methodology to compare multiple split criteria and also presented a formal description of how to select between split criteria for a given data set, they concluded that Information Gain and Gini Index disagree only in 2% of all cases [13].
Mohammed A. Muharram and George D. Smith compared the performance of classifiers in their paper "Evolutionary Feature Construction Using Information Gain and Gini Index" to ascertain if C5 or CART was in any way benefiting from the inclusion of an attribute evolved using Information gain or Gini index respectively, they found no evidence that any algorithm has an advantage over the other classifiers and according to them all classifiers benefit from the inclusion of an evolved attribute [14].
Theoretical and empirical comparison of different split measures for induction of decision tree in Random forest and its effect on the accuracy of Random forest was done by Vrushali Y. Kulkarni, Manisha Petare and P. K. Sinha in their work entitled Analyzing Random Forest Classifier with Different Split Measures. The empirical results put forth by them, show that there is not much / significant variation in accuracy obtained except Chi Square, further Information gain and Gain ratio give comparable results for almost all datasets and Gini index slightly lags in the results with most of the datasets [15].

III. DATA
The data used in this paper is split across two CSV files, which has been collected from NDC Pune (India Meteorological department), agency of Ministry of earth sciences, Government of India. It is the principal agency responsible for meteorological observations, weather forecasting and seismology. IMD is one of the six regional specialized meteorological centers of the world meteorological organization.
The weather parameters in both data files are taken for the 3 regions of Kashmir division i.e. Gulmarg (North Kashmir), Srinagar (Central Kashmir) and Qazigund (South Kashmir). Gulmarg is geographically located at 34.05°N 74.38°E and has an average elevation of 2,650 m (8,690 ft.), Srinagar (Central) is located at 34.5°N 74.47°E and has an average elevation of 1,585 m (5,200 ft.), and Qazigund (South) is located at 33.59°N 75.16°E. It has an average elevation of 1,670 m (5,480 ft.).
The first data file (Fig. 1), shown below consists of 12190 instances of relative humidity (in %) measured every day at time 12 AM and 3 PM from year 2012 to 2017, for all the three stations.
The second data file (Fig. 2), shown below consists of 6117 instances of Maximum temperature (°C), Minimum temperature (°C) and Rainfall (in mm) measured every day from year 2012 to 2017, for all the three stations.
The two data files are integrated into single holistic dataset, discrepancies are resolved, data for each attribute is cleaned, transformed and loaded for formation of single dataset, shown below (Fig 3). The integrated data has Maximum temperature (tmax), Minimum temperature (tmin) and Rainfall (rfall), humidity measured 12 AM (humid12) and 3 PM (humid3) for every day (with exception) from year 2012 to 2017, for all the three stations.

A. Data Attributes
Of the nine attributes five are geographical parameters, they are Maximum Temperature, Minimum Temperature, Rainfall, Humidity at 12 & Humidity at 3 termed as tmax, tmin, rfall, humid12 & humid3 respectively, while as four parameters are auxiliary/dependent parameters they are station id, year, month and date termed as station_id, year, mnth & dt. In order to implement decision tree for the prediction of rainfall we have to evaluate each attribute of the resultant data independently.

1) Rainfall:
As per the resultant dataset the rainfall in Kashmir province varies from no rainfall to above 100 mm of rainfall in one day. The broader inspection of rain data of five years recorded in 5951 entries is that there is no rainfall in 4026 instances and rainfall in 1952 instances, thus the inference is that we can divide rain data in to two classes that is presence and absence of rain, accordingly dataset is to be modified with new column "crfall" which will be marked as "Y" in case of rainfall (1925 entries) and "N" in case of no rainfall (4026 entries). The Decision Tree is trained to predict presence or absence of rain on a given day.
2) Maximum temperature: Maximum Temperature (tmax) is continuous valued rather than discrete valued, in this case we must determine the "best" split-point for Maximum Temperature (tmax), where the split-point is a threshold on Maximum Temperature (tmax), this can be determined by employing either of the two techniques, Information Gain used by ID3 or Gini Index used by CART, in this paper we use both the techniques to determine the split-point, we will compare the results from the two techniques (Information Gain & Gini Index) and decide accordingly. In order to calculate Information Gain or Gini index, we need to determine unique values of Maximum Temperature (tmax) and then these unique values are to be sorted in ascending order. In the dataset of 5951 records there are 380 unique values of Maximum Temperature (tmax) recoded, varying from -8.2°C to 35.4°C. Their after mid-point between each pair of adjacent values is considered as possible split-point., the snap shot of first 10, middle 10 and last 10 sorted records with mid points are shown in Fig. 4. For each possible split-point for Maximum Temperature, we will evaluate Infotmax(D) and Ginitmax(D) but first we have to determine the prerequisites, for possible split value of 33.85 we have to determine the following: 1) fyes: No. of days there was rain for tmax<=33.85 2) fno: No. of days there was no rain for tmax<=33.85 3) syes: No. of days there was rain for tmax>33.85 4) sno: No. of days there was no rain for tmax>33.85 These values have to be generated for all possible splitpoints, the snap shot of first 10, middle 10 and the last 10 records with necessary values are shown below (Fig. 5).
Again first row shall not be considered because it has no mid-point, for every other possible point we have generated necessary values.
For each possible split-point for Maximum Temperature, we will calculate Infospltpnt(D) and Ginispltpnt(D) using following equations and 432 | P a g e www.ijacsa.thesai.org In this way we generate Info(D) and Gini(D) for every possible split-point, with exception to rno 1 because it has no split point, further of 379 possible split-points 9 possible splitpoints do not generate info(D), show below (Fig. 7). This is because one of the values of fyes, fno, syes, sno is zero. We have generated Information Gain and Gini Index for every split point; we now compare the two results.

Case 1: Information Gain
The point with minimum expected information requirement for Maximum Temperature (tmax) is to be selected as the split point for Maximum Temperature (tmax), the five best cases with minimum Information Gain are shown below (Fig. 8).
The above table is regenerated with Gini Index for the above split-points (Fig 9).    and in accordance to the rule of Information Gain we have to choose 25.05 as split-point for Maximum Temperature(tmax) since it has the lowest Information Gain, split-point 25.05 with all the attributes is shown below: (Fig. 10).

Case 2: Gini Index
The point giving the minimum Gini index for a given attribute Maximum Temperature (tmax) is to be taken as a split-point for the Maximum Temperature (tmax), the five best cases with minimum Gini Index are shown below: (Fig. 11).
The above table is regenerated with Information Gain for the above split-points (Fig. 12).
And in accordance to the rule we have to choose 8.05 as split-point for Maximum Temperature(tmax) since it has the lowest Gini Index, split-point 8.05 with all the attributes is shown below (Fig. 13).     3) Minimum Temperature: Minimum Temperature (tmin) is again continuous valued rather than discrete valued, in this case we must determine the "best" split-point for Minimum Temperature (tmin), where the split-point is a threshold on Minimum Temperature (tmin), again we use both the techniques to determine the split-point, we will compare the results from the two techniques (Information Gain & Gini Index) and decide accordingly.
We determine unique values of Minimum Temperature (tmin) and then these unique values are sorted in ascending order. In the dataset of 5951 records there are 354 unique values of Minimum Temperature (tmin) recoded, varying from -16.5°C to 23.8°C. Their after mid-point between each pair of adjacent values is generated as possible split-point., the snap shot of first 10, middle 10 and last 10 sorted records with mid points are shown below (Fig. 14).
Therefore given 354 values of Minimum Temperature (tmin), 353 possible splits will be generated and evaluated, there is no mid-point generated for the first minimum recorded temperature -16.5°C. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 434 | P a g e www.ijacsa.thesai.org For each possible split-point for Minimum Temperature (tmin), we calculate values of fyes, fno, syes, and sno. These values have to be generated for all possible split-points, the snap shot of first 10, middle 10 and last 10 records with necessary values are shown below (Fig. 15).
Again first row shall not be considered because it has no mid-point, for every other possible point we have generated necessary values.
For each possible split-point for Minimum Temperature, we will calculate Infospltpnt(D) and Ginispltpnt(D) using following equations. (13)(14)(15)(16). and The snap shot of first 10, middle 10 and last 10 records with necessary values are shown in Fig. 16.
We generate Info(D) and Gini(D) for every possible splitpoint with exception to rno 1 because it has no split point, further of 353 possible split-points 12 possible split-points do not generate info(D), this is because one of the values of fyes, fno, syes, sno is zero, as shown in Fig. 17.
We have generated Information Gain and Gini Index for every split point; we now compare the two results.

Case 1: Information Gain
The point with minimum expected information requirement for Minimum Temperature (tmin) is to be selected as the split point for Minimum Temperature (tmin), the five best cases with minimum Information Gain are shown below: (Fig 18).
The above table is regenerated with Gini Index for the split-points (Fig. 19). 435 | P a g e www.ijacsa.thesai.org And in accordance to the rule of Information Gain we have to choose -0.35 as split-point for Minimum Temperature (tmin) since it has the lowest Information Gain, split-point -0.35 with all the attributes is shown below: (Fig 20).

Case 2: Gini Index
The point giving the minimum Gini index for a given attribute Minimum Temperature (tmin) is to be taken as a split-point for the Minimum Temperature (tmin), the five best cases with minimum Gini Index are shown below: (Fig. 21).
The table is regenerated with Information Gain for the above split-points: (Fig. 22).
And in accordance to the rule of Gini Index we have to choose -0.35 as split-point for Minimum Temperature (tmin) since it has the lowest Gini Index, split-point -0.35 with all the attributes is shown below: (Fig. 23).
The results of Information Gain and Gini Index are exactly the same, hence split-point -0.35 will be chosen in either case, and there is no conflict at all.     Humidity Measured at 12:00 A.M (humid12) is continuous valued rather than discrete valued, and in accordance with the methodology used for the determination of best split-point for maximum and minimum temperature, we use same procedure for determination of best split-point for humidity12 as well. In the dataset of 5951 records there are 82 unique values of Humidity Measured at 12:00 A.M (humid12) recoded, varying from 18 to 100. The snap shot of first 10, middle 10 and last 10-sorted records with mid points are shown below (Fig. 24), 81 possible split-points will be evaluated.
The snap shot of first 10, middle 10 and last 10 records with necessary values of fyes, fno, syes & sno are shown below (Fig. 25).  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 436 | P a g e www.ijacsa.thesai.org For each possible split-point for Minimum Temperature, we will calculate Infospltpnt(D) and Ginispltpnt(D) using following equations. (17)(18)(19)(20). and The snap shot of first 10, middle 10 and last 10 records with Information Gain & Gini Index values are shown below (Fig. 26).
We generate Info(D) and Gini(D) for every possible splitpoint with exception to rno 1 because it has no split point, further of 81 possible split-points 8 possible split-points do not generate info(D), this is because one of the values of fyes, fno, syes, sno is zero, as shown below (Fig. 27).
We have generated Information Gain and Gini Index for every split point; we now compare the two results.

Case 1: Information Gain
The point with minimum expected information requirement for Humidity Measured at 12:00 A.M (humid12) is to be selected as the split point; the five best cases with minimum Information Gain are shown in Fig. 28.   The above table is regenerated with Gini Index for the above split-points (Fig 29).
And in accordance to the rule of Information Gain we have to choose 69.5 as split-point for Humidity Measured at 12:00 A.M (humid12) since it has the lowest Information Gain, splitpoint 69.5 with all the attributes is shown below: (Fig. 30).

Case 2: Gini Index
The point giving the minimum Gini index for a given attribute Humidity Measured at 12:00 A.M (humid12) is to be taken as a split-point; the five best cases with minimum Gini Index are shown below: (Fig 31).
The above table is regenerated with Information Gain for the above split-points (Fig. 32).
And in accordance to the rule of Gini Index we have to choose 69.5 as split-point for Humidity Measured at 12:00 point 69.5 with all the attributes is shown below (Fig. 33).    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 437 | P a g e www.ijacsa.thesai.org The results of Information Gain and Gini Index are exactly the same, hence split-point 69.5 will be chosen in either case, and there is no conflict at all.

5) Humidity
Measured at 03:00 P.M: Like the earlier three cases Humidity Measured at 03:00 P.M (humid3) is also continuous valued rather than discrete valued, and accordingly best split-point for humidity3 is generated and evaluated as well.
In the dataset of 5951 records there are 80 unique values of Humidity Measured at 03:00 P.M (humid3) recoded, varying from 16 to 100. The snap shot of first 10, middle 10 and last 10-sorted records with mid points are shown below (Fig 34), 79 possible split-points will be evaluated.
The snap shot of first 10, middle 10 and last 10 records with Information Gain & Gini Index values are shown below: (Fig. 36).
We generate Info(D) and Gini(D) for every possible splitpoint with exception to rno 1 because it has no split point, further of 79 possible split-points 15 possible split-points do not generate info(D), this is because one of the values of fyes, fno, syes, sno is zero, as shown below: (Fig. 37).
We have generated Information Gain and Gini Index for every split point; we now compare the two results.

Case 1: Information Gain
The point with minimum expected information requirement for Humidity Measured at 03:00 P.M (humid3) is to be selected as the split point for Humidity Measured at 03:00 P.M (humid3) the five best cases with minimum Information Gain are shown below: (Fig 38). www.ijacsa.thesai.org  The above table is regenerated with Gini Index for the above split-points (Fig. 39).
And in accordance to the rule of Information Gain we have to choose 82.5 as split-point for Humidity Measured at 03:00 P.M (humid3) since it has the lowest Information Gain, splitpoint 82.5 with all the attributes is shown in Fig. 40.  The point giving the minimum Gini index for a given attribute Humidity Measured at 03:00 P.M (humid3) is to be taken as a split-point for the Humidity Measured at 03:00 P.M (humid3) the five best cases with minimum Gini Index are shown below: (Fig. 41).
The above table is regenerated with Information Gain for the above split-points (Fig. 42).
And in accordance to the rule of Gini Index we have to choose 89.5 as split-point for Humidity Measured 03:00 P.M (humid3) since it has the lowest Gini Index, split-point 89.5 with all the attributes is shown below (Fig. 43).
As per Information Gain choice of split-point is 82.5, while as per the choice of Gini Index the split-point is 89.5. In order to make decision on the choice of split-point we compare the two generated list, as shown below (Fig. 44).    From the comparison shown above, there is a visible overlap between the two results, we choose 89.5 as split-point for Humidity Measured at 03:00 P.M (humid3), because it is first choice as per Gini Index and it is second choice of Information Gain.

B. Evaluation --Information Gain vs. Gini Index
Four attributes are continuous valued rather than discrete valued, we employed Information Gain used by ID3 and Gini Index used by CART to determine best possible split-point, the results are shown below (Table I). Of the four attributes, Tmin and Humid12 have same results for Information Gain and Gini Index. Humid3 has overlapping results for Information Gain and Gini Index, as already discussed we choose 89.5 as split-point for Humid3. It is the attribute Tmax where the results of Information Gain and Gini Index do not corroborate, and hence we have to choose one of the values, either as per Information Gain (25.05) or as per Gini Index (8.05). We chose Gini Index over Information Gain primarily because the split-point of three attributes (Tmin, Humid12, Humid3) is as per Gini Index while as split point of two attributes (Tmin & Humid12) is as per Information Gain, thus we choose to go with the majority i.e. Gini Index over Information Gain accordingly split-point of Tmax is 8.05.

C. Rest of Data Attributes
Off the rest of the data attributes, Station_id, Year, Month and date, we decide not to consider recording station (Station_id) as part of decision tree for prediction of rainfall, since all the stations belong to the same province. Further, a year is 365 days or 12 month or 4 seasons, thus we split the months into season as shown below: (Table II). Thus we use seasons instead of months, and decide not to use year and date as part of decision table, this will also maximize information dissemination.

1) Resultant Dataset:
Consequent upon conversion of continuous valued attributes into discrete valued and conversion of months into seasons besides not considering some irrelevant attributes, the snapshot of the resultant dataset is shown below: (Fig. 45). Further months have been converted into seasons as per the table shown above and crfall is Y if rfall >0 and crfall is N if rfall =0.

IV. CONCLUSION AND FUTURE WORK
In this paper two techniques are employed i.e. Information Gain and Gini index to convert continuous data into discrete valued data. This is preliminary and prerequisite step in order to apply machine learning algorithm Decision tree on the geographical data set. Besides having prepared historical geographical data for the application of Decision tree algorithm we have also compared the results from two varying techniques applied on the same dataset.
Whilst this study was primarily aimed at the comparison of Information Gain and Gini index, a fuller work is underway in which two separate dataset shall be generated on the basis of Information Gain and Gini index thereafter decision tree www.ijacsa.thesai.org algorithms shall be employed on these two generated data sets this will enable us to compare the performance of Information Gain and Gini index at the individual level of implementation.