An Approach to Improve Classification Accuracy of Leaf Images Using Dorsal and Ventral Features

—This paper proposes to improve the classification accuracy of the leaf images by extracting texture and statistical features by utilizing the presence of striking features on the dorsal and ventral sides of the leaves, which on other types of objects may not be that prominent. The texture features have been extracted from dorsal, ventral and a combination of dorsal-ventral sides of leaf images using Gray level co-occurrence matrix. In addition to this, this work also uses certain general statistical features for discriminating them into various classes. The feature selection work has been performed separately for the dorsal, ventral and combined data sets (for both texture and statistical features) using the most common feature selection algorithms. The classification accuracy has been calculated and compared to find which side of the leaf image (dorsal or ventral) gives better results with which type of features(texture or statistical). This study reveals that the ventral leaf features can be another alternative in discriminating the leaf images into various classes.


INTRODUCTION
The plants play an integral role in the ecological balance by providing shelter, improving the atmosphere, providing medicinal values etc.Therefore, there is a dire necessity to preserve and conserve them.Plants have also been studied for increasing food production, bringing forth new varieties of fruits, flowers and plant species.Several attempts have been made to classify the plants on the basis of flowers, arrangement of leaves on the plants, shapes, color and texture, to name a few.Such studies are essential for the ecological balance as some of the plants are on the verge of the extinction.For a layman, the characteristics features of a digital image are texture, shape, color and size.But, for a computer system there must be a computer recognizable feature set which could be stored, refined and analysed for appropriate classification.The human quest for finding the image textural features dates back to 1970's when Haralick [1], Rosenfeld and Troy [2] have obtained textural coarseness of digital images by finding the difference of the gray values of the adjacent pixels and then performing autocorrelation of the image values.The texture based properties of digital images have also been used in medical images [3] and in tomography based images [4], analysis of ultrasound images [5] and classification of food items like Italian pasta and plum cakes [6,7].Some common approaches for plant leaf classification using digital images are based on geometrical properties [8], texture and shape based features [9] and color features [10].
Nature has given two faces to the leaves: the dorsal (or the face up side) and the ventral (or the back side facing the substratum).The dorsal sides are generally smooth with texture, absorbing sunlight whereas the ventral sides have prominent vein structure.The fine line present on the leaf is called the mid rib or the prominent vein and other hair like lines are called secondary veins.The pattern of leaf venation is an important characteristic for the identification of a plant.
The texture is an integral property of every surface: patterns of tiles, wood, fabric or crops in the field.A texture contains important information regarding structural arrangement of surface and its relationships with the surroundings.The human eyes can interpret texture features for a surface which is fine, coarse, rough or smooth, rippled or irregular.In the case of digital images, the texture represents the arrangement of pixels and their distribution, which is very helpful in classifying the images into various categories.A digital image data structure is represented through pixels expressing the relative brightness values.All the pixels in a digital image form the population and in statistical jargon, a sample is a subset of values taken out from a digital image to draw appropriate conclusions about the characteristic properties exhibited by the population.A statistical sample drawn from a large population can be represented through frequency distribution or correlation curves can be drawn, through which detailed statistical analysis can be performed.www.ijacsa.thesai.orgTherefore, the univariate image statistics like mean, mode, median or standard deviation etc. can be utilized in studying and discriminating one class of digital image from the other.
The present study proposes to improve the classification accuracy of the leaf images by extracting texture and statistical features by utilizing the presence of striking features on the dorsal and ventral sides of the leaves, which on other types of objects may not be that prominent.The texture features have been extracted from dorsal, ventral and a combination of dorsal-ventral sides of leaf images using Gray level co-occurrence matrix.In addition to this, certain general statistical properties [11] like Mean, Median, Integrated Density, Skewness, Kurtosis, Minimum value, Standard Deviation, Raw-Integrated Density, XM and YM of leaf images for discriminating them into various classes have been used.The most important task for achieving higher degree of classification accuracy is to extract relevant features which can improve the overall accuracy of the classifier.Hence, the selection of an appropriate set of features is very important in pattern recognition problems.The feature extraction algorithms help in reducing the storage space requirement for the data, the visualization of the small dataset improves, the features and their relation can be better understood, and further, the training phase is greatly reduced.This work performs the feature selection task for the dorsal, ventral and combined data sets (for both texture and statistical features) using the most common feature selection algorithms.After selecting the relevant features, the leaf images are classified using the classification algorithms: K-Nearest Neighbor, J48, Naïve Bayes, Partial Least Square (PLS), Classification and Regression Tree (CART), Classification Tree (CT) and then the classification accuracy for each algorithm with each feature data set is calculated.One of the objectives behind the study of dorsal and ventral sides of leaf images using texture and statistical features is to find which side (dorsal or ventral or dorsal-ventral) gives best classification results and with which type of classification approach: texture or statistical.The rest of the paper has been divided into four sections, the Section II highlights the proposed methodology (Creation of colored leaf Image data set, Preprocessing of the Digital images, Generation of texture features, Generation of statistical features, Feature Selection process in different data sets, Application of classification algorithms), the Section III describes the results obtained through the proposed methodology and their comparison with the similar recent work and in Section IV, the conclusion follows.

II. PROPOSED METHODOLOGY
This work proposes to use the dorsal and ventral sides of the leaf images using texture and statistical approaches for leaf discrimination into classes.The proposed approach involves the following steps:

A. Creation of colored leaf Image data set
The leaf image data set is available from several sources (Data Banks) including that of [12,13,14].But the images that are stored in the data set are that of dorsal leaf images.But, this work proposes to utilize both the dorsal and the ventral faces of the leaf images.This necessitated the creation of a new data set with both the faces of the leaf images.For the purpose of creating the required database, the 24-bit RGB images of dorsal and ventral faces of leaves of the Helianthus annuus L.(Sunflower), Psidium guajava (Guava) and Alcia rosea (Hollyhock) have been captured as shown in Fig. 1 using Sony Cybershot HX200V with 18.2MP "Exmor R TM " CMOS Sensor with extra high sensitivity technology, 30x optical zoom.The captured images include 100 dorsal side and 100 ventral side images for each of the above mentioned leaf categories totaling a sample size of 600 images with a pixel size of 1080 X 920.

B. Preprocessing of the Digital images
The leaf images were extracted using background removal technique.In order to find out the texture features and to reduce the computational complexity, all the colored images were converted to 8-bit gray level and reduced to the pixel size of 256X256.All the image processing tasks have been performed through ImageJ (Version 1.44) [11].The gray stack of the slices of the dorsal, ventral and dorsal-ventral combined images have been prepared for further feature extraction using Gray Level Co-occurrence Matrix and statistical techniques [11] in a batch processing mode in ImageJ.

C. Generation of texture features
For batch processing, the gray stacks of the slices of the leaf images are processed through the texture extraction techniques given by Haralick [1] which provides the probability of gray level i occurring in the neighborhood of gray level j given distance d and angle ϴ and total number of gray levels N (in the present case 256).The gray level cooccurrence matrix GM can be expressed mathematically as follows: (1) In order to reduce the complexity, the inter pixel distance d is kept unity.Normally, in the case of Gray level cooccurrence matrix based methods, the calculations for feature extraction are carried out at unit pixel distance with ϴ =0° and 45°, but this work has gone further in extracting the image texture features for the dorsal and ventral sides of the leaf images using Gray level co-occurrence matrix based method at unit pixel distance with angular pixel positions at ϴ =0°, 45°, 90° and 135° independently and then combining them together using ImageJ software.The remaining angular positions of 225°, 270° and 315° are just the mirror images, therefore not considered.
The texture features dataset at the angular pixel position ϴ is represented in the following manner:
The scale of calibration has been set to millimeter (mm).The statistical features datasets have also been prepared using ImageJ software [11].The following three statistical feature datasets have been prepared: Image Statistical Dorsal Dataset (ISDD), Image Statistical Ventral Dataset (ISVD) and Combined Image Statistical Dorsal-Ventral Dataset (CISDVD) using the equations ( 7), ( 8) and ( 9) respectively: E. Feature Selection process in different data sets Feature selection process involves selecting those features in the data set that are most useful and in simpler words most relevant and which shall provide better predictive accuracy and remove redundancy from the dataset.In addition to that, feature selection process also provides better understanding of the features.
In this study for the feature selection process, following seven feature selection algorithms have been used: Best First Search (BFS), Correlation Based Feature Selection (CFS), Chi-square (Chisq), OneR, Randomforest (RForest), ReliefF and Hill Climbing (HC).In addition to this one more method which includes all the features extracted from the image set i.e.No Feature Selection Algorithm (No Algo.Used) has also been used.
The algorithms mentioned above have been applied on the texture feature data sets: ITDD, ITVD, CITDVD and statistical feature data sets: ISDD, ISVD, CISDVD which generates a total of 48 different data sets comprising of 24 texture based and 24 statistical based data sets.The Tables I, II and III describe the number and names of texture features selected by each of the feature selection algorithm used in the present analysis.Similar results are given in Tables IV, V and VI for the statistical features.

F. Application of classification algorithms
To discriminate the features obtained in section 2.5 into various classes (using 48 different data sets), the following six classification algorithms have been used: K-Nearest Neighbor (KNN), J48, Naïve Bayes, Partial Least Square (PLS), Classification and Regression Trees (CART), Classification Tree (CT) using "caret" package under RStudio [15].Each data set was split into two groups (Training and Test sets) in the ratio 75:25.The training data set contains the class labels, whereas the testing dataset does not contain the class labels.The preprocessing of the data involved centering and the scaling of the data matrix.In the classification procedure, a 10-fold cross validation technique has been applied which is repeated three times for validating any predictive model.Predictive accuracy and kappa values have been adopted as a measurable parameter for the classification process.Kappa is defined as the degree of right predictions of a model.This is originally a measure of agreement between two classifiers and is calculated as: In broad terms a kappa below 0.2 indicates poor agreement and a kappa above 0.8 indicates very good agreement or beyond chance [17,18].The quantitative results, obtained by following the methodology proposed in Section II, for the predictive accuracy for texture feature data sets: ITDD, ITVD, CITDVD and statistical feature data sets: ISDD, ISVD, CISDVD are given in Tables VII, VIII, IX and Tables X, XI, XII respectively.However the pictorial representations of the kappa values for ITDD, ITVD, CITDVD texture feature data sets and ISDD, ISVD, CISDVD statistical feature data sets have been represented in Fig. 2((a),(b),(c)) and 3((a),(b),(c)) respectively.A. Analysis on the basis of Texture feature data sets It has been observed from the values for predictive accuracy for the texture feature dataset, the ITVD feature data model has the highest value for the average predictive accuracy (83.49%) amongst all the texture based data models (CITDVD(81.05%),ITDD(80.89%))studied in this work.The comparison of the present results with the results of [9] are not directly comparable due to the differences in the datasets used.Despite of this fact this work compares its results with the results of [9].In [9] two classification algorithms Neuro Fuzzy Controller(NFC) and Multi-Layer Perceptron(MLP) have been used for the texture based model with only dorsal side images and the average predictive accuracy achieved is 81.6% and 87% respectively.In the proposed ITVD model, ventral based texture feature model provides average predictive accuracy value of 83.49% and J48 classification algorithm gives 97.18% accuracy value using Best First Search(BFS) algorithm for feature selection, which is comparable with the results of [9] as shown in the Fig. 4.
While observing the accuracy values for texture feature based models, ITDD model provides accuracy value of 96% using Best First Search(BFS) algorithm for feature selection applied to dorsal leaf images and J48 as the classification algorithm.When CITDVD model is used, an accuracy value of 94.85% for J48 algorithm has been observed when all the textures features are used (No feature selection algo.used) as shown in the Tables VII, VIII and IX.On comparing results with textures segmentation model [19], which used Brodatz album (each image size 256 X 256) prepared the gray level co-occurrence matrix at unit pixel distance with angular pixel positions at ϴ =0°, 45, has achieved predictive accuracy as high as 90% (approx.).However this work has prepared the gray level co-occurrence matrix at unit pixel distance with angular pixel positions at ϴ =0°, 45°, 90° and 135° and have achieved better predictive accuracy in all the texture based (ITDD, ITVD, CITDVD) models as shown in Table XIII.

B. Analysis on the basis of Statistical feature data Sets
On observing the values for predictive accuracy for the statistical feature model, the ISDD feature model has the highest value for the average predictive accuracy (86.52%) amongst all the statistical based feature data models (CISDVD(84.04%),ISVD(79.92%))studied in this work, as shown in the Fig. 4.
On comparing the results for statistical based feature models proposed in this work with [9], which is based on the dorsal based image sets only, two of the proposed statistical based feature models (ISDD and CISDVD) have fared better by giving more values for average predictive accuracy.Now, on comparing the texture based (ITDD, ITVD, CITDVD) and statistical based feature models (ISDD, ISVD, CISDVD) proposed in this work, the statistical based model ISDD fares the best amongst all the models proposed in this work in achieving the average predictive accuracy, as shown in the Fig. 5.While observing the classification using statistical features, by using K-Nearest Neighbor algorithm with correlation based feature selection algorithm has given the highest accuracy value of 95.13%.The ISVD model has achieved highest accuracy value of 88.87% with K-Nearest Neighbor algorithm with Random Forest based feature selection algorithm.On combining the dorsal and ventral images together and it has been observed that the predictive accuracy achieved is 91.77% with K-Nearest Neighbor algorithm with chi-square as the feature selection algorithm as shown in the Tables X, XI and XII.www.ijacsa.thesai.orgmisclassification rate is the 13.55% and when no feature selection algorithm is used (all the 11 features used), the average misclassification rate is the 13.23% as shown in the Fig. 6(a).In ITVD based model, when the features are selected using Correlation based feature selection algorithm (6 features selected), the average misclassification rate is the 13.88%, and when no feature selection algorithm is used (all the 11 features used), the average misclassification rate is the 11.68% as shown in the Fig. 6(b).In CITDVD based model, when the features are selected using Best First Search algorithm (8 features selected) the average misclassification rate is the 14.47%, and when no feature selection algorithm is used (all the 11 features used), the average misclassification rate is the 14.02% as shown in the Fig. 6(c).

D. Analysis on the basis of number of statistical features used and the Average Misclassification results
In ISDD based models, when features are selected using Correlation based algorithm (4 features selected) , the average misclassification rate is the 5.94% and when Hill Climb and Random Forest based algorithms are used with 5 features, the average misclassification rate is the 8.04% as shown in the Fig. 7(a).In ISVD based model, when the features are selected using Random Forest algorithm (5 features selected), the average misclassification rate is the 14.22%, as shown in the Fig. 7(b).In CISDVD based model, when the features are selected using Chi-square (5 features selected) the average misclassification rate is the 11.59%, as shown in the Fig. 7(c).

E. Analysis on the basis of Number of features selected for classification
On comparing the results of this work with the [19,20] as shown in Fig. 8, [19] has the highest predictive accuracy of 90% with 32 features and the highest predictive accuracy achieved is 93.29% for 10 features on Lung Cancer Data [20], whereas in the present study, when 10 features are used, the accuracy achieved is 95.13% using Correlation based feature selection (CFS) for ISDD based model.In the case of ISDD model proposed in this study has achieved the highest accuracy values for 10 features.
The feature selection and misclassification method is not directly comparable due to different datasets used, but with the 10 features selection as the criteria for classification, this work has compared its results with [19,20].

F. Analysis on the basis of dorsal and ventral features
The summary of the results, presented quantitatively in Table XIII and graphically in Fig. 5, clearly demonstrate the supremacy of ventral features over the dorsal features.The highest predictive accuracy (97.18%) is achievable through classification algorithm J48 using Best First Search algorithm applied over texture features obtained from ventral side leaf images.The statistical features are giving the best average predictive accuracy (86.52%) amongst all the models proposed in this work.The statistical feature set is providing much better predictive accuracy as compared to the models with texture based feature sets owing to the fact that the mutual information (MI) which is based on entropy, provided by the two or more random variables in the dataset is more in the case of statistical feature sets as compared to the texture based feature sets.Based on the extensive analysis, performed in this work, it is proposed that the statistical model (ISDD) which is purely based on calculating the statistical feature values can be applied for studying any object of interest with dorsal side of the image.www.ijacsa.thesai.org[19,20]

Fig. 1 .
Fig. 1.Colored sample of dorsal and ventral leaf images indicate that all the 11 different values of texture features (mentioned above) measured at a particular value of ϴ which is one of the values 0°, 45°, 90° and 135°.The following three texture feature datasets have been prepared in this study: Image Texture Dorsal Dataset (ITDD), Image Texture Ventral Dataset (ITVD) and Combined Image Texture Dorsal-Ventral Dataset (CITDVD) using the equations (3), (4) and (5) respectively.

Fig. 6 .
Fig. 6.Average misclassification rate vs. no. of features selected using (a) ITDD (b) ITVD (c) CITDVD respectivelyIV.CONCLUSIONSThis paper proposes to utilize the concept of striking features present on both the dorsal and the ventral sides of the leaves and has been modeled around texture and statistical features for dorsal, ventral and dorsal-ventral leaf images.It has been observed that the texture based model, the ITVD model, is giving better average predictive accuracy as compared to other texture based models.This strengthens the proposition of this work that ventral sides of leaf images can be another alternative for extracting and discriminating features.Based on the results of all the statistical feature based models, it is inferred that the ISDD model is the best amongst all the texture and statistical based models used in this work.

TABLE XI .
CLASSIFICATION ACCURACY FOR ISVD