Morphological Features Analysis for Erythrocyte Classification in IDA and Thalassemia

—Iron Deficiency Anemia (IDA) and Thalassemia is a common disease in the world population. In hospital routine, those diseases are being recognized based on level of hemoglobin in Complete Blood Count (CBC) result. Then, visual experts will conduct examination under the light microscope which is subjected to human error. In this research, we suggested a methodology via machine learning to classify and characterize erythrocyte related with IDA and Thalassemia. We employ some image pre-processing techniques on the blood smear images to enhance edges and reduce image noise such as gamma correction and morphological processing. Then, every single erythrocyte image will segment the background and foreground by using Otsu’s threshold method. Here, we have considered nine types of erythrocyte such as teardrop, echinocyte, elliptocyte, microcytic, hypochromic, target cell, acanthocyte, sickle cell and normal cell to be classified and portray based on their morphological features. Later, these 24 and 31 features from Hue’s moment, Zernike moment, Fourier descriptor and geometrical features are confirmed as potential features for each condition by calculating one-way ANOVA. Next, the rank of subset features is done based on their information gain value from maximum to minimum. Each of subset is separated by incremental of five features. Here, we compare the performance for each subset with five selected classifiers namely logistic regression, radial basis function network, multilayer perceptron, Naïve Bayes Classifier and Classification and Regression Tree. The best subsets from 31 features provide the highest result of classification with 83.5% accuracy, 83.5% sensitivity and 83.3% positive predictive value respectively via logistic regression compared to other classifiers. This study could be extended by using image dataset from other blood based disease for future work.


I. INTRODUCTION
Red blood cells (RBC) also termed as erythrocyte, forms the main part of human body system.They deliver oxygen to the different body tissues via blood flow through the circulatory system.It also contains haemoglobin which provide red colour and protein lipids to maintain their stability and deformability.
In hospital routine, pathologist detects erythrocyte abnormality in blood smear images under the light microscope.This process is very subjective evaluation that lead to tedious, time consuming and error prone work [1], [2] The pathologist accomplishes judgement based on his clinicopathological understanding in disease diagnostics.Thus, expose it to a higher variability of intra-observer due to small disparity in morphological feature.
Amongst various microscopic of blood smear test, most common disease detection including anaemia, thalassemia, leukaemia, red cell haemolysis, etc.In specific, anaemia and thalassemia frequently happens in our population.According to [3], approximately 1 in 10 Malaysian is a carrier for Thalassemia.Currently, there are 6753 Thalassemia patients that require medical attention.
The detection of certain type of anaemia and thalassemia can be based on microscope evaluation of erythrocyte in terms of their shape and size.So, the classification of abnormal and normal erythrocytes from microscopic images of blood smears has a large influence in developing a rapid and efficient tool.Hence, feature descriptors play an important role in classification.In this study, we are focusing on nine types of erythrocytes (teardrop, target, elliptocyte, hypochromic, microcytic, normocytic, burr and keratocyte) shapes that are commonly seen in iron deficiency anaemia (IDA) and thalassemia patient (Fig 1).A research reported [4] an approach to classify between malaria, thalassemia and normal patient using backpropagation neural network, while [5] proposed 271 features of color, texture and geometrical with PCA adopted.Author in [6] and [7] presented a rule-based technique for four classes of erythrocyte such as elliptocyte, macrocyte, www.ijacsa.thesai.orgmicrocytic, sphrerocyte and thalassemia beta major.Other author proposed cancer cell classification using geometric mean transform and dissimilarity metrics [8].While some others, proposed shape geometric features into 8 features [9], 4 features [6] and 6 features [7] including of area, perimeter, diameter, shape geometric feature, area proportion, deviation, central pallor, form factor, shape area factor and diameter area factor.Meanwhile, other author proposed moment invariant [10] and geometric features in their classification [11], [12].Recently, [13] proposed a new shape descriptors using generalized support functions to describe convex figures.In conclusion, recent cell detection and recognition research are still focusing on morphological features to boost up the performance of overall digital pathology image analysis.
Our objective is to classify nine different categories of abnormal and normal erythrocyte in IDA and Thalassemia using a set of 24 and 31 features (Hue's moment, Zernike moment, Fourier descriptor and geometrical features).The gold standard analysis of the images is performed by two experts.Then, evaluation and statistical analysis are conducted for both set of features and compared.The rest of this paper is arranged as follows: the proposed morphological feature analysis in IDA and Thalassemia blood smear images in Section II, results and discussion in Section III, and the conclusion of this work in Section IV.

A. Data Collection and Image Acquisitions
In this study, we have considered about seven healthy, nine Iron Deficiency Anaemia and seven Thalassemia blood smear images.These patients' peripheral blood smear were captured under light microscope from Pathology Department, Hospital Canselori Tuanku Muhriz (HCTM), Cheras, Malaysia.The process for displaying RBC image involved digitization of image from optical image with 40 times (40X) objective which equals to approximately 400 magnifications.Image acquisition has been guided by Hematologist (Fig. 2) in HCTM.From this study, the effective resolution and pixel size were 150dpi and 1920 X 1440 respectively.These images were captured using Olympus Digital Camera model DP22.

B. Erythrocyte Segmentation
Based on Fig. 3, we apply gamma correction to get sharpen edges.Followed by a few pre-processing step for erythrocyte segmentation.Hence, erythrocyte segmentation is vital for distinguishing abnormality present in IDA and Thalassemia disease.We have used connected component labelling method to segment erythrocyte from peripheral blood smear images.Later, each segmented erythrocyte was separated to clump and single cell based on their area.Here, clump cells will be counted by CP-SA method.CP-SA used skeleton algorithm in getting the backbone of each clump RBC, and analysed it with the concavity point pixel to detect and count the RBC.While single cell will be input to next level.

C. Morphological Feature Extraction
In IDA and Thalassemia condition, erythrocyte morphology varies from the normal cell.Thus, mathematical model was chosen to characterizing abnormal RBC's.Here, two separated experiment was conducted.First experiment consists of 24 morphological features in Table I, while another experiment includes 31 morphological features as shown in Table II.www.ijacsa.thesai.orga) Hue's moments This moment was introduced by Hu (1962).It manipulates the moment invariant within translation, scaling and rotational.This method calculate central moments, and the next seven invariant moments were created based on it [12].

b) Zernike moments
Zernike moments are based on a unit circle of a complete orthogonal set of complex polynomial.It is better than other methods in terms of information redundancy, reconstruction capability and noise resilience.
In this method, region of interest is plotted using polar coordinate in a unit circle.The origin of circle is also the center of the region of interest.Here, Ѳ is the angle of polar coordinate and r is the radius of polar coordinate.The plotting to the polar coordinate is: (1) where √ and ( ) Next, Zernike moments A(x,y), was calculated by normalized the region of interest using polar coordinates.Equation as following is considered for achieving both properties.
Where √ ̅ ̅ Note that, are taken from Hue's moments.In binary image, .
c) Fourier descriptor A region of interest image consists of M-point pixel boundary [14].The boundary points coordinate is ).These coordinates can be stated as x(k) = x k and y(k) = y k .. It can be defined in complex number as Eq. ( 3).
where In complex number plane, real axis is stated as x-axis and imaginary axis as y-axis The Fourier descriptor of ( ) is defined in Eq. ( 4) as below, For Fourier descriptor of the boundary is defined as a(u).The four features of fd1, fd2, fp1 and fp2 were calculated based on this descriptor [15].

d) Other morphological features
Other morphological features consist of eccentricity, roundness, area, fill area, area proportion have also been considered in [16], [17].Part of them are detailed as following Fill Area: Specifies the number of pixels in the region, with all holes filled in.

D. Selection of Significant Feature for IDA and Thalassemia
The feature selection is playing an important role in machine learning process.It has impact to increase the potentiality of each classifier.Some are suitable to differentiate textural and morphological features for describing normal and abnormal erythrocytes.Though, for predicting this situation, a set of features needs to be identified, in which is given the highest prediction accuracy.Plus, reduced the computational complexity and non-selected features.Here, statistical process were employ to the extracted features using one-way ANOVA (analysis of variance) for pinpointing a set of significant features [18].Next, the information gain value is computed for each significant features using Weka version 3.7.5 and ranked based on it [19].One way ANOVA calculated and compared the means of two or more instance using F-test statistic [18] as denoted below (9) The lower F value indicates lower discrimination potentiality and bigger F value indicates the better discrimination potentiality.Table III displays the analysis of one-way ANOVA from the extracted morphological erythrocyte images for both 24 and 31 features.Each feature was modeled for nine types of erythrocyte in IDA and Thalassemia representation.From this table, we can observe that all 24 and 31 features are statistically significant.a) Features ranking Here, process for ranking the significant features based on their information gain was done.This study compared the ranked features into two conditions: (i) for 24 features (Table IV) and (ii) for 31 features (Table V).These ranked features were allocated into five subsets (24 features) and seven subsets (31 features) based on their ranks.

b) Information gain ranking
Information gain is designed via entropy theory of Shannon's [19].The top-ranked features have the highest value of information gain.All significant features were set from maximum to minimum value of their information gain.

E. IDA and Thalassemia Cell Identification
Our main objective was to classify and compared nine different categories of abnormal and normal erythrocyte based on two conditions (24 and 31 features).Abnormal erythrocyte characterization in IDA and Thalassemia patients such as teardrop, target, elliptocyte, acanthocyte, hypochromic, echinocyte, microcytic, sickle cell and normal erythrocyte were identified.Here, we setup five classification approach namely radial basis function network, logistic regression, Naïve Bayes classifier, classification, regression trees and multilayer perceptron by using Weka version 3.7.5.

a) Logistic regression (LR)
Logistic regression is a methodology under supervised classification which determines the membership of each dataset.
As our experiment involved nine classes of erythrocyte, multiclass logistic regression model with one against all algorithm was used [20].www.ijacsa.thesai.org

b) Radial basis function network (RBF)
The radial basis function network is a sort of feed forward network [21].It contents of input layer, output layer and single hidden layer.Let assume that x = [ ] d dimensional input feature space, while target output, [ ].The output of the RBF is defined by Eq (6) as below, Here, = the basis input, = weights of hidden node.
The basis function, ( ) is explained in Eq. ( 7) as the following: where is the centre of the radial basis function, is input vector, is the width of .In order to determine the basis function parameters, K-means clustering algorithm is applied.Let say, =0.1.

c) Multilayer perceptron (MLP)
This methods is one type of back propagation neural network algorithm with multiple layers [21].This algorithm will update the weights of each input and output layer in order to minimize output error and give better accuracy.Eq. ( 8) is applied to calculate the training error, where the network output z k and target output is t k .The number of node is depicted by C. Here, we have nine output nodes with single hidden layer.

d) Naïve bayes classifier (NB)
This classifier is rooted from Bayesian theoretic method for feature set independent assumption.This method will guess possession of specific feature set via posterior probability.Here, posterior probability is estimated and forecast for each class label based on maximum value [21].

e) Classification and regression tree (CART)
CART is commonly used in machine learning with different classification problem.This algorithm construct classification and regression tree via top down recursive and divide-and-conquer approach.The CART comprises of root node, internal node and leaf node.The bottom node is leaf node and top most node is root node.Class label is allocated by leaf node.Here, the splitting criterion for data partition is done by Gini index measure [19].

III. RESULT AND DISCUSSION
Based on both Table VI (31 features) and Table VII (24 features), it is perceived that sometimes, whole features might not contribute the highest prediction performance.Hence, we decided to diversify them into subsets via their information gain value.Each subset consists of incremental of five features separately that sum up to 7 subsets for 31 features and 5 subsets for 24 features.Then, all subset will go through the five selected classifiers.This research includes 725 abnormal and 99 normal erythrocytes.We also observed a fluctuation trend in these tables with different subset with different classifier.This can be concluded that each of feature have their unique value that led to better or worst result.Lesser feature is worth to get better performance and accuracy, in which approximately subset of ten, fifteen and twenty are worth than 31 features ranked.However, more features need to get more accurate result based on 24 features ranked.Hence, it is notable that each feature will get more precise and optimized result with the introduction of information gain ranking.

IV. CONCLUSION
As a conclusion, we have characterized and classified nine types of erythrocyte for identification of IDA and thalassemia condition.A set of morphological features were extracted from segmented blood smear images and ranked via information gain value into two conditions: (i) 24 features and (ii) 31 features.The experimental amongst the five classifiers has shown that logistic regression had given the optimal value of accuracy, sensitivity and positive predictive value (83.5%, 83.5% and 83.3%) respectively.Thus, we proposed these geometrical features to characterize each of the shape.
It also detected that a fewer feature in subset (for 31 feature) provides better performance in most classifier.This result proven that each of features has different significant value depend on the segmented image and it is important to ranked this feature correctly.This research could be extended by using image dataset from other blood based disease for future work.

TABLE III .
ONE WAY ANOVA ANALYSIS OF IDA AND THALASSEMIA FEATURES

TABLE IV .
FEATURE RANKING BASED ON INFORMATION GAIN FOR 24 FEATURES Table VI and Table VII, it shows variation performance of different subsets of morphological features ranked using five different classifiers.As in Table VI, the Accuracy value varies from 68.2 to 78.3 for Naïve Bayes classifiers, 73.0 to 80.5 for RBF, 75.8 to 83.3 for MLP, 75.6 to 83.5 for LR and 71.2 to 78.5 for CART respectively.While in Table VII, this Accuracy value is much lower within 46.